[PATCH v2 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle

2018-11-05 Thread Steve Sistare
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates.  To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed.  For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code.  In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

  %find - percent of time spent in old and new functions that search for
idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

  steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench  process 10
sched_wakeup_granularity_ns=1500

  baseline
  grps  time  %busy  slice   sched   idle wake %find  steal
  18.084  75.02   0.10  105476  4629159183  0.31  0
  2   13.892  85.33   0.10  190225  70958   119264  0.45  0
  3   19.668  89.04   0.10  263896  87047   176850  0.49  0
  4   25.279  91.28   0.10  322171  94691   227474  0.51  0
  8   47.832  94.86   0.09  630636 144141   486322  0.56  0

  new
  grps  time  %busy  slice   sched   idle wake %find  steal  %speedup
  15.938  96.80   0.24   31255   719024061  0.63   7433  36.1
  2   11.491  99.23   0.16   74097   457869512  0.84  19463  20.9
  3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
  4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
  8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow.  A negative "speedup" is a
regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec.  Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

-- 1 Socket Results --

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench  process 10

--- base ----- new ---
  groupstime %stdevtime %stdev  %speedup
   1   8.0080.1   5.9050.2  35.6
   2  13.8140.2  11.4380.1  20.7
   3  19.4880.2  16.9190.1  15.1
   4  25.0590.1  22.4090.1  11.8
   8  47.4780.1  44.2210.1   7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench  process 10

--- base ----- new ---
  groupstime %stdevtime %stdev  %speedup
   1   4.5860.8   4.5960.6  -0.3
   2   7.6930.2   5.7751.3  33.2
   3  10.4420.3   8.2880.3  25.9
   4  13.0870.2  11.0570.1  18.3
   8  24.1450.2  22.0760.3   9.3
  16  43.7790.1  41.7410.2   4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs

  clients%speedup
116.2
211.7
4 9.9
812.8
   1613.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

  Benchmark %speedup
  specjbb2015_critical_jops  5.7
  mysql_sysb1.0.14_mutex_2  40.6
  mysql_sysb1.0.14_oltp_23.9

-- 2 Socket Results --

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench  process 10

--- base ----- new ---
  groupstime %stdevtime %stdev  %speedup
   1   7.9450.2   7.2198.7  10.0
   2   8.4440.4   6.6891.5  26.2
   3  12.1001.1   9.9622.0  21.4
   4  15.0010.4  13.1091.1  14.4
   8  27.9600.2  26.1270.3   7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

[PATCH v2 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle

2018-11-05 Thread Steve Sistare
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates.  To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed.  For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.

This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle.  idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy.  Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.

Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.

Stealing imprroves utilization with only a modest CPU overhead in scheduler
code.  In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:

  %find - percent of time spent in old and new functions that search for
idle CPUs and tasks to steal and set the overloaded CPUs bitmap.

  steal - number of times a task is stolen from another CPU.

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hackbench  process 10
sched_wakeup_granularity_ns=1500

  baseline
  grps  time  %busy  slice   sched   idle wake %find  steal
  18.084  75.02   0.10  105476  4629159183  0.31  0
  2   13.892  85.33   0.10  190225  70958   119264  0.45  0
  3   19.668  89.04   0.10  263896  87047   176850  0.49  0
  4   25.279  91.28   0.10  322171  94691   227474  0.51  0
  8   47.832  94.86   0.09  630636 144141   486322  0.56  0

  new
  grps  time  %busy  slice   sched   idle wake %find  steal  %speedup
  15.938  96.80   0.24   31255   719024061  0.63   7433  36.1
  2   11.491  99.23   0.16   74097   457869512  0.84  19463  20.9
  3   16.987  99.66   0.15  115824   1985   113826  0.77  24707  15.8
  4   22.504  99.80   0.14  167188   2385   164786  0.75  29353  12.3
  8   44.441  99.86   0.11  389153   1616   387401  0.67  38190   7.6

Elapsed time improves by 8 to 36%, and CPU busy utilization is up
by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks).
The cost is at most 0.4% more find time.

Additional performance results follow.  A negative "speedup" is a
regression.  Note: for all hackbench runs, sched_wakeup_granularity_ns
is set to 15 msec.  Otherwise, preemptions increase at higher loads and
distort the comparison between baseline and new.

-- 1 Socket Results --

X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench  process 10

--- base ----- new ---
  groupstime %stdevtime %stdev  %speedup
   1   8.0080.1   5.9050.2  35.6
   2  13.8140.2  11.4380.1  20.7
   3  19.4880.2  16.9190.1  15.1
   4  25.0590.1  22.4090.1  11.8
   8  47.4780.1  44.2210.1   7.3

X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Average of 10 runs of: hackbench  process 10

--- base ----- new ---
  groupstime %stdevtime %stdev  %speedup
   1   4.5860.8   4.5960.6  -0.3
   2   7.6930.2   5.7751.3  33.2
   3  10.4420.3   8.2880.3  25.9
   4  13.0870.2  11.0570.1  18.3
   8  24.1450.2  22.0760.3   9.3
  16  43.7790.1  41.7410.2   4.8

KVM 4-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
tbench, average of 11 runs

  clients%speedup
116.2
211.7
4 9.9
812.8
   1613.7

KVM 2-cpu
Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

  Benchmark %speedup
  specjbb2015_critical_jops  5.7
  mysql_sysb1.0.14_mutex_2  40.6
  mysql_sysb1.0.14_oltp_23.9

-- 2 Socket Results --

X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Average of 10 runs of: hackbench  process 10

--- base ----- new ---
  groupstime %stdevtime %stdev  %speedup
   1   7.9450.2   7.2198.7  10.0
   2   8.4440.4   6.6891.5  26.2
   3  12.1001.1   9.9622.0  21.4
   4  15.0010.4  13.1091.1  14.4
   8  27.9600.2  26.1270.3   7.0

X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz