Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups

2018-10-27 Thread Jirka Hladky
Hi Mel and Srikar,

I would like to ask you if you could look into the Group Imbalance Bug
described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. See also comment [1]. The paper describes the bug on
workload which involves different ssh sessions and it assumes
kernel.sched_autogroup_enabled=1. We have found out that it can be
reproduced more easily with cgroups.

Reproducer consists of this workload
* 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU.
* NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from
which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the
Open Multi-Processing (OMP) mode.

We run the workload in two modes:

NORMAL - both stress and lu.C.x are run in the same control group
GROUP  - each binary is run in a separate control group:
stress, first instance: cpu:test_group_1
stress, seconds instance: cpu:test_group_2
lu.C.x : cpu:test_group_main

I run lu.C.x with a different number of threads - for example on 4
NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x
with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in
total, even with 92 threads for lu.C.x and two stress processes server
is still not fully loaded.

Here are the runtimes in seconds for lu.C.x for different number of threads

#Threads  NORMAL GROUP
72 21.2730.01
80 15.32 164
88 17.91 367
92 19.22 432

As you can see, already for 72 threads lu.C.x is significantly slower
when executed in dedicated cgroup. And it gets much worse with an
increasing number of threads (slowdown by the factor 10x and greater).

Some more details are below.

Please let me know if it sounds interesting and if you would like to
look into it. I can provide you with the reproducer plus some
supplementary python scripts to further analyze the results.

Thanks a lot!
Jirka

Some more details on the case with 80 threads for lu.C.x, 2 stress
processes run  on 96 CPUs server with 4 NUMA nodes.

Analyzing ps output is very interesting (here for 5 subsequent runs of
the workload):

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_NORMAL_1.ps.numa.hist Average21.25  21.00  19.75  18.00
lu.C.x_80_NORMAL_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_2.ps.numa.hist Average20.50  20.75  18.00  20.75
lu.C.x_80_NORMAL_2.stress.ps.numa.hist  Average1.00  0.75  0.25
lu.C.x_80_NORMAL_3.ps.numa.hist Average21.75  22.00  18.75  17.50
lu.C.x_80_NORMAL_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_4.ps.numa.hist Average21.50  21.00  18.75  18.75
lu.C.x_80_NORMAL_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_5.ps.numa.hist Average18.00  23.33  19.33  19.33
lu.C.x_80_NORMAL_5.stress.ps.numa.hist  Average1.00  1.00


As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes.

Compare it with cgroups mode:

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_GROUP_1.ps.numa.hist Average13.05  13.54  27.65  25.76
lu.C.x_80_GROUP_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_2.ps.numa.hist Average12.18  14.85  27.56  25.41
lu.C.x_80_GROUP_2.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_3.ps.numa.hist Average15.32  13.23  26.52  24.94
lu.C.x_80_GROUP_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_4.ps.numa.hist Average13.82  14.86  25.64  25.68
lu.C.x_80_GROUP_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_5.ps.numa.hist Average15.12  13.03  25.12  26.73
lu.C.x_80_GROUP_5.stress.ps.numa.hist  Average1.00  1.00

In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0
and #1 where stress processes are running. It does it to such extent
that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have
more NAS threads scheduled than CPUs available - there are 24 CPUs in
each NUMA node.

Here is the detailed report:
$more lu.C.x_80_GROUP_1.ps.numa.hist
#Date   NUMA 0  NUMA 1  NUMA 2  NUMA 3
2018-Oct-27_04h39m57s6   7   37  30
2018-Oct-27_04h40m02s16  15  23  26
2018-Oct-27_04h40m08s13  12  27  28
2018-Oct-27_04h40m13s9   15  29  27
2018-Oct-27_04h40m18s16  13  27  24
2018-Oct-27_04h40m23s16  14  25  25
2018-Oct-27_04h40m28s16  15  24  25
2018-Oct-27_04h40m33s10  11  34  25
2018-Oct-27_04h40m38s16  13  25  26
2018-Oct-27_04h40m43s10  10  32  28

Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups

2018-10-27 Thread Jirka Hladky
Hi Mel and Srikar,

I would like to ask you if you could look into the Group Imbalance Bug
described in this paper

http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf

in chapter 3.1. See also comment [1]. The paper describes the bug on
workload which involves different ssh sessions and it assumes
kernel.sched_autogroup_enabled=1. We have found out that it can be
reproduced more easily with cgroups.

Reproducer consists of this workload
* 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU.
* NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from
which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the
Open Multi-Processing (OMP) mode.

We run the workload in two modes:

NORMAL - both stress and lu.C.x are run in the same control group
GROUP  - each binary is run in a separate control group:
stress, first instance: cpu:test_group_1
stress, seconds instance: cpu:test_group_2
lu.C.x : cpu:test_group_main

I run lu.C.x with a different number of threads - for example on 4
NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x
with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in
total, even with 92 threads for lu.C.x and two stress processes server
is still not fully loaded.

Here are the runtimes in seconds for lu.C.x for different number of threads

#Threads  NORMAL GROUP
72 21.2730.01
80 15.32 164
88 17.91 367
92 19.22 432

As you can see, already for 72 threads lu.C.x is significantly slower
when executed in dedicated cgroup. And it gets much worse with an
increasing number of threads (slowdown by the factor 10x and greater).

Some more details are below.

Please let me know if it sounds interesting and if you would like to
look into it. I can provide you with the reproducer plus some
supplementary python scripts to further analyze the results.

Thanks a lot!
Jirka

Some more details on the case with 80 threads for lu.C.x, 2 stress
processes run  on 96 CPUs server with 4 NUMA nodes.

Analyzing ps output is very interesting (here for 5 subsequent runs of
the workload):

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_NORMAL_1.ps.numa.hist Average21.25  21.00  19.75  18.00
lu.C.x_80_NORMAL_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_2.ps.numa.hist Average20.50  20.75  18.00  20.75
lu.C.x_80_NORMAL_2.stress.ps.numa.hist  Average1.00  0.75  0.25
lu.C.x_80_NORMAL_3.ps.numa.hist Average21.75  22.00  18.75  17.50
lu.C.x_80_NORMAL_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_4.ps.numa.hist Average21.50  21.00  18.75  18.75
lu.C.x_80_NORMAL_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_NORMAL_5.ps.numa.hist Average18.00  23.33  19.33  19.33
lu.C.x_80_NORMAL_5.stress.ps.numa.hist  Average1.00  1.00


As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes.

Compare it with cgroups mode:

Average number of threads scheduled for NUMA NODE  0  1  2  3

lu.C.x_80_GROUP_1.ps.numa.hist Average13.05  13.54  27.65  25.76
lu.C.x_80_GROUP_1.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_2.ps.numa.hist Average12.18  14.85  27.56  25.41
lu.C.x_80_GROUP_2.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_3.ps.numa.hist Average15.32  13.23  26.52  24.94
lu.C.x_80_GROUP_3.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_4.ps.numa.hist Average13.82  14.86  25.64  25.68
lu.C.x_80_GROUP_4.stress.ps.numa.hist  Average1.00  1.00
lu.C.x_80_GROUP_5.ps.numa.hist Average15.12  13.03  25.12  26.73
lu.C.x_80_GROUP_5.stress.ps.numa.hist  Average1.00  1.00

In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0
and #1 where stress processes are running. It does it to such extent
that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have
more NAS threads scheduled than CPUs available - there are 24 CPUs in
each NUMA node.

Here is the detailed report:
$more lu.C.x_80_GROUP_1.ps.numa.hist
#Date   NUMA 0  NUMA 1  NUMA 2  NUMA 3
2018-Oct-27_04h39m57s6   7   37  30
2018-Oct-27_04h40m02s16  15  23  26
2018-Oct-27_04h40m08s13  12  27  28
2018-Oct-27_04h40m13s9   15  29  27
2018-Oct-27_04h40m18s16  13  27  24
2018-Oct-27_04h40m23s16  14  25  25
2018-Oct-27_04h40m28s16  15  24  25
2018-Oct-27_04h40m33s10  11  34  25
2018-Oct-27_04h40m38s16  13  25  26
2018-Oct-27_04h40m43s10  10  32  28