Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups
Hi Mel and Srikar, I would like to ask you if you could look into the Group Imbalance Bug described in this paper http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf in chapter 3.1. See also comment [1]. The paper describes the bug on workload which involves different ssh sessions and it assumes kernel.sched_autogroup_enabled=1. We have found out that it can be reproduced more easily with cgroups. Reproducer consists of this workload * 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU. * NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the Open Multi-Processing (OMP) mode. We run the workload in two modes: NORMAL - both stress and lu.C.x are run in the same control group GROUP - each binary is run in a separate control group: stress, first instance: cpu:test_group_1 stress, seconds instance: cpu:test_group_2 lu.C.x : cpu:test_group_main I run lu.C.x with a different number of threads - for example on 4 NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in total, even with 92 threads for lu.C.x and two stress processes server is still not fully loaded. Here are the runtimes in seconds for lu.C.x for different number of threads #Threads NORMAL GROUP 72 21.2730.01 80 15.32 164 88 17.91 367 92 19.22 432 As you can see, already for 72 threads lu.C.x is significantly slower when executed in dedicated cgroup. And it gets much worse with an increasing number of threads (slowdown by the factor 10x and greater). Some more details are below. Please let me know if it sounds interesting and if you would like to look into it. I can provide you with the reproducer plus some supplementary python scripts to further analyze the results. Thanks a lot! Jirka Some more details on the case with 80 threads for lu.C.x, 2 stress processes run on 96 CPUs server with 4 NUMA nodes. Analyzing ps output is very interesting (here for 5 subsequent runs of the workload): Average number of threads scheduled for NUMA NODE 0 1 2 3 lu.C.x_80_NORMAL_1.ps.numa.hist Average21.25 21.00 19.75 18.00 lu.C.x_80_NORMAL_1.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_NORMAL_2.ps.numa.hist Average20.50 20.75 18.00 20.75 lu.C.x_80_NORMAL_2.stress.ps.numa.hist Average1.00 0.75 0.25 lu.C.x_80_NORMAL_3.ps.numa.hist Average21.75 22.00 18.75 17.50 lu.C.x_80_NORMAL_3.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_NORMAL_4.ps.numa.hist Average21.50 21.00 18.75 18.75 lu.C.x_80_NORMAL_4.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_NORMAL_5.ps.numa.hist Average18.00 23.33 19.33 19.33 lu.C.x_80_NORMAL_5.stress.ps.numa.hist Average1.00 1.00 As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes. Compare it with cgroups mode: Average number of threads scheduled for NUMA NODE 0 1 2 3 lu.C.x_80_GROUP_1.ps.numa.hist Average13.05 13.54 27.65 25.76 lu.C.x_80_GROUP_1.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_2.ps.numa.hist Average12.18 14.85 27.56 25.41 lu.C.x_80_GROUP_2.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_3.ps.numa.hist Average15.32 13.23 26.52 24.94 lu.C.x_80_GROUP_3.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_4.ps.numa.hist Average13.82 14.86 25.64 25.68 lu.C.x_80_GROUP_4.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_5.ps.numa.hist Average15.12 13.03 25.12 26.73 lu.C.x_80_GROUP_5.stress.ps.numa.hist Average1.00 1.00 In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0 and #1 where stress processes are running. It does it to such extent that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have more NAS threads scheduled than CPUs available - there are 24 CPUs in each NUMA node. Here is the detailed report: $more lu.C.x_80_GROUP_1.ps.numa.hist #Date NUMA 0 NUMA 1 NUMA 2 NUMA 3 2018-Oct-27_04h39m57s6 7 37 30 2018-Oct-27_04h40m02s16 15 23 26 2018-Oct-27_04h40m08s13 12 27 28 2018-Oct-27_04h40m13s9 15 29 27 2018-Oct-27_04h40m18s16 13 27 24 2018-Oct-27_04h40m23s16 14 25 25 2018-Oct-27_04h40m28s16 15 24 25 2018-Oct-27_04h40m33s10 11 34 25 2018-Oct-27_04h40m38s16 13 25 26 2018-Oct-27_04h40m43s10 10 32 28
Group Imbalance Bug - performance drop by factor 10x on NUMA boxes with cgroups
Hi Mel and Srikar, I would like to ask you if you could look into the Group Imbalance Bug described in this paper http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf in chapter 3.1. See also comment [1]. The paper describes the bug on workload which involves different ssh sessions and it assumes kernel.sched_autogroup_enabled=1. We have found out that it can be reproduced more easily with cgroups. Reproducer consists of this workload * 2 separate "stress --cpu 1" processes. Each stress process needs 1 CPU. * NAS benchmark (https://www.nas.nasa.gov/publications/npb.html) from which I use lu.C.x binary (Lower-Upper Gauss-Seidel solver) in the Open Multi-Processing (OMP) mode. We run the workload in two modes: NORMAL - both stress and lu.C.x are run in the same control group GROUP - each binary is run in a separate control group: stress, first instance: cpu:test_group_1 stress, seconds instance: cpu:test_group_2 lu.C.x : cpu:test_group_main I run lu.C.x with a different number of threads - for example on 4 NUMA server with 4x Xeon Gold 6126 CPU (total 96 CPUs) - I run lu.C.x with 72, 80, 88, and 92 threads. Since the server has 96 CPUs in total, even with 92 threads for lu.C.x and two stress processes server is still not fully loaded. Here are the runtimes in seconds for lu.C.x for different number of threads #Threads NORMAL GROUP 72 21.2730.01 80 15.32 164 88 17.91 367 92 19.22 432 As you can see, already for 72 threads lu.C.x is significantly slower when executed in dedicated cgroup. And it gets much worse with an increasing number of threads (slowdown by the factor 10x and greater). Some more details are below. Please let me know if it sounds interesting and if you would like to look into it. I can provide you with the reproducer plus some supplementary python scripts to further analyze the results. Thanks a lot! Jirka Some more details on the case with 80 threads for lu.C.x, 2 stress processes run on 96 CPUs server with 4 NUMA nodes. Analyzing ps output is very interesting (here for 5 subsequent runs of the workload): Average number of threads scheduled for NUMA NODE 0 1 2 3 lu.C.x_80_NORMAL_1.ps.numa.hist Average21.25 21.00 19.75 18.00 lu.C.x_80_NORMAL_1.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_NORMAL_2.ps.numa.hist Average20.50 20.75 18.00 20.75 lu.C.x_80_NORMAL_2.stress.ps.numa.hist Average1.00 0.75 0.25 lu.C.x_80_NORMAL_3.ps.numa.hist Average21.75 22.00 18.75 17.50 lu.C.x_80_NORMAL_3.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_NORMAL_4.ps.numa.hist Average21.50 21.00 18.75 18.75 lu.C.x_80_NORMAL_4.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_NORMAL_5.ps.numa.hist Average18.00 23.33 19.33 19.33 lu.C.x_80_NORMAL_5.stress.ps.numa.hist Average1.00 1.00 As you can see, in NORMAL mode lu.C.x is uniformly scheduled over NUMA nodes. Compare it with cgroups mode: Average number of threads scheduled for NUMA NODE 0 1 2 3 lu.C.x_80_GROUP_1.ps.numa.hist Average13.05 13.54 27.65 25.76 lu.C.x_80_GROUP_1.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_2.ps.numa.hist Average12.18 14.85 27.56 25.41 lu.C.x_80_GROUP_2.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_3.ps.numa.hist Average15.32 13.23 26.52 24.94 lu.C.x_80_GROUP_3.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_4.ps.numa.hist Average13.82 14.86 25.64 25.68 lu.C.x_80_GROUP_4.stress.ps.numa.hist Average1.00 1.00 lu.C.x_80_GROUP_5.ps.numa.hist Average15.12 13.03 25.12 26.73 lu.C.x_80_GROUP_5.stress.ps.numa.hist Average1.00 1.00 In cgroup mode, the scheduler is moving lu.C.x away from the nodes #0 and #1 where stress processes are running. It does it to such extent that NUMA nodes #2 and #3 are overcommitted - these NUMA nodes have more NAS threads scheduled than CPUs available - there are 24 CPUs in each NUMA node. Here is the detailed report: $more lu.C.x_80_GROUP_1.ps.numa.hist #Date NUMA 0 NUMA 1 NUMA 2 NUMA 3 2018-Oct-27_04h39m57s6 7 37 30 2018-Oct-27_04h40m02s16 15 23 26 2018-Oct-27_04h40m08s13 12 27 28 2018-Oct-27_04h40m13s9 15 29 27 2018-Oct-27_04h40m18s16 13 27 24 2018-Oct-27_04h40m23s16 14 25 25 2018-Oct-27_04h40m28s16 15 24 25 2018-Oct-27_04h40m33s10 11 34 25 2018-Oct-27_04h40m38s16 13 25 26 2018-Oct-27_04h40m43s10 10 32 28