[RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.940.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.990.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.835.5677 (0.73%) 0.8 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2102.6 (0.34%) 2.11 16 4218.45 0.06 4221.35 (0.07%)0.38 32 7531.36 0.49 7607.18 (1.01%)0.25 48 10206.42 0.21 10324.26 (1.15%) 0.13 64 12053.73 0.112158.3 (0.87%)0.24 128 14810.33 0.04 14840.4 (0.2%) 0.38 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline%stdev patch%stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.341.0178 (1.78%) 0.49 80 1 0.531.0092 (0.92%) 1.5 100 1 0.791.0090 (0.9%)0.88 120 1 0.061.0048 (0.48%) 0.72 140 1 0.221.0116 (1.16%) 0.05 160 1 0.571.0264 (2.64%) 0.67 180 1 0.811.0194 (1.94%) 0.91 200 1 0.441.028 (2.8%) 3.09 220 1 1.741.0229 (2.29%) 0.21 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline%stdev patch%stdev 8 45.36 0.4346.28 (2.01%)0.29 16 87.81 0.8289.67 (2.12%)0.38 32 151.19 0.02153.5 (1.53%)0.41 48 190.2 0.21194.79 (2.41%) 0.07 64 190.42 0.35202.9 (6.55%)1.66 128 323.86 0.28343.56 (6.08%) 1.34 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 629.8 603.83 (-4.12%) 2 1159.65 1155.75 (-0.34%) 4 2121.61 2093.99 (-1.3%) 8 2620.52 2641.51 (0.8%) 16 2879.31 2897.6 (0.64%) 32 2791.24 2936.47 (5.2%) 64 1853.07 1894.74 (2.25%) 128 1484.95 1494.29 (0.63%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 256.41 255.8 (-0.24%) 2 509.89 504.52 (-1.05%) 4 999.44 1003.74 (0.43%) 8 1982.7 1976.42 (-0.32%) 16 3891.51 3916.04 (0.63%) 32 6819.24 6845.06 (0.38%) 64 8542.95 8568.28 (0.3%) 128 15277.6 15754.6 (3.12%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 92 (2.13%)2.17 75 124 2.13 122 (1.61%) 1.42 90 152 1.74 151 (0.66%) 0.66 95 171 2.11 170 (0.58%) 0 99 512.67104.96 208.33 (59.36%) 1.2 99.52296 82.553674.66 (-60.05%) 22.19 99.912517.33 2.38 12784 (-2.13%)0.66 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.651.2213 (6.66%)10.32 2 1.4559 8.551.5048 (-3.36%) 4.72 4 2.6271 1.742.5532 (2.81%)2.02 8 4.7089 3.014.5118 (4.19%)
[RESEND RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.940.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.990.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.835.5677 (0.73%) 0.8 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2102.6 (0.34%) 2.11 16 4218.45 0.06 4221.35 (0.07%)0.38 32 7531.36 0.49 7607.18 (1.01%)0.25 48 10206.42 0.21 10324.26 (1.15%) 0.13 64 12053.73 0.112158.3 (0.87%)0.24 128 14810.33 0.04 14840.4 (0.2%) 0.38 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline%stdev patch%stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.341.0178 (1.78%) 0.49 80 1 0.531.0092 (0.92%) 1.5 100 1 0.791.0090 (0.9%)0.88 120 1 0.061.0048 (0.48%) 0.72 140 1 0.221.0116 (1.16%) 0.05 160 1 0.571.0264 (2.64%) 0.67 180 1 0.811.0194 (1.94%) 0.91 200 1 0.441.028 (2.8%) 3.09 220 1 1.741.0229 (2.29%) 0.21 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline%stdev patch%stdev 8 45.36 0.4346.28 (2.01%)0.29 16 87.81 0.8289.67 (2.12%)0.38 32 151.19 0.02153.5 (1.53%)0.41 48 190.2 0.21194.79 (2.41%) 0.07 64 190.42 0.35202.9 (6.55%)1.66 128 323.86 0.28343.56 (6.08%) 1.34 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 629.8 603.83 (-4.12%) 2 1159.65 1155.75 (-0.34%) 4 2121.61 2093.99 (-1.3%) 8 2620.52 2641.51 (0.8%) 16 2879.31 2897.6 (0.64%) 32 2791.24 2936.47 (5.2%) 64 1853.07 1894.74 (2.25%) 128 1484.95 1494.29 (0.63%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 256.41 255.8 (-0.24%) 2 509.89 504.52 (-1.05%) 4 999.44 1003.74 (0.43%) 8 1982.7 1976.42 (-0.32%) 16 3891.51 3916.04 (0.63%) 32 6819.24 6845.06 (0.38%) 64 8542.95 8568.28 (0.3%) 128 15277.6 15754.6 (3.12%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 92 (2.13%)2.17 75 124 2.13 122 (1.61%) 1.42 90 152 1.74 151 (0.66%) 0.66 95 171 2.11 170 (0.58%) 0 99 512.67104.96 208.33 (59.36%) 1.2 99.52296 82.553674.66 (-60.05%) 22.19 99.912517.33 2.38 12784 (-2.13%)0.66 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.651.2213 (6.66%)10.32 2 1.4559 8.551.5048 (-3.36%) 4.72 4 2.6271 1.742.5532 (2.81%)2.02 8 4.7089 3.014.5118 (4.19%)
[RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.940.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.990.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.835.5677 (0.73%) 0.8 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2102.6 (0.34%) 2.11 16 4218.45 0.06 4221.35 (0.07%)0.38 32 7531.36 0.49 7607.18 (1.01%)0.25 48 10206.42 0.21 10324.26 (1.15%) 0.13 64 12053.73 0.112158.3 (0.87%)0.24 128 14810.33 0.04 14840.4 (0.2%) 0.38 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline%stdev patch%stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.341.0178 (1.78%) 0.49 80 1 0.531.0092 (0.92%) 1.5 100 1 0.791.0090 (0.9%)0.88 120 1 0.061.0048 (0.48%) 0.72 140 1 0.221.0116 (1.16%) 0.05 160 1 0.571.0264 (2.64%) 0.67 180 1 0.811.0194 (1.94%) 0.91 200 1 0.441.028 (2.8%) 3.09 220 1 1.741.0229 (2.29%) 0.21 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline%stdev patch%stdev 8 45.36 0.4346.28 (2.01%)0.29 16 87.81 0.8289.67 (2.12%)0.38 32 151.19 0.02153.5 (1.53%)0.41 48 190.2 0.21194.79 (2.41%) 0.07 64 190.42 0.35202.9 (6.55%)1.66 128 323.86 0.28343.56 (6.08%) 1.34 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 629.8 603.83 (-4.12%) 2 1159.65 1155.75 (-0.34%) 4 2121.61 2093.99 (-1.3%) 8 2620.52 2641.51 (0.8%) 16 2879.31 2897.6 (0.64%) 32 2791.24 2936.47 (5.2%) 64 1853.07 1894.74 (2.25%) 128 1484.95 1494.29 (0.63%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 256.41 255.8 (-0.24%) 2 509.89 504.52 (-1.05%) 4 999.44 1003.74 (0.43%) 8 1982.7 1976.42 (-0.32%) 16 3891.51 3916.04 (0.63%) 32 6819.24 6845.06 (0.38%) 64 8542.95 8568.28 (0.3%) 128 15277.6 15754.6 (3.12%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 92 (2.13%)2.17 75 124 2.13 122 (1.61%) 1.42 90 152 1.74 151 (0.66%) 0.66 95 171 2.11 170 (0.58%) 0 99 512.67104.96 208.33 (59.36%) 1.2 99.52296 82.553674.66 (-60.05%) 22.19 99.912517.33 2.38 12784 (-2.13%)0.66 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.651.2213 (6.66%)10.32 2 1.4559 8.551.5048 (-3.36%) 4.72 4 2.6271 1.742.5532 (2.81%)2.02 8 4.7089 3.014.5118 (4.19%)
[RFC/RFT V2 PATCH 0/5] Improve scheduler scalability for fast path
Current select_idle_sibling first tries to find a fully idle core using select_idle_core which can potentially search all cores and if it fails it finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially search all cpus in the llc domain. This doesn't scale for large llc domains and will only get worse with more cores in future. This patch solves the scalability problem by: - Setting an upper and lower limit of idle cpu search in select_idle_cpu to keep search time low and constant - Adding a new sched feature SIS_CORE to disable select_idle_core Additionally it also introduces a new per-cpu variable next_cpu to track the limit of search so that every time search starts from where it ended. This rotating search window over cpus in LLC domain ensures that idle cpus are eventually found in case of high load. Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled). Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): groups baseline %stdev patch %stdev 1 0.5816 8.940.5903 (-1.5%) 11.28 2 0.6428 10.64 0.5843 (9.1%) 4.93 4 1.0152 1.990.9965 (1.84%) 1.83 8 1.8128 1.4 1.7921 (1.14%) 1.76 16 3.1666 0.8 3.1345 (1.01%) 0.81 32 5.6084 0.835.5677 (0.73%) 0.8 Sysbench MySQL on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): threads baseline %stdev patch %stdev 8 2095.45 1.82 2102.6 (0.34%) 2.11 16 4218.45 0.06 4221.35 (0.07%)0.38 32 7531.36 0.49 7607.18 (1.01%)0.25 48 10206.42 0.21 10324.26 (1.15%) 0.13 64 12053.73 0.112158.3 (0.87%)0.24 128 14810.33 0.04 14840.4 (0.2%) 0.38 Oracle DB on 2 socket, 44 core and 88 threads Intel x86 machine (normalized, higher is better): users baseline%stdev patch%stdev 20 1 0.9 1.0068 (0.68%) 0.27 40 1 0.8 1.0103 (1.03%) 1.24 60 1 0.341.0178 (1.78%) 0.49 80 1 0.531.0092 (0.92%) 1.5 100 1 0.791.0090 (0.9%)0.88 120 1 0.061.0048 (0.48%) 0.72 140 1 0.221.0116 (1.16%) 0.05 160 1 0.571.0264 (2.64%) 0.67 180 1 0.811.0194 (1.94%) 0.91 200 1 0.441.028 (2.8%) 3.09 220 1 1.741.0229 (2.29%) 0.21 Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with message size = 8k (higher is better): threads baseline%stdev patch%stdev 8 45.36 0.4346.28 (2.01%)0.29 16 87.81 0.8289.67 (2.12%)0.38 32 151.19 0.02153.5 (1.53%)0.41 48 190.2 0.21194.79 (2.41%) 0.07 64 190.42 0.35202.9 (6.55%)1.66 128 323.86 0.28343.56 (6.08%) 1.34 Dbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 629.8 603.83 (-4.12%) 2 1159.65 1155.75 (-0.34%) 4 2121.61 2093.99 (-1.3%) 8 2620.52 2641.51 (0.8%) 16 2879.31 2897.6 (0.64%) 32 2791.24 2936.47 (5.2%) 64 1853.07 1894.74 (2.25%) 128 1484.95 1494.29 (0.63%) Tbench on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): clients baselinepatch 1 256.41 255.8 (-0.24%) 2 509.89 504.52 (-1.05%) 4 999.44 1003.74 (0.43%) 8 1982.7 1976.42 (-0.32%) 16 3891.51 3916.04 (0.63%) 32 6819.24 6845.06 (0.38%) 64 8542.95 8568.28 (0.3%) 128 15277.6 15754.6 (3.12%) Schbench on 2 socket, 44 core and 88 threads Intel x86 machine with 44 tasks (lower is better): percentile baseline %stdev patch %stdev 50 942.82 92 (2.13%)2.17 75 124 2.13 122 (1.61%) 1.42 90 152 1.74 151 (0.66%) 0.66 95 171 2.11 170 (0.58%) 0 99 512.67104.96 208.33 (59.36%) 1.2 99.52296 82.553674.66 (-60.05%) 22.19 99.912517.33 2.38 12784 (-2.13%)0.66 Hackbench process on 2 socket, 16 core and 128 threads SPARC machine (lower is better): groups baseline %stdev patch %stdev 1 1.3085 6.651.2213 (6.66%)10.32 2 1.4559 8.551.5048 (-3.36%) 4.72 4 2.6271 1.742.5532 (2.81%)2.02 8 4.7089 3.014.5118 (4.19%)