Re: [PATCH 00/33] AutoNUMA27
* Andrea Arcangeli [2012-10-14 06:57:16]: > I'll release an autonuma29 behaving like 28fast if there are no > surprises. The new algorithm change in 28fast will also save memory > once I rewrite it properly. > Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but plan to run on a newer release soon). --- | kernel| vm| nofit| fit| -- - |||noksm| ksm| noksm| ksm| -- - ||| nothp| thp| nothp| thp| nothp| thp| nothp| thp| --- |mainline_v36|vm_1| 136085| 188500| 133871| 163638| 133540| 178159| 132460| 164763| ||vm_2| 61549| 80496| 61420| 74864| 63777| 80573| 60479| 73416| ||vm_3| 60688| 79349| 62244| 73289| 64394| 80803| 61040| 74258| --- | autonuma27_|vm_1| 143261| 186080| 127420| 178505| 141080| 201436| 143216| 183710| ||vm_2| 72224| 94368| 71309| 89576| 59098| 83750| 63813| 90862| ||vm_3| 61215| 94213| 71539| 89594| 76269| 99637| 72412| 91191| --- | improvement|vm_1| 5.27%| -1.28%| -4.82%| 9.09%| 5.65%| 13.07%| 8.12%| 11.50%| | from |vm_2| 17.34%| 17.23%| 16.10%| 19.65%| -7.34%| 3.94%| 5.51%| 23.76%| | mainline |vm_3| 0.87%| 18.73%| 14.93%| 22.25%| 18.44%| 23.31%| 18.63%| 22.80%| --- (Results with suggested tweaks from Andrea) echo 0 > /sys/kernel/mm/autonuma/knuma_scand/pmd echo 15000 > /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs | kernel| vm| nofit| fit| -- -- ||| noksm| ksm| noksm| ksm| -- -- |||nothp| thp| nothp| thp| nothp| thp| nothp| thp| |mainline_v36|vm_1| 136142| 178362| 132493| 166169| 131774| 179340| 133058| 164637| ||vm_2|61143| 81943| 60998| 74195| 63725| 79530| 61916| 73183| ||vm_3|61599| 79058| 61448| 73248| 62563| 80815| 61381| 74669| | autonuma27_|vm_1| 142023| na| 142808| 177880| na| 197244| 145165| 174175| ||vm_2|61071| na| 61008| 91184| na| 78893| 71675| 80471| ||vm_3|72646| na| 72855| 92167| na| 99080| 64758| 91831| | improvement|vm_1|4.32%| na| 7.79%| 7.05%| na| 9.98%| 9.10%| 5.79%| | from |vm_2| -0.12%| na| 0.02%| 22.90%| na| -0.80%| 15.76%| 9.96%| | mainline |vm_3| 17.93%| na| 18.56%| 25.83%| na| 22.60%| 5.50%| 22.98%| Host: Enterprise Linux Distro 2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node. (total of 24 logical CPUs and 24 GB RAM) VMs: Enterprise Linux Distro Distro Kernel Main VM (VM1) -- relevant benchmark score. 12 vCPUs Either 12 GB (for '< 1 Node' configuration, i.e fit case) or 14 GB (for '> 1 Node', i.e no fit case) Noise VMs (VM2 and VM3) each noise VM has half of the remaining resources. 6 vCPUs Either 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node ') (to sum 20 GB w/ Main VM + 4
Re: [PATCH 00/33] AutoNUMA27
* Andrea Arcangeli aarca...@redhat.com [2012-10-14 06:57:16]: I'll release an autonuma29 behaving like 28fast if there are no surprises. The new algorithm change in 28fast will also save memory once I rewrite it properly. Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but plan to run on a newer release soon). --- | kernel| vm| nofit| fit| -- - |||noksm| ksm| noksm| ksm| -- - ||| nothp| thp| nothp| thp| nothp| thp| nothp| thp| --- |mainline_v36|vm_1| 136085| 188500| 133871| 163638| 133540| 178159| 132460| 164763| ||vm_2| 61549| 80496| 61420| 74864| 63777| 80573| 60479| 73416| ||vm_3| 60688| 79349| 62244| 73289| 64394| 80803| 61040| 74258| --- | autonuma27_|vm_1| 143261| 186080| 127420| 178505| 141080| 201436| 143216| 183710| ||vm_2| 72224| 94368| 71309| 89576| 59098| 83750| 63813| 90862| ||vm_3| 61215| 94213| 71539| 89594| 76269| 99637| 72412| 91191| --- | improvement|vm_1| 5.27%| -1.28%| -4.82%| 9.09%| 5.65%| 13.07%| 8.12%| 11.50%| | from |vm_2| 17.34%| 17.23%| 16.10%| 19.65%| -7.34%| 3.94%| 5.51%| 23.76%| | mainline |vm_3| 0.87%| 18.73%| 14.93%| 22.25%| 18.44%| 23.31%| 18.63%| 22.80%| --- (Results with suggested tweaks from Andrea) echo 0 /sys/kernel/mm/autonuma/knuma_scand/pmd echo 15000 /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs | kernel| vm| nofit| fit| -- -- ||| noksm| ksm| noksm| ksm| -- -- |||nothp| thp| nothp| thp| nothp| thp| nothp| thp| |mainline_v36|vm_1| 136142| 178362| 132493| 166169| 131774| 179340| 133058| 164637| ||vm_2|61143| 81943| 60998| 74195| 63725| 79530| 61916| 73183| ||vm_3|61599| 79058| 61448| 73248| 62563| 80815| 61381| 74669| | autonuma27_|vm_1| 142023| na| 142808| 177880| na| 197244| 145165| 174175| ||vm_2|61071| na| 61008| 91184| na| 78893| 71675| 80471| ||vm_3|72646| na| 72855| 92167| na| 99080| 64758| 91831| | improvement|vm_1|4.32%| na| 7.79%| 7.05%| na| 9.98%| 9.10%| 5.79%| | from |vm_2| -0.12%| na| 0.02%| 22.90%| na| -0.80%| 15.76%| 9.96%| | mainline |vm_3| 17.93%| na| 18.56%| 25.83%| na| 22.60%| 5.50%| 22.98%| Host: Enterprise Linux Distro 2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node. (total of 24 logical CPUs and 24 GB RAM) VMs: Enterprise Linux Distro Distro Kernel Main VM (VM1) -- relevant benchmark score. 12 vCPUs Either 12 GB (for ' 1 Node' configuration, i.e fit case) or 14 GB (for ' 1 Node', i.e no fit case) Noise VMs (VM2 and VM3) each noise VM has half of the remaining resources. 6 vCPUs Either 4 GB (for ' 1 Node' configuration) or 3 GB (' 1 Node ') (to sum 20 GB w/
Re: [PATCH 00/33] AutoNUMA27
> > Interesting. So numa01 should be improved in autonuma28fast. Not sure > why the hard binds show any difference, but I'm more concerned in > optimizing numa01. I get the same results from hard bindings on > upstream or autonuma, strange. > > Could you repeat only numa01 with the origin/autonuma28fast branch? Okay, will try to get the numbers on autonuma28 soon. > Also if you could post the two pdf convergence chart generated by > numa01 on autonuma27 and autonuma28fast, I think that would be > interesting to see the full effect and why it is faster. Have attached the chart for autonuma27 in a private email. -- Thanks and Regards Srikar -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Interesting. So numa01 should be improved in autonuma28fast. Not sure why the hard binds show any difference, but I'm more concerned in optimizing numa01. I get the same results from hard bindings on upstream or autonuma, strange. Could you repeat only numa01 with the origin/autonuma28fast branch? Okay, will try to get the numbers on autonuma28 soon. Also if you could post the two pdf convergence chart generated by numa01 on autonuma27 and autonuma28fast, I think that would be interesting to see the full effect and why it is faster. Have attached the chart for autonuma27 in a private email. -- Thanks and Regards Srikar -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Hi Srikar, On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote: > * Andrea Arcangeli [2012-10-04 01:50:42]: > > > Hello everyone, > > > > This is a new AutoNUMA27 release for Linux v3.6. > > > > > Here results of autonumabenchmark on a 328GB 64 core with ht disabled > comparing v3.6 with autonuma27. *snip* > numa01: 1805.19 1907.11 1866.39-3.88% Interesting. So numa01 should be improved in autonuma28fast. Not sure why the hard binds show any difference, but I'm more concerned in optimizing numa01. I get the same results from hard bindings on upstream or autonuma, strange. Could you repeat only numa01 with the origin/autonuma28fast branch? Also if you could post the two pdf convergence chart generated by numa01 on autonuma27 and autonuma28fast, I think that would be interesting to see the full effect and why it is faster. I only had the time for a quick push after having the idea added in autonuma28fast (which is yet improved compared to autonuma28), but I've been told already that it's dealing with numa01 on the 8 node very well as expected. numa01 in the 8 node is a workload without a perfect solution (other than MADV_INTERLEAVE). Full convergence preventing cross-node traffic is impossible because there are 2 processes spanning over 8 nodes and all process memory is touched by all threads constantly. Yet autonuma28fast should deal optimally that scenario too. As a side note: numa01 on the 2 node instead converges fully (2 processes + 2 nodes = full convergence). numa01 on 2 nodes or >2nodes is a very different kind of test. I'll release an autonuma29 behaving like 28fast if there are no surprises. The new algorithm change in 28fast will also save memory once I rewrite it properly. Thanks! Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
* Andrea Arcangeli [2012-10-04 01:50:42]: > Hello everyone, > > This is a new AutoNUMA27 release for Linux v3.6. > Here results of autonumabenchmark on a 328GB 64 core with ht disabled comparing v3.6 with autonuma27. $ numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32510 MB node 0 free: 31689 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32512 MB node 1 free: 31930 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32512 MB node 2 free: 31917 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 32512 MB node 3 free: 31928 MB node 4 cpus: 32 33 34 35 36 37 38 39 node 4 size: 32512 MB node 4 free: 31926 MB node 5 cpus: 40 41 42 43 44 45 46 47 node 5 size: 32512 MB node 5 free: 31913 MB node 6 cpus: 48 49 50 51 52 53 54 55 node 6 size: 65280 MB node 6 free: 63952 MB node 7 cpus: 56 57 58 59 60 61 62 63 node 7 size: 65280 MB node 7 free: 64230 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10 KernelVersion: 3.6.0-mainline_v36 Testcase: Min Max Avg numa01: 1509.14 2098.75 1793.90 numa01_HARD_BIND: 865.43 1826.40 1334.85 numa01_INVERSE_BIND: 3242.76 3496.71 3345.12 numa01_THREAD_ALLOC: 944.28 1418.78 1214.32 numa01_THREAD_ALLOC_HARD_BIND: 696.33 1004.99 825.63 numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88 2301.27 2186.33 numa02: 129.87 146.10 136.88 numa02_HARD_BIND: 25.8126.1825.97 numa02_INVERSE_BIND: 341.96 354.73 345.59 numa02_SMT: 160.77 246.66 186.85 numa02_SMT_HARD_BIND: 25.7738.8633.57 numa02_SMT_INVERSE_BIND: 282.61 326.76 296.44 KernelVersion: 3.6.0-autonuma27+ Testcase: Min Max Avg %Change numa01: 1805.19 1907.11 1866.39-3.88% numa01_HARD_BIND: 953.33 2050.23 1603.29 -16.74% numa01_INVERSE_BIND: 3515.14 3882.10 3715.28-9.96% numa01_THREAD_ALLOC: 323.50 362.17 348.81 248.13% numa01_THREAD_ALLOC_HARD_BIND: 841.08 1205.80 977.43 -15.53% numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35 2654.89 2439.51 -10.38% numa02: 51.6473.3558.88 132.47% numa02_HARD_BIND: 25.2326.3125.93 0.15% numa02_INVERSE_BIND: 338.39 355.70 344.82 0.22% numa02_SMT: 51.7666.7858.63 218.69% numa02_SMT_HARD_BIND: 34.9545.3939.24 -14.45% numa02_SMT_INVERSE_BIND: 287.85 300.82 295.80 0.22% -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
* Andrea Arcangeli aarca...@redhat.com [2012-10-04 01:50:42]: Hello everyone, This is a new AutoNUMA27 release for Linux v3.6. Here results of autonumabenchmark on a 328GB 64 core with ht disabled comparing v3.6 with autonuma27. $ numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 32510 MB node 0 free: 31689 MB node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 32512 MB node 1 free: 31930 MB node 2 cpus: 16 17 18 19 20 21 22 23 node 2 size: 32512 MB node 2 free: 31917 MB node 3 cpus: 24 25 26 27 28 29 30 31 node 3 size: 32512 MB node 3 free: 31928 MB node 4 cpus: 32 33 34 35 36 37 38 39 node 4 size: 32512 MB node 4 free: 31926 MB node 5 cpus: 40 41 42 43 44 45 46 47 node 5 size: 32512 MB node 5 free: 31913 MB node 6 cpus: 48 49 50 51 52 53 54 55 node 6 size: 65280 MB node 6 free: 63952 MB node 7 cpus: 56 57 58 59 60 61 62 63 node 7 size: 65280 MB node 7 free: 64230 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10 KernelVersion: 3.6.0-mainline_v36 Testcase: Min Max Avg numa01: 1509.14 2098.75 1793.90 numa01_HARD_BIND: 865.43 1826.40 1334.85 numa01_INVERSE_BIND: 3242.76 3496.71 3345.12 numa01_THREAD_ALLOC: 944.28 1418.78 1214.32 numa01_THREAD_ALLOC_HARD_BIND: 696.33 1004.99 825.63 numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88 2301.27 2186.33 numa02: 129.87 146.10 136.88 numa02_HARD_BIND: 25.8126.1825.97 numa02_INVERSE_BIND: 341.96 354.73 345.59 numa02_SMT: 160.77 246.66 186.85 numa02_SMT_HARD_BIND: 25.7738.8633.57 numa02_SMT_INVERSE_BIND: 282.61 326.76 296.44 KernelVersion: 3.6.0-autonuma27+ Testcase: Min Max Avg %Change numa01: 1805.19 1907.11 1866.39-3.88% numa01_HARD_BIND: 953.33 2050.23 1603.29 -16.74% numa01_INVERSE_BIND: 3515.14 3882.10 3715.28-9.96% numa01_THREAD_ALLOC: 323.50 362.17 348.81 248.13% numa01_THREAD_ALLOC_HARD_BIND: 841.08 1205.80 977.43 -15.53% numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35 2654.89 2439.51 -10.38% numa02: 51.6473.3558.88 132.47% numa02_HARD_BIND: 25.2326.3125.93 0.15% numa02_INVERSE_BIND: 338.39 355.70 344.82 0.22% numa02_SMT: 51.7666.7858.63 218.69% numa02_SMT_HARD_BIND: 34.9545.3939.24 -14.45% numa02_SMT_INVERSE_BIND: 287.85 300.82 295.80 0.22% -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Hi Srikar, On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote: * Andrea Arcangeli aarca...@redhat.com [2012-10-04 01:50:42]: Hello everyone, This is a new AutoNUMA27 release for Linux v3.6. Here results of autonumabenchmark on a 328GB 64 core with ht disabled comparing v3.6 with autonuma27. *snip* numa01: 1805.19 1907.11 1866.39-3.88% Interesting. So numa01 should be improved in autonuma28fast. Not sure why the hard binds show any difference, but I'm more concerned in optimizing numa01. I get the same results from hard bindings on upstream or autonuma, strange. Could you repeat only numa01 with the origin/autonuma28fast branch? Also if you could post the two pdf convergence chart generated by numa01 on autonuma27 and autonuma28fast, I think that would be interesting to see the full effect and why it is faster. I only had the time for a quick push after having the idea added in autonuma28fast (which is yet improved compared to autonuma28), but I've been told already that it's dealing with numa01 on the 8 node very well as expected. numa01 in the 8 node is a workload without a perfect solution (other than MADV_INTERLEAVE). Full convergence preventing cross-node traffic is impossible because there are 2 processes spanning over 8 nodes and all process memory is touched by all threads constantly. Yet autonuma28fast should deal optimally that scenario too. As a side note: numa01 on the 2 node instead converges fully (2 processes + 2 nodes = full convergence). numa01 on 2 nodes or 2nodes is a very different kind of test. I'll release an autonuma29 behaving like 28fast if there are no surprises. The new algorithm change in 28fast will also save memory once I rewrite it properly. Thanks! Andrea -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote: > On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote: > > Hi Mel, > > > > On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote: > > > As a basic sniff test I added a test to MMtests for the AutoNUMA > > > Benchmark on a 4-node machine and the following fell out. > > > > > > 3.6.0 3.6.0 > > >vanillaautonuma-v33r6 > > > UserSMT 82851.82 ( 0.00%)33084.03 ( 60.07%) > > > UserTHREAD_ALLOC 142723.90 ( 0.00%)47707.38 ( 66.57%) > > > System SMT 396.68 ( 0.00%) 621.46 (-56.67%) > > > System THREAD_ALLOC 675.22 ( 0.00%) 836.96 (-23.95%) > > > Elapsed SMT 1987.08 ( 0.00%) 828.57 ( 58.30%) > > > Elapsed THREAD_ALLOC 3222.99 ( 0.00%) 1101.31 ( 65.83%) > > > CPU SMT 4189.00 ( 0.00%) 4067.00 ( 2.91%) > > > CPU THREAD_ALLOC 4449.00 ( 0.00%) 4407.00 ( 0.94%) > > > > Thanks a lot for the help and for looking into it! > > > > Just curious, why are you running only numa02_SMT and > > numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version > > without _suffix) > > > > Bug in the testing script on my end. Each of them are run separtly and it Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so it can run the tests individually. I know start_bench.sh can run all the tests itself but in time I'll want mmtests to collect additional stats that can also be applied to other benchmarks consistently. The revised results look like this AUTONUMA BENCH 3.6.0 3.6.0 vanillaautonuma-v33r6 UserNUMA01 66395.58 ( 0.00%)32000.83 ( 51.80%) UserNUMA01_THEADLOCAL55952.48 ( 0.00%)16950.48 ( 69.71%) UserNUMA026988.51 ( 0.00%) 2150.56 ( 69.23%) UserNUMA02_SMT2914.25 ( 0.00%) 1013.11 ( 65.24%) System NUMA01 319.12 ( 0.00%) 483.60 (-51.54%) System NUMA01_THEADLOCAL 40.60 ( 0.00%) 184.39 (-354.16%) System NUMA02 1.62 ( 0.00%) 23.92 (-1376.54%) System NUMA02_SMT 0.90 ( 0.00%) 16.20 (-1700.00%) Elapsed NUMA011519.53 ( 0.00%) 757.40 ( 50.16%) Elapsed NUMA01_THEADLOCAL 1269.49 ( 0.00%) 398.63 ( 68.60%) Elapsed NUMA02 181.12 ( 0.00%) 57.09 ( 68.48%) Elapsed NUMA02_SMT 164.18 ( 0.00%) 53.16 ( 67.62%) CPU NUMA014390.00 ( 0.00%) 4288.00 ( 2.32%) CPU NUMA01_THEADLOCAL 4410.00 ( 0.00%) 4298.00 ( 2.54%) CPU NUMA023859.00 ( 0.00%) 3808.00 ( 1.32%) CPU NUMA02_SMT1775.00 ( 0.00%) 1935.00 ( -9.01%) MMTests Statistics: duration 3.6.0 3.6.0 vanilla autonuma-v33r6 User 132257.4452121.30 System362.79 708.62 Elapsed 3142.66 1275.72 MMTests Statistics: vmstat 3.6.0 3.6.0 vanilla autonuma-v33r6 THP fault alloc 17660 19927 THP collapse alloc 10 12399 THP splits4 12637 The System CPU usage is high but is compenstated for with reduced User and Elapsed times in this particular case. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote: > Hi Mel, > > On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote: > > So after getting through the full review of it, there wasn't anything > > I could not stand. I think it's *very* heavy on some of the paths like > > the idle balancer which I was not keen on and the fault paths are also > > quite heavy. I think the weight on some of these paths can be reduced > > but not to 0 if the objectives to autonuma are to be met. > > > > I'm not fully convinced that the task exchange is actually necessary or > > beneficial because it somewhat assumes that there is a symmetry between CPU > > and memory balancing that may not be true. The fact that it only considers > > The problem is that without an active task exchange and no explicit > call to stop_one_cpu*, there's no way to migrate a currently running > task and clearly we need that. We can indefinitely wait hoping the > task goes to sleep and leaves the CPU idle, or that a couple of other > tasks start and trigger load balance events. > Stick that in a comment although I still don't fully see why the actual exchange is necessary and why you cannot just move the current task to the remote CPUs runqueue. Maybe it's something to do with them converging faster if you do an exchange. I'll figure it out eventually. > We must move tasks even if all cpus are in a steady rq->nr_running == > 1 state and there's no other scheduler balance event that could > possibly attempt to move tasks around in such a steady state. > I see, because just because there is a 1:1 mapping between tasks and CPUs does not mean that it has converged from a NUMA perspective. The idle balancer could be moving to an idle CPU that is poor from a NUMA point of view. Better integration with the load balancer and caching on a per-NUMA basis both the best and worst converged processes might help but I'm hand-waving. > Of course one could hack the active idle balancing so that it does the > active NUMA balancing action, but that would be a purely artificial > complication: it would add unnecessary delay and it would provide no > benefit whatsoever. > > Why don't we dump the active idle balancing too, and we hack the load > balancing to do the active idle balancing as well? Of course then the > two will be more integrated. But it'll be a mess and slower and > there's a good reason why they exist as totally separated pieces of > code working in parallel. > I'm not 100% convinced they have to be separate but you have thought about this a hell of a lot more than I have and I'm a scheduling dummy. For example, to me it seems that if the load balancer was going to move a task to an idle CPU on a remote node, it could also check it it would be more or less converged before moving and reject the balancing if it would be less converged after the move. This increases the search cost in the load balancer but not necessarily any worse than what happens currently. > We can integrate it more, but in my view the result would be worse and > more complicated. Last but not the least messing the idle balancing > code to do an active NUMA balancing action (somehow invoking > stop_one_cpu* in the steady state described above) would force even > cellphones and UP kernels to deal with NUMA code somehow. > hmm... > > tasks that are currently running feels a bit random but examining all tasks > > that recently ran on the node would be far too expensive to there is no > > So far this seems a good tradeoff. Nothing will prevent us to scan > deeper into the runqueues later if find a way to do that efficiently. > I don't think there is an effecient way to do that but I'm hoping caching an exchange candiate on a per-NUMA basis could reduce the cost while still converging reasonably quickly. > > good answer. You are caught between a rock and a hard place and either > > direction you go is wrong for different reasons. You need something more > > I think you described the problem perfectly ;). > > > frequent than scans (because it'll converge too slowly) but doing it from > > the balancer misses some tasks and may run too frequently and it's unclear > > how it effects the current load balancer decisions. I don't have a good > > alternative solution for this but ideally it would be better integrated with > > the existing scheduler when there is more data on what those scheduling > > decisions should be. That will only come from a wide range of testing and > > the inevitable bug reports. > > > > That said, this is concentrating on the problems without considering the > > situations where it would work very well. I think it'll come down to HPC > > and anything jitter-sensitive will hate this while workloads like JVM, > > virtualisation or anything that uses a lot of memory without caring about > > placement will love it. It's not perfect but it's better than incurring > > the cost of remote access unconditionally. > > Full agreement. >
Re: [PATCH 00/33] AutoNUMA27
On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote: Hi Mel, On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote: So after getting through the full review of it, there wasn't anything I could not stand. I think it's *very* heavy on some of the paths like the idle balancer which I was not keen on and the fault paths are also quite heavy. I think the weight on some of these paths can be reduced but not to 0 if the objectives to autonuma are to be met. I'm not fully convinced that the task exchange is actually necessary or beneficial because it somewhat assumes that there is a symmetry between CPU and memory balancing that may not be true. The fact that it only considers The problem is that without an active task exchange and no explicit call to stop_one_cpu*, there's no way to migrate a currently running task and clearly we need that. We can indefinitely wait hoping the task goes to sleep and leaves the CPU idle, or that a couple of other tasks start and trigger load balance events. Stick that in a comment although I still don't fully see why the actual exchange is necessary and why you cannot just move the current task to the remote CPUs runqueue. Maybe it's something to do with them converging faster if you do an exchange. I'll figure it out eventually. We must move tasks even if all cpus are in a steady rq-nr_running == 1 state and there's no other scheduler balance event that could possibly attempt to move tasks around in such a steady state. I see, because just because there is a 1:1 mapping between tasks and CPUs does not mean that it has converged from a NUMA perspective. The idle balancer could be moving to an idle CPU that is poor from a NUMA point of view. Better integration with the load balancer and caching on a per-NUMA basis both the best and worst converged processes might help but I'm hand-waving. Of course one could hack the active idle balancing so that it does the active NUMA balancing action, but that would be a purely artificial complication: it would add unnecessary delay and it would provide no benefit whatsoever. Why don't we dump the active idle balancing too, and we hack the load balancing to do the active idle balancing as well? Of course then the two will be more integrated. But it'll be a mess and slower and there's a good reason why they exist as totally separated pieces of code working in parallel. I'm not 100% convinced they have to be separate but you have thought about this a hell of a lot more than I have and I'm a scheduling dummy. For example, to me it seems that if the load balancer was going to move a task to an idle CPU on a remote node, it could also check it it would be more or less converged before moving and reject the balancing if it would be less converged after the move. This increases the search cost in the load balancer but not necessarily any worse than what happens currently. We can integrate it more, but in my view the result would be worse and more complicated. Last but not the least messing the idle balancing code to do an active NUMA balancing action (somehow invoking stop_one_cpu* in the steady state described above) would force even cellphones and UP kernels to deal with NUMA code somehow. hmm... tasks that are currently running feels a bit random but examining all tasks that recently ran on the node would be far too expensive to there is no So far this seems a good tradeoff. Nothing will prevent us to scan deeper into the runqueues later if find a way to do that efficiently. I don't think there is an effecient way to do that but I'm hoping caching an exchange candiate on a per-NUMA basis could reduce the cost while still converging reasonably quickly. good answer. You are caught between a rock and a hard place and either direction you go is wrong for different reasons. You need something more I think you described the problem perfectly ;). frequent than scans (because it'll converge too slowly) but doing it from the balancer misses some tasks and may run too frequently and it's unclear how it effects the current load balancer decisions. I don't have a good alternative solution for this but ideally it would be better integrated with the existing scheduler when there is more data on what those scheduling decisions should be. That will only come from a wide range of testing and the inevitable bug reports. That said, this is concentrating on the problems without considering the situations where it would work very well. I think it'll come down to HPC and anything jitter-sensitive will hate this while workloads like JVM, virtualisation or anything that uses a lot of memory without caring about placement will love it. It's not perfect but it's better than incurring the cost of remote access unconditionally. Full agreement. Your detailed full review was very appreciated, thanks! You're welcome. -- Mel Gorman
Re: [PATCH 00/33] AutoNUMA27
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote: On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote: Hi Mel, On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote: As a basic sniff test I added a test to MMtests for the AutoNUMA Benchmark on a 4-node machine and the following fell out. 3.6.0 3.6.0 vanillaautonuma-v33r6 UserSMT 82851.82 ( 0.00%)33084.03 ( 60.07%) UserTHREAD_ALLOC 142723.90 ( 0.00%)47707.38 ( 66.57%) System SMT 396.68 ( 0.00%) 621.46 (-56.67%) System THREAD_ALLOC 675.22 ( 0.00%) 836.96 (-23.95%) Elapsed SMT 1987.08 ( 0.00%) 828.57 ( 58.30%) Elapsed THREAD_ALLOC 3222.99 ( 0.00%) 1101.31 ( 65.83%) CPU SMT 4189.00 ( 0.00%) 4067.00 ( 2.91%) CPU THREAD_ALLOC 4449.00 ( 0.00%) 4407.00 ( 0.94%) Thanks a lot for the help and for looking into it! Just curious, why are you running only numa02_SMT and numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version without _suffix) Bug in the testing script on my end. Each of them are run separtly and it Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so it can run the tests individually. I know start_bench.sh can run all the tests itself but in time I'll want mmtests to collect additional stats that can also be applied to other benchmarks consistently. The revised results look like this AUTONUMA BENCH 3.6.0 3.6.0 vanillaautonuma-v33r6 UserNUMA01 66395.58 ( 0.00%)32000.83 ( 51.80%) UserNUMA01_THEADLOCAL55952.48 ( 0.00%)16950.48 ( 69.71%) UserNUMA026988.51 ( 0.00%) 2150.56 ( 69.23%) UserNUMA02_SMT2914.25 ( 0.00%) 1013.11 ( 65.24%) System NUMA01 319.12 ( 0.00%) 483.60 (-51.54%) System NUMA01_THEADLOCAL 40.60 ( 0.00%) 184.39 (-354.16%) System NUMA02 1.62 ( 0.00%) 23.92 (-1376.54%) System NUMA02_SMT 0.90 ( 0.00%) 16.20 (-1700.00%) Elapsed NUMA011519.53 ( 0.00%) 757.40 ( 50.16%) Elapsed NUMA01_THEADLOCAL 1269.49 ( 0.00%) 398.63 ( 68.60%) Elapsed NUMA02 181.12 ( 0.00%) 57.09 ( 68.48%) Elapsed NUMA02_SMT 164.18 ( 0.00%) 53.16 ( 67.62%) CPU NUMA014390.00 ( 0.00%) 4288.00 ( 2.32%) CPU NUMA01_THEADLOCAL 4410.00 ( 0.00%) 4298.00 ( 2.54%) CPU NUMA023859.00 ( 0.00%) 3808.00 ( 1.32%) CPU NUMA02_SMT1775.00 ( 0.00%) 1935.00 ( -9.01%) MMTests Statistics: duration 3.6.0 3.6.0 vanilla autonuma-v33r6 User 132257.4452121.30 System362.79 708.62 Elapsed 3142.66 1275.72 MMTests Statistics: vmstat 3.6.0 3.6.0 vanilla autonuma-v33r6 THP fault alloc 17660 19927 THP collapse alloc 10 12399 THP splits4 12637 The System CPU usage is high but is compenstated for with reduced User and Elapsed times in this particular case. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Hi Mel, On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote: > So after getting through the full review of it, there wasn't anything > I could not stand. I think it's *very* heavy on some of the paths like > the idle balancer which I was not keen on and the fault paths are also > quite heavy. I think the weight on some of these paths can be reduced > but not to 0 if the objectives to autonuma are to be met. > > I'm not fully convinced that the task exchange is actually necessary or > beneficial because it somewhat assumes that there is a symmetry between CPU > and memory balancing that may not be true. The fact that it only considers The problem is that without an active task exchange and no explicit call to stop_one_cpu*, there's no way to migrate a currently running task and clearly we need that. We can indefinitely wait hoping the task goes to sleep and leaves the CPU idle, or that a couple of other tasks start and trigger load balance events. We must move tasks even if all cpus are in a steady rq->nr_running == 1 state and there's no other scheduler balance event that could possibly attempt to move tasks around in such a steady state. Of course one could hack the active idle balancing so that it does the active NUMA balancing action, but that would be a purely artificial complication: it would add unnecessary delay and it would provide no benefit whatsoever. Why don't we dump the active idle balancing too, and we hack the load balancing to do the active idle balancing as well? Of course then the two will be more integrated. But it'll be a mess and slower and there's a good reason why they exist as totally separated pieces of code working in parallel. We can integrate it more, but in my view the result would be worse and more complicated. Last but not the least messing the idle balancing code to do an active NUMA balancing action (somehow invoking stop_one_cpu* in the steady state described above) would force even cellphones and UP kernels to deal with NUMA code somehow. > tasks that are currently running feels a bit random but examining all tasks > that recently ran on the node would be far too expensive to there is no So far this seems a good tradeoff. Nothing will prevent us to scan deeper into the runqueues later if find a way to do that efficiently. > good answer. You are caught between a rock and a hard place and either > direction you go is wrong for different reasons. You need something more I think you described the problem perfectly ;). > frequent than scans (because it'll converge too slowly) but doing it from > the balancer misses some tasks and may run too frequently and it's unclear > how it effects the current load balancer decisions. I don't have a good > alternative solution for this but ideally it would be better integrated with > the existing scheduler when there is more data on what those scheduling > decisions should be. That will only come from a wide range of testing and > the inevitable bug reports. > > That said, this is concentrating on the problems without considering the > situations where it would work very well. I think it'll come down to HPC > and anything jitter-sensitive will hate this while workloads like JVM, > virtualisation or anything that uses a lot of memory without caring about > placement will love it. It's not perfect but it's better than incurring > the cost of remote access unconditionally. Full agreement. Your detailed full review was very appreciated, thanks! Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote: > If System CPU time really does go down as this converges then that > should be obvious from monitoring vmstat over time for a test. Early on > - high usage with that dropping as it converges. If that doesn't happen > then the tasks are not converging, the phases change constantly or > something unexpected happened that needs to be identified. Yes, all measurable kernel cost should be in the memory copies (migration and khugepaged, the latter is going to be optimized away). The migrations must stop after the workload converges. Either migrations are used to reach convergence or they shouldn't happen in the first place (not in any measurable amount). > Ok. Are they separate STREAM instances or threads running on the same > arrays? My understanding is separate instances. I think it's a single threaded benchmark and you run many copies. It was modified to run for 5min (otherwise upstream has not enough time to get it wrong, as result of background scheduling jitters). Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote: > Hi Mel, > > On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote: > > As a basic sniff test I added a test to MMtests for the AutoNUMA > > Benchmark on a 4-node machine and the following fell out. > > > > 3.6.0 3.6.0 > >vanillaautonuma-v33r6 > > UserSMT 82851.82 ( 0.00%)33084.03 ( 60.07%) > > UserTHREAD_ALLOC 142723.90 ( 0.00%)47707.38 ( 66.57%) > > System SMT 396.68 ( 0.00%) 621.46 (-56.67%) > > System THREAD_ALLOC 675.22 ( 0.00%) 836.96 (-23.95%) > > Elapsed SMT 1987.08 ( 0.00%) 828.57 ( 58.30%) > > Elapsed THREAD_ALLOC 3222.99 ( 0.00%) 1101.31 ( 65.83%) > > CPU SMT 4189.00 ( 0.00%) 4067.00 ( 2.91%) > > CPU THREAD_ALLOC 4449.00 ( 0.00%) 4407.00 ( 0.94%) > > Thanks a lot for the help and for looking into it! > > Just curious, why are you running only numa02_SMT and > numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version > without _suffix) > Bug in the testing script on my end. Each of them are run separtly and it looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then numa01_THREAD_ALLOC. The intention was to allow additional stats to be gathered independently of what start_bench.sh collects. Will improve it in the future. > > > > The performance improvements are certainly there for this basic test but > > I note the System CPU usage is very high. > > Yes, migrate is expensive, but after convergence has been reached the > system time should be the same as upstream. > Ok. > btw, I improved things further in autonuma28 (new branch in aa.git). > Ok. > > > > The vmstats showed up this > > > > THP fault alloc 81376 86070 > > THP collapse alloc 14 40423 > > THP splits8 41792 > > > > So we're doing a lot of splits and collapses for THP there. There is a > > possibility that khugepaged and the autonuma kernel thread are doing some > > busy work. Not a show-stopped, just interesting. > > > > I've done no analysis at all and this was just to have something to look > > at before looking at the code closer. > > Sure, the idea is to have THP native migration, then we'll do zero > collapse/splits. > Seems reasonably. It should be obvious to measure when/if that happens. > > > The objective of AutoNUMA is to provide out-of-the-box performance as > > > close as possible to (and potentially faster than) manual NUMA hard > > > bindings. > > > > > > It is not very intrusive into the kernel core and is well structured > > > into separate source modules. > > > > > > AutoNUMA was extensively tested against 3.x upstream kernels and other > > > NUMA placement algorithms such as numad (in userland through cpusets) > > > and schednuma (in kernel too) and was found superior in all cases. > > > > > > Most important: not a single benchmark showed a regression yet when > > > compared to vanilla kernels. Not even on the 2 node systems where the > > > NUMA effects are less significant. > > > > > > > Ok, I have not run a general regression test and won't get the chance to > > soon but hopefully others will. One thing they might want to watch out > > for is System CPU time. It's possible that your AutoNUMA benchmark > > triggers a worst-case but it's worth keeping an eye on because any cost > > from that has to be offset by gains from better NUMA placements. > > Good idea to monitor it indeed. > If System CPU time really does go down as this converges then that should be obvious from monitoring vmstat over time for a test. Early on - high usage with that dropping as it converges. If that doesn't happen then the tasks are not converging, the phases change constantly or something unexpected happened that needs to be identified. > > Is STREAM really a good benchmark in this case? Unless you also ran it in > > parallel mode, it basically operations against three arrays and not really > > NUMA friendly once the total size is greater than a NUMA node. I guess > > it makes sense to run it just to see does autonuma break it :) > > The way this is run is that there is 1 stream, then 4 stream, then 8 > until we max out all CPUs. > Ok. Are they separate STREAM instances or threads running on the same arrays? > I think we could run "memhog" instead of "stream" and it'd be the > same. stream probably better resembles real life computations. > > The upstream scheduler lacks any notion of affinity so eventually > during the 5 min run, on process changes node, it doesn't notice its > memory was elsewhere so it stays there, and the memory can't follow > the cpu either. So then it runs much slower. > > So it's the simplest test of all to get right, all it requires is some > notion of node affinity. > Ok. > It's
Re: [PATCH 00/33] AutoNUMA27
Hi Mel, On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote: > As a basic sniff test I added a test to MMtests for the AutoNUMA > Benchmark on a 4-node machine and the following fell out. > > 3.6.0 3.6.0 >vanillaautonuma-v33r6 > UserSMT 82851.82 ( 0.00%)33084.03 ( 60.07%) > UserTHREAD_ALLOC 142723.90 ( 0.00%)47707.38 ( 66.57%) > System SMT 396.68 ( 0.00%) 621.46 (-56.67%) > System THREAD_ALLOC 675.22 ( 0.00%) 836.96 (-23.95%) > Elapsed SMT 1987.08 ( 0.00%) 828.57 ( 58.30%) > Elapsed THREAD_ALLOC 3222.99 ( 0.00%) 1101.31 ( 65.83%) > CPU SMT 4189.00 ( 0.00%) 4067.00 ( 2.91%) > CPU THREAD_ALLOC 4449.00 ( 0.00%) 4407.00 ( 0.94%) Thanks a lot for the help and for looking into it! Just curious, why are you running only numa02_SMT and numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version without _suffix) > > The performance improvements are certainly there for this basic test but > I note the System CPU usage is very high. Yes, migrate is expensive, but after convergence has been reached the system time should be the same as upstream. btw, I improved things further in autonuma28 (new branch in aa.git). > > The vmstats showed up this > > THP fault alloc 81376 86070 > THP collapse alloc 14 40423 > THP splits8 41792 > > So we're doing a lot of splits and collapses for THP there. There is a > possibility that khugepaged and the autonuma kernel thread are doing some > busy work. Not a show-stopped, just interesting. > > I've done no analysis at all and this was just to have something to look > at before looking at the code closer. Sure, the idea is to have THP native migration, then we'll do zero collapse/splits. > > The objective of AutoNUMA is to provide out-of-the-box performance as > > close as possible to (and potentially faster than) manual NUMA hard > > bindings. > > > > It is not very intrusive into the kernel core and is well structured > > into separate source modules. > > > > AutoNUMA was extensively tested against 3.x upstream kernels and other > > NUMA placement algorithms such as numad (in userland through cpusets) > > and schednuma (in kernel too) and was found superior in all cases. > > > > Most important: not a single benchmark showed a regression yet when > > compared to vanilla kernels. Not even on the 2 node systems where the > > NUMA effects are less significant. > > > > Ok, I have not run a general regression test and won't get the chance to > soon but hopefully others will. One thing they might want to watch out > for is System CPU time. It's possible that your AutoNUMA benchmark > triggers a worst-case but it's worth keeping an eye on because any cost > from that has to be offset by gains from better NUMA placements. Good idea to monitor it indeed. > Is STREAM really a good benchmark in this case? Unless you also ran it in > parallel mode, it basically operations against three arrays and not really > NUMA friendly once the total size is greater than a NUMA node. I guess > it makes sense to run it just to see does autonuma break it :) The way this is run is that there is 1 stream, then 4 stream, then 8 until we max out all CPUs. I think we could run "memhog" instead of "stream" and it'd be the same. stream probably better resembles real life computations. The upstream scheduler lacks any notion of affinity so eventually during the 5 min run, on process changes node, it doesn't notice its memory was elsewhere so it stays there, and the memory can't follow the cpu either. So then it runs much slower. So it's the simplest test of all to get right, all it requires is some notion of node affinity. It's also the only workload that the home node design in schednuma in tip.git can get right (schednuma post current tip.git introduced cpu-follow-memory design of AutoNUMA so schednuma will have a chance to get right more stuff than just the stream multi instance benchmark). So it's just for a verification than the simple stuff (single threaded process computing) is ok and the upstream regression vs hard NUMA bindings is fixed. stream is also one case where we have to perform identical to the hard NUMA bindings. No migration of CPU or memory must ever happen with AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and find that it is already in the best place and it will leave it alone. With the autonuma-benchmark it's impossible to reach identical performance of the _HARD_BIND case because _HARD_BIND doesn't need to do any memory migration (I'm 3 seconds away from hard bindings in a 198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;). > > > > > == iozone == > > > > ALL INIT RE RE RANDOM
Re: [PATCH 00/33] AutoNUMA27
Hi Mel, On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote: As a basic sniff test I added a test to MMtests for the AutoNUMA Benchmark on a 4-node machine and the following fell out. 3.6.0 3.6.0 vanillaautonuma-v33r6 UserSMT 82851.82 ( 0.00%)33084.03 ( 60.07%) UserTHREAD_ALLOC 142723.90 ( 0.00%)47707.38 ( 66.57%) System SMT 396.68 ( 0.00%) 621.46 (-56.67%) System THREAD_ALLOC 675.22 ( 0.00%) 836.96 (-23.95%) Elapsed SMT 1987.08 ( 0.00%) 828.57 ( 58.30%) Elapsed THREAD_ALLOC 3222.99 ( 0.00%) 1101.31 ( 65.83%) CPU SMT 4189.00 ( 0.00%) 4067.00 ( 2.91%) CPU THREAD_ALLOC 4449.00 ( 0.00%) 4407.00 ( 0.94%) Thanks a lot for the help and for looking into it! Just curious, why are you running only numa02_SMT and numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version without _suffix) The performance improvements are certainly there for this basic test but I note the System CPU usage is very high. Yes, migrate is expensive, but after convergence has been reached the system time should be the same as upstream. btw, I improved things further in autonuma28 (new branch in aa.git). The vmstats showed up this THP fault alloc 81376 86070 THP collapse alloc 14 40423 THP splits8 41792 So we're doing a lot of splits and collapses for THP there. There is a possibility that khugepaged and the autonuma kernel thread are doing some busy work. Not a show-stopped, just interesting. I've done no analysis at all and this was just to have something to look at before looking at the code closer. Sure, the idea is to have THP native migration, then we'll do zero collapse/splits. The objective of AutoNUMA is to provide out-of-the-box performance as close as possible to (and potentially faster than) manual NUMA hard bindings. It is not very intrusive into the kernel core and is well structured into separate source modules. AutoNUMA was extensively tested against 3.x upstream kernels and other NUMA placement algorithms such as numad (in userland through cpusets) and schednuma (in kernel too) and was found superior in all cases. Most important: not a single benchmark showed a regression yet when compared to vanilla kernels. Not even on the 2 node systems where the NUMA effects are less significant. Ok, I have not run a general regression test and won't get the chance to soon but hopefully others will. One thing they might want to watch out for is System CPU time. It's possible that your AutoNUMA benchmark triggers a worst-case but it's worth keeping an eye on because any cost from that has to be offset by gains from better NUMA placements. Good idea to monitor it indeed. Is STREAM really a good benchmark in this case? Unless you also ran it in parallel mode, it basically operations against three arrays and not really NUMA friendly once the total size is greater than a NUMA node. I guess it makes sense to run it just to see does autonuma break it :) The way this is run is that there is 1 stream, then 4 stream, then 8 until we max out all CPUs. I think we could run memhog instead of stream and it'd be the same. stream probably better resembles real life computations. The upstream scheduler lacks any notion of affinity so eventually during the 5 min run, on process changes node, it doesn't notice its memory was elsewhere so it stays there, and the memory can't follow the cpu either. So then it runs much slower. So it's the simplest test of all to get right, all it requires is some notion of node affinity. It's also the only workload that the home node design in schednuma in tip.git can get right (schednuma post current tip.git introduced cpu-follow-memory design of AutoNUMA so schednuma will have a chance to get right more stuff than just the stream multi instance benchmark). So it's just for a verification than the simple stuff (single threaded process computing) is ok and the upstream regression vs hard NUMA bindings is fixed. stream is also one case where we have to perform identical to the hard NUMA bindings. No migration of CPU or memory must ever happen with AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and find that it is already in the best place and it will leave it alone. With the autonuma-benchmark it's impossible to reach identical performance of the _HARD_BIND case because _HARD_BIND doesn't need to do any memory migration (I'm 3 seconds away from hard bindings in a 198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;). == iozone == ALL INIT RE RE RANDOM RANDOM BACKWD RECRE STRIDE F FRE F FRE FILE TYPE (KB) IOS
Re: [PATCH 00/33] AutoNUMA27
On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote: Hi Mel, On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote: As a basic sniff test I added a test to MMtests for the AutoNUMA Benchmark on a 4-node machine and the following fell out. 3.6.0 3.6.0 vanillaautonuma-v33r6 UserSMT 82851.82 ( 0.00%)33084.03 ( 60.07%) UserTHREAD_ALLOC 142723.90 ( 0.00%)47707.38 ( 66.57%) System SMT 396.68 ( 0.00%) 621.46 (-56.67%) System THREAD_ALLOC 675.22 ( 0.00%) 836.96 (-23.95%) Elapsed SMT 1987.08 ( 0.00%) 828.57 ( 58.30%) Elapsed THREAD_ALLOC 3222.99 ( 0.00%) 1101.31 ( 65.83%) CPU SMT 4189.00 ( 0.00%) 4067.00 ( 2.91%) CPU THREAD_ALLOC 4449.00 ( 0.00%) 4407.00 ( 0.94%) Thanks a lot for the help and for looking into it! Just curious, why are you running only numa02_SMT and numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version without _suffix) Bug in the testing script on my end. Each of them are run separtly and it looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then numa01_THREAD_ALLOC. The intention was to allow additional stats to be gathered independently of what start_bench.sh collects. Will improve it in the future. The performance improvements are certainly there for this basic test but I note the System CPU usage is very high. Yes, migrate is expensive, but after convergence has been reached the system time should be the same as upstream. Ok. btw, I improved things further in autonuma28 (new branch in aa.git). Ok. The vmstats showed up this THP fault alloc 81376 86070 THP collapse alloc 14 40423 THP splits8 41792 So we're doing a lot of splits and collapses for THP there. There is a possibility that khugepaged and the autonuma kernel thread are doing some busy work. Not a show-stopped, just interesting. I've done no analysis at all and this was just to have something to look at before looking at the code closer. Sure, the idea is to have THP native migration, then we'll do zero collapse/splits. Seems reasonably. It should be obvious to measure when/if that happens. The objective of AutoNUMA is to provide out-of-the-box performance as close as possible to (and potentially faster than) manual NUMA hard bindings. It is not very intrusive into the kernel core and is well structured into separate source modules. AutoNUMA was extensively tested against 3.x upstream kernels and other NUMA placement algorithms such as numad (in userland through cpusets) and schednuma (in kernel too) and was found superior in all cases. Most important: not a single benchmark showed a regression yet when compared to vanilla kernels. Not even on the 2 node systems where the NUMA effects are less significant. Ok, I have not run a general regression test and won't get the chance to soon but hopefully others will. One thing they might want to watch out for is System CPU time. It's possible that your AutoNUMA benchmark triggers a worst-case but it's worth keeping an eye on because any cost from that has to be offset by gains from better NUMA placements. Good idea to monitor it indeed. If System CPU time really does go down as this converges then that should be obvious from monitoring vmstat over time for a test. Early on - high usage with that dropping as it converges. If that doesn't happen then the tasks are not converging, the phases change constantly or something unexpected happened that needs to be identified. Is STREAM really a good benchmark in this case? Unless you also ran it in parallel mode, it basically operations against three arrays and not really NUMA friendly once the total size is greater than a NUMA node. I guess it makes sense to run it just to see does autonuma break it :) The way this is run is that there is 1 stream, then 4 stream, then 8 until we max out all CPUs. Ok. Are they separate STREAM instances or threads running on the same arrays? I think we could run memhog instead of stream and it'd be the same. stream probably better resembles real life computations. The upstream scheduler lacks any notion of affinity so eventually during the 5 min run, on process changes node, it doesn't notice its memory was elsewhere so it stays there, and the memory can't follow the cpu either. So then it runs much slower. So it's the simplest test of all to get right, all it requires is some notion of node affinity. Ok. It's also the only workload that the home node design in schednuma in tip.git can get right (schednuma post current tip.git introduced cpu-follow-memory design of AutoNUMA
Re: [PATCH 00/33] AutoNUMA27
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote: If System CPU time really does go down as this converges then that should be obvious from monitoring vmstat over time for a test. Early on - high usage with that dropping as it converges. If that doesn't happen then the tasks are not converging, the phases change constantly or something unexpected happened that needs to be identified. Yes, all measurable kernel cost should be in the memory copies (migration and khugepaged, the latter is going to be optimized away). The migrations must stop after the workload converges. Either migrations are used to reach convergence or they shouldn't happen in the first place (not in any measurable amount). Ok. Are they separate STREAM instances or threads running on the same arrays? My understanding is separate instances. I think it's a single threaded benchmark and you run many copies. It was modified to run for 5min (otherwise upstream has not enough time to get it wrong, as result of background scheduling jitters). Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Hi Mel, On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote: So after getting through the full review of it, there wasn't anything I could not stand. I think it's *very* heavy on some of the paths like the idle balancer which I was not keen on and the fault paths are also quite heavy. I think the weight on some of these paths can be reduced but not to 0 if the objectives to autonuma are to be met. I'm not fully convinced that the task exchange is actually necessary or beneficial because it somewhat assumes that there is a symmetry between CPU and memory balancing that may not be true. The fact that it only considers The problem is that without an active task exchange and no explicit call to stop_one_cpu*, there's no way to migrate a currently running task and clearly we need that. We can indefinitely wait hoping the task goes to sleep and leaves the CPU idle, or that a couple of other tasks start and trigger load balance events. We must move tasks even if all cpus are in a steady rq-nr_running == 1 state and there's no other scheduler balance event that could possibly attempt to move tasks around in such a steady state. Of course one could hack the active idle balancing so that it does the active NUMA balancing action, but that would be a purely artificial complication: it would add unnecessary delay and it would provide no benefit whatsoever. Why don't we dump the active idle balancing too, and we hack the load balancing to do the active idle balancing as well? Of course then the two will be more integrated. But it'll be a mess and slower and there's a good reason why they exist as totally separated pieces of code working in parallel. We can integrate it more, but in my view the result would be worse and more complicated. Last but not the least messing the idle balancing code to do an active NUMA balancing action (somehow invoking stop_one_cpu* in the steady state described above) would force even cellphones and UP kernels to deal with NUMA code somehow. tasks that are currently running feels a bit random but examining all tasks that recently ran on the node would be far too expensive to there is no So far this seems a good tradeoff. Nothing will prevent us to scan deeper into the runqueues later if find a way to do that efficiently. good answer. You are caught between a rock and a hard place and either direction you go is wrong for different reasons. You need something more I think you described the problem perfectly ;). frequent than scans (because it'll converge too slowly) but doing it from the balancer misses some tasks and may run too frequently and it's unclear how it effects the current load balancer decisions. I don't have a good alternative solution for this but ideally it would be better integrated with the existing scheduler when there is more data on what those scheduling decisions should be. That will only come from a wide range of testing and the inevitable bug reports. That said, this is concentrating on the problems without considering the situations where it would work very well. I think it'll come down to HPC and anything jitter-sensitive will hate this while workloads like JVM, virtualisation or anything that uses a lot of memory without caring about placement will love it. It's not perfect but it's better than incurring the cost of remote access unconditionally. Full agreement. Your detailed full review was very appreciated, thanks! Andrea -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Fri, 05 Oct 2012 16:14:44 -0700 Andi Kleen wrote: > IMHO needs a performance shot-out. Run both on the same 10 workloads > and see who wins. Just a lot of of work. Any volunteers? Here are some preliminary results from simple benchmarks on a 4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system. For the simple linpack streams benchmark, both sched/numa and autonuma are within the margin of error compared to manual tuning of task affinity. This is a big win, since the current upstream scheduler has regressions of 10-20% when the system runs 4 through 16 streams processes. For specjbb, the story is more complicated. After fixing the obvious bugs in sched/numa, and getting some basic cpu-follows-memory code (not yet in -tip AFAIK), Larry, Peter and I, averaged results look like this: baseline: 246019 manual pinning: 285481 (+16%) autonuma: 266626 (+8%) sched/numa: 226540 (-8%) This is with newer sched/numa code than what is in -tip right now. Once Peter pushes the fixes by Larry and me into -tip, as well as his cpu-follows-memory code, others should be able to run tests like this as well. Now for some other workloads, and tests on 8 node systems, etc... Full results for the specjbb run below: BASELINE - disabling auto numa (matches RHEL6 within 1%) [root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt spec1.txt: throughput = 243639.70 SPECjbb2005 bops spec2.txt: throughput = 249186.20 SPECjbb2005 bops spec3.txt: throughput = 247216.72 SPECjbb2005 bops spec4.txt: throughput = 244035.60 SPECjbb2005 bops Manual NUMACTL results are: [root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt spec1.txt: throughput = 291430.22 SPECjbb2005 bops spec2.txt: throughput = 283550.85 SPECjbb2005 bops spec3.txt: throughput = 284028.71 SPECjbb2005 bops spec4.txt: throughput = 282919.37 SPECjbb2005 bops AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64 [root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt spec1.txt: throughput = 261835.01 SPECjbb2005 bops spec2.txt: throughput = 269053.06 SPECjbb2005 bops spec3.txt: throughput = 261230.50 SPECjbb2005 bops spec3.txt: throughput = 274386.81 SPECjbb2005 bops Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and Larry: [root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt spec1.txt: throughput = 222349.74 SPECjbb2005 bops spec2.txt: throughput = 232988.59 SPECjbb2005 bops spec3.txt: throughput = 223386.03 SPECjbb2005 bops spec4.txt: throughput = 227438.11 SPECjbb2005 bops -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On 10/05/2012 05:11 PM, Andi Kleen wrote: > Tim Chen writes: >>> >> >> I remembered that 3 months ago when Alex tested the numa/sched patches >> there were 20% regression on SpecJbb2005 due to the numa balancer. > > 20% on anything sounds like a show stopper to me. > > -Andi > Much worse than that on an 8-way machine for a multi-node multi-threaded process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a simple version of that). The contention on the page table lock ( &(>page_table_lock)->rlock ) goes through the roof, with threads constantly fighting to invalidate translations and re-fault them. This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw. Running linux-next with no tweaks other than kernel.sched_migration_cost_ns = 50 gives: numa01 8325.78 numa01_HARD_BIND 488.98 (The Hard Bind being a case where the threads are pre-bound to the node set with their memory, so what should be a fairly "best case" for comparison). If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated invalidations from being triggered while the contention for the first invalidation pass is still being fought over): numa01 4272.93 numa01_HARD_BIND 498.98 Since this is a "big" process in the current SchedNUMA code and hence much more likely to trip invalidations, forcing task_numa_big() to always return false in order to avoid the frequent invalidations gives: numa01 429.07 numa01_HARD_BIND 466.67 Finally, with SchedNUMA entirely disabled but the rest of linux-next left intact: numa01 1075.31 numa01_HARD_BIND 484.20 I didn't write down the lock contentions for comparison, but yes - the contention does decrease similarly to the time decreases. There are other microbenchmarks, but those suffice to show the regression pattern. I mentioned this to the RedHat folks last week, so I expect this is already being worked. It seemed pertinent to bring up given the discussion about the current state of linux-next though, just so folks know. From where I'm sitting, it looks to me like the scan period is way too aggressive and there's too much work potentially attempted during a "scan" (by which I mean the hard tick driven choice to invalidate in order to set up potential migration faults). The current code walks/invalidates the entire virtual address space, skipping few vmas. For a very large 64-bit process, that's going to be a *lot* of translations (or even vmas if the address space is fragmented) to walk. That's a seriously long path coming from the timer code. I would think capping the number of translations to process per visit would help. Hope this helps the discussion, Don Morris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On 10/05/2012 05:11 PM, Andi Kleen wrote: Tim Chen tim.c.c...@linux.intel.com writes: I remembered that 3 months ago when Alex tested the numa/sched patches there were 20% regression on SpecJbb2005 due to the numa balancer. 20% on anything sounds like a show stopper to me. -Andi Much worse than that on an 8-way machine for a multi-node multi-threaded process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a simple version of that). The contention on the page table lock ( (mm-page_table_lock)-rlock ) goes through the roof, with threads constantly fighting to invalidate translations and re-fault them. This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw. Running linux-next with no tweaks other than kernel.sched_migration_cost_ns = 50 gives: numa01 8325.78 numa01_HARD_BIND 488.98 (The Hard Bind being a case where the threads are pre-bound to the node set with their memory, so what should be a fairly best case for comparison). If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated invalidations from being triggered while the contention for the first invalidation pass is still being fought over): numa01 4272.93 numa01_HARD_BIND 498.98 Since this is a big process in the current SchedNUMA code and hence much more likely to trip invalidations, forcing task_numa_big() to always return false in order to avoid the frequent invalidations gives: numa01 429.07 numa01_HARD_BIND 466.67 Finally, with SchedNUMA entirely disabled but the rest of linux-next left intact: numa01 1075.31 numa01_HARD_BIND 484.20 I didn't write down the lock contentions for comparison, but yes - the contention does decrease similarly to the time decreases. There are other microbenchmarks, but those suffice to show the regression pattern. I mentioned this to the RedHat folks last week, so I expect this is already being worked. It seemed pertinent to bring up given the discussion about the current state of linux-next though, just so folks know. From where I'm sitting, it looks to me like the scan period is way too aggressive and there's too much work potentially attempted during a scan (by which I mean the hard tick driven choice to invalidate in order to set up potential migration faults). The current code walks/invalidates the entire virtual address space, skipping few vmas. For a very large 64-bit process, that's going to be a *lot* of translations (or even vmas if the address space is fragmented) to walk. That's a seriously long path coming from the timer code. I would think capping the number of translations to process per visit would help. Hope this helps the discussion, Don Morris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Fri, 05 Oct 2012 16:14:44 -0700 Andi Kleen a...@firstfloor.org wrote: IMHO needs a performance shot-out. Run both on the same 10 workloads and see who wins. Just a lot of of work. Any volunteers? Here are some preliminary results from simple benchmarks on a 4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system. For the simple linpack streams benchmark, both sched/numa and autonuma are within the margin of error compared to manual tuning of task affinity. This is a big win, since the current upstream scheduler has regressions of 10-20% when the system runs 4 through 16 streams processes. For specjbb, the story is more complicated. After fixing the obvious bugs in sched/numa, and getting some basic cpu-follows-memory code (not yet in -tip AFAIK), Larry, Peter and I, averaged results look like this: baseline: 246019 manual pinning: 285481 (+16%) autonuma: 266626 (+8%) sched/numa: 226540 (-8%) This is with newer sched/numa code than what is in -tip right now. Once Peter pushes the fixes by Larry and me into -tip, as well as his cpu-follows-memory code, others should be able to run tests like this as well. Now for some other workloads, and tests on 8 node systems, etc... Full results for the specjbb run below: BASELINE - disabling auto numa (matches RHEL6 within 1%) [root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt spec1.txt: throughput = 243639.70 SPECjbb2005 bops spec2.txt: throughput = 249186.20 SPECjbb2005 bops spec3.txt: throughput = 247216.72 SPECjbb2005 bops spec4.txt: throughput = 244035.60 SPECjbb2005 bops Manual NUMACTL results are: [root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt spec1.txt: throughput = 291430.22 SPECjbb2005 bops spec2.txt: throughput = 283550.85 SPECjbb2005 bops spec3.txt: throughput = 284028.71 SPECjbb2005 bops spec4.txt: throughput = 282919.37 SPECjbb2005 bops AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64 [root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt spec1.txt: throughput = 261835.01 SPECjbb2005 bops spec2.txt: throughput = 269053.06 SPECjbb2005 bops spec3.txt: throughput = 261230.50 SPECjbb2005 bops spec3.txt: throughput = 274386.81 SPECjbb2005 bops Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and Larry: [root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt spec1.txt: throughput = 222349.74 SPECjbb2005 bops spec2.txt: throughput = 232988.59 SPECjbb2005 bops spec3.txt: throughput = 223386.03 SPECjbb2005 bops spec4.txt: throughput = 227438.11 SPECjbb2005 bops -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Tim Chen writes: >> > > I remembered that 3 months ago when Alex tested the numa/sched patches > there were 20% regression on SpecJbb2005 due to the numa balancer. 20% on anything sounds like a show stopper to me. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote: > Andrew Morton writes: > > > On Thu, 4 Oct 2012 01:50:42 +0200 > > Andrea Arcangeli wrote: > > > >> This is a new AutoNUMA27 release for Linux v3.6. > > > > Peter's numa/sched patches have been in -next for a week. > > Did they pass review? I have some doubts. > > The last time I looked it also broke numactl. > > > Guys, what's the plan here? > > Since they are both performance features their ultimate benefit > is how much faster they make things (and how seldom they make things > slower) > > IMHO needs a performance shot-out. Run both on the same 10 workloads > and see who wins. Just a lot of of work. Any volunteers? > > For a change like this I think less regression is actually more > important than the highest peak numbers. > > -Andi > I remembered that 3 months ago when Alex tested the numa/sched patches there were 20% regression on SpecJbb2005 due to the numa balancer. Those issues may have been fixed but we probably need to run this benchmark against the latest. For most of the other kernel performance workloads we ran we didn't see much changes. Maurico has a different config for this benchmark and it will be nice if he can also check to see if there are any performance changes on his side. Tim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Andrew Morton writes: > On Thu, 4 Oct 2012 01:50:42 +0200 > Andrea Arcangeli wrote: > >> This is a new AutoNUMA27 release for Linux v3.6. > > Peter's numa/sched patches have been in -next for a week. Did they pass review? I have some doubts. The last time I looked it also broke numactl. > Guys, what's the plan here? Since they are both performance features their ultimate benefit is how much faster they make things (and how seldom they make things slower) IMHO needs a performance shot-out. Run both on the same 10 workloads and see who wins. Just a lot of of work. Any volunteers? For a change like this I think less regression is actually more important than the highest peak numbers. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Andrew Morton a...@linux-foundation.org writes: On Thu, 4 Oct 2012 01:50:42 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is a new AutoNUMA27 release for Linux v3.6. Peter's numa/sched patches have been in -next for a week. Did they pass review? I have some doubts. The last time I looked it also broke numactl. Guys, what's the plan here? Since they are both performance features their ultimate benefit is how much faster they make things (and how seldom they make things slower) IMHO needs a performance shot-out. Run both on the same 10 workloads and see who wins. Just a lot of of work. Any volunteers? For a change like this I think less regression is actually more important than the highest peak numbers. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote: Andrew Morton a...@linux-foundation.org writes: On Thu, 4 Oct 2012 01:50:42 +0200 Andrea Arcangeli aarca...@redhat.com wrote: This is a new AutoNUMA27 release for Linux v3.6. Peter's numa/sched patches have been in -next for a week. Did they pass review? I have some doubts. The last time I looked it also broke numactl. Guys, what's the plan here? Since they are both performance features their ultimate benefit is how much faster they make things (and how seldom they make things slower) IMHO needs a performance shot-out. Run both on the same 10 workloads and see who wins. Just a lot of of work. Any volunteers? For a change like this I think less regression is actually more important than the highest peak numbers. -Andi I remembered that 3 months ago when Alex tested the numa/sched patches there were 20% regression on SpecJbb2005 due to the numa balancer. Those issues may have been fixed but we probably need to run this benchmark against the latest. For most of the other kernel performance workloads we ran we didn't see much changes. Maurico has a different config for this benchmark and it will be nice if he can also check to see if there are any performance changes on his side. Tim -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/33] AutoNUMA27
Tim Chen tim.c.c...@linux.intel.com writes: I remembered that 3 months ago when Alex tested the numa/sched patches there were 20% regression on SpecJbb2005 due to the numa balancer. 20% on anything sounds like a show stopper to me. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/