Re: [PATCH 00/33] AutoNUMA27

2012-10-23 Thread Srikar Dronamraju
* Andrea Arcangeli  [2012-10-14 06:57:16]:

> I'll release an autonuma29 behaving like 28fast if there are no
> surprises. The new algorithm change in 28fast will also save memory
> once I rewrite it properly.
> 

Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but
plan to run on a newer release soon).


---
|  kernel|  vm|  nofit| 
   fit|
--
-
|||noksm|  ksm|
noksm|  ksm|
--
-
|||   nothp| thp|   nothp| thp|   nothp| 
thp|   nothp| thp|
---
|mainline_v36|vm_1|  136085|  188500|  133871|  163638|  133540|  
178159|  132460|  164763|
||vm_2|   61549|   80496|   61420|   74864|   63777|   
80573|   60479|   73416|
||vm_3|   60688|   79349|   62244|   73289|   64394|   
80803|   61040|   74258|
---
| autonuma27_|vm_1|  143261|  186080|  127420|  178505|  141080|  
201436|  143216|  183710|
||vm_2|   72224|   94368|   71309|   89576|   59098|   
83750|   63813|   90862|
||vm_3|   61215|   94213|   71539|   89594|   76269|   
99637|   72412|   91191|
---
| improvement|vm_1|   5.27%|  -1.28%|  -4.82%|   9.09%|   5.65%|  
13.07%|   8.12%|  11.50%|
|   from |vm_2|  17.34%|  17.23%|  16.10%|  19.65%|  -7.34%|   
3.94%|   5.51%|  23.76%|
|  mainline  |vm_3|   0.87%|  18.73%|  14.93%|  22.25%|  18.44%|  
23.31%|  18.63%|  22.80%|
---


(Results with suggested tweaks from Andrea)

echo 0 > /sys/kernel/mm/autonuma/knuma_scand/pmd

echo 15000 > /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs 


|  kernel|  vm|   nofit|
fit|
--
--
||| noksm|  ksm|
noksm|  ksm|
--
--
|||nothp| thp|   nothp| thp|   nothp| 
thp|   nothp| thp|

|mainline_v36|vm_1|   136142|  178362|  132493|  166169|  131774|  
179340|  133058|  164637|
||vm_2|61143|   81943|   60998|   74195|   63725|   
79530|   61916|   73183|
||vm_3|61599|   79058|   61448|   73248|   62563|   
80815|   61381|   74669|

| autonuma27_|vm_1|   142023|  na|  142808|  177880|  na|  
197244|  145165|  174175|
||vm_2|61071|  na|   61008|   91184|  na|   
78893|   71675|   80471|
||vm_3|72646|  na|   72855|   92167|  na|   
99080|   64758|   91831|

| improvement|vm_1|4.32%|  na|   7.79%|   7.05%|  na|   
9.98%|   9.10%|   5.79%|
|  from  |vm_2|   -0.12%|  na|   0.02%|  22.90%|  na|  
-0.80%|  15.76%|   9.96%|
|  mainline  |vm_3|   17.93%|  na|  18.56%|  25.83%|  na|  
22.60%|   5.50%|  22.98%|


Host:

Enterprise Linux Distro
2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
(total of 24 logical CPUs and 24 GB RAM) 

VMs:

Enterprise Linux Distro
Distro Kernel
Main VM (VM1) -- relevant benchmark score.
12 vCPUs

Either 12 GB (for '< 1 Node' configuration, i.e fit case)
 or 14 GB (for '> 1 Node', i.e no fit case) 
Noise VMs (VM2 and VM3)
each noise VM has half of the remaining resources.
6 vCPUs

Either 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node ')
(to sum 20 GB w/ Main VM + 4 

Re: [PATCH 00/33] AutoNUMA27

2012-10-23 Thread Srikar Dronamraju
* Andrea Arcangeli aarca...@redhat.com [2012-10-14 06:57:16]:

 I'll release an autonuma29 behaving like 28fast if there are no
 surprises. The new algorithm change in 28fast will also save memory
 once I rewrite it properly.
 

Here are my results of specjbb2005 on a 2 node box (Still on autonuma27, but
plan to run on a newer release soon).


---
|  kernel|  vm|  nofit| 
   fit|
--
-
|||noksm|  ksm|
noksm|  ksm|
--
-
|||   nothp| thp|   nothp| thp|   nothp| 
thp|   nothp| thp|
---
|mainline_v36|vm_1|  136085|  188500|  133871|  163638|  133540|  
178159|  132460|  164763|
||vm_2|   61549|   80496|   61420|   74864|   63777|   
80573|   60479|   73416|
||vm_3|   60688|   79349|   62244|   73289|   64394|   
80803|   61040|   74258|
---
| autonuma27_|vm_1|  143261|  186080|  127420|  178505|  141080|  
201436|  143216|  183710|
||vm_2|   72224|   94368|   71309|   89576|   59098|   
83750|   63813|   90862|
||vm_3|   61215|   94213|   71539|   89594|   76269|   
99637|   72412|   91191|
---
| improvement|vm_1|   5.27%|  -1.28%|  -4.82%|   9.09%|   5.65%|  
13.07%|   8.12%|  11.50%|
|   from |vm_2|  17.34%|  17.23%|  16.10%|  19.65%|  -7.34%|   
3.94%|   5.51%|  23.76%|
|  mainline  |vm_3|   0.87%|  18.73%|  14.93%|  22.25%|  18.44%|  
23.31%|  18.63%|  22.80%|
---


(Results with suggested tweaks from Andrea)

echo 0  /sys/kernel/mm/autonuma/knuma_scand/pmd

echo 15000  /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs 


|  kernel|  vm|   nofit|
fit|
--
--
||| noksm|  ksm|
noksm|  ksm|
--
--
|||nothp| thp|   nothp| thp|   nothp| 
thp|   nothp| thp|

|mainline_v36|vm_1|   136142|  178362|  132493|  166169|  131774|  
179340|  133058|  164637|
||vm_2|61143|   81943|   60998|   74195|   63725|   
79530|   61916|   73183|
||vm_3|61599|   79058|   61448|   73248|   62563|   
80815|   61381|   74669|

| autonuma27_|vm_1|   142023|  na|  142808|  177880|  na|  
197244|  145165|  174175|
||vm_2|61071|  na|   61008|   91184|  na|   
78893|   71675|   80471|
||vm_3|72646|  na|   72855|   92167|  na|   
99080|   64758|   91831|

| improvement|vm_1|4.32%|  na|   7.79%|   7.05%|  na|   
9.98%|   9.10%|   5.79%|
|  from  |vm_2|   -0.12%|  na|   0.02%|  22.90%|  na|  
-0.80%|  15.76%|   9.96%|
|  mainline  |vm_3|   17.93%|  na|  18.56%|  25.83%|  na|  
22.60%|   5.50%|  22.98%|


Host:

Enterprise Linux Distro
2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
(total of 24 logical CPUs and 24 GB RAM) 

VMs:

Enterprise Linux Distro
Distro Kernel
Main VM (VM1) -- relevant benchmark score.
12 vCPUs

Either 12 GB (for ' 1 Node' configuration, i.e fit case)
 or 14 GB (for ' 1 Node', i.e no fit case) 
Noise VMs (VM2 and VM3)
each noise VM has half of the remaining resources.
6 vCPUs

Either 4 GB (for ' 1 Node' configuration) or 3 GB (' 1 Node ')
(to sum 20 GB w/ 

Re: [PATCH 00/33] AutoNUMA27

2012-10-15 Thread Srikar Dronamraju
> 
> Interesting. So numa01 should be improved in autonuma28fast. Not sure
> why the hard binds show any difference, but I'm more concerned in
> optimizing numa01. I get the same results from hard bindings on
> upstream or autonuma, strange.
> 
> Could you repeat only numa01 with the origin/autonuma28fast branch?

Okay, will try to get the numbers on autonuma28 soon.

> Also if you could post the two pdf convergence chart generated by
> numa01 on autonuma27 and autonuma28fast, I think that would be
> interesting to see the full effect and why it is faster.

Have attached the chart for autonuma27 in a private email.

-- 
Thanks and Regards
Srikar

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-15 Thread Srikar Dronamraju
 
 Interesting. So numa01 should be improved in autonuma28fast. Not sure
 why the hard binds show any difference, but I'm more concerned in
 optimizing numa01. I get the same results from hard bindings on
 upstream or autonuma, strange.
 
 Could you repeat only numa01 with the origin/autonuma28fast branch?

Okay, will try to get the numbers on autonuma28 soon.

 Also if you could post the two pdf convergence chart generated by
 numa01 on autonuma27 and autonuma28fast, I think that would be
 interesting to see the full effect and why it is faster.

Have attached the chart for autonuma27 in a private email.

-- 
Thanks and Regards
Srikar

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-13 Thread Andrea Arcangeli
Hi Srikar,

On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote:
> * Andrea Arcangeli  [2012-10-04 01:50:42]:
> 
> > Hello everyone,
> > 
> > This is a new AutoNUMA27 release for Linux v3.6.
> > 
> 
> 
> Here results of autonumabenchmark on a 328GB 64 core with ht disabled
> comparing v3.6 with autonuma27.

*snip*

>   numa01: 1805.19  1907.11  1866.39-3.88%  

Interesting. So numa01 should be improved in autonuma28fast. Not sure
why the hard binds show any difference, but I'm more concerned in
optimizing numa01. I get the same results from hard bindings on
upstream or autonuma, strange.

Could you repeat only numa01 with the origin/autonuma28fast branch?
Also if you could post the two pdf convergence chart generated by
numa01 on autonuma27 and autonuma28fast, I think that would be
interesting to see the full effect and why it is faster.

I only had the time for a quick push after having the idea added in
autonuma28fast (which is yet improved compared to autonuma28), but
I've been told already that it's dealing with numa01 on the 8 node
very well as expected.

numa01 in the 8 node is a workload without a perfect solution (other
than MADV_INTERLEAVE). Full convergence preventing cross-node traffic
is impossible because there are 2 processes spanning over 8 nodes and
all process memory is touched by all threads constantly. Yet
autonuma28fast should deal optimally that scenario too.

As a side note: numa01 on the 2 node instead converges fully (2
processes + 2 nodes = full convergence). numa01 on 2 nodes or >2nodes
is a very different kind of test.

I'll release an autonuma29 behaving like 28fast if there are no
surprises. The new algorithm change in 28fast will also save memory
once I rewrite it properly.

Thanks!
Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-13 Thread Srikar Dronamraju
* Andrea Arcangeli  [2012-10-04 01:50:42]:

> Hello everyone,
> 
> This is a new AutoNUMA27 release for Linux v3.6.
> 


Here results of autonumabenchmark on a 328GB 64 core with ht disabled
comparing v3.6 with autonuma27.

$ numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31689 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31917 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31928 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31926 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31913 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63952 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64230 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  20  20  20  20  20  20  20 
  1:  20  10  20  20  20  20  20  20 
  2:  20  20  10  20  20  20  20  20 
  3:  20  20  20  10  20  20  20  20 
  4:  20  20  20  20  10  20  20  20 
  5:  20  20  20  20  20  10  20  20 
  6:  20  20  20  20  20  20  10  20 
  7:  20  20  20  20  20  20  20  10 



  KernelVersion: 3.6.0-mainline_v36
Testcase: Min  Max  Avg
  numa01: 1509.14  2098.75  1793.90
numa01_HARD_BIND:  865.43  1826.40  1334.85
 numa01_INVERSE_BIND: 3242.76  3496.71  3345.12
 numa01_THREAD_ALLOC:  944.28  1418.78  1214.32
   numa01_THREAD_ALLOC_HARD_BIND:  696.33  1004.99   825.63
numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88  2301.27  2186.33
  numa02:  129.87   146.10   136.88
numa02_HARD_BIND:   25.8126.1825.97
 numa02_INVERSE_BIND:  341.96   354.73   345.59
  numa02_SMT:  160.77   246.66   186.85
numa02_SMT_HARD_BIND:   25.7738.8633.57
 numa02_SMT_INVERSE_BIND:  282.61   326.76   296.44

  KernelVersion:   3.6.0-autonuma27+

Testcase: Min  Max  Avg  %Change   
  numa01: 1805.19  1907.11  1866.39-3.88%  
numa01_HARD_BIND:  953.33  2050.23  1603.29   -16.74%  
 numa01_INVERSE_BIND: 3515.14  3882.10  3715.28-9.96%  
 numa01_THREAD_ALLOC:  323.50   362.17   348.81   248.13%  
   numa01_THREAD_ALLOC_HARD_BIND:  841.08  1205.80   977.43   -15.53%  
numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35  2654.89  2439.51   -10.38%  
  numa02:   51.6473.3558.88   132.47%  
numa02_HARD_BIND:   25.2326.3125.93 0.15%  
 numa02_INVERSE_BIND:  338.39   355.70   344.82 0.22%  
  numa02_SMT:   51.7666.7858.63   218.69%  
numa02_SMT_HARD_BIND:   34.9545.3939.24   -14.45%  
 numa02_SMT_INVERSE_BIND:  287.85   300.82   295.80 0.22%  

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-13 Thread Srikar Dronamraju
* Andrea Arcangeli aarca...@redhat.com [2012-10-04 01:50:42]:

 Hello everyone,
 
 This is a new AutoNUMA27 release for Linux v3.6.
 


Here results of autonumabenchmark on a 328GB 64 core with ht disabled
comparing v3.6 with autonuma27.

$ numactl -H 
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32510 MB
node 0 free: 31689 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32512 MB
node 1 free: 31930 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 32512 MB
node 2 free: 31917 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 32512 MB
node 3 free: 31928 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 32512 MB
node 4 free: 31926 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 32512 MB
node 5 free: 31913 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 65280 MB
node 6 free: 63952 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 65280 MB
node 7 free: 64230 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  20  20  20  20  20  20  20 
  1:  20  10  20  20  20  20  20  20 
  2:  20  20  10  20  20  20  20  20 
  3:  20  20  20  10  20  20  20  20 
  4:  20  20  20  20  10  20  20  20 
  5:  20  20  20  20  20  10  20  20 
  6:  20  20  20  20  20  20  10  20 
  7:  20  20  20  20  20  20  20  10 



  KernelVersion: 3.6.0-mainline_v36
Testcase: Min  Max  Avg
  numa01: 1509.14  2098.75  1793.90
numa01_HARD_BIND:  865.43  1826.40  1334.85
 numa01_INVERSE_BIND: 3242.76  3496.71  3345.12
 numa01_THREAD_ALLOC:  944.28  1418.78  1214.32
   numa01_THREAD_ALLOC_HARD_BIND:  696.33  1004.99   825.63
numa01_THREAD_ALLOC_INVERSE_BIND: 2072.88  2301.27  2186.33
  numa02:  129.87   146.10   136.88
numa02_HARD_BIND:   25.8126.1825.97
 numa02_INVERSE_BIND:  341.96   354.73   345.59
  numa02_SMT:  160.77   246.66   186.85
numa02_SMT_HARD_BIND:   25.7738.8633.57
 numa02_SMT_INVERSE_BIND:  282.61   326.76   296.44

  KernelVersion:   3.6.0-autonuma27+

Testcase: Min  Max  Avg  %Change   
  numa01: 1805.19  1907.11  1866.39-3.88%  
numa01_HARD_BIND:  953.33  2050.23  1603.29   -16.74%  
 numa01_INVERSE_BIND: 3515.14  3882.10  3715.28-9.96%  
 numa01_THREAD_ALLOC:  323.50   362.17   348.81   248.13%  
   numa01_THREAD_ALLOC_HARD_BIND:  841.08  1205.80   977.43   -15.53%  
numa01_THREAD_ALLOC_INVERSE_BIND: 2268.35  2654.89  2439.51   -10.38%  
  numa02:   51.6473.3558.88   132.47%  
numa02_HARD_BIND:   25.2326.3125.93 0.15%  
 numa02_INVERSE_BIND:  338.39   355.70   344.82 0.22%  
  numa02_SMT:   51.7666.7858.63   218.69%  
numa02_SMT_HARD_BIND:   34.9545.3939.24   -14.45%  
 numa02_SMT_INVERSE_BIND:  287.85   300.82   295.80 0.22%  

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-13 Thread Andrea Arcangeli
Hi Srikar,

On Sun, Oct 14, 2012 at 12:10:19AM +0530, Srikar Dronamraju wrote:
 * Andrea Arcangeli aarca...@redhat.com [2012-10-04 01:50:42]:
 
  Hello everyone,
  
  This is a new AutoNUMA27 release for Linux v3.6.
  
 
 
 Here results of autonumabenchmark on a 328GB 64 core with ht disabled
 comparing v3.6 with autonuma27.

*snip*

   numa01: 1805.19  1907.11  1866.39-3.88%  

Interesting. So numa01 should be improved in autonuma28fast. Not sure
why the hard binds show any difference, but I'm more concerned in
optimizing numa01. I get the same results from hard bindings on
upstream or autonuma, strange.

Could you repeat only numa01 with the origin/autonuma28fast branch?
Also if you could post the two pdf convergence chart generated by
numa01 on autonuma27 and autonuma28fast, I think that would be
interesting to see the full effect and why it is faster.

I only had the time for a quick push after having the idea added in
autonuma28fast (which is yet improved compared to autonuma28), but
I've been told already that it's dealing with numa01 on the 8 node
very well as expected.

numa01 in the 8 node is a workload without a perfect solution (other
than MADV_INTERLEAVE). Full convergence preventing cross-node traffic
is impossible because there are 2 processes spanning over 8 nodes and
all process memory is touched by all threads constantly. Yet
autonuma28fast should deal optimally that scenario too.

As a side note: numa01 on the 2 node instead converges fully (2
processes + 2 nodes = full convergence). numa01 on 2 nodes or 2nodes
is a very different kind of test.

I'll release an autonuma29 behaving like 28fast if there are no
surprises. The new algorithm change in 28fast will also save memory
once I rewrite it properly.

Thanks!
Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-12 Thread Mel Gorman
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> > Hi Mel,
> > 
> > On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > > Benchmark on a 4-node machine and the following fell out.
> > > 
> > >  3.6.0 3.6.0
> > >vanillaautonuma-v33r6
> > > UserSMT 82851.82 (  0.00%)33084.03 ( 60.07%)
> > > UserTHREAD_ALLOC   142723.90 (  0.00%)47707.38 ( 66.57%)
> > > System  SMT   396.68 (  0.00%)  621.46 (-56.67%)
> > > System  THREAD_ALLOC  675.22 (  0.00%)  836.96 (-23.95%)
> > > Elapsed SMT  1987.08 (  0.00%)  828.57 ( 58.30%)
> > > Elapsed THREAD_ALLOC 3222.99 (  0.00%) 1101.31 ( 65.83%)
> > > CPU SMT  4189.00 (  0.00%) 4067.00 (  2.91%)
> > > CPU THREAD_ALLOC 4449.00 (  0.00%) 4407.00 (  0.94%)
> > 
> > Thanks a lot for the help and for looking into it!
> > 
> > Just curious, why are you running only numa02_SMT and
> > numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> > without _suffix)
> > 
> 
> Bug in the testing script on my end. Each of them are run separtly and it

Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so
it can run the tests individually. I know start_bench.sh can run all the
tests itself but in time I'll want mmtests to collect additional stats
that can also be applied to other benchmarks consistently. The revised
results look like this

AUTONUMA BENCH
  3.6.0 3.6.0
vanillaautonuma-v33r6
UserNUMA01   66395.58 (  0.00%)32000.83 ( 51.80%)
UserNUMA01_THEADLOCAL55952.48 (  0.00%)16950.48 ( 69.71%)
UserNUMA026988.51 (  0.00%) 2150.56 ( 69.23%)
UserNUMA02_SMT2914.25 (  0.00%) 1013.11 ( 65.24%)
System  NUMA01 319.12 (  0.00%)  483.60 (-51.54%)
System  NUMA01_THEADLOCAL   40.60 (  0.00%)  184.39 (-354.16%)
System  NUMA02   1.62 (  0.00%)   23.92 (-1376.54%)
System  NUMA02_SMT   0.90 (  0.00%)   16.20 (-1700.00%)
Elapsed NUMA011519.53 (  0.00%)  757.40 ( 50.16%)
Elapsed NUMA01_THEADLOCAL 1269.49 (  0.00%)  398.63 ( 68.60%)
Elapsed NUMA02 181.12 (  0.00%)   57.09 ( 68.48%)
Elapsed NUMA02_SMT 164.18 (  0.00%)   53.16 ( 67.62%)
CPU NUMA014390.00 (  0.00%) 4288.00 (  2.32%)
CPU NUMA01_THEADLOCAL 4410.00 (  0.00%) 4298.00 (  2.54%)
CPU NUMA023859.00 (  0.00%) 3808.00 (  1.32%)
CPU NUMA02_SMT1775.00 (  0.00%) 1935.00 ( -9.01%)

MMTests Statistics: duration
   3.6.0   3.6.0
 vanilla autonuma-v33r6
User   132257.4452121.30
System362.79  708.62
Elapsed  3142.66 1275.72

MMTests Statistics: vmstat
  3.6.0   3.6.0
vanilla autonuma-v33r6
THP fault alloc   17660   19927
THP collapse alloc   10   12399
THP splits4   12637

The System CPU usage is high but is compenstated for with reduced User
and Elapsed times in this particular case.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-12 Thread Mel Gorman
On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> > So after getting through the full review of it, there wasn't anything
> > I could not stand. I think it's *very* heavy on some of the paths like
> > the idle balancer which I was not keen on and the fault paths are also
> > quite heavy.  I think the weight on some of these paths can be reduced
> > but not to 0 if the objectives to autonuma are to be met.
> > 
> > I'm not fully convinced that the task exchange is actually necessary or
> > beneficial because it somewhat assumes that there is a symmetry between CPU
> > and memory balancing that may not be true. The fact that it only considers
> 
> The problem is that without an active task exchange and no explicit
> call to stop_one_cpu*, there's no way to migrate a currently running
> task and clearly we need that. We can indefinitely wait hoping the
> task goes to sleep and leaves the CPU idle, or that a couple of other
> tasks start and trigger load balance events.
> 

Stick that in a comment although I still don't fully see why the actual
exchange is necessary and why you cannot just move the current task to
the remote CPUs runqueue. Maybe it's something to do with them converging
faster if you do an exchange. I'll figure it out eventually.

> We must move tasks even if all cpus are in a steady rq->nr_running ==
> 1 state and there's no other scheduler balance event that could
> possibly attempt to move tasks around in such a steady state.
> 

I see, because just because there is a 1:1 mapping between tasks and
CPUs does not mean that it has converged from a NUMA perspective. The
idle balancer could be moving to an idle CPU that is poor from a NUMA
point of view. Better integration with the load balancer and caching on
a per-NUMA basis both the best and worst converged processes might help
but I'm hand-waving.

> Of course one could hack the active idle balancing so that it does the
> active NUMA balancing action, but that would be a purely artificial
> complication: it would add unnecessary delay and it would provide no
> benefit whatsoever.
> 
> Why don't we dump the active idle balancing too, and we hack the load
> balancing to do the active idle balancing as well? Of course then the
> two will be more integrated. But it'll be a mess and slower and
> there's a good reason why they exist as totally separated pieces of
> code working in parallel.
> 

I'm not 100% convinced they have to be separate but you have thought about
this a hell of a lot more than I have and I'm a scheduling dummy.

For example, to me it seems that if the load balancer was going to move a
task to an idle CPU on a remote node, it could also check it it would be
more or less converged before moving and reject the balancing if it would
be less converged after the move. This increases the search cost in the
load balancer but not necessarily any worse than what happens currently.

> We can integrate it more, but in my view the result would be worse and
> more complicated. Last but not the least messing the idle balancing
> code to do an active NUMA balancing action (somehow invoking
> stop_one_cpu* in the steady state described above) would force even
> cellphones and UP kernels to deal with NUMA code somehow.
> 

hmm...

> > tasks that are currently running feels a bit random but examining all tasks
> > that recently ran on the node would be far too expensive to there is no
> 
> So far this seems a good tradeoff. Nothing will prevent us to scan
> deeper into the runqueues later if find a way to do that efficiently.
> 

I don't think there is an effecient way to do that but I'm hoping
caching an exchange candiate on a per-NUMA basis could reduce the cost
while still converging reasonably quickly.

> > good answer. You are caught between a rock and a hard place and either
> > direction you go is wrong for different reasons. You need something more
> 
> I think you described the problem perfectly ;).
> 
> > frequent than scans (because it'll converge too slowly) but doing it from
> > the balancer misses some tasks and may run too frequently and it's unclear
> > how it effects the current load balancer decisions. I don't have a good
> > alternative solution for this but ideally it would be better integrated with
> > the existing scheduler when there is more data on what those scheduling
> > decisions should be. That will only come from a wide range of testing and
> > the inevitable bug reports.
> > 
> > That said, this is concentrating on the problems without considering the
> > situations where it would work very well.  I think it'll come down to HPC
> > and anything jitter-sensitive will hate this while workloads like JVM,
> > virtualisation or anything that uses a lot of memory without caring about
> > placement will love it. It's not perfect but it's better than incurring
> > the cost of remote access unconditionally.
> 
> Full agreement.
> 

Re: [PATCH 00/33] AutoNUMA27

2012-10-12 Thread Mel Gorman
On Fri, Oct 12, 2012 at 03:45:53AM +0200, Andrea Arcangeli wrote:
 Hi Mel,
 
 On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
  So after getting through the full review of it, there wasn't anything
  I could not stand. I think it's *very* heavy on some of the paths like
  the idle balancer which I was not keen on and the fault paths are also
  quite heavy.  I think the weight on some of these paths can be reduced
  but not to 0 if the objectives to autonuma are to be met.
  
  I'm not fully convinced that the task exchange is actually necessary or
  beneficial because it somewhat assumes that there is a symmetry between CPU
  and memory balancing that may not be true. The fact that it only considers
 
 The problem is that without an active task exchange and no explicit
 call to stop_one_cpu*, there's no way to migrate a currently running
 task and clearly we need that. We can indefinitely wait hoping the
 task goes to sleep and leaves the CPU idle, or that a couple of other
 tasks start and trigger load balance events.
 

Stick that in a comment although I still don't fully see why the actual
exchange is necessary and why you cannot just move the current task to
the remote CPUs runqueue. Maybe it's something to do with them converging
faster if you do an exchange. I'll figure it out eventually.

 We must move tasks even if all cpus are in a steady rq-nr_running ==
 1 state and there's no other scheduler balance event that could
 possibly attempt to move tasks around in such a steady state.
 

I see, because just because there is a 1:1 mapping between tasks and
CPUs does not mean that it has converged from a NUMA perspective. The
idle balancer could be moving to an idle CPU that is poor from a NUMA
point of view. Better integration with the load balancer and caching on
a per-NUMA basis both the best and worst converged processes might help
but I'm hand-waving.

 Of course one could hack the active idle balancing so that it does the
 active NUMA balancing action, but that would be a purely artificial
 complication: it would add unnecessary delay and it would provide no
 benefit whatsoever.
 
 Why don't we dump the active idle balancing too, and we hack the load
 balancing to do the active idle balancing as well? Of course then the
 two will be more integrated. But it'll be a mess and slower and
 there's a good reason why they exist as totally separated pieces of
 code working in parallel.
 

I'm not 100% convinced they have to be separate but you have thought about
this a hell of a lot more than I have and I'm a scheduling dummy.

For example, to me it seems that if the load balancer was going to move a
task to an idle CPU on a remote node, it could also check it it would be
more or less converged before moving and reject the balancing if it would
be less converged after the move. This increases the search cost in the
load balancer but not necessarily any worse than what happens currently.

 We can integrate it more, but in my view the result would be worse and
 more complicated. Last but not the least messing the idle balancing
 code to do an active NUMA balancing action (somehow invoking
 stop_one_cpu* in the steady state described above) would force even
 cellphones and UP kernels to deal with NUMA code somehow.
 

hmm...

  tasks that are currently running feels a bit random but examining all tasks
  that recently ran on the node would be far too expensive to there is no
 
 So far this seems a good tradeoff. Nothing will prevent us to scan
 deeper into the runqueues later if find a way to do that efficiently.
 

I don't think there is an effecient way to do that but I'm hoping
caching an exchange candiate on a per-NUMA basis could reduce the cost
while still converging reasonably quickly.

  good answer. You are caught between a rock and a hard place and either
  direction you go is wrong for different reasons. You need something more
 
 I think you described the problem perfectly ;).
 
  frequent than scans (because it'll converge too slowly) but doing it from
  the balancer misses some tasks and may run too frequently and it's unclear
  how it effects the current load balancer decisions. I don't have a good
  alternative solution for this but ideally it would be better integrated with
  the existing scheduler when there is more data on what those scheduling
  decisions should be. That will only come from a wide range of testing and
  the inevitable bug reports.
  
  That said, this is concentrating on the problems without considering the
  situations where it would work very well.  I think it'll come down to HPC
  and anything jitter-sensitive will hate this while workloads like JVM,
  virtualisation or anything that uses a lot of memory without caring about
  placement will love it. It's not perfect but it's better than incurring
  the cost of remote access unconditionally.
 
 Full agreement.
 
 Your detailed full review was very appreciated, thanks!
 

You're welcome.

-- 
Mel Gorman

Re: [PATCH 00/33] AutoNUMA27

2012-10-12 Thread Mel Gorman
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
 On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
  Hi Mel,
  
  On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
   As a basic sniff test I added a test to MMtests for the AutoNUMA
   Benchmark on a 4-node machine and the following fell out.
   
3.6.0 3.6.0
  vanillaautonuma-v33r6
   UserSMT 82851.82 (  0.00%)33084.03 ( 60.07%)
   UserTHREAD_ALLOC   142723.90 (  0.00%)47707.38 ( 66.57%)
   System  SMT   396.68 (  0.00%)  621.46 (-56.67%)
   System  THREAD_ALLOC  675.22 (  0.00%)  836.96 (-23.95%)
   Elapsed SMT  1987.08 (  0.00%)  828.57 ( 58.30%)
   Elapsed THREAD_ALLOC 3222.99 (  0.00%) 1101.31 ( 65.83%)
   CPU SMT  4189.00 (  0.00%) 4067.00 (  2.91%)
   CPU THREAD_ALLOC 4449.00 (  0.00%) 4407.00 (  0.94%)
  
  Thanks a lot for the help and for looking into it!
  
  Just curious, why are you running only numa02_SMT and
  numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
  without _suffix)
  
 
 Bug in the testing script on my end. Each of them are run separtly and it

Ok, MMTests 0.06 (released a few minutes ago) patches autonumabench so
it can run the tests individually. I know start_bench.sh can run all the
tests itself but in time I'll want mmtests to collect additional stats
that can also be applied to other benchmarks consistently. The revised
results look like this

AUTONUMA BENCH
  3.6.0 3.6.0
vanillaautonuma-v33r6
UserNUMA01   66395.58 (  0.00%)32000.83 ( 51.80%)
UserNUMA01_THEADLOCAL55952.48 (  0.00%)16950.48 ( 69.71%)
UserNUMA026988.51 (  0.00%) 2150.56 ( 69.23%)
UserNUMA02_SMT2914.25 (  0.00%) 1013.11 ( 65.24%)
System  NUMA01 319.12 (  0.00%)  483.60 (-51.54%)
System  NUMA01_THEADLOCAL   40.60 (  0.00%)  184.39 (-354.16%)
System  NUMA02   1.62 (  0.00%)   23.92 (-1376.54%)
System  NUMA02_SMT   0.90 (  0.00%)   16.20 (-1700.00%)
Elapsed NUMA011519.53 (  0.00%)  757.40 ( 50.16%)
Elapsed NUMA01_THEADLOCAL 1269.49 (  0.00%)  398.63 ( 68.60%)
Elapsed NUMA02 181.12 (  0.00%)   57.09 ( 68.48%)
Elapsed NUMA02_SMT 164.18 (  0.00%)   53.16 ( 67.62%)
CPU NUMA014390.00 (  0.00%) 4288.00 (  2.32%)
CPU NUMA01_THEADLOCAL 4410.00 (  0.00%) 4298.00 (  2.54%)
CPU NUMA023859.00 (  0.00%) 3808.00 (  1.32%)
CPU NUMA02_SMT1775.00 (  0.00%) 1935.00 ( -9.01%)

MMTests Statistics: duration
   3.6.0   3.6.0
 vanilla autonuma-v33r6
User   132257.4452121.30
System362.79  708.62
Elapsed  3142.66 1275.72

MMTests Statistics: vmstat
  3.6.0   3.6.0
vanilla autonuma-v33r6
THP fault alloc   17660   19927
THP collapse alloc   10   12399
THP splits4   12637

The System CPU usage is high but is compenstated for with reduced User
and Elapsed times in this particular case.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Andrea Arcangeli
Hi Mel,

On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
> So after getting through the full review of it, there wasn't anything
> I could not stand. I think it's *very* heavy on some of the paths like
> the idle balancer which I was not keen on and the fault paths are also
> quite heavy.  I think the weight on some of these paths can be reduced
> but not to 0 if the objectives to autonuma are to be met.
> 
> I'm not fully convinced that the task exchange is actually necessary or
> beneficial because it somewhat assumes that there is a symmetry between CPU
> and memory balancing that may not be true. The fact that it only considers

The problem is that without an active task exchange and no explicit
call to stop_one_cpu*, there's no way to migrate a currently running
task and clearly we need that. We can indefinitely wait hoping the
task goes to sleep and leaves the CPU idle, or that a couple of other
tasks start and trigger load balance events.

We must move tasks even if all cpus are in a steady rq->nr_running ==
1 state and there's no other scheduler balance event that could
possibly attempt to move tasks around in such a steady state.

Of course one could hack the active idle balancing so that it does the
active NUMA balancing action, but that would be a purely artificial
complication: it would add unnecessary delay and it would provide no
benefit whatsoever.

Why don't we dump the active idle balancing too, and we hack the load
balancing to do the active idle balancing as well? Of course then the
two will be more integrated. But it'll be a mess and slower and
there's a good reason why they exist as totally separated pieces of
code working in parallel.

We can integrate it more, but in my view the result would be worse and
more complicated. Last but not the least messing the idle balancing
code to do an active NUMA balancing action (somehow invoking
stop_one_cpu* in the steady state described above) would force even
cellphones and UP kernels to deal with NUMA code somehow.

> tasks that are currently running feels a bit random but examining all tasks
> that recently ran on the node would be far too expensive to there is no

So far this seems a good tradeoff. Nothing will prevent us to scan
deeper into the runqueues later if find a way to do that efficiently.

> good answer. You are caught between a rock and a hard place and either
> direction you go is wrong for different reasons. You need something more

I think you described the problem perfectly ;).

> frequent than scans (because it'll converge too slowly) but doing it from
> the balancer misses some tasks and may run too frequently and it's unclear
> how it effects the current load balancer decisions. I don't have a good
> alternative solution for this but ideally it would be better integrated with
> the existing scheduler when there is more data on what those scheduling
> decisions should be. That will only come from a wide range of testing and
> the inevitable bug reports.
> 
> That said, this is concentrating on the problems without considering the
> situations where it would work very well.  I think it'll come down to HPC
> and anything jitter-sensitive will hate this while workloads like JVM,
> virtualisation or anything that uses a lot of memory without caring about
> placement will love it. It's not perfect but it's better than incurring
> the cost of remote access unconditionally.

Full agreement.

Your detailed full review was very appreciated, thanks!

Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Andrea Arcangeli
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
> If System CPU time really does go down as this converges then that
> should be obvious from monitoring vmstat over time for a test. Early on
> - high usage with that dropping as it converges. If that doesn't happen
>   then the tasks are not converging, the phases change constantly or
> something unexpected happened that needs to be identified.

Yes, all measurable kernel cost should be in the memory copies
(migration and khugepaged, the latter is going to be optimized away).

The migrations must stop after the workload converges. Either
migrations are used to reach convergence or they shouldn't happen in
the first place (not in any measurable amount).

> Ok. Are they separate STREAM instances or threads running on the same
> arrays? 

My understanding is separate instances. I think it's a single threaded
benchmark and you run many copies. It was modified to run for 5min
(otherwise upstream has not enough time to get it wrong, as result of
background scheduling jitters).

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Mel Gorman
On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> > As a basic sniff test I added a test to MMtests for the AutoNUMA
> > Benchmark on a 4-node machine and the following fell out.
> > 
> >  3.6.0 3.6.0
> >vanillaautonuma-v33r6
> > UserSMT 82851.82 (  0.00%)33084.03 ( 60.07%)
> > UserTHREAD_ALLOC   142723.90 (  0.00%)47707.38 ( 66.57%)
> > System  SMT   396.68 (  0.00%)  621.46 (-56.67%)
> > System  THREAD_ALLOC  675.22 (  0.00%)  836.96 (-23.95%)
> > Elapsed SMT  1987.08 (  0.00%)  828.57 ( 58.30%)
> > Elapsed THREAD_ALLOC 3222.99 (  0.00%) 1101.31 ( 65.83%)
> > CPU SMT  4189.00 (  0.00%) 4067.00 (  2.91%)
> > CPU THREAD_ALLOC 4449.00 (  0.00%) 4407.00 (  0.94%)
> 
> Thanks a lot for the help and for looking into it!
> 
> Just curious, why are you running only numa02_SMT and
> numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
> without _suffix)
> 

Bug in the testing script on my end. Each of them are run separtly and it
looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then
numa01_THREAD_ALLOC. The intention was to allow additional stats to be
gathered independently of what start_bench.sh collects. Will improve it
in the future.

> > 
> > The performance improvements are certainly there for this basic test but
> > I note the System CPU usage is very high.
> 
> Yes, migrate is expensive, but after convergence has been reached the
> system time should be the same as upstream.
> 

Ok.

> btw, I improved things further in autonuma28 (new branch in aa.git).
> 

Ok.

> > 
> > The vmstats showed up this
> > 
> > THP fault alloc   81376   86070
> > THP collapse alloc   14   40423
> > THP splits8   41792
> > 
> > So we're doing a lot of splits and collapses for THP there. There is a
> > possibility that khugepaged and the autonuma kernel thread are doing some
> > busy work. Not a show-stopped, just interesting.
> > 
> > I've done no analysis at all and this was just to have something to look
> > at before looking at the code closer.
> 
> Sure, the idea is to have THP native migration, then we'll do zero
> collapse/splits.
> 

Seems reasonably. It should be obvious to measure when/if that happens.

> > > The objective of AutoNUMA is to provide out-of-the-box performance as
> > > close as possible to (and potentially faster than) manual NUMA hard
> > > bindings.
> > > 
> > > It is not very intrusive into the kernel core and is well structured
> > > into separate source modules.
> > > 
> > > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > > NUMA placement algorithms such as numad (in userland through cpusets)
> > > and schednuma (in kernel too) and was found superior in all cases.
> > > 
> > > Most important: not a single benchmark showed a regression yet when
> > > compared to vanilla kernels. Not even on the 2 node systems where the
> > > NUMA effects are less significant.
> > > 
> > 
> > Ok, I have not run a general regression test and won't get the chance to
> > soon but hopefully others will. One thing they might want to watch out
> > for is System CPU time. It's possible that your AutoNUMA benchmark
> > triggers a worst-case but it's worth keeping an eye on because any cost
> > from that has to be offset by gains from better NUMA placements.
> 
> Good idea to monitor it indeed.
> 

If System CPU time really does go down as this converges then that
should be obvious from monitoring vmstat over time for a test. Early on
- high usage with that dropping as it converges. If that doesn't happen
  then the tasks are not converging, the phases change constantly or
something unexpected happened that needs to be identified.

> > Is STREAM really a good benchmark in this case? Unless you also ran it in
> > parallel mode, it basically operations against three arrays and not really
> > NUMA friendly once the total size is greater than a NUMA node. I guess
> > it makes sense to run it just to see does autonuma break it :)
> 
> The way this is run is that there is 1 stream, then 4 stream, then 8
> until we max out all CPUs.
> 

Ok. Are they separate STREAM instances or threads running on the same
arrays? 

> I think we could run "memhog" instead of "stream" and it'd be the
> same. stream probably better resembles real life computations.
> 
> The upstream scheduler lacks any notion of affinity so eventually
> during the 5 min run, on process changes node, it doesn't notice its
> memory was elsewhere so it stays there, and the memory can't follow
> the cpu either. So then it runs much slower.
> 
> So it's the simplest test of all to get right, all it requires is some
> notion of node affinity.
> 

Ok.

> It's 

Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Andrea Arcangeli
Hi Mel,

On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
> As a basic sniff test I added a test to MMtests for the AutoNUMA
> Benchmark on a 4-node machine and the following fell out.
> 
>  3.6.0 3.6.0
>vanillaautonuma-v33r6
> UserSMT 82851.82 (  0.00%)33084.03 ( 60.07%)
> UserTHREAD_ALLOC   142723.90 (  0.00%)47707.38 ( 66.57%)
> System  SMT   396.68 (  0.00%)  621.46 (-56.67%)
> System  THREAD_ALLOC  675.22 (  0.00%)  836.96 (-23.95%)
> Elapsed SMT  1987.08 (  0.00%)  828.57 ( 58.30%)
> Elapsed THREAD_ALLOC 3222.99 (  0.00%) 1101.31 ( 65.83%)
> CPU SMT  4189.00 (  0.00%) 4067.00 (  2.91%)
> CPU THREAD_ALLOC 4449.00 (  0.00%) 4407.00 (  0.94%)

Thanks a lot for the help and for looking into it!

Just curious, why are you running only numa02_SMT and
numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
without _suffix)

> 
> The performance improvements are certainly there for this basic test but
> I note the System CPU usage is very high.

Yes, migrate is expensive, but after convergence has been reached the
system time should be the same as upstream.

btw, I improved things further in autonuma28 (new branch in aa.git).

> 
> The vmstats showed up this
> 
> THP fault alloc   81376   86070
> THP collapse alloc   14   40423
> THP splits8   41792
> 
> So we're doing a lot of splits and collapses for THP there. There is a
> possibility that khugepaged and the autonuma kernel thread are doing some
> busy work. Not a show-stopped, just interesting.
> 
> I've done no analysis at all and this was just to have something to look
> at before looking at the code closer.

Sure, the idea is to have THP native migration, then we'll do zero
collapse/splits.

> > The objective of AutoNUMA is to provide out-of-the-box performance as
> > close as possible to (and potentially faster than) manual NUMA hard
> > bindings.
> > 
> > It is not very intrusive into the kernel core and is well structured
> > into separate source modules.
> > 
> > AutoNUMA was extensively tested against 3.x upstream kernels and other
> > NUMA placement algorithms such as numad (in userland through cpusets)
> > and schednuma (in kernel too) and was found superior in all cases.
> > 
> > Most important: not a single benchmark showed a regression yet when
> > compared to vanilla kernels. Not even on the 2 node systems where the
> > NUMA effects are less significant.
> > 
> 
> Ok, I have not run a general regression test and won't get the chance to
> soon but hopefully others will. One thing they might want to watch out
> for is System CPU time. It's possible that your AutoNUMA benchmark
> triggers a worst-case but it's worth keeping an eye on because any cost
> from that has to be offset by gains from better NUMA placements.

Good idea to monitor it indeed.

> Is STREAM really a good benchmark in this case? Unless you also ran it in
> parallel mode, it basically operations against three arrays and not really
> NUMA friendly once the total size is greater than a NUMA node. I guess
> it makes sense to run it just to see does autonuma break it :)

The way this is run is that there is 1 stream, then 4 stream, then 8
until we max out all CPUs.

I think we could run "memhog" instead of "stream" and it'd be the
same. stream probably better resembles real life computations.

The upstream scheduler lacks any notion of affinity so eventually
during the 5 min run, on process changes node, it doesn't notice its
memory was elsewhere so it stays there, and the memory can't follow
the cpu either. So then it runs much slower.

So it's the simplest test of all to get right, all it requires is some
notion of node affinity.

It's also the only workload that the home node design in schednuma in
tip.git can get right (schednuma post current tip.git introduced
cpu-follow-memory design of AutoNUMA so schednuma will have a chance
to get right more stuff than just the stream multi instance
benchmark).

So it's just for a verification than the simple stuff (single threaded
process computing) is ok and the upstream regression vs hard NUMA
bindings is fixed.

stream is also one case where we have to perform identical to the hard
NUMA bindings. No migration of CPU or memory must ever happen with
AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
find that it is already in the best place and it will leave it alone.

With the autonuma-benchmark it's impossible to reach identical
performance of the _HARD_BIND case because _HARD_BIND doesn't need to
do any memory migration (I'm 3 seconds away from hard bindings in a
198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).

> 
> > 
> > == iozone ==
> > 
> >  ALL  INIT   RE RE   RANDOM 

Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Andrea Arcangeli
Hi Mel,

On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
 As a basic sniff test I added a test to MMtests for the AutoNUMA
 Benchmark on a 4-node machine and the following fell out.
 
  3.6.0 3.6.0
vanillaautonuma-v33r6
 UserSMT 82851.82 (  0.00%)33084.03 ( 60.07%)
 UserTHREAD_ALLOC   142723.90 (  0.00%)47707.38 ( 66.57%)
 System  SMT   396.68 (  0.00%)  621.46 (-56.67%)
 System  THREAD_ALLOC  675.22 (  0.00%)  836.96 (-23.95%)
 Elapsed SMT  1987.08 (  0.00%)  828.57 ( 58.30%)
 Elapsed THREAD_ALLOC 3222.99 (  0.00%) 1101.31 ( 65.83%)
 CPU SMT  4189.00 (  0.00%) 4067.00 (  2.91%)
 CPU THREAD_ALLOC 4449.00 (  0.00%) 4407.00 (  0.94%)

Thanks a lot for the help and for looking into it!

Just curious, why are you running only numa02_SMT and
numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
without _suffix)

 
 The performance improvements are certainly there for this basic test but
 I note the System CPU usage is very high.

Yes, migrate is expensive, but after convergence has been reached the
system time should be the same as upstream.

btw, I improved things further in autonuma28 (new branch in aa.git).

 
 The vmstats showed up this
 
 THP fault alloc   81376   86070
 THP collapse alloc   14   40423
 THP splits8   41792
 
 So we're doing a lot of splits and collapses for THP there. There is a
 possibility that khugepaged and the autonuma kernel thread are doing some
 busy work. Not a show-stopped, just interesting.
 
 I've done no analysis at all and this was just to have something to look
 at before looking at the code closer.

Sure, the idea is to have THP native migration, then we'll do zero
collapse/splits.

  The objective of AutoNUMA is to provide out-of-the-box performance as
  close as possible to (and potentially faster than) manual NUMA hard
  bindings.
  
  It is not very intrusive into the kernel core and is well structured
  into separate source modules.
  
  AutoNUMA was extensively tested against 3.x upstream kernels and other
  NUMA placement algorithms such as numad (in userland through cpusets)
  and schednuma (in kernel too) and was found superior in all cases.
  
  Most important: not a single benchmark showed a regression yet when
  compared to vanilla kernels. Not even on the 2 node systems where the
  NUMA effects are less significant.
  
 
 Ok, I have not run a general regression test and won't get the chance to
 soon but hopefully others will. One thing they might want to watch out
 for is System CPU time. It's possible that your AutoNUMA benchmark
 triggers a worst-case but it's worth keeping an eye on because any cost
 from that has to be offset by gains from better NUMA placements.

Good idea to monitor it indeed.

 Is STREAM really a good benchmark in this case? Unless you also ran it in
 parallel mode, it basically operations against three arrays and not really
 NUMA friendly once the total size is greater than a NUMA node. I guess
 it makes sense to run it just to see does autonuma break it :)

The way this is run is that there is 1 stream, then 4 stream, then 8
until we max out all CPUs.

I think we could run memhog instead of stream and it'd be the
same. stream probably better resembles real life computations.

The upstream scheduler lacks any notion of affinity so eventually
during the 5 min run, on process changes node, it doesn't notice its
memory was elsewhere so it stays there, and the memory can't follow
the cpu either. So then it runs much slower.

So it's the simplest test of all to get right, all it requires is some
notion of node affinity.

It's also the only workload that the home node design in schednuma in
tip.git can get right (schednuma post current tip.git introduced
cpu-follow-memory design of AutoNUMA so schednuma will have a chance
to get right more stuff than just the stream multi instance
benchmark).

So it's just for a verification than the simple stuff (single threaded
process computing) is ok and the upstream regression vs hard NUMA
bindings is fixed.

stream is also one case where we have to perform identical to the hard
NUMA bindings. No migration of CPU or memory must ever happen with
AutoNUMA in the stream benchmark. AutoNUMA will just monitor it and
find that it is already in the best place and it will leave it alone.

With the autonuma-benchmark it's impossible to reach identical
performance of the _HARD_BIND case because _HARD_BIND doesn't need to
do any memory migration (I'm 3 seconds away from hard bindings in a
198 sec run though, just the 3 seconds it takes to migrate 3g of ram ;).

 
  
  == iozone ==
  
   ALL  INIT   RE RE   RANDOM RANDOM BACKWD  
  RECRE STRIDE  F  FRE F  FRE
  FILE TYPE (KB)  IOS  

Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Mel Gorman
On Thu, Oct 11, 2012 at 04:56:11PM +0200, Andrea Arcangeli wrote:
 Hi Mel,
 
 On Thu, Oct 11, 2012 at 11:19:30AM +0100, Mel Gorman wrote:
  As a basic sniff test I added a test to MMtests for the AutoNUMA
  Benchmark on a 4-node machine and the following fell out.
  
   3.6.0 3.6.0
 vanillaautonuma-v33r6
  UserSMT 82851.82 (  0.00%)33084.03 ( 60.07%)
  UserTHREAD_ALLOC   142723.90 (  0.00%)47707.38 ( 66.57%)
  System  SMT   396.68 (  0.00%)  621.46 (-56.67%)
  System  THREAD_ALLOC  675.22 (  0.00%)  836.96 (-23.95%)
  Elapsed SMT  1987.08 (  0.00%)  828.57 ( 58.30%)
  Elapsed THREAD_ALLOC 3222.99 (  0.00%) 1101.31 ( 65.83%)
  CPU SMT  4189.00 (  0.00%) 4067.00 (  2.91%)
  CPU THREAD_ALLOC 4449.00 (  0.00%) 4407.00 (  0.94%)
 
 Thanks a lot for the help and for looking into it!
 
 Just curious, why are you running only numa02_SMT and
 numa01_THREAD_ALLOC? And not numa01 and numa02? (the standard version
 without _suffix)
 

Bug in the testing script on my end. Each of them are run separtly and it
looks like in retrospect that a THREAD_ALLOC test actually ran numa01 then
numa01_THREAD_ALLOC. The intention was to allow additional stats to be
gathered independently of what start_bench.sh collects. Will improve it
in the future.

  
  The performance improvements are certainly there for this basic test but
  I note the System CPU usage is very high.
 
 Yes, migrate is expensive, but after convergence has been reached the
 system time should be the same as upstream.
 

Ok.

 btw, I improved things further in autonuma28 (new branch in aa.git).
 

Ok.

  
  The vmstats showed up this
  
  THP fault alloc   81376   86070
  THP collapse alloc   14   40423
  THP splits8   41792
  
  So we're doing a lot of splits and collapses for THP there. There is a
  possibility that khugepaged and the autonuma kernel thread are doing some
  busy work. Not a show-stopped, just interesting.
  
  I've done no analysis at all and this was just to have something to look
  at before looking at the code closer.
 
 Sure, the idea is to have THP native migration, then we'll do zero
 collapse/splits.
 

Seems reasonably. It should be obvious to measure when/if that happens.

   The objective of AutoNUMA is to provide out-of-the-box performance as
   close as possible to (and potentially faster than) manual NUMA hard
   bindings.
   
   It is not very intrusive into the kernel core and is well structured
   into separate source modules.
   
   AutoNUMA was extensively tested against 3.x upstream kernels and other
   NUMA placement algorithms such as numad (in userland through cpusets)
   and schednuma (in kernel too) and was found superior in all cases.
   
   Most important: not a single benchmark showed a regression yet when
   compared to vanilla kernels. Not even on the 2 node systems where the
   NUMA effects are less significant.
   
  
  Ok, I have not run a general regression test and won't get the chance to
  soon but hopefully others will. One thing they might want to watch out
  for is System CPU time. It's possible that your AutoNUMA benchmark
  triggers a worst-case but it's worth keeping an eye on because any cost
  from that has to be offset by gains from better NUMA placements.
 
 Good idea to monitor it indeed.
 

If System CPU time really does go down as this converges then that
should be obvious from monitoring vmstat over time for a test. Early on
- high usage with that dropping as it converges. If that doesn't happen
  then the tasks are not converging, the phases change constantly or
something unexpected happened that needs to be identified.

  Is STREAM really a good benchmark in this case? Unless you also ran it in
  parallel mode, it basically operations against three arrays and not really
  NUMA friendly once the total size is greater than a NUMA node. I guess
  it makes sense to run it just to see does autonuma break it :)
 
 The way this is run is that there is 1 stream, then 4 stream, then 8
 until we max out all CPUs.
 

Ok. Are they separate STREAM instances or threads running on the same
arrays? 

 I think we could run memhog instead of stream and it'd be the
 same. stream probably better resembles real life computations.
 
 The upstream scheduler lacks any notion of affinity so eventually
 during the 5 min run, on process changes node, it doesn't notice its
 memory was elsewhere so it stays there, and the memory can't follow
 the cpu either. So then it runs much slower.
 
 So it's the simplest test of all to get right, all it requires is some
 notion of node affinity.
 

Ok.

 It's also the only workload that the home node design in schednuma in
 tip.git can get right (schednuma post current tip.git introduced
 cpu-follow-memory design of AutoNUMA 

Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Andrea Arcangeli
On Thu, Oct 11, 2012 at 04:35:03PM +0100, Mel Gorman wrote:
 If System CPU time really does go down as this converges then that
 should be obvious from monitoring vmstat over time for a test. Early on
 - high usage with that dropping as it converges. If that doesn't happen
   then the tasks are not converging, the phases change constantly or
 something unexpected happened that needs to be identified.

Yes, all measurable kernel cost should be in the memory copies
(migration and khugepaged, the latter is going to be optimized away).

The migrations must stop after the workload converges. Either
migrations are used to reach convergence or they shouldn't happen in
the first place (not in any measurable amount).

 Ok. Are they separate STREAM instances or threads running on the same
 arrays? 

My understanding is separate instances. I think it's a single threaded
benchmark and you run many copies. It was modified to run for 5min
(otherwise upstream has not enough time to get it wrong, as result of
background scheduling jitters).

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-11 Thread Andrea Arcangeli
Hi Mel,

On Thu, Oct 11, 2012 at 10:34:32PM +0100, Mel Gorman wrote:
 So after getting through the full review of it, there wasn't anything
 I could not stand. I think it's *very* heavy on some of the paths like
 the idle balancer which I was not keen on and the fault paths are also
 quite heavy.  I think the weight on some of these paths can be reduced
 but not to 0 if the objectives to autonuma are to be met.
 
 I'm not fully convinced that the task exchange is actually necessary or
 beneficial because it somewhat assumes that there is a symmetry between CPU
 and memory balancing that may not be true. The fact that it only considers

The problem is that without an active task exchange and no explicit
call to stop_one_cpu*, there's no way to migrate a currently running
task and clearly we need that. We can indefinitely wait hoping the
task goes to sleep and leaves the CPU idle, or that a couple of other
tasks start and trigger load balance events.

We must move tasks even if all cpus are in a steady rq-nr_running ==
1 state and there's no other scheduler balance event that could
possibly attempt to move tasks around in such a steady state.

Of course one could hack the active idle balancing so that it does the
active NUMA balancing action, but that would be a purely artificial
complication: it would add unnecessary delay and it would provide no
benefit whatsoever.

Why don't we dump the active idle balancing too, and we hack the load
balancing to do the active idle balancing as well? Of course then the
two will be more integrated. But it'll be a mess and slower and
there's a good reason why they exist as totally separated pieces of
code working in parallel.

We can integrate it more, but in my view the result would be worse and
more complicated. Last but not the least messing the idle balancing
code to do an active NUMA balancing action (somehow invoking
stop_one_cpu* in the steady state described above) would force even
cellphones and UP kernels to deal with NUMA code somehow.

 tasks that are currently running feels a bit random but examining all tasks
 that recently ran on the node would be far too expensive to there is no

So far this seems a good tradeoff. Nothing will prevent us to scan
deeper into the runqueues later if find a way to do that efficiently.

 good answer. You are caught between a rock and a hard place and either
 direction you go is wrong for different reasons. You need something more

I think you described the problem perfectly ;).

 frequent than scans (because it'll converge too slowly) but doing it from
 the balancer misses some tasks and may run too frequently and it's unclear
 how it effects the current load balancer decisions. I don't have a good
 alternative solution for this but ideally it would be better integrated with
 the existing scheduler when there is more data on what those scheduling
 decisions should be. That will only come from a wide range of testing and
 the inevitable bug reports.
 
 That said, this is concentrating on the problems without considering the
 situations where it would work very well.  I think it'll come down to HPC
 and anything jitter-sensitive will hate this while workloads like JVM,
 virtualisation or anything that uses a lot of memory without caring about
 placement will love it. It's not perfect but it's better than incurring
 the cost of remote access unconditionally.

Full agreement.

Your detailed full review was very appreciated, thanks!

Andrea
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-08 Thread Rik van Riel
On Fri, 05 Oct 2012 16:14:44 -0700
Andi Kleen  wrote:

> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?

Here are some preliminary results from simple benchmarks on a
4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system.

For the simple linpack streams benchmark, both sched/numa and
autonuma are within the margin of error compared to manual
tuning of task affinity.  This is a big win, since the current
upstream scheduler has regressions of 10-20% when the system
runs 4 through 16 streams processes.

For specjbb, the story is more complicated. After fixing the
obvious bugs in sched/numa, and getting some basic cpu-follows-memory
code (not yet in -tip AFAIK), Larry, Peter and I, averaged results
look like this:

baseline:   246019
manual pinning: 285481 (+16%)
autonuma:   266626 (+8%)
sched/numa: 226540 (-8%)

This is with newer sched/numa code than what is in -tip right now.
Once Peter pushes the fixes by Larry and me into -tip, as well as
his cpu-follows-memory code, others should be able to run tests
like this as well.

Now for some other workloads, and tests on 8 node systems, etc...


Full results for the specjbb run below:

BASELINE - disabling auto numa (matches RHEL6 within 1%)

[root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt
spec1.txt:   throughput = 243639.70 SPECjbb2005 bops
spec2.txt:   throughput = 249186.20 SPECjbb2005 bops
spec3.txt:   throughput = 247216.72 SPECjbb2005 bops
spec4.txt:   throughput = 244035.60 SPECjbb2005 bops

Manual NUMACTL results are:

[root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt
spec1.txt:   throughput = 291430.22 SPECjbb2005 bops
spec2.txt:   throughput = 283550.85 SPECjbb2005 bops
spec3.txt:   throughput = 284028.71 SPECjbb2005 bops
spec4.txt:   throughput = 282919.37 SPECjbb2005 bops

AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64
[root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt
spec1.txt:   throughput = 261835.01 SPECjbb2005 bops
spec2.txt:   throughput = 269053.06 SPECjbb2005 bops
spec3.txt:   throughput = 261230.50 SPECjbb2005 bops
spec3.txt:   throughput = 274386.81 SPECjbb2005 bops

Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and 
Larry:

[root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt
spec1.txt:   throughput = 222349.74 SPECjbb2005 bops
spec2.txt:   throughput = 232988.59 SPECjbb2005 bops
spec3.txt:   throughput = 223386.03 SPECjbb2005 bops
spec4.txt:   throughput = 227438.11 SPECjbb2005 bops

-- 
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-08 Thread Don Morris
On 10/05/2012 05:11 PM, Andi Kleen wrote:
> Tim Chen  writes:
>>>
>>
>> I remembered that 3 months ago when Alex tested the numa/sched patches
>> there were 20% regression on SpecJbb2005 due to the numa balancer.
> 
> 20% on anything sounds like a show stopper to me.
> 
> -Andi
> 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( &(>page_table_lock)->rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 50 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly "best case" for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a "big" process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a "scan" (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-08 Thread Don Morris
On 10/05/2012 05:11 PM, Andi Kleen wrote:
 Tim Chen tim.c.c...@linux.intel.com writes:


 I remembered that 3 months ago when Alex tested the numa/sched patches
 there were 20% regression on SpecJbb2005 due to the numa balancer.
 
 20% on anything sounds like a show stopper to me.
 
 -Andi
 

Much worse than that on an 8-way machine for a multi-node multi-threaded
process, from what I can tell. (Andrea's AutoNUMA microbenchmark is a
simple version of that). The contention on the page table lock
( (mm-page_table_lock)-rlock ) goes through the roof, with threads
constantly fighting to invalidate translations and re-fault them.

This is on a DL980 with Xeon E7-2870s @ 2.4 GHz, btw.

Running linux-next with no tweaks other than
kernel.sched_migration_cost_ns = 50 gives:
numa01
8325.78
numa01_HARD_BIND
488.98

(The Hard Bind being a case where the threads are pre-bound to the
node set with their memory, so what should be a fairly best case for
comparison).

If the SchedNUMA scanning period is upped to 25000 ms (to keep repeated
invalidations from being triggered while the contention for the first
invalidation pass is still being fought over):
numa01
4272.93
numa01_HARD_BIND
498.98

Since this is a big process in the current SchedNUMA code and hence
much more likely to trip invalidations, forcing task_numa_big() to
always return false in order to avoid the frequent invalidations gives:
numa01
429.07
numa01_HARD_BIND
466.67

Finally, with SchedNUMA entirely disabled but the rest of linux-next
left intact:
numa01
1075.31
numa01_HARD_BIND
484.20

I didn't write down the lock contentions for comparison, but yes -
the contention does decrease similarly to the time decreases.

There are other microbenchmarks, but those suffice to show the
regression pattern. I mentioned this to the RedHat folks last
week, so I expect this is already being worked. It seemed pertinent
to bring up given the discussion about the current state of linux-next
though, just so folks know. From where I'm sitting, it looks to
me like the scan period is way too aggressive and there's too much
work potentially attempted during a scan (by which I mean the
hard tick driven choice to invalidate in order to set up potential
migration faults). The current code walks/invalidates the entire
virtual address space, skipping few vmas. For a very large 64-bit
process, that's going to be a *lot* of translations (or even vmas
if the address space is fragmented) to walk. That's a seriously
long path coming from the timer code. I would think capping the
number of translations to process per visit would help.

Hope this helps the discussion,
Don Morris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-08 Thread Rik van Riel
On Fri, 05 Oct 2012 16:14:44 -0700
Andi Kleen a...@firstfloor.org wrote:

 IMHO needs a performance shot-out. Run both on the same 10 workloads
 and see who wins. Just a lot of of work. Any volunteers?

Here are some preliminary results from simple benchmarks on a
4-node, 32 CPU core (4x8 core) Dell PowerEdge R910 system.

For the simple linpack streams benchmark, both sched/numa and
autonuma are within the margin of error compared to manual
tuning of task affinity.  This is a big win, since the current
upstream scheduler has regressions of 10-20% when the system
runs 4 through 16 streams processes.

For specjbb, the story is more complicated. After fixing the
obvious bugs in sched/numa, and getting some basic cpu-follows-memory
code (not yet in -tip AFAIK), Larry, Peter and I, averaged results
look like this:

baseline:   246019
manual pinning: 285481 (+16%)
autonuma:   266626 (+8%)
sched/numa: 226540 (-8%)

This is with newer sched/numa code than what is in -tip right now.
Once Peter pushes the fixes by Larry and me into -tip, as well as
his cpu-follows-memory code, others should be able to run tests
like this as well.

Now for some other workloads, and tests on 8 node systems, etc...


Full results for the specjbb run below:

BASELINE - disabling auto numa (matches RHEL6 within 1%)

[root@perf74 SPECjbb]# cat r7_36_auto27_specjbb4_noauto.txt
spec1.txt:   throughput = 243639.70 SPECjbb2005 bops
spec2.txt:   throughput = 249186.20 SPECjbb2005 bops
spec3.txt:   throughput = 247216.72 SPECjbb2005 bops
spec4.txt:   throughput = 244035.60 SPECjbb2005 bops

Manual NUMACTL results are:

[root@perf74 SPECjbb]# more r7_36_numactl_specjbb4.txt
spec1.txt:   throughput = 291430.22 SPECjbb2005 bops
spec2.txt:   throughput = 283550.85 SPECjbb2005 bops
spec3.txt:   throughput = 284028.71 SPECjbb2005 bops
spec4.txt:   throughput = 282919.37 SPECjbb2005 bops

AUTONUMA27 - 3.6.0-0.24.autonuma27.test.x86_64
[root@perf74 SPECjbb]# more r7_36_auto27_specjbb4.txt
spec1.txt:   throughput = 261835.01 SPECjbb2005 bops
spec2.txt:   throughput = 269053.06 SPECjbb2005 bops
spec3.txt:   throughput = 261230.50 SPECjbb2005 bops
spec3.txt:   throughput = 274386.81 SPECjbb2005 bops

Tuned SCHED_NUMA from Friday 10/4/2012 with fixes from Peter, Rik and 
Larry:

[root@perf74 SPECjbb]# more r7_36_schednuma_specjbb4.txt
spec1.txt:   throughput = 222349.74 SPECjbb2005 bops
spec2.txt:   throughput = 232988.59 SPECjbb2005 bops
spec3.txt:   throughput = 223386.03 SPECjbb2005 bops
spec4.txt:   throughput = 227438.11 SPECjbb2005 bops

-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Andi Kleen
Tim Chen  writes:
>> 
>
> I remembered that 3 months ago when Alex tested the numa/sched patches
> there were 20% regression on SpecJbb2005 due to the numa balancer.

20% on anything sounds like a show stopper to me.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Tim Chen
On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
> Andrew Morton  writes:
> 
> > On Thu,  4 Oct 2012 01:50:42 +0200
> > Andrea Arcangeli  wrote:
> >
> >> This is a new AutoNUMA27 release for Linux v3.6.
> >
> > Peter's numa/sched patches have been in -next for a week. 
> 
> Did they pass review? I have some doubts.
> 
> The last time I looked it also broke numactl.
> 
> > Guys, what's the plan here?
> 
> Since they are both performance features their ultimate benefit
> is how much faster they make things (and how seldom they make things
> slower)
> 
> IMHO needs a performance shot-out. Run both on the same 10 workloads
> and see who wins. Just a lot of of work. Any volunteers?
> 
> For a change like this I think less regression is actually more
> important than the highest peak numbers.
> 
> -Andi
> 

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest.  For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Andi Kleen
Andrew Morton  writes:

> On Thu,  4 Oct 2012 01:50:42 +0200
> Andrea Arcangeli  wrote:
>
>> This is a new AutoNUMA27 release for Linux v3.6.
>
> Peter's numa/sched patches have been in -next for a week. 

Did they pass review? I have some doubts.

The last time I looked it also broke numactl.

> Guys, what's the plan here?

Since they are both performance features their ultimate benefit
is how much faster they make things (and how seldom they make things
slower)

IMHO needs a performance shot-out. Run both on the same 10 workloads
and see who wins. Just a lot of of work. Any volunteers?

For a change like this I think less regression is actually more
important than the highest peak numbers.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Andi Kleen
Andrew Morton a...@linux-foundation.org writes:

 On Thu,  4 Oct 2012 01:50:42 +0200
 Andrea Arcangeli aarca...@redhat.com wrote:

 This is a new AutoNUMA27 release for Linux v3.6.

 Peter's numa/sched patches have been in -next for a week. 

Did they pass review? I have some doubts.

The last time I looked it also broke numactl.

 Guys, what's the plan here?

Since they are both performance features their ultimate benefit
is how much faster they make things (and how seldom they make things
slower)

IMHO needs a performance shot-out. Run both on the same 10 workloads
and see who wins. Just a lot of of work. Any volunteers?

For a change like this I think less regression is actually more
important than the highest peak numbers.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Tim Chen
On Fri, 2012-10-05 at 16:14 -0700, Andi Kleen wrote:
 Andrew Morton a...@linux-foundation.org writes:
 
  On Thu,  4 Oct 2012 01:50:42 +0200
  Andrea Arcangeli aarca...@redhat.com wrote:
 
  This is a new AutoNUMA27 release for Linux v3.6.
 
  Peter's numa/sched patches have been in -next for a week. 
 
 Did they pass review? I have some doubts.
 
 The last time I looked it also broke numactl.
 
  Guys, what's the plan here?
 
 Since they are both performance features their ultimate benefit
 is how much faster they make things (and how seldom they make things
 slower)
 
 IMHO needs a performance shot-out. Run both on the same 10 workloads
 and see who wins. Just a lot of of work. Any volunteers?
 
 For a change like this I think less regression is actually more
 important than the highest peak numbers.
 
 -Andi
 

I remembered that 3 months ago when Alex tested the numa/sched patches
there were 20% regression on SpecJbb2005 due to the numa balancer.
Those issues may have been fixed but we probably need to run this
benchmark against the latest.  For most of the other kernel performance
workloads we ran we didn't see much changes.

Maurico has a different config for this benchmark and it will be nice
if he can also check to see if there are any performance changes on his
side.

Tim


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/33] AutoNUMA27

2012-10-05 Thread Andi Kleen
Tim Chen tim.c.c...@linux.intel.com writes:
 

 I remembered that 3 months ago when Alex tested the numa/sched patches
 there were 20% regression on SpecJbb2005 due to the numa balancer.

20% on anything sounds like a show stopper to me.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/