Re: [PATCH 00/31] numa/core patches

2012-11-12 Thread Mel Gorman
On Sat, Nov 10, 2012 at 10:47:41AM +0800, Alex Shi wrote:
> On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman  wrote:
> > On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
> >> >
> >> > In reality, this report is larger but I chopped it down a bit for
> >> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
> >> > terms of average operations per numa node and overall throughput.
> >> >
> >> > SPECJBB PEAKS
> >> >3.7.0  3.7.0  
> >> > 3.7.0
> >> >   rc2-stats-v2r1 rc2-autonuma-v27r8  
> >> >rc2-schednuma-v1r3
> >> >  Expctd Warehouse   12.00 (  0.00%)   
> >> > 12.00 (  0.00%)   12.00 (  0.00%)
> >> >  Expctd Peak Bops   442225.00 (  0.00%)   
> >> > 596039.00 ( 34.78%)   555342.00 ( 25.58%)
> >> >  Actual Warehouse7.00 (  0.00%)
> >> > 9.00 ( 28.57%)8.00 ( 14.29%)
> >> >  Actual Peak Bops   550747.00 (  0.00%)   
> >> > 646124.00 ( 17.32%)   560635.00 (  1.80%)
> >>
> >> It is impressive report!
> >>
> >> Could you like to share the what JVM and options are you using in the
> >> testing, and based on which kinds of platform?
> >>
> >
> > Oracle JVM version "1.7.0_07"
> > Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
> >
> > 4 JVMs were run, one for each node.
> >
> > JVM switch specified was -Xmx12901m so it would consume roughly 80% of
> > memory overall.
> >
> > Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
> > total with HT enabled.
> >
> 
> Thanks for configuration sharing!
> 
> I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.

If you are using hugepages then automatic numa is not migrating those
pages. If you are pinning the JVMs to the socket then automatic numa
balancing is unnecessary as they are already on the correct node.

> In previous sched numa version, I had found 20% dropping with Jrockit
> with our configuration. but for this version. No clear regression
> found. also has no benefit found.
> 

You are only checking for regressions with your configuration which is
important because it showed that schednuma introduced only overhead in
an optimisation NUMA configuration.

In your case, you will see little or not benefit with any automatic NUMA
balancing implementation as the most important pages neiter can migrate
nor need to.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-12 Thread Mel Gorman
On Sat, Nov 10, 2012 at 10:47:41AM +0800, Alex Shi wrote:
 On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman mgor...@suse.de wrote:
  On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
  
   In reality, this report is larger but I chopped it down a bit for
   brevity. autonuma beats schednuma *heavily* on this benchmark both in
   terms of average operations per numa node and overall throughput.
  
   SPECJBB PEAKS
  3.7.0  3.7.0  
   3.7.0
 rc2-stats-v2r1 rc2-autonuma-v27r8  
  rc2-schednuma-v1r3
Expctd Warehouse   12.00 (  0.00%)   
   12.00 (  0.00%)   12.00 (  0.00%)
Expctd Peak Bops   442225.00 (  0.00%)   
   596039.00 ( 34.78%)   555342.00 ( 25.58%)
Actual Warehouse7.00 (  0.00%)
   9.00 ( 28.57%)8.00 ( 14.29%)
Actual Peak Bops   550747.00 (  0.00%)   
   646124.00 ( 17.32%)   560635.00 (  1.80%)
 
  It is impressive report!
 
  Could you like to share the what JVM and options are you using in the
  testing, and based on which kinds of platform?
 
 
  Oracle JVM version 1.7.0_07
  Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
  Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
 
  4 JVMs were run, one for each node.
 
  JVM switch specified was -Xmx12901m so it would consume roughly 80% of
  memory overall.
 
  Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
  total with HT enabled.
 
 
 Thanks for configuration sharing!
 
 I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.

If you are using hugepages then automatic numa is not migrating those
pages. If you are pinning the JVMs to the socket then automatic numa
balancing is unnecessary as they are already on the correct node.

 In previous sched numa version, I had found 20% dropping with Jrockit
 with our configuration. but for this version. No clear regression
 found. also has no benefit found.
 

You are only checking for regressions with your configuration which is
important because it showed that schednuma introduced only overhead in
an optimisation NUMA configuration.

In your case, you will see little or not benefit with any automatic NUMA
balancing implementation as the most important pages neiter can migrate
nor need to.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-09 Thread Alex Shi
On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman  wrote:
> On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
>> >
>> > In reality, this report is larger but I chopped it down a bit for
>> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
>> > terms of average operations per numa node and overall throughput.
>> >
>> > SPECJBB PEAKS
>> >3.7.0  3.7.0
>> >   3.7.0
>> >   rc2-stats-v2r1 rc2-autonuma-v27r8
>> >  rc2-schednuma-v1r3
>> >  Expctd Warehouse   12.00 (  0.00%)   
>> > 12.00 (  0.00%)   12.00 (  0.00%)
>> >  Expctd Peak Bops   442225.00 (  0.00%)   
>> > 596039.00 ( 34.78%)   555342.00 ( 25.58%)
>> >  Actual Warehouse7.00 (  0.00%)
>> > 9.00 ( 28.57%)8.00 ( 14.29%)
>> >  Actual Peak Bops   550747.00 (  0.00%)   
>> > 646124.00 ( 17.32%)   560635.00 (  1.80%)
>>
>> It is impressive report!
>>
>> Could you like to share the what JVM and options are you using in the
>> testing, and based on which kinds of platform?
>>
>
> Oracle JVM version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>
> 4 JVMs were run, one for each node.
>
> JVM switch specified was -Xmx12901m so it would consume roughly 80% of
> memory overall.
>
> Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
> total with HT enabled.
>

Thanks for configuration sharing!

I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.
In previous sched numa version, I had found 20% dropping with Jrockit
with our configuration. but for this version. No clear regression
found. also has no benefit found.

Seems we need to expend the testing configurations. :)
-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-09 Thread Rik van Riel

On 10/30/2012 08:20 AM, Mel Gorman wrote:

On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:

Hi all,

Here's a re-post of the NUMA scheduling and migration improvement
patches that we are working on. These include techniques from
AutoNUMA and the sched/numa tree and form a unified basis - it
has got all the bits that look good and mergeable.



Thanks for the repost. I have not even started a review yet as I was
travelling and just online today. It will be another day or two before I can
start but I was at least able to do a comparison test between autonuma and
schednuma today to see which actually performs the best. Even without the
review I was able to stick on similar vmstats as was applied to autonuma
to give a rough estimate of the relative overhead of both implementations.


Peter, Ingo,

do you have any comments on the performance measurements
by Mel?

Any ideas on how to fix sched/numa or numa/core?

At this point, I suspect the easiest way forward might be
to merge the basic infrastructure from Mel's combined
tree (in -mm? in -tip?), so we can experiment with different
NUMA placement policies on top.

That way we can do apples to apples comparison of the
policies, and figure out what works best, and why.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-09 Thread Rik van Riel

On 10/30/2012 08:20 AM, Mel Gorman wrote:

On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:

Hi all,

Here's a re-post of the NUMA scheduling and migration improvement
patches that we are working on. These include techniques from
AutoNUMA and the sched/numa tree and form a unified basis - it
has got all the bits that look good and mergeable.



Thanks for the repost. I have not even started a review yet as I was
travelling and just online today. It will be another day or two before I can
start but I was at least able to do a comparison test between autonuma and
schednuma today to see which actually performs the best. Even without the
review I was able to stick on similar vmstats as was applied to autonuma
to give a rough estimate of the relative overhead of both implementations.


Peter, Ingo,

do you have any comments on the performance measurements
by Mel?

Any ideas on how to fix sched/numa or numa/core?

At this point, I suspect the easiest way forward might be
to merge the basic infrastructure from Mel's combined
tree (in -mm? in -tip?), so we can experiment with different
NUMA placement policies on top.

That way we can do apples to apples comparison of the
policies, and figure out what works best, and why.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-09 Thread Alex Shi
On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman mgor...@suse.de wrote:
 On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
 
  In reality, this report is larger but I chopped it down a bit for
  brevity. autonuma beats schednuma *heavily* on this benchmark both in
  terms of average operations per numa node and overall throughput.
 
  SPECJBB PEAKS
 3.7.0  3.7.0
3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8
   rc2-schednuma-v1r3
   Expctd Warehouse   12.00 (  0.00%)   
  12.00 (  0.00%)   12.00 (  0.00%)
   Expctd Peak Bops   442225.00 (  0.00%)   
  596039.00 ( 34.78%)   555342.00 ( 25.58%)
   Actual Warehouse7.00 (  0.00%)
  9.00 ( 28.57%)8.00 ( 14.29%)
   Actual Peak Bops   550747.00 (  0.00%)   
  646124.00 ( 17.32%)   560635.00 (  1.80%)

 It is impressive report!

 Could you like to share the what JVM and options are you using in the
 testing, and based on which kinds of platform?


 Oracle JVM version 1.7.0_07
 Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
 Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

 4 JVMs were run, one for each node.

 JVM switch specified was -Xmx12901m so it would consume roughly 80% of
 memory overall.

 Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
 total with HT enabled.


Thanks for configuration sharing!

I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.
In previous sched numa version, I had found 20% dropping with Jrockit
with our configuration. but for this version. No clear regression
found. also has no benefit found.

Seems we need to expend the testing configurations. :)
-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-05 Thread Srikar Dronamraju
Hey Peter, 


Here are results on 2node and 8node machine while running the autonuma
benchmark.

On 2 node, 12 core 24GB 

KernelVersion: 3.7.0-rc3
Testcase:  Min  Max  Avg
  numa01:   121.23   122.43   121.53
numa01_HARD_BIND:80.9081.0780.96
 numa01_INVERSE_BIND:   145.91   146.06   145.97
 numa01_THREAD_ALLOC:   395.81   398.30   397.47
   numa01_THREAD_ALLOC_HARD_BIND:   264.09   264.27   264.18
numa01_THREAD_ALLOC_INVERSE_BIND:   476.36   476.65   476.53
  numa02:53.1153.1953.15
numa02_HARD_BIND:35.2035.2935.25
 numa02_INVERSE_BIND:63.5263.5563.54
  numa02_SMT:60.2862.0061.33
numa02_SMT_HARD_BIND:42.6343.6143.22
 numa02_SMT_INVERSE_BIND:76.2778.0677.31

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches)
Testcase:  Min  Max  Avg  %Change
  numa01:   121.28   121.71   121.470.05%
numa01_HARD_BIND:80.8981.0180.960.00%
 numa01_INVERSE_BIND:   145.87   146.04   145.960.01%
 numa01_THREAD_ALLOC:   398.07   400.27   398.90   -0.36%
   numa01_THREAD_ALLOC_HARD_BIND:   264.02   264.21   264.140.02%
numa01_THREAD_ALLOC_INVERSE_BIND:   476.13   476.62   476.410.03%
  numa02:52.9753.2553.130.04%
numa02_HARD_BIND:35.2135.2835.240.03%
 numa02_INVERSE_BIND:63.5163.5463.530.02%
  numa02_SMT:61.3562.4661.97   -1.03%
numa02_SMT_HARD_BIND:42.8943.8543.220.00%
 numa02_SMT_INVERSE_BIND:76.5377.6877.080.30%



KernelVersion: 3.7.0-rc3(with HT enabled )
Testcase:  Min  Max  Avg
  numa01:   242.58   244.39   243.68
numa01_HARD_BIND:   169.36   169.40   169.38
 numa01_INVERSE_BIND:   299.69   299.73   299.71
 numa01_THREAD_ALLOC:   399.86   404.10   401.50
   numa01_THREAD_ALLOC_HARD_BIND:   278.72   278.77   278.75
numa01_THREAD_ALLOC_INVERSE_BIND:   493.46   493.59   493.54
  numa02:53.0053.3353.19
numa02_HARD_BIND:36.7736.8836.82
 numa02_INVERSE_BIND:66.0766.1066.09
  numa02_SMT:53.2353.5153.35
numa02_SMT_HARD_BIND:35.1935.2735.24
 numa02_SMT_INVERSE_BIND:63.5063.5463.52

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) (with HT enabled)
Testcase:  Min  Max  Avg  %Change
  numa01:   242.68   244.59   243.530.06%
numa01_HARD_BIND:   169.37   169.42   169.40   -0.01%
 numa01_INVERSE_BIND:   299.83   299.96   299.91   -0.07%
 numa01_THREAD_ALLOC:   399.53   403.13   401.62   -0.03%
   numa01_THREAD_ALLOC_HARD_BIND:   278.78   278.80   278.79   -0.01%
numa01_THREAD_ALLOC_INVERSE_BIND:   493.63   493.90   493.78   -0.05%
  numa02:53.0653.4253.22   -0.06%
numa02_HARD_BIND:36.7836.8736.820.00%
 numa02_INVERSE_BIND:66.0966.1066.10   -0.02%
  numa02_SMT:53.3453.5553.42   -0.13%
numa02_SMT_HARD_BIND:35.2235.2935.25   -0.03%
 numa02_SMT_INVERSE_BIND:63.5063.5863.53   -0.02%




On 8 node, 64 core, 320 GB 


KernelVersion: 3.7.0-rc3()
Testcase:  Min  Max  Avg
  numa01:  1550.56  1596.03  1574.24
numa01_HARD_BIND:   915.25  2540.64  1392.42
 numa01_INVERSE_BIND:  2964.66  3716.33  3149.10
 numa01_THREAD_ALLOC:   922.99  1003.31   972.99
   numa01_THREAD_ALLOC_HARD_BIND:   579.54  1266.65   896.75
numa01_THREAD_ALLOC_INVERSE_BIND:  1794.51  2057.16  1922.86
  numa02:   126.22   133.01   130.91
numa02_HARD_BIND:25.8526.2526.06
 numa02_INVERSE_BIND:   341.38   350.35   345.82
  numa02_SMT:   153.06   175.41   163.47
numa02_SMT_HARD_BIND:27.10   212.39   114.37
 numa02_SMT_INVERSE_BIND:   285.70  1542.83   540.62

KernelVersion: numasched()
 

Re: [PATCH 00/31] numa/core patches

2012-11-05 Thread Srikar Dronamraju
Hey Peter, 


Here are results on 2node and 8node machine while running the autonuma
benchmark.

On 2 node, 12 core 24GB 

KernelVersion: 3.7.0-rc3
Testcase:  Min  Max  Avg
  numa01:   121.23   122.43   121.53
numa01_HARD_BIND:80.9081.0780.96
 numa01_INVERSE_BIND:   145.91   146.06   145.97
 numa01_THREAD_ALLOC:   395.81   398.30   397.47
   numa01_THREAD_ALLOC_HARD_BIND:   264.09   264.27   264.18
numa01_THREAD_ALLOC_INVERSE_BIND:   476.36   476.65   476.53
  numa02:53.1153.1953.15
numa02_HARD_BIND:35.2035.2935.25
 numa02_INVERSE_BIND:63.5263.5563.54
  numa02_SMT:60.2862.0061.33
numa02_SMT_HARD_BIND:42.6343.6143.22
 numa02_SMT_INVERSE_BIND:76.2778.0677.31

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches)
Testcase:  Min  Max  Avg  %Change
  numa01:   121.28   121.71   121.470.05%
numa01_HARD_BIND:80.8981.0180.960.00%
 numa01_INVERSE_BIND:   145.87   146.04   145.960.01%
 numa01_THREAD_ALLOC:   398.07   400.27   398.90   -0.36%
   numa01_THREAD_ALLOC_HARD_BIND:   264.02   264.21   264.140.02%
numa01_THREAD_ALLOC_INVERSE_BIND:   476.13   476.62   476.410.03%
  numa02:52.9753.2553.130.04%
numa02_HARD_BIND:35.2135.2835.240.03%
 numa02_INVERSE_BIND:63.5163.5463.530.02%
  numa02_SMT:61.3562.4661.97   -1.03%
numa02_SMT_HARD_BIND:42.8943.8543.220.00%
 numa02_SMT_INVERSE_BIND:76.5377.6877.080.30%



KernelVersion: 3.7.0-rc3(with HT enabled )
Testcase:  Min  Max  Avg
  numa01:   242.58   244.39   243.68
numa01_HARD_BIND:   169.36   169.40   169.38
 numa01_INVERSE_BIND:   299.69   299.73   299.71
 numa01_THREAD_ALLOC:   399.86   404.10   401.50
   numa01_THREAD_ALLOC_HARD_BIND:   278.72   278.77   278.75
numa01_THREAD_ALLOC_INVERSE_BIND:   493.46   493.59   493.54
  numa02:53.0053.3353.19
numa02_HARD_BIND:36.7736.8836.82
 numa02_INVERSE_BIND:66.0766.1066.09
  numa02_SMT:53.2353.5153.35
numa02_SMT_HARD_BIND:35.1935.2735.24
 numa02_SMT_INVERSE_BIND:63.5063.5463.52

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) (with HT enabled)
Testcase:  Min  Max  Avg  %Change
  numa01:   242.68   244.59   243.530.06%
numa01_HARD_BIND:   169.37   169.42   169.40   -0.01%
 numa01_INVERSE_BIND:   299.83   299.96   299.91   -0.07%
 numa01_THREAD_ALLOC:   399.53   403.13   401.62   -0.03%
   numa01_THREAD_ALLOC_HARD_BIND:   278.78   278.80   278.79   -0.01%
numa01_THREAD_ALLOC_INVERSE_BIND:   493.63   493.90   493.78   -0.05%
  numa02:53.0653.4253.22   -0.06%
numa02_HARD_BIND:36.7836.8736.820.00%
 numa02_INVERSE_BIND:66.0966.1066.10   -0.02%
  numa02_SMT:53.3453.5553.42   -0.13%
numa02_SMT_HARD_BIND:35.2235.2935.25   -0.03%
 numa02_SMT_INVERSE_BIND:63.5063.5863.53   -0.02%




On 8 node, 64 core, 320 GB 


KernelVersion: 3.7.0-rc3()
Testcase:  Min  Max  Avg
  numa01:  1550.56  1596.03  1574.24
numa01_HARD_BIND:   915.25  2540.64  1392.42
 numa01_INVERSE_BIND:  2964.66  3716.33  3149.10
 numa01_THREAD_ALLOC:   922.99  1003.31   972.99
   numa01_THREAD_ALLOC_HARD_BIND:   579.54  1266.65   896.75
numa01_THREAD_ALLOC_INVERSE_BIND:  1794.51  2057.16  1922.86
  numa02:   126.22   133.01   130.91
numa02_HARD_BIND:25.8526.2526.06
 numa02_INVERSE_BIND:   341.38   350.35   345.82
  numa02_SMT:   153.06   175.41   163.47
numa02_SMT_HARD_BIND:27.10   212.39   114.37
 numa02_SMT_INVERSE_BIND:   285.70  1542.83   540.62

KernelVersion: numasched()
 

Re: [PATCH 00/31] numa/core patches

2012-11-03 Thread Mel Gorman
On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
> >
> > In reality, this report is larger but I chopped it down a bit for
> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
> > terms of average operations per numa node and overall throughput.
> >
> > SPECJBB PEAKS
> >3.7.0  3.7.0 
> >  3.7.0
> >   rc2-stats-v2r1 rc2-autonuma-v27r8 
> > rc2-schednuma-v1r3
> >  Expctd Warehouse   12.00 (  0.00%)   12.00 
> > (  0.00%)   12.00 (  0.00%)
> >  Expctd Peak Bops   442225.00 (  0.00%)   596039.00 
> > ( 34.78%)   555342.00 ( 25.58%)
> >  Actual Warehouse7.00 (  0.00%)9.00 
> > ( 28.57%)8.00 ( 14.29%)
> >  Actual Peak Bops   550747.00 (  0.00%)   646124.00 
> > ( 17.32%)   560635.00 (  1.80%)
> 
> It is impressive report!
> 
> Could you like to share the what JVM and options are you using in the
> testing, and based on which kinds of platform?
> 

Oracle JVM version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

4 JVMs were run, one for each node.

JVM switch specified was -Xmx12901m so it would consume roughly 80% of
memory overall.

Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
total with HT enabled.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-03 Thread Alex Shi
>
> In reality, this report is larger but I chopped it down a bit for
> brevity. autonuma beats schednuma *heavily* on this benchmark both in
> terms of average operations per numa node and overall throughput.
>
> SPECJBB PEAKS
>3.7.0  3.7.0   
>3.7.0
>   rc2-stats-v2r1 rc2-autonuma-v27r8   
>   rc2-schednuma-v1r3
>  Expctd Warehouse   12.00 (  0.00%)   12.00 ( 
>  0.00%)   12.00 (  0.00%)
>  Expctd Peak Bops   442225.00 (  0.00%)   596039.00 ( 
> 34.78%)   555342.00 ( 25.58%)
>  Actual Warehouse7.00 (  0.00%)9.00 ( 
> 28.57%)8.00 ( 14.29%)
>  Actual Peak Bops   550747.00 (  0.00%)   646124.00 ( 
> 17.32%)   560635.00 (  1.80%)

It is impressive report!

Could you like to share the what JVM and options are you using in the
testing, and based on which kinds of platform?

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-03 Thread Alex Shi

 In reality, this report is larger but I chopped it down a bit for
 brevity. autonuma beats schednuma *heavily* on this benchmark both in
 terms of average operations per numa node and overall throughput.

 SPECJBB PEAKS
3.7.0  3.7.0   
3.7.0
   rc2-stats-v2r1 rc2-autonuma-v27r8   
   rc2-schednuma-v1r3
  Expctd Warehouse   12.00 (  0.00%)   12.00 ( 
  0.00%)   12.00 (  0.00%)
  Expctd Peak Bops   442225.00 (  0.00%)   596039.00 ( 
 34.78%)   555342.00 ( 25.58%)
  Actual Warehouse7.00 (  0.00%)9.00 ( 
 28.57%)8.00 ( 14.29%)
  Actual Peak Bops   550747.00 (  0.00%)   646124.00 ( 
 17.32%)   560635.00 (  1.80%)

It is impressive report!

Could you like to share the what JVM and options are you using in the
testing, and based on which kinds of platform?

-- 
Thanks
Alex
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-03 Thread Mel Gorman
On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
 
  In reality, this report is larger but I chopped it down a bit for
  brevity. autonuma beats schednuma *heavily* on this benchmark both in
  terms of average operations per numa node and overall throughput.
 
  SPECJBB PEAKS
 3.7.0  3.7.0 
   3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 
  rc2-schednuma-v1r3
   Expctd Warehouse   12.00 (  0.00%)   12.00 
  (  0.00%)   12.00 (  0.00%)
   Expctd Peak Bops   442225.00 (  0.00%)   596039.00 
  ( 34.78%)   555342.00 ( 25.58%)
   Actual Warehouse7.00 (  0.00%)9.00 
  ( 28.57%)8.00 ( 14.29%)
   Actual Peak Bops   550747.00 (  0.00%)   646124.00 
  ( 17.32%)   560635.00 (  1.80%)
 
 It is impressive report!
 
 Could you like to share the what JVM and options are you using in the
 testing, and based on which kinds of platform?
 

Oracle JVM version 1.7.0_07
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

4 JVMs were run, one for each node.

JVM switch specified was -Xmx12901m so it would consume roughly 80% of
memory overall.

Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
total with HT enabled.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-02 Thread Hugh Dickins
On Fri, 2 Nov 2012, Zhouping Liu wrote:
> On 11/01/2012 09:41 PM, Hugh Dickins wrote:
> > 
> > Here's a patch fixing and tidying up that and a few other things there.
> > But I'm not signing it off yet, partly because I've barely tested it
> > (quite probably I didn't even have any numa pmd migration happening
> > at all), and partly because just a moment ago I ran across this
> > instructive comment in __collapse_huge_page_isolate():
> > /* cannot use mapcount: can't collapse if there's a gup pin */
> > if (page_count(page) != 1) {
> > 
> > Hmm, yes, below I've added the page_mapcount() check I proposed to
> > do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
> > need a page_count() check (for 2?) to guard against get_user_pages()?
> > I suspect we do, but then do we have enough locking to stabilize such
> > a check?  Probably, but...
> > 
> > This will take more time, and I doubt get_user_pages() is an issue in
> > your testing, so please would you try the patch below, to see if it
> > does fix the BUGs you are seeing?  Thanks a lot.
> 
> Hugh, I have tested the patch for 5 more hours,
> the issue can't be reproduced again,
> so I think it has fixed the issue, thank you :)

Thanks a lot for testing and reporting back, that's good news.

However, I've meanwhile become convinced that more fixes are needed here,
to be safe against get_user_pages() (including get_user_pages_fast());
to get the Mlocked count right; and to recover correctly when !pmd_same
with an Unevictable page.

Won't now have time to update the patch today,
but these additional fixes shouldn't hold up your testing.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-02 Thread Hugh Dickins
On Fri, 2 Nov 2012, Zhouping Liu wrote:
 On 11/01/2012 09:41 PM, Hugh Dickins wrote:
  
  Here's a patch fixing and tidying up that and a few other things there.
  But I'm not signing it off yet, partly because I've barely tested it
  (quite probably I didn't even have any numa pmd migration happening
  at all), and partly because just a moment ago I ran across this
  instructive comment in __collapse_huge_page_isolate():
  /* cannot use mapcount: can't collapse if there's a gup pin */
  if (page_count(page) != 1) {
  
  Hmm, yes, below I've added the page_mapcount() check I proposed to
  do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
  need a page_count() check (for 2?) to guard against get_user_pages()?
  I suspect we do, but then do we have enough locking to stabilize such
  a check?  Probably, but...
  
  This will take more time, and I doubt get_user_pages() is an issue in
  your testing, so please would you try the patch below, to see if it
  does fix the BUGs you are seeing?  Thanks a lot.
 
 Hugh, I have tested the patch for 5 more hours,
 the issue can't be reproduced again,
 so I think it has fixed the issue, thank you :)

Thanks a lot for testing and reporting back, that's good news.

However, I've meanwhile become convinced that more fixes are needed here,
to be safe against get_user_pages() (including get_user_pages_fast());
to get the Mlocked count right; and to recover correctly when !pmd_same
with an Unevictable page.

Won't now have time to update the patch today,
but these additional fixes shouldn't hold up your testing.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-11-01 Thread Zhouping Liu

On 11/01/2012 09:41 PM, Hugh Dickins wrote:

On Wed, 31 Oct 2012, Hugh Dickins wrote:

On Wed, 31 Oct 2012, Zhouping Liu wrote:

On 10/31/2012 03:26 PM, Hugh Dickins wrote:

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
from bigfile just the lines from ":" to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.

Hugh, I didn't find the next function, as I can't find any words that matched
"do_huge_pmd_numa_page".
is there any other methods?

Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
unless I've made a typo but am blind to it.

Were you applying objdump to the vmlinux which gave you the
BUG at mm/memcontrol.c:1134! ?

Thanks for the further info you then sent privately: I have not made any
more effort to reproduce the issue, but your objdump did tell me that the
put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
"Drop the local reference", just before successful return after migration.

I didn't really get the inspiration I'd hoped for out of knowing that,
but it did make wonder whether you're suffering from one of the issues
I already mentioned, and I can now see a way in which it might cause
the mm/memcontrol.c:1134 BUG:-

migrate_page_copy() does TestClearPageActive on the source page:
so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
with a !PageLRU page, it's quite possible that the page was sitting in
a pagevec, and added to the active lru (so added to the lru_size of the
active lru), but our final put_page removes it from lru, active flag has
been cleared, so we subtract it from the lru_size of the inactive lru -
that could indeed make it go negative and trigger the BUG.

Here's a patch fixing and tidying up that and a few other things there.
But I'm not signing it off yet, partly because I've barely tested it
(quite probably I didn't even have any numa pmd migration happening
at all), and partly because just a moment ago I ran across this
instructive comment in __collapse_huge_page_isolate():
/* cannot use mapcount: can't collapse if there's a gup pin */
if (page_count(page) != 1) {

Hmm, yes, below I've added the page_mapcount() check I proposed to
do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
need a page_count() check (for 2?) to guard against get_user_pages()?
I suspect we do, but then do we have enough locking to stabilize such
a check?  Probably, but...

This will take more time, and I doubt get_user_pages() is an issue in
your testing, so please would you try the patch below, to see if it
does fix the BUGs you are seeing?  Thanks a lot.


Hugh, I have tested the patch for 5 more hours, the issue can't be 
reproduced again,

so I think it has fixed the issue, thank you :)

Zhouping



Not-Yet-Signed-off-by: Hugh Dickins 
---

  mm/huge_memory.c |   24 +---
  1 file changed, 9 insertions(+), 15 deletions(-)

--- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 
-0700
+++ linux/mm/huge_memory.c  2012-11-01 05:52:19.512153771 -0700
@@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
-   int node, lru;
+   int node = -1;
  
  	spin_lock(>page_table_lock);

if (unlikely(!pmd_same(*pmd, entry)))
@@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
  
  		get_page(page);

-   node = mpol_misplaced(page, vma, haddr);
+   if (page_mapcount(page) == 1)   /* Only do exclusively mapped */
+   node = mpol_misplaced(page, vma, haddr);
if (node != -1)
goto migrate;
}
@@ -801,13 +802,11 @@ migrate:
if (!new_page)
goto alloc_fail;
  
-	lru = PageLRU(page);

-
-   if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+   if (isolate_lru_page(page)) /* Does an implicit get_page() */
goto alloc_fail;
  
-	if (!trylock_page(new_page))

-   BUG();
+   __set_page_locked(new_page);
+   SetPageSwapBacked(new_page);
  
  	/* anon mapping, we can simply copy page->mapping to the new page: */

new_page->mapping = page->mapping;
@@ -820,8 +819,6 @@ migrate:
spin_lock(>page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
spin_unlock(>page_table_lock);
-   if (lru)
-   putback_lru_page(page);
  
  		unlock_page(new_page);

ClearPageActive(new_page);  /* Set by 

Re: [PATCH 00/31] numa/core patches

2012-11-01 Thread Hugh Dickins
On Wed, 31 Oct 2012, Hugh Dickins wrote:
> On Wed, 31 Oct 2012, Zhouping Liu wrote:
> > On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> > > 
> > > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> > > would help if we could focus on the one which is giving the trouble,
> > > but I don't know which that is.  Zhouping, if you can, please would
> > > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> > > from bigfile just the lines from ":" to whatever
> > > is the next function, and post or mail privately just that disassembly.
> > > That should be good to identify which of the put_page()s is involved.
> > 
> > Hugh, I didn't find the next function, as I can't find any words that 
> > matched
> > "do_huge_pmd_numa_page".
> > is there any other methods?
> 
> Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
> unless I've made a typo but am blind to it.
> 
> Were you applying objdump to the vmlinux which gave you the
> BUG at mm/memcontrol.c:1134! ?

Thanks for the further info you then sent privately: I have not made any
more effort to reproduce the issue, but your objdump did tell me that the
put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
"Drop the local reference", just before successful return after migration.

I didn't really get the inspiration I'd hoped for out of knowing that,
but it did make wonder whether you're suffering from one of the issues
I already mentioned, and I can now see a way in which it might cause
the mm/memcontrol.c:1134 BUG:-

migrate_page_copy() does TestClearPageActive on the source page:
so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
with a !PageLRU page, it's quite possible that the page was sitting in
a pagevec, and added to the active lru (so added to the lru_size of the
active lru), but our final put_page removes it from lru, active flag has
been cleared, so we subtract it from the lru_size of the inactive lru -
that could indeed make it go negative and trigger the BUG.

Here's a patch fixing and tidying up that and a few other things there.
But I'm not signing it off yet, partly because I've barely tested it
(quite probably I didn't even have any numa pmd migration happening
at all), and partly because just a moment ago I ran across this
instructive comment in __collapse_huge_page_isolate():
/* cannot use mapcount: can't collapse if there's a gup pin */
if (page_count(page) != 1) {

Hmm, yes, below I've added the page_mapcount() check I proposed to
do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
need a page_count() check (for 2?) to guard against get_user_pages()?
I suspect we do, but then do we have enough locking to stabilize such
a check?  Probably, but...

This will take more time, and I doubt get_user_pages() is an issue in
your testing, so please would you try the patch below, to see if it
does fix the BUGs you are seeing?  Thanks a lot.

Not-Yet-Signed-off-by: Hugh Dickins 
---

 mm/huge_memory.c |   24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

--- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 
-0700
+++ linux/mm/huge_memory.c  2012-11-01 05:52:19.512153771 -0700
@@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
-   int node, lru;
+   int node = -1;
 
spin_lock(>page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
@@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
 
get_page(page);
-   node = mpol_misplaced(page, vma, haddr);
+   if (page_mapcount(page) == 1)   /* Only do exclusively mapped */
+   node = mpol_misplaced(page, vma, haddr);
if (node != -1)
goto migrate;
}
@@ -801,13 +802,11 @@ migrate:
if (!new_page)
goto alloc_fail;
 
-   lru = PageLRU(page);
-
-   if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+   if (isolate_lru_page(page)) /* Does an implicit get_page() */
goto alloc_fail;
 
-   if (!trylock_page(new_page))
-   BUG();
+   __set_page_locked(new_page);
+   SetPageSwapBacked(new_page);
 
/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
@@ -820,8 +819,6 @@ migrate:
spin_lock(>page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
spin_unlock(>page_table_lock);
-   if (lru)
-   putback_lru_page(page);
 
unlock_page(new_page);
ClearPageActive(new_page);  /* Set by migrate_page_copy() */
@@ -829,6 +826,7 @@ migrate:
put_page(new_page);   

Re: [PATCH 00/31] numa/core patches

2012-11-01 Thread Hugh Dickins
On Wed, 31 Oct 2012, Hugh Dickins wrote:
 On Wed, 31 Oct 2012, Zhouping Liu wrote:
  On 10/31/2012 03:26 PM, Hugh Dickins wrote:
   
   There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
   would help if we could focus on the one which is giving the trouble,
   but I don't know which that is.  Zhouping, if you can, please would
   you do an objdump -ld vmlinux bigfile of your kernel, then extract
   from bigfile just the lines from do_huge_pmd_numa_page: to whatever
   is the next function, and post or mail privately just that disassembly.
   That should be good to identify which of the put_page()s is involved.
  
  Hugh, I didn't find the next function, as I can't find any words that 
  matched
  do_huge_pmd_numa_page.
  is there any other methods?
 
 Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
 unless I've made a typo but am blind to it.
 
 Were you applying objdump to the vmlinux which gave you the
 BUG at mm/memcontrol.c:1134! ?

Thanks for the further info you then sent privately: I have not made any
more effort to reproduce the issue, but your objdump did tell me that the
put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
Drop the local reference, just before successful return after migration.

I didn't really get the inspiration I'd hoped for out of knowing that,
but it did make wonder whether you're suffering from one of the issues
I already mentioned, and I can now see a way in which it might cause
the mm/memcontrol.c:1134 BUG:-

migrate_page_copy() does TestClearPageActive on the source page:
so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
with a !PageLRU page, it's quite possible that the page was sitting in
a pagevec, and added to the active lru (so added to the lru_size of the
active lru), but our final put_page removes it from lru, active flag has
been cleared, so we subtract it from the lru_size of the inactive lru -
that could indeed make it go negative and trigger the BUG.

Here's a patch fixing and tidying up that and a few other things there.
But I'm not signing it off yet, partly because I've barely tested it
(quite probably I didn't even have any numa pmd migration happening
at all), and partly because just a moment ago I ran across this
instructive comment in __collapse_huge_page_isolate():
/* cannot use mapcount: can't collapse if there's a gup pin */
if (page_count(page) != 1) {

Hmm, yes, below I've added the page_mapcount() check I proposed to
do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
need a page_count() check (for 2?) to guard against get_user_pages()?
I suspect we do, but then do we have enough locking to stabilize such
a check?  Probably, but...

This will take more time, and I doubt get_user_pages() is an issue in
your testing, so please would you try the patch below, to see if it
does fix the BUGs you are seeing?  Thanks a lot.

Not-Yet-Signed-off-by: Hugh Dickins hu...@google.com
---

 mm/huge_memory.c |   24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

--- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 
-0700
+++ linux/mm/huge_memory.c  2012-11-01 05:52:19.512153771 -0700
@@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
-   int node, lru;
+   int node = -1;
 
spin_lock(mm-page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
@@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
 
get_page(page);
-   node = mpol_misplaced(page, vma, haddr);
+   if (page_mapcount(page) == 1)   /* Only do exclusively mapped */
+   node = mpol_misplaced(page, vma, haddr);
if (node != -1)
goto migrate;
}
@@ -801,13 +802,11 @@ migrate:
if (!new_page)
goto alloc_fail;
 
-   lru = PageLRU(page);
-
-   if (lru  isolate_lru_page(page)) /* does an implicit get_page() */
+   if (isolate_lru_page(page)) /* Does an implicit get_page() */
goto alloc_fail;
 
-   if (!trylock_page(new_page))
-   BUG();
+   __set_page_locked(new_page);
+   SetPageSwapBacked(new_page);
 
/* anon mapping, we can simply copy page-mapping to the new page: */
new_page-mapping = page-mapping;
@@ -820,8 +819,6 @@ migrate:
spin_lock(mm-page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
spin_unlock(mm-page_table_lock);
-   if (lru)
-   putback_lru_page(page);
 
unlock_page(new_page);
ClearPageActive(new_page);  /* Set by migrate_page_copy() */
@@ -829,6 +826,7 @@ migrate:
put_page(new_page); /* Free 

Re: [PATCH 00/31] numa/core patches

2012-11-01 Thread Zhouping Liu

On 11/01/2012 09:41 PM, Hugh Dickins wrote:

On Wed, 31 Oct 2012, Hugh Dickins wrote:

On Wed, 31 Oct 2012, Zhouping Liu wrote:

On 10/31/2012 03:26 PM, Hugh Dickins wrote:

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an objdump -ld vmlinux bigfile of your kernel, then extract
from bigfile just the lines from do_huge_pmd_numa_page: to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.

Hugh, I didn't find the next function, as I can't find any words that matched
do_huge_pmd_numa_page.
is there any other methods?

Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
unless I've made a typo but am blind to it.

Were you applying objdump to the vmlinux which gave you the
BUG at mm/memcontrol.c:1134! ?

Thanks for the further info you then sent privately: I have not made any
more effort to reproduce the issue, but your objdump did tell me that the
put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
Drop the local reference, just before successful return after migration.

I didn't really get the inspiration I'd hoped for out of knowing that,
but it did make wonder whether you're suffering from one of the issues
I already mentioned, and I can now see a way in which it might cause
the mm/memcontrol.c:1134 BUG:-

migrate_page_copy() does TestClearPageActive on the source page:
so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
with a !PageLRU page, it's quite possible that the page was sitting in
a pagevec, and added to the active lru (so added to the lru_size of the
active lru), but our final put_page removes it from lru, active flag has
been cleared, so we subtract it from the lru_size of the inactive lru -
that could indeed make it go negative and trigger the BUG.

Here's a patch fixing and tidying up that and a few other things there.
But I'm not signing it off yet, partly because I've barely tested it
(quite probably I didn't even have any numa pmd migration happening
at all), and partly because just a moment ago I ran across this
instructive comment in __collapse_huge_page_isolate():
/* cannot use mapcount: can't collapse if there's a gup pin */
if (page_count(page) != 1) {

Hmm, yes, below I've added the page_mapcount() check I proposed to
do_huge_pmd_numa_page(), but is even that safe enough?  Do we actually
need a page_count() check (for 2?) to guard against get_user_pages()?
I suspect we do, but then do we have enough locking to stabilize such
a check?  Probably, but...

This will take more time, and I doubt get_user_pages() is an issue in
your testing, so please would you try the patch below, to see if it
does fix the BUGs you are seeing?  Thanks a lot.


Hugh, I have tested the patch for 5 more hours, the issue can't be 
reproduced again,

so I think it has fixed the issue, thank you :)

Zhouping



Not-Yet-Signed-off-by: Hugh Dickins hu...@google.com
---

  mm/huge_memory.c |   24 +---
  1 file changed, 9 insertions(+), 15 deletions(-)

--- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 
-0700
+++ linux/mm/huge_memory.c  2012-11-01 05:52:19.512153771 -0700
@@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
-   int node, lru;
+   int node = -1;
  
  	spin_lock(mm-page_table_lock);

if (unlikely(!pmd_same(*pmd, entry)))
@@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
  
  		get_page(page);

-   node = mpol_misplaced(page, vma, haddr);
+   if (page_mapcount(page) == 1)   /* Only do exclusively mapped */
+   node = mpol_misplaced(page, vma, haddr);
if (node != -1)
goto migrate;
}
@@ -801,13 +802,11 @@ migrate:
if (!new_page)
goto alloc_fail;
  
-	lru = PageLRU(page);

-
-   if (lru  isolate_lru_page(page)) /* does an implicit get_page() */
+   if (isolate_lru_page(page)) /* Does an implicit get_page() */
goto alloc_fail;
  
-	if (!trylock_page(new_page))

-   BUG();
+   __set_page_locked(new_page);
+   SetPageSwapBacked(new_page);
  
  	/* anon mapping, we can simply copy page-mapping to the new page: */

new_page-mapping = page-mapping;
@@ -820,8 +819,6 @@ migrate:
spin_lock(mm-page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
spin_unlock(mm-page_table_lock);
-   if (lru)
-   putback_lru_page(page);
  
  		unlock_page(new_page);

ClearPageActive(new_page);   

Re: [PATCH 00/31] numa/core patches

2012-10-31 Thread Hugh Dickins
On Wed, 31 Oct 2012, Zhouping Liu wrote:
> On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> > 
> > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> > would help if we could focus on the one which is giving the trouble,
> > but I don't know which that is.  Zhouping, if you can, please would
> > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> > from bigfile just the lines from ":" to whatever
> > is the next function, and post or mail privately just that disassembly.
> > That should be good to identify which of the put_page()s is involved.
> 
> Hugh, I didn't find the next function, as I can't find any words that matched
> "do_huge_pmd_numa_page".
> is there any other methods?

Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
unless I've made a typo but am blind to it.

Were you applying objdump to the vmlinux which gave you the
BUG at mm/memcontrol.c:1134! ?

Maybe just do "objdump -ld mm/huge_memory.o >notsobigfile"
and mail me an attachment of the notsobigfile.

I did try building your config here last night, but ran out of disk
space on this partition, and it was already clear that my gcc version
differs from yours, so not quite matching.

> also I tried to use kdump to dump vmcore file,
> but unluckily kdump didn't
> work well, if you think it useful to dump vmcore file, I can try it again and
> provide more info.

It would take me awhile to get up to speed on using that,
I'd prefer to start with just the objdump of huge_memory.o.

I forgot last night to say that I did try stress (but not on a kernel
of your config), but didn't see the BUG: I expect there are too many
differences in our environments, and I'd have to tweak things one way
or another to get it to happen - probably a waste of time.

Thanks,
Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-31 Thread Zhouping Liu

On 10/31/2012 03:26 PM, Hugh Dickins wrote:

On Tue, 30 Oct 2012, Johannes Weiner wrote:

[88099.923724] [ cut here ]
[88099.924036] kernel BUG at mm/memcontrol.c:1134!
[88099.924036] invalid opcode:  [#1] SMP
[88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
drm_kms_helper ttm drm i2c_core
[88099.924036] CPU 7
[88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
Dell Inc. PowerEdge 6950/0WN213
[88099.924036] RIP: 0010:[] []
mem_cgroup_update_lru_size+0x27/0x30

Thanks a lot for your testing efforts, I really appreciate it.

I'm looking into it, but I don't expect power to get back for several
days where I live, so it's hard to reproduce it locally.

But that looks like an LRU accounting imbalance that I wasn't able to
tie to this patch yet.  Do you see weird numbers for the lru counters
in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
well.

Sorry, I didn't get very far with it tonight.

Almost certain to be a page which was added to lru while it looked like
a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
lru here, it's likely to be the page in question, but not necessarily.

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
from bigfile just the lines from ":" to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.


Hugh, I didn't find the next function, as I can't find any words that 
matched "do_huge_pmd_numa_page".
is there any other methods? also I tried to use kdump to dump vmcore 
file, but unluckily kdump didn't
work well, if you think it useful to dump vmcore file, I can try it 
again and provide more info.


Thanks,
Zhouping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-31 Thread Hugh Dickins
On Tue, 30 Oct 2012, Johannes Weiner wrote:
> On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
> > On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> > >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> > >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > >>>[  180.918591] RIP: 0010:[]  [] 
> > >>>mem_cgroup_prepare_migration+0xba/0xd0
> > >>>[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
> > >>>[  182.775090]  [] handle_mm_fault+0x1e9/0x360
> > >>>[  182.863038]  [] __do_page_fault+0x172/0x4e0
> > >>>[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
> > >>>[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
> > >>>[  183.126832]  [] ? __schedule+0x3c6/0x7a0
> > >>>[  183.211216]  [] do_page_fault+0xe/0x10
> > >>>[  183.293705]  [] page_fault+0x28/0x30
> > >>Johannes, this looks like the thp migration memcg hookery gone bad,
> > >>could you have a look at this?
> > >Oops.  Here is an incremental fix, feel free to fold it into #31.
> > Hello Johannes,
> > 
> > maybe I don't think the below patch completely fix this issue, as I
> > found a new error(maybe similar with this):
> > 
> > [88099.923724] [ cut here ]
> > [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> > [88099.924036] invalid opcode:  [#1] SMP
> > [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> > amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> > joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> > megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> > drm_kms_helper ttm drm i2c_core
> > [88099.924036] CPU 7
> > [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> > Dell Inc. PowerEdge 6950/0WN213
> > [88099.924036] RIP: 0010:[] []
> > mem_cgroup_update_lru_size+0x27/0x30
> 
> Thanks a lot for your testing efforts, I really appreciate it.
> 
> I'm looking into it, but I don't expect power to get back for several
> days where I live, so it's hard to reproduce it locally.
> 
> But that looks like an LRU accounting imbalance that I wasn't able to
> tie to this patch yet.  Do you see weird numbers for the lru counters
> in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
> well.

Sorry, I didn't get very far with it tonight.

Almost certain to be a page which was added to lru while it looked like
a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
lru here, it's likely to be the page in question, but not necessarily.

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
from bigfile just the lines from ":" to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.

do_huge_pmd_numa_page() does look a bit worrying, but I've not pinned
the misaccounting seen to the aspects which have worried me so far.
Where is a check for page_mapcount(page) being 1?  And surely it's
unsafe to to be migrating the page when it was found !PageLRU?  It's
quite likely to be sitting in a pagevec or on a local list somewhere,
about to be added to lru at any moment.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-31 Thread Hugh Dickins
On Tue, 30 Oct 2012, Johannes Weiner wrote:
 On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
  On 10/29/2012 01:56 AM, Johannes Weiner wrote:
  On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
  On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
  [  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
  mem_cgroup_prepare_migration+0xba/0xd0
  [  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
  [  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
  [  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
  [  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
  [  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
  [  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
  [  183.211216]  [81632ede] do_page_fault+0xe/0x10
  [  183.293705]  [8162f518] page_fault+0x28/0x30
  Johannes, this looks like the thp migration memcg hookery gone bad,
  could you have a look at this?
  Oops.  Here is an incremental fix, feel free to fold it into #31.
  Hello Johannes,
  
  maybe I don't think the below patch completely fix this issue, as I
  found a new error(maybe similar with this):
  
  [88099.923724] [ cut here ]
  [88099.924036] kernel BUG at mm/memcontrol.c:1134!
  [88099.924036] invalid opcode:  [#1] SMP
  [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
  amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
  joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
  megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
  drm_kms_helper ttm drm i2c_core
  [88099.924036] CPU 7
  [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
  Dell Inc. PowerEdge 6950/0WN213
  [88099.924036] RIP: 0010:[81188e97] [81188e97]
  mem_cgroup_update_lru_size+0x27/0x30
 
 Thanks a lot for your testing efforts, I really appreciate it.
 
 I'm looking into it, but I don't expect power to get back for several
 days where I live, so it's hard to reproduce it locally.
 
 But that looks like an LRU accounting imbalance that I wasn't able to
 tie to this patch yet.  Do you see weird numbers for the lru counters
 in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
 well.

Sorry, I didn't get very far with it tonight.

Almost certain to be a page which was added to lru while it looked like
a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
lru here, it's likely to be the page in question, but not necessarily.

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an objdump -ld vmlinux bigfile of your kernel, then extract
from bigfile just the lines from do_huge_pmd_numa_page: to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.

do_huge_pmd_numa_page() does look a bit worrying, but I've not pinned
the misaccounting seen to the aspects which have worried me so far.
Where is a check for page_mapcount(page) being 1?  And surely it's
unsafe to to be migrating the page when it was found !PageLRU?  It's
quite likely to be sitting in a pagevec or on a local list somewhere,
about to be added to lru at any moment.

Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-31 Thread Zhouping Liu

On 10/31/2012 03:26 PM, Hugh Dickins wrote:

On Tue, 30 Oct 2012, Johannes Weiner wrote:

[88099.923724] [ cut here ]
[88099.924036] kernel BUG at mm/memcontrol.c:1134!
[88099.924036] invalid opcode:  [#1] SMP
[88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
drm_kms_helper ttm drm i2c_core
[88099.924036] CPU 7
[88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
Dell Inc. PowerEdge 6950/0WN213
[88099.924036] RIP: 0010:[81188e97] [81188e97]
mem_cgroup_update_lru_size+0x27/0x30

Thanks a lot for your testing efforts, I really appreciate it.

I'm looking into it, but I don't expect power to get back for several
days where I live, so it's hard to reproduce it locally.

But that looks like an LRU accounting imbalance that I wasn't able to
tie to this patch yet.  Do you see weird numbers for the lru counters
in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
well.

Sorry, I didn't get very far with it tonight.

Almost certain to be a page which was added to lru while it looked like
a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
lru here, it's likely to be the page in question, but not necessarily.

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is.  Zhouping, if you can, please would
you do an objdump -ld vmlinux bigfile of your kernel, then extract
from bigfile just the lines from do_huge_pmd_numa_page: to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.


Hugh, I didn't find the next function, as I can't find any words that 
matched do_huge_pmd_numa_page.
is there any other methods? also I tried to use kdump to dump vmcore 
file, but unluckily kdump didn't
work well, if you think it useful to dump vmcore file, I can try it 
again and provide more info.


Thanks,
Zhouping
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-31 Thread Hugh Dickins
On Wed, 31 Oct 2012, Zhouping Liu wrote:
 On 10/31/2012 03:26 PM, Hugh Dickins wrote:
  
  There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
  would help if we could focus on the one which is giving the trouble,
  but I don't know which that is.  Zhouping, if you can, please would
  you do an objdump -ld vmlinux bigfile of your kernel, then extract
  from bigfile just the lines from do_huge_pmd_numa_page: to whatever
  is the next function, and post or mail privately just that disassembly.
  That should be good to identify which of the put_page()s is involved.
 
 Hugh, I didn't find the next function, as I can't find any words that matched
 do_huge_pmd_numa_page.
 is there any other methods?

Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
unless I've made a typo but am blind to it.

Were you applying objdump to the vmlinux which gave you the
BUG at mm/memcontrol.c:1134! ?

Maybe just do objdump -ld mm/huge_memory.o notsobigfile
and mail me an attachment of the notsobigfile.

I did try building your config here last night, but ran out of disk
space on this partition, and it was already clear that my gcc version
differs from yours, so not quite matching.

 also I tried to use kdump to dump vmcore file,
 but unluckily kdump didn't
 work well, if you think it useful to dump vmcore file, I can try it again and
 provide more info.

It would take me awhile to get up to speed on using that,
I'd prefer to start with just the objdump of huge_memory.o.

I forgot last night to say that I did try stress (but not on a kernel
of your config), but didn't see the BUG: I expect there are too many
differences in our environments, and I'd have to tweak things one way
or another to get it to happen - probably a waste of time.

Thanks,
Hugh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Johannes Weiner
On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
> On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> >>>[  180.918591] RIP: 0010:[]  [] 
> >>>mem_cgroup_prepare_migration+0xba/0xd0
> >>>[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
> >>>[  182.775090]  [] handle_mm_fault+0x1e9/0x360
> >>>[  182.863038]  [] __do_page_fault+0x172/0x4e0
> >>>[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
> >>>[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
> >>>[  183.126832]  [] ? __schedule+0x3c6/0x7a0
> >>>[  183.211216]  [] do_page_fault+0xe/0x10
> >>>[  183.293705]  [] page_fault+0x28/0x30
> >>Johannes, this looks like the thp migration memcg hookery gone bad,
> >>could you have a look at this?
> >Oops.  Here is an incremental fix, feel free to fold it into #31.
> Hello Johannes,
> 
> maybe I don't think the below patch completely fix this issue, as I
> found a new error(maybe similar with this):
> 
> [88099.923724] [ cut here ]
> [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> [88099.924036] invalid opcode:  [#1] SMP
> [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> drm_kms_helper ttm drm i2c_core
> [88099.924036] CPU 7
> [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> Dell Inc. PowerEdge 6950/0WN213
> [88099.924036] RIP: 0010:[] []
> mem_cgroup_update_lru_size+0x27/0x30

Thanks a lot for your testing efforts, I really appreciate it.

I'm looking into it, but I don't expect power to get back for several
days where I live, so it's hard to reproduce it locally.

But that looks like an LRU accounting imbalance that I wasn't able to
tie to this patch yet.  Do you see weird numbers for the lru counters
in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
well.

Thanks,
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Mel Gorman
On Tue, Oct 30, 2012 at 08:28:10AM -0700, Andrew Morton wrote:
> 
> On Tue, 30 Oct 2012 12:20:32 + Mel Gorman  wrote:
> 
> > ...
> 
> Useful testing - thanks.  Did I miss the description of what
> autonumabench actually does?  How representitive is it of real-world
> things?
> 

It's not representative of anything at all. It's a synthetic benchmark
that just measures if automatic NUMA migration (whatever the mechanism)
is working as expected. I'm not aware of a decent description of what
the test does and why. Here is my current interpretation and hopefully
Andrea will correct me if I'm wrong.

NUMA01
  Two processes
  NUM_CPUS/2 number of threads so all CPUs are in use
  
  On startup, the process forks
  Each process mallocs a 3G buffer but there is no communication
  between the processes.
  Threads are created that zeros out the full buffer 1000 times

  The objective of the test is that initially the two processes
  allocate their memory on the same node. As the threads are
  are created the memory will migrate from the initial node to
  nodes that are closer to the referencing thread.

  It is worth noting that this benchmark is specifically tuned
  for two nodes and the expectation is that the two processes
  and their threads split so that all process A runs on node 0
  and all threads on process B run in node 1

  With 4 and more nodes, this is actually an adverse workload.
  As all the buffer is zeroed in both processes, there is an
  expectation that it will continually bounce between two nodes.

  So, on 2 nodes, this benchmark tests convergence. On 4 or more
  nodes, this partially measures how much busy work automatic
  NUMA migrate does and it'll be very noisy due to cache conflicts.

NUMA01_THREADLOCAL
  Two processes
  NUM_CPUS/2 number of threads so all CPUs are in use

  On startup, the process forks
  Each process mallocs a 3G buffer but there is no communication
  between the processes
  Threads are created that zero out their own subset of the buffer.
  Each buffer is 3G/NR_THREADS in size
  
  This benchmark is more realistic. In an ideal situation, each
  thread will migrate its data to its local node. The test really
  is to see does it converge and how quickly.

NUMA02
 One process, NR_CPU threads

 On startup, malloc a 1G buffer
 Create threads that zero out a thread-local portion of the buffer.
  Zeros multiple times - the number of times is fixed and seems
  to just be to take a period of time

 This is similar in principal to NUMA01_THREADLOCAL except that only
 one process is involved. I think it was aimed at being more JVM-like.

NUMA02_SMT
 One process, NR_CPU/2 threads

 This is a variation of NUMA02 except that with half the cores idle it
 is checking if the system migrates the memory to two or more nodes or
 if it tries to fit everything in one node even though the memory should
 migrate to be close to the CPU

> > I also expect autonuma is continually scanning where as schednuma is
> > reacting to some other external event or at least less frequently scanning.
> 
> Might this imply that autonuma is consuming more CPU in kernel threads,
> the cost of which didn't get included in these results?

It might but according to top, knuma_scand only used 7.86 seconds of CPU
time during the whole test and the time used by the migration tests is
also very low. Most migration threads used less than 1 second of CPU
time. Two migration threads used 2 seconds of CPU time each but that
still seems low.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Andrew Morton

On Tue, 30 Oct 2012 12:20:32 + Mel Gorman  wrote:

> ...

Useful testing - thanks.  Did I miss the description of what
autonumabench actually does?  How representitive is it of real-world
things?

> I also expect autonuma is continually scanning where as schednuma is
> reacting to some other external event or at least less frequently scanning.

Might this imply that autonuma is consuming more CPU in kernel threads,
the cost of which didn't get included in these results?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Mel Gorman
On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:
> Hi all,
> 
> Here's a re-post of the NUMA scheduling and migration improvement
> patches that we are working on. These include techniques from
> AutoNUMA and the sched/numa tree and form a unified basis - it
> has got all the bits that look good and mergeable.
> 

Thanks for the repost. I have not even started a review yet as I was
travelling and just online today. It will be another day or two before I can
start but I was at least able to do a comparison test between autonuma and
schednuma today to see which actually performs the best. Even without the
review I was able to stick on similar vmstats as was applied to autonuma
to give a rough estimate of the relative overhead of both implementations.

Machine was a 4-node box running autonumabench and specjbb.

Three kernels are

3.7-rc2-stats-v2r1  vmstat patces on top
3.7-rc2-autonuma-v27latest autonuma with stats on top
3.7-rc2-schednuma-v1r3  schednuma series minus the last path + stats

AUTONUMA BENCH
  3.7.0 3.7.0   
  3.7.0
 rc2-stats-v2r1rc2-autonuma-v27r8
rc2-schednuma-v1r3
UserNUMA01   67145.71 (  0.00%)30110.13 ( 55.16%)
61666.46 (  8.16%)
UserNUMA01_THEADLOCAL55104.60 (  0.00%)17285.49 ( 68.63%)
17135.48 ( 68.90%)
UserNUMA027074.54 (  0.00%) 2219.11 ( 68.63%) 
2226.09 ( 68.53%)
UserNUMA02_SMT2916.86 (  0.00%)  999.19 ( 65.74%) 
1038.06 ( 64.41%)
System  NUMA01  42.28 (  0.00%)  469.07 (-1009.44%) 
2808.08 (-6541.63%)
System  NUMA01_THEADLOCAL   41.71 (  0.00%)  183.24 (-339.32%)  
174.92 (-319.37%)
System  NUMA02  34.67 (  0.00%)   27.85 ( 19.67%)   
15.03 ( 56.65%)
System  NUMA02_SMT   0.89 (  0.00%)   18.36 (-1962.92%)
5.05 (-467.42%)
Elapsed NUMA011512.97 (  0.00%)  698.18 ( 53.85%) 
1422.71 (  5.97%)
Elapsed NUMA01_THEADLOCAL 1264.23 (  0.00%)  389.51 ( 69.19%)  
377.51 ( 70.14%)
Elapsed NUMA02 181.52 (  0.00%)   60.65 ( 66.59%)   
52.86 ( 70.88%)
Elapsed NUMA02_SMT 163.59 (  0.00%)   58.57 ( 64.20%)   
48.82 ( 70.16%)
CPU NUMA014440.00 (  0.00%) 4379.00 (  1.37%) 
4531.00 ( -2.05%)
CPU NUMA01_THEADLOCAL 4362.00 (  0.00%) 4484.00 ( -2.80%) 
4585.00 ( -5.11%)
CPU NUMA023916.00 (  0.00%) 3704.00 (  5.41%) 
4239.00 ( -8.25%)
CPU NUMA02_SMT1783.00 (  0.00%) 1737.00 (  2.58%) 
2136.00 (-19.80%)

Two figures really matter here - System CPU usage and Elapsed time.

autonuma was known to hurt system CPU usage for the NUMA01 test case but
schednuma does *far* worse. I do not have a breakdown of where this time
is being spent but the raw figure is bad. autonuma is 10 times worse
than a vanilla kernel and schednuma is 5 times worse than autonuma.

For the overhead of the other test cases, schednuma is roughly
comparable with autonuma -- i.e. both pretty high overhead.

In terms of elapsed time, autonuma in the NUMA01 test case massively
improves elapsed time while schednuma barely makes a dent on it. Looking
at the memory usage per node (I generated a graph offline), it appears
that schednuma does not migrate pages to other nodes fast enough. The
convergence figures do not reflect this because the convergence seems
high (towards 1) but it may be because the approximation using faults is
insufficient.

In the other cases, schednuma does well and is comparable to autonuma.

MMTests Statistics: duration
   3.7.0   3.7.0   3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
User   132248.8850620.5082073.11
System120.19  699.12 3003.83
Elapsed  3131.10 1215.63 1911.55

This is the overall time to complete the test. autonuma is way better
than schednuma but this is all due to how it handles the NUMA01 test
case.

MMTests Statistics: vmstat
 3.7.0   3.7.0   3.7.0
  rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
Page Ins 37256   37508   37360
Page Outs2   13372   19488
Swap Ins 0   0   0
Swap Outs0   0   0
Direct pages scanned 0   0   0
Kswapd pages scanned 0   0   0
Kswapd pages reclaimed   0   0   0
Direct pages reclaimed   0   0   0
Kswapd efficiency 100%100%100%
Kswapd velocity  0.000   0.000   0.000
Direct efficiency 100%100%100%
Direct 

Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Zhouping Liu

On 10/29/2012 01:56 AM, Johannes Weiner wrote:

On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:

On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[]  [] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [] handle_mm_fault+0x1e9/0x360
[  182.863038]  [] __do_page_fault+0x172/0x4e0
[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [] ? __schedule+0x3c6/0x7a0
[  183.211216]  [] do_page_fault+0xe/0x10
[  183.293705]  [] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.

Hello Johannes,

maybe I don't think the below patch completely fix this issue, as I 
found a new error(maybe similar with this):


[88099.923724] [ cut here ]
[88099.924036] kernel BUG at mm/memcontrol.c:1134!
[88099.924036] invalid opcode:  [#1] SMP
[88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm 
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp 
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi 
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit 
drm_kms_helper ttm drm i2c_core

[88099.924036] CPU 7
[88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 
Dell Inc. PowerEdge 6950/0WN213
[88099.924036] RIP: 0010:[] [] 
mem_cgroup_update_lru_size+0x27/0x30

[88099.924036] RSP: :88021b247ca8  EFLAGS: 00010082
[88099.924036] RAX: 88011d310138 RBX: ea0002f18000 RCX: 
0001
[88099.924036] RDX: fe00 RSI: 000e RDI: 
88011d310138
[88099.924036] RBP: 88021b247ca8 R08:  R09: 
a8000bc6
[88099.924036] R10:  R11:  R12: 
fe00
[88099.924036] R13: 88011ffecb40 R14: 0286 R15: 

[88099.924036] FS:  7f787d0bf740() GS:88021fc8() 
knlGS:

[88099.924036] CS:  0010 DS:  ES:  CR0: 8005003b
[88099.924036] CR2: 7f7873a00010 CR3: 00021bda CR4: 
07e0
[88099.924036] DR0:  DR1:  DR2: 

[88099.924036] DR3:  DR6: 0ff0 DR7: 
0400
[88099.924036] Process stress (pid: 3441, threadinfo 88021b246000, 
task 88021b399760)

[88099.924036] Stack:
[88099.924036]  88021b247cf8 8113a9cd ea0002f18000 
88011d310138
[88099.924036]  0200 ea0002f18000 88019bace580 
7f7873c0
[88099.924036]  88021aca0cf0 ea00081e 88021b247d18 
8113aa7d

[88099.924036] Call Trace:
[88099.924036]  [] __page_cache_release.part.11+0xdd/0x140
[88099.924036]  [] __put_compound_page+0x1d/0x30
[88099.924036]  [] put_compound_page+0x5d/0x1e0
[88099.924036]  [] put_page+0x45/0x50
[88099.924036]  [] do_huge_pmd_numa_page+0x2ec/0x4e0
[88099.924036]  [] handle_mm_fault+0x1e9/0x360
[88099.924036]  [] __do_page_fault+0x172/0x4e0
[88099.924036]  [] ? task_numa_work+0x1c9/0x220
[88099.924036]  [] ? task_work_run+0xac/0xe0
[88099.924036]  [] do_page_fault+0xe/0x10
[88099.924036]  [] page_fault+0x28/0x30
[88099.924036] Code: 00 00 00 00 66 66 66 66 90 44 8b 1d 1c 90 b5 00 55 
48 89 e5 45 85 db 75 10 89 f6 48 63 d2 48 83 c6 0e 48 01 54 f7 08 78 02 
5d c3 <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec
[88099.924036] RIP  [] 
mem_cgroup_update_lru_size+0x27/0x30

[88099.924036]  RSP 
[88099.924036] ---[ end trace c8d6b169e0c3f25a ]---
[88108.054610] [ cut here ]
[88108.054610] WARNING: at kernel/watchdog.c:245 
watchdog_overflow_callback+0x9c/0xd0()

[88108.054610] Hardware name: PowerEdge 6950
[88108.054610] Watchdog detected hard LOCKUP on cpu 3
[88108.054610] Modules linked in: lockd sunrpc kvm_amd kvm 
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp 
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi 
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit 
drm_kms_helper ttm drm i2c_core

[88108.054610] Pid: 3429, comm: stress Tainted: G  D 3.7.0-rc2Jons+ #3
[88108.054610] Call Trace:
[88108.054610][] warn_slowpath_common+0x7f/0xc0
[88108.054610]  [] warn_slowpath_fmt+0x46/0x50
[88108.054610]  [] ? sched_clock_cpu+0xa8/0x120
[88108.054610]  [] ? touch_nmi_watchdog+0x80/0x80
[88108.054610]  [] watchdog_overflow_callback+0x9c/0xd0
[88108.054610]  [] __perf_event_overflow+0x9d/0x230
[88108.054610]  [] ? perf_event_update_userpage+0x24/0x110
[88108.054610]  [] perf_event_overflow+0x14/0x20
[88108.054610]  [] x86_pmu_handle_irq+0x10a/0x160
[88108.054610]  [] perf_event_nmi_handler+0x1d/0x20
[88108.054610]  [] nmi_handle.isra.0+0x51/0x80
[88108.054610]  [] do_nmi+0x179/0x350
[88108.054610]  [] end_repeat_nmi+0x1e/0x2e
[88108.054610]  [] ? 

Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Zhouping Liu

On 10/29/2012 01:56 AM, Johannes Weiner wrote:

On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:

On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
[  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
[  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
[  183.211216]  [81632ede] do_page_fault+0xe/0x10
[  183.293705]  [8162f518] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.

Hello Johannes,

maybe I don't think the below patch completely fix this issue, as I 
found a new error(maybe similar with this):


[88099.923724] [ cut here ]
[88099.924036] kernel BUG at mm/memcontrol.c:1134!
[88099.924036] invalid opcode:  [#1] SMP
[88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm 
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp 
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi 
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit 
drm_kms_helper ttm drm i2c_core

[88099.924036] CPU 7
[88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 
Dell Inc. PowerEdge 6950/0WN213
[88099.924036] RIP: 0010:[81188e97] [81188e97] 
mem_cgroup_update_lru_size+0x27/0x30

[88099.924036] RSP: :88021b247ca8  EFLAGS: 00010082
[88099.924036] RAX: 88011d310138 RBX: ea0002f18000 RCX: 
0001
[88099.924036] RDX: fe00 RSI: 000e RDI: 
88011d310138
[88099.924036] RBP: 88021b247ca8 R08:  R09: 
a8000bc6
[88099.924036] R10:  R11:  R12: 
fe00
[88099.924036] R13: 88011ffecb40 R14: 0286 R15: 

[88099.924036] FS:  7f787d0bf740() GS:88021fc8() 
knlGS:

[88099.924036] CS:  0010 DS:  ES:  CR0: 8005003b
[88099.924036] CR2: 7f7873a00010 CR3: 00021bda CR4: 
07e0
[88099.924036] DR0:  DR1:  DR2: 

[88099.924036] DR3:  DR6: 0ff0 DR7: 
0400
[88099.924036] Process stress (pid: 3441, threadinfo 88021b246000, 
task 88021b399760)

[88099.924036] Stack:
[88099.924036]  88021b247cf8 8113a9cd ea0002f18000 
88011d310138
[88099.924036]  0200 ea0002f18000 88019bace580 
7f7873c0
[88099.924036]  88021aca0cf0 ea00081e 88021b247d18 
8113aa7d

[88099.924036] Call Trace:
[88099.924036]  [8113a9cd] __page_cache_release.part.11+0xdd/0x140
[88099.924036]  [8113aa7d] __put_compound_page+0x1d/0x30
[88099.924036]  [8113ac4d] put_compound_page+0x5d/0x1e0
[88099.924036]  [8113b1a5] put_page+0x45/0x50
[88099.924036]  [8118378c] do_huge_pmd_numa_page+0x2ec/0x4e0
[88099.924036]  [81158089] handle_mm_fault+0x1e9/0x360
[88099.924036]  [8162cd22] __do_page_fault+0x172/0x4e0
[88099.924036]  [810958b9] ? task_numa_work+0x1c9/0x220
[88099.924036]  [8107c56c] ? task_work_run+0xac/0xe0
[88099.924036]  [8162d09e] do_page_fault+0xe/0x10
[88099.924036]  [816296d8] page_fault+0x28/0x30
[88099.924036] Code: 00 00 00 00 66 66 66 66 90 44 8b 1d 1c 90 b5 00 55 
48 89 e5 45 85 db 75 10 89 f6 48 63 d2 48 83 c6 0e 48 01 54 f7 08 78 02 
5d c3 0f 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec
[88099.924036] RIP  [81188e97] 
mem_cgroup_update_lru_size+0x27/0x30

[88099.924036]  RSP 88021b247ca8
[88099.924036] ---[ end trace c8d6b169e0c3f25a ]---
[88108.054610] [ cut here ]
[88108.054610] WARNING: at kernel/watchdog.c:245 
watchdog_overflow_callback+0x9c/0xd0()

[88108.054610] Hardware name: PowerEdge 6950
[88108.054610] Watchdog detected hard LOCKUP on cpu 3
[88108.054610] Modules linked in: lockd sunrpc kvm_amd kvm 
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp 
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi 
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit 
drm_kms_helper ttm drm i2c_core

[88108.054610] Pid: 3429, comm: stress Tainted: G  D 3.7.0-rc2Jons+ #3
[88108.054610] Call Trace:
[88108.054610]  NMI  [8105c29f] warn_slowpath_common+0x7f/0xc0
[88108.054610]  [8105c396] warn_slowpath_fmt+0x46/0x50
[88108.054610]  [81093fa8] ? sched_clock_cpu+0xa8/0x120
[88108.054610]  [810e95c0] ? touch_nmi_watchdog+0x80/0x80
[88108.054610] 

Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Mel Gorman
On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:
 Hi all,
 
 Here's a re-post of the NUMA scheduling and migration improvement
 patches that we are working on. These include techniques from
 AutoNUMA and the sched/numa tree and form a unified basis - it
 has got all the bits that look good and mergeable.
 

Thanks for the repost. I have not even started a review yet as I was
travelling and just online today. It will be another day or two before I can
start but I was at least able to do a comparison test between autonuma and
schednuma today to see which actually performs the best. Even without the
review I was able to stick on similar vmstats as was applied to autonuma
to give a rough estimate of the relative overhead of both implementations.

Machine was a 4-node box running autonumabench and specjbb.

Three kernels are

3.7-rc2-stats-v2r1  vmstat patces on top
3.7-rc2-autonuma-v27latest autonuma with stats on top
3.7-rc2-schednuma-v1r3  schednuma series minus the last path + stats

AUTONUMA BENCH
  3.7.0 3.7.0   
  3.7.0
 rc2-stats-v2r1rc2-autonuma-v27r8
rc2-schednuma-v1r3
UserNUMA01   67145.71 (  0.00%)30110.13 ( 55.16%)
61666.46 (  8.16%)
UserNUMA01_THEADLOCAL55104.60 (  0.00%)17285.49 ( 68.63%)
17135.48 ( 68.90%)
UserNUMA027074.54 (  0.00%) 2219.11 ( 68.63%) 
2226.09 ( 68.53%)
UserNUMA02_SMT2916.86 (  0.00%)  999.19 ( 65.74%) 
1038.06 ( 64.41%)
System  NUMA01  42.28 (  0.00%)  469.07 (-1009.44%) 
2808.08 (-6541.63%)
System  NUMA01_THEADLOCAL   41.71 (  0.00%)  183.24 (-339.32%)  
174.92 (-319.37%)
System  NUMA02  34.67 (  0.00%)   27.85 ( 19.67%)   
15.03 ( 56.65%)
System  NUMA02_SMT   0.89 (  0.00%)   18.36 (-1962.92%)
5.05 (-467.42%)
Elapsed NUMA011512.97 (  0.00%)  698.18 ( 53.85%) 
1422.71 (  5.97%)
Elapsed NUMA01_THEADLOCAL 1264.23 (  0.00%)  389.51 ( 69.19%)  
377.51 ( 70.14%)
Elapsed NUMA02 181.52 (  0.00%)   60.65 ( 66.59%)   
52.86 ( 70.88%)
Elapsed NUMA02_SMT 163.59 (  0.00%)   58.57 ( 64.20%)   
48.82 ( 70.16%)
CPU NUMA014440.00 (  0.00%) 4379.00 (  1.37%) 
4531.00 ( -2.05%)
CPU NUMA01_THEADLOCAL 4362.00 (  0.00%) 4484.00 ( -2.80%) 
4585.00 ( -5.11%)
CPU NUMA023916.00 (  0.00%) 3704.00 (  5.41%) 
4239.00 ( -8.25%)
CPU NUMA02_SMT1783.00 (  0.00%) 1737.00 (  2.58%) 
2136.00 (-19.80%)

Two figures really matter here - System CPU usage and Elapsed time.

autonuma was known to hurt system CPU usage for the NUMA01 test case but
schednuma does *far* worse. I do not have a breakdown of where this time
is being spent but the raw figure is bad. autonuma is 10 times worse
than a vanilla kernel and schednuma is 5 times worse than autonuma.

For the overhead of the other test cases, schednuma is roughly
comparable with autonuma -- i.e. both pretty high overhead.

In terms of elapsed time, autonuma in the NUMA01 test case massively
improves elapsed time while schednuma barely makes a dent on it. Looking
at the memory usage per node (I generated a graph offline), it appears
that schednuma does not migrate pages to other nodes fast enough. The
convergence figures do not reflect this because the convergence seems
high (towards 1) but it may be because the approximation using faults is
insufficient.

In the other cases, schednuma does well and is comparable to autonuma.

MMTests Statistics: duration
   3.7.0   3.7.0   3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
User   132248.8850620.5082073.11
System120.19  699.12 3003.83
Elapsed  3131.10 1215.63 1911.55

This is the overall time to complete the test. autonuma is way better
than schednuma but this is all due to how it handles the NUMA01 test
case.

MMTests Statistics: vmstat
 3.7.0   3.7.0   3.7.0
  rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
Page Ins 37256   37508   37360
Page Outs2   13372   19488
Swap Ins 0   0   0
Swap Outs0   0   0
Direct pages scanned 0   0   0
Kswapd pages scanned 0   0   0
Kswapd pages reclaimed   0   0   0
Direct pages reclaimed   0   0   0
Kswapd efficiency 100%100%100%
Kswapd velocity  0.000   0.000   0.000
Direct efficiency 100%100%100%
Direct 

Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Andrew Morton

On Tue, 30 Oct 2012 12:20:32 + Mel Gorman mgor...@suse.de wrote:

 ...

Useful testing - thanks.  Did I miss the description of what
autonumabench actually does?  How representitive is it of real-world
things?

 I also expect autonuma is continually scanning where as schednuma is
 reacting to some other external event or at least less frequently scanning.

Might this imply that autonuma is consuming more CPU in kernel threads,
the cost of which didn't get included in these results?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Mel Gorman
On Tue, Oct 30, 2012 at 08:28:10AM -0700, Andrew Morton wrote:
 
 On Tue, 30 Oct 2012 12:20:32 + Mel Gorman mgor...@suse.de wrote:
 
  ...
 
 Useful testing - thanks.  Did I miss the description of what
 autonumabench actually does?  How representitive is it of real-world
 things?
 

It's not representative of anything at all. It's a synthetic benchmark
that just measures if automatic NUMA migration (whatever the mechanism)
is working as expected. I'm not aware of a decent description of what
the test does and why. Here is my current interpretation and hopefully
Andrea will correct me if I'm wrong.

NUMA01
  Two processes
  NUM_CPUS/2 number of threads so all CPUs are in use
  
  On startup, the process forks
  Each process mallocs a 3G buffer but there is no communication
  between the processes.
  Threads are created that zeros out the full buffer 1000 times

  The objective of the test is that initially the two processes
  allocate their memory on the same node. As the threads are
  are created the memory will migrate from the initial node to
  nodes that are closer to the referencing thread.

  It is worth noting that this benchmark is specifically tuned
  for two nodes and the expectation is that the two processes
  and their threads split so that all process A runs on node 0
  and all threads on process B run in node 1

  With 4 and more nodes, this is actually an adverse workload.
  As all the buffer is zeroed in both processes, there is an
  expectation that it will continually bounce between two nodes.

  So, on 2 nodes, this benchmark tests convergence. On 4 or more
  nodes, this partially measures how much busy work automatic
  NUMA migrate does and it'll be very noisy due to cache conflicts.

NUMA01_THREADLOCAL
  Two processes
  NUM_CPUS/2 number of threads so all CPUs are in use

  On startup, the process forks
  Each process mallocs a 3G buffer but there is no communication
  between the processes
  Threads are created that zero out their own subset of the buffer.
  Each buffer is 3G/NR_THREADS in size
  
  This benchmark is more realistic. In an ideal situation, each
  thread will migrate its data to its local node. The test really
  is to see does it converge and how quickly.

NUMA02
 One process, NR_CPU threads

 On startup, malloc a 1G buffer
 Create threads that zero out a thread-local portion of the buffer.
  Zeros multiple times - the number of times is fixed and seems
  to just be to take a period of time

 This is similar in principal to NUMA01_THREADLOCAL except that only
 one process is involved. I think it was aimed at being more JVM-like.

NUMA02_SMT
 One process, NR_CPU/2 threads

 This is a variation of NUMA02 except that with half the cores idle it
 is checking if the system migrates the memory to two or more nodes or
 if it tries to fit everything in one node even though the memory should
 migrate to be close to the CPU

  I also expect autonuma is continually scanning where as schednuma is
  reacting to some other external event or at least less frequently scanning.
 
 Might this imply that autonuma is consuming more CPU in kernel threads,
 the cost of which didn't get included in these results?

It might but according to top, knuma_scand only used 7.86 seconds of CPU
time during the whole test and the time used by the migration tests is
also very low. Most migration threads used less than 1 second of CPU
time. Two migration threads used 2 seconds of CPU time each but that
still seems low.

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-30 Thread Johannes Weiner
On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
 On 10/29/2012 01:56 AM, Johannes Weiner wrote:
 On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
 On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
 [  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
 mem_cgroup_prepare_migration+0xba/0xd0
 [  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
 [  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
 [  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
 [  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
 [  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
 [  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
 [  183.211216]  [81632ede] do_page_fault+0xe/0x10
 [  183.293705]  [8162f518] page_fault+0x28/0x30
 Johannes, this looks like the thp migration memcg hookery gone bad,
 could you have a look at this?
 Oops.  Here is an incremental fix, feel free to fold it into #31.
 Hello Johannes,
 
 maybe I don't think the below patch completely fix this issue, as I
 found a new error(maybe similar with this):
 
 [88099.923724] [ cut here ]
 [88099.924036] kernel BUG at mm/memcontrol.c:1134!
 [88099.924036] invalid opcode:  [#1] SMP
 [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
 amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
 joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
 megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
 drm_kms_helper ttm drm i2c_core
 [88099.924036] CPU 7
 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
 Dell Inc. PowerEdge 6950/0WN213
 [88099.924036] RIP: 0010:[81188e97] [81188e97]
 mem_cgroup_update_lru_size+0x27/0x30

Thanks a lot for your testing efforts, I really appreciate it.

I'm looking into it, but I don't expect power to get back for several
days where I live, so it's hard to reproduce it locally.

But that looks like an LRU accounting imbalance that I wasn't able to
tie to this patch yet.  Do you see weird numbers for the lru counters
in /proc/vmstat even without this memory cgroup patch?  Ccing Hugh as
well.

Thanks,
Johannes
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-28 Thread Zhouping Liu

On 10/29/2012 01:56 AM, Johannes Weiner wrote:

On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:

On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[]  [] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [] handle_mm_fault+0x1e9/0x360
[  182.863038]  [] __do_page_fault+0x172/0x4e0
[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [] ? __schedule+0x3c6/0x7a0
[  183.211216]  [] do_page_fault+0xe/0x10
[  183.293705]  [] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.


Hi Johannes,

Tested the below patch, and I'm sure it has fixed the above issue, thank 
you.


 Zhouping



Signed-off-by: Johannes Weiner 
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c30a14..0d7ebd3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (!new_page)
goto alloc_fail;
  
-	mem_cgroup_prepare_migration(page, new_page, );

-
lru = PageLRU(page);
  
  	if (lru && isolate_lru_page(page)) /* does an implicit get_page() */

@@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
  
  		return;

}
+   /*
+* Traditional migration needs to prepare the memcg charge
+* transaction early to prevent the old page from being
+* uncharged when installing migration entries.  Here we can
+* save the potential rollback and start the charge transfer
+* only when migration is already known to end successfully.
+*/
+   mem_cgroup_prepare_migration(page, new_page, );
  
  	entry = mk_pmd(new_page, vma->vm_page_prot);

entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+   /*
+* Finish the charge transaction under the page table lock to
+* prevent split_huge_page() from dividing up the charge
+* before it's fully transferred to the new page.
+*/
+   mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(>page_table_lock);
  
  	put_page(page);			/* Drop the rmap reference */

@@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
  
  	unlock_page(new_page);
  
-	mem_cgroup_end_migration(memcg, page, new_page, true);

-
unlock_page(page);
put_page(page); /* Drop the local reference */
  
  	return;
  
  alloc_fail:

-   if (new_page) {
-   mem_cgroup_end_migration(memcg, page, new_page, false);
+   if (new_page)
put_page(new_page);
-   }
  
  	unlock_page(page);
  
diff --git a/mm/memcontrol.c b/mm/memcontrol.c

index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
  struct mem_cgroup **memcgp)
  {
struct mem_cgroup *memcg = NULL;
+   unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;
  
  	*memcgp = NULL;
  
-	VM_BUG_ON(PageTransHuge(page));

if (mem_cgroup_disabled())
return;
  
+	if (PageTransHuge(page))

+   nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
 * charged to the res_counter since we plan on replacing the
 * old one and only one page is going to be left afterwards.
 */
-   __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+   __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
  }
  
  /* remove redundant charge if migration failed*/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-28 Thread Johannes Weiner
On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > [  180.918591] RIP: 0010:[]  [] 
> > mem_cgroup_prepare_migration+0xba/0xd0
> 
> > [  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
> > [  182.775090]  [] handle_mm_fault+0x1e9/0x360
> > [  182.863038]  [] __do_page_fault+0x172/0x4e0
> > [  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
> > [  183.041512]  [] ? __switch_to+0x3ce/0x4a0
> > [  183.126832]  [] ? __schedule+0x3c6/0x7a0
> > [  183.211216]  [] do_page_fault+0xe/0x10
> > [  183.293705]  [] page_fault+0x28/0x30 
> 
> Johannes, this looks like the thp migration memcg hookery gone bad,
> could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.

Signed-off-by: Johannes Weiner 
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c30a14..0d7ebd3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (!new_page)
goto alloc_fail;
 
-   mem_cgroup_prepare_migration(page, new_page, );
-
lru = PageLRU(page);
 
if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
@@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
return;
}
+   /*
+* Traditional migration needs to prepare the memcg charge
+* transaction early to prevent the old page from being
+* uncharged when installing migration entries.  Here we can
+* save the potential rollback and start the charge transfer
+* only when migration is already known to end successfully.
+*/
+   mem_cgroup_prepare_migration(page, new_page, );
 
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+   /*
+* Finish the charge transaction under the page table lock to
+* prevent split_huge_page() from dividing up the charge
+* before it's fully transferred to the new page.
+*/
+   mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(>page_table_lock);
 
put_page(page); /* Drop the rmap reference */
@@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
unlock_page(new_page);
 
-   mem_cgroup_end_migration(memcg, page, new_page, true);
-
unlock_page(page);
put_page(page); /* Drop the local reference */
 
return;
 
 alloc_fail:
-   if (new_page) {
-   mem_cgroup_end_migration(memcg, page, new_page, false);
+   if (new_page)
put_page(new_page);
-   }
 
unlock_page(page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
  struct mem_cgroup **memcgp)
 {
struct mem_cgroup *memcg = NULL;
+   unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;
 
*memcgp = NULL;
 
-   VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;
 
+   if (PageTransHuge(page))
+   nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
 * charged to the res_counter since we plan on replacing the
 * old one and only one page is going to be left afterwards.
 */
-   __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+   __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-28 Thread Johannes Weiner
On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
 On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
  [  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
  mem_cgroup_prepare_migration+0xba/0xd0
 
  [  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
  [  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
  [  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
  [  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
  [  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
  [  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
  [  183.211216]  [81632ede] do_page_fault+0xe/0x10
  [  183.293705]  [8162f518] page_fault+0x28/0x30 
 
 Johannes, this looks like the thp migration memcg hookery gone bad,
 could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c30a14..0d7ebd3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (!new_page)
goto alloc_fail;
 
-   mem_cgroup_prepare_migration(page, new_page, memcg);
-
lru = PageLRU(page);
 
if (lru  isolate_lru_page(page)) /* does an implicit get_page() */
@@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
return;
}
+   /*
+* Traditional migration needs to prepare the memcg charge
+* transaction early to prevent the old page from being
+* uncharged when installing migration entries.  Here we can
+* save the potential rollback and start the charge transfer
+* only when migration is already known to end successfully.
+*/
+   mem_cgroup_prepare_migration(page, new_page, memcg);
 
entry = mk_pmd(new_page, vma-vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+   /*
+* Finish the charge transaction under the page table lock to
+* prevent split_huge_page() from dividing up the charge
+* before it's fully transferred to the new page.
+*/
+   mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(mm-page_table_lock);
 
put_page(page); /* Drop the rmap reference */
@@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
 
unlock_page(new_page);
 
-   mem_cgroup_end_migration(memcg, page, new_page, true);
-
unlock_page(page);
put_page(page); /* Drop the local reference */
 
return;
 
 alloc_fail:
-   if (new_page) {
-   mem_cgroup_end_migration(memcg, page, new_page, false);
+   if (new_page)
put_page(new_page);
-   }
 
unlock_page(page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
  struct mem_cgroup **memcgp)
 {
struct mem_cgroup *memcg = NULL;
+   unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;
 
*memcgp = NULL;
 
-   VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;
 
+   if (PageTransHuge(page))
+   nr_pages = compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
 * charged to the res_counter since we plan on replacing the
 * old one and only one page is going to be left afterwards.
 */
-   __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+   __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
 }
 
 /* remove redundant charge if migration failed*/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-28 Thread Zhouping Liu

On 10/29/2012 01:56 AM, Johannes Weiner wrote:

On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:

On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
[  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
[  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
[  183.211216]  [81632ede] do_page_fault+0xe/0x10
[  183.293705]  [8162f518] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Oops.  Here is an incremental fix, feel free to fold it into #31.


Hi Johannes,

Tested the below patch, and I'm sure it has fixed the above issue, thank 
you.


 Zhouping



Signed-off-by: Johannes Weiner han...@cmpxchg.org
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c30a14..0d7ebd3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
if (!new_page)
goto alloc_fail;
  
-	mem_cgroup_prepare_migration(page, new_page, memcg);

-
lru = PageLRU(page);
  
  	if (lru  isolate_lru_page(page)) /* does an implicit get_page() */

@@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
  
  		return;

}
+   /*
+* Traditional migration needs to prepare the memcg charge
+* transaction early to prevent the old page from being
+* uncharged when installing migration entries.  Here we can
+* save the potential rollback and start the charge transfer
+* only when migration is already known to end successfully.
+*/
+   mem_cgroup_prepare_migration(page, new_page, memcg);
  
  	entry = mk_pmd(new_page, vma-vm_page_prot);

entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+   /*
+* Finish the charge transaction under the page table lock to
+* prevent split_huge_page() from dividing up the charge
+* before it's fully transferred to the new page.
+*/
+   mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(mm-page_table_lock);
  
  	put_page(page);			/* Drop the rmap reference */

@@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
  
  	unlock_page(new_page);
  
-	mem_cgroup_end_migration(memcg, page, new_page, true);

-
unlock_page(page);
put_page(page); /* Drop the local reference */
  
  	return;
  
  alloc_fail:

-   if (new_page) {
-   mem_cgroup_end_migration(memcg, page, new_page, false);
+   if (new_page)
put_page(new_page);
-   }
  
  	unlock_page(page);
  
diff --git a/mm/memcontrol.c b/mm/memcontrol.c

index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
  struct mem_cgroup **memcgp)
  {
struct mem_cgroup *memcg = NULL;
+   unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;
  
  	*memcgp = NULL;
  
-	VM_BUG_ON(PageTransHuge(page));

if (mem_cgroup_disabled())
return;
  
+	if (PageTransHuge(page))

+   nr_pages = compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, 
struct page *newpage,
 * charged to the res_counter since we plan on replacing the
 * old one and only one page is going to be left afterwards.
 */
-   __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+   __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
  }
  
  /* remove redundant charge if migration failed*/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Ingo Molnar

* Zhouping Liu  wrote:

> On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> >* Peter Zijlstra  wrote:
> >
> >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> >>>[  180.918591] RIP: 0010:[]  [] 
> >>>mem_cgroup_prepare_migration+0xba/0xd0
> >>>[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
> >>>[  182.775090]  [] handle_mm_fault+0x1e9/0x360
> >>>[  182.863038]  [] __do_page_fault+0x172/0x4e0
> >>>[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
> >>>[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
> >>>[  183.126832]  [] ? __schedule+0x3c6/0x7a0
> >>>[  183.211216]  [] do_page_fault+0xe/0x10
> >>>[  183.293705]  [] page_fault+0x28/0x30
> >>Johannes, this looks like the thp migration memcg hookery gone bad,
> >>could you have a look at this?
> >Meanwhile, Zhouping Liu, could you please not apply the last
> >patch:
> >
> >   [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> >
> >and see whether it boots/works without that?
> 
> Hi Ingo,
> 
> your supposed is right, after reverting the 31st patch(sched, numa,
> mm: Add memcg support to do_huge_pmd_numa_page())
> the issue is gone, thank you.

The tested bits you can find in the numa/core tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core

It includes all changes (patches #1-#30) except patch #31 - I 
wanted to test and apply that last patch today, but won't do it 
now that you've reported this regression.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Zhouping Liu

On 10/26/2012 05:20 PM, Ingo Molnar wrote:

* Peter Zijlstra  wrote:


On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[]  [] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [] handle_mm_fault+0x1e9/0x360
[  182.863038]  [] __do_page_fault+0x172/0x4e0
[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [] ? __schedule+0x3c6/0x7a0
[  183.211216]  [] do_page_fault+0xe/0x10
[  183.293705]  [] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last
patch:

   [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?


Hi Ingo,

your supposed is right, after reverting the 31st patch(sched, numa, mm: 
Add memcg support to do_huge_pmd_numa_page())

the issue is gone, thank you.


Thanks,
Zhouping



Thanks,

Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org;> em...@kvack.org 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Zhouping Liu

On 10/26/2012 05:20 PM, Ingo Molnar wrote:

* Peter Zijlstra  wrote:


On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[]  [] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [] handle_mm_fault+0x1e9/0x360
[  182.863038]  [] __do_page_fault+0x172/0x4e0
[  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [] ? __schedule+0x3c6/0x7a0
[  183.211216]  [] do_page_fault+0xe/0x10
[  183.293705]  [] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last
patch:

   [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?


Ok, I  reverted the 31st patch, will provide the results here after I 
finish testing.


Thanks,
Zhouping
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > [  180.918591] RIP: 0010:[]  [] 
> > mem_cgroup_prepare_migration+0xba/0xd0
> 
> > [  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
> > [  182.775090]  [] handle_mm_fault+0x1e9/0x360
> > [  182.863038]  [] __do_page_fault+0x172/0x4e0
> > [  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
> > [  183.041512]  [] ? __switch_to+0x3ce/0x4a0
> > [  183.126832]  [] ? __schedule+0x3c6/0x7a0
> > [  183.211216]  [] do_page_fault+0xe/0x10
> > [  183.293705]  [] page_fault+0x28/0x30 
> 
> Johannes, this looks like the thp migration memcg hookery gone bad,
> could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last 
patch:

  [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Peter Zijlstra
On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> [  180.918591] RIP: 0010:[]  [] 
> mem_cgroup_prepare_migration+0xba/0xd0

> [  182.681450]  [] do_huge_pmd_numa_page+0x180/0x500
> [  182.775090]  [] handle_mm_fault+0x1e9/0x360
> [  182.863038]  [] __do_page_fault+0x172/0x4e0
> [  182.950574]  [] ? __switch_to_xtra+0x163/0x1a0
> [  183.041512]  [] ? __switch_to+0x3ce/0x4a0
> [  183.126832]  [] ? __schedule+0x3c6/0x7a0
> [  183.211216]  [] do_page_fault+0xe/0x10
> [  183.293705]  [] page_fault+0x28/0x30 

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Peter Zijlstra
On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
 [  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
 mem_cgroup_prepare_migration+0xba/0xd0

 [  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
 [  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
 [  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
 [  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
 [  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
 [  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
 [  183.211216]  [81632ede] do_page_fault+0xe/0x10
 [  183.293705]  [8162f518] page_fault+0x28/0x30 

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Ingo Molnar

* Peter Zijlstra a.p.zijls...@chello.nl wrote:

 On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
  [  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
  mem_cgroup_prepare_migration+0xba/0xd0
 
  [  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
  [  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
  [  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
  [  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
  [  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
  [  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
  [  183.211216]  [81632ede] do_page_fault+0xe/0x10
  [  183.293705]  [8162f518] page_fault+0x28/0x30 
 
 Johannes, this looks like the thp migration memcg hookery gone bad,
 could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last 
patch:

  [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Zhouping Liu

On 10/26/2012 05:20 PM, Ingo Molnar wrote:

* Peter Zijlstra a.p.zijls...@chello.nl wrote:


On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
[  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
[  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
[  183.211216]  [81632ede] do_page_fault+0xe/0x10
[  183.293705]  [8162f518] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last
patch:

   [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?


Ok, I  reverted the 31st patch, will provide the results here after I 
finish testing.


Thanks,
Zhouping
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Zhouping Liu

On 10/26/2012 05:20 PM, Ingo Molnar wrote:

* Peter Zijlstra a.p.zijls...@chello.nl wrote:


On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:

[  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
mem_cgroup_prepare_migration+0xba/0xd0
[  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
[  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
[  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
[  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
[  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
[  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
[  183.211216]  [81632ede] do_page_fault+0xe/0x10
[  183.293705]  [8162f518] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last
patch:

   [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?


Hi Ingo,

your supposed is right, after reverting the 31st patch(sched, numa, mm: 
Add memcg support to do_huge_pmd_numa_page())

the issue is gone, thank you.


Thanks,
Zhouping



Thanks,

Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/31] numa/core patches

2012-10-26 Thread Ingo Molnar

* Zhouping Liu z...@redhat.com wrote:

 On 10/26/2012 05:20 PM, Ingo Molnar wrote:
 * Peter Zijlstra a.p.zijls...@chello.nl wrote:
 
 On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
 [  180.918591] RIP: 0010:[8118c39a]  [8118c39a] 
 mem_cgroup_prepare_migration+0xba/0xd0
 [  182.681450]  [81183b60] do_huge_pmd_numa_page+0x180/0x500
 [  182.775090]  [811585c9] handle_mm_fault+0x1e9/0x360
 [  182.863038]  [81632b62] __do_page_fault+0x172/0x4e0
 [  182.950574]  [8101c283] ? __switch_to_xtra+0x163/0x1a0
 [  183.041512]  [8101281e] ? __switch_to+0x3ce/0x4a0
 [  183.126832]  [8162d686] ? __schedule+0x3c6/0x7a0
 [  183.211216]  [81632ede] do_page_fault+0xe/0x10
 [  183.293705]  [8162f518] page_fault+0x28/0x30
 Johannes, this looks like the thp migration memcg hookery gone bad,
 could you have a look at this?
 Meanwhile, Zhouping Liu, could you please not apply the last
 patch:
 
[PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
 
 and see whether it boots/works without that?
 
 Hi Ingo,
 
 your supposed is right, after reverting the 31st patch(sched, numa,
 mm: Add memcg support to do_huge_pmd_numa_page())
 the issue is gone, thank you.

The tested bits you can find in the numa/core tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core

It includes all changes (patches #1-#30) except patch #31 - I 
wanted to test and apply that last patch today, but won't do it 
now that you've reported this regression.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/