Re: [PATCH 00/31] numa/core patches
On Sat, Nov 10, 2012 at 10:47:41AM +0800, Alex Shi wrote: > On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman wrote: > > On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote: > >> > > >> > In reality, this report is larger but I chopped it down a bit for > >> > brevity. autonuma beats schednuma *heavily* on this benchmark both in > >> > terms of average operations per numa node and overall throughput. > >> > > >> > SPECJBB PEAKS > >> >3.7.0 3.7.0 > >> > 3.7.0 > >> > rc2-stats-v2r1 rc2-autonuma-v27r8 > >> >rc2-schednuma-v1r3 > >> > Expctd Warehouse 12.00 ( 0.00%) > >> > 12.00 ( 0.00%) 12.00 ( 0.00%) > >> > Expctd Peak Bops 442225.00 ( 0.00%) > >> > 596039.00 ( 34.78%) 555342.00 ( 25.58%) > >> > Actual Warehouse7.00 ( 0.00%) > >> > 9.00 ( 28.57%)8.00 ( 14.29%) > >> > Actual Peak Bops 550747.00 ( 0.00%) > >> > 646124.00 ( 17.32%) 560635.00 ( 1.80%) > >> > >> It is impressive report! > >> > >> Could you like to share the what JVM and options are you using in the > >> testing, and based on which kinds of platform? > >> > > > > Oracle JVM version "1.7.0_07" > > Java(TM) SE Runtime Environment (build 1.7.0_07-b10) > > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) > > > > 4 JVMs were run, one for each node. > > > > JVM switch specified was -Xmx12901m so it would consume roughly 80% of > > memory overall. > > > > Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in > > total with HT enabled. > > > > Thanks for configuration sharing! > > I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket. If you are using hugepages then automatic numa is not migrating those pages. If you are pinning the JVMs to the socket then automatic numa balancing is unnecessary as they are already on the correct node. > In previous sched numa version, I had found 20% dropping with Jrockit > with our configuration. but for this version. No clear regression > found. also has no benefit found. > You are only checking for regressions with your configuration which is important because it showed that schednuma introduced only overhead in an optimisation NUMA configuration. In your case, you will see little or not benefit with any automatic NUMA balancing implementation as the most important pages neiter can migrate nor need to. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Sat, Nov 10, 2012 at 10:47:41AM +0800, Alex Shi wrote: On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman mgor...@suse.de wrote: On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote: In reality, this report is larger but I chopped it down a bit for brevity. autonuma beats schednuma *heavily* on this benchmark both in terms of average operations per numa node and overall throughput. SPECJBB PEAKS 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3 Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%) Actual Warehouse7.00 ( 0.00%) 9.00 ( 28.57%)8.00 ( 14.29%) Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%) It is impressive report! Could you like to share the what JVM and options are you using in the testing, and based on which kinds of platform? Oracle JVM version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) 4 JVMs were run, one for each node. JVM switch specified was -Xmx12901m so it would consume roughly 80% of memory overall. Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in total with HT enabled. Thanks for configuration sharing! I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket. If you are using hugepages then automatic numa is not migrating those pages. If you are pinning the JVMs to the socket then automatic numa balancing is unnecessary as they are already on the correct node. In previous sched numa version, I had found 20% dropping with Jrockit with our configuration. but for this version. No clear regression found. also has no benefit found. You are only checking for regressions with your configuration which is important because it showed that schednuma introduced only overhead in an optimisation NUMA configuration. In your case, you will see little or not benefit with any automatic NUMA balancing implementation as the most important pages neiter can migrate nor need to. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman wrote: > On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote: >> > >> > In reality, this report is larger but I chopped it down a bit for >> > brevity. autonuma beats schednuma *heavily* on this benchmark both in >> > terms of average operations per numa node and overall throughput. >> > >> > SPECJBB PEAKS >> >3.7.0 3.7.0 >> > 3.7.0 >> > rc2-stats-v2r1 rc2-autonuma-v27r8 >> > rc2-schednuma-v1r3 >> > Expctd Warehouse 12.00 ( 0.00%) >> > 12.00 ( 0.00%) 12.00 ( 0.00%) >> > Expctd Peak Bops 442225.00 ( 0.00%) >> > 596039.00 ( 34.78%) 555342.00 ( 25.58%) >> > Actual Warehouse7.00 ( 0.00%) >> > 9.00 ( 28.57%)8.00 ( 14.29%) >> > Actual Peak Bops 550747.00 ( 0.00%) >> > 646124.00 ( 17.32%) 560635.00 ( 1.80%) >> >> It is impressive report! >> >> Could you like to share the what JVM and options are you using in the >> testing, and based on which kinds of platform? >> > > Oracle JVM version "1.7.0_07" > Java(TM) SE Runtime Environment (build 1.7.0_07-b10) > Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) > > 4 JVMs were run, one for each node. > > JVM switch specified was -Xmx12901m so it would consume roughly 80% of > memory overall. > > Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in > total with HT enabled. > Thanks for configuration sharing! I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket. In previous sched numa version, I had found 20% dropping with Jrockit with our configuration. but for this version. No clear regression found. also has no benefit found. Seems we need to expend the testing configurations. :) -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/30/2012 08:20 AM, Mel Gorman wrote: On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote: Hi all, Here's a re-post of the NUMA scheduling and migration improvement patches that we are working on. These include techniques from AutoNUMA and the sched/numa tree and form a unified basis - it has got all the bits that look good and mergeable. Thanks for the repost. I have not even started a review yet as I was travelling and just online today. It will be another day or two before I can start but I was at least able to do a comparison test between autonuma and schednuma today to see which actually performs the best. Even without the review I was able to stick on similar vmstats as was applied to autonuma to give a rough estimate of the relative overhead of both implementations. Peter, Ingo, do you have any comments on the performance measurements by Mel? Any ideas on how to fix sched/numa or numa/core? At this point, I suspect the easiest way forward might be to merge the basic infrastructure from Mel's combined tree (in -mm? in -tip?), so we can experiment with different NUMA placement policies on top. That way we can do apples to apples comparison of the policies, and figure out what works best, and why. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/30/2012 08:20 AM, Mel Gorman wrote: On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote: Hi all, Here's a re-post of the NUMA scheduling and migration improvement patches that we are working on. These include techniques from AutoNUMA and the sched/numa tree and form a unified basis - it has got all the bits that look good and mergeable. Thanks for the repost. I have not even started a review yet as I was travelling and just online today. It will be another day or two before I can start but I was at least able to do a comparison test between autonuma and schednuma today to see which actually performs the best. Even without the review I was able to stick on similar vmstats as was applied to autonuma to give a rough estimate of the relative overhead of both implementations. Peter, Ingo, do you have any comments on the performance measurements by Mel? Any ideas on how to fix sched/numa or numa/core? At this point, I suspect the easiest way forward might be to merge the basic infrastructure from Mel's combined tree (in -mm? in -tip?), so we can experiment with different NUMA placement policies on top. That way we can do apples to apples comparison of the policies, and figure out what works best, and why. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman mgor...@suse.de wrote: On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote: In reality, this report is larger but I chopped it down a bit for brevity. autonuma beats schednuma *heavily* on this benchmark both in terms of average operations per numa node and overall throughput. SPECJBB PEAKS 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3 Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%) Actual Warehouse7.00 ( 0.00%) 9.00 ( 28.57%)8.00 ( 14.29%) Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%) It is impressive report! Could you like to share the what JVM and options are you using in the testing, and based on which kinds of platform? Oracle JVM version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) 4 JVMs were run, one for each node. JVM switch specified was -Xmx12901m so it would consume roughly 80% of memory overall. Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in total with HT enabled. Thanks for configuration sharing! I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket. In previous sched numa version, I had found 20% dropping with Jrockit with our configuration. but for this version. No clear regression found. also has no benefit found. Seems we need to expend the testing configurations. :) -- Thanks Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
Hey Peter, Here are results on 2node and 8node machine while running the autonuma benchmark. On 2 node, 12 core 24GB KernelVersion: 3.7.0-rc3 Testcase: Min Max Avg numa01: 121.23 122.43 121.53 numa01_HARD_BIND:80.9081.0780.96 numa01_INVERSE_BIND: 145.91 146.06 145.97 numa01_THREAD_ALLOC: 395.81 398.30 397.47 numa01_THREAD_ALLOC_HARD_BIND: 264.09 264.27 264.18 numa01_THREAD_ALLOC_INVERSE_BIND: 476.36 476.65 476.53 numa02:53.1153.1953.15 numa02_HARD_BIND:35.2035.2935.25 numa02_INVERSE_BIND:63.5263.5563.54 numa02_SMT:60.2862.0061.33 numa02_SMT_HARD_BIND:42.6343.6143.22 numa02_SMT_INVERSE_BIND:76.2778.0677.31 KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) Testcase: Min Max Avg %Change numa01: 121.28 121.71 121.470.05% numa01_HARD_BIND:80.8981.0180.960.00% numa01_INVERSE_BIND: 145.87 146.04 145.960.01% numa01_THREAD_ALLOC: 398.07 400.27 398.90 -0.36% numa01_THREAD_ALLOC_HARD_BIND: 264.02 264.21 264.140.02% numa01_THREAD_ALLOC_INVERSE_BIND: 476.13 476.62 476.410.03% numa02:52.9753.2553.130.04% numa02_HARD_BIND:35.2135.2835.240.03% numa02_INVERSE_BIND:63.5163.5463.530.02% numa02_SMT:61.3562.4661.97 -1.03% numa02_SMT_HARD_BIND:42.8943.8543.220.00% numa02_SMT_INVERSE_BIND:76.5377.6877.080.30% KernelVersion: 3.7.0-rc3(with HT enabled ) Testcase: Min Max Avg numa01: 242.58 244.39 243.68 numa01_HARD_BIND: 169.36 169.40 169.38 numa01_INVERSE_BIND: 299.69 299.73 299.71 numa01_THREAD_ALLOC: 399.86 404.10 401.50 numa01_THREAD_ALLOC_HARD_BIND: 278.72 278.77 278.75 numa01_THREAD_ALLOC_INVERSE_BIND: 493.46 493.59 493.54 numa02:53.0053.3353.19 numa02_HARD_BIND:36.7736.8836.82 numa02_INVERSE_BIND:66.0766.1066.09 numa02_SMT:53.2353.5153.35 numa02_SMT_HARD_BIND:35.1935.2735.24 numa02_SMT_INVERSE_BIND:63.5063.5463.52 KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) (with HT enabled) Testcase: Min Max Avg %Change numa01: 242.68 244.59 243.530.06% numa01_HARD_BIND: 169.37 169.42 169.40 -0.01% numa01_INVERSE_BIND: 299.83 299.96 299.91 -0.07% numa01_THREAD_ALLOC: 399.53 403.13 401.62 -0.03% numa01_THREAD_ALLOC_HARD_BIND: 278.78 278.80 278.79 -0.01% numa01_THREAD_ALLOC_INVERSE_BIND: 493.63 493.90 493.78 -0.05% numa02:53.0653.4253.22 -0.06% numa02_HARD_BIND:36.7836.8736.820.00% numa02_INVERSE_BIND:66.0966.1066.10 -0.02% numa02_SMT:53.3453.5553.42 -0.13% numa02_SMT_HARD_BIND:35.2235.2935.25 -0.03% numa02_SMT_INVERSE_BIND:63.5063.5863.53 -0.02% On 8 node, 64 core, 320 GB KernelVersion: 3.7.0-rc3() Testcase: Min Max Avg numa01: 1550.56 1596.03 1574.24 numa01_HARD_BIND: 915.25 2540.64 1392.42 numa01_INVERSE_BIND: 2964.66 3716.33 3149.10 numa01_THREAD_ALLOC: 922.99 1003.31 972.99 numa01_THREAD_ALLOC_HARD_BIND: 579.54 1266.65 896.75 numa01_THREAD_ALLOC_INVERSE_BIND: 1794.51 2057.16 1922.86 numa02: 126.22 133.01 130.91 numa02_HARD_BIND:25.8526.2526.06 numa02_INVERSE_BIND: 341.38 350.35 345.82 numa02_SMT: 153.06 175.41 163.47 numa02_SMT_HARD_BIND:27.10 212.39 114.37 numa02_SMT_INVERSE_BIND: 285.70 1542.83 540.62 KernelVersion: numasched()
Re: [PATCH 00/31] numa/core patches
Hey Peter, Here are results on 2node and 8node machine while running the autonuma benchmark. On 2 node, 12 core 24GB KernelVersion: 3.7.0-rc3 Testcase: Min Max Avg numa01: 121.23 122.43 121.53 numa01_HARD_BIND:80.9081.0780.96 numa01_INVERSE_BIND: 145.91 146.06 145.97 numa01_THREAD_ALLOC: 395.81 398.30 397.47 numa01_THREAD_ALLOC_HARD_BIND: 264.09 264.27 264.18 numa01_THREAD_ALLOC_INVERSE_BIND: 476.36 476.65 476.53 numa02:53.1153.1953.15 numa02_HARD_BIND:35.2035.2935.25 numa02_INVERSE_BIND:63.5263.5563.54 numa02_SMT:60.2862.0061.33 numa02_SMT_HARD_BIND:42.6343.6143.22 numa02_SMT_INVERSE_BIND:76.2778.0677.31 KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) Testcase: Min Max Avg %Change numa01: 121.28 121.71 121.470.05% numa01_HARD_BIND:80.8981.0180.960.00% numa01_INVERSE_BIND: 145.87 146.04 145.960.01% numa01_THREAD_ALLOC: 398.07 400.27 398.90 -0.36% numa01_THREAD_ALLOC_HARD_BIND: 264.02 264.21 264.140.02% numa01_THREAD_ALLOC_INVERSE_BIND: 476.13 476.62 476.410.03% numa02:52.9753.2553.130.04% numa02_HARD_BIND:35.2135.2835.240.03% numa02_INVERSE_BIND:63.5163.5463.530.02% numa02_SMT:61.3562.4661.97 -1.03% numa02_SMT_HARD_BIND:42.8943.8543.220.00% numa02_SMT_INVERSE_BIND:76.5377.6877.080.30% KernelVersion: 3.7.0-rc3(with HT enabled ) Testcase: Min Max Avg numa01: 242.58 244.39 243.68 numa01_HARD_BIND: 169.36 169.40 169.38 numa01_INVERSE_BIND: 299.69 299.73 299.71 numa01_THREAD_ALLOC: 399.86 404.10 401.50 numa01_THREAD_ALLOC_HARD_BIND: 278.72 278.77 278.75 numa01_THREAD_ALLOC_INVERSE_BIND: 493.46 493.59 493.54 numa02:53.0053.3353.19 numa02_HARD_BIND:36.7736.8836.82 numa02_INVERSE_BIND:66.0766.1066.09 numa02_SMT:53.2353.5153.35 numa02_SMT_HARD_BIND:35.1935.2735.24 numa02_SMT_INVERSE_BIND:63.5063.5463.52 KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) (with HT enabled) Testcase: Min Max Avg %Change numa01: 242.68 244.59 243.530.06% numa01_HARD_BIND: 169.37 169.42 169.40 -0.01% numa01_INVERSE_BIND: 299.83 299.96 299.91 -0.07% numa01_THREAD_ALLOC: 399.53 403.13 401.62 -0.03% numa01_THREAD_ALLOC_HARD_BIND: 278.78 278.80 278.79 -0.01% numa01_THREAD_ALLOC_INVERSE_BIND: 493.63 493.90 493.78 -0.05% numa02:53.0653.4253.22 -0.06% numa02_HARD_BIND:36.7836.8736.820.00% numa02_INVERSE_BIND:66.0966.1066.10 -0.02% numa02_SMT:53.3453.5553.42 -0.13% numa02_SMT_HARD_BIND:35.2235.2935.25 -0.03% numa02_SMT_INVERSE_BIND:63.5063.5863.53 -0.02% On 8 node, 64 core, 320 GB KernelVersion: 3.7.0-rc3() Testcase: Min Max Avg numa01: 1550.56 1596.03 1574.24 numa01_HARD_BIND: 915.25 2540.64 1392.42 numa01_INVERSE_BIND: 2964.66 3716.33 3149.10 numa01_THREAD_ALLOC: 922.99 1003.31 972.99 numa01_THREAD_ALLOC_HARD_BIND: 579.54 1266.65 896.75 numa01_THREAD_ALLOC_INVERSE_BIND: 1794.51 2057.16 1922.86 numa02: 126.22 133.01 130.91 numa02_HARD_BIND:25.8526.2526.06 numa02_INVERSE_BIND: 341.38 350.35 345.82 numa02_SMT: 153.06 175.41 163.47 numa02_SMT_HARD_BIND:27.10 212.39 114.37 numa02_SMT_INVERSE_BIND: 285.70 1542.83 540.62 KernelVersion: numasched()
Re: [PATCH 00/31] numa/core patches
On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote: > > > > In reality, this report is larger but I chopped it down a bit for > > brevity. autonuma beats schednuma *heavily* on this benchmark both in > > terms of average operations per numa node and overall throughput. > > > > SPECJBB PEAKS > >3.7.0 3.7.0 > > 3.7.0 > > rc2-stats-v2r1 rc2-autonuma-v27r8 > > rc2-schednuma-v1r3 > > Expctd Warehouse 12.00 ( 0.00%) 12.00 > > ( 0.00%) 12.00 ( 0.00%) > > Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 > > ( 34.78%) 555342.00 ( 25.58%) > > Actual Warehouse7.00 ( 0.00%)9.00 > > ( 28.57%)8.00 ( 14.29%) > > Actual Peak Bops 550747.00 ( 0.00%) 646124.00 > > ( 17.32%) 560635.00 ( 1.80%) > > It is impressive report! > > Could you like to share the what JVM and options are you using in the > testing, and based on which kinds of platform? > Oracle JVM version "1.7.0_07" Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) 4 JVMs were run, one for each node. JVM switch specified was -Xmx12901m so it would consume roughly 80% of memory overall. Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in total with HT enabled. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
> > In reality, this report is larger but I chopped it down a bit for > brevity. autonuma beats schednuma *heavily* on this benchmark both in > terms of average operations per numa node and overall throughput. > > SPECJBB PEAKS >3.7.0 3.7.0 >3.7.0 > rc2-stats-v2r1 rc2-autonuma-v27r8 > rc2-schednuma-v1r3 > Expctd Warehouse 12.00 ( 0.00%) 12.00 ( > 0.00%) 12.00 ( 0.00%) > Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( > 34.78%) 555342.00 ( 25.58%) > Actual Warehouse7.00 ( 0.00%)9.00 ( > 28.57%)8.00 ( 14.29%) > Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( > 17.32%) 560635.00 ( 1.80%) It is impressive report! Could you like to share the what JVM and options are you using in the testing, and based on which kinds of platform? -- Thanks Alex -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
In reality, this report is larger but I chopped it down a bit for brevity. autonuma beats schednuma *heavily* on this benchmark both in terms of average operations per numa node and overall throughput. SPECJBB PEAKS 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3 Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%) Actual Warehouse7.00 ( 0.00%)9.00 ( 28.57%)8.00 ( 14.29%) Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%) It is impressive report! Could you like to share the what JVM and options are you using in the testing, and based on which kinds of platform? -- Thanks Alex -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote: In reality, this report is larger but I chopped it down a bit for brevity. autonuma beats schednuma *heavily* on this benchmark both in terms of average operations per numa node and overall throughput. SPECJBB PEAKS 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3 Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%) Actual Warehouse7.00 ( 0.00%)9.00 ( 28.57%)8.00 ( 14.29%) Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%) It is impressive report! Could you like to share the what JVM and options are you using in the testing, and based on which kinds of platform? Oracle JVM version 1.7.0_07 Java(TM) SE Runtime Environment (build 1.7.0_07-b10) Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode) 4 JVMs were run, one for each node. JVM switch specified was -Xmx12901m so it would consume roughly 80% of memory overall. Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in total with HT enabled. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Fri, 2 Nov 2012, Zhouping Liu wrote: > On 11/01/2012 09:41 PM, Hugh Dickins wrote: > > > > Here's a patch fixing and tidying up that and a few other things there. > > But I'm not signing it off yet, partly because I've barely tested it > > (quite probably I didn't even have any numa pmd migration happening > > at all), and partly because just a moment ago I ran across this > > instructive comment in __collapse_huge_page_isolate(): > > /* cannot use mapcount: can't collapse if there's a gup pin */ > > if (page_count(page) != 1) { > > > > Hmm, yes, below I've added the page_mapcount() check I proposed to > > do_huge_pmd_numa_page(), but is even that safe enough? Do we actually > > need a page_count() check (for 2?) to guard against get_user_pages()? > > I suspect we do, but then do we have enough locking to stabilize such > > a check? Probably, but... > > > > This will take more time, and I doubt get_user_pages() is an issue in > > your testing, so please would you try the patch below, to see if it > > does fix the BUGs you are seeing? Thanks a lot. > > Hugh, I have tested the patch for 5 more hours, > the issue can't be reproduced again, > so I think it has fixed the issue, thank you :) Thanks a lot for testing and reporting back, that's good news. However, I've meanwhile become convinced that more fixes are needed here, to be safe against get_user_pages() (including get_user_pages_fast()); to get the Mlocked count right; and to recover correctly when !pmd_same with an Unevictable page. Won't now have time to update the patch today, but these additional fixes shouldn't hold up your testing. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Fri, 2 Nov 2012, Zhouping Liu wrote: On 11/01/2012 09:41 PM, Hugh Dickins wrote: Here's a patch fixing and tidying up that and a few other things there. But I'm not signing it off yet, partly because I've barely tested it (quite probably I didn't even have any numa pmd migration happening at all), and partly because just a moment ago I ran across this instructive comment in __collapse_huge_page_isolate(): /* cannot use mapcount: can't collapse if there's a gup pin */ if (page_count(page) != 1) { Hmm, yes, below I've added the page_mapcount() check I proposed to do_huge_pmd_numa_page(), but is even that safe enough? Do we actually need a page_count() check (for 2?) to guard against get_user_pages()? I suspect we do, but then do we have enough locking to stabilize such a check? Probably, but... This will take more time, and I doubt get_user_pages() is an issue in your testing, so please would you try the patch below, to see if it does fix the BUGs you are seeing? Thanks a lot. Hugh, I have tested the patch for 5 more hours, the issue can't be reproduced again, so I think it has fixed the issue, thank you :) Thanks a lot for testing and reporting back, that's good news. However, I've meanwhile become convinced that more fixes are needed here, to be safe against get_user_pages() (including get_user_pages_fast()); to get the Mlocked count right; and to recover correctly when !pmd_same with an Unevictable page. Won't now have time to update the patch today, but these additional fixes shouldn't hold up your testing. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 11/01/2012 09:41 PM, Hugh Dickins wrote: On Wed, 31 Oct 2012, Hugh Dickins wrote: On Wed, 31 Oct 2012, Zhouping Liu wrote: On 10/31/2012 03:26 PM, Hugh Dickins wrote: There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract from bigfile just the lines from ":" to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. Hugh, I didn't find the next function, as I can't find any words that matched "do_huge_pmd_numa_page". is there any other methods? Hmm, do_huge_pmd_numa_page does appear in your stacktrace, unless I've made a typo but am blind to it. Were you applying objdump to the vmlinux which gave you the BUG at mm/memcontrol.c:1134! ? Thanks for the further info you then sent privately: I have not made any more effort to reproduce the issue, but your objdump did tell me that the put_page hitting the problem is the one on line 872 of mm/huge_memory.c, "Drop the local reference", just before successful return after migration. I didn't really get the inspiration I'd hoped for out of knowing that, but it did make wonder whether you're suffering from one of the issues I already mentioned, and I can now see a way in which it might cause the mm/memcontrol.c:1134 BUG:- migrate_page_copy() does TestClearPageActive on the source page: so given the unsafe way in which do_huge_pmd_numa_page() was proceeding with a !PageLRU page, it's quite possible that the page was sitting in a pagevec, and added to the active lru (so added to the lru_size of the active lru), but our final put_page removes it from lru, active flag has been cleared, so we subtract it from the lru_size of the inactive lru - that could indeed make it go negative and trigger the BUG. Here's a patch fixing and tidying up that and a few other things there. But I'm not signing it off yet, partly because I've barely tested it (quite probably I didn't even have any numa pmd migration happening at all), and partly because just a moment ago I ran across this instructive comment in __collapse_huge_page_isolate(): /* cannot use mapcount: can't collapse if there's a gup pin */ if (page_count(page) != 1) { Hmm, yes, below I've added the page_mapcount() check I proposed to do_huge_pmd_numa_page(), but is even that safe enough? Do we actually need a page_count() check (for 2?) to guard against get_user_pages()? I suspect we do, but then do we have enough locking to stabilize such a check? Probably, but... This will take more time, and I doubt get_user_pages() is an issue in your testing, so please would you try the patch below, to see if it does fix the BUGs you are seeing? Thanks a lot. Hugh, I have tested the patch for 5 more hours, the issue can't be reproduced again, so I think it has fixed the issue, thank you :) Zhouping Not-Yet-Signed-off-by: Hugh Dickins --- mm/huge_memory.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) --- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 -0700 +++ linux/mm/huge_memory.c 2012-11-01 05:52:19.512153771 -0700 @@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str struct mem_cgroup *memcg = NULL; struct page *new_page = NULL; struct page *page = NULL; - int node, lru; + int node = -1; spin_lock(>page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) @@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str VM_BUG_ON(!PageCompound(page) || !PageHead(page)); get_page(page); - node = mpol_misplaced(page, vma, haddr); + if (page_mapcount(page) == 1) /* Only do exclusively mapped */ + node = mpol_misplaced(page, vma, haddr); if (node != -1) goto migrate; } @@ -801,13 +802,11 @@ migrate: if (!new_page) goto alloc_fail; - lru = PageLRU(page); - - if (lru && isolate_lru_page(page)) /* does an implicit get_page() */ + if (isolate_lru_page(page)) /* Does an implicit get_page() */ goto alloc_fail; - if (!trylock_page(new_page)) - BUG(); + __set_page_locked(new_page); + SetPageSwapBacked(new_page); /* anon mapping, we can simply copy page->mapping to the new page: */ new_page->mapping = page->mapping; @@ -820,8 +819,6 @@ migrate: spin_lock(>page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) { spin_unlock(>page_table_lock); - if (lru) - putback_lru_page(page); unlock_page(new_page); ClearPageActive(new_page); /* Set by
Re: [PATCH 00/31] numa/core patches
On Wed, 31 Oct 2012, Hugh Dickins wrote: > On Wed, 31 Oct 2012, Zhouping Liu wrote: > > On 10/31/2012 03:26 PM, Hugh Dickins wrote: > > > > > > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it > > > would help if we could focus on the one which is giving the trouble, > > > but I don't know which that is. Zhouping, if you can, please would > > > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract > > > from bigfile just the lines from ":" to whatever > > > is the next function, and post or mail privately just that disassembly. > > > That should be good to identify which of the put_page()s is involved. > > > > Hugh, I didn't find the next function, as I can't find any words that > > matched > > "do_huge_pmd_numa_page". > > is there any other methods? > > Hmm, do_huge_pmd_numa_page does appear in your stacktrace, > unless I've made a typo but am blind to it. > > Were you applying objdump to the vmlinux which gave you the > BUG at mm/memcontrol.c:1134! ? Thanks for the further info you then sent privately: I have not made any more effort to reproduce the issue, but your objdump did tell me that the put_page hitting the problem is the one on line 872 of mm/huge_memory.c, "Drop the local reference", just before successful return after migration. I didn't really get the inspiration I'd hoped for out of knowing that, but it did make wonder whether you're suffering from one of the issues I already mentioned, and I can now see a way in which it might cause the mm/memcontrol.c:1134 BUG:- migrate_page_copy() does TestClearPageActive on the source page: so given the unsafe way in which do_huge_pmd_numa_page() was proceeding with a !PageLRU page, it's quite possible that the page was sitting in a pagevec, and added to the active lru (so added to the lru_size of the active lru), but our final put_page removes it from lru, active flag has been cleared, so we subtract it from the lru_size of the inactive lru - that could indeed make it go negative and trigger the BUG. Here's a patch fixing and tidying up that and a few other things there. But I'm not signing it off yet, partly because I've barely tested it (quite probably I didn't even have any numa pmd migration happening at all), and partly because just a moment ago I ran across this instructive comment in __collapse_huge_page_isolate(): /* cannot use mapcount: can't collapse if there's a gup pin */ if (page_count(page) != 1) { Hmm, yes, below I've added the page_mapcount() check I proposed to do_huge_pmd_numa_page(), but is even that safe enough? Do we actually need a page_count() check (for 2?) to guard against get_user_pages()? I suspect we do, but then do we have enough locking to stabilize such a check? Probably, but... This will take more time, and I doubt get_user_pages() is an issue in your testing, so please would you try the patch below, to see if it does fix the BUGs you are seeing? Thanks a lot. Not-Yet-Signed-off-by: Hugh Dickins --- mm/huge_memory.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) --- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 -0700 +++ linux/mm/huge_memory.c 2012-11-01 05:52:19.512153771 -0700 @@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str struct mem_cgroup *memcg = NULL; struct page *new_page = NULL; struct page *page = NULL; - int node, lru; + int node = -1; spin_lock(>page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) @@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str VM_BUG_ON(!PageCompound(page) || !PageHead(page)); get_page(page); - node = mpol_misplaced(page, vma, haddr); + if (page_mapcount(page) == 1) /* Only do exclusively mapped */ + node = mpol_misplaced(page, vma, haddr); if (node != -1) goto migrate; } @@ -801,13 +802,11 @@ migrate: if (!new_page) goto alloc_fail; - lru = PageLRU(page); - - if (lru && isolate_lru_page(page)) /* does an implicit get_page() */ + if (isolate_lru_page(page)) /* Does an implicit get_page() */ goto alloc_fail; - if (!trylock_page(new_page)) - BUG(); + __set_page_locked(new_page); + SetPageSwapBacked(new_page); /* anon mapping, we can simply copy page->mapping to the new page: */ new_page->mapping = page->mapping; @@ -820,8 +819,6 @@ migrate: spin_lock(>page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) { spin_unlock(>page_table_lock); - if (lru) - putback_lru_page(page); unlock_page(new_page); ClearPageActive(new_page); /* Set by migrate_page_copy() */ @@ -829,6 +826,7 @@ migrate: put_page(new_page);
Re: [PATCH 00/31] numa/core patches
On Wed, 31 Oct 2012, Hugh Dickins wrote: On Wed, 31 Oct 2012, Zhouping Liu wrote: On 10/31/2012 03:26 PM, Hugh Dickins wrote: There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an objdump -ld vmlinux bigfile of your kernel, then extract from bigfile just the lines from do_huge_pmd_numa_page: to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. Hugh, I didn't find the next function, as I can't find any words that matched do_huge_pmd_numa_page. is there any other methods? Hmm, do_huge_pmd_numa_page does appear in your stacktrace, unless I've made a typo but am blind to it. Were you applying objdump to the vmlinux which gave you the BUG at mm/memcontrol.c:1134! ? Thanks for the further info you then sent privately: I have not made any more effort to reproduce the issue, but your objdump did tell me that the put_page hitting the problem is the one on line 872 of mm/huge_memory.c, Drop the local reference, just before successful return after migration. I didn't really get the inspiration I'd hoped for out of knowing that, but it did make wonder whether you're suffering from one of the issues I already mentioned, and I can now see a way in which it might cause the mm/memcontrol.c:1134 BUG:- migrate_page_copy() does TestClearPageActive on the source page: so given the unsafe way in which do_huge_pmd_numa_page() was proceeding with a !PageLRU page, it's quite possible that the page was sitting in a pagevec, and added to the active lru (so added to the lru_size of the active lru), but our final put_page removes it from lru, active flag has been cleared, so we subtract it from the lru_size of the inactive lru - that could indeed make it go negative and trigger the BUG. Here's a patch fixing and tidying up that and a few other things there. But I'm not signing it off yet, partly because I've barely tested it (quite probably I didn't even have any numa pmd migration happening at all), and partly because just a moment ago I ran across this instructive comment in __collapse_huge_page_isolate(): /* cannot use mapcount: can't collapse if there's a gup pin */ if (page_count(page) != 1) { Hmm, yes, below I've added the page_mapcount() check I proposed to do_huge_pmd_numa_page(), but is even that safe enough? Do we actually need a page_count() check (for 2?) to guard against get_user_pages()? I suspect we do, but then do we have enough locking to stabilize such a check? Probably, but... This will take more time, and I doubt get_user_pages() is an issue in your testing, so please would you try the patch below, to see if it does fix the BUGs you are seeing? Thanks a lot. Not-Yet-Signed-off-by: Hugh Dickins hu...@google.com --- mm/huge_memory.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) --- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 -0700 +++ linux/mm/huge_memory.c 2012-11-01 05:52:19.512153771 -0700 @@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str struct mem_cgroup *memcg = NULL; struct page *new_page = NULL; struct page *page = NULL; - int node, lru; + int node = -1; spin_lock(mm-page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) @@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str VM_BUG_ON(!PageCompound(page) || !PageHead(page)); get_page(page); - node = mpol_misplaced(page, vma, haddr); + if (page_mapcount(page) == 1) /* Only do exclusively mapped */ + node = mpol_misplaced(page, vma, haddr); if (node != -1) goto migrate; } @@ -801,13 +802,11 @@ migrate: if (!new_page) goto alloc_fail; - lru = PageLRU(page); - - if (lru isolate_lru_page(page)) /* does an implicit get_page() */ + if (isolate_lru_page(page)) /* Does an implicit get_page() */ goto alloc_fail; - if (!trylock_page(new_page)) - BUG(); + __set_page_locked(new_page); + SetPageSwapBacked(new_page); /* anon mapping, we can simply copy page-mapping to the new page: */ new_page-mapping = page-mapping; @@ -820,8 +819,6 @@ migrate: spin_lock(mm-page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) { spin_unlock(mm-page_table_lock); - if (lru) - putback_lru_page(page); unlock_page(new_page); ClearPageActive(new_page); /* Set by migrate_page_copy() */ @@ -829,6 +826,7 @@ migrate: put_page(new_page); /* Free
Re: [PATCH 00/31] numa/core patches
On 11/01/2012 09:41 PM, Hugh Dickins wrote: On Wed, 31 Oct 2012, Hugh Dickins wrote: On Wed, 31 Oct 2012, Zhouping Liu wrote: On 10/31/2012 03:26 PM, Hugh Dickins wrote: There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an objdump -ld vmlinux bigfile of your kernel, then extract from bigfile just the lines from do_huge_pmd_numa_page: to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. Hugh, I didn't find the next function, as I can't find any words that matched do_huge_pmd_numa_page. is there any other methods? Hmm, do_huge_pmd_numa_page does appear in your stacktrace, unless I've made a typo but am blind to it. Were you applying objdump to the vmlinux which gave you the BUG at mm/memcontrol.c:1134! ? Thanks for the further info you then sent privately: I have not made any more effort to reproduce the issue, but your objdump did tell me that the put_page hitting the problem is the one on line 872 of mm/huge_memory.c, Drop the local reference, just before successful return after migration. I didn't really get the inspiration I'd hoped for out of knowing that, but it did make wonder whether you're suffering from one of the issues I already mentioned, and I can now see a way in which it might cause the mm/memcontrol.c:1134 BUG:- migrate_page_copy() does TestClearPageActive on the source page: so given the unsafe way in which do_huge_pmd_numa_page() was proceeding with a !PageLRU page, it's quite possible that the page was sitting in a pagevec, and added to the active lru (so added to the lru_size of the active lru), but our final put_page removes it from lru, active flag has been cleared, so we subtract it from the lru_size of the inactive lru - that could indeed make it go negative and trigger the BUG. Here's a patch fixing and tidying up that and a few other things there. But I'm not signing it off yet, partly because I've barely tested it (quite probably I didn't even have any numa pmd migration happening at all), and partly because just a moment ago I ran across this instructive comment in __collapse_huge_page_isolate(): /* cannot use mapcount: can't collapse if there's a gup pin */ if (page_count(page) != 1) { Hmm, yes, below I've added the page_mapcount() check I proposed to do_huge_pmd_numa_page(), but is even that safe enough? Do we actually need a page_count() check (for 2?) to guard against get_user_pages()? I suspect we do, but then do we have enough locking to stabilize such a check? Probably, but... This will take more time, and I doubt get_user_pages() is an issue in your testing, so please would you try the patch below, to see if it does fix the BUGs you are seeing? Thanks a lot. Hugh, I have tested the patch for 5 more hours, the issue can't be reproduced again, so I think it has fixed the issue, thank you :) Zhouping Not-Yet-Signed-off-by: Hugh Dickins hu...@google.com --- mm/huge_memory.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) --- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 -0700 +++ linux/mm/huge_memory.c 2012-11-01 05:52:19.512153771 -0700 @@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str struct mem_cgroup *memcg = NULL; struct page *new_page = NULL; struct page *page = NULL; - int node, lru; + int node = -1; spin_lock(mm-page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) @@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str VM_BUG_ON(!PageCompound(page) || !PageHead(page)); get_page(page); - node = mpol_misplaced(page, vma, haddr); + if (page_mapcount(page) == 1) /* Only do exclusively mapped */ + node = mpol_misplaced(page, vma, haddr); if (node != -1) goto migrate; } @@ -801,13 +802,11 @@ migrate: if (!new_page) goto alloc_fail; - lru = PageLRU(page); - - if (lru isolate_lru_page(page)) /* does an implicit get_page() */ + if (isolate_lru_page(page)) /* Does an implicit get_page() */ goto alloc_fail; - if (!trylock_page(new_page)) - BUG(); + __set_page_locked(new_page); + SetPageSwapBacked(new_page); /* anon mapping, we can simply copy page-mapping to the new page: */ new_page-mapping = page-mapping; @@ -820,8 +819,6 @@ migrate: spin_lock(mm-page_table_lock); if (unlikely(!pmd_same(*pmd, entry))) { spin_unlock(mm-page_table_lock); - if (lru) - putback_lru_page(page); unlock_page(new_page); ClearPageActive(new_page);
Re: [PATCH 00/31] numa/core patches
On Wed, 31 Oct 2012, Zhouping Liu wrote: > On 10/31/2012 03:26 PM, Hugh Dickins wrote: > > > > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it > > would help if we could focus on the one which is giving the trouble, > > but I don't know which that is. Zhouping, if you can, please would > > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract > > from bigfile just the lines from ":" to whatever > > is the next function, and post or mail privately just that disassembly. > > That should be good to identify which of the put_page()s is involved. > > Hugh, I didn't find the next function, as I can't find any words that matched > "do_huge_pmd_numa_page". > is there any other methods? Hmm, do_huge_pmd_numa_page does appear in your stacktrace, unless I've made a typo but am blind to it. Were you applying objdump to the vmlinux which gave you the BUG at mm/memcontrol.c:1134! ? Maybe just do "objdump -ld mm/huge_memory.o >notsobigfile" and mail me an attachment of the notsobigfile. I did try building your config here last night, but ran out of disk space on this partition, and it was already clear that my gcc version differs from yours, so not quite matching. > also I tried to use kdump to dump vmcore file, > but unluckily kdump didn't > work well, if you think it useful to dump vmcore file, I can try it again and > provide more info. It would take me awhile to get up to speed on using that, I'd prefer to start with just the objdump of huge_memory.o. I forgot last night to say that I did try stress (but not on a kernel of your config), but didn't see the BUG: I expect there are too many differences in our environments, and I'd have to tweak things one way or another to get it to happen - probably a waste of time. Thanks, Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/31/2012 03:26 PM, Hugh Dickins wrote: On Tue, 30 Oct 2012, Johannes Weiner wrote: [88099.923724] [ cut here ] [88099.924036] kernel BUG at mm/memcontrol.c:1134! [88099.924036] invalid opcode: [#1] SMP [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88099.924036] CPU 7 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 Dell Inc. PowerEdge 6950/0WN213 [88099.924036] RIP: 0010:[] [] mem_cgroup_update_lru_size+0x27/0x30 Thanks a lot for your testing efforts, I really appreciate it. I'm looking into it, but I don't expect power to get back for several days where I live, so it's hard to reproduce it locally. But that looks like an LRU accounting imbalance that I wasn't able to tie to this patch yet. Do you see weird numbers for the lru counters in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as well. Sorry, I didn't get very far with it tonight. Almost certain to be a page which was added to lru while it looked like a 4k page, but taken off lru as a 2M page: we are taking a 2M page off lru here, it's likely to be the page in question, but not necessarily. There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract from bigfile just the lines from ":" to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. Hugh, I didn't find the next function, as I can't find any words that matched "do_huge_pmd_numa_page". is there any other methods? also I tried to use kdump to dump vmcore file, but unluckily kdump didn't work well, if you think it useful to dump vmcore file, I can try it again and provide more info. Thanks, Zhouping -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, 30 Oct 2012, Johannes Weiner wrote: > On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote: > > On 10/29/2012 01:56 AM, Johannes Weiner wrote: > > >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: > > >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: > > >>>[ 180.918591] RIP: 0010:[] [] > > >>>mem_cgroup_prepare_migration+0xba/0xd0 > > >>>[ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 > > >>>[ 182.775090] [] handle_mm_fault+0x1e9/0x360 > > >>>[ 182.863038] [] __do_page_fault+0x172/0x4e0 > > >>>[ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 > > >>>[ 183.041512] [] ? __switch_to+0x3ce/0x4a0 > > >>>[ 183.126832] [] ? __schedule+0x3c6/0x7a0 > > >>>[ 183.211216] [] do_page_fault+0xe/0x10 > > >>>[ 183.293705] [] page_fault+0x28/0x30 > > >>Johannes, this looks like the thp migration memcg hookery gone bad, > > >>could you have a look at this? > > >Oops. Here is an incremental fix, feel free to fold it into #31. > > Hello Johannes, > > > > maybe I don't think the below patch completely fix this issue, as I > > found a new error(maybe similar with this): > > > > [88099.923724] [ cut here ] > > [88099.924036] kernel BUG at mm/memcontrol.c:1134! > > [88099.924036] invalid opcode: [#1] SMP > > [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm > > amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp > > joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi > > megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit > > drm_kms_helper ttm drm i2c_core > > [88099.924036] CPU 7 > > [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 > > Dell Inc. PowerEdge 6950/0WN213 > > [88099.924036] RIP: 0010:[] [] > > mem_cgroup_update_lru_size+0x27/0x30 > > Thanks a lot for your testing efforts, I really appreciate it. > > I'm looking into it, but I don't expect power to get back for several > days where I live, so it's hard to reproduce it locally. > > But that looks like an LRU accounting imbalance that I wasn't able to > tie to this patch yet. Do you see weird numbers for the lru counters > in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as > well. Sorry, I didn't get very far with it tonight. Almost certain to be a page which was added to lru while it looked like a 4k page, but taken off lru as a 2M page: we are taking a 2M page off lru here, it's likely to be the page in question, but not necessarily. There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract from bigfile just the lines from ":" to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. do_huge_pmd_numa_page() does look a bit worrying, but I've not pinned the misaccounting seen to the aspects which have worried me so far. Where is a check for page_mapcount(page) being 1? And surely it's unsafe to to be migrating the page when it was found !PageLRU? It's quite likely to be sitting in a pagevec or on a local list somewhere, about to be added to lru at any moment. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, 30 Oct 2012, Johannes Weiner wrote: On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote: On 10/29/2012 01:56 AM, Johannes Weiner wrote: On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Hello Johannes, maybe I don't think the below patch completely fix this issue, as I found a new error(maybe similar with this): [88099.923724] [ cut here ] [88099.924036] kernel BUG at mm/memcontrol.c:1134! [88099.924036] invalid opcode: [#1] SMP [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88099.924036] CPU 7 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 Dell Inc. PowerEdge 6950/0WN213 [88099.924036] RIP: 0010:[81188e97] [81188e97] mem_cgroup_update_lru_size+0x27/0x30 Thanks a lot for your testing efforts, I really appreciate it. I'm looking into it, but I don't expect power to get back for several days where I live, so it's hard to reproduce it locally. But that looks like an LRU accounting imbalance that I wasn't able to tie to this patch yet. Do you see weird numbers for the lru counters in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as well. Sorry, I didn't get very far with it tonight. Almost certain to be a page which was added to lru while it looked like a 4k page, but taken off lru as a 2M page: we are taking a 2M page off lru here, it's likely to be the page in question, but not necessarily. There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an objdump -ld vmlinux bigfile of your kernel, then extract from bigfile just the lines from do_huge_pmd_numa_page: to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. do_huge_pmd_numa_page() does look a bit worrying, but I've not pinned the misaccounting seen to the aspects which have worried me so far. Where is a check for page_mapcount(page) being 1? And surely it's unsafe to to be migrating the page when it was found !PageLRU? It's quite likely to be sitting in a pagevec or on a local list somewhere, about to be added to lru at any moment. Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/31/2012 03:26 PM, Hugh Dickins wrote: On Tue, 30 Oct 2012, Johannes Weiner wrote: [88099.923724] [ cut here ] [88099.924036] kernel BUG at mm/memcontrol.c:1134! [88099.924036] invalid opcode: [#1] SMP [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88099.924036] CPU 7 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 Dell Inc. PowerEdge 6950/0WN213 [88099.924036] RIP: 0010:[81188e97] [81188e97] mem_cgroup_update_lru_size+0x27/0x30 Thanks a lot for your testing efforts, I really appreciate it. I'm looking into it, but I don't expect power to get back for several days where I live, so it's hard to reproduce it locally. But that looks like an LRU accounting imbalance that I wasn't able to tie to this patch yet. Do you see weird numbers for the lru counters in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as well. Sorry, I didn't get very far with it tonight. Almost certain to be a page which was added to lru while it looked like a 4k page, but taken off lru as a 2M page: we are taking a 2M page off lru here, it's likely to be the page in question, but not necessarily. There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an objdump -ld vmlinux bigfile of your kernel, then extract from bigfile just the lines from do_huge_pmd_numa_page: to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. Hugh, I didn't find the next function, as I can't find any words that matched do_huge_pmd_numa_page. is there any other methods? also I tried to use kdump to dump vmcore file, but unluckily kdump didn't work well, if you think it useful to dump vmcore file, I can try it again and provide more info. Thanks, Zhouping -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Wed, 31 Oct 2012, Zhouping Liu wrote: On 10/31/2012 03:26 PM, Hugh Dickins wrote: There's quite a few put_page()s in do_huge_pmd_numa_page(), and it would help if we could focus on the one which is giving the trouble, but I don't know which that is. Zhouping, if you can, please would you do an objdump -ld vmlinux bigfile of your kernel, then extract from bigfile just the lines from do_huge_pmd_numa_page: to whatever is the next function, and post or mail privately just that disassembly. That should be good to identify which of the put_page()s is involved. Hugh, I didn't find the next function, as I can't find any words that matched do_huge_pmd_numa_page. is there any other methods? Hmm, do_huge_pmd_numa_page does appear in your stacktrace, unless I've made a typo but am blind to it. Were you applying objdump to the vmlinux which gave you the BUG at mm/memcontrol.c:1134! ? Maybe just do objdump -ld mm/huge_memory.o notsobigfile and mail me an attachment of the notsobigfile. I did try building your config here last night, but ran out of disk space on this partition, and it was already clear that my gcc version differs from yours, so not quite matching. also I tried to use kdump to dump vmcore file, but unluckily kdump didn't work well, if you think it useful to dump vmcore file, I can try it again and provide more info. It would take me awhile to get up to speed on using that, I'd prefer to start with just the objdump of huge_memory.o. I forgot last night to say that I did try stress (but not on a kernel of your config), but didn't see the BUG: I expect there are too many differences in our environments, and I'd have to tweak things one way or another to get it to happen - probably a waste of time. Thanks, Hugh -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote: > On 10/29/2012 01:56 AM, Johannes Weiner wrote: > >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: > >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: > >>>[ 180.918591] RIP: 0010:[] [] > >>>mem_cgroup_prepare_migration+0xba/0xd0 > >>>[ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 > >>>[ 182.775090] [] handle_mm_fault+0x1e9/0x360 > >>>[ 182.863038] [] __do_page_fault+0x172/0x4e0 > >>>[ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 > >>>[ 183.041512] [] ? __switch_to+0x3ce/0x4a0 > >>>[ 183.126832] [] ? __schedule+0x3c6/0x7a0 > >>>[ 183.211216] [] do_page_fault+0xe/0x10 > >>>[ 183.293705] [] page_fault+0x28/0x30 > >>Johannes, this looks like the thp migration memcg hookery gone bad, > >>could you have a look at this? > >Oops. Here is an incremental fix, feel free to fold it into #31. > Hello Johannes, > > maybe I don't think the below patch completely fix this issue, as I > found a new error(maybe similar with this): > > [88099.923724] [ cut here ] > [88099.924036] kernel BUG at mm/memcontrol.c:1134! > [88099.924036] invalid opcode: [#1] SMP > [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm > amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp > joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi > megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit > drm_kms_helper ttm drm i2c_core > [88099.924036] CPU 7 > [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 > Dell Inc. PowerEdge 6950/0WN213 > [88099.924036] RIP: 0010:[] [] > mem_cgroup_update_lru_size+0x27/0x30 Thanks a lot for your testing efforts, I really appreciate it. I'm looking into it, but I don't expect power to get back for several days where I live, so it's hard to reproduce it locally. But that looks like an LRU accounting imbalance that I wasn't able to tie to this patch yet. Do you see weird numbers for the lru counters in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as well. Thanks, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, Oct 30, 2012 at 08:28:10AM -0700, Andrew Morton wrote: > > On Tue, 30 Oct 2012 12:20:32 + Mel Gorman wrote: > > > ... > > Useful testing - thanks. Did I miss the description of what > autonumabench actually does? How representitive is it of real-world > things? > It's not representative of anything at all. It's a synthetic benchmark that just measures if automatic NUMA migration (whatever the mechanism) is working as expected. I'm not aware of a decent description of what the test does and why. Here is my current interpretation and hopefully Andrea will correct me if I'm wrong. NUMA01 Two processes NUM_CPUS/2 number of threads so all CPUs are in use On startup, the process forks Each process mallocs a 3G buffer but there is no communication between the processes. Threads are created that zeros out the full buffer 1000 times The objective of the test is that initially the two processes allocate their memory on the same node. As the threads are are created the memory will migrate from the initial node to nodes that are closer to the referencing thread. It is worth noting that this benchmark is specifically tuned for two nodes and the expectation is that the two processes and their threads split so that all process A runs on node 0 and all threads on process B run in node 1 With 4 and more nodes, this is actually an adverse workload. As all the buffer is zeroed in both processes, there is an expectation that it will continually bounce between two nodes. So, on 2 nodes, this benchmark tests convergence. On 4 or more nodes, this partially measures how much busy work automatic NUMA migrate does and it'll be very noisy due to cache conflicts. NUMA01_THREADLOCAL Two processes NUM_CPUS/2 number of threads so all CPUs are in use On startup, the process forks Each process mallocs a 3G buffer but there is no communication between the processes Threads are created that zero out their own subset of the buffer. Each buffer is 3G/NR_THREADS in size This benchmark is more realistic. In an ideal situation, each thread will migrate its data to its local node. The test really is to see does it converge and how quickly. NUMA02 One process, NR_CPU threads On startup, malloc a 1G buffer Create threads that zero out a thread-local portion of the buffer. Zeros multiple times - the number of times is fixed and seems to just be to take a period of time This is similar in principal to NUMA01_THREADLOCAL except that only one process is involved. I think it was aimed at being more JVM-like. NUMA02_SMT One process, NR_CPU/2 threads This is a variation of NUMA02 except that with half the cores idle it is checking if the system migrates the memory to two or more nodes or if it tries to fit everything in one node even though the memory should migrate to be close to the CPU > > I also expect autonuma is continually scanning where as schednuma is > > reacting to some other external event or at least less frequently scanning. > > Might this imply that autonuma is consuming more CPU in kernel threads, > the cost of which didn't get included in these results? It might but according to top, knuma_scand only used 7.86 seconds of CPU time during the whole test and the time used by the migration tests is also very low. Most migration threads used less than 1 second of CPU time. Two migration threads used 2 seconds of CPU time each but that still seems low. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, 30 Oct 2012 12:20:32 + Mel Gorman wrote: > ... Useful testing - thanks. Did I miss the description of what autonumabench actually does? How representitive is it of real-world things? > I also expect autonuma is continually scanning where as schednuma is > reacting to some other external event or at least less frequently scanning. Might this imply that autonuma is consuming more CPU in kernel threads, the cost of which didn't get included in these results? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote: > Hi all, > > Here's a re-post of the NUMA scheduling and migration improvement > patches that we are working on. These include techniques from > AutoNUMA and the sched/numa tree and form a unified basis - it > has got all the bits that look good and mergeable. > Thanks for the repost. I have not even started a review yet as I was travelling and just online today. It will be another day or two before I can start but I was at least able to do a comparison test between autonuma and schednuma today to see which actually performs the best. Even without the review I was able to stick on similar vmstats as was applied to autonuma to give a rough estimate of the relative overhead of both implementations. Machine was a 4-node box running autonumabench and specjbb. Three kernels are 3.7-rc2-stats-v2r1 vmstat patces on top 3.7-rc2-autonuma-v27latest autonuma with stats on top 3.7-rc2-schednuma-v1r3 schednuma series minus the last path + stats AUTONUMA BENCH 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1rc2-autonuma-v27r8 rc2-schednuma-v1r3 UserNUMA01 67145.71 ( 0.00%)30110.13 ( 55.16%) 61666.46 ( 8.16%) UserNUMA01_THEADLOCAL55104.60 ( 0.00%)17285.49 ( 68.63%) 17135.48 ( 68.90%) UserNUMA027074.54 ( 0.00%) 2219.11 ( 68.63%) 2226.09 ( 68.53%) UserNUMA02_SMT2916.86 ( 0.00%) 999.19 ( 65.74%) 1038.06 ( 64.41%) System NUMA01 42.28 ( 0.00%) 469.07 (-1009.44%) 2808.08 (-6541.63%) System NUMA01_THEADLOCAL 41.71 ( 0.00%) 183.24 (-339.32%) 174.92 (-319.37%) System NUMA02 34.67 ( 0.00%) 27.85 ( 19.67%) 15.03 ( 56.65%) System NUMA02_SMT 0.89 ( 0.00%) 18.36 (-1962.92%) 5.05 (-467.42%) Elapsed NUMA011512.97 ( 0.00%) 698.18 ( 53.85%) 1422.71 ( 5.97%) Elapsed NUMA01_THEADLOCAL 1264.23 ( 0.00%) 389.51 ( 69.19%) 377.51 ( 70.14%) Elapsed NUMA02 181.52 ( 0.00%) 60.65 ( 66.59%) 52.86 ( 70.88%) Elapsed NUMA02_SMT 163.59 ( 0.00%) 58.57 ( 64.20%) 48.82 ( 70.16%) CPU NUMA014440.00 ( 0.00%) 4379.00 ( 1.37%) 4531.00 ( -2.05%) CPU NUMA01_THEADLOCAL 4362.00 ( 0.00%) 4484.00 ( -2.80%) 4585.00 ( -5.11%) CPU NUMA023916.00 ( 0.00%) 3704.00 ( 5.41%) 4239.00 ( -8.25%) CPU NUMA02_SMT1783.00 ( 0.00%) 1737.00 ( 2.58%) 2136.00 (-19.80%) Two figures really matter here - System CPU usage and Elapsed time. autonuma was known to hurt system CPU usage for the NUMA01 test case but schednuma does *far* worse. I do not have a breakdown of where this time is being spent but the raw figure is bad. autonuma is 10 times worse than a vanilla kernel and schednuma is 5 times worse than autonuma. For the overhead of the other test cases, schednuma is roughly comparable with autonuma -- i.e. both pretty high overhead. In terms of elapsed time, autonuma in the NUMA01 test case massively improves elapsed time while schednuma barely makes a dent on it. Looking at the memory usage per node (I generated a graph offline), it appears that schednuma does not migrate pages to other nodes fast enough. The convergence figures do not reflect this because the convergence seems high (towards 1) but it may be because the approximation using faults is insufficient. In the other cases, schednuma does well and is comparable to autonuma. MMTests Statistics: duration 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3 User 132248.8850620.5082073.11 System120.19 699.12 3003.83 Elapsed 3131.10 1215.63 1911.55 This is the overall time to complete the test. autonuma is way better than schednuma but this is all due to how it handles the NUMA01 test case. MMTests Statistics: vmstat 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3 Page Ins 37256 37508 37360 Page Outs2 13372 19488 Swap Ins 0 0 0 Swap Outs0 0 0 Direct pages scanned 0 0 0 Kswapd pages scanned 0 0 0 Kswapd pages reclaimed 0 0 0 Direct pages reclaimed 0 0 0 Kswapd efficiency 100%100%100% Kswapd velocity 0.000 0.000 0.000 Direct efficiency 100%100%100% Direct
Re: [PATCH 00/31] numa/core patches
On 10/29/2012 01:56 AM, Johannes Weiner wrote: On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[] [] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [] handle_mm_fault+0x1e9/0x360 [ 182.863038] [] __do_page_fault+0x172/0x4e0 [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [] ? __schedule+0x3c6/0x7a0 [ 183.211216] [] do_page_fault+0xe/0x10 [ 183.293705] [] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Hello Johannes, maybe I don't think the below patch completely fix this issue, as I found a new error(maybe similar with this): [88099.923724] [ cut here ] [88099.924036] kernel BUG at mm/memcontrol.c:1134! [88099.924036] invalid opcode: [#1] SMP [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88099.924036] CPU 7 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 Dell Inc. PowerEdge 6950/0WN213 [88099.924036] RIP: 0010:[] [] mem_cgroup_update_lru_size+0x27/0x30 [88099.924036] RSP: :88021b247ca8 EFLAGS: 00010082 [88099.924036] RAX: 88011d310138 RBX: ea0002f18000 RCX: 0001 [88099.924036] RDX: fe00 RSI: 000e RDI: 88011d310138 [88099.924036] RBP: 88021b247ca8 R08: R09: a8000bc6 [88099.924036] R10: R11: R12: fe00 [88099.924036] R13: 88011ffecb40 R14: 0286 R15: [88099.924036] FS: 7f787d0bf740() GS:88021fc8() knlGS: [88099.924036] CS: 0010 DS: ES: CR0: 8005003b [88099.924036] CR2: 7f7873a00010 CR3: 00021bda CR4: 07e0 [88099.924036] DR0: DR1: DR2: [88099.924036] DR3: DR6: 0ff0 DR7: 0400 [88099.924036] Process stress (pid: 3441, threadinfo 88021b246000, task 88021b399760) [88099.924036] Stack: [88099.924036] 88021b247cf8 8113a9cd ea0002f18000 88011d310138 [88099.924036] 0200 ea0002f18000 88019bace580 7f7873c0 [88099.924036] 88021aca0cf0 ea00081e 88021b247d18 8113aa7d [88099.924036] Call Trace: [88099.924036] [] __page_cache_release.part.11+0xdd/0x140 [88099.924036] [] __put_compound_page+0x1d/0x30 [88099.924036] [] put_compound_page+0x5d/0x1e0 [88099.924036] [] put_page+0x45/0x50 [88099.924036] [] do_huge_pmd_numa_page+0x2ec/0x4e0 [88099.924036] [] handle_mm_fault+0x1e9/0x360 [88099.924036] [] __do_page_fault+0x172/0x4e0 [88099.924036] [] ? task_numa_work+0x1c9/0x220 [88099.924036] [] ? task_work_run+0xac/0xe0 [88099.924036] [] do_page_fault+0xe/0x10 [88099.924036] [] page_fault+0x28/0x30 [88099.924036] Code: 00 00 00 00 66 66 66 66 90 44 8b 1d 1c 90 b5 00 55 48 89 e5 45 85 db 75 10 89 f6 48 63 d2 48 83 c6 0e 48 01 54 f7 08 78 02 5d c3 <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec [88099.924036] RIP [] mem_cgroup_update_lru_size+0x27/0x30 [88099.924036] RSP [88099.924036] ---[ end trace c8d6b169e0c3f25a ]--- [88108.054610] [ cut here ] [88108.054610] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0() [88108.054610] Hardware name: PowerEdge 6950 [88108.054610] Watchdog detected hard LOCKUP on cpu 3 [88108.054610] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88108.054610] Pid: 3429, comm: stress Tainted: G D 3.7.0-rc2Jons+ #3 [88108.054610] Call Trace: [88108.054610][] warn_slowpath_common+0x7f/0xc0 [88108.054610] [] warn_slowpath_fmt+0x46/0x50 [88108.054610] [] ? sched_clock_cpu+0xa8/0x120 [88108.054610] [] ? touch_nmi_watchdog+0x80/0x80 [88108.054610] [] watchdog_overflow_callback+0x9c/0xd0 [88108.054610] [] __perf_event_overflow+0x9d/0x230 [88108.054610] [] ? perf_event_update_userpage+0x24/0x110 [88108.054610] [] perf_event_overflow+0x14/0x20 [88108.054610] [] x86_pmu_handle_irq+0x10a/0x160 [88108.054610] [] perf_event_nmi_handler+0x1d/0x20 [88108.054610] [] nmi_handle.isra.0+0x51/0x80 [88108.054610] [] do_nmi+0x179/0x350 [88108.054610] [] end_repeat_nmi+0x1e/0x2e [88108.054610] [] ?
Re: [PATCH 00/31] numa/core patches
On 10/29/2012 01:56 AM, Johannes Weiner wrote: On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Hello Johannes, maybe I don't think the below patch completely fix this issue, as I found a new error(maybe similar with this): [88099.923724] [ cut here ] [88099.924036] kernel BUG at mm/memcontrol.c:1134! [88099.924036] invalid opcode: [#1] SMP [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88099.924036] CPU 7 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 Dell Inc. PowerEdge 6950/0WN213 [88099.924036] RIP: 0010:[81188e97] [81188e97] mem_cgroup_update_lru_size+0x27/0x30 [88099.924036] RSP: :88021b247ca8 EFLAGS: 00010082 [88099.924036] RAX: 88011d310138 RBX: ea0002f18000 RCX: 0001 [88099.924036] RDX: fe00 RSI: 000e RDI: 88011d310138 [88099.924036] RBP: 88021b247ca8 R08: R09: a8000bc6 [88099.924036] R10: R11: R12: fe00 [88099.924036] R13: 88011ffecb40 R14: 0286 R15: [88099.924036] FS: 7f787d0bf740() GS:88021fc8() knlGS: [88099.924036] CS: 0010 DS: ES: CR0: 8005003b [88099.924036] CR2: 7f7873a00010 CR3: 00021bda CR4: 07e0 [88099.924036] DR0: DR1: DR2: [88099.924036] DR3: DR6: 0ff0 DR7: 0400 [88099.924036] Process stress (pid: 3441, threadinfo 88021b246000, task 88021b399760) [88099.924036] Stack: [88099.924036] 88021b247cf8 8113a9cd ea0002f18000 88011d310138 [88099.924036] 0200 ea0002f18000 88019bace580 7f7873c0 [88099.924036] 88021aca0cf0 ea00081e 88021b247d18 8113aa7d [88099.924036] Call Trace: [88099.924036] [8113a9cd] __page_cache_release.part.11+0xdd/0x140 [88099.924036] [8113aa7d] __put_compound_page+0x1d/0x30 [88099.924036] [8113ac4d] put_compound_page+0x5d/0x1e0 [88099.924036] [8113b1a5] put_page+0x45/0x50 [88099.924036] [8118378c] do_huge_pmd_numa_page+0x2ec/0x4e0 [88099.924036] [81158089] handle_mm_fault+0x1e9/0x360 [88099.924036] [8162cd22] __do_page_fault+0x172/0x4e0 [88099.924036] [810958b9] ? task_numa_work+0x1c9/0x220 [88099.924036] [8107c56c] ? task_work_run+0xac/0xe0 [88099.924036] [8162d09e] do_page_fault+0xe/0x10 [88099.924036] [816296d8] page_fault+0x28/0x30 [88099.924036] Code: 00 00 00 00 66 66 66 66 90 44 8b 1d 1c 90 b5 00 55 48 89 e5 45 85 db 75 10 89 f6 48 63 d2 48 83 c6 0e 48 01 54 f7 08 78 02 5d c3 0f 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec [88099.924036] RIP [81188e97] mem_cgroup_update_lru_size+0x27/0x30 [88099.924036] RSP 88021b247ca8 [88099.924036] ---[ end trace c8d6b169e0c3f25a ]--- [88108.054610] [ cut here ] [88108.054610] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0() [88108.054610] Hardware name: PowerEdge 6950 [88108.054610] Watchdog detected hard LOCKUP on cpu 3 [88108.054610] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88108.054610] Pid: 3429, comm: stress Tainted: G D 3.7.0-rc2Jons+ #3 [88108.054610] Call Trace: [88108.054610] NMI [8105c29f] warn_slowpath_common+0x7f/0xc0 [88108.054610] [8105c396] warn_slowpath_fmt+0x46/0x50 [88108.054610] [81093fa8] ? sched_clock_cpu+0xa8/0x120 [88108.054610] [810e95c0] ? touch_nmi_watchdog+0x80/0x80 [88108.054610]
Re: [PATCH 00/31] numa/core patches
On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote: Hi all, Here's a re-post of the NUMA scheduling and migration improvement patches that we are working on. These include techniques from AutoNUMA and the sched/numa tree and form a unified basis - it has got all the bits that look good and mergeable. Thanks for the repost. I have not even started a review yet as I was travelling and just online today. It will be another day or two before I can start but I was at least able to do a comparison test between autonuma and schednuma today to see which actually performs the best. Even without the review I was able to stick on similar vmstats as was applied to autonuma to give a rough estimate of the relative overhead of both implementations. Machine was a 4-node box running autonumabench and specjbb. Three kernels are 3.7-rc2-stats-v2r1 vmstat patces on top 3.7-rc2-autonuma-v27latest autonuma with stats on top 3.7-rc2-schednuma-v1r3 schednuma series minus the last path + stats AUTONUMA BENCH 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1rc2-autonuma-v27r8 rc2-schednuma-v1r3 UserNUMA01 67145.71 ( 0.00%)30110.13 ( 55.16%) 61666.46 ( 8.16%) UserNUMA01_THEADLOCAL55104.60 ( 0.00%)17285.49 ( 68.63%) 17135.48 ( 68.90%) UserNUMA027074.54 ( 0.00%) 2219.11 ( 68.63%) 2226.09 ( 68.53%) UserNUMA02_SMT2916.86 ( 0.00%) 999.19 ( 65.74%) 1038.06 ( 64.41%) System NUMA01 42.28 ( 0.00%) 469.07 (-1009.44%) 2808.08 (-6541.63%) System NUMA01_THEADLOCAL 41.71 ( 0.00%) 183.24 (-339.32%) 174.92 (-319.37%) System NUMA02 34.67 ( 0.00%) 27.85 ( 19.67%) 15.03 ( 56.65%) System NUMA02_SMT 0.89 ( 0.00%) 18.36 (-1962.92%) 5.05 (-467.42%) Elapsed NUMA011512.97 ( 0.00%) 698.18 ( 53.85%) 1422.71 ( 5.97%) Elapsed NUMA01_THEADLOCAL 1264.23 ( 0.00%) 389.51 ( 69.19%) 377.51 ( 70.14%) Elapsed NUMA02 181.52 ( 0.00%) 60.65 ( 66.59%) 52.86 ( 70.88%) Elapsed NUMA02_SMT 163.59 ( 0.00%) 58.57 ( 64.20%) 48.82 ( 70.16%) CPU NUMA014440.00 ( 0.00%) 4379.00 ( 1.37%) 4531.00 ( -2.05%) CPU NUMA01_THEADLOCAL 4362.00 ( 0.00%) 4484.00 ( -2.80%) 4585.00 ( -5.11%) CPU NUMA023916.00 ( 0.00%) 3704.00 ( 5.41%) 4239.00 ( -8.25%) CPU NUMA02_SMT1783.00 ( 0.00%) 1737.00 ( 2.58%) 2136.00 (-19.80%) Two figures really matter here - System CPU usage and Elapsed time. autonuma was known to hurt system CPU usage for the NUMA01 test case but schednuma does *far* worse. I do not have a breakdown of where this time is being spent but the raw figure is bad. autonuma is 10 times worse than a vanilla kernel and schednuma is 5 times worse than autonuma. For the overhead of the other test cases, schednuma is roughly comparable with autonuma -- i.e. both pretty high overhead. In terms of elapsed time, autonuma in the NUMA01 test case massively improves elapsed time while schednuma barely makes a dent on it. Looking at the memory usage per node (I generated a graph offline), it appears that schednuma does not migrate pages to other nodes fast enough. The convergence figures do not reflect this because the convergence seems high (towards 1) but it may be because the approximation using faults is insufficient. In the other cases, schednuma does well and is comparable to autonuma. MMTests Statistics: duration 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3 User 132248.8850620.5082073.11 System120.19 699.12 3003.83 Elapsed 3131.10 1215.63 1911.55 This is the overall time to complete the test. autonuma is way better than schednuma but this is all due to how it handles the NUMA01 test case. MMTests Statistics: vmstat 3.7.0 3.7.0 3.7.0 rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3 Page Ins 37256 37508 37360 Page Outs2 13372 19488 Swap Ins 0 0 0 Swap Outs0 0 0 Direct pages scanned 0 0 0 Kswapd pages scanned 0 0 0 Kswapd pages reclaimed 0 0 0 Direct pages reclaimed 0 0 0 Kswapd efficiency 100%100%100% Kswapd velocity 0.000 0.000 0.000 Direct efficiency 100%100%100% Direct
Re: [PATCH 00/31] numa/core patches
On Tue, 30 Oct 2012 12:20:32 + Mel Gorman mgor...@suse.de wrote: ... Useful testing - thanks. Did I miss the description of what autonumabench actually does? How representitive is it of real-world things? I also expect autonuma is continually scanning where as schednuma is reacting to some other external event or at least less frequently scanning. Might this imply that autonuma is consuming more CPU in kernel threads, the cost of which didn't get included in these results? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, Oct 30, 2012 at 08:28:10AM -0700, Andrew Morton wrote: On Tue, 30 Oct 2012 12:20:32 + Mel Gorman mgor...@suse.de wrote: ... Useful testing - thanks. Did I miss the description of what autonumabench actually does? How representitive is it of real-world things? It's not representative of anything at all. It's a synthetic benchmark that just measures if automatic NUMA migration (whatever the mechanism) is working as expected. I'm not aware of a decent description of what the test does and why. Here is my current interpretation and hopefully Andrea will correct me if I'm wrong. NUMA01 Two processes NUM_CPUS/2 number of threads so all CPUs are in use On startup, the process forks Each process mallocs a 3G buffer but there is no communication between the processes. Threads are created that zeros out the full buffer 1000 times The objective of the test is that initially the two processes allocate their memory on the same node. As the threads are are created the memory will migrate from the initial node to nodes that are closer to the referencing thread. It is worth noting that this benchmark is specifically tuned for two nodes and the expectation is that the two processes and their threads split so that all process A runs on node 0 and all threads on process B run in node 1 With 4 and more nodes, this is actually an adverse workload. As all the buffer is zeroed in both processes, there is an expectation that it will continually bounce between two nodes. So, on 2 nodes, this benchmark tests convergence. On 4 or more nodes, this partially measures how much busy work automatic NUMA migrate does and it'll be very noisy due to cache conflicts. NUMA01_THREADLOCAL Two processes NUM_CPUS/2 number of threads so all CPUs are in use On startup, the process forks Each process mallocs a 3G buffer but there is no communication between the processes Threads are created that zero out their own subset of the buffer. Each buffer is 3G/NR_THREADS in size This benchmark is more realistic. In an ideal situation, each thread will migrate its data to its local node. The test really is to see does it converge and how quickly. NUMA02 One process, NR_CPU threads On startup, malloc a 1G buffer Create threads that zero out a thread-local portion of the buffer. Zeros multiple times - the number of times is fixed and seems to just be to take a period of time This is similar in principal to NUMA01_THREADLOCAL except that only one process is involved. I think it was aimed at being more JVM-like. NUMA02_SMT One process, NR_CPU/2 threads This is a variation of NUMA02 except that with half the cores idle it is checking if the system migrates the memory to two or more nodes or if it tries to fit everything in one node even though the memory should migrate to be close to the CPU I also expect autonuma is continually scanning where as schednuma is reacting to some other external event or at least less frequently scanning. Might this imply that autonuma is consuming more CPU in kernel threads, the cost of which didn't get included in these results? It might but according to top, knuma_scand only used 7.86 seconds of CPU time during the whole test and the time used by the migration tests is also very low. Most migration threads used less than 1 second of CPU time. Two migration threads used 2 seconds of CPU time each but that still seems low. -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote: On 10/29/2012 01:56 AM, Johannes Weiner wrote: On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Hello Johannes, maybe I don't think the below patch completely fix this issue, as I found a new error(maybe similar with this): [88099.923724] [ cut here ] [88099.924036] kernel BUG at mm/memcontrol.c:1134! [88099.924036] invalid opcode: [#1] SMP [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [88099.924036] CPU 7 [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3 Dell Inc. PowerEdge 6950/0WN213 [88099.924036] RIP: 0010:[81188e97] [81188e97] mem_cgroup_update_lru_size+0x27/0x30 Thanks a lot for your testing efforts, I really appreciate it. I'm looking into it, but I don't expect power to get back for several days where I live, so it's hard to reproduce it locally. But that looks like an LRU accounting imbalance that I wasn't able to tie to this patch yet. Do you see weird numbers for the lru counters in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as well. Thanks, Johannes -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/29/2012 01:56 AM, Johannes Weiner wrote: On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[] [] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [] handle_mm_fault+0x1e9/0x360 [ 182.863038] [] __do_page_fault+0x172/0x4e0 [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [] ? __schedule+0x3c6/0x7a0 [ 183.211216] [] do_page_fault+0xe/0x10 [ 183.293705] [] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Hi Johannes, Tested the below patch, and I'm sure it has fixed the above issue, thank you. Zhouping Signed-off-by: Johannes Weiner --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5c30a14..0d7ebd3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!new_page) goto alloc_fail; - mem_cgroup_prepare_migration(page, new_page, ); - lru = PageLRU(page); if (lru && isolate_lru_page(page)) /* does an implicit get_page() */ @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, return; } + /* +* Traditional migration needs to prepare the memcg charge +* transaction early to prevent the old page from being +* uncharged when installing migration entries. Here we can +* save the potential rollback and start the charge transfer +* only when migration is already known to end successfully. +*/ + mem_cgroup_prepare_migration(page, new_page, ); entry = mk_pmd(new_page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, set_pmd_at(mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, address, entry); page_remove_rmap(page); + /* +* Finish the charge transaction under the page table lock to +* prevent split_huge_page() from dividing up the charge +* before it's fully transferred to the new page. +*/ + mem_cgroup_end_migration(memcg, page, new_page, true); spin_unlock(>page_table_lock); put_page(page); /* Drop the rmap reference */ @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unlock_page(new_page); - mem_cgroup_end_migration(memcg, page, new_page, true); - unlock_page(page); put_page(page); /* Drop the local reference */ return; alloc_fail: - if (new_page) { - mem_cgroup_end_migration(memcg, page, new_page, false); + if (new_page) put_page(new_page); - } unlock_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7acf43b..011e510 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, struct mem_cgroup **memcgp) { struct mem_cgroup *memcg = NULL; + unsigned int nr_pages = 1; struct page_cgroup *pc; enum charge_type ctype; *memcgp = NULL; - VM_BUG_ON(PageTransHuge(page)); if (mem_cgroup_disabled()) return; + if (PageTransHuge(page)) + nr_pages <<= compound_order(page); + pc = lookup_page_cgroup(page); lock_page_cgroup(pc); if (PageCgroupUsed(pc)) { @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, * charged to the res_counter since we plan on replacing the * old one and only one page is going to be left afterwards. */ - __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false); + __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false); } /* remove redundant charge if migration failed*/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: > On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: > > [ 180.918591] RIP: 0010:[] [] > > mem_cgroup_prepare_migration+0xba/0xd0 > > > [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 > > [ 182.775090] [] handle_mm_fault+0x1e9/0x360 > > [ 182.863038] [] __do_page_fault+0x172/0x4e0 > > [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 > > [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 > > [ 183.126832] [] ? __schedule+0x3c6/0x7a0 > > [ 183.211216] [] do_page_fault+0xe/0x10 > > [ 183.293705] [] page_fault+0x28/0x30 > > Johannes, this looks like the thp migration memcg hookery gone bad, > could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Signed-off-by: Johannes Weiner --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5c30a14..0d7ebd3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!new_page) goto alloc_fail; - mem_cgroup_prepare_migration(page, new_page, ); - lru = PageLRU(page); if (lru && isolate_lru_page(page)) /* does an implicit get_page() */ @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, return; } + /* +* Traditional migration needs to prepare the memcg charge +* transaction early to prevent the old page from being +* uncharged when installing migration entries. Here we can +* save the potential rollback and start the charge transfer +* only when migration is already known to end successfully. +*/ + mem_cgroup_prepare_migration(page, new_page, ); entry = mk_pmd(new_page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, set_pmd_at(mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, address, entry); page_remove_rmap(page); + /* +* Finish the charge transaction under the page table lock to +* prevent split_huge_page() from dividing up the charge +* before it's fully transferred to the new page. +*/ + mem_cgroup_end_migration(memcg, page, new_page, true); spin_unlock(>page_table_lock); put_page(page); /* Drop the rmap reference */ @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unlock_page(new_page); - mem_cgroup_end_migration(memcg, page, new_page, true); - unlock_page(page); put_page(page); /* Drop the local reference */ return; alloc_fail: - if (new_page) { - mem_cgroup_end_migration(memcg, page, new_page, false); + if (new_page) put_page(new_page); - } unlock_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7acf43b..011e510 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, struct mem_cgroup **memcgp) { struct mem_cgroup *memcg = NULL; + unsigned int nr_pages = 1; struct page_cgroup *pc; enum charge_type ctype; *memcgp = NULL; - VM_BUG_ON(PageTransHuge(page)); if (mem_cgroup_disabled()) return; + if (PageTransHuge(page)) + nr_pages <<= compound_order(page); + pc = lookup_page_cgroup(page); lock_page_cgroup(pc); if (PageCgroupUsed(pc)) { @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, * charged to the res_counter since we plan on replacing the * old one and only one page is going to be left afterwards. */ - __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false); + __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false); } /* remove redundant charge if migration failed*/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5c30a14..0d7ebd3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!new_page) goto alloc_fail; - mem_cgroup_prepare_migration(page, new_page, memcg); - lru = PageLRU(page); if (lru isolate_lru_page(page)) /* does an implicit get_page() */ @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, return; } + /* +* Traditional migration needs to prepare the memcg charge +* transaction early to prevent the old page from being +* uncharged when installing migration entries. Here we can +* save the potential rollback and start the charge transfer +* only when migration is already known to end successfully. +*/ + mem_cgroup_prepare_migration(page, new_page, memcg); entry = mk_pmd(new_page, vma-vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, set_pmd_at(mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, address, entry); page_remove_rmap(page); + /* +* Finish the charge transaction under the page table lock to +* prevent split_huge_page() from dividing up the charge +* before it's fully transferred to the new page. +*/ + mem_cgroup_end_migration(memcg, page, new_page, true); spin_unlock(mm-page_table_lock); put_page(page); /* Drop the rmap reference */ @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unlock_page(new_page); - mem_cgroup_end_migration(memcg, page, new_page, true); - unlock_page(page); put_page(page); /* Drop the local reference */ return; alloc_fail: - if (new_page) { - mem_cgroup_end_migration(memcg, page, new_page, false); + if (new_page) put_page(new_page); - } unlock_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7acf43b..011e510 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, struct mem_cgroup **memcgp) { struct mem_cgroup *memcg = NULL; + unsigned int nr_pages = 1; struct page_cgroup *pc; enum charge_type ctype; *memcgp = NULL; - VM_BUG_ON(PageTransHuge(page)); if (mem_cgroup_disabled()) return; + if (PageTransHuge(page)) + nr_pages = compound_order(page); + pc = lookup_page_cgroup(page); lock_page_cgroup(pc); if (PageCgroupUsed(pc)) { @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, * charged to the res_counter since we plan on replacing the * old one and only one page is going to be left afterwards. */ - __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false); + __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false); } /* remove redundant charge if migration failed*/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/29/2012 01:56 AM, Johannes Weiner wrote: On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Oops. Here is an incremental fix, feel free to fold it into #31. Hi Johannes, Tested the below patch, and I'm sure it has fixed the above issue, thank you. Zhouping Signed-off-by: Johannes Weiner han...@cmpxchg.org --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 5c30a14..0d7ebd3 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!new_page) goto alloc_fail; - mem_cgroup_prepare_migration(page, new_page, memcg); - lru = PageLRU(page); if (lru isolate_lru_page(page)) /* does an implicit get_page() */ @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, return; } + /* +* Traditional migration needs to prepare the memcg charge +* transaction early to prevent the old page from being +* uncharged when installing migration entries. Here we can +* save the potential rollback and start the charge transfer +* only when migration is already known to end successfully. +*/ + mem_cgroup_prepare_migration(page, new_page, memcg); entry = mk_pmd(new_page, vma-vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, set_pmd_at(mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, address, entry); page_remove_rmap(page); + /* +* Finish the charge transaction under the page table lock to +* prevent split_huge_page() from dividing up the charge +* before it's fully transferred to the new page. +*/ + mem_cgroup_end_migration(memcg, page, new_page, true); spin_unlock(mm-page_table_lock); put_page(page); /* Drop the rmap reference */ @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma, unlock_page(new_page); - mem_cgroup_end_migration(memcg, page, new_page, true); - unlock_page(page); put_page(page); /* Drop the local reference */ return; alloc_fail: - if (new_page) { - mem_cgroup_end_migration(memcg, page, new_page, false); + if (new_page) put_page(new_page); - } unlock_page(page); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7acf43b..011e510 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, struct mem_cgroup **memcgp) { struct mem_cgroup *memcg = NULL; + unsigned int nr_pages = 1; struct page_cgroup *pc; enum charge_type ctype; *memcgp = NULL; - VM_BUG_ON(PageTransHuge(page)); if (mem_cgroup_disabled()) return; + if (PageTransHuge(page)) + nr_pages = compound_order(page); + pc = lookup_page_cgroup(page); lock_page_cgroup(pc); if (PageCgroupUsed(pc)) { @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage, * charged to the res_counter since we plan on replacing the * old one and only one page is going to be left afterwards. */ - __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false); + __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false); } /* remove redundant charge if migration failed*/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
* Zhouping Liu wrote: > On 10/26/2012 05:20 PM, Ingo Molnar wrote: > >* Peter Zijlstra wrote: > > > >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: > >>>[ 180.918591] RIP: 0010:[] [] > >>>mem_cgroup_prepare_migration+0xba/0xd0 > >>>[ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 > >>>[ 182.775090] [] handle_mm_fault+0x1e9/0x360 > >>>[ 182.863038] [] __do_page_fault+0x172/0x4e0 > >>>[ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 > >>>[ 183.041512] [] ? __switch_to+0x3ce/0x4a0 > >>>[ 183.126832] [] ? __schedule+0x3c6/0x7a0 > >>>[ 183.211216] [] do_page_fault+0xe/0x10 > >>>[ 183.293705] [] page_fault+0x28/0x30 > >>Johannes, this looks like the thp migration memcg hookery gone bad, > >>could you have a look at this? > >Meanwhile, Zhouping Liu, could you please not apply the last > >patch: > > > > [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() > > > >and see whether it boots/works without that? > > Hi Ingo, > > your supposed is right, after reverting the 31st patch(sched, numa, > mm: Add memcg support to do_huge_pmd_numa_page()) > the issue is gone, thank you. The tested bits you can find in the numa/core tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core It includes all changes (patches #1-#30) except patch #31 - I wanted to test and apply that last patch today, but won't do it now that you've reported this regression. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/26/2012 05:20 PM, Ingo Molnar wrote: * Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[] [] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [] handle_mm_fault+0x1e9/0x360 [ 182.863038] [] __do_page_fault+0x172/0x4e0 [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [] ? __schedule+0x3c6/0x7a0 [ 183.211216] [] do_page_fault+0xe/0x10 [ 183.293705] [] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Hi Ingo, your supposed is right, after reverting the 31st patch(sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()) the issue is gone, thank you. Thanks, Zhouping Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org;> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/26/2012 05:20 PM, Ingo Molnar wrote: * Peter Zijlstra wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[] [] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [] handle_mm_fault+0x1e9/0x360 [ 182.863038] [] __do_page_fault+0x172/0x4e0 [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [] ? __schedule+0x3c6/0x7a0 [ 183.211216] [] do_page_fault+0xe/0x10 [ 183.293705] [] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Ok, I reverted the 31st patch, will provide the results here after I finish testing. Thanks, Zhouping -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
* Peter Zijlstra wrote: > On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: > > [ 180.918591] RIP: 0010:[] [] > > mem_cgroup_prepare_migration+0xba/0xd0 > > > [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 > > [ 182.775090] [] handle_mm_fault+0x1e9/0x360 > > [ 182.863038] [] __do_page_fault+0x172/0x4e0 > > [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 > > [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 > > [ 183.126832] [] ? __schedule+0x3c6/0x7a0 > > [ 183.211216] [] do_page_fault+0xe/0x10 > > [ 183.293705] [] page_fault+0x28/0x30 > > Johannes, this looks like the thp migration memcg hookery gone bad, > could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: > [ 180.918591] RIP: 0010:[] [] > mem_cgroup_prepare_migration+0xba/0xd0 > [ 182.681450] [] do_huge_pmd_numa_page+0x180/0x500 > [ 182.775090] [] handle_mm_fault+0x1e9/0x360 > [ 182.863038] [] __do_page_fault+0x172/0x4e0 > [ 182.950574] [] ? __switch_to_xtra+0x163/0x1a0 > [ 183.041512] [] ? __switch_to+0x3ce/0x4a0 > [ 183.126832] [] ? __schedule+0x3c6/0x7a0 > [ 183.211216] [] do_page_fault+0xe/0x10 > [ 183.293705] [] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
* Peter Zijlstra a.p.zijls...@chello.nl wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/26/2012 05:20 PM, Ingo Molnar wrote: * Peter Zijlstra a.p.zijls...@chello.nl wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Ok, I reverted the 31st patch, will provide the results here after I finish testing. Thanks, Zhouping -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
On 10/26/2012 05:20 PM, Ingo Molnar wrote: * Peter Zijlstra a.p.zijls...@chello.nl wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Hi Ingo, your supposed is right, after reverting the 31st patch(sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()) the issue is gone, thank you. Thanks, Zhouping Thanks, Ingo -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/31] numa/core patches
* Zhouping Liu z...@redhat.com wrote: On 10/26/2012 05:20 PM, Ingo Molnar wrote: * Peter Zijlstra a.p.zijls...@chello.nl wrote: On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote: [ 180.918591] RIP: 0010:[8118c39a] [8118c39a] mem_cgroup_prepare_migration+0xba/0xd0 [ 182.681450] [81183b60] do_huge_pmd_numa_page+0x180/0x500 [ 182.775090] [811585c9] handle_mm_fault+0x1e9/0x360 [ 182.863038] [81632b62] __do_page_fault+0x172/0x4e0 [ 182.950574] [8101c283] ? __switch_to_xtra+0x163/0x1a0 [ 183.041512] [8101281e] ? __switch_to+0x3ce/0x4a0 [ 183.126832] [8162d686] ? __schedule+0x3c6/0x7a0 [ 183.211216] [81632ede] do_page_fault+0xe/0x10 [ 183.293705] [8162f518] page_fault+0x28/0x30 Johannes, this looks like the thp migration memcg hookery gone bad, could you have a look at this? Meanwhile, Zhouping Liu, could you please not apply the last patch: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page() and see whether it boots/works without that? Hi Ingo, your supposed is right, after reverting the 31st patch(sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()) the issue is gone, thank you. The tested bits you can find in the numa/core tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core It includes all changes (patches #1-#30) except patch #31 - I wanted to test and apply that last patch today, but won't do it now that you've reported this regression. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/