Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Dave Chinner
On Thu, Mar 05, 2015 at 12:35:45AM +0100, Ingo Molnar wrote:
> 
> * Dave Chinner  wrote:
> 
> > > After going through the series again, I did not spot why there is 
> > > a difference. It's functionally similar and I would hate the 
> > > theory that this is somehow hardware related due to the use of 
> > > bits it takes action on.
> > 
> > I doubt it's hardware related - I'm testing inside a VM, [...]
> 
> That might be significant, I doubt Mel considered KVM's interpretation 
> of pte details?

I did actaully mention that before:

| I am running a fake-numa=4 config on this test VM so it's got 4
| nodes of 4p/4GB RAM each.

but I think it got snipped before Mel was cc'd.

Perhaps size of the nodes is relevant, too, because the steady state
phase 3 memory usage is 5-6GB when this problem first shows up, and
then continues into phase 4 where memory usage grows again and peaks
at ~10GB

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Ingo Molnar

* Dave Chinner  wrote:

> > After going through the series again, I did not spot why there is 
> > a difference. It's functionally similar and I would hate the 
> > theory that this is somehow hardware related due to the use of 
> > bits it takes action on.
> 
> I doubt it's hardware related - I'm testing inside a VM, [...]

That might be significant, I doubt Mel considered KVM's interpretation 
of pte details?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Dave Chinner
On Wed, Mar 04, 2015 at 08:00:46PM +, Mel Gorman wrote:
> On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
> > On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote:
> > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner  
> > > > > wrote:
> > > > > >>
> > > > > >> But are those migrate-page calls really common enough to make these
> > > > > >> things happen often enough on the same pages for this all to 
> > > > > >> matter?
> > > > > >
> > > > > > It's looking like that's a possibility.
> > > > > 
> > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > > > re-introduced the "pte was already NUMA" case.
> > > > > 
> > > > > So that's not it either, afaik. Plus your numbers seem to say that
> > > > > it's really "migrate_pages()" that is done more. So it feels like the
> > > > > numa balancing isn't working right.
> > > > 
> > > > So that should show up in the vmstats, right? Oh, and there's a
> > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > > > 
> > > 
> > > The stats indicate both more updates and more faults. Can you try this
> > > please? It's against 4.0-rc1.
> > > 
> > > ---8<---
> > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
> > 
> > Makes no noticable difference to behaviour or performance. Stats:
> > 
> 
> After going through the series again, I did not spot why there is a
> difference. It's functionally similar and I would hate the theory that
> this is somehow hardware related due to the use of bits it takes action
> on.

I doubt it's hardware related - I'm testing inside a VM, and the
host is a year old Dell r820 server, so it's a pretty common
hardware I'd think.

Guest:

processor   : 15
vendor_id   : GenuineIntel
cpu family  : 6
model   : 6
model name  : QEMU Virtual CPU version 2.0.0
stepping: 3
microcode   : 0x1
cpu MHz : 2199.998
cache size  : 4096 KB
physical id : 15
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 15
initial apicid  : 15
fpu : yes
fpu_exception   : yes
cpuid level : 4
wp  : yes
flags   : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16 x2apic 
popcnt hypervisor lahf_lm
bugs:
bogomips: 4399.99
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Host:

processor   : 31
vendor_id   : GenuineIntel
cpu family  : 6
model   : 45
model name  : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
stepping: 7
microcode   : 0x70d
cpu MHz : 1190.750
cache size  : 16384 KB
physical id : 1
siblings: 16
core id : 7
cpu cores   : 8
apicid  : 47
initial apicid  : 47
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm 
tpr_shadow vnmi flexpriority ept vpid
bogomips: 4400.75
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

> There is nothing in the manual that indicates that it would. Try this
> as I don't want to leave this hanging before LSF/MM because it'll mask other
> reports. It alters the maximum rate automatic NUMA balancing scans ptes.
> 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7ce18f3c097a..40ae5d84d4ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
>   * calculated based on the tasks virtual memory size and
>   * numa_balancing_scan_size.
>   */
> -unsigned int sysctl_numa_balancing_scan_period_min = 1000;
> +unsigned int sysctl_numa_balancing_scan_period_min = 2000;
>  unsigned int sysctl_numa_balancing_scan_period_max = 6;

Made absolutely no difference:

357,635  migrate:mm_migrate_pages  ( +-  4.11% )

numa_hit 36724642
numa_miss 92477
numa_foreign 92477
numa_interleave 11835
numa_local 36709671
numa_other 107448
numa_pte_updates 83924860
numa_huge_pte_updates 0
numa_hint_faults 81856035
numa_hint_faults_local 22104529
numa_pages_migrated 32766735
pgmigrate_success 32766735
pgmigrate_fail 0

Runtime was 

Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Mel Gorman
On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
> On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote:
> > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner  
> > > > wrote:
> > > > >>
> > > > >> But are those migrate-page calls really common enough to make these
> > > > >> things happen often enough on the same pages for this all to matter?
> > > > >
> > > > > It's looking like that's a possibility.
> > > > 
> > > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > > re-introduced the "pte was already NUMA" case.
> > > > 
> > > > So that's not it either, afaik. Plus your numbers seem to say that
> > > > it's really "migrate_pages()" that is done more. So it feels like the
> > > > numa balancing isn't working right.
> > > 
> > > So that should show up in the vmstats, right? Oh, and there's a
> > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > > 
> > 
> > The stats indicate both more updates and more faults. Can you try this
> > please? It's against 4.0-rc1.
> > 
> > ---8<---
> > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
> 
> Makes no noticable difference to behaviour or performance. Stats:
> 

After going through the series again, I did not spot why there is a
difference. It's functionally similar and I would hate the theory that
this is somehow hardware related due to the use of bits it takes action
on. There is nothing in the manual that indicates that it would. Try this
as I don't want to leave this hanging before LSF/MM because it'll mask other
reports. It alters the maximum rate automatic NUMA balancing scans ptes.

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..40ae5d84d4ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
  * calculated based on the tasks virtual memory size and
  * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_min = 2000;
 unsigned int sysctl_numa_balancing_scan_period_max = 6;
 
 /* Portion of address space to scan in MB */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Mel Gorman
On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
 On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote:
  On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
   On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner da...@fromorbit.com 
wrote:

 But are those migrate-page calls really common enough to make these
 things happen often enough on the same pages for this all to matter?

 It's looking like that's a possibility.

Hmm. Looking closer, commit 10c1045f28e8 already should have
re-introduced the pte was already NUMA case.

So that's not it either, afaik. Plus your numbers seem to say that
it's really migrate_pages() that is done more. So it feels like the
numa balancing isn't working right.
   
   So that should show up in the vmstats, right? Oh, and there's a
   tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
   
  
  The stats indicate both more updates and more faults. Can you try this
  please? It's against 4.0-rc1.
  
  ---8---
  mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
 
 Makes no noticable difference to behaviour or performance. Stats:
 

After going through the series again, I did not spot why there is a
difference. It's functionally similar and I would hate the theory that
this is somehow hardware related due to the use of bits it takes action
on. There is nothing in the manual that indicates that it would. Try this
as I don't want to leave this hanging before LSF/MM because it'll mask other
reports. It alters the maximum rate automatic NUMA balancing scans ptes.

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..40ae5d84d4ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
  * calculated based on the tasks virtual memory size and
  * numa_balancing_scan_size.
  */
-unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_min = 2000;
 unsigned int sysctl_numa_balancing_scan_period_max = 6;
 
 /* Portion of address space to scan in MB */
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Dave Chinner
On Thu, Mar 05, 2015 at 12:35:45AM +0100, Ingo Molnar wrote:
 
 * Dave Chinner da...@fromorbit.com wrote:
 
   After going through the series again, I did not spot why there is 
   a difference. It's functionally similar and I would hate the 
   theory that this is somehow hardware related due to the use of 
   bits it takes action on.
  
  I doubt it's hardware related - I'm testing inside a VM, [...]
 
 That might be significant, I doubt Mel considered KVM's interpretation 
 of pte details?

I did actaully mention that before:

| I am running a fake-numa=4 config on this test VM so it's got 4
| nodes of 4p/4GB RAM each.

but I think it got snipped before Mel was cc'd.

Perhaps size of the nodes is relevant, too, because the steady state
phase 3 memory usage is 5-6GB when this problem first shows up, and
then continues into phase 4 where memory usage grows again and peaks
at ~10GB

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Ingo Molnar

* Dave Chinner da...@fromorbit.com wrote:

  After going through the series again, I did not spot why there is 
  a difference. It's functionally similar and I would hate the 
  theory that this is somehow hardware related due to the use of 
  bits it takes action on.
 
 I doubt it's hardware related - I'm testing inside a VM, [...]

That might be significant, I doubt Mel considered KVM's interpretation 
of pte details?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-04 Thread Dave Chinner
On Wed, Mar 04, 2015 at 08:00:46PM +, Mel Gorman wrote:
 On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote:
  On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote:
   On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
 On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner da...@fromorbit.com 
 wrote:
 
  But are those migrate-page calls really common enough to make these
  things happen often enough on the same pages for this all to 
  matter?
 
  It's looking like that's a possibility.
 
 Hmm. Looking closer, commit 10c1045f28e8 already should have
 re-introduced the pte was already NUMA case.
 
 So that's not it either, afaik. Plus your numbers seem to say that
 it's really migrate_pages() that is done more. So it feels like the
 numa balancing isn't working right.

So that should show up in the vmstats, right? Oh, and there's a
tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:

   
   The stats indicate both more updates and more faults. Can you try this
   please? It's against 4.0-rc1.
   
   ---8---
   mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing
  
  Makes no noticable difference to behaviour or performance. Stats:
  
 
 After going through the series again, I did not spot why there is a
 difference. It's functionally similar and I would hate the theory that
 this is somehow hardware related due to the use of bits it takes action
 on.

I doubt it's hardware related - I'm testing inside a VM, and the
host is a year old Dell r820 server, so it's a pretty common
hardware I'd think.

Guest:

processor   : 15
vendor_id   : GenuineIntel
cpu family  : 6
model   : 6
model name  : QEMU Virtual CPU version 2.0.0
stepping: 3
microcode   : 0x1
cpu MHz : 2199.998
cache size  : 4096 KB
physical id : 15
siblings: 1
core id : 0
cpu cores   : 1
apicid  : 15
initial apicid  : 15
fpu : yes
fpu_exception   : yes
cpuid level : 4
wp  : yes
flags   : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16 x2apic 
popcnt hypervisor lahf_lm
bugs:
bogomips: 4399.99
clflush size: 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Host:

processor   : 31
vendor_id   : GenuineIntel
cpu family  : 6
model   : 45
model name  : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
stepping: 7
microcode   : 0x70d
cpu MHz : 1190.750
cache size  : 16384 KB
physical id : 1
siblings: 16
core id : 7
cpu cores   : 8
apicid  : 47
initial apicid  : 47
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm 
tpr_shadow vnmi flexpriority ept vpid
bogomips: 4400.75
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

 There is nothing in the manual that indicates that it would. Try this
 as I don't want to leave this hanging before LSF/MM because it'll mask other
 reports. It alters the maximum rate automatic NUMA balancing scans ptes.
 
 ---
  kernel/sched/fair.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
 index 7ce18f3c097a..40ae5d84d4ba 100644
 --- a/kernel/sched/fair.c
 +++ b/kernel/sched/fair.c
 @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct 
 sched_entity *se)
   * calculated based on the tasks virtual memory size and
   * numa_balancing_scan_size.
   */
 -unsigned int sysctl_numa_balancing_scan_period_min = 1000;
 +unsigned int sysctl_numa_balancing_scan_period_min = 2000;
  unsigned int sysctl_numa_balancing_scan_period_max = 6;

Made absolutely no difference:

357,635  migrate:mm_migrate_pages  ( +-  4.11% )

numa_hit 36724642
numa_miss 92477
numa_foreign 92477
numa_interleave 11835
numa_local 36709671
numa_other 107448
numa_pte_updates 83924860
numa_huge_pte_updates 0
numa_hint_faults 81856035
numa_hint_faults_local 22104529
numa_pages_migrated 32766735
pgmigrate_success 32766735
pgmigrate_fail 0

Runtime was actually a minute worse (18m35s vs 17m39s) than without
this patch.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this 

Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-03 Thread Dave Chinner
On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote:
> On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner  wrote:
> > > >>
> > > >> But are those migrate-page calls really common enough to make these
> > > >> things happen often enough on the same pages for this all to matter?
> > > >
> > > > It's looking like that's a possibility.
> > > 
> > > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > > re-introduced the "pte was already NUMA" case.
> > > 
> > > So that's not it either, afaik. Plus your numbers seem to say that
> > > it's really "migrate_pages()" that is done more. So it feels like the
> > > numa balancing isn't working right.
> > 
> > So that should show up in the vmstats, right? Oh, and there's a
> > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> > 
> 
> The stats indicate both more updates and more faults. Can you try this
> please? It's against 4.0-rc1.
> 
> ---8<---
> mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing

Makes no noticable difference to behaviour or performance. Stats:

359,857  migrate:mm_migrate_pages ( +-  5.54% )

numa_hit 36026802
numa_miss 14287
numa_foreign 14287
numa_interleave 18408
numa_local 36006052
numa_other 35037
numa_pte_updates 81803359
numa_huge_pte_updates 0
numa_hint_faults 79810798
numa_hint_faults_local 21227730
numa_pages_migrated 32037516
pgmigrate_success 32037516
pgmigrate_fail 0

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-03 Thread Mel Gorman
On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
> On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner  wrote:
> > >>
> > >> But are those migrate-page calls really common enough to make these
> > >> things happen often enough on the same pages for this all to matter?
> > >
> > > It's looking like that's a possibility.
> > 
> > Hmm. Looking closer, commit 10c1045f28e8 already should have
> > re-introduced the "pte was already NUMA" case.
> > 
> > So that's not it either, afaik. Plus your numbers seem to say that
> > it's really "migrate_pages()" that is done more. So it feels like the
> > numa balancing isn't working right.
> 
> So that should show up in the vmstats, right? Oh, and there's a
> tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
> 

The stats indicate both more updates and more faults. Can you try this
please? It's against 4.0-rc1.

---8<---
mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

   Across the board the 4.0-rc1 numbers are much slower, and the
   degradation is far worse when using the large memory footprint
   configs. Perf points straight at the cause - this is from 4.0-rc1
   on the "-o bhash=101073" config:

   -   56.07%56.07%  [kernel][k] 
default_send_IPI_mask_sequence_phys
  - default_send_IPI_mask_sequence_phys
 - 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
 smp_call_function_many
   - native_flush_tlb_others
  - 99.85% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.73% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
  0.63% native_send_call_func_single_ipi
 generic_exec_single
 smp_call_function_single

This was bisected to commit 4d94246699 ("mm: convert p[te|md]_mknonnuma
and remaining page table manipulations") but I expect the full issue is
related series up to and including that patch.

There are two important changes that might be relevant here. The first is
marking huge PMDs to trap a hinting fault potentially sends an IPI to flush
TLBs. This did not show up in Dave's report and it almost certainly is not
a factor but it would affect IPI counts for other users. The second is that
the PTE protection update now clears the PTE leaving a window where parallel
faults can be trapped resulting in more overhead from faults. Higher faults,
even if correct can result in higher scan rates indirectly and may explain
what Dave is saying.

This is not signed off or tested.
---
 mm/huge_memory.c | 11 +--
 mm/mprotect.c| 17 +++--
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8cb5a82..7fc4732c77d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1494,8 +1494,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
}
 
if (!prot_numa || !pmd_protnone(*pmd)) {
-   ret = 1;
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   /*
+* NUMA hinting update can avoid a clear and flush as
+* it is not a functional correctness issue if access
+* occurs after the update
+*/
+   if (prot_numa)
+   entry = *pmd;
+   else
+   entry = pmdp_get_and_clear_notify(mm, addr, 
pmd);
entry = pmd_modify(entry, newprot);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 44727811bf4c..1efd03ffa0d8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,19 +77,32 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
pte_t ptent;
 
/*
-* Avoid trapping faults against the zero or KSM
-* pages. See similar comment in change_huge_pmd.
+* prot_numa does not clear the pte during protection
+* update as asynchronous hardware updates are not
+* a concern but unnecessary faults while the PTE is
+* cleared is overhead.
 */
if (prot_numa) {

Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-03 Thread Dave Chinner
On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner  wrote:
> >>
> >> But are those migrate-page calls really common enough to make these
> >> things happen often enough on the same pages for this all to matter?
> >
> > It's looking like that's a possibility.
> 
> Hmm. Looking closer, commit 10c1045f28e8 already should have
> re-introduced the "pte was already NUMA" case.
> 
> So that's not it either, afaik. Plus your numbers seem to say that
> it's really "migrate_pages()" that is done more. So it feels like the
> numa balancing isn't working right.

So that should show up in the vmstats, right? Oh, and there's a
tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:

3.19:

55,898  migrate:mm_migrate_pages

And a sample of the events shows 99.99% of these are:

mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=

4.0-rc1:

364,442  migrate:mm_migrate_pages

They are also single page MIGRATE_ASYNC events like for 3.19.

And 'grep "numa\|migrate" /proc/vmstat' output for the entire
xfs_repair run:

3.19:

numa_hit 5163221
numa_miss 121274
numa_foreign 121274
numa_interleave 12116
numa_local 5153127
numa_other 131368
numa_pte_updates 36482466
numa_huge_pte_updates 0
numa_hint_faults 34816515
numa_hint_faults_local 9197961
numa_pages_migrated 1228114
pgmigrate_success 1228114
pgmigrate_fail 0

4.0-rc1:

numa_hit 36952043
numa_miss 92471
numa_foreign 92471
numa_interleave 10964
numa_local 36927384
numa_other 117130
numa_pte_updates 84010995
numa_huge_pte_updates 0
numa_hint_faults 81697505
numa_hint_faults_local 21765799
numa_pages_migrated 32916316
pgmigrate_success 32916316
pgmigrate_fail 0

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-03 Thread Mel Gorman
On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
 On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
  On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner da...@fromorbit.com wrote:
  
   But are those migrate-page calls really common enough to make these
   things happen often enough on the same pages for this all to matter?
  
   It's looking like that's a possibility.
  
  Hmm. Looking closer, commit 10c1045f28e8 already should have
  re-introduced the pte was already NUMA case.
  
  So that's not it either, afaik. Plus your numbers seem to say that
  it's really migrate_pages() that is done more. So it feels like the
  numa balancing isn't working right.
 
 So that should show up in the vmstats, right? Oh, and there's a
 tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
 

The stats indicate both more updates and more faults. Can you try this
please? It's against 4.0-rc1.

---8---
mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

   Across the board the 4.0-rc1 numbers are much slower, and the
   degradation is far worse when using the large memory footprint
   configs. Perf points straight at the cause - this is from 4.0-rc1
   on the -o bhash=101073 config:

   -   56.07%56.07%  [kernel][k] 
default_send_IPI_mask_sequence_phys
  - default_send_IPI_mask_sequence_phys
 - 99.99% physflat_send_IPI_mask
- 99.37% native_send_call_func_ipi
 smp_call_function_many
   - native_flush_tlb_others
  - 99.85% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.73% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
  0.63% native_send_call_func_single_ipi
 generic_exec_single
 smp_call_function_single

This was bisected to commit 4d94246699 (mm: convert p[te|md]_mknonnuma
and remaining page table manipulations) but I expect the full issue is
related series up to and including that patch.

There are two important changes that might be relevant here. The first is
marking huge PMDs to trap a hinting fault potentially sends an IPI to flush
TLBs. This did not show up in Dave's report and it almost certainly is not
a factor but it would affect IPI counts for other users. The second is that
the PTE protection update now clears the PTE leaving a window where parallel
faults can be trapped resulting in more overhead from faults. Higher faults,
even if correct can result in higher scan rates indirectly and may explain
what Dave is saying.

This is not signed off or tested.
---
 mm/huge_memory.c | 11 +--
 mm/mprotect.c| 17 +++--
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8cb5a82..7fc4732c77d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1494,8 +1494,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
}
 
if (!prot_numa || !pmd_protnone(*pmd)) {
-   ret = 1;
-   entry = pmdp_get_and_clear_notify(mm, addr, pmd);
+   /*
+* NUMA hinting update can avoid a clear and flush as
+* it is not a functional correctness issue if access
+* occurs after the update
+*/
+   if (prot_numa)
+   entry = *pmd;
+   else
+   entry = pmdp_get_and_clear_notify(mm, addr, 
pmd);
entry = pmd_modify(entry, newprot);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 44727811bf4c..1efd03ffa0d8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,19 +77,32 @@ static unsigned long change_pte_range(struct vm_area_struct 
*vma, pmd_t *pmd,
pte_t ptent;
 
/*
-* Avoid trapping faults against the zero or KSM
-* pages. See similar comment in change_huge_pmd.
+* prot_numa does not clear the pte during protection
+* update as asynchronous hardware updates are not
+* a concern but unnecessary faults while the PTE is
+* cleared is overhead.
 */
if (prot_numa) {

Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-03 Thread Dave Chinner
On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote:
 On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote:
  On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
   On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner da...@fromorbit.com wrote:
   
But are those migrate-page calls really common enough to make these
things happen often enough on the same pages for this all to matter?
   
It's looking like that's a possibility.
   
   Hmm. Looking closer, commit 10c1045f28e8 already should have
   re-introduced the pte was already NUMA case.
   
   So that's not it either, afaik. Plus your numbers seem to say that
   it's really migrate_pages() that is done more. So it feels like the
   numa balancing isn't working right.
  
  So that should show up in the vmstats, right? Oh, and there's a
  tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:
  
 
 The stats indicate both more updates and more faults. Can you try this
 please? It's against 4.0-rc1.
 
 ---8---
 mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing

Makes no noticable difference to behaviour or performance. Stats:

359,857  migrate:mm_migrate_pages ( +-  5.54% )

numa_hit 36026802
numa_miss 14287
numa_foreign 14287
numa_interleave 18408
numa_local 36006052
numa_other 35037
numa_pte_updates 81803359
numa_huge_pte_updates 0
numa_hint_faults 79810798
numa_hint_faults_local 21227730
numa_pages_migrated 32037516
pgmigrate_success 32037516
pgmigrate_fail 0

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-03 Thread Dave Chinner
On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote:
 On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner da...@fromorbit.com wrote:
 
  But are those migrate-page calls really common enough to make these
  things happen often enough on the same pages for this all to matter?
 
  It's looking like that's a possibility.
 
 Hmm. Looking closer, commit 10c1045f28e8 already should have
 re-introduced the pte was already NUMA case.
 
 So that's not it either, afaik. Plus your numbers seem to say that
 it's really migrate_pages() that is done more. So it feels like the
 numa balancing isn't working right.

So that should show up in the vmstats, right? Oh, and there's a
tracepoint in migrate_pages, too. Same 6x10s samples in phase 3:

3.19:

55,898  migrate:mm_migrate_pages

And a sample of the events shows 99.99% of these are:

mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason=

4.0-rc1:

364,442  migrate:mm_migrate_pages

They are also single page MIGRATE_ASYNC events like for 3.19.

And 'grep numa\|migrate /proc/vmstat' output for the entire
xfs_repair run:

3.19:

numa_hit 5163221
numa_miss 121274
numa_foreign 121274
numa_interleave 12116
numa_local 5153127
numa_other 131368
numa_pte_updates 36482466
numa_huge_pte_updates 0
numa_hint_faults 34816515
numa_hint_faults_local 9197961
numa_pages_migrated 1228114
pgmigrate_success 1228114
pgmigrate_fail 0

4.0-rc1:

numa_hit 36952043
numa_miss 92471
numa_foreign 92471
numa_interleave 10964
numa_local 36927384
numa_other 117130
numa_pte_updates 84010995
numa_huge_pte_updates 0
numa_hint_faults 81697505
numa_hint_faults_local 21765799
numa_pages_migrated 32916316
pgmigrate_success 32916316
pgmigrate_fail 0

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner  wrote:
>>
>> But are those migrate-page calls really common enough to make these
>> things happen often enough on the same pages for this all to matter?
>
> It's looking like that's a possibility.

Hmm. Looking closer, commit 10c1045f28e8 already should have
re-introduced the "pte was already NUMA" case.

So that's not it either, afaik. Plus your numbers seem to say that
it's really "migrate_pages()" that is done more. So it feels like the
numa balancing isn't working right.

But I'm not seeing what would cause that in that commit. It really all
looks the same to me. The few special-cases it drops get re-introduced
later (although in a different form).

Mel, do you see what I'm missing?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Dave Chinner
On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote:
> On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
>  wrote:
> >
> > There might be some other case where the new "just change the
> > protection" doesn't do the "oh, but it the protection didn't change,
> > don't bother flushing". I don't see it.
> 
> Hmm. I wonder.. In change_pte_range(), we just unconditionally change
> the protection bits.
> 
> But the old numa code used to do
> 
> if (!pte_numa(oldpte)) {
> ptep_set_numa(mm, addr, pte);
> 
> so it would actually avoid the pte update if a numa-prot page was
> marked numa-prot again.
> 
> But are those migrate-page calls really common enough to make these
> things happen often enough on the same pages for this all to matter?

It's looking like that's a possibility.  I am running a fake-numa=4
config on this test VM so it's got 4 nodes of 4p/4GB RAM each.
both kernels are running through the same page fault path and that
is straight through migrate_pages().

3.19:

   13.70% 0.01%  [kernel][k] native_flush_tlb_others
   - native_flush_tlb_others
  - 98.58% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 96.88% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
+ 3.12% __get_user_pages
  + 1.40% flush_tlb_mm_range

4.0-rc1:

-   67.12% 0.04%  [kernel][k] native_flush_tlb_others
   - native_flush_tlb_others
  - 99.80% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.50% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   - async_page_fault

Same call chain, just a lot more CPU used further down the stack.

> Odd.
> 
> So it would be good if your profiles just show "there's suddenly a
> *lot* more calls to flush_tlb_page() from XYZ" and the culprit is
> obvious that way..

Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to
count all the tlb flush events from the kernel. I then pulled the
full events for a 30s period to get a sampling of the reason
associated with each flush event.

4.0-rc1:

 Performance counter stats for 'system wide' (6 runs):

 2,190,503  tlb:tlb_flush  ( +-  8.30% )

  10.001970663 seconds time elapsed( +-  0.00% )

The reason breakdown:

81% TLB_REMOTE_SHOOTDOWN
19% TLB_FLUSH_ON_TASK_SWITCH

3.19:

 Performance counter stats for 'system wide' (6 runs):

   467,151  tlb:tlb_flush  ( +- 25.50% )

  10.002021491 seconds time elapsed( +-  0.00% )

The reason breakdown:

  6% TLB_REMOTE_SHOOTDOWN
 94% TLB_FLUSH_ON_TASK_SWITCH

The difference would appear to be the number of remote TLB
shootdowns that are occurring from otherwise identical page fault
paths.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
 wrote:
>
> There might be some other case where the new "just change the
> protection" doesn't do the "oh, but it the protection didn't change,
> don't bother flushing". I don't see it.

Hmm. I wonder.. In change_pte_range(), we just unconditionally change
the protection bits.

But the old numa code used to do

if (!pte_numa(oldpte)) {
ptep_set_numa(mm, addr, pte);

so it would actually avoid the pte update if a numa-prot page was
marked numa-prot again.

But are those migrate-page calls really common enough to make these
things happen often enough on the same pages for this all to matter?

Odd.

So it would be good if your profiles just show "there's suddenly a
*lot* more calls to flush_tlb_page() from XYZ" and the culprit is
obvious that way..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 5:47 PM, Dave Chinner  wrote:
>
> Anyway, the difference between good and bad is pretty clear, so
> I'm pretty confident the bisect is solid:
>
> 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit

Well, it's the mm queue from Andrew, so I'm not surprised. That said,
I don't see why that particular one should matter.

Hmm. In your profiles, can you tell which caller of "flush_tlb_page()"
 changed the most? The change from "mknnuma" to "prot_none" *should*
be 100% equivalent (both just change the page to be not-present, just
set different bits elsewhere in the pte), but clearly something
wasn't.

Oh. Except for that special "huge-zero-page" special case that got
dropped, but that got re-introduced in commit e944fd67b625.

There might be some other case where the new "just change the
protection" doesn't do the "oh, but it the protection didn't change,
don't bother flushing". I don't see it.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Dave Chinner
On Mon, Mar 02, 2015 at 11:47:52AM -0800, Linus Torvalds wrote:
> On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner  wrote:
> >
> > Across the board the 4.0-rc1 numbers are much slower, and the
> > degradation is far worse when using the large memory footprint
> > configs. Perf points straight at the cause - this is from 4.0-rc1
> > on the "-o bhash=101073" config:
> >
> > -   56.07%56.07%  [kernel][k] 
> > default_send_IPI_mask_sequence_phys
> >   - 99.99% physflat_send_IPI_mask
> >  - 99.37% native_send_call_func_ipi
> ..
> >
> > And the same profile output from 3.19 shows:
> >
> > -9.61% 9.61%  [kernel][k] 
> > default_send_IPI_mask_sequence_phys
> >  - 99.98% physflat_send_IPI_mask
> >  - 96.26% native_send_call_func_ipi
> ...
> >
> > So either there's been a massive increase in the number of IPIs
> > being sent, or the cost per IPI have greatly increased. Either way,
> > the result is a pretty significant performance degradatation.

> I assume it's the mm queue from Andrew, so adding him to the cc. There
> are changes to the page migration etc, which could explain it.
> 
> There are also a fair amount of APIC changes in 4.0-rc1, so I guess it
> really could be just that the IPI sending itself has gotten much
> slower. Adding Ingo for that, although I don't think
> default_send_IPI_mask_sequence_phys() itself hasn't actually changed,
> only other things around the apic. So I'd be inclined to blame the mm
> changes.
> 
> Obviously bisection would find it..

Yes, though the time it takes to do a 13 step bisection means it's
something I don't do just for an initial bug report. ;)

Anyway, the difference between good and bad is pretty clear, so
I'm pretty confident the bisect is solid:

4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit
commit 4d9424669946532be754a6e116618dcb58430cb4
Author: Mel Gorman 
Date:   Thu Feb 12 14:58:28 2015 -0800

mm: convert p[te|md]_mknonnuma and remaining page table manipulations

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

[andre.przyw...@arm.com: fix compiler warning in pmdp_invalidate()]
[a...@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman 
Acked-by: Linus Torvalds 
Acked-by: Aneesh Kumar 
Tested-by: Sasha Levin 
Cc: Benjamin Herrenschmidt 
Cc: Dave Jones 
Cc: Hugh Dickins 
Cc: Ingo Molnar 
Cc: Kirill Shutemov 
Cc: Paul Mackerras 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

:04 04 50985a3f84e80bb2bdd049d4f34739d99436f988 
1bc79bfac2c138844373b603f9bc5914f0d010f3 March
:04 04 ea69bcd1c59f832a4b012a57b4eb1d0c7516947d 
0822692fa6c356952e723b56038585716fa51723 Minclude
:04 04 c11960b9f1ee72edb08dc3fdc46f590fb1d545f7 
f5d17ff5b639adcb7363a196a9efe70f2a7312b5 Mmm

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner  wrote:
>
> Across the board the 4.0-rc1 numbers are much slower, and the
> degradation is far worse when using the large memory footprint
> configs. Perf points straight at the cause - this is from 4.0-rc1
> on the "-o bhash=101073" config:
>
> -   56.07%56.07%  [kernel][k] 
> default_send_IPI_mask_sequence_phys
>   - 99.99% physflat_send_IPI_mask
>  - 99.37% native_send_call_func_ipi
..
>
> And the same profile output from 3.19 shows:
>
> -9.61% 9.61%  [kernel][k] 
> default_send_IPI_mask_sequence_phys
>  - 99.98% physflat_send_IPI_mask
>  - 96.26% native_send_call_func_ipi
...
>
> So either there's been a massive increase in the number of IPIs
> being sent, or the cost per IPI have greatly increased. Either way,
> the result is a pretty significant performance degradatation.

And on Mon, Mar 2, 2015 at 11:17 AM, Matt  wrote:
>
> Linus already posted a fix to the problem, however I can't seem to
> find the matching commit in his tree (searching for "TLC regression"
> or "TLB cache").

That was commit f045bbb9fa1b, which was then refined by commit
721c21c17ab9, because it turned out that ARM64 had a very subtle
relationship with tlb->end and fullmm.

But both of those hit 3.19, so none of this should affect 4.0-rc1.
There's something else going on.

I assume it's the mm queue from Andrew, so adding him to the cc. There
are changes to the page migration etc, which could explain it.

There are also a fair amount of APIC changes in 4.0-rc1, so I guess it
really could be just that the IPI sending itself has gotten much
slower. Adding Ingo for that, although I don't think
default_send_IPI_mask_sequence_phys() itself hasn't actually changed,
only other things around the apic. So I'd be inclined to blame the mm
changes.

Obviously bisection would find it..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Matt
On Mon, Mar 2, 2015 at 8:25 PM, Dave Hansen  wrote:
> On 03/02/2015 11:17 AM, Matt wrote:
>> Linus already posted a fix to the problem, however I can't seem to
>> find the matching commit in his tree (searching for "TLC regression"
>> or "TLB cache").
>
> It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc.

Mea culpa, should have looked at the date of the thread - was just
grasping at straws to make an help attempt :/

I'll refrain from posting in this thread then to avoid clutter & load
to the list

(this is way over my head, I'm mostly doing minor patch porting and
custom kernels as a hobby)

Kind Regards

Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Dave Hansen
On 03/02/2015 11:17 AM, Matt wrote:
> Linus already posted a fix to the problem, however I can't seem to
> find the matching commit in his tree (searching for "TLC regression"
> or "TLB cache").

It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Dave Chinner
On Mon, Mar 02, 2015 at 11:47:52AM -0800, Linus Torvalds wrote:
 On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner da...@fromorbit.com wrote:
 
  Across the board the 4.0-rc1 numbers are much slower, and the
  degradation is far worse when using the large memory footprint
  configs. Perf points straight at the cause - this is from 4.0-rc1
  on the -o bhash=101073 config:
 
  -   56.07%56.07%  [kernel][k] 
  default_send_IPI_mask_sequence_phys
- 99.99% physflat_send_IPI_mask
   - 99.37% native_send_call_func_ipi
 ..
 
  And the same profile output from 3.19 shows:
 
  -9.61% 9.61%  [kernel][k] 
  default_send_IPI_mask_sequence_phys
   - 99.98% physflat_send_IPI_mask
   - 96.26% native_send_call_func_ipi
 ...
 
  So either there's been a massive increase in the number of IPIs
  being sent, or the cost per IPI have greatly increased. Either way,
  the result is a pretty significant performance degradatation.

 I assume it's the mm queue from Andrew, so adding him to the cc. There
 are changes to the page migration etc, which could explain it.
 
 There are also a fair amount of APIC changes in 4.0-rc1, so I guess it
 really could be just that the IPI sending itself has gotten much
 slower. Adding Ingo for that, although I don't think
 default_send_IPI_mask_sequence_phys() itself hasn't actually changed,
 only other things around the apic. So I'd be inclined to blame the mm
 changes.
 
 Obviously bisection would find it..

Yes, though the time it takes to do a 13 step bisection means it's
something I don't do just for an initial bug report. ;)

Anyway, the difference between good and bad is pretty clear, so
I'm pretty confident the bisect is solid:

4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit
commit 4d9424669946532be754a6e116618dcb58430cb4
Author: Mel Gorman mgor...@suse.de
Date:   Thu Feb 12 14:58:28 2015 -0800

mm: convert p[te|md]_mknonnuma and remaining page table manipulations

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

[andre.przyw...@arm.com: fix compiler warning in pmdp_invalidate()]
[a...@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman mgor...@suse.de
Acked-by: Linus Torvalds torva...@linux-foundation.org
Acked-by: Aneesh Kumar aneesh.ku...@linux.vnet.ibm.com
Tested-by: Sasha Levin sasha.le...@oracle.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Dave Jones da...@redhat.com
Cc: Hugh Dickins hu...@google.com
Cc: Ingo Molnar mi...@redhat.com
Cc: Kirill Shutemov kirill.shute...@linux.intel.com
Cc: Paul Mackerras pau...@samba.org
Cc: Rik van Riel r...@redhat.com
Signed-off-by: Andrew Morton a...@linux-foundation.org
Signed-off-by: Linus Torvalds torva...@linux-foundation.org

:04 04 50985a3f84e80bb2bdd049d4f34739d99436f988 
1bc79bfac2c138844373b603f9bc5914f0d010f3 March
:04 04 ea69bcd1c59f832a4b012a57b4eb1d0c7516947d 
0822692fa6c356952e723b56038585716fa51723 Minclude
:04 04 c11960b9f1ee72edb08dc3fdc46f590fb1d545f7 
f5d17ff5b639adcb7363a196a9efe70f2a7312b5 Mmm

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 5:47 PM, Dave Chinner da...@fromorbit.com wrote:

 Anyway, the difference between good and bad is pretty clear, so
 I'm pretty confident the bisect is solid:

 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit

Well, it's the mm queue from Andrew, so I'm not surprised. That said,
I don't see why that particular one should matter.

Hmm. In your profiles, can you tell which caller of flush_tlb_page()
 changed the most? The change from mknnuma to prot_none *should*
be 100% equivalent (both just change the page to be not-present, just
set different bits elsewhere in the pte), but clearly something
wasn't.

Oh. Except for that special huge-zero-page special case that got
dropped, but that got re-introduced in commit e944fd67b625.

There might be some other case where the new just change the
protection doesn't do the oh, but it the protection didn't change,
don't bother flushing. I don't see it.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 There might be some other case where the new just change the
 protection doesn't do the oh, but it the protection didn't change,
 don't bother flushing. I don't see it.

Hmm. I wonder.. In change_pte_range(), we just unconditionally change
the protection bits.

But the old numa code used to do

if (!pte_numa(oldpte)) {
ptep_set_numa(mm, addr, pte);

so it would actually avoid the pte update if a numa-prot page was
marked numa-prot again.

But are those migrate-page calls really common enough to make these
things happen often enough on the same pages for this all to matter?

Odd.

So it would be good if your profiles just show there's suddenly a
*lot* more calls to flush_tlb_page() from XYZ and the culprit is
obvious that way..

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner da...@fromorbit.com wrote:

 But are those migrate-page calls really common enough to make these
 things happen often enough on the same pages for this all to matter?

 It's looking like that's a possibility.

Hmm. Looking closer, commit 10c1045f28e8 already should have
re-introduced the pte was already NUMA case.

So that's not it either, afaik. Plus your numbers seem to say that
it's really migrate_pages() that is done more. So it feels like the
numa balancing isn't working right.

But I'm not seeing what would cause that in that commit. It really all
looks the same to me. The few special-cases it drops get re-introduced
later (although in a different form).

Mel, do you see what I'm missing?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Dave Chinner
On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote:
 On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  There might be some other case where the new just change the
  protection doesn't do the oh, but it the protection didn't change,
  don't bother flushing. I don't see it.
 
 Hmm. I wonder.. In change_pte_range(), we just unconditionally change
 the protection bits.
 
 But the old numa code used to do
 
 if (!pte_numa(oldpte)) {
 ptep_set_numa(mm, addr, pte);
 
 so it would actually avoid the pte update if a numa-prot page was
 marked numa-prot again.
 
 But are those migrate-page calls really common enough to make these
 things happen often enough on the same pages for this all to matter?

It's looking like that's a possibility.  I am running a fake-numa=4
config on this test VM so it's got 4 nodes of 4p/4GB RAM each.
both kernels are running through the same page fault path and that
is straight through migrate_pages().

3.19:

   13.70% 0.01%  [kernel][k] native_flush_tlb_others
   - native_flush_tlb_others
  - 98.58% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 96.88% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   + async_page_fault
+ 3.12% __get_user_pages
  + 1.40% flush_tlb_mm_range

4.0-rc1:

-   67.12% 0.04%  [kernel][k] native_flush_tlb_others
   - native_flush_tlb_others
  - 99.80% flush_tlb_page
   ptep_clear_flush
   try_to_unmap_one
   rmap_walk
   try_to_unmap
   migrate_pages
   migrate_misplaced_page
 - handle_mm_fault
- 99.50% __do_page_fault
 trace_do_page_fault
 do_async_page_fault
   - async_page_fault

Same call chain, just a lot more CPU used further down the stack.

 Odd.
 
 So it would be good if your profiles just show there's suddenly a
 *lot* more calls to flush_tlb_page() from XYZ and the culprit is
 obvious that way..

Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to
count all the tlb flush events from the kernel. I then pulled the
full events for a 30s period to get a sampling of the reason
associated with each flush event.

4.0-rc1:

 Performance counter stats for 'system wide' (6 runs):

 2,190,503  tlb:tlb_flush  ( +-  8.30% )

  10.001970663 seconds time elapsed( +-  0.00% )

The reason breakdown:

81% TLB_REMOTE_SHOOTDOWN
19% TLB_FLUSH_ON_TASK_SWITCH

3.19:

 Performance counter stats for 'system wide' (6 runs):

   467,151  tlb:tlb_flush  ( +- 25.50% )

  10.002021491 seconds time elapsed( +-  0.00% )

The reason breakdown:

  6% TLB_REMOTE_SHOOTDOWN
 94% TLB_FLUSH_ON_TASK_SWITCH

The difference would appear to be the number of remote TLB
shootdowns that are occurring from otherwise identical page fault
paths.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Linus Torvalds
On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner da...@fromorbit.com wrote:

 Across the board the 4.0-rc1 numbers are much slower, and the
 degradation is far worse when using the large memory footprint
 configs. Perf points straight at the cause - this is from 4.0-rc1
 on the -o bhash=101073 config:

 -   56.07%56.07%  [kernel][k] 
 default_send_IPI_mask_sequence_phys
   - 99.99% physflat_send_IPI_mask
  - 99.37% native_send_call_func_ipi
..

 And the same profile output from 3.19 shows:

 -9.61% 9.61%  [kernel][k] 
 default_send_IPI_mask_sequence_phys
  - 99.98% physflat_send_IPI_mask
  - 96.26% native_send_call_func_ipi
...

 So either there's been a massive increase in the number of IPIs
 being sent, or the cost per IPI have greatly increased. Either way,
 the result is a pretty significant performance degradatation.

And on Mon, Mar 2, 2015 at 11:17 AM, Matt jackdac...@gmail.com wrote:

 Linus already posted a fix to the problem, however I can't seem to
 find the matching commit in his tree (searching for TLC regression
 or TLB cache).

That was commit f045bbb9fa1b, which was then refined by commit
721c21c17ab9, because it turned out that ARM64 had a very subtle
relationship with tlb-end and fullmm.

But both of those hit 3.19, so none of this should affect 4.0-rc1.
There's something else going on.

I assume it's the mm queue from Andrew, so adding him to the cc. There
are changes to the page migration etc, which could explain it.

There are also a fair amount of APIC changes in 4.0-rc1, so I guess it
really could be just that the IPI sending itself has gotten much
slower. Adding Ingo for that, although I don't think
default_send_IPI_mask_sequence_phys() itself hasn't actually changed,
only other things around the apic. So I'd be inclined to blame the mm
changes.

Obviously bisection would find it..

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Dave Hansen
On 03/02/2015 11:17 AM, Matt wrote:
 Linus already posted a fix to the problem, however I can't seem to
 find the matching commit in his tree (searching for TLC regression
 or TLB cache).

It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.

2015-03-02 Thread Matt
On Mon, Mar 2, 2015 at 8:25 PM, Dave Hansen d...@sr71.net wrote:
 On 03/02/2015 11:17 AM, Matt wrote:
 Linus already posted a fix to the problem, however I can't seem to
 find the matching commit in his tree (searching for TLC regression
 or TLB cache).

 It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc.

Mea culpa, should have looked at the date of the thread - was just
grasping at straws to make an help attempt :/

I'll refrain from posting in this thread then to avoid clutter  load
to the list

(this is way over my head, I'm mostly doing minor patch porting and
custom kernels as a hobby)

Kind Regards

Matt
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/