Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Thu, Mar 05, 2015 at 12:35:45AM +0100, Ingo Molnar wrote: > > * Dave Chinner wrote: > > > > After going through the series again, I did not spot why there is > > > a difference. It's functionally similar and I would hate the > > > theory that this is somehow hardware related due to the use of > > > bits it takes action on. > > > > I doubt it's hardware related - I'm testing inside a VM, [...] > > That might be significant, I doubt Mel considered KVM's interpretation > of pte details? I did actaully mention that before: | I am running a fake-numa=4 config on this test VM so it's got 4 | nodes of 4p/4GB RAM each. but I think it got snipped before Mel was cc'd. Perhaps size of the nodes is relevant, too, because the steady state phase 3 memory usage is 5-6GB when this problem first shows up, and then continues into phase 4 where memory usage grows again and peaks at ~10GB Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
* Dave Chinner wrote: > > After going through the series again, I did not spot why there is > > a difference. It's functionally similar and I would hate the > > theory that this is somehow hardware related due to the use of > > bits it takes action on. > > I doubt it's hardware related - I'm testing inside a VM, [...] That might be significant, I doubt Mel considered KVM's interpretation of pte details? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Wed, Mar 04, 2015 at 08:00:46PM +, Mel Gorman wrote: > On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote: > > On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote: > > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner > > > > > wrote: > > > > > >> > > > > > >> But are those migrate-page calls really common enough to make these > > > > > >> things happen often enough on the same pages for this all to > > > > > >> matter? > > > > > > > > > > > > It's looking like that's a possibility. > > > > > > > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > > > > re-introduced the "pte was already NUMA" case. > > > > > > > > > > So that's not it either, afaik. Plus your numbers seem to say that > > > > > it's really "migrate_pages()" that is done more. So it feels like the > > > > > numa balancing isn't working right. > > > > > > > > So that should show up in the vmstats, right? Oh, and there's a > > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > > > > > > > > > > The stats indicate both more updates and more faults. Can you try this > > > please? It's against 4.0-rc1. > > > > > > ---8<--- > > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing > > > > Makes no noticable difference to behaviour or performance. Stats: > > > > After going through the series again, I did not spot why there is a > difference. It's functionally similar and I would hate the theory that > this is somehow hardware related due to the use of bits it takes action > on. I doubt it's hardware related - I'm testing inside a VM, and the host is a year old Dell r820 server, so it's a pretty common hardware I'd think. Guest: processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 6 model name : QEMU Virtual CPU version 2.0.0 stepping: 3 microcode : 0x1 cpu MHz : 2199.998 cache size : 4096 KB physical id : 15 siblings: 1 core id : 0 cpu cores : 1 apicid : 15 initial apicid : 15 fpu : yes fpu_exception : yes cpuid level : 4 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl pni cx16 x2apic popcnt hypervisor lahf_lm bugs: bogomips: 4399.99 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: Host: processor : 31 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz stepping: 7 microcode : 0x70d cpu MHz : 1190.750 cache size : 16384 KB physical id : 1 siblings: 16 core id : 7 cpu cores : 8 apicid : 47 initial apicid : 47 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips: 4400.75 clflush size: 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: > There is nothing in the manual that indicates that it would. Try this > as I don't want to leave this hanging before LSF/MM because it'll mask other > reports. It alters the maximum rate automatic NUMA balancing scans ptes. > > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7ce18f3c097a..40ae5d84d4ba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct > sched_entity *se) > * calculated based on the tasks virtual memory size and > * numa_balancing_scan_size. > */ > -unsigned int sysctl_numa_balancing_scan_period_min = 1000; > +unsigned int sysctl_numa_balancing_scan_period_min = 2000; > unsigned int sysctl_numa_balancing_scan_period_max = 6; Made absolutely no difference: 357,635 migrate:mm_migrate_pages ( +- 4.11% ) numa_hit 36724642 numa_miss 92477 numa_foreign 92477 numa_interleave 11835 numa_local 36709671 numa_other 107448 numa_pte_updates 83924860 numa_huge_pte_updates 0 numa_hint_faults 81856035 numa_hint_faults_local 22104529 numa_pages_migrated 32766735 pgmigrate_success 32766735 pgmigrate_fail 0 Runtime was actuall
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Wed, Mar 04, 2015 at 08:33:53AM +1100, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote: > > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner > > > > wrote: > > > > >> > > > > >> But are those migrate-page calls really common enough to make these > > > > >> things happen often enough on the same pages for this all to matter? > > > > > > > > > > It's looking like that's a possibility. > > > > > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > > > re-introduced the "pte was already NUMA" case. > > > > > > > > So that's not it either, afaik. Plus your numbers seem to say that > > > > it's really "migrate_pages()" that is done more. So it feels like the > > > > numa balancing isn't working right. > > > > > > So that should show up in the vmstats, right? Oh, and there's a > > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > > > > > > > The stats indicate both more updates and more faults. Can you try this > > please? It's against 4.0-rc1. > > > > ---8<--- > > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing > > Makes no noticable difference to behaviour or performance. Stats: > After going through the series again, I did not spot why there is a difference. It's functionally similar and I would hate the theory that this is somehow hardware related due to the use of bits it takes action on. There is nothing in the manual that indicates that it would. Try this as I don't want to leave this hanging before LSF/MM because it'll mask other reports. It alters the maximum rate automatic NUMA balancing scans ptes. --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7ce18f3c097a..40ae5d84d4ba 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -799,7 +799,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se) * calculated based on the tasks virtual memory size and * numa_balancing_scan_size. */ -unsigned int sysctl_numa_balancing_scan_period_min = 1000; +unsigned int sysctl_numa_balancing_scan_period_min = 2000; unsigned int sysctl_numa_balancing_scan_period_max = 6; /* Portion of address space to scan in MB */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Tue, Mar 03, 2015 at 01:43:46PM +, Mel Gorman wrote: > On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner wrote: > > > >> > > > >> But are those migrate-page calls really common enough to make these > > > >> things happen often enough on the same pages for this all to matter? > > > > > > > > It's looking like that's a possibility. > > > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > > re-introduced the "pte was already NUMA" case. > > > > > > So that's not it either, afaik. Plus your numbers seem to say that > > > it's really "migrate_pages()" that is done more. So it feels like the > > > numa balancing isn't working right. > > > > So that should show up in the vmstats, right? Oh, and there's a > > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > > > > The stats indicate both more updates and more faults. Can you try this > please? It's against 4.0-rc1. > > ---8<--- > mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing Makes no noticable difference to behaviour or performance. Stats: 359,857 migrate:mm_migrate_pages ( +- 5.54% ) numa_hit 36026802 numa_miss 14287 numa_foreign 14287 numa_interleave 18408 numa_local 36006052 numa_other 35037 numa_pte_updates 81803359 numa_huge_pte_updates 0 numa_hint_faults 79810798 numa_hint_faults_local 21227730 numa_pages_migrated 32037516 pgmigrate_success 32037516 pgmigrate_fail 0 -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Tue, Mar 03, 2015 at 10:34:37PM +1100, Dave Chinner wrote: > On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner wrote: > > >> > > >> But are those migrate-page calls really common enough to make these > > >> things happen often enough on the same pages for this all to matter? > > > > > > It's looking like that's a possibility. > > > > Hmm. Looking closer, commit 10c1045f28e8 already should have > > re-introduced the "pte was already NUMA" case. > > > > So that's not it either, afaik. Plus your numbers seem to say that > > it's really "migrate_pages()" that is done more. So it feels like the > > numa balancing isn't working right. > > So that should show up in the vmstats, right? Oh, and there's a > tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: > The stats indicate both more updates and more faults. Can you try this please? It's against 4.0-rc1. ---8<--- mm: numa: Reduce amount of IPI traffic due to automatic NUMA balancing Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07%56.07% [kernel][k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This was bisected to commit 4d94246699 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations") but I expect the full issue is related series up to and including that patch. There are two important changes that might be relevant here. The first is marking huge PMDs to trap a hinting fault potentially sends an IPI to flush TLBs. This did not show up in Dave's report and it almost certainly is not a factor but it would affect IPI counts for other users. The second is that the PTE protection update now clears the PTE leaving a window where parallel faults can be trapped resulting in more overhead from faults. Higher faults, even if correct can result in higher scan rates indirectly and may explain what Dave is saying. This is not signed off or tested. --- mm/huge_memory.c | 11 +-- mm/mprotect.c| 17 +++-- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index fc00c8cb5a82..7fc4732c77d7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1494,8 +1494,15 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, } if (!prot_numa || !pmd_protnone(*pmd)) { - ret = 1; - entry = pmdp_get_and_clear_notify(mm, addr, pmd); + /* +* NUMA hinting update can avoid a clear and flush as +* it is not a functional correctness issue if access +* occurs after the update +*/ + if (prot_numa) + entry = *pmd; + else + entry = pmdp_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); diff --git a/mm/mprotect.c b/mm/mprotect.c index 44727811bf4c..1efd03ffa0d8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -77,19 +77,32 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, pte_t ptent; /* -* Avoid trapping faults against the zero or KSM -* pages. See similar comment in change_huge_pmd. +* prot_numa does not clear the pte during protection +* update as asynchronous hardware updates are not +* a concern but unnecessary faults while the PTE is +* cleared is overhead. */ if (prot_numa) {
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 02, 2015 at 10:56:14PM -0800, Linus Torvalds wrote: > On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner wrote: > >> > >> But are those migrate-page calls really common enough to make these > >> things happen often enough on the same pages for this all to matter? > > > > It's looking like that's a possibility. > > Hmm. Looking closer, commit 10c1045f28e8 already should have > re-introduced the "pte was already NUMA" case. > > So that's not it either, afaik. Plus your numbers seem to say that > it's really "migrate_pages()" that is done more. So it feels like the > numa balancing isn't working right. So that should show up in the vmstats, right? Oh, and there's a tracepoint in migrate_pages, too. Same 6x10s samples in phase 3: 3.19: 55,898 migrate:mm_migrate_pages And a sample of the events shows 99.99% of these are: mm_migrate_pages: nr_succeeded=1 nr_failed=0 mode=MIGRATE_ASYNC reason= 4.0-rc1: 364,442 migrate:mm_migrate_pages They are also single page MIGRATE_ASYNC events like for 3.19. And 'grep "numa\|migrate" /proc/vmstat' output for the entire xfs_repair run: 3.19: numa_hit 5163221 numa_miss 121274 numa_foreign 121274 numa_interleave 12116 numa_local 5153127 numa_other 131368 numa_pte_updates 36482466 numa_huge_pte_updates 0 numa_hint_faults 34816515 numa_hint_faults_local 9197961 numa_pages_migrated 1228114 pgmigrate_success 1228114 pgmigrate_fail 0 4.0-rc1: numa_hit 36952043 numa_miss 92471 numa_foreign 92471 numa_interleave 10964 numa_local 36927384 numa_other 117130 numa_pte_updates 84010995 numa_huge_pte_updates 0 numa_hint_faults 81697505 numa_hint_faults_local 21765799 numa_pages_migrated 32916316 pgmigrate_success 32916316 pgmigrate_fail 0 Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 2, 2015 at 9:20 PM, Dave Chinner wrote: >> >> But are those migrate-page calls really common enough to make these >> things happen often enough on the same pages for this all to matter? > > It's looking like that's a possibility. Hmm. Looking closer, commit 10c1045f28e8 already should have re-introduced the "pte was already NUMA" case. So that's not it either, afaik. Plus your numbers seem to say that it's really "migrate_pages()" that is done more. So it feels like the numa balancing isn't working right. But I'm not seeing what would cause that in that commit. It really all looks the same to me. The few special-cases it drops get re-introduced later (although in a different form). Mel, do you see what I'm missing? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 02, 2015 at 06:37:47PM -0800, Linus Torvalds wrote: > On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds > wrote: > > > > There might be some other case where the new "just change the > > protection" doesn't do the "oh, but it the protection didn't change, > > don't bother flushing". I don't see it. > > Hmm. I wonder.. In change_pte_range(), we just unconditionally change > the protection bits. > > But the old numa code used to do > > if (!pte_numa(oldpte)) { > ptep_set_numa(mm, addr, pte); > > so it would actually avoid the pte update if a numa-prot page was > marked numa-prot again. > > But are those migrate-page calls really common enough to make these > things happen often enough on the same pages for this all to matter? It's looking like that's a possibility. I am running a fake-numa=4 config on this test VM so it's got 4 nodes of 4p/4GB RAM each. both kernels are running through the same page fault path and that is straight through migrate_pages(). 3.19: 13.70% 0.01% [kernel][k] native_flush_tlb_others - native_flush_tlb_others - 98.58% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 96.88% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault + 3.12% __get_user_pages + 1.40% flush_tlb_mm_range 4.0-rc1: - 67.12% 0.04% [kernel][k] native_flush_tlb_others - native_flush_tlb_others - 99.80% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.50% __do_page_fault trace_do_page_fault do_async_page_fault - async_page_fault Same call chain, just a lot more CPU used further down the stack. > Odd. > > So it would be good if your profiles just show "there's suddenly a > *lot* more calls to flush_tlb_page() from XYZ" and the culprit is > obvious that way.. Ok, I did a simple 'perf stat -e tlb:tlb_flush -a -r 6 sleep 10' to count all the tlb flush events from the kernel. I then pulled the full events for a 30s period to get a sampling of the reason associated with each flush event. 4.0-rc1: Performance counter stats for 'system wide' (6 runs): 2,190,503 tlb:tlb_flush ( +- 8.30% ) 10.001970663 seconds time elapsed( +- 0.00% ) The reason breakdown: 81% TLB_REMOTE_SHOOTDOWN 19% TLB_FLUSH_ON_TASK_SWITCH 3.19: Performance counter stats for 'system wide' (6 runs): 467,151 tlb:tlb_flush ( +- 25.50% ) 10.002021491 seconds time elapsed( +- 0.00% ) The reason breakdown: 6% TLB_REMOTE_SHOOTDOWN 94% TLB_FLUSH_ON_TASK_SWITCH The difference would appear to be the number of remote TLB shootdowns that are occurring from otherwise identical page fault paths. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 2, 2015 at 6:22 PM, Linus Torvalds wrote: > > There might be some other case where the new "just change the > protection" doesn't do the "oh, but it the protection didn't change, > don't bother flushing". I don't see it. Hmm. I wonder.. In change_pte_range(), we just unconditionally change the protection bits. But the old numa code used to do if (!pte_numa(oldpte)) { ptep_set_numa(mm, addr, pte); so it would actually avoid the pte update if a numa-prot page was marked numa-prot again. But are those migrate-page calls really common enough to make these things happen often enough on the same pages for this all to matter? Odd. So it would be good if your profiles just show "there's suddenly a *lot* more calls to flush_tlb_page() from XYZ" and the culprit is obvious that way.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 2, 2015 at 5:47 PM, Dave Chinner wrote: > > Anyway, the difference between good and bad is pretty clear, so > I'm pretty confident the bisect is solid: > > 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit Well, it's the mm queue from Andrew, so I'm not surprised. That said, I don't see why that particular one should matter. Hmm. In your profiles, can you tell which caller of "flush_tlb_page()" changed the most? The change from "mknnuma" to "prot_none" *should* be 100% equivalent (both just change the page to be not-present, just set different bits elsewhere in the pte), but clearly something wasn't. Oh. Except for that special "huge-zero-page" special case that got dropped, but that got re-introduced in commit e944fd67b625. There might be some other case where the new "just change the protection" doesn't do the "oh, but it the protection didn't change, don't bother flushing". I don't see it. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 02, 2015 at 11:47:52AM -0800, Linus Torvalds wrote: > On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner wrote: > > > > Across the board the 4.0-rc1 numbers are much slower, and the > > degradation is far worse when using the large memory footprint > > configs. Perf points straight at the cause - this is from 4.0-rc1 > > on the "-o bhash=101073" config: > > > > - 56.07%56.07% [kernel][k] > > default_send_IPI_mask_sequence_phys > > - 99.99% physflat_send_IPI_mask > > - 99.37% native_send_call_func_ipi > .. > > > > And the same profile output from 3.19 shows: > > > > -9.61% 9.61% [kernel][k] > > default_send_IPI_mask_sequence_phys > > - 99.98% physflat_send_IPI_mask > > - 96.26% native_send_call_func_ipi > ... > > > > So either there's been a massive increase in the number of IPIs > > being sent, or the cost per IPI have greatly increased. Either way, > > the result is a pretty significant performance degradatation. > I assume it's the mm queue from Andrew, so adding him to the cc. There > are changes to the page migration etc, which could explain it. > > There are also a fair amount of APIC changes in 4.0-rc1, so I guess it > really could be just that the IPI sending itself has gotten much > slower. Adding Ingo for that, although I don't think > default_send_IPI_mask_sequence_phys() itself hasn't actually changed, > only other things around the apic. So I'd be inclined to blame the mm > changes. > > Obviously bisection would find it.. Yes, though the time it takes to do a 13 step bisection means it's something I don't do just for an initial bug report. ;) Anyway, the difference between good and bad is pretty clear, so I'm pretty confident the bisect is solid: 4d9424669946532be754a6e116618dcb58430cb4 is the first bad commit commit 4d9424669946532be754a6e116618dcb58430cb4 Author: Mel Gorman Date: Thu Feb 12 14:58:28 2015 -0800 mm: convert p[te|md]_mknonnuma and remaining page table manipulations With PROT_NONE, the traditional page table manipulation functions are sufficient. [andre.przyw...@arm.com: fix compiler warning in pmdp_invalidate()] [a...@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS] Signed-off-by: Mel Gorman Acked-by: Linus Torvalds Acked-by: Aneesh Kumar Tested-by: Sasha Levin Cc: Benjamin Herrenschmidt Cc: Dave Jones Cc: Hugh Dickins Cc: Ingo Molnar Cc: Kirill Shutemov Cc: Paul Mackerras Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds :04 04 50985a3f84e80bb2bdd049d4f34739d99436f988 1bc79bfac2c138844373b603f9bc5914f0d010f3 March :04 04 ea69bcd1c59f832a4b012a57b4eb1d0c7516947d 0822692fa6c356952e723b56038585716fa51723 Minclude :04 04 c11960b9f1ee72edb08dc3fdc46f590fb1d545f7 f5d17ff5b639adcb7363a196a9efe70f2a7312b5 Mmm Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Sun, Mar 1, 2015 at 5:04 PM, Dave Chinner wrote: > > Across the board the 4.0-rc1 numbers are much slower, and the > degradation is far worse when using the large memory footprint > configs. Perf points straight at the cause - this is from 4.0-rc1 > on the "-o bhash=101073" config: > > - 56.07%56.07% [kernel][k] > default_send_IPI_mask_sequence_phys > - 99.99% physflat_send_IPI_mask > - 99.37% native_send_call_func_ipi .. > > And the same profile output from 3.19 shows: > > -9.61% 9.61% [kernel][k] > default_send_IPI_mask_sequence_phys > - 99.98% physflat_send_IPI_mask > - 96.26% native_send_call_func_ipi ... > > So either there's been a massive increase in the number of IPIs > being sent, or the cost per IPI have greatly increased. Either way, > the result is a pretty significant performance degradatation. And on Mon, Mar 2, 2015 at 11:17 AM, Matt wrote: > > Linus already posted a fix to the problem, however I can't seem to > find the matching commit in his tree (searching for "TLC regression" > or "TLB cache"). That was commit f045bbb9fa1b, which was then refined by commit 721c21c17ab9, because it turned out that ARM64 had a very subtle relationship with tlb->end and fullmm. But both of those hit 3.19, so none of this should affect 4.0-rc1. There's something else going on. I assume it's the mm queue from Andrew, so adding him to the cc. There are changes to the page migration etc, which could explain it. There are also a fair amount of APIC changes in 4.0-rc1, so I guess it really could be just that the IPI sending itself has gotten much slower. Adding Ingo for that, although I don't think default_send_IPI_mask_sequence_phys() itself hasn't actually changed, only other things around the apic. So I'd be inclined to blame the mm changes. Obviously bisection would find it.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On Mon, Mar 2, 2015 at 8:25 PM, Dave Hansen wrote: > On 03/02/2015 11:17 AM, Matt wrote: >> Linus already posted a fix to the problem, however I can't seem to >> find the matching commit in his tree (searching for "TLC regression" >> or "TLB cache"). > > It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc. Mea culpa, should have looked at the date of the thread - was just grasping at straws to make an help attempt :/ I'll refrain from posting in this thread then to avoid clutter & load to the list (this is way over my head, I'm mostly doing minor patch porting and custom kernels as a hobby) Kind Regards Matt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
On 03/02/2015 11:17 AM, Matt wrote: > Linus already posted a fix to the problem, however I can't seem to > find the matching commit in his tree (searching for "TLC regression" > or "TLB cache"). It's in 721c21c17ab958abf19a8fc611c3bd4743680e38 iirc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
Hi Dave, is the following thread and patch related to your problem, I just happened to stumble upon it a few days ago: https://lkml.org/lkml/2014/12/17/280 , http://marc.info/?l=linux-kernel&m=141876582909898&w=2 Re: post-3.18 performance regression in TLB flushing code Linus already posted a fix to the problem, however I can't seem to find the matching commit in his tree (searching for "TLC regression" or "TLB cache"). Kind Regards Matt -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[regression v4.0-rc1] mm: IPIs from TLB flushes causing significant performance degradation.
Hi folks, Running one of my usual benchmarks (fsmark to create 50 million zero length files in a 500TB filesystem, then running xfs_repair on it) has indicated a significant regression in xfs_repair performance. config3.19 4.0-rc1 defaults 8m08s9m34s -o ag_stride=-1 4m04s4m38s -o bhash=101073 6m04s 17m43s -o ag_stride=-1,bhash=101073 4m54s9m58s The default is for create a number of concurrent threads to progress AGs in parallel (https://lkml.org/lkml/2014/7/3/15), and this is running on a 500AG filesystem so lots of parallelism. "-o ag_stride=-1" turns this off, and just leaves a single prefetch group working on AGs sequentially. As you can see, turning off the concurrency halves the runtime. The concurrency is really there for large spinning disk arrays, where IO wait time dominates performance. I'm running on SSDs, so ther eis almost no IO wait time. The "-o bhash=X" controls the size of the buffer cache. The default value is 4096, which means xfs_repair is oeprating with a memory footprint of about 1GB and is small enough to suffer from readahead thrashing on large filesystems. Setting it to 101073 gives increases that to around 7-10GB and prevents readahead thrashing, so should run much faster than the default concurrent config. It does run faster for 3.19, but for 4.0-rc1 it runs almost twice as slow, and burns a huge amount of system CPU time doing so. Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07%56.07% [kernel][k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single And the same profile output from 3.19 shows: -9.61% 9.61% [kernel][k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.98% physflat_send_IPI_mask - 96.26% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 98.44% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page handle_mm_fault + 1.56% flush_tlb_mm_range + 3.74% native_send_call_func_single_ipi So either there's been a massive increase in the number of IPIs being sent, or the cost per IPI have greatly increased. Either way, the result is a pretty significant performance degradatation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/