Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 2018-08-27 at 19:02 +1000, Nicholas Piggin wrote: > > More tlbies ? With the cost of the broadasts on the fabric ? I don't > > think so.. or I'm not understanding your point... > > More tlbies are no good, but there will be some places where it works > out much better (and fewer tlbies). Worst possible case for current code > is a big unmap with lots of scattered page sizes. We _should_ get that > with just a single PID flush at the end, but what we will get today is > a bunch of PID and VA flushes. > > I don't propose doing that though, I'd rather be explicit about > tracking start and end range of each page size. Still not "optimal" > but neither is existing single range for sparse mappings... anyway it > will need to be profiled, but my point is we don't really fit exactly > what x86/arm want. If we have an arch specific part, we could just remember up to N "large" pages there without actually flushing, and if that overflows, upgrade to a full flush. Cheers, Ben.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Thu, Aug 30, 2018 at 12:13:50AM +, Vineet Gupta wrote: > On 08/27/2018 04:00 AM, Peter Zijlstra wrote: > > > > The one obvious thing SH and ARM want is a sensible default for > > tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 ) > > > > The below make tlb_start_vma() default to flush_cache_range(), which > > should be right and sufficient. The only exceptions that I found where > > (oddly): > > > > - m68k-mmu > > - sparc64 > > - unicore > > > > Those architectures appear to have a non-NOP flush_cache_range(), but > > their current tlb_start_vma() does not call it. > > So indeed we follow the DaveM's insight from 2004 about tlb_{start,end}_vma() > and > those are No-ops for ARC for the general case. For the historic VIPT aliasing > dcache they are what they should be per 2004 link above - I presume that is > all > hunky dory with you ? Yep, I was just confused about those 3 architectures having flush_cache_range() but not calling it from tlb_start_vma(). The regular VIPT aliasing thing is all good. And not having them is also fine. > > Furthermore, I think tlb_flush() is broken on arc and parisc; in > > particular they don't appear to have any TLB invalidate for the > > shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0. > > Care to explain this issue a bit more ? > And that is independent of the current discussion. > > > Possibly shift_arg_pages() should be fixed instead. So the problem is that shift_arg_pages() does not call tlb_start_vma() / tlb_end_vma(). It also has fullmm=0. Therefore, on ARC, there will not be a TLB invalidate at all when freeing the page-tables. And while looking at this more, move_page_tables() also looks dodgy, I think it always needs a TLB flush of the entire 'old' range. > > diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h > > index a9db5f62aaf3..7af2b373ebe7 100644 > > --- a/arch/arc/include/asm/tlb.h > > +++ b/arch/arc/include/asm/tlb.h > > @@ -23,15 +23,6 @@ do { \ > > * > > * Note, read http://lkml.org/lkml/2004/1/15/6 > > */ > > -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING > > -#define tlb_start_vma(tlb, vma) > > -#else > > -#define tlb_start_vma(tlb, vma) > > \ > > -do { > > \ > > - if (!tlb->fullmm) \ > > - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ > > -} while(0) > > -#endif > > > > #define tlb_end_vma(tlb, vma) > > \ > > do { > > \ > > [snip..] > > > \ > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > > index e811ef7b8350..1d037fd5bb7a 100644 > > --- a/include/asm-generic/tlb.h > > +++ b/include/asm-generic/tlb.h > > @@ -181,19 +181,21 @@ static inline void > > tlb_remove_check_page_size_change(struct mmu_gather *tlb, > > * the vmas are adjusted to only cover the region to be torn down. > > */ > > #ifndef tlb_start_vma > > -#define tlb_start_vma(tlb, vma) do { } while (0) > > +#define tlb_start_vma(tlb, vma) > > \ > > +do { > > \ > > + if (!tlb->fullmm) \ > > + flush_cache_range(vma, vma->vm_start, vma->vm_end); \ > > +} while (0) > > #endif > > So for non aliasing arches to be not affected, this relies on > flush_cache_range() > to be no-op ? Yes; a cursory inspected shows this to be so. With 'obvious' exception from the above 3 architectures. > > -#define __tlb_end_vma(tlb, vma)\ > > - do {\ > > - if (!tlb->fullmm && tlb->end) { \ > > - tlb_flush(tlb); \ > > - __tlb_reset_range(tlb); \ > > - } \ > > - } while (0) > > - > > #ifndef tlb_end_vma > > -#define tlb_end_vma__tlb_end_vma > > +#define tlb_end_vma(tlb, vma) > > \ > > + do {\ > > + if (!tlb->fullmm && tlb->end) { \ > > + tlb_flush(tlb); \ > > + __tlb_reset_range(tlb); \ > > + } \ > > + } while (0) > > #endif > > > > #ifndef __tlb_remove_tlb_entry > > And this one is for shift_arg_pages() but will also cause extraneous flushes > for > other cases - not happening currently ! No, shift_arg_pages() d
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On 08/27/2018 04:00 AM, Peter Zijlstra wrote: > > The one obvious thing SH and ARM want is a sensible default for > tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 ) > > The below make tlb_start_vma() default to flush_cache_range(), which > should be right and sufficient. The only exceptions that I found where > (oddly): > > - m68k-mmu > - sparc64 > - unicore > > Those architectures appear to have a non-NOP flush_cache_range(), but > their current tlb_start_vma() does not call it. So indeed we follow the DaveM's insight from 2004 about tlb_{start,end}_vma() and those are No-ops for ARC for the general case. For the historic VIPT aliasing dcache they are what they should be per 2004 link above - I presume that is all hunky dory with you ? > Furthermore, I think tlb_flush() is broken on arc and parisc; in > particular they don't appear to have any TLB invalidate for the > shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0. Care to explain this issue a bit more ? And that is independent of the current discussion. > Possibly shift_arg_pages() should be fixed instead. > > Some archs (nds32,sparc32) avoid this by having an unconditional > flush_tlb_mm() in tlb_flush(), which seems somewhat suboptimal if you > have flush_tlb_range(). TLB_FLUSH_VMA() might be an option, however > hideous it is. > > --- > > diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h > index a9db5f62aaf3..7af2b373ebe7 100644 > --- a/arch/arc/include/asm/tlb.h > +++ b/arch/arc/include/asm/tlb.h > @@ -23,15 +23,6 @@ do { \ > * > * Note, read http://lkml.org/lkml/2004/1/15/6 > */ > -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING > -#define tlb_start_vma(tlb, vma) > -#else > -#define tlb_start_vma(tlb, vma) > \ > -do { \ > - if (!tlb->fullmm) \ > - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ > -} while(0) > -#endif > > #define tlb_end_vma(tlb, vma) > \ > do { \ [snip..] > \ > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index e811ef7b8350..1d037fd5bb7a 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -181,19 +181,21 @@ static inline void > tlb_remove_check_page_size_change(struct mmu_gather *tlb, > * the vmas are adjusted to only cover the region to be torn down. > */ > #ifndef tlb_start_vma > -#define tlb_start_vma(tlb, vma) do { } while (0) > +#define tlb_start_vma(tlb, vma) > \ > +do { \ > + if (!tlb->fullmm) \ > + flush_cache_range(vma, vma->vm_start, vma->vm_end); \ > +} while (0) > #endif So for non aliasing arches to be not affected, this relies on flush_cache_range() to be no-op ? > -#define __tlb_end_vma(tlb, vma) \ > - do {\ > - if (!tlb->fullmm && tlb->end) { \ > - tlb_flush(tlb); \ > - __tlb_reset_range(tlb); \ > - } \ > - } while (0) > - > #ifndef tlb_end_vma > -#define tlb_end_vma __tlb_end_vma > +#define tlb_end_vma(tlb, vma) > \ > + do {\ > + if (!tlb->fullmm && tlb->end) { \ > + tlb_flush(tlb); \ > + __tlb_reset_range(tlb); \ > + } \ > + } while (0) > #endif > > #ifndef __tlb_remove_tlb_entry And this one is for shift_arg_pages() but will also cause extraneous flushes for other cases - not happening currently ! -Vineet
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 27 Aug 2018 09:36:50 -0400 Rik van Riel wrote: > On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote: > > > It could do that. It requires a tlbie that matches the page size, > > so it means 3 sizes. I think possibly even that would be better > > than current code, but we could do better if we had a few specific > > fields in there. > > Would it cause a noticeable overhead to keep track > of which page sizes were removed, and to simply flush > the whole TLB in the (unlikely?) event that multiple > page sizes were removed in the same munmap? > > Once the unmap is so large that multiple page sizes > were covered, you may already be looking at so many > individual flush operations that a full flush might > be faster. It will take some profiling and measuring. unmapping a small number of huge pages plus a small number of surrounding small pages may not be uncommon if THP is working well. That could become a lot more expensive. > > Is there a point on PPC where simply flushing the > whole TLB, and having other things be reloaded later, > is faster than flushing every individual page mapping > that got unmapped? There is. For local TLB flushes that point is well over 100 individual invalidates though. We're generally better off flushing all page sizes for that case. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote: > It could do that. It requires a tlbie that matches the page size, > so it means 3 sizes. I think possibly even that would be better > than current code, but we could do better if we had a few specific > fields in there. Would it cause a noticeable overhead to keep track of which page sizes were removed, and to simply flush the whole TLB in the (unlikely?) event that multiple page sizes were removed in the same munmap? Once the unmap is so large that multiple page sizes were covered, you may already be looking at so many individual flush operations that a full flush might be faster. Is there a point on PPC where simply flushing the whole TLB, and having other things be reloaded later, is faster than flushing every individual page mapping that got unmapped? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
I cannot speak to how widespread it has been adopted, but the linux (kernel) package for version 4.17.17 has been successfully built and installed for ia64 under Debian ports. There is clearly more work to do to get ia64 rehabilitated, but there are over 10,000 packages currently successfully built for ia64 under Debian ports[1]. Jason [1] https://buildd.debian.org/status/architecture.php?a=ia64&suite=sid On Mon, Aug 27, 2018 at 4:57 AM Christoph Hellwig wrote: > > On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote: > > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf > > arch/ia64' leaving us with s390. > > Is removing ia64 a serious plan? It is the cause for a fair share of > oddities in dma lang, and I did not have much luck getting maintainer > replies lately, but I didn't know of a plan to get rid of it. > > What is the state of people still using ia64 mainline kernels vs just > old distros in the still existing machines?
Re: removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, Aug 27, 2018 at 01:57:08AM -0700, Christoph Hellwig wrote: > On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote: > > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf > > arch/ia64' leaving us with s390. > > Is removing ia64 a serious plan? I 'joked' about it a while ago on IRC, and aegl reacted that it might not be entirely unreasonable. > It is the cause for a fair share of > oddities in dma lang, and I did not have much luck getting maintainer > replies lately, but I didn't know of a plan to get rid of it. > > What is the state of people still using ia64 mainline kernels vs just > old distros in the still existing machines? Both arjan and aegl said that the vast majority of people still running ia64 machines run old distros.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote: > And there's only like 4 architectures that still have a custom > mmu_gather: > > - sh > - arm > - ia64 > - s390 > > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf > arch/ia64' leaving us with s390. The one obvious thing SH and ARM want is a sensible default for tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 ) The below make tlb_start_vma() default to flush_cache_range(), which should be right and sufficient. The only exceptions that I found where (oddly): - m68k-mmu - sparc64 - unicore Those architectures appear to have a non-NOP flush_cache_range(), but their current tlb_start_vma() does not call it. Furthermore, I think tlb_flush() is broken on arc and parisc; in particular they don't appear to have any TLB invalidate for the shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0. Possibly shift_arg_pages() should be fixed instead. Some archs (nds32,sparc32) avoid this by having an unconditional flush_tlb_mm() in tlb_flush(), which seems somewhat suboptimal if you have flush_tlb_range(). TLB_FLUSH_VMA() might be an option, however hideous it is. --- diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h index a9db5f62aaf3..7af2b373ebe7 100644 --- a/arch/arc/include/asm/tlb.h +++ b/arch/arc/include/asm/tlb.h @@ -23,15 +23,6 @@ do { \ * * Note, read http://lkml.org/lkml/2004/1/15/6 */ -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING -#define tlb_start_vma(tlb, vma) -#else -#define tlb_start_vma(tlb, vma) \ -do { \ - if (!tlb->fullmm) \ - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ -} while(0) -#endif #define tlb_end_vma(tlb, vma) \ do { \ diff --git a/arch/mips/include/asm/tlb.h b/arch/mips/include/asm/tlb.h index b6823b9e94da..9d04b4649692 100644 --- a/arch/mips/include/asm/tlb.h +++ b/arch/mips/include/asm/tlb.h @@ -5,16 +5,6 @@ #include #include -/* - * MIPS doesn't need any special per-pte or per-vma handling, except - * we need to flush cache for area to be unmapped. - */ -#define tlb_start_vma(tlb, vma)\ - do {\ - if (!tlb->fullmm) \ - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ - } while (0) -#define tlb_end_vma(tlb, vma) do { } while (0) #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) /* diff --git a/arch/nds32/include/asm/tlb.h b/arch/nds32/include/asm/tlb.h index b35ae5eae3ab..0bf7c9482381 100644 --- a/arch/nds32/include/asm/tlb.h +++ b/arch/nds32/include/asm/tlb.h @@ -4,12 +4,6 @@ #ifndef __ASMNDS32_TLB_H #define __ASMNDS32_TLB_H -#define tlb_start_vma(tlb,vma) \ - do {\ - if (!tlb->fullmm) \ - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ - } while (0) - #define tlb_end_vma(tlb,vma) \ do {\ if(!tlb->fullmm)\ diff --git a/arch/nios2/include/asm/tlb.h b/arch/nios2/include/asm/tlb.h index d3bc648e08b5..9b518c6d0f62 100644 --- a/arch/nios2/include/asm/tlb.h +++ b/arch/nios2/include/asm/tlb.h @@ -15,16 +15,6 @@ extern void set_mmu_pid(unsigned long pid); -/* - * NiosII doesn't need any special per-pte or per-vma handling, except - * we need to flush cache for the area to be unmapped. - */ -#define tlb_start_vma(tlb, vma)\ - do {\ - if (!tlb->fullmm) \ - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ - } while (0) - #define tlb_end_vma(tlb, vma) do { } while (0) #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h index 0c881e74d8a6..b1984f9cd3af 100644 --- a/arch/parisc/include/asm/tlb.h +++ b/arch/parisc/include/asm/tlb.h @@ -7,11 +7,6 @@ do { if ((tlb)->fullmm) \ flush_tlb_mm((tlb)->mm);\ } while (0) -#define tlb_start_vma(tlb, vma) \ -do { if (!(tlb)->fullmm) \ - flush_cache_range(vma, vma->vm_start, vma->vm_end); \ -} while (0) - #define tlb_end_vma(tlb, vma) \ do { if (!(tlb)->fullmm) \ flush_tlb_range(vma, vma->vm_start, vma->vm_end);
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 27 Aug 2018 18:09:50 +1000 Benjamin Herrenschmidt wrote: > On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote: > > > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather > > > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to > > > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is > > > still correct for huge pages, inefficient maybe, but so is flushing > > > every other page because 'sparse' transparant-huge-pages. > > > > It could do that. It requires a tlbie that matches the page size, > > so it means 3 sizes. I think possibly even that would be better > > than current code, but we could do better if we had a few specific > > fields in there. > > More tlbies ? With the cost of the broadasts on the fabric ? I don't > think so.. or I'm not understanding your point... More tlbies are no good, but there will be some places where it works out much better (and fewer tlbies). Worst possible case for current code is a big unmap with lots of scattered page sizes. We _should_ get that with just a single PID flush at the end, but what we will get today is a bunch of PID and VA flushes. I don't propose doing that though, I'd rather be explicit about tracking start and end range of each page size. Still not "optimal" but neither is existing single range for sparse mappings... anyway it will need to be profiled, but my point is we don't really fit exactly what x86/arm want. Thanks, Nick
removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote: > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf > arch/ia64' leaving us with s390. Is removing ia64 a serious plan? It is the cause for a fair share of oddities in dma lang, and I did not have much luck getting maintainer replies lately, but I didn't know of a plan to get rid of it. What is the state of people still using ia64 mainline kernels vs just old distros in the still existing machines?
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 27 Aug 2018 10:20:45 +0200 Peter Zijlstra wrote: > On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote: > > > Sadly our architecture requires a precise match between the page size > > specified in the tlbie instruction and the entry in the TLB or it won't > > be flushed. > > Argh.. OK I see. That is rather unfortunate and does seem to require > something along the lines of tlb_remove_check_page_size_change(). Or we can do better with some more of our own data in mmu_gather, but things that probably few or no other architectures want. I've held off trying to put any crap in generic code because there's other lower hanging fruit still, but I'd really rather just give archs the ability to put their own data in there. I don't really see a downside to it (divergence of course, but the existing proliferation of code is much harder to follow than some data that would be maintained and used purely by the arch, and beats having to implement entirely your own mmu_gather). Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote: > > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather > > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to > > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is > > still correct for huge pages, inefficient maybe, but so is flushing > > every other page because 'sparse' transparant-huge-pages. > > It could do that. It requires a tlbie that matches the page size, > so it means 3 sizes. I think possibly even that would be better > than current code, but we could do better if we had a few specific > fields in there. More tlbies ? With the cost of the broadasts on the fabric ? I don't think so.. or I'm not understanding your point... Sadly our architecture requires a precise match between the page size specified in the tlbie instruction and the entry in the TLB or it won't be flushed. Ben.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote: > Sadly our architecture requires a precise match between the page size > specified in the tlbie instruction and the entry in the TLB or it won't > be flushed. Argh.. OK I see. That is rather unfortunate and does seem to require something along the lines of tlb_remove_check_page_size_change().
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, 27 Aug 2018 09:47:01 +0200 Peter Zijlstra wrote: > On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote: > > On Fri, 24 Aug 2018 13:39:53 +0200 > > Peter Zijlstra wrote: > > > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote: > > > > > Hurm.. look at commit: > > > > > > > > e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and > > > > force flush if page size change") > > > > > > Ah, good, it seems that already got cleaned up a lot. But it all moved > > > into the power code.. blergh. > > > > I lost track of what the problem is here? > > Aside from the commit above being absolute crap (which did get fixed up, > luckily) I would really like to get rid of all arch specific mmu_gather. > > We can have opt-in bits to the generic code, but the endless back and > forth between common and arch code is an utter pain in the arse. > > And there's only like 4 architectures that still have a custom > mmu_gather: > > - sh > - arm > - ia64 > - s390 > > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf > arch/ia64' leaving us with s390. Well I don't see a big problem in having an arch_mmu_gather field or small bits. powerpc would actually like that rather than trying to add things it wants into generic code (and it wants more than just a few flags bits, ideally). > After that everyone uses the common code and we can clean up. > > > For powerpc, tlb_start_vma is not the right API to use for this because > > it wants to deal with different page sizes within a vma. > > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is > still correct for huge pages, inefficient maybe, but so is flushing > every other page because 'sparse' transparant-huge-pages. It could do that. It requires a tlbie that matches the page size, so it means 3 sizes. I think possibly even that would be better than current code, but we could do better if we had a few specific fields in there. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote: > On Fri, 24 Aug 2018 13:39:53 +0200 > Peter Zijlstra wrote: > > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote: > > > Hurm.. look at commit: > > > > > > e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force > > > flush if page size change") > > > > Ah, good, it seems that already got cleaned up a lot. But it all moved > > into the power code.. blergh. > > I lost track of what the problem is here? Aside from the commit above being absolute crap (which did get fixed up, luckily) I would really like to get rid of all arch specific mmu_gather. We can have opt-in bits to the generic code, but the endless back and forth between common and arch code is an utter pain in the arse. And there's only like 4 architectures that still have a custom mmu_gather: - sh - arm - ia64 - s390 sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf arch/ia64' leaving us with s390. After that everyone uses the common code and we can clean up. > For powerpc, tlb_start_vma is not the right API to use for this because > it wants to deal with different page sizes within a vma. Yes.. I see that. tlb_remove_check_page_size_change() really is a rather ugly thing, it can cause loads of TLB flushes. Do you really _have_ to do that? The way ARM and x86 work is that using INVLPG in a 4K stride is still correct for huge pages, inefficient maybe, but so is flushing every other page because 'sparse' transparant-huge-pages.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Fri, 24 Aug 2018 13:39:53 +0200 Peter Zijlstra wrote: > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote: > > On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote: > > > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote: > > > > The only problem with this approach is that we've lost track of the > > > > granule > > > > size by the point we get to the tlb_flush(), so we can't adjust the > > > > stride of > > > > the TLB invalidations for huge mappings, which actually works nicely in > > > > the > > > > synchronous case (e.g. we perform a single invalidation for a 2MB > > > > mapping, > > > > rather than iterating over it at a 4k granule). > > > > > > > > One thing we could do is switch to synchronous mode if we detect a > > > > change in > > > > granule (i.e. treat it like a batch failure). > > > > > > We could use tlb_start_vma() to track that, I think. Shouldn't be too > > > hard. > > > > Hurm.. look at commit: > > > > e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force > > flush if page size change") > > Ah, good, it seems that already got cleaned up a lot. But it all moved > into the power code.. blergh. I lost track of what the problem is here? For powerpc, tlb_start_vma is not the right API to use for this because it wants to deal with different page sizes within a vma. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Fri, Aug 24, 2018 at 10:26:50AM -0700, Nadav Amit wrote: > at 1:47 AM, Peter Zijlstra wrote: > > > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote: > >> The only problem with this approach is that we've lost track of the granule > >> size by the point we get to the tlb_flush(), so we can't adjust the stride > >> of > >> the TLB invalidations for huge mappings, which actually works nicely in the > >> synchronous case (e.g. we perform a single invalidation for a 2MB mapping, > >> rather than iterating over it at a 4k granule). > >> > >> One thing we could do is switch to synchronous mode if we detect a change > >> in > >> granule (i.e. treat it like a batch failure). > > > > We could use tlb_start_vma() to track that, I think. Shouldn't be too > > hard. > > Somewhat unrelated, but I use this opportunity that TLB got your attention > for something that bothers me for some time. clear_fixmap(), which is used > in various places (e.g., text_poke()), ends up in doing only a local TLB > flush (in __set_pte_vaddr()). > > Is that sufficient? Urgh.. weren't the fixmaps per cpu? Bah, I remember looking at this during PTI, but I seem to have forgotten everything again.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
at 1:47 AM, Peter Zijlstra wrote: > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote: >> The only problem with this approach is that we've lost track of the granule >> size by the point we get to the tlb_flush(), so we can't adjust the stride of >> the TLB invalidations for huge mappings, which actually works nicely in the >> synchronous case (e.g. we perform a single invalidation for a 2MB mapping, >> rather than iterating over it at a 4k granule). >> >> One thing we could do is switch to synchronous mode if we detect a change in >> granule (i.e. treat it like a batch failure). > > We could use tlb_start_vma() to track that, I think. Shouldn't be too > hard. Somewhat unrelated, but I use this opportunity that TLB got your attention for something that bothers me for some time. clear_fixmap(), which is used in various places (e.g., text_poke()), ends up in doing only a local TLB flush (in __set_pte_vaddr()). Is that sufficient?
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
Hi Peter, On Fri, Aug 24, 2018 at 03:13:32PM +0200, Peter Zijlstra wrote: > On Fri, Aug 24, 2018 at 10:35:56AM +0200, Peter Zijlstra wrote: > > > Anyway, its sorted now; although I'd like to write me a fairly big > > comment in asm-generic/tlb.h about things, before I forget again. > > How's something like so? There's a little page_size thingy in this; > mostly because I couldn't be arsed to split it for now. > > Will has opinions on the page_size thing; I'll let him explain. They're not especially strong opinions, it's just that I don't think the page size is necessarily the right thing to track and I'd rather remove that altogether. In the patches I've hacked up (I'll post shortly as an RFC), I track the levels of page-table instead so you can relate the mmu_gather explicitly with the page-table structure, rather than have to infer it from the page size. For example, if an architecture could put down huge mappings at the pte level (e.g. using a contiguous hint in the pte like we have on arm64), then actually you want to know about the level rather than the size. You can also track the levels using only 4 bits in the gather structure. Finally, both approaches have a funny corner case when a VMA contains a mixture of granule sizes. With the "page size has changed so flush synchronously" you can theoretically end up with a lot of flushes, where you'd have been better off just invalidating the whole mm. If you track the levels instead and postpone a flush using the smallest level you saw, then you're likely to hit whatever threshold you have and nuke the mm. Will
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Fri, Aug 24, 2018 at 03:13:32PM +0200, Peter Zijlstra wrote: > + * HAVE_RCU_TABLE_FREE > + * > + * This provides tlb_remove_table(), to be used instead of tlb_remove_page() > + * for page directores (__p*_free_tlb()). This provides separate freeing of > + * the page-table pages themselves in a semi-RCU fashion (see comment > below). > + * Useful if your architecture doesn't use IPIs for remote TLB invalidates > + * and therefore doesn't naturally serialize with software page-table > walkers. > + * > + * HAVE_RCU_TABLE_INVALIDATE > + * > + * This makes HAVE_RCU_TABLE_FREE call tlb_flush_mmu_tlbonly() before > freeing > + * the page-table pages. Required if you use HAVE_RCU_TABLE_FREE and your > + * architecture uses the Linux page-tables natively. Writing that also made me think we maybe should've negated that option.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Fri, Aug 24, 2018 at 10:35:56AM +0200, Peter Zijlstra wrote: > Anyway, its sorted now; although I'd like to write me a fairly big > comment in asm-generic/tlb.h about things, before I forget again. How's something like so? There's a little page_size thingy in this; mostly because I couldn't be arsed to split it for now. Will has opinions on the page_size thing; I'll let him explain. --- arch/Kconfig | 3 + arch/arm/include/asm/tlb.h | 3 +- arch/ia64/include/asm/tlb.h| 3 +- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/tlb.h | 17 -- arch/s390/include/asm/tlb.h| 4 +- arch/sh/include/asm/tlb.h | 4 +- arch/um/include/asm/tlb.h | 4 +- include/asm-generic/tlb.h | 130 - mm/huge_memory.c | 4 +- mm/hugetlb.c | 2 +- mm/madvise.c | 2 +- mm/memory.c| 9 ++- 13 files changed, 137 insertions(+), 49 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 6801123932a5..053c44703539 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -365,6 +365,9 @@ config HAVE_RCU_TABLE_FREE config HAVE_RCU_TABLE_INVALIDATE bool +config HAVE_MMU_GATHER_PAGE_SIZE + bool + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h index f854148c8d7c..d644c3c7c6f3 100644 --- a/arch/arm/include/asm/tlb.h +++ b/arch/arm/include/asm/tlb.h @@ -286,8 +286,7 @@ tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr #define tlb_migrate_finish(mm) do { } while (0) -#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change -static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, +static inline void tlb_change_page_size(struct mmu_gather *tlb, unsigned int page_size) { } diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h index 516355a774bf..bf8985f5f876 100644 --- a/arch/ia64/include/asm/tlb.h +++ b/arch/ia64/include/asm/tlb.h @@ -282,8 +282,7 @@ do { \ #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ tlb_remove_tlb_entry(tlb, ptep, address) -#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change -static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, +static inline void tlb_change_page_size(struct mmu_gather *tlb, unsigned int page_size) { } diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index db0b6eebbfa5..4db1072868f7 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -217,6 +217,7 @@ config PPC select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE if SMP + select HAVE_MMU_GATHER_PAGE_SIZE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN select HAVE_SYSCALL_TRACEPOINTS diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index f0e571b2dc7c..b29a67137acf 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -27,7 +27,6 @@ #define tlb_start_vma(tlb, vma)do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry -#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change extern void tlb_flush(struct mmu_gather *tlb); @@ -46,22 +45,6 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, #endif } -static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, -unsigned int page_size) -{ - if (!tlb->page_size) - tlb->page_size = page_size; - else if (tlb->page_size != page_size) { - if (!tlb->fullmm) - tlb_flush_mmu(tlb); - /* -* update the page size after flush for the new -* mmu_gather. -*/ - tlb->page_size = page_size; - } -} - #ifdef CONFIG_SMP static inline int mm_is_core_local(struct mm_struct *mm) { diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h index 457b7ba0fbb6..cf3d64313740 100644 --- a/arch/s390/include/asm/tlb.h +++ b/arch/s390/include/asm/tlb.h @@ -180,9 +180,7 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, pud_t *pud, #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ tlb_remove_tlb_entry(tlb, ptep, address) -#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change -static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, -
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote: > On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote: > > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote: > > > The only problem with this approach is that we've lost track of the > > > granule > > > size by the point we get to the tlb_flush(), so we can't adjust the > > > stride of > > > the TLB invalidations for huge mappings, which actually works nicely in > > > the > > > synchronous case (e.g. we perform a single invalidation for a 2MB mapping, > > > rather than iterating over it at a 4k granule). > > > > > > One thing we could do is switch to synchronous mode if we detect a change > > > in > > > granule (i.e. treat it like a batch failure). > > > > We could use tlb_start_vma() to track that, I think. Shouldn't be too > > hard. > > Hurm.. look at commit: > > e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force > flush if page size change") Ah, good, it seems that already got cleaned up a lot. But it all moved into the power code.. blergh.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote: > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote: > > The only problem with this approach is that we've lost track of the granule > > size by the point we get to the tlb_flush(), so we can't adjust the stride > > of > > the TLB invalidations for huge mappings, which actually works nicely in the > > synchronous case (e.g. we perform a single invalidation for a 2MB mapping, > > rather than iterating over it at a 4k granule). > > > > One thing we could do is switch to synchronous mode if we detect a change in > > granule (i.e. treat it like a batch failure). > > We could use tlb_start_vma() to track that, I think. Shouldn't be too > hard. Hurm.. look at commit: e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force flush if page size change") yuck yuck yuck. That needs fixing.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote: > The only problem with this approach is that we've lost track of the granule > size by the point we get to the tlb_flush(), so we can't adjust the stride of > the TLB invalidations for huge mappings, which actually works nicely in the > synchronous case (e.g. we perform a single invalidation for a 2MB mapping, > rather than iterating over it at a 4k granule). > > One thing we could do is switch to synchronous mode if we detect a change in > granule (i.e. treat it like a batch failure). We could use tlb_start_vma() to track that, I think. Shouldn't be too hard.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Thu, Aug 23, 2018 at 02:54:20PM +1000, Benjamin Herrenschmidt wrote: > On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote: > > The problem is that x86 _used_ to do this all correctly long long ago. > > > > And then we switched over to the "generic" table flushing (which > > harkens back to the powerpc code). > > Yes, we wrote it the RCU stuff to solve the races with SW walking, > which is completely orthogonal with HW walking & TLB content. We didn't > do the move to generic code though ;-) > > > Which actually turned out to be not generic at all, and did not flush > > the internal pages like x86 used to (back when x86 just used > > tlb_remove_page for everything). > > Well, having RCU do the flushing is rather generic, it makes sense > whenever there's somebody doing a SW walk *and* you don't have IPIs to > synchronize your flushes (ie, anybody with HW TLB invalidation > broadcast basically, so ARM and us). Right, so (many many years ago) I moved it over to generic code because Sparc-hash wanted fast_gup and I figured having multiple copies of this stuff wasn't ideal. Then ARM came along and used it because it does the invalidate broadcast. And then when we switched x86 over last year or so; because paravirt; I had long since forgotten all details and completely overlooked this. Worse; somewhere along the line we tried to get s390 on this and they ran into the exact problem being fixed now. That _should_ have been a big clue, but somehow I never got around to thinking about it properly and they went back to a private copy of all this. So double fail on me I suppose :-/ Anyway, its sorted now; although I'd like to write me a fairly big comment in asm-generic/tlb.h about things, before I forget again.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 05:55:27PM +0200, Peter Zijlstra wrote: > On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote: > > ARM > > which later used this put an explicit TLB invalidate in their > > __p*_free_tlb() functions, and PowerPC-radix followed that example. > > > +/* > > + * If we want tlb_remove_table() to imply TLB invalidates. > > + */ > > +static inline void tlb_table_invalidate(struct mmu_gather *tlb) > > +{ > > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE > > + /* > > +* Invalidate page-table caches used by hardware walkers. Then we still > > +* need to RCU-sched wait while freeing the pages because software > > +* walkers can still be in-flight. > > +*/ > > + __tlb_flush_mmu_tlbonly(tlb); > > +#endif > > +} > > > Nick, Will is already looking at using this to remove the synchronous > invalidation from __p*_free_tlb() for ARM, could you have a look to see > if PowerPC-radix could benefit from that too? > > Basically, using a patch like the below, would give your tlb_flush() > information on if tables were removed or not. Just to say that I have something up and running for arm64 based on this. I'll post it once it's seen a bit more testing (either tomorrow or Monday). Will
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 10:11:41PM -0700, Linus Torvalds wrote: > On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt > wrote: > > > > > > So we do need a different flush instruction for the page tables vs. the > > normal TLB pages. > > Right. ARM wants it too. x86 is odd in that a regular "invlpg" already > invalidates all the internal tlb cache nodes. > > So the "new world order" is exactly that patch that PeterZ sent you, that > adds a > > + unsigned intfreed_tables : 1; > > to the 'struct mmu_gather', and then makes all those > pte/pmd/pud/p4d_free_tlb() functions set that bit. > > So I'm referring to the email PeterZ sent you in this thread that said: > > Nick, Will is already looking at using this to remove the synchronous > invalidation from __p*_free_tlb() for ARM, could you have a look to see > if PowerPC-radix could benefit from that too? > > Basically, using a patch like the below, would give your tlb_flush() > information on if tables were removed or not. > > then, in that model, you do *not* need to override these > pte/pmd/pud/p4d_free_tlb() macros at all (well, you *can* if you want > to, for doing games with the range modification, but let's sayt that > you don't need that right now). > > So instead, when you get to the actual "tlb_flush(tlb)", you do > exactly that - flush the tlb. And the mmu_gather structure shows you > how much you need to flush. If you see that "freed_tables" is set, > then you know that you need to also do the special instruction to > flush the inner level caches. The range continues to show the page > range. The only problem with this approach is that we've lost track of the granule size by the point we get to the tlb_flush(), so we can't adjust the stride of the TLB invalidations for huge mappings, which actually works nicely in the synchronous case (e.g. we perform a single invalidation for a 2MB mapping, rather than iterating over it at a 4k granule). One thing we could do is switch to synchronous mode if we detect a change in granule (i.e. treat it like a batch failure). Will
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 22 Aug 2018 17:30:15 +0200 Peter Zijlstra wrote: > Jann reported that x86 was missing required TLB invalidates when he > hit the !*batch slow path in tlb_remove_table(). > > This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache) > invalidates, the PowerPC-hash where this code originated and the > Sparc-hash where this was subsequently used did not need that. ARM > which later used this put an explicit TLB invalidate in their > __p*_free_tlb() functions, and PowerPC-radix followed that example. So this is interesting, I _think_ a145abf12c did fix this bug for powerpc, but then it seem to have been re-broken by a46cc7a90f because that one defers the flush back to tlb_flush time. There was quite a lot of churn getting the radix MMU off the ground at that point though so I'm not 100% sure. But AFAIKS powerpc today has this same breakage, and this patch should fix it. I have a couple of patches that touch the same code I'll send, you might have some opinions on them. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 22 Aug 2018 22:20:32 -0700 Linus Torvalds wrote: > On Wed, Aug 22, 2018 at 10:11 PM Linus Torvalds > wrote: > > > > So instead, when you get to the actual "tlb_flush(tlb)", you do > > exactly that - flush the tlb. And the mmu_gather structure shows you > > how much you need to flush. If you see that "freed_tables" is set, > > then you know that you need to also do the special instruction to > > flush the inner level caches. The range continues to show the page > > range. > > Note that this obviously works fine for a hashed table model too - you > just ignore the "freed_tables" bit entirely and continue to do > whatever you always did. > > And we can ignore it on x86 too, because we just see the range, and we > invalidate the range, and that will always invalidate the mid-level > caching too. > > So the new bit is literally for arm and powerpc-radix (and maybe > s390), but we want to make the actual VM interface truly generic and > not have one set of code with five different behaviors (which we > _currently_ kind of have with the whole in addition to all the > HAVE_RCU_TABLE_FREE etc config options that modify how the code works. > > It would be good to also cut down on the millions of functions that > each architecture can override, because Christ, it got very confusing > at times to follow just how the code flowed from generic code to > architecture-specific macros back to generic code and then > arch-specific inline helper functions. > > It's a maze of underscores and "page" vs "table", and "flush" vs "remove" etc. > > But that "it would be good to really make everybody to use as much of > the generic code as possible" and everybody have the same pattern, > that's a future thing. But the whole "let's just add that > "freed_tables" thing would be part of trying to get people to use the > same overall pattern, even if some architectures might not care about > that detail. For s390 the new freed_tables bit looks to be step into the right direction. Right now there is the flush_mm bit in the mm->context that I use to keep track of the need for a global flush of the mm. The flush_mm bit is currently used for both lazy PTE flushing and TLB flushes for freed page table pages. I'll look into it once the generic patch is merged. The switch to the generic mmu_gather is certainly not a change that can be done in a hurry, we had too many subtle TLB flush issues in the past. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Thu, 23 Aug 2018 15:21:30 +1000 Benjamin Herrenschmidt wrote: > On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote: > > On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt > > wrote: > > > > > > > > > So we do need a different flush instruction for the page tables vs. the > > > normal TLB pages. > > > > Right. ARM wants it too. x86 is odd in that a regular "invlpg" already > > invalidates all the internal tlb cache nodes. > > > > So the "new world order" is exactly that patch that PeterZ sent you, that > > adds a > > > > + unsigned intfreed_tables : 1; > > > > .../... > > > So instead, when you get to the actual "tlb_flush(tlb)", you do > > exactly that - flush the tlb. And the mmu_gather structure shows you > > how much you need to flush. If you see that "freed_tables" is set, > > then you know that you need to also do the special instruction to > > flush the inner level caches. The range continues to show the page > > range. > > Yup. That looks like a generic version of the "need_flush_all" flag we > have, which is fine by us. > > Just don't blame powerpc for all the historical crap :-) And yes we very much want to remove the x86 hacks from generic code and have them use the sane powerpc/radix page walk cache flushing model. That would indeed allow us to stop overriding those macros and start sharing more code with other archs. We can help write or review code to make sure bugs don't creep in when moving it to generic implementation. It will be much more relevant to us now because radix is very similar to x86. The hack is not the powerpc override macros though, let's be clear about that. Even x86 will be helped out by removing that crap because it won't have to do a full TLB flush caused by massive TLB range if it frees 0..small number of pages that happen to also free some page tables. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 22 Aug 2018 22:03:40 -0700 Linus Torvalds wrote: > On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin wrote: > > > > I think it was quite well understood and fixed here, a145abf12c9 but > > again that was before I really started looking at it. > > You don't understand the problem. More fundamentally I think I didn't understand this fix, I think actually powerpc/radix does have a bug here. a145abf12c9 was really just a replacement for x86's hack of expanding the TLB invalidation range when freeing page table to capture page walk cache (powerpc/radix needs a different instruction so that didn't work for us). But I hadn't really looked at this fix closely rather Peter's follow up post about making powerpc page walk cache flushing design a generic concept. My point in this reply was more that my patches from the other month weren't a blundering issue to fix this bug without realising it, they were purely about avoiding the x86 TLB range expanding hack (that won't be needed if generic users all move over). > > All the x86 people thought WE ALREADY DID THAT. > > Because we had done this all correctly over a decade ago! > > Nobody realized that it had been screwed up by the powerpc code, and The powerpc/hash code is not screwed up though AFAIKS. You can't take arch specific code and slap a "generic" label on it, least of all the crazy powerpc/hash code, you of all people would agree with that :) > the commit you point to was believed to be a new *powerpc* only issue, > because the semantics on powerpc has changed because of the radix > tree. > > The semantics on x86 have never changed, they've always been the same. > So why would the x86 people react to powerpc doing something that x86 > had already always done. > > See? > > Nobody cared one whit about commit a145abf12c9, because it just > handles a new powerpc-specific case. > > > I don't really understand what the issue you have with powerpc here. > > powerpc hash has the page table flushing accessors which are just > > no-ops, it's the generic code that fails to call them properly. Surely > > there was no powerpc patch that removed those calls from generic code? > > Yes there was. > > Look where the generic code *came* from. > > It's powerpc code. > > See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing > into generic code"). > > The powerpc code was made into the generic code, because the powerpc > code had to handle all those special RCU freeing things etc that > others didn't. > > It's just that when x86 was then switched over to use the "generic" > code, people didn't realize that the generic code didn't do the TLB > invalidations for page tables, because they hadn't been needed on > powerpc. Sure, there was a minor bug in the port. Not that it was a closely guarded secret that powerpc didn't flush page table pages, but it's a relatively subtle issue in complex code. That happens. > > So the powerpc code that was made generic, never really was. The new > "generic" code had a powerpc-specific quirk. > > That then very subtly broke x86, without the x86 people ever > realizing. Because the old simple non-RCU x86 code had never had that > issue, it just treated the leaf pages and the directory pages exactly > the same. > > See? > > And THAT is why I talk about the powerpc code. Because what is > "generic" code in 4.18 (and several releases before) oisn't actually > generic. > > And that's basically exactly the bug that the patches from PeterZ is > fixing. Making the "tlb_remove_table()" code always flush the tlb, the > way it should have when it was made generic. It just sounded like you were blaming correct powerpc/hash code for this. It's just a minor bug in taking that code into generic, not really a big deal, right? Or are you saying powerpc devs or code could be doing something better to play nicer with the rest of the archs? Honestly trying to improve things here, and encouraged by x86 and ARM looking to move over to a saner page walk cache tracking design and sharing more code with powerpc/radix. I would help with reviewing things or writing code or porting powerpc bits if I can. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote: > On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt > wrote: > > > > > > So we do need a different flush instruction for the page tables vs. the > > normal TLB pages. > > Right. ARM wants it too. x86 is odd in that a regular "invlpg" already > invalidates all the internal tlb cache nodes. > > So the "new world order" is exactly that patch that PeterZ sent you, that > adds a > > + unsigned intfreed_tables : 1; > .../... > So instead, when you get to the actual "tlb_flush(tlb)", you do > exactly that - flush the tlb. And the mmu_gather structure shows you > how much you need to flush. If you see that "freed_tables" is set, > then you know that you need to also do the special instruction to > flush the inner level caches. The range continues to show the page > range. Yup. That looks like a generic version of the "need_flush_all" flag we have, which is fine by us. Just don't blame powerpc for all the historical crap :-) Cheers, Ben. > >Linus
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 10:11 PM Linus Torvalds wrote: > > So instead, when you get to the actual "tlb_flush(tlb)", you do > exactly that - flush the tlb. And the mmu_gather structure shows you > how much you need to flush. If you see that "freed_tables" is set, > then you know that you need to also do the special instruction to > flush the inner level caches. The range continues to show the page > range. Note that this obviously works fine for a hashed table model too - you just ignore the "freed_tables" bit entirely and continue to do whatever you always did. And we can ignore it on x86 too, because we just see the range, and we invalidate the range, and that will always invalidate the mid-level caching too. So the new bit is literally for arm and powerpc-radix (and maybe s390), but we want to make the actual VM interface truly generic and not have one set of code with five different behaviors (which we _currently_ kind of have with the whole in addition to all the HAVE_RCU_TABLE_FREE etc config options that modify how the code works. It would be good to also cut down on the millions of functions that each architecture can override, because Christ, it got very confusing at times to follow just how the code flowed from generic code to architecture-specific macros back to generic code and then arch-specific inline helper functions. It's a maze of underscores and "page" vs "table", and "flush" vs "remove" etc. But that "it would be good to really make everybody to use as much of the generic code as possible" and everybody have the same pattern, that's a future thing. But the whole "let's just add that "freed_tables" thing would be part of trying to get people to use the same overall pattern, even if some architectures might not care about that detail. Linus
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt wrote: > > > So we do need a different flush instruction for the page tables vs. the > normal TLB pages. Right. ARM wants it too. x86 is odd in that a regular "invlpg" already invalidates all the internal tlb cache nodes. So the "new world order" is exactly that patch that PeterZ sent you, that adds a + unsigned intfreed_tables : 1; to the 'struct mmu_gather', and then makes all those pte/pmd/pud/p4d_free_tlb() functions set that bit. So I'm referring to the email PeterZ sent you in this thread that said: Nick, Will is already looking at using this to remove the synchronous invalidation from __p*_free_tlb() for ARM, could you have a look to see if PowerPC-radix could benefit from that too? Basically, using a patch like the below, would give your tlb_flush() information on if tables were removed or not. then, in that model, you do *not* need to override these pte/pmd/pud/p4d_free_tlb() macros at all (well, you *can* if you want to, for doing games with the range modification, but let's sayt that you don't need that right now). So instead, when you get to the actual "tlb_flush(tlb)", you do exactly that - flush the tlb. And the mmu_gather structure shows you how much you need to flush. If you see that "freed_tables" is set, then you know that you need to also do the special instruction to flush the inner level caches. The range continues to show the page range. Linus
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin wrote: > > I think it was quite well understood and fixed here, a145abf12c9 but > again that was before I really started looking at it. You don't understand the problem. All the x86 people thought WE ALREADY DID THAT. Because we had done this all correctly over a decade ago! Nobody realized that it had been screwed up by the powerpc code, and the commit you point to was believed to be a new *powerpc* only issue, because the semantics on powerpc has changed because of the radix tree. The semantics on x86 have never changed, they've always been the same. So why would the x86 people react to powerpc doing something that x86 had already always done. See? Nobody cared one whit about commit a145abf12c9, because it just handles a new powerpc-specific case. > I don't really understand what the issue you have with powerpc here. > powerpc hash has the page table flushing accessors which are just > no-ops, it's the generic code that fails to call them properly. Surely > there was no powerpc patch that removed those calls from generic code? Yes there was. Look where the generic code *came* from. It's powerpc code. See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing into generic code"). The powerpc code was made into the generic code, because the powerpc code had to handle all those special RCU freeing things etc that others didn't. It's just that when x86 was then switched over to use the "generic" code, people didn't realize that the generic code didn't do the TLB invalidations for page tables, because they hadn't been needed on powerpc. So the powerpc code that was made generic, never really was. The new "generic" code had a powerpc-specific quirk. That then very subtly broke x86, without the x86 people ever realizing. Because the old simple non-RCU x86 code had never had that issue, it just treated the leaf pages and the directory pages exactly the same. See? And THAT is why I talk about the powerpc code. Because what is "generic" code in 4.18 (and several releases before) oisn't actually generic. And that's basically exactly the bug that the patches from PeterZ is fixing. Making the "tlb_remove_table()" code always flush the tlb, the way it should have when it was made generic. Linus
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote: > On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin wrote: > > > > powerpc/radix has no such issue, it already does this tracking. > > Yeah, I now realize that this was why you wanted to add that hacky > thing to the generic code, so that you can add the tlb_flush_pgtable() > call. > > I thought it was because powerpc had some special flush instruction > for it, and the regular tlb flush didn't do it. But no. It was because > the regular code had lost the tlb flush _entirely_, because powerpc > didn't want it. Heh :-) Well, back on hash we didn't (we do now with Radix) but I wouldn't blame us for the generic code being broken ... the RCU table freeing was in arch/powerpc at the time :-) I don't think it was us making it generic :) > > We were discussing this a couple of months ago, I wasn't aware of ARM's > > issue but I suggested x86 could go the same way as powerpc. > > The problem is that x86 _used_ to do this all correctly long long ago. > > And then we switched over to the "generic" table flushing (which > harkens back to the powerpc code). Yes, we wrote it the RCU stuff to solve the races with SW walking, which is completely orthogonal with HW walking & TLB content. We didn't do the move to generic code though ;-) > Which actually turned out to be not generic at all, and did not flush > the internal pages like x86 used to (back when x86 just used > tlb_remove_page for everything). Well, having RCU do the flushing is rather generic, it makes sense whenever there's somebody doing a SW walk *and* you don't have IPIs to synchronize your flushes (ie, anybody with HW TLB invalidation broadcast basically, so ARM and us). > So as a result, x86 had unintentionally lost the TLB flush we used to > have, because tlb_remove_table() had lost the tlb flushing because of > a powerpc quirk. This is a somewhat odd way of putting the "blame" :-) But yeah ok... > You then added it back as a hacky per-architecture hook (apparently > having realized that you never did it at all), which didn't fix the > unintentional lack of flushing on x86. > > So now we're going to do it right. No more "oh, powerpc didn't need > to flush because the hash tables weren't in the tlb at all" thing in > the generic code that then others need to work around. So we do need a different flush instruction for the page tables vs. the normal TLB pages. Cheers, Ben.
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 22 Aug 2018 20:59:46 -0700 Linus Torvalds wrote: > On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin wrote: > > > > powerpc/radix has no such issue, it already does this tracking. > > Yeah, I now realize that this was why you wanted to add that hacky > thing to the generic code, so that you can add the tlb_flush_pgtable() > call. > > I thought it was because powerpc had some special flush instruction > for it, and the regular tlb flush didn't do it. But no. powerpc/radix does have a special instruction for it, that is why I posted the patch :) > It was because > the regular code had lost the tlb flush _entirely_, because powerpc > didn't want it. I think that was long before I started looking at the code. powerpc/hash hardware has no idea about the page tables so yeah they don't need it. > > > We were discussing this a couple of months ago, I wasn't aware of ARM's > > issue but I suggested x86 could go the same way as powerpc. > > The problem is that x86 _used_ to do this all correctly long long ago. > > And then we switched over to the "generic" table flushing (which > harkens back to the powerpc code). > > Which actually turned out to be not generic at all, and did not flush > the internal pages like x86 used to (back when x86 just used > tlb_remove_page for everything). > > So as a result, x86 had unintentionally lost the TLB flush we used to > have, because tlb_remove_table() had lost the tlb flushing because of > a powerpc quirk. > > You then added it back as a hacky per-architecture hook (apparently > having realized that you never did it at all), which didn't fix the I think it was quite well understood and fixed here, a145abf12c9 but again that was before I really started looking at it. The hooks I added recently are for a different reason, and it's actaully the opposite problem -- to work around the hacky generic code that x86 foisted on other archs. > unintentional lack of flushing on x86. > > So now we're going to do it right. No more "oh, powerpc didn't need > to flush because the hash tables weren't in the tlb at all" thing in > the generic code that then others need to work around. I don't really understand what the issue you have with powerpc here. powerpc hash has the page table flushing accessors which are just no-ops, it's the generic code that fails to call them properly. Surely there was no powerpc patch that removed those calls from generic code? powerpc/radix yes it does some arch specific things to do its page walk cache flushing, but it is a better design than the hacks x86 has in generic code, surely. I thought you basically agreed and thought x86 / generic code could move to that kind of model. Thanks, Nick
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin wrote: > > powerpc/radix has no such issue, it already does this tracking. Yeah, I now realize that this was why you wanted to add that hacky thing to the generic code, so that you can add the tlb_flush_pgtable() call. I thought it was because powerpc had some special flush instruction for it, and the regular tlb flush didn't do it. But no. It was because the regular code had lost the tlb flush _entirely_, because powerpc didn't want it. > We were discussing this a couple of months ago, I wasn't aware of ARM's > issue but I suggested x86 could go the same way as powerpc. The problem is that x86 _used_ to do this all correctly long long ago. And then we switched over to the "generic" table flushing (which harkens back to the powerpc code). Which actually turned out to be not generic at all, and did not flush the internal pages like x86 used to (back when x86 just used tlb_remove_page for everything). So as a result, x86 had unintentionally lost the TLB flush we used to have, because tlb_remove_table() had lost the tlb flushing because of a powerpc quirk. You then added it back as a hacky per-architecture hook (apparently having realized that you never did it at all), which didn't fix the unintentional lack of flushing on x86. So now we're going to do it right. No more "oh, powerpc didn't need to flush because the hash tables weren't in the tlb at all" thing in the generic code that then others need to work around. Linus
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, 22 Aug 2018 17:55:27 +0200 Peter Zijlstra wrote: > On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote: > > ARM > > which later used this put an explicit TLB invalidate in their > > __p*_free_tlb() functions, and PowerPC-radix followed that example. > > > +/* > > + * If we want tlb_remove_table() to imply TLB invalidates. > > + */ > > +static inline void tlb_table_invalidate(struct mmu_gather *tlb) > > +{ > > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE > > + /* > > +* Invalidate page-table caches used by hardware walkers. Then we still > > +* need to RCU-sched wait while freeing the pages because software > > +* walkers can still be in-flight. > > +*/ > > + __tlb_flush_mmu_tlbonly(tlb); > > +#endif > > +} > > > Nick, Will is already looking at using this to remove the synchronous > invalidation from __p*_free_tlb() for ARM, could you have a look to see > if PowerPC-radix could benefit from that too? powerpc/radix has no such issue, it already does this tracking. We were discussing this a couple of months ago, I wasn't aware of ARM's issue but I suggested x86 could go the same way as powerpc. Would have been good to get some feedback on some of the proposed approaches there. Because it's not just pwc tracking but if you do this you don't need those silly hacks in generic code to expand the TLB address range either. So powerpc has no fundamental problem with making this stuff generic. If you need a fix for x86 and ARM for this merge window though, I would suggest just copying what powerpc already has. Next time we can consider consolidating them all into generic code. Thanks, Nick > > Basically, using a patch like the below, would give your tlb_flush() > information on if tables were removed or not. > > --- > > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -96,12 +96,22 @@ struct mmu_gather { > #endif > unsigned long start; > unsigned long end; > - /* we are in the middle of an operation to clear > - * a full mm and can make some optimizations */ > - unsigned intfullmm : 1, > - /* we have performed an operation which > - * requires a complete flush of the tlb */ > - need_flush_all : 1; > + /* > + * we are in the middle of an operation to clear > + * a full mm and can make some optimizations > + */ > + unsigned intfullmm : 1; > + > + /* > + * we have performed an operation which > + * requires a complete flush of the tlb > + */ > + unsigned intneed_flush_all : 1; > + > + /* > + * we have removed page directories > + */ > + unsigned intfreed_tables : 1; > > struct mmu_gather_batch *active; > struct mmu_gather_batch local; > @@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str > tlb->start = TASK_SIZE; > tlb->end = 0; > } > + tlb->freed_tables = 0; > } > > static inline void tlb_remove_page_size(struct mmu_gather *tlb, > @@ -269,6 +280,7 @@ static inline void tlb_remove_check_page > #define pte_free_tlb(tlb, ptep, address) \ > do {\ > __tlb_adjust_range(tlb, address, PAGE_SIZE);\ > + tlb->freed_tables = 1; \ > __pte_free_tlb(tlb, ptep, address); \ > } while (0) > #endif > @@ -276,7 +288,8 @@ static inline void tlb_remove_check_page > #ifndef pmd_free_tlb > #define pmd_free_tlb(tlb, pmdp, address) \ > do {\ > - __tlb_adjust_range(tlb, address, PAGE_SIZE);\ > + __tlb_adjust_range(tlb, address, PAGE_SIZE);\ > + tlb->freed_tables = 1; \ > __pmd_free_tlb(tlb, pmdp, address); \ > } while (0) > #endif > @@ -286,6 +299,7 @@ static inline void tlb_remove_check_page > #define pud_free_tlb(tlb, pudp, address) \ > do {\ > __tlb_adjust_range(tlb, address, PAGE_SIZE);\ > + tlb->freed_tables = 1; \ > __pud_free_tlb(tlb, pudp, address); \ > } while (0) > #endif > @@ -295,7 +309,8 @@ static inline void tlb_remove_check_page > #ifndef p4d_free_tlb > #define p4d_free_tlb(tlb, pudp, address) \ > do {\ > - __tlb_adjust_range(tlb, address, PAGE_SIZE);\ > + __tlb_adjust_range(tlb, address, PAGE_SIZE);\ > + tlb->freed_tables = 1; \ > __p4d_free_tlb(tlb, pudp, address); \ > } while (0) > #endif
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 8:46 AM Peter Zijlstra wrote: > > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -180,6 +180,7 @@ config X86 > select HAVE_PERF_REGS > select HAVE_PERF_USER_STACK_DUMP > select HAVE_RCU_TABLE_FREE > + select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE This is confusing. First you select HAVE_RCU_TABLE_FREE unconditionally, and then you select HAVE_RCU_TABLE_INVALIDATE based on that unconditional variable. I can see why you do it, but that's because I see the next patch. On its own it just looks like you have a drinking problem. That said, I was waiting to see if this patch-set would get any comments before applying it, but it's been mostly crickets... So I think I'll just apply it and get this issue over and done with. Linus
Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote: > ARM > which later used this put an explicit TLB invalidate in their > __p*_free_tlb() functions, and PowerPC-radix followed that example. > +/* > + * If we want tlb_remove_table() to imply TLB invalidates. > + */ > +static inline void tlb_table_invalidate(struct mmu_gather *tlb) > +{ > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE > + /* > + * Invalidate page-table caches used by hardware walkers. Then we still > + * need to RCU-sched wait while freeing the pages because software > + * walkers can still be in-flight. > + */ > + __tlb_flush_mmu_tlbonly(tlb); > +#endif > +} Nick, Will is already looking at using this to remove the synchronous invalidation from __p*_free_tlb() for ARM, could you have a look to see if PowerPC-radix could benefit from that too? Basically, using a patch like the below, would give your tlb_flush() information on if tables were removed or not. --- --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -96,12 +96,22 @@ struct mmu_gather { #endif unsigned long start; unsigned long end; - /* we are in the middle of an operation to clear -* a full mm and can make some optimizations */ - unsigned intfullmm : 1, - /* we have performed an operation which -* requires a complete flush of the tlb */ - need_flush_all : 1; + /* +* we are in the middle of an operation to clear +* a full mm and can make some optimizations +*/ + unsigned intfullmm : 1; + + /* +* we have performed an operation which +* requires a complete flush of the tlb +*/ + unsigned intneed_flush_all : 1; + + /* +* we have removed page directories +*/ + unsigned intfreed_tables : 1; struct mmu_gather_batch *active; struct mmu_gather_batch local; @@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str tlb->start = TASK_SIZE; tlb->end = 0; } + tlb->freed_tables = 0; } static inline void tlb_remove_page_size(struct mmu_gather *tlb, @@ -269,6 +280,7 @@ static inline void tlb_remove_check_page #define pte_free_tlb(tlb, ptep, address) \ do {\ __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __pte_free_tlb(tlb, ptep, address); \ } while (0) #endif @@ -276,7 +288,8 @@ static inline void tlb_remove_check_page #ifndef pmd_free_tlb #define pmd_free_tlb(tlb, pmdp, address) \ do {\ - __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __pmd_free_tlb(tlb, pmdp, address); \ } while (0) #endif @@ -286,6 +299,7 @@ static inline void tlb_remove_check_page #define pud_free_tlb(tlb, pudp, address) \ do {\ __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __pud_free_tlb(tlb, pudp, address); \ } while (0) #endif @@ -295,7 +309,8 @@ static inline void tlb_remove_check_page #ifndef p4d_free_tlb #define p4d_free_tlb(tlb, pudp, address) \ do {\ - __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __p4d_free_tlb(tlb, pudp, address); \ } while (0) #endif
[PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
Jann reported that x86 was missing required TLB invalidates when he hit the !*batch slow path in tlb_remove_table(). This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache) invalidates, the PowerPC-hash where this code originated and the Sparc-hash where this was subsequently used did not need that. ARM which later used this put an explicit TLB invalidate in their __p*_free_tlb() functions, and PowerPC-radix followed that example. But when we hooked up x86 we failed to consider this. Fix this by (optionally) hooking tlb_remove_table() into the TLB invalidate code. NOTE: s390 was also needing something like this and might now be able to use the generic code again. Cc: sta...@kernel.org Cc: Nicholas Piggin Cc: David Miller Cc: Will Deacon Cc: Martin Schwidefsky Cc: Michael Ellerman Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)") Reported-by: Jann Horn Signed-off-by: Peter Zijlstra (Intel) --- arch/Kconfig |3 +++ arch/x86/Kconfig |1 + mm/memory.c | 27 +-- 3 files changed, 29 insertions(+), 2 deletions(-) --- a/arch/Kconfig +++ b/arch/Kconfig @@ -362,6 +362,9 @@ config HAVE_ARCH_JUMP_LABEL config HAVE_RCU_TABLE_FREE bool +config HAVE_RCU_TABLE_INVALIDATE + bool + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -180,6 +180,7 @@ config X86 select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE + select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR --- a/mm/memory.c +++ b/mm/memory.c @@ -238,17 +238,22 @@ void arch_tlb_gather_mmu(struct mmu_gath __tlb_reset_range(tlb); } -static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +static void __tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { if (!tlb->end) return; tlb_flush(tlb); mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); + __tlb_reset_range(tlb); +} + +static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) +{ + __tlb_flush_mmu_tlbonly(tlb); #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif - __tlb_reset_range(tlb); } static void tlb_flush_mmu_free(struct mmu_gather *tlb) @@ -330,6 +335,21 @@ bool __tlb_remove_page_size(struct mmu_g * See the comment near struct mmu_table_batch. */ +/* + * If we want tlb_remove_table() to imply TLB invalidates. + */ +static inline void tlb_table_invalidate(struct mmu_gather *tlb) +{ +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE + /* +* Invalidate page-table caches used by hardware walkers. Then we still +* need to RCU-sched wait while freeing the pages because software +* walkers can still be in-flight. +*/ + __tlb_flush_mmu_tlbonly(tlb); +#endif +} + static void tlb_remove_table_smp_sync(void *arg) { /* Simply deliver the interrupt */ @@ -366,6 +386,7 @@ void tlb_table_flush(struct mmu_gather * struct mmu_table_batch **batch = &tlb->batch; if (*batch) { + tlb_table_invalidate(tlb); call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu); *batch = NULL; } @@ -378,11 +399,13 @@ void tlb_remove_table(struct mmu_gather if (*batch == NULL) { *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN); if (*batch == NULL) { + tlb_table_invalidate(tlb); tlb_remove_table_one(table); return; } (*batch)->nr = 0; } + (*batch)->tables[(*batch)->nr++] = table; if ((*batch)->nr == MAX_TABLE_BATCH) tlb_table_flush(tlb);