Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-30 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 19:02 +1000, Nicholas Piggin wrote:
> > More tlbies ? With the cost of the broadasts on the fabric ? I don't
> > think so.. or I'm not understanding your point...
> 
> More tlbies are no good, but there will be some places where it works
> out much better (and fewer tlbies). Worst possible case for current code
> is a big unmap with lots of scattered page sizes. We _should_ get that
> with just a single PID flush at the end, but what we will get today is
> a bunch of PID and VA flushes.
> 
> I don't propose doing that though, I'd rather be explicit about
> tracking start and end range of each page size. Still not "optimal"
> but neither is existing single range for sparse mappings... anyway it
> will need to be profiled, but my point is we don't really fit exactly
> what x86/arm want.

If we have an arch specific part, we could just remember up to N
"large" pages there without actually flushing, and if that overflows,
upgrade to a full flush.

Cheers,
Ben.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-30 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 19:02 +1000, Nicholas Piggin wrote:
> > More tlbies ? With the cost of the broadasts on the fabric ? I don't
> > think so.. or I'm not understanding your point...
> 
> More tlbies are no good, but there will be some places where it works
> out much better (and fewer tlbies). Worst possible case for current code
> is a big unmap with lots of scattered page sizes. We _should_ get that
> with just a single PID flush at the end, but what we will get today is
> a bunch of PID and VA flushes.
> 
> I don't propose doing that though, I'd rather be explicit about
> tracking start and end range of each page size. Still not "optimal"
> but neither is existing single range for sparse mappings... anyway it
> will need to be profiled, but my point is we don't really fit exactly
> what x86/arm want.

If we have an arch specific part, we could just remember up to N
"large" pages there without actually flushing, and if that overflows,
upgrade to a full flush.

Cheers,
Ben.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-30 Thread Peter Zijlstra
On Thu, Aug 30, 2018 at 12:13:50AM +, Vineet Gupta wrote:
> On 08/27/2018 04:00 AM, Peter Zijlstra wrote:
> >
> > The one obvious thing SH and ARM want is a sensible default for
> > tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 )
> >
> > The below make tlb_start_vma() default to flush_cache_range(), which
> > should be right and sufficient. The only exceptions that I found where
> > (oddly):
> >
> >   - m68k-mmu
> >   - sparc64
> >   - unicore
> >
> > Those architectures appear to have a non-NOP flush_cache_range(), but
> > their current tlb_start_vma() does not call it.
> 
> So indeed we follow the DaveM's insight from 2004 about tlb_{start,end}_vma() 
> and
> those are No-ops for ARC for the general case. For the historic VIPT aliasing
> dcache they are what they should be per 2004 link above - I presume that is 
> all
> hunky dory with you ?

Yep, I was just confused about those 3 architectures having
flush_cache_range() but not calling it from tlb_start_vma(). The regular
VIPT aliasing thing is all good. And not having them is also fine.

> > Furthermore, I think tlb_flush() is broken on arc and parisc; in
> > particular they don't appear to have any TLB invalidate for the
> > shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0.
> 
> Care to explain this issue a bit more ?
> And that is independent of the current discussion.
> 
> > Possibly shift_arg_pages() should be fixed instead.

So the problem is that shift_arg_pages() does not call tlb_start_vma() /
tlb_end_vma(). It also has fullmm=0. Therefore, on ARC, there will not
be a TLB invalidate at all when freeing the page-tables.

And while looking at this more, move_page_tables() also looks dodgy, I
think it always needs a TLB flush of the entire 'old' range.

> > diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h
> > index a9db5f62aaf3..7af2b373ebe7 100644
> > --- a/arch/arc/include/asm/tlb.h
> > +++ b/arch/arc/include/asm/tlb.h
> > @@ -23,15 +23,6 @@ do { \
> >   *
> >   * Note, read http://lkml.org/lkml/2004/1/15/6
> >   */
> > -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING
> > -#define tlb_start_vma(tlb, vma)
> > -#else
> > -#define tlb_start_vma(tlb, vma)
> > \
> > -do {   
> > \
> > -   if (!tlb->fullmm)   \
> > -   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> > -} while(0)
> > -#endif
> >  
> >  #define tlb_end_vma(tlb, vma)  
> > \
> >  do {   
> > \
> 
> [snip..]
> 
> >   \
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index e811ef7b8350..1d037fd5bb7a 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -181,19 +181,21 @@ static inline void 
> > tlb_remove_check_page_size_change(struct mmu_gather *tlb,
> >   * the vmas are adjusted to only cover the region to be torn down.
> >   */
> >  #ifndef tlb_start_vma
> > -#define tlb_start_vma(tlb, vma) do { } while (0)
> > +#define tlb_start_vma(tlb, vma)
> > \
> > +do {   
> > \
> > +   if (!tlb->fullmm)   \
> > +   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> > +} while (0)
> >  #endif
> 
> So for non aliasing arches to be not affected, this relies on 
> flush_cache_range()
> to be no-op ?

Yes; a cursory inspected shows this to be so. With 'obvious' exception
from the above 3 architectures.

> > -#define __tlb_end_vma(tlb, vma)\
> > -   do {\
> > -   if (!tlb->fullmm && tlb->end) { \
> > -   tlb_flush(tlb); \
> > -   __tlb_reset_range(tlb); \
> > -   }   \
> > -   } while (0)
> > -
> >  #ifndef tlb_end_vma
> > -#define tlb_end_vma__tlb_end_vma
> > +#define tlb_end_vma(tlb, vma)  
> > \
> > +   do {\
> > +   if (!tlb->fullmm && tlb->end) { \
> > +   tlb_flush(tlb); \
> > +   __tlb_reset_range(tlb); \
> > +   }   \
> > +   } while (0)
> >  #endif
> >  
> >  #ifndef __tlb_remove_tlb_entry
> 
> And this one is for shift_arg_pages() but will also cause extraneous flushes 
> for
> other cases - not happening currently !

No, shift_arg_pages() 

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-30 Thread Peter Zijlstra
On Thu, Aug 30, 2018 at 12:13:50AM +, Vineet Gupta wrote:
> On 08/27/2018 04:00 AM, Peter Zijlstra wrote:
> >
> > The one obvious thing SH and ARM want is a sensible default for
> > tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 )
> >
> > The below make tlb_start_vma() default to flush_cache_range(), which
> > should be right and sufficient. The only exceptions that I found where
> > (oddly):
> >
> >   - m68k-mmu
> >   - sparc64
> >   - unicore
> >
> > Those architectures appear to have a non-NOP flush_cache_range(), but
> > their current tlb_start_vma() does not call it.
> 
> So indeed we follow the DaveM's insight from 2004 about tlb_{start,end}_vma() 
> and
> those are No-ops for ARC for the general case. For the historic VIPT aliasing
> dcache they are what they should be per 2004 link above - I presume that is 
> all
> hunky dory with you ?

Yep, I was just confused about those 3 architectures having
flush_cache_range() but not calling it from tlb_start_vma(). The regular
VIPT aliasing thing is all good. And not having them is also fine.

> > Furthermore, I think tlb_flush() is broken on arc and parisc; in
> > particular they don't appear to have any TLB invalidate for the
> > shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0.
> 
> Care to explain this issue a bit more ?
> And that is independent of the current discussion.
> 
> > Possibly shift_arg_pages() should be fixed instead.

So the problem is that shift_arg_pages() does not call tlb_start_vma() /
tlb_end_vma(). It also has fullmm=0. Therefore, on ARC, there will not
be a TLB invalidate at all when freeing the page-tables.

And while looking at this more, move_page_tables() also looks dodgy, I
think it always needs a TLB flush of the entire 'old' range.

> > diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h
> > index a9db5f62aaf3..7af2b373ebe7 100644
> > --- a/arch/arc/include/asm/tlb.h
> > +++ b/arch/arc/include/asm/tlb.h
> > @@ -23,15 +23,6 @@ do { \
> >   *
> >   * Note, read http://lkml.org/lkml/2004/1/15/6
> >   */
> > -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING
> > -#define tlb_start_vma(tlb, vma)
> > -#else
> > -#define tlb_start_vma(tlb, vma)
> > \
> > -do {   
> > \
> > -   if (!tlb->fullmm)   \
> > -   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> > -} while(0)
> > -#endif
> >  
> >  #define tlb_end_vma(tlb, vma)  
> > \
> >  do {   
> > \
> 
> [snip..]
> 
> >   \
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index e811ef7b8350..1d037fd5bb7a 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -181,19 +181,21 @@ static inline void 
> > tlb_remove_check_page_size_change(struct mmu_gather *tlb,
> >   * the vmas are adjusted to only cover the region to be torn down.
> >   */
> >  #ifndef tlb_start_vma
> > -#define tlb_start_vma(tlb, vma) do { } while (0)
> > +#define tlb_start_vma(tlb, vma)
> > \
> > +do {   
> > \
> > +   if (!tlb->fullmm)   \
> > +   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> > +} while (0)
> >  #endif
> 
> So for non aliasing arches to be not affected, this relies on 
> flush_cache_range()
> to be no-op ?

Yes; a cursory inspected shows this to be so. With 'obvious' exception
from the above 3 architectures.

> > -#define __tlb_end_vma(tlb, vma)\
> > -   do {\
> > -   if (!tlb->fullmm && tlb->end) { \
> > -   tlb_flush(tlb); \
> > -   __tlb_reset_range(tlb); \
> > -   }   \
> > -   } while (0)
> > -
> >  #ifndef tlb_end_vma
> > -#define tlb_end_vma__tlb_end_vma
> > +#define tlb_end_vma(tlb, vma)  
> > \
> > +   do {\
> > +   if (!tlb->fullmm && tlb->end) { \
> > +   tlb_flush(tlb); \
> > +   __tlb_reset_range(tlb); \
> > +   }   \
> > +   } while (0)
> >  #endif
> >  
> >  #ifndef __tlb_remove_tlb_entry
> 
> And this one is for shift_arg_pages() but will also cause extraneous flushes 
> for
> other cases - not happening currently !

No, shift_arg_pages() 

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-29 Thread Vineet Gupta
On 08/27/2018 04:00 AM, Peter Zijlstra wrote:
>
> The one obvious thing SH and ARM want is a sensible default for
> tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 )
>
> The below make tlb_start_vma() default to flush_cache_range(), which
> should be right and sufficient. The only exceptions that I found where
> (oddly):
>
>   - m68k-mmu
>   - sparc64
>   - unicore
>
> Those architectures appear to have a non-NOP flush_cache_range(), but
> their current tlb_start_vma() does not call it.

So indeed we follow the DaveM's insight from 2004 about tlb_{start,end}_vma() 
and
those are No-ops for ARC for the general case. For the historic VIPT aliasing
dcache they are what they should be per 2004 link above - I presume that is all
hunky dory with you ?

> Furthermore, I think tlb_flush() is broken on arc and parisc; in
> particular they don't appear to have any TLB invalidate for the
> shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0.

Care to explain this issue a bit more ?
And that is independent of the current discussion.

> Possibly shift_arg_pages() should be fixed instead.
>
> Some archs (nds32,sparc32) avoid this by having an unconditional
> flush_tlb_mm() in tlb_flush(), which seems somewhat suboptimal if you
> have flush_tlb_range().  TLB_FLUSH_VMA() might be an option, however
> hideous it is.
>
> ---
>
> diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h
> index a9db5f62aaf3..7af2b373ebe7 100644
> --- a/arch/arc/include/asm/tlb.h
> +++ b/arch/arc/include/asm/tlb.h
> @@ -23,15 +23,6 @@ do {   \
>   *
>   * Note, read http://lkml.org/lkml/2004/1/15/6
>   */
> -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING
> -#define tlb_start_vma(tlb, vma)
> -#else
> -#define tlb_start_vma(tlb, vma)  
> \
> -do { \
> - if (!tlb->fullmm)   \
> - flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> -} while(0)
> -#endif
>  
>  #define tlb_end_vma(tlb, vma)
> \
>  do { \

[snip..]

> \
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index e811ef7b8350..1d037fd5bb7a 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -181,19 +181,21 @@ static inline void 
> tlb_remove_check_page_size_change(struct mmu_gather *tlb,
>   * the vmas are adjusted to only cover the region to be torn down.
>   */
>  #ifndef tlb_start_vma
> -#define tlb_start_vma(tlb, vma) do { } while (0)
> +#define tlb_start_vma(tlb, vma)  
> \
> +do { \
> + if (!tlb->fullmm)   \
> + flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> +} while (0)
>  #endif

So for non aliasing arches to be not affected, this relies on 
flush_cache_range()
to be no-op ?

> -#define __tlb_end_vma(tlb, vma)  \
> - do {\
> - if (!tlb->fullmm && tlb->end) { \
> - tlb_flush(tlb); \
> - __tlb_reset_range(tlb); \
> - }   \
> - } while (0)
> -
>  #ifndef tlb_end_vma
> -#define tlb_end_vma  __tlb_end_vma
> +#define tlb_end_vma(tlb, vma)
> \
> + do {\
> + if (!tlb->fullmm && tlb->end) { \
> + tlb_flush(tlb); \
> + __tlb_reset_range(tlb); \
> + }   \
> + } while (0)
>  #endif
>  
>  #ifndef __tlb_remove_tlb_entry

And this one is for shift_arg_pages() but will also cause extraneous flushes for
other cases - not happening currently !

-Vineet



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-29 Thread Vineet Gupta
On 08/27/2018 04:00 AM, Peter Zijlstra wrote:
>
> The one obvious thing SH and ARM want is a sensible default for
> tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 )
>
> The below make tlb_start_vma() default to flush_cache_range(), which
> should be right and sufficient. The only exceptions that I found where
> (oddly):
>
>   - m68k-mmu
>   - sparc64
>   - unicore
>
> Those architectures appear to have a non-NOP flush_cache_range(), but
> their current tlb_start_vma() does not call it.

So indeed we follow the DaveM's insight from 2004 about tlb_{start,end}_vma() 
and
those are No-ops for ARC for the general case. For the historic VIPT aliasing
dcache they are what they should be per 2004 link above - I presume that is all
hunky dory with you ?

> Furthermore, I think tlb_flush() is broken on arc and parisc; in
> particular they don't appear to have any TLB invalidate for the
> shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0.

Care to explain this issue a bit more ?
And that is independent of the current discussion.

> Possibly shift_arg_pages() should be fixed instead.
>
> Some archs (nds32,sparc32) avoid this by having an unconditional
> flush_tlb_mm() in tlb_flush(), which seems somewhat suboptimal if you
> have flush_tlb_range().  TLB_FLUSH_VMA() might be an option, however
> hideous it is.
>
> ---
>
> diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h
> index a9db5f62aaf3..7af2b373ebe7 100644
> --- a/arch/arc/include/asm/tlb.h
> +++ b/arch/arc/include/asm/tlb.h
> @@ -23,15 +23,6 @@ do {   \
>   *
>   * Note, read http://lkml.org/lkml/2004/1/15/6
>   */
> -#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING
> -#define tlb_start_vma(tlb, vma)
> -#else
> -#define tlb_start_vma(tlb, vma)  
> \
> -do { \
> - if (!tlb->fullmm)   \
> - flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> -} while(0)
> -#endif
>  
>  #define tlb_end_vma(tlb, vma)
> \
>  do { \

[snip..]

> \
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index e811ef7b8350..1d037fd5bb7a 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -181,19 +181,21 @@ static inline void 
> tlb_remove_check_page_size_change(struct mmu_gather *tlb,
>   * the vmas are adjusted to only cover the region to be torn down.
>   */
>  #ifndef tlb_start_vma
> -#define tlb_start_vma(tlb, vma) do { } while (0)
> +#define tlb_start_vma(tlb, vma)  
> \
> +do { \
> + if (!tlb->fullmm)   \
> + flush_cache_range(vma, vma->vm_start, vma->vm_end); \
> +} while (0)
>  #endif

So for non aliasing arches to be not affected, this relies on 
flush_cache_range()
to be no-op ?

> -#define __tlb_end_vma(tlb, vma)  \
> - do {\
> - if (!tlb->fullmm && tlb->end) { \
> - tlb_flush(tlb); \
> - __tlb_reset_range(tlb); \
> - }   \
> - } while (0)
> -
>  #ifndef tlb_end_vma
> -#define tlb_end_vma  __tlb_end_vma
> +#define tlb_end_vma(tlb, vma)
> \
> + do {\
> + if (!tlb->fullmm && tlb->end) { \
> + tlb_flush(tlb); \
> + __tlb_reset_range(tlb); \
> + }   \
> + } while (0)
>  #endif
>  
>  #ifndef __tlb_remove_tlb_entry

And this one is for shift_arg_pages() but will also cause extraneous flushes for
other cases - not happening currently !

-Vineet



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 09:36:50 -0400
Rik van Riel  wrote:

> On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> 
> > It could do that. It requires a tlbie that matches the page size,
> > so it means 3 sizes. I think possibly even that would be better
> > than current code, but we could do better if we had a few specific
> > fields in there.  
> 
> Would it cause a noticeable overhead to keep track
> of which page sizes were removed, and to simply flush
> the whole TLB in the (unlikely?) event that multiple
> page sizes were removed in the same munmap?
> 
> Once the unmap is so large that multiple page sizes
> were covered, you may already be looking at so many
> individual flush operations that a full flush might
> be faster.

It will take some profiling and measuring. unmapping a small number of
huge pages plus a small number of surrounding small pages may not be
uncommon if THP is working well. That could become a lot more expensive.

> 
> Is there a point on PPC where simply flushing the
> whole TLB, and having other things be reloaded later,
> is faster than flushing every individual page mapping
> that got unmapped?

There is. For local TLB flushes that point is well over 100 individual
invalidates though. We're generally better off flushing all page sizes
for that case.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 09:36:50 -0400
Rik van Riel  wrote:

> On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> 
> > It could do that. It requires a tlbie that matches the page size,
> > so it means 3 sizes. I think possibly even that would be better
> > than current code, but we could do better if we had a few specific
> > fields in there.  
> 
> Would it cause a noticeable overhead to keep track
> of which page sizes were removed, and to simply flush
> the whole TLB in the (unlikely?) event that multiple
> page sizes were removed in the same munmap?
> 
> Once the unmap is so large that multiple page sizes
> were covered, you may already be looking at so many
> individual flush operations that a full flush might
> be faster.

It will take some profiling and measuring. unmapping a small number of
huge pages plus a small number of surrounding small pages may not be
uncommon if THP is working well. That could become a lot more expensive.

> 
> Is there a point on PPC where simply flushing the
> whole TLB, and having other things be reloaded later,
> is faster than flushing every individual page mapping
> that got unmapped?

There is. For local TLB flushes that point is well over 100 individual
invalidates though. We're generally better off flushing all page sizes
for that case.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Rik van Riel
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:

> It could do that. It requires a tlbie that matches the page size,
> so it means 3 sizes. I think possibly even that would be better
> than current code, but we could do better if we had a few specific
> fields in there.

Would it cause a noticeable overhead to keep track
of which page sizes were removed, and to simply flush
the whole TLB in the (unlikely?) event that multiple
page sizes were removed in the same munmap?

Once the unmap is so large that multiple page sizes
were covered, you may already be looking at so many
individual flush operations that a full flush might
be faster.

Is there a point on PPC where simply flushing the
whole TLB, and having other things be reloaded later,
is faster than flushing every individual page mapping
that got unmapped?

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Rik van Riel
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:

> It could do that. It requires a tlbie that matches the page size,
> so it means 3 sizes. I think possibly even that would be better
> than current code, but we could do better if we had a few specific
> fields in there.

Would it cause a noticeable overhead to keep track
of which page sizes were removed, and to simply flush
the whole TLB in the (unlikely?) event that multiple
page sizes were removed in the same munmap?

Once the unmap is so large that multiple page sizes
were covered, you may already be looking at so many
individual flush operations that a full flush might
be faster.

Is there a point on PPC where simply flushing the
whole TLB, and having other things be reloaded later,
is faster than flushing every individual page mapping
that got unmapped?

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


Re: removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Jason Duerstock
I cannot speak to how widespread it has been adopted, but the linux
(kernel) package for version 4.17.17 has been successfully built and
installed for ia64 under Debian ports.  There is clearly more work to
do to get ia64 rehabilitated, but there are over 10,000 packages
currently successfully built for ia64 under Debian ports[1].

Jason

[1] https://buildd.debian.org/status/architecture.php?a=ia64=sid
On Mon, Aug 27, 2018 at 4:57 AM Christoph Hellwig  wrote:
>
> On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> > arch/ia64' leaving us with s390.
>
> Is removing ia64 a serious plan?  It is the cause for a fair share of
> oddities in dma lang, and I did not have much luck getting maintainer
> replies lately, but I didn't know of a plan to get rid of it.
>
> What is the state of people still using ia64 mainline kernels vs just
> old distros in the still existing machines?


Re: removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Jason Duerstock
I cannot speak to how widespread it has been adopted, but the linux
(kernel) package for version 4.17.17 has been successfully built and
installed for ia64 under Debian ports.  There is clearly more work to
do to get ia64 rehabilitated, but there are over 10,000 packages
currently successfully built for ia64 under Debian ports[1].

Jason

[1] https://buildd.debian.org/status/architecture.php?a=ia64=sid
On Mon, Aug 27, 2018 at 4:57 AM Christoph Hellwig  wrote:
>
> On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> > arch/ia64' leaving us with s390.
>
> Is removing ia64 a serious plan?  It is the cause for a fair share of
> oddities in dma lang, and I did not have much luck getting maintainer
> replies lately, but I didn't know of a plan to get rid of it.
>
> What is the state of people still using ia64 mainline kernels vs just
> old distros in the still existing machines?


Re: removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 01:57:08AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> > arch/ia64' leaving us with s390.
> 
> Is removing ia64 a serious plan?

I 'joked' about it a while ago on IRC, and aegl reacted that it might
not be entirely unreasonable.

> It is the cause for a fair share of
> oddities in dma lang, and I did not have much luck getting maintainer
> replies lately, but I didn't know of a plan to get rid of it.
> 
> What is the state of people still using ia64 mainline kernels vs just
> old distros in the still existing machines?

Both arjan and aegl said that the vast majority of people still running
ia64 machines run old distros.




Re: removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 01:57:08AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> > sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> > arch/ia64' leaving us with s390.
> 
> Is removing ia64 a serious plan?

I 'joked' about it a while ago on IRC, and aegl reacted that it might
not be entirely unreasonable.

> It is the cause for a fair share of
> oddities in dma lang, and I did not have much luck getting maintainer
> replies lately, but I didn't know of a plan to get rid of it.
> 
> What is the state of people still using ia64 mainline kernels vs just
> old distros in the still existing machines?

Both arjan and aegl said that the vast majority of people still running
ia64 machines run old distros.




Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> And there's only like 4 architectures that still have a custom
> mmu_gather:
> 
>   - sh
>   - arm
>   - ia64
>   - s390
> 
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

The one obvious thing SH and ARM want is a sensible default for
tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 )

The below make tlb_start_vma() default to flush_cache_range(), which
should be right and sufficient. The only exceptions that I found where
(oddly):

  - m68k-mmu
  - sparc64
  - unicore

Those architectures appear to have a non-NOP flush_cache_range(), but
their current tlb_start_vma() does not call it.

Furthermore, I think tlb_flush() is broken on arc and parisc; in
particular they don't appear to have any TLB invalidate for the
shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0.

Possibly shift_arg_pages() should be fixed instead.

Some archs (nds32,sparc32) avoid this by having an unconditional
flush_tlb_mm() in tlb_flush(), which seems somewhat suboptimal if you
have flush_tlb_range().  TLB_FLUSH_VMA() might be an option, however
hideous it is.

---

diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h
index a9db5f62aaf3..7af2b373ebe7 100644
--- a/arch/arc/include/asm/tlb.h
+++ b/arch/arc/include/asm/tlb.h
@@ -23,15 +23,6 @@ do { \
  *
  * Note, read http://lkml.org/lkml/2004/1/15/6
  */
-#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING
-#define tlb_start_vma(tlb, vma)
-#else
-#define tlb_start_vma(tlb, vma)
\
-do {   \
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-} while(0)
-#endif
 
 #define tlb_end_vma(tlb, vma)  \
 do {   \
diff --git a/arch/mips/include/asm/tlb.h b/arch/mips/include/asm/tlb.h
index b6823b9e94da..9d04b4649692 100644
--- a/arch/mips/include/asm/tlb.h
+++ b/arch/mips/include/asm/tlb.h
@@ -5,16 +5,6 @@
 #include 
 #include 
 
-/*
- * MIPS doesn't need any special per-pte or per-vma handling, except
- * we need to flush cache for area to be unmapped.
- */
-#define tlb_start_vma(tlb, vma)\
-   do {\
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-   }  while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 /*
diff --git a/arch/nds32/include/asm/tlb.h b/arch/nds32/include/asm/tlb.h
index b35ae5eae3ab..0bf7c9482381 100644
--- a/arch/nds32/include/asm/tlb.h
+++ b/arch/nds32/include/asm/tlb.h
@@ -4,12 +4,6 @@
 #ifndef __ASMNDS32_TLB_H
 #define __ASMNDS32_TLB_H
 
-#define tlb_start_vma(tlb,vma) \
-   do {\
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-   } while (0)
-
 #define tlb_end_vma(tlb,vma)   \
do {\
if(!tlb->fullmm)\
diff --git a/arch/nios2/include/asm/tlb.h b/arch/nios2/include/asm/tlb.h
index d3bc648e08b5..9b518c6d0f62 100644
--- a/arch/nios2/include/asm/tlb.h
+++ b/arch/nios2/include/asm/tlb.h
@@ -15,16 +15,6 @@
 
 extern void set_mmu_pid(unsigned long pid);
 
-/*
- * NiosII doesn't need any special per-pte or per-vma handling, except
- * we need to flush cache for the area to be unmapped.
- */
-#define tlb_start_vma(tlb, vma)\
-   do {\
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-   }  while (0)
-
 #define tlb_end_vma(tlb, vma)  do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h
index 0c881e74d8a6..b1984f9cd3af 100644
--- a/arch/parisc/include/asm/tlb.h
+++ b/arch/parisc/include/asm/tlb.h
@@ -7,11 +7,6 @@ do {   if ((tlb)->fullmm)  \
flush_tlb_mm((tlb)->mm);\
 } while (0)
 
-#define tlb_start_vma(tlb, vma) \
-do {   if (!(tlb)->fullmm) \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-} while (0)
-
 #define tlb_end_vma(tlb, vma)  \
 do {   if (!(tlb)->fullmm) \
flush_tlb_range(vma, vma->vm_start, 

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> And there's only like 4 architectures that still have a custom
> mmu_gather:
> 
>   - sh
>   - arm
>   - ia64
>   - s390
> 
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

The one obvious thing SH and ARM want is a sensible default for
tlb_start_vma(). (also: https://lkml.org/lkml/2004/1/15/6 )

The below make tlb_start_vma() default to flush_cache_range(), which
should be right and sufficient. The only exceptions that I found where
(oddly):

  - m68k-mmu
  - sparc64
  - unicore

Those architectures appear to have a non-NOP flush_cache_range(), but
their current tlb_start_vma() does not call it.

Furthermore, I think tlb_flush() is broken on arc and parisc; in
particular they don't appear to have any TLB invalidate for the
shift_arg_pages() case, where we do not call tlb_*_vma() and fullmm=0.

Possibly shift_arg_pages() should be fixed instead.

Some archs (nds32,sparc32) avoid this by having an unconditional
flush_tlb_mm() in tlb_flush(), which seems somewhat suboptimal if you
have flush_tlb_range().  TLB_FLUSH_VMA() might be an option, however
hideous it is.

---

diff --git a/arch/arc/include/asm/tlb.h b/arch/arc/include/asm/tlb.h
index a9db5f62aaf3..7af2b373ebe7 100644
--- a/arch/arc/include/asm/tlb.h
+++ b/arch/arc/include/asm/tlb.h
@@ -23,15 +23,6 @@ do { \
  *
  * Note, read http://lkml.org/lkml/2004/1/15/6
  */
-#ifndef CONFIG_ARC_CACHE_VIPT_ALIASING
-#define tlb_start_vma(tlb, vma)
-#else
-#define tlb_start_vma(tlb, vma)
\
-do {   \
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-} while(0)
-#endif
 
 #define tlb_end_vma(tlb, vma)  \
 do {   \
diff --git a/arch/mips/include/asm/tlb.h b/arch/mips/include/asm/tlb.h
index b6823b9e94da..9d04b4649692 100644
--- a/arch/mips/include/asm/tlb.h
+++ b/arch/mips/include/asm/tlb.h
@@ -5,16 +5,6 @@
 #include 
 #include 
 
-/*
- * MIPS doesn't need any special per-pte or per-vma handling, except
- * we need to flush cache for area to be unmapped.
- */
-#define tlb_start_vma(tlb, vma)\
-   do {\
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-   }  while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 /*
diff --git a/arch/nds32/include/asm/tlb.h b/arch/nds32/include/asm/tlb.h
index b35ae5eae3ab..0bf7c9482381 100644
--- a/arch/nds32/include/asm/tlb.h
+++ b/arch/nds32/include/asm/tlb.h
@@ -4,12 +4,6 @@
 #ifndef __ASMNDS32_TLB_H
 #define __ASMNDS32_TLB_H
 
-#define tlb_start_vma(tlb,vma) \
-   do {\
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-   } while (0)
-
 #define tlb_end_vma(tlb,vma)   \
do {\
if(!tlb->fullmm)\
diff --git a/arch/nios2/include/asm/tlb.h b/arch/nios2/include/asm/tlb.h
index d3bc648e08b5..9b518c6d0f62 100644
--- a/arch/nios2/include/asm/tlb.h
+++ b/arch/nios2/include/asm/tlb.h
@@ -15,16 +15,6 @@
 
 extern void set_mmu_pid(unsigned long pid);
 
-/*
- * NiosII doesn't need any special per-pte or per-vma handling, except
- * we need to flush cache for the area to be unmapped.
- */
-#define tlb_start_vma(tlb, vma)\
-   do {\
-   if (!tlb->fullmm)   \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-   }  while (0)
-
 #define tlb_end_vma(tlb, vma)  do { } while (0)
 #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h
index 0c881e74d8a6..b1984f9cd3af 100644
--- a/arch/parisc/include/asm/tlb.h
+++ b/arch/parisc/include/asm/tlb.h
@@ -7,11 +7,6 @@ do {   if ((tlb)->fullmm)  \
flush_tlb_mm((tlb)->mm);\
 } while (0)
 
-#define tlb_start_vma(tlb, vma) \
-do {   if (!(tlb)->fullmm) \
-   flush_cache_range(vma, vma->vm_start, vma->vm_end); \
-} while (0)
-
 #define tlb_end_vma(tlb, vma)  \
 do {   if (!(tlb)->fullmm) \
flush_tlb_range(vma, vma->vm_start, 

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 18:09:50 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > > still correct for huge pages, inefficient maybe, but so is flushing
> > > every other page because 'sparse' transparant-huge-pages.  
> > 
> > It could do that. It requires a tlbie that matches the page size,
> > so it means 3 sizes. I think possibly even that would be better
> > than current code, but we could do better if we had a few specific
> > fields in there.  
> 
> More tlbies ? With the cost of the broadasts on the fabric ? I don't
> think so.. or I'm not understanding your point...

More tlbies are no good, but there will be some places where it works
out much better (and fewer tlbies). Worst possible case for current code
is a big unmap with lots of scattered page sizes. We _should_ get that
with just a single PID flush at the end, but what we will get today is
a bunch of PID and VA flushes.

I don't propose doing that though, I'd rather be explicit about
tracking start and end range of each page size. Still not "optimal"
but neither is existing single range for sparse mappings... anyway it
will need to be profiled, but my point is we don't really fit exactly
what x86/arm want.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 18:09:50 +1000
Benjamin Herrenschmidt  wrote:

> On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > > still correct for huge pages, inefficient maybe, but so is flushing
> > > every other page because 'sparse' transparant-huge-pages.  
> > 
> > It could do that. It requires a tlbie that matches the page size,
> > so it means 3 sizes. I think possibly even that would be better
> > than current code, but we could do better if we had a few specific
> > fields in there.  
> 
> More tlbies ? With the cost of the broadasts on the fabric ? I don't
> think so.. or I'm not understanding your point...

More tlbies are no good, but there will be some places where it works
out much better (and fewer tlbies). Worst possible case for current code
is a big unmap with lots of scattered page sizes. We _should_ get that
with just a single PID flush at the end, but what we will get today is
a bunch of PID and VA flushes.

I don't propose doing that though, I'd rather be explicit about
tracking start and end range of each page size. Still not "optimal"
but neither is existing single range for sparse mappings... anyway it
will need to be profiled, but my point is we don't really fit exactly
what x86/arm want.

Thanks,
Nick


removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Christoph Hellwig
On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

Is removing ia64 a serious plan?  It is the cause for a fair share of
oddities in dma lang, and I did not have much luck getting maintainer
replies lately, but I didn't know of a plan to get rid of it.

What is the state of people still using ia64 mainline kernels vs just
old distros in the still existing machines?


removig ia64, was: Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Christoph Hellwig
On Mon, Aug 27, 2018 at 09:47:01AM +0200, Peter Zijlstra wrote:
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

Is removing ia64 a serious plan?  It is the cause for a fair share of
oddities in dma lang, and I did not have much luck getting maintainer
replies lately, but I didn't know of a plan to get rid of it.

What is the state of people still using ia64 mainline kernels vs just
old distros in the still existing machines?


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 10:20:45 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Sadly our architecture requires a precise match between the page size
> > specified in the tlbie instruction and the entry in the TLB or it won't
> > be flushed.  
> 
> Argh.. OK I see. That is rather unfortunate and does seem to require
> something along the lines of tlb_remove_check_page_size_change().

Or we can do better with some more of our own data in mmu_gather,
but things that probably few or no other architectures want. I've
held off trying to put any crap in generic code because there's
other lower hanging fruit still, but I'd really rather just give
archs the ability to put their own data in there. I don't really
see a downside to it (divergence of course, but the existing
proliferation of code is much harder to follow than some data that
would be maintained and used purely by the arch, and beats having
to implement entirely your own mmu_gather).

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 10:20:45 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote:
> 
> > Sadly our architecture requires a precise match between the page size
> > specified in the tlbie instruction and the entry in the TLB or it won't
> > be flushed.  
> 
> Argh.. OK I see. That is rather unfortunate and does seem to require
> something along the lines of tlb_remove_check_page_size_change().

Or we can do better with some more of our own data in mmu_gather,
but things that probably few or no other architectures want. I've
held off trying to put any crap in generic code because there's
other lower hanging fruit still, but I'd really rather just give
archs the ability to put their own data in there. I don't really
see a downside to it (divergence of course, but the existing
proliferation of code is much harder to follow than some data that
would be maintained and used purely by the arch, and beats having
to implement entirely your own mmu_gather).

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > still correct for huge pages, inefficient maybe, but so is flushing
> > every other page because 'sparse' transparant-huge-pages.
> 
> It could do that. It requires a tlbie that matches the page size,
> so it means 3 sizes. I think possibly even that would be better
> than current code, but we could do better if we had a few specific
> fields in there.

More tlbies ? With the cost of the broadasts on the fabric ? I don't
think so.. or I'm not understanding your point...

Sadly our architecture requires a precise match between the page size
specified in the tlbie instruction and the entry in the TLB or it won't
be flushed.

Ben.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > still correct for huge pages, inefficient maybe, but so is flushing
> > every other page because 'sparse' transparant-huge-pages.
> 
> It could do that. It requires a tlbie that matches the page size,
> so it means 3 sizes. I think possibly even that would be better
> than current code, but we could do better if we had a few specific
> fields in there.

More tlbies ? With the cost of the broadasts on the fabric ? I don't
think so.. or I'm not understanding your point...

Sadly our architecture requires a precise match between the page size
specified in the tlbie instruction and the entry in the TLB or it won't
be flushed.

Ben.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote:

> Sadly our architecture requires a precise match between the page size
> specified in the tlbie instruction and the entry in the TLB or it won't
> be flushed.

Argh.. OK I see. That is rather unfortunate and does seem to require
something along the lines of tlb_remove_check_page_size_change().


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 06:09:50PM +1000, Benjamin Herrenschmidt wrote:

> Sadly our architecture requires a precise match between the page size
> specified in the tlbie instruction and the entry in the TLB or it won't
> be flushed.

Argh.. OK I see. That is rather unfortunate and does seem to require
something along the lines of tlb_remove_check_page_size_change().


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 09:47:01 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote:
> > On Fri, 24 Aug 2018 13:39:53 +0200
> > Peter Zijlstra  wrote:  
> > > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:  
> 
> > > > Hurm.. look at commit:
> > > > 
> > > >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and 
> > > > force flush if page size change")
> > > 
> > > Ah, good, it seems that already got cleaned up a lot. But it all moved
> > > into the power code.. blergh.  
> > 
> > I lost track of what the problem is here?  
> 
> Aside from the commit above being absolute crap (which did get fixed up,
> luckily) I would really like to get rid of all arch specific mmu_gather.
> 
> We can have opt-in bits to the generic code, but the endless back and
> forth between common and arch code is an utter pain in the arse.
> 
> And there's only like 4 architectures that still have a custom
> mmu_gather:
> 
>   - sh
>   - arm
>   - ia64
>   - s390
> 
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

Well I don't see a big problem in having an arch_mmu_gather field
or small bits. powerpc would actually like that rather than trying
to add things it wants into generic code (and it wants more than
just a few flags bits, ideally).

> After that everyone uses the common code and we can clean up.
> 
> > For powerpc, tlb_start_vma is not the right API to use for this because
> > it wants to deal with different page sizes within a vma.  
> 
> Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> still correct for huge pages, inefficient maybe, but so is flushing
> every other page because 'sparse' transparant-huge-pages.

It could do that. It requires a tlbie that matches the page size,
so it means 3 sizes. I think possibly even that would be better
than current code, but we could do better if we had a few specific
fields in there.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Nicholas Piggin
On Mon, 27 Aug 2018 09:47:01 +0200
Peter Zijlstra  wrote:

> On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote:
> > On Fri, 24 Aug 2018 13:39:53 +0200
> > Peter Zijlstra  wrote:  
> > > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:  
> 
> > > > Hurm.. look at commit:
> > > > 
> > > >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and 
> > > > force flush if page size change")
> > > 
> > > Ah, good, it seems that already got cleaned up a lot. But it all moved
> > > into the power code.. blergh.  
> > 
> > I lost track of what the problem is here?  
> 
> Aside from the commit above being absolute crap (which did get fixed up,
> luckily) I would really like to get rid of all arch specific mmu_gather.
> 
> We can have opt-in bits to the generic code, but the endless back and
> forth between common and arch code is an utter pain in the arse.
> 
> And there's only like 4 architectures that still have a custom
> mmu_gather:
> 
>   - sh
>   - arm
>   - ia64
>   - s390
> 
> sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
> arch/ia64' leaving us with s390.

Well I don't see a big problem in having an arch_mmu_gather field
or small bits. powerpc would actually like that rather than trying
to add things it wants into generic code (and it wants more than
just a few flags bits, ideally).

> After that everyone uses the common code and we can clean up.
> 
> > For powerpc, tlb_start_vma is not the right API to use for this because
> > it wants to deal with different page sizes within a vma.  
> 
> Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> still correct for huge pages, inefficient maybe, but so is flushing
> every other page because 'sparse' transparant-huge-pages.

It could do that. It requires a tlbie that matches the page size,
so it means 3 sizes. I think possibly even that would be better
than current code, but we could do better if we had a few specific
fields in there.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote:
> On Fri, 24 Aug 2018 13:39:53 +0200
> Peter Zijlstra  wrote:
> > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:

> > > Hurm.. look at commit:
> > > 
> > >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> > > flush if page size change")  
> > 
> > Ah, good, it seems that already got cleaned up a lot. But it all moved
> > into the power code.. blergh.
> 
> I lost track of what the problem is here?

Aside from the commit above being absolute crap (which did get fixed up,
luckily) I would really like to get rid of all arch specific mmu_gather.

We can have opt-in bits to the generic code, but the endless back and
forth between common and arch code is an utter pain in the arse.

And there's only like 4 architectures that still have a custom
mmu_gather:

  - sh
  - arm
  - ia64
  - s390

sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
arch/ia64' leaving us with s390.

After that everyone uses the common code and we can clean up.

> For powerpc, tlb_start_vma is not the right API to use for this because
> it wants to deal with different page sizes within a vma.

Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
still correct for huge pages, inefficient maybe, but so is flushing
every other page because 'sparse' transparant-huge-pages.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Peter Zijlstra
On Mon, Aug 27, 2018 at 03:00:08PM +1000, Nicholas Piggin wrote:
> On Fri, 24 Aug 2018 13:39:53 +0200
> Peter Zijlstra  wrote:
> > On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:

> > > Hurm.. look at commit:
> > > 
> > >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> > > flush if page size change")  
> > 
> > Ah, good, it seems that already got cleaned up a lot. But it all moved
> > into the power code.. blergh.
> 
> I lost track of what the problem is here?

Aside from the commit above being absolute crap (which did get fixed up,
luckily) I would really like to get rid of all arch specific mmu_gather.

We can have opt-in bits to the generic code, but the endless back and
forth between common and arch code is an utter pain in the arse.

And there's only like 4 architectures that still have a custom
mmu_gather:

  - sh
  - arm
  - ia64
  - s390

sh is trivial, arm seems doable, with a bit of luck we can do 'rm -rf
arch/ia64' leaving us with s390.

After that everyone uses the common code and we can clean up.

> For powerpc, tlb_start_vma is not the right API to use for this because
> it wants to deal with different page sizes within a vma.

Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
still correct for huge pages, inefficient maybe, but so is flushing
every other page because 'sparse' transparant-huge-pages.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-26 Thread Nicholas Piggin
On Fri, 24 Aug 2018 13:39:53 +0200
Peter Zijlstra  wrote:

> On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:  
> > > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:  
> > > > The only problem with this approach is that we've lost track of the 
> > > > granule
> > > > size by the point we get to the tlb_flush(), so we can't adjust the 
> > > > stride of
> > > > the TLB invalidations for huge mappings, which actually works nicely in 
> > > > the
> > > > synchronous case (e.g. we perform a single invalidation for a 2MB 
> > > > mapping,
> > > > rather than iterating over it at a 4k granule).
> > > > 
> > > > One thing we could do is switch to synchronous mode if we detect a 
> > > > change in
> > > > granule (i.e. treat it like a batch failure).  
> > > 
> > > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > > hard.  
> > 
> > Hurm.. look at commit:
> > 
> >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> > flush if page size change")  
> 
> Ah, good, it seems that already got cleaned up a lot. But it all moved
> into the power code.. blergh.

I lost track of what the problem is here?

For powerpc, tlb_start_vma is not the right API to use for this because
it wants to deal with different page sizes within a vma.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-26 Thread Nicholas Piggin
On Fri, 24 Aug 2018 13:39:53 +0200
Peter Zijlstra  wrote:

> On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:  
> > > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:  
> > > > The only problem with this approach is that we've lost track of the 
> > > > granule
> > > > size by the point we get to the tlb_flush(), so we can't adjust the 
> > > > stride of
> > > > the TLB invalidations for huge mappings, which actually works nicely in 
> > > > the
> > > > synchronous case (e.g. we perform a single invalidation for a 2MB 
> > > > mapping,
> > > > rather than iterating over it at a 4k granule).
> > > > 
> > > > One thing we could do is switch to synchronous mode if we detect a 
> > > > change in
> > > > granule (i.e. treat it like a batch failure).  
> > > 
> > > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > > hard.  
> > 
> > Hurm.. look at commit:
> > 
> >   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> > flush if page size change")  
> 
> Ah, good, it seems that already got cleaned up a lot. But it all moved
> into the power code.. blergh.

I lost track of what the problem is here?

For powerpc, tlb_start_vma is not the right API to use for this because
it wants to deal with different page sizes within a vma.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 10:26:50AM -0700, Nadav Amit wrote:
> at 1:47 AM, Peter Zijlstra  wrote:
> 
> > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> >> The only problem with this approach is that we've lost track of the granule
> >> size by the point we get to the tlb_flush(), so we can't adjust the stride 
> >> of
> >> the TLB invalidations for huge mappings, which actually works nicely in the
> >> synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> >> rather than iterating over it at a 4k granule).
> >> 
> >> One thing we could do is switch to synchronous mode if we detect a change 
> >> in
> >> granule (i.e. treat it like a batch failure).
> > 
> > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > hard.
> 
> Somewhat unrelated, but I use this opportunity that TLB got your attention
> for something that bothers me for some time. clear_fixmap(), which is used
> in various places (e.g., text_poke()), ends up in doing only a local TLB
> flush (in __set_pte_vaddr()).
> 
> Is that sufficient?

Urgh.. weren't the fixmaps per cpu? Bah, I remember looking at this
during PTI, but I seem to have forgotten everything again.




Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 10:26:50AM -0700, Nadav Amit wrote:
> at 1:47 AM, Peter Zijlstra  wrote:
> 
> > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> >> The only problem with this approach is that we've lost track of the granule
> >> size by the point we get to the tlb_flush(), so we can't adjust the stride 
> >> of
> >> the TLB invalidations for huge mappings, which actually works nicely in the
> >> synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> >> rather than iterating over it at a 4k granule).
> >> 
> >> One thing we could do is switch to synchronous mode if we detect a change 
> >> in
> >> granule (i.e. treat it like a batch failure).
> > 
> > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > hard.
> 
> Somewhat unrelated, but I use this opportunity that TLB got your attention
> for something that bothers me for some time. clear_fixmap(), which is used
> in various places (e.g., text_poke()), ends up in doing only a local TLB
> flush (in __set_pte_vaddr()).
> 
> Is that sufficient?

Urgh.. weren't the fixmaps per cpu? Bah, I remember looking at this
during PTI, but I seem to have forgotten everything again.




Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Nadav Amit
at 1:47 AM, Peter Zijlstra  wrote:

> On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
>> The only problem with this approach is that we've lost track of the granule
>> size by the point we get to the tlb_flush(), so we can't adjust the stride of
>> the TLB invalidations for huge mappings, which actually works nicely in the
>> synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
>> rather than iterating over it at a 4k granule).
>> 
>> One thing we could do is switch to synchronous mode if we detect a change in
>> granule (i.e. treat it like a batch failure).
> 
> We could use tlb_start_vma() to track that, I think. Shouldn't be too
> hard.

Somewhat unrelated, but I use this opportunity that TLB got your attention
for something that bothers me for some time. clear_fixmap(), which is used
in various places (e.g., text_poke()), ends up in doing only a local TLB
flush (in __set_pte_vaddr()).

Is that sufficient?



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Nadav Amit
at 1:47 AM, Peter Zijlstra  wrote:

> On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
>> The only problem with this approach is that we've lost track of the granule
>> size by the point we get to the tlb_flush(), so we can't adjust the stride of
>> the TLB invalidations for huge mappings, which actually works nicely in the
>> synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
>> rather than iterating over it at a 4k granule).
>> 
>> One thing we could do is switch to synchronous mode if we detect a change in
>> granule (i.e. treat it like a batch failure).
> 
> We could use tlb_start_vma() to track that, I think. Shouldn't be too
> hard.

Somewhat unrelated, but I use this opportunity that TLB got your attention
for something that bothers me for some time. clear_fixmap(), which is used
in various places (e.g., text_poke()), ends up in doing only a local TLB
flush (in __set_pte_vaddr()).

Is that sufficient?



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Will Deacon
Hi Peter,

On Fri, Aug 24, 2018 at 03:13:32PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 10:35:56AM +0200, Peter Zijlstra wrote:
> 
> > Anyway, its sorted now; although I'd like to write me a fairly big
> > comment in asm-generic/tlb.h about things, before I forget again.
> 
> How's something like so? There's a little page_size thingy in this;
> mostly because I couldn't be arsed to split it for now.
> 
> Will has opinions on the page_size thing; I'll let him explain.

They're not especially strong opinions, it's just that I don't think the
page size is necessarily the right thing to track and I'd rather remove that
altogether.

In the patches I've hacked up (I'll post shortly as an RFC), I track the
levels of page-table instead so you can relate the mmu_gather explicitly
with the page-table structure, rather than have to infer it from the page
size. For example, if an architecture could put down huge mappings at the
pte level (e.g. using a contiguous hint in the pte like we have on arm64),
then actually you want to know about the level rather than the size. You can
also track the levels using only 4 bits in the gather structure.

Finally, both approaches have a funny corner case when a VMA contains a
mixture of granule sizes. With the "page size has changed so flush
synchronously" you can theoretically end up with a lot of flushes, where
you'd have been better off just invalidating the whole mm. If you track the
levels instead and postpone a flush using the smallest level you saw, then
you're likely to hit whatever threshold you have and nuke the mm.

Will


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Will Deacon
Hi Peter,

On Fri, Aug 24, 2018 at 03:13:32PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 10:35:56AM +0200, Peter Zijlstra wrote:
> 
> > Anyway, its sorted now; although I'd like to write me a fairly big
> > comment in asm-generic/tlb.h about things, before I forget again.
> 
> How's something like so? There's a little page_size thingy in this;
> mostly because I couldn't be arsed to split it for now.
> 
> Will has opinions on the page_size thing; I'll let him explain.

They're not especially strong opinions, it's just that I don't think the
page size is necessarily the right thing to track and I'd rather remove that
altogether.

In the patches I've hacked up (I'll post shortly as an RFC), I track the
levels of page-table instead so you can relate the mmu_gather explicitly
with the page-table structure, rather than have to infer it from the page
size. For example, if an architecture could put down huge mappings at the
pte level (e.g. using a contiguous hint in the pte like we have on arm64),
then actually you want to know about the level rather than the size. You can
also track the levels using only 4 bits in the gather structure.

Finally, both approaches have a funny corner case when a VMA contains a
mixture of granule sizes. With the "page size has changed so flush
synchronously" you can theoretically end up with a lot of flushes, where
you'd have been better off just invalidating the whole mm. If you track the
levels instead and postpone a flush using the smallest level you saw, then
you're likely to hit whatever threshold you have and nuke the mm.

Will


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 03:13:32PM +0200, Peter Zijlstra wrote:
> + *  HAVE_RCU_TABLE_FREE
> + *
> + *  This provides tlb_remove_table(), to be used instead of tlb_remove_page()
> + *  for page directores (__p*_free_tlb()). This provides separate freeing of
> + *  the page-table pages themselves in a semi-RCU fashion (see comment 
> below).
> + *  Useful if your architecture doesn't use IPIs for remote TLB invalidates
> + *  and therefore doesn't naturally serialize with software page-table 
> walkers.
> + *
> + *  HAVE_RCU_TABLE_INVALIDATE
> + *
> + *  This makes HAVE_RCU_TABLE_FREE call tlb_flush_mmu_tlbonly() before 
> freeing
> + *  the page-table pages. Required if you use HAVE_RCU_TABLE_FREE and your
> + *  architecture uses the Linux page-tables natively.

Writing that also made me think we maybe should've negated that option.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 03:13:32PM +0200, Peter Zijlstra wrote:
> + *  HAVE_RCU_TABLE_FREE
> + *
> + *  This provides tlb_remove_table(), to be used instead of tlb_remove_page()
> + *  for page directores (__p*_free_tlb()). This provides separate freeing of
> + *  the page-table pages themselves in a semi-RCU fashion (see comment 
> below).
> + *  Useful if your architecture doesn't use IPIs for remote TLB invalidates
> + *  and therefore doesn't naturally serialize with software page-table 
> walkers.
> + *
> + *  HAVE_RCU_TABLE_INVALIDATE
> + *
> + *  This makes HAVE_RCU_TABLE_FREE call tlb_flush_mmu_tlbonly() before 
> freeing
> + *  the page-table pages. Required if you use HAVE_RCU_TABLE_FREE and your
> + *  architecture uses the Linux page-tables natively.

Writing that also made me think we maybe should've negated that option.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 10:35:56AM +0200, Peter Zijlstra wrote:

> Anyway, its sorted now; although I'd like to write me a fairly big
> comment in asm-generic/tlb.h about things, before I forget again.

How's something like so? There's a little page_size thingy in this;
mostly because I couldn't be arsed to split it for now.

Will has opinions on the page_size thing; I'll let him explain.

---

 arch/Kconfig   |   3 +
 arch/arm/include/asm/tlb.h |   3 +-
 arch/ia64/include/asm/tlb.h|   3 +-
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/tlb.h |  17 --
 arch/s390/include/asm/tlb.h|   4 +-
 arch/sh/include/asm/tlb.h  |   4 +-
 arch/um/include/asm/tlb.h  |   4 +-
 include/asm-generic/tlb.h  | 130 -
 mm/huge_memory.c   |   4 +-
 mm/hugetlb.c   |   2 +-
 mm/madvise.c   |   2 +-
 mm/memory.c|   9 ++-
 13 files changed, 137 insertions(+), 49 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..053c44703539 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -365,6 +365,9 @@ config HAVE_RCU_TABLE_FREE
 config HAVE_RCU_TABLE_INVALIDATE
bool
 
+config HAVE_MMU_GATHER_PAGE_SIZE
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index f854148c8d7c..d644c3c7c6f3 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -286,8 +286,7 @@ tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t 
*pmdp, unsigned long addr
 
 #define tlb_migrate_finish(mm) do { } while (0)
 
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
+static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
 }
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index 516355a774bf..bf8985f5f876 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -282,8 +282,7 @@ do {
\
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
tlb_remove_tlb_entry(tlb, ptep, address)
 
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
+static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
 }
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index db0b6eebbfa5..4db1072868f7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -217,6 +217,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_MMU_GATHER_PAGE_SIZE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..b29a67137acf 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -27,7 +27,6 @@
 #define tlb_start_vma(tlb, vma)do { } while (0)
 #define tlb_end_vma(tlb, vma)  do { } while (0)
 #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
 
@@ -46,22 +45,6 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather 
*tlb, pte_t *ptep,
 #endif
 }
 
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
-unsigned int page_size)
-{
-   if (!tlb->page_size)
-   tlb->page_size = page_size;
-   else if (tlb->page_size != page_size) {
-   if (!tlb->fullmm)
-   tlb_flush_mmu(tlb);
-   /*
-* update the page size after flush for the new
-* mmu_gather.
-*/
-   tlb->page_size = page_size;
-   }
-}
-
 #ifdef CONFIG_SMP
 static inline int mm_is_core_local(struct mm_struct *mm)
 {
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 457b7ba0fbb6..cf3d64313740 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -180,9 +180,7 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, 
pud_t *pud,
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
tlb_remove_tlb_entry(tlb, ptep, address)
 
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
-   

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 10:35:56AM +0200, Peter Zijlstra wrote:

> Anyway, its sorted now; although I'd like to write me a fairly big
> comment in asm-generic/tlb.h about things, before I forget again.

How's something like so? There's a little page_size thingy in this;
mostly because I couldn't be arsed to split it for now.

Will has opinions on the page_size thing; I'll let him explain.

---

 arch/Kconfig   |   3 +
 arch/arm/include/asm/tlb.h |   3 +-
 arch/ia64/include/asm/tlb.h|   3 +-
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/tlb.h |  17 --
 arch/s390/include/asm/tlb.h|   4 +-
 arch/sh/include/asm/tlb.h  |   4 +-
 arch/um/include/asm/tlb.h  |   4 +-
 include/asm-generic/tlb.h  | 130 -
 mm/huge_memory.c   |   4 +-
 mm/hugetlb.c   |   2 +-
 mm/madvise.c   |   2 +-
 mm/memory.c|   9 ++-
 13 files changed, 137 insertions(+), 49 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6801123932a5..053c44703539 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -365,6 +365,9 @@ config HAVE_RCU_TABLE_FREE
 config HAVE_RCU_TABLE_INVALIDATE
bool
 
+config HAVE_MMU_GATHER_PAGE_SIZE
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index f854148c8d7c..d644c3c7c6f3 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -286,8 +286,7 @@ tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t 
*pmdp, unsigned long addr
 
 #define tlb_migrate_finish(mm) do { } while (0)
 
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
+static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
 }
diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index 516355a774bf..bf8985f5f876 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -282,8 +282,7 @@ do {
\
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
tlb_remove_tlb_entry(tlb, ptep, address)
 
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
+static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
 }
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index db0b6eebbfa5..4db1072868f7 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -217,6 +217,7 @@ config PPC
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
+   select HAVE_MMU_GATHER_PAGE_SIZE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index f0e571b2dc7c..b29a67137acf 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -27,7 +27,6 @@
 #define tlb_start_vma(tlb, vma)do { } while (0)
 #define tlb_end_vma(tlb, vma)  do { } while (0)
 #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
 
 extern void tlb_flush(struct mmu_gather *tlb);
 
@@ -46,22 +45,6 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather 
*tlb, pte_t *ptep,
 #endif
 }
 
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
-unsigned int page_size)
-{
-   if (!tlb->page_size)
-   tlb->page_size = page_size;
-   else if (tlb->page_size != page_size) {
-   if (!tlb->fullmm)
-   tlb_flush_mmu(tlb);
-   /*
-* update the page size after flush for the new
-* mmu_gather.
-*/
-   tlb->page_size = page_size;
-   }
-}
-
 #ifdef CONFIG_SMP
 static inline int mm_is_core_local(struct mm_struct *mm)
 {
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 457b7ba0fbb6..cf3d64313740 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -180,9 +180,7 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, 
pud_t *pud,
 #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)   \
tlb_remove_tlb_entry(tlb, ptep, address)
 
-#define tlb_remove_check_page_size_change tlb_remove_check_page_size_change
-static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
-   

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> > > The only problem with this approach is that we've lost track of the 
> > > granule
> > > size by the point we get to the tlb_flush(), so we can't adjust the 
> > > stride of
> > > the TLB invalidations for huge mappings, which actually works nicely in 
> > > the
> > > synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> > > rather than iterating over it at a 4k granule).
> > > 
> > > One thing we could do is switch to synchronous mode if we detect a change 
> > > in
> > > granule (i.e. treat it like a batch failure).
> > 
> > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > hard.
> 
> Hurm.. look at commit:
> 
>   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> flush if page size change")

Ah, good, it seems that already got cleaned up a lot. But it all moved
into the power code.. blergh.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 01:32:14PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> > > The only problem with this approach is that we've lost track of the 
> > > granule
> > > size by the point we get to the tlb_flush(), so we can't adjust the 
> > > stride of
> > > the TLB invalidations for huge mappings, which actually works nicely in 
> > > the
> > > synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> > > rather than iterating over it at a 4k granule).
> > > 
> > > One thing we could do is switch to synchronous mode if we detect a change 
> > > in
> > > granule (i.e. treat it like a batch failure).
> > 
> > We could use tlb_start_vma() to track that, I think. Shouldn't be too
> > hard.
> 
> Hurm.. look at commit:
> 
>   e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force 
> flush if page size change")

Ah, good, it seems that already got cleaned up a lot. But it all moved
into the power code.. blergh.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> > The only problem with this approach is that we've lost track of the granule
> > size by the point we get to the tlb_flush(), so we can't adjust the stride 
> > of
> > the TLB invalidations for huge mappings, which actually works nicely in the
> > synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> > rather than iterating over it at a 4k granule).
> > 
> > One thing we could do is switch to synchronous mode if we detect a change in
> > granule (i.e. treat it like a batch failure).
> 
> We could use tlb_start_vma() to track that, I think. Shouldn't be too
> hard.

Hurm.. look at commit:

  e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force flush 
if page size change")

yuck yuck yuck. That needs fixing.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Fri, Aug 24, 2018 at 10:47:17AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> > The only problem with this approach is that we've lost track of the granule
> > size by the point we get to the tlb_flush(), so we can't adjust the stride 
> > of
> > the TLB invalidations for huge mappings, which actually works nicely in the
> > synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> > rather than iterating over it at a 4k granule).
> > 
> > One thing we could do is switch to synchronous mode if we detect a change in
> > granule (i.e. treat it like a batch failure).
> 
> We could use tlb_start_vma() to track that, I think. Shouldn't be too
> hard.

Hurm.. look at commit:

  e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force flush 
if page size change")

yuck yuck yuck. That needs fixing.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> The only problem with this approach is that we've lost track of the granule
> size by the point we get to the tlb_flush(), so we can't adjust the stride of
> the TLB invalidations for huge mappings, which actually works nicely in the
> synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> rather than iterating over it at a 4k granule).
> 
> One thing we could do is switch to synchronous mode if we detect a change in
> granule (i.e. treat it like a batch failure).

We could use tlb_start_vma() to track that, I think. Shouldn't be too
hard.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Thu, Aug 23, 2018 at 02:39:59PM +0100, Will Deacon wrote:
> The only problem with this approach is that we've lost track of the granule
> size by the point we get to the tlb_flush(), so we can't adjust the stride of
> the TLB invalidations for huge mappings, which actually works nicely in the
> synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
> rather than iterating over it at a 4k granule).
> 
> One thing we could do is switch to synchronous mode if we detect a change in
> granule (i.e. treat it like a batch failure).

We could use tlb_start_vma() to track that, I think. Shouldn't be too
hard.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Thu, Aug 23, 2018 at 02:54:20PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote:

> > The problem is that x86 _used_ to do this all correctly long long ago.
> > 
> > And then we switched over to the "generic" table flushing (which
> > harkens back to the powerpc code).
> 
> Yes, we wrote it the RCU stuff to solve the races with SW walking,
> which is completely orthogonal with HW walking & TLB content. We didn't
> do the move to generic code though ;-)
> 
> > Which actually turned out to be not generic at all, and did not flush
> > the internal pages like x86 used to (back when x86 just used
> > tlb_remove_page for everything).
> 
> Well, having RCU do the flushing is rather generic, it makes sense
> whenever there's somebody doing a SW walk *and* you don't have IPIs to
> synchronize your flushes (ie, anybody with HW TLB invalidation
> broadcast basically, so ARM and us).

Right, so (many many years ago) I moved it over to generic code because
Sparc-hash wanted fast_gup and I figured having multiple copies of this
stuff wasn't ideal.

Then ARM came along and used it because it does the invalidate
broadcast.

And then when we switched x86 over last year or so; because paravirt; I
had long since forgotten all details and completely overlooked this.

Worse; somewhere along the line we tried to get s390 on this and they
ran into the exact problem being fixed now. That _should_ have been a
big clue, but somehow I never got around to thinking about it properly
and they went back to a private copy of all this.

So double fail on me I suppose :-/

Anyway, its sorted now; although I'd like to write me a fairly big
comment in asm-generic/tlb.h about things, before I forget again.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-24 Thread Peter Zijlstra
On Thu, Aug 23, 2018 at 02:54:20PM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote:

> > The problem is that x86 _used_ to do this all correctly long long ago.
> > 
> > And then we switched over to the "generic" table flushing (which
> > harkens back to the powerpc code).
> 
> Yes, we wrote it the RCU stuff to solve the races with SW walking,
> which is completely orthogonal with HW walking & TLB content. We didn't
> do the move to generic code though ;-)
> 
> > Which actually turned out to be not generic at all, and did not flush
> > the internal pages like x86 used to (back when x86 just used
> > tlb_remove_page for everything).
> 
> Well, having RCU do the flushing is rather generic, it makes sense
> whenever there's somebody doing a SW walk *and* you don't have IPIs to
> synchronize your flushes (ie, anybody with HW TLB invalidation
> broadcast basically, so ARM and us).

Right, so (many many years ago) I moved it over to generic code because
Sparc-hash wanted fast_gup and I figured having multiple copies of this
stuff wasn't ideal.

Then ARM came along and used it because it does the invalidate
broadcast.

And then when we switched x86 over last year or so; because paravirt; I
had long since forgotten all details and completely overlooked this.

Worse; somewhere along the line we tried to get s390 on this and they
ran into the exact problem being fixed now. That _should_ have been a
big clue, but somehow I never got around to thinking about it properly
and they went back to a private copy of all this.

So double fail on me I suppose :-/

Anyway, its sorted now; although I'd like to write me a fairly big
comment in asm-generic/tlb.h about things, before I forget again.


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Will Deacon
On Wed, Aug 22, 2018 at 05:55:27PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> > ARM
> > which later used this put an explicit TLB invalidate in their
> > __p*_free_tlb() functions, and PowerPC-radix followed that example.
> 
> > +/*
> > + * If we want tlb_remove_table() to imply TLB invalidates.
> > + */
> > +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> > +{
> > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> > +   /*
> > +* Invalidate page-table caches used by hardware walkers. Then we still
> > +* need to RCU-sched wait while freeing the pages because software
> > +* walkers can still be in-flight.
> > +*/
> > +   __tlb_flush_mmu_tlbonly(tlb);
> > +#endif
> > +}
> 
> 
> Nick, Will is already looking at using this to remove the synchronous
> invalidation from __p*_free_tlb() for ARM, could you have a look to see
> if PowerPC-radix could benefit from that too?
> 
> Basically, using a patch like the below, would give your tlb_flush()
> information on if tables were removed or not.

Just to say that I have something up and running for arm64 based on this.
I'll post it once it's seen a bit more testing (either tomorrow or Monday).

Will


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Will Deacon
On Wed, Aug 22, 2018 at 05:55:27PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> > ARM
> > which later used this put an explicit TLB invalidate in their
> > __p*_free_tlb() functions, and PowerPC-radix followed that example.
> 
> > +/*
> > + * If we want tlb_remove_table() to imply TLB invalidates.
> > + */
> > +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> > +{
> > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> > +   /*
> > +* Invalidate page-table caches used by hardware walkers. Then we still
> > +* need to RCU-sched wait while freeing the pages because software
> > +* walkers can still be in-flight.
> > +*/
> > +   __tlb_flush_mmu_tlbonly(tlb);
> > +#endif
> > +}
> 
> 
> Nick, Will is already looking at using this to remove the synchronous
> invalidation from __p*_free_tlb() for ARM, could you have a look to see
> if PowerPC-radix could benefit from that too?
> 
> Basically, using a patch like the below, would give your tlb_flush()
> information on if tables were removed or not.

Just to say that I have something up and running for arm64 based on this.
I'll post it once it's seen a bit more testing (either tomorrow or Monday).

Will


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Will Deacon
On Wed, Aug 22, 2018 at 10:11:41PM -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> wrote:
> >
> >
> > So we do need a different flush instruction for the page tables vs. the
> > normal TLB pages.
> 
> Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> invalidates all the internal tlb cache nodes.
> 
> So the "new world order" is exactly that patch that PeterZ sent you, that 
> adds a
> 
> +   unsigned intfreed_tables : 1;
> 
> to the 'struct mmu_gather', and then makes all those
> pte/pmd/pud/p4d_free_tlb() functions set that bit.
> 
> So I'm referring to the email PeterZ sent you in this thread that said:
> 
>   Nick, Will is already looking at using this to remove the synchronous
>   invalidation from __p*_free_tlb() for ARM, could you have a look to see
>   if PowerPC-radix could benefit from that too?
> 
>   Basically, using a patch like the below, would give your tlb_flush()
>   information on if tables were removed or not.
> 
> then, in that model, you do *not* need to override these
> pte/pmd/pud/p4d_free_tlb() macros at all (well, you *can* if you want
> to, for doing games with the range modification, but let's sayt that
> you don't need that right now).
> 
> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

The only problem with this approach is that we've lost track of the granule
size by the point we get to the tlb_flush(), so we can't adjust the stride of
the TLB invalidations for huge mappings, which actually works nicely in the
synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
rather than iterating over it at a 4k granule).

One thing we could do is switch to synchronous mode if we detect a change in
granule (i.e. treat it like a batch failure).

Will


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Will Deacon
On Wed, Aug 22, 2018 at 10:11:41PM -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> wrote:
> >
> >
> > So we do need a different flush instruction for the page tables vs. the
> > normal TLB pages.
> 
> Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> invalidates all the internal tlb cache nodes.
> 
> So the "new world order" is exactly that patch that PeterZ sent you, that 
> adds a
> 
> +   unsigned intfreed_tables : 1;
> 
> to the 'struct mmu_gather', and then makes all those
> pte/pmd/pud/p4d_free_tlb() functions set that bit.
> 
> So I'm referring to the email PeterZ sent you in this thread that said:
> 
>   Nick, Will is already looking at using this to remove the synchronous
>   invalidation from __p*_free_tlb() for ARM, could you have a look to see
>   if PowerPC-radix could benefit from that too?
> 
>   Basically, using a patch like the below, would give your tlb_flush()
>   information on if tables were removed or not.
> 
> then, in that model, you do *not* need to override these
> pte/pmd/pud/p4d_free_tlb() macros at all (well, you *can* if you want
> to, for doing games with the range modification, but let's sayt that
> you don't need that right now).
> 
> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

The only problem with this approach is that we've lost track of the granule
size by the point we get to the tlb_flush(), so we can't adjust the stride of
the TLB invalidations for huge mappings, which actually works nicely in the
synchronous case (e.g. we perform a single invalidation for a 2MB mapping,
rather than iterating over it at a 4k granule).

One thing we could do is switch to synchronous mode if we detect a change in
granule (i.e. treat it like a batch failure).

Will


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:30:15 +0200
Peter Zijlstra  wrote:

> Jann reported that x86 was missing required TLB invalidates when he
> hit the !*batch slow path in tlb_remove_table().
> 
> This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
> invalidates, the PowerPC-hash where this code originated and the
> Sparc-hash where this was subsequently used did not need that. ARM
> which later used this put an explicit TLB invalidate in their
> __p*_free_tlb() functions, and PowerPC-radix followed that example.

So this is interesting, I _think_ a145abf12c did fix this bug for
powerpc, but then it seem to have been re-broken by a46cc7a90f
because that one defers the flush back to tlb_flush time. There
was quite a lot of churn getting the radix MMU off the ground at
that point though so I'm not 100% sure.

But AFAIKS powerpc today has this same breakage, and this patch
should fix it.

I have a couple of patches that touch the same code I'll send, you
might have some opinions on them.

Thanks,
Nick



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:30:15 +0200
Peter Zijlstra  wrote:

> Jann reported that x86 was missing required TLB invalidates when he
> hit the !*batch slow path in tlb_remove_table().
> 
> This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
> invalidates, the PowerPC-hash where this code originated and the
> Sparc-hash where this was subsequently used did not need that. ARM
> which later used this put an explicit TLB invalidate in their
> __p*_free_tlb() functions, and PowerPC-radix followed that example.

So this is interesting, I _think_ a145abf12c did fix this bug for
powerpc, but then it seem to have been re-broken by a46cc7a90f
because that one defers the flush back to tlb_flush time. There
was quite a lot of churn getting the radix MMU off the ground at
that point though so I'm not 100% sure.

But AFAIKS powerpc today has this same breakage, and this patch
should fix it.

I have a couple of patches that touch the same code I'll send, you
might have some opinions on them.

Thanks,
Nick



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Martin Schwidefsky
On Wed, 22 Aug 2018 22:20:32 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 10:11 PM Linus Torvalds
>  wrote:
> >
> > So instead, when you get to the actual "tlb_flush(tlb)", you do
> > exactly that - flush the tlb. And the mmu_gather structure shows you
> > how much you need to flush. If you see that "freed_tables" is set,
> > then you know that you need to also do the special instruction to
> > flush the inner level caches. The range continues to show the page
> > range.  
> 
> Note that this obviously works fine for a hashed table model too - you
> just ignore the "freed_tables" bit entirely and continue to do
> whatever you always did.
> 
> And we can ignore it on x86 too, because we just see the range, and we
> invalidate the range, and that will always invalidate the mid-level
> caching too.
> 
> So the new bit is literally for arm and powerpc-radix (and maybe
> s390), but we want to make the actual VM interface truly generic and
> not have one set of code with five different behaviors (which we
> _currently_ kind of have with the whole in addition to all the
> HAVE_RCU_TABLE_FREE etc config options that modify how the code works.
> 
> It would be good to also cut down on the millions of functions that
> each architecture can override, because Christ, it got very confusing
> at times to follow just how the code flowed from generic code to
> architecture-specific macros back to generic code and then
> arch-specific inline helper functions.
> 
> It's a maze of underscores and "page" vs "table", and "flush" vs "remove" etc.
> 
> But that "it would be good to really make everybody to use as much of
> the generic code as possible" and everybody have the same pattern,
> that's a future thing. But the whole "let's just add that
> "freed_tables" thing would be part of trying to get people to use the
> same overall pattern, even if some architectures might not care about
> that detail.

For s390 the new freed_tables bit looks to be step into the right
direction. Right now there is the flush_mm bit in the mm->context
that I use to keep track of the need for a global flush of the mm.
The flush_mm bit is currently used for both lazy PTE flushing and
TLB flushes for freed page table pages.

I'll look into it once the generic patch is merged. The switch to
the generic mmu_gather is certainly not a change that can be done
in a hurry, we had too many subtle TLB flush issues in the past.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Martin Schwidefsky
On Wed, 22 Aug 2018 22:20:32 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 10:11 PM Linus Torvalds
>  wrote:
> >
> > So instead, when you get to the actual "tlb_flush(tlb)", you do
> > exactly that - flush the tlb. And the mmu_gather structure shows you
> > how much you need to flush. If you see that "freed_tables" is set,
> > then you know that you need to also do the special instruction to
> > flush the inner level caches. The range continues to show the page
> > range.  
> 
> Note that this obviously works fine for a hashed table model too - you
> just ignore the "freed_tables" bit entirely and continue to do
> whatever you always did.
> 
> And we can ignore it on x86 too, because we just see the range, and we
> invalidate the range, and that will always invalidate the mid-level
> caching too.
> 
> So the new bit is literally for arm and powerpc-radix (and maybe
> s390), but we want to make the actual VM interface truly generic and
> not have one set of code with five different behaviors (which we
> _currently_ kind of have with the whole in addition to all the
> HAVE_RCU_TABLE_FREE etc config options that modify how the code works.
> 
> It would be good to also cut down on the millions of functions that
> each architecture can override, because Christ, it got very confusing
> at times to follow just how the code flowed from generic code to
> architecture-specific macros back to generic code and then
> arch-specific inline helper functions.
> 
> It's a maze of underscores and "page" vs "table", and "flush" vs "remove" etc.
> 
> But that "it would be good to really make everybody to use as much of
> the generic code as possible" and everybody have the same pattern,
> that's a future thing. But the whole "let's just add that
> "freed_tables" thing would be part of trying to get people to use the
> same overall pattern, even if some architectures might not care about
> that detail.

For s390 the new freed_tables bit looks to be step into the right
direction. Right now there is the flush_mm bit in the mm->context
that I use to keep track of the need for a global flush of the mm.
The flush_mm bit is currently used for both lazy PTE flushing and
TLB flushes for freed page table pages.

I'll look into it once the generic patch is merged. The switch to
the generic mmu_gather is certainly not a change that can be done
in a hurry, we had too many subtle TLB flush issues in the past.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Nicholas Piggin
On Thu, 23 Aug 2018 15:21:30 +1000
Benjamin Herrenschmidt  wrote:

> On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote:
> > On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> > wrote:  
> > > 
> > > 
> > > So we do need a different flush instruction for the page tables vs. the
> > > normal TLB pages.  
> > 
> > Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> > invalidates all the internal tlb cache nodes.
> > 
> > So the "new world order" is exactly that patch that PeterZ sent you, that 
> > adds a
> > 
> > +   unsigned intfreed_tables : 1;
> >   
> 
>  .../...
> 
> > So instead, when you get to the actual "tlb_flush(tlb)", you do
> > exactly that - flush the tlb. And the mmu_gather structure shows you
> > how much you need to flush. If you see that "freed_tables" is set,
> > then you know that you need to also do the special instruction to
> > flush the inner level caches. The range continues to show the page
> > range.  
> 
> Yup. That looks like a generic version of the "need_flush_all" flag we
> have, which is fine by us.
> 
> Just don't blame powerpc for all the historical crap :-)

And yes we very much want to remove the x86 hacks from generic code and
have them use the sane powerpc/radix page walk cache flushing model.
That would indeed allow us to stop overriding those macros and start
sharing more code with other archs. We can help write or review code to
make sure bugs don't creep in when moving it to generic implementation.
It will be much more relevant to us now because radix is very similar to
x86.

The hack is not the powerpc override macros though, let's be clear
about that. Even x86 will be helped out by removing that crap because
it won't have to do a full TLB flush caused by massive TLB range if it
frees 0..small number of pages that happen to also free some page
tables.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-23 Thread Nicholas Piggin
On Thu, 23 Aug 2018 15:21:30 +1000
Benjamin Herrenschmidt  wrote:

> On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote:
> > On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> > wrote:  
> > > 
> > > 
> > > So we do need a different flush instruction for the page tables vs. the
> > > normal TLB pages.  
> > 
> > Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> > invalidates all the internal tlb cache nodes.
> > 
> > So the "new world order" is exactly that patch that PeterZ sent you, that 
> > adds a
> > 
> > +   unsigned intfreed_tables : 1;
> >   
> 
>  .../...
> 
> > So instead, when you get to the actual "tlb_flush(tlb)", you do
> > exactly that - flush the tlb. And the mmu_gather structure shows you
> > how much you need to flush. If you see that "freed_tables" is set,
> > then you know that you need to also do the special instruction to
> > flush the inner level caches. The range continues to show the page
> > range.  
> 
> Yup. That looks like a generic version of the "need_flush_all" flag we
> have, which is fine by us.
> 
> Just don't blame powerpc for all the historical crap :-)

And yes we very much want to remove the x86 hacks from generic code and
have them use the sane powerpc/radix page walk cache flushing model.
That would indeed allow us to stop overriding those macros and start
sharing more code with other archs. We can help write or review code to
make sure bugs don't creep in when moving it to generic implementation.
It will be much more relevant to us now because radix is very similar to
x86.

The hack is not the powerpc override macros though, let's be clear
about that. Even x86 will be helped out by removing that crap because
it won't have to do a full TLB flush caused by massive TLB range if it
frees 0..small number of pages that happen to also free some page
tables.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 22:03:40 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin  wrote:
> >
> > I think it was quite well understood and fixed here, a145abf12c9 but
> > again that was before I really started looking at it.  
> 
> You don't understand the problem.

More fundamentally I think I didn't understand this fix, I think
actually powerpc/radix does have a bug here. a145abf12c9 was really
just a replacement for x86's hack of expanding the TLB invalidation
range when freeing page table to capture page walk cache (powerpc/radix
needs a different instruction so that didn't work for us).

But I hadn't really looked at this fix closely rather Peter's follow up
post about making powerpc page walk cache flushing design a generic
concept.

My point in this reply was more that my patches from the other month
weren't a blundering issue to fix this bug without realising it, they
were purely about avoiding the x86 TLB range expanding hack (that won't
be needed if generic users all move over).

> 
> All the x86 people thought WE ALREADY DID THAT.
> 
> Because we had done this all correctly over a decade ago!
> 
> Nobody realized that it had been screwed up by the powerpc code, and

The powerpc/hash code is not screwed up though AFAIKS. You can't
take arch specific code and slap a "generic" label on it, least of all
the crazy powerpc/hash code, you of all people would agree with that :)

> the commit you point to was believed to be a new *powerpc* only issue,
> because the semantics on powerpc has changed because of the radix
> tree.
> 
> The semantics on x86 have never changed, they've always been the same.
> So why would the x86 people react to powerpc doing something that x86
> had already always done.
> 
> See?
> 
> Nobody cared one whit about commit a145abf12c9, because it just
> handles a new powerpc-specific case.
> 
> > I don't really understand what the issue you have with powerpc here.
> > powerpc hash has the page table flushing accessors which are just
> > no-ops, it's the generic code that fails to call them properly. Surely
> > there was no powerpc patch that removed those calls from generic code?  
> 
> Yes there was.
> 
> Look where the generic code *came* from.
> 
> It's powerpc code.
> 
> See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing
> into generic code").
> 
> The powerpc code was made into the generic code, because the powerpc
> code had to handle all those special RCU freeing things etc that
> others didn't.
> 
> It's just that when x86 was then switched over to use the "generic"
> code, people didn't realize that the generic code didn't do the TLB
> invalidations for page tables, because they hadn't been needed on
> powerpc.

Sure, there was a minor bug in the port. Not that it was a closely
guarded secret that powerpc didn't flush page table pages, but it's a
relatively subtle issue in complex code. That happens.

> 
> So the powerpc code that was made generic, never really was. The new
> "generic" code had a powerpc-specific quirk.
> 
> That then very subtly broke x86, without the x86 people ever
> realizing. Because the old simple non-RCU x86 code had never had that
> issue, it just treated the leaf pages and the directory pages exactly
> the same.
> 
> See?
> 
> And THAT is why I talk about the powerpc code. Because what is
> "generic" code in 4.18 (and several releases before) oisn't actually
> generic.
> 
> And that's basically exactly the bug that the patches from PeterZ is
> fixing. Making the "tlb_remove_table()" code always flush the tlb, the
> way it should have when it was made generic.

It just sounded like you were blaming correct powerpc/hash code for
this. It's just a minor bug in taking that code into generic, not really
a big deal, right? Or are you saying powerpc devs or code could be doing
something better to play nicer with the rest of the archs?

Honestly trying to improve things here, and encouraged by x86 and ARM
looking to move over to a saner page walk cache tracking design and
sharing more code with powerpc/radix. I would help with reviewing
things or writing code or porting powerpc bits if I can.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 22:03:40 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin  wrote:
> >
> > I think it was quite well understood and fixed here, a145abf12c9 but
> > again that was before I really started looking at it.  
> 
> You don't understand the problem.

More fundamentally I think I didn't understand this fix, I think
actually powerpc/radix does have a bug here. a145abf12c9 was really
just a replacement for x86's hack of expanding the TLB invalidation
range when freeing page table to capture page walk cache (powerpc/radix
needs a different instruction so that didn't work for us).

But I hadn't really looked at this fix closely rather Peter's follow up
post about making powerpc page walk cache flushing design a generic
concept.

My point in this reply was more that my patches from the other month
weren't a blundering issue to fix this bug without realising it, they
were purely about avoiding the x86 TLB range expanding hack (that won't
be needed if generic users all move over).

> 
> All the x86 people thought WE ALREADY DID THAT.
> 
> Because we had done this all correctly over a decade ago!
> 
> Nobody realized that it had been screwed up by the powerpc code, and

The powerpc/hash code is not screwed up though AFAIKS. You can't
take arch specific code and slap a "generic" label on it, least of all
the crazy powerpc/hash code, you of all people would agree with that :)

> the commit you point to was believed to be a new *powerpc* only issue,
> because the semantics on powerpc has changed because of the radix
> tree.
> 
> The semantics on x86 have never changed, they've always been the same.
> So why would the x86 people react to powerpc doing something that x86
> had already always done.
> 
> See?
> 
> Nobody cared one whit about commit a145abf12c9, because it just
> handles a new powerpc-specific case.
> 
> > I don't really understand what the issue you have with powerpc here.
> > powerpc hash has the page table flushing accessors which are just
> > no-ops, it's the generic code that fails to call them properly. Surely
> > there was no powerpc patch that removed those calls from generic code?  
> 
> Yes there was.
> 
> Look where the generic code *came* from.
> 
> It's powerpc code.
> 
> See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing
> into generic code").
> 
> The powerpc code was made into the generic code, because the powerpc
> code had to handle all those special RCU freeing things etc that
> others didn't.
> 
> It's just that when x86 was then switched over to use the "generic"
> code, people didn't realize that the generic code didn't do the TLB
> invalidations for page tables, because they hadn't been needed on
> powerpc.

Sure, there was a minor bug in the port. Not that it was a closely
guarded secret that powerpc didn't flush page table pages, but it's a
relatively subtle issue in complex code. That happens.

> 
> So the powerpc code that was made generic, never really was. The new
> "generic" code had a powerpc-specific quirk.
> 
> That then very subtly broke x86, without the x86 people ever
> realizing. Because the old simple non-RCU x86 code had never had that
> issue, it just treated the leaf pages and the directory pages exactly
> the same.
> 
> See?
> 
> And THAT is why I talk about the powerpc code. Because what is
> "generic" code in 4.18 (and several releases before) oisn't actually
> generic.
> 
> And that's basically exactly the bug that the patches from PeterZ is
> fixing. Making the "tlb_remove_table()" code always flush the tlb, the
> way it should have when it was made generic.

It just sounded like you were blaming correct powerpc/hash code for
this. It's just a minor bug in taking that code into generic, not really
a big deal, right? Or are you saying powerpc devs or code could be doing
something better to play nicer with the rest of the archs?

Honestly trying to improve things here, and encouraged by x86 and ARM
looking to move over to a saner page walk cache tracking design and
sharing more code with powerpc/radix. I would help with reviewing
things or writing code or porting powerpc bits if I can.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> wrote:
> > 
> > 
> > So we do need a different flush instruction for the page tables vs. the
> > normal TLB pages.
> 
> Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> invalidates all the internal tlb cache nodes.
> 
> So the "new world order" is exactly that patch that PeterZ sent you, that 
> adds a
> 
> +   unsigned intfreed_tables : 1;
> 

 .../...

> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

Yup. That looks like a generic version of the "need_flush_all" flag we
have, which is fine by us.

Just don't blame powerpc for all the historical crap :-)

Cheers,
Ben.

> 
>Linus



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> wrote:
> > 
> > 
> > So we do need a different flush instruction for the page tables vs. the
> > normal TLB pages.
> 
> Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> invalidates all the internal tlb cache nodes.
> 
> So the "new world order" is exactly that patch that PeterZ sent you, that 
> adds a
> 
> +   unsigned intfreed_tables : 1;
> 

 .../...

> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

Yup. That looks like a generic version of the "need_flush_all" flag we
have, which is fine by us.

Just don't blame powerpc for all the historical crap :-)

Cheers,
Ben.

> 
>Linus



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 10:11 PM Linus Torvalds
 wrote:
>
> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

Note that this obviously works fine for a hashed table model too - you
just ignore the "freed_tables" bit entirely and continue to do
whatever you always did.

And we can ignore it on x86 too, because we just see the range, and we
invalidate the range, and that will always invalidate the mid-level
caching too.

So the new bit is literally for arm and powerpc-radix (and maybe
s390), but we want to make the actual VM interface truly generic and
not have one set of code with five different behaviors (which we
_currently_ kind of have with the whole in addition to all the
HAVE_RCU_TABLE_FREE etc config options that modify how the code works.

It would be good to also cut down on the millions of functions that
each architecture can override, because Christ, it got very confusing
at times to follow just how the code flowed from generic code to
architecture-specific macros back to generic code and then
arch-specific inline helper functions.

It's a maze of underscores and "page" vs "table", and "flush" vs "remove" etc.

But that "it would be good to really make everybody to use as much of
the generic code as possible" and everybody have the same pattern,
that's a future thing. But the whole "let's just add that
"freed_tables" thing would be part of trying to get people to use the
same overall pattern, even if some architectures might not care about
that detail.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 10:11 PM Linus Torvalds
 wrote:
>
> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

Note that this obviously works fine for a hashed table model too - you
just ignore the "freed_tables" bit entirely and continue to do
whatever you always did.

And we can ignore it on x86 too, because we just see the range, and we
invalidate the range, and that will always invalidate the mid-level
caching too.

So the new bit is literally for arm and powerpc-radix (and maybe
s390), but we want to make the actual VM interface truly generic and
not have one set of code with five different behaviors (which we
_currently_ kind of have with the whole in addition to all the
HAVE_RCU_TABLE_FREE etc config options that modify how the code works.

It would be good to also cut down on the millions of functions that
each architecture can override, because Christ, it got very confusing
at times to follow just how the code flowed from generic code to
architecture-specific macros back to generic code and then
arch-specific inline helper functions.

It's a maze of underscores and "page" vs "table", and "flush" vs "remove" etc.

But that "it would be good to really make everybody to use as much of
the generic code as possible" and everybody have the same pattern,
that's a future thing. But the whole "let's just add that
"freed_tables" thing would be part of trying to get people to use the
same overall pattern, even if some architectures might not care about
that detail.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  wrote:
>
>
> So we do need a different flush instruction for the page tables vs. the
> normal TLB pages.

Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
invalidates all the internal tlb cache nodes.

So the "new world order" is exactly that patch that PeterZ sent you, that adds a

+   unsigned intfreed_tables : 1;

to the 'struct mmu_gather', and then makes all those
pte/pmd/pud/p4d_free_tlb() functions set that bit.

So I'm referring to the email PeterZ sent you in this thread that said:

  Nick, Will is already looking at using this to remove the synchronous
  invalidation from __p*_free_tlb() for ARM, could you have a look to see
  if PowerPC-radix could benefit from that too?

  Basically, using a patch like the below, would give your tlb_flush()
  information on if tables were removed or not.

then, in that model, you do *not* need to override these
pte/pmd/pud/p4d_free_tlb() macros at all (well, you *can* if you want
to, for doing games with the range modification, but let's sayt that
you don't need that right now).

So instead, when you get to the actual "tlb_flush(tlb)", you do
exactly that - flush the tlb. And the mmu_gather structure shows you
how much you need to flush. If you see that "freed_tables" is set,
then you know that you need to also do the special instruction to
flush the inner level caches. The range continues to show the page
range.

   Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  wrote:
>
>
> So we do need a different flush instruction for the page tables vs. the
> normal TLB pages.

Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
invalidates all the internal tlb cache nodes.

So the "new world order" is exactly that patch that PeterZ sent you, that adds a

+   unsigned intfreed_tables : 1;

to the 'struct mmu_gather', and then makes all those
pte/pmd/pud/p4d_free_tlb() functions set that bit.

So I'm referring to the email PeterZ sent you in this thread that said:

  Nick, Will is already looking at using this to remove the synchronous
  invalidation from __p*_free_tlb() for ARM, could you have a look to see
  if PowerPC-radix could benefit from that too?

  Basically, using a patch like the below, would give your tlb_flush()
  information on if tables were removed or not.

then, in that model, you do *not* need to override these
pte/pmd/pud/p4d_free_tlb() macros at all (well, you *can* if you want
to, for doing games with the range modification, but let's sayt that
you don't need that right now).

So instead, when you get to the actual "tlb_flush(tlb)", you do
exactly that - flush the tlb. And the mmu_gather structure shows you
how much you need to flush. If you see that "freed_tables" is set,
then you know that you need to also do the special instruction to
flush the inner level caches. The range continues to show the page
range.

   Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin  wrote:
>
> I think it was quite well understood and fixed here, a145abf12c9 but
> again that was before I really started looking at it.

You don't understand the problem.

All the x86 people thought WE ALREADY DID THAT.

Because we had done this all correctly over a decade ago!

Nobody realized that it had been screwed up by the powerpc code, and
the commit you point to was believed to be a new *powerpc* only issue,
because the semantics on powerpc has changed because of the radix
tree.

The semantics on x86 have never changed, they've always been the same.
So why would the x86 people react to powerpc doing something that x86
had already always done.

See?

Nobody cared one whit about commit a145abf12c9, because it just
handles a new powerpc-specific case.

> I don't really understand what the issue you have with powerpc here.
> powerpc hash has the page table flushing accessors which are just
> no-ops, it's the generic code that fails to call them properly. Surely
> there was no powerpc patch that removed those calls from generic code?

Yes there was.

Look where the generic code *came* from.

It's powerpc code.

See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing
into generic code").

The powerpc code was made into the generic code, because the powerpc
code had to handle all those special RCU freeing things etc that
others didn't.

It's just that when x86 was then switched over to use the "generic"
code, people didn't realize that the generic code didn't do the TLB
invalidations for page tables, because they hadn't been needed on
powerpc.

So the powerpc code that was made generic, never really was. The new
"generic" code had a powerpc-specific quirk.

That then very subtly broke x86, without the x86 people ever
realizing. Because the old simple non-RCU x86 code had never had that
issue, it just treated the leaf pages and the directory pages exactly
the same.

See?

And THAT is why I talk about the powerpc code. Because what is
"generic" code in 4.18 (and several releases before) oisn't actually
generic.

And that's basically exactly the bug that the patches from PeterZ is
fixing. Making the "tlb_remove_table()" code always flush the tlb, the
way it should have when it was made generic.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 9:33 PM Nicholas Piggin  wrote:
>
> I think it was quite well understood and fixed here, a145abf12c9 but
> again that was before I really started looking at it.

You don't understand the problem.

All the x86 people thought WE ALREADY DID THAT.

Because we had done this all correctly over a decade ago!

Nobody realized that it had been screwed up by the powerpc code, and
the commit you point to was believed to be a new *powerpc* only issue,
because the semantics on powerpc has changed because of the radix
tree.

The semantics on x86 have never changed, they've always been the same.
So why would the x86 people react to powerpc doing something that x86
had already always done.

See?

Nobody cared one whit about commit a145abf12c9, because it just
handles a new powerpc-specific case.

> I don't really understand what the issue you have with powerpc here.
> powerpc hash has the page table flushing accessors which are just
> no-ops, it's the generic code that fails to call them properly. Surely
> there was no powerpc patch that removed those calls from generic code?

Yes there was.

Look where the generic code *came* from.

It's powerpc code.

See commit 267239116987 ("mm, powerpc: move the RCU page-table freeing
into generic code").

The powerpc code was made into the generic code, because the powerpc
code had to handle all those special RCU freeing things etc that
others didn't.

It's just that when x86 was then switched over to use the "generic"
code, people didn't realize that the generic code didn't do the TLB
invalidations for page tables, because they hadn't been needed on
powerpc.

So the powerpc code that was made generic, never really was. The new
"generic" code had a powerpc-specific quirk.

That then very subtly broke x86, without the x86 people ever
realizing. Because the old simple non-RCU x86 code had never had that
issue, it just treated the leaf pages and the directory pages exactly
the same.

See?

And THAT is why I talk about the powerpc code. Because what is
"generic" code in 4.18 (and several releases before) oisn't actually
generic.

And that's basically exactly the bug that the patches from PeterZ is
fixing. Making the "tlb_remove_table()" code always flush the tlb, the
way it should have when it was made generic.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
> > 
> > powerpc/radix has no such issue, it already does this tracking.
> 
> Yeah, I now realize that this was why you wanted to add that hacky
> thing to the generic code, so that you can add the tlb_flush_pgtable()
> call.
> 
> I thought it was because powerpc had some special flush instruction
> for it, and the regular tlb flush didn't do it. But no. It was because
> the regular code had lost the tlb flush _entirely_, because powerpc
> didn't want it.

Heh :-) Well, back on hash we didn't (we do now with Radix) but I
wouldn't blame us for the generic code being broken ... the RCU table
freeing was in arch/powerpc at the time :-) I don't think it was us
making it generic :)

> > We were discussing this a couple of months ago, I wasn't aware of ARM's
> > issue but I suggested x86 could go the same way as powerpc.
> 
> The problem is that x86 _used_ to do this all correctly long long ago.
> 
> And then we switched over to the "generic" table flushing (which
> harkens back to the powerpc code).

Yes, we wrote it the RCU stuff to solve the races with SW walking,
which is completely orthogonal with HW walking & TLB content. We didn't
do the move to generic code though ;-)

> Which actually turned out to be not generic at all, and did not flush
> the internal pages like x86 used to (back when x86 just used
> tlb_remove_page for everything).

Well, having RCU do the flushing is rather generic, it makes sense
whenever there's somebody doing a SW walk *and* you don't have IPIs to
synchronize your flushes (ie, anybody with HW TLB invalidation
broadcast basically, so ARM and us).

> So as a result, x86 had unintentionally lost the TLB flush we used to
> have, because tlb_remove_table() had lost the tlb flushing because of
> a powerpc quirk.

This is a somewhat odd way of putting the "blame" :-) But yeah ok...

> You then added it back as a hacky per-architecture hook (apparently
> having realized that you never did it at all), which didn't fix the
> unintentional lack of flushing on x86.
> 
> So now we're going to do it right.  No more "oh, powerpc didn't need
> to flush because the hash tables weren't in the tlb at all" thing in
> the generic code that then others need to work around.

So we do need a different flush instruction for the page tables vs. the
normal TLB pages. 

Cheers,
Ben.




Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
> > 
> > powerpc/radix has no such issue, it already does this tracking.
> 
> Yeah, I now realize that this was why you wanted to add that hacky
> thing to the generic code, so that you can add the tlb_flush_pgtable()
> call.
> 
> I thought it was because powerpc had some special flush instruction
> for it, and the regular tlb flush didn't do it. But no. It was because
> the regular code had lost the tlb flush _entirely_, because powerpc
> didn't want it.

Heh :-) Well, back on hash we didn't (we do now with Radix) but I
wouldn't blame us for the generic code being broken ... the RCU table
freeing was in arch/powerpc at the time :-) I don't think it was us
making it generic :)

> > We were discussing this a couple of months ago, I wasn't aware of ARM's
> > issue but I suggested x86 could go the same way as powerpc.
> 
> The problem is that x86 _used_ to do this all correctly long long ago.
> 
> And then we switched over to the "generic" table flushing (which
> harkens back to the powerpc code).

Yes, we wrote it the RCU stuff to solve the races with SW walking,
which is completely orthogonal with HW walking & TLB content. We didn't
do the move to generic code though ;-)

> Which actually turned out to be not generic at all, and did not flush
> the internal pages like x86 used to (back when x86 just used
> tlb_remove_page for everything).

Well, having RCU do the flushing is rather generic, it makes sense
whenever there's somebody doing a SW walk *and* you don't have IPIs to
synchronize your flushes (ie, anybody with HW TLB invalidation
broadcast basically, so ARM and us).

> So as a result, x86 had unintentionally lost the TLB flush we used to
> have, because tlb_remove_table() had lost the tlb flushing because of
> a powerpc quirk.

This is a somewhat odd way of putting the "blame" :-) But yeah ok...

> You then added it back as a hacky per-architecture hook (apparently
> having realized that you never did it at all), which didn't fix the
> unintentional lack of flushing on x86.
> 
> So now we're going to do it right.  No more "oh, powerpc didn't need
> to flush because the hash tables weren't in the tlb at all" thing in
> the generic code that then others need to work around.

So we do need a different flush instruction for the page tables vs. the
normal TLB pages. 

Cheers,
Ben.




Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 20:59:46 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
> >
> > powerpc/radix has no such issue, it already does this tracking.  
> 
> Yeah, I now realize that this was why you wanted to add that hacky
> thing to the generic code, so that you can add the tlb_flush_pgtable()
> call.
> 
> I thought it was because powerpc had some special flush instruction
> for it, and the regular tlb flush didn't do it. But no.

powerpc/radix does have a special instruction for it, that is why I
posted the patch :)

> It was because
> the regular code had lost the tlb flush _entirely_, because powerpc
> didn't want it.

I think that was long before I started looking at the code.
powerpc/hash hardware has no idea about the page tables so yeah they
don't need it.

> 
> > We were discussing this a couple of months ago, I wasn't aware of ARM's
> > issue but I suggested x86 could go the same way as powerpc.  
> 
> The problem is that x86 _used_ to do this all correctly long long ago.
> 
> And then we switched over to the "generic" table flushing (which
> harkens back to the powerpc code).
> 
> Which actually turned out to be not generic at all, and did not flush
> the internal pages like x86 used to (back when x86 just used
> tlb_remove_page for everything).
> 
> So as a result, x86 had unintentionally lost the TLB flush we used to
> have, because tlb_remove_table() had lost the tlb flushing because of
> a powerpc quirk.
> 
> You then added it back as a hacky per-architecture hook (apparently
> having realized that you never did it at all), which didn't fix the

I think it was quite well understood and fixed here, a145abf12c9 but
again that was before I really started looking at it.

The hooks I added recently are for a different reason, and it's
actaully the opposite problem -- to work around the hacky generic code
that x86 foisted on other archs.

> unintentional lack of flushing on x86.
> 
> So now we're going to do it right.  No more "oh, powerpc didn't need
> to flush because the hash tables weren't in the tlb at all" thing in
> the generic code that then others need to work around.

I don't really understand what the issue you have with powerpc here.
powerpc hash has the page table flushing accessors which are just
no-ops, it's the generic code that fails to call them properly. Surely
there was no powerpc patch that removed those calls from generic code?

powerpc/radix yes it does some arch specific things to do its page walk
cache flushing, but it is a better design than the hacks x86 has in
generic code, surely. I thought you basically agreed and thought x86 /
generic code could move to that kind of model.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 20:59:46 -0700
Linus Torvalds  wrote:

> On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
> >
> > powerpc/radix has no such issue, it already does this tracking.  
> 
> Yeah, I now realize that this was why you wanted to add that hacky
> thing to the generic code, so that you can add the tlb_flush_pgtable()
> call.
> 
> I thought it was because powerpc had some special flush instruction
> for it, and the regular tlb flush didn't do it. But no.

powerpc/radix does have a special instruction for it, that is why I
posted the patch :)

> It was because
> the regular code had lost the tlb flush _entirely_, because powerpc
> didn't want it.

I think that was long before I started looking at the code.
powerpc/hash hardware has no idea about the page tables so yeah they
don't need it.

> 
> > We were discussing this a couple of months ago, I wasn't aware of ARM's
> > issue but I suggested x86 could go the same way as powerpc.  
> 
> The problem is that x86 _used_ to do this all correctly long long ago.
> 
> And then we switched over to the "generic" table flushing (which
> harkens back to the powerpc code).
> 
> Which actually turned out to be not generic at all, and did not flush
> the internal pages like x86 used to (back when x86 just used
> tlb_remove_page for everything).
> 
> So as a result, x86 had unintentionally lost the TLB flush we used to
> have, because tlb_remove_table() had lost the tlb flushing because of
> a powerpc quirk.
> 
> You then added it back as a hacky per-architecture hook (apparently
> having realized that you never did it at all), which didn't fix the

I think it was quite well understood and fixed here, a145abf12c9 but
again that was before I really started looking at it.

The hooks I added recently are for a different reason, and it's
actaully the opposite problem -- to work around the hacky generic code
that x86 foisted on other archs.

> unintentional lack of flushing on x86.
> 
> So now we're going to do it right.  No more "oh, powerpc didn't need
> to flush because the hash tables weren't in the tlb at all" thing in
> the generic code that then others need to work around.

I don't really understand what the issue you have with powerpc here.
powerpc hash has the page table flushing accessors which are just
no-ops, it's the generic code that fails to call them properly. Surely
there was no powerpc patch that removed those calls from generic code?

powerpc/radix yes it does some arch specific things to do its page walk
cache flushing, but it is a better design than the hacks x86 has in
generic code, surely. I thought you basically agreed and thought x86 /
generic code could move to that kind of model.

Thanks,
Nick


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
>
> powerpc/radix has no such issue, it already does this tracking.

Yeah, I now realize that this was why you wanted to add that hacky
thing to the generic code, so that you can add the tlb_flush_pgtable()
call.

I thought it was because powerpc had some special flush instruction
for it, and the regular tlb flush didn't do it. But no. It was because
the regular code had lost the tlb flush _entirely_, because powerpc
didn't want it.

> We were discussing this a couple of months ago, I wasn't aware of ARM's
> issue but I suggested x86 could go the same way as powerpc.

The problem is that x86 _used_ to do this all correctly long long ago.

And then we switched over to the "generic" table flushing (which
harkens back to the powerpc code).

Which actually turned out to be not generic at all, and did not flush
the internal pages like x86 used to (back when x86 just used
tlb_remove_page for everything).

So as a result, x86 had unintentionally lost the TLB flush we used to
have, because tlb_remove_table() had lost the tlb flushing because of
a powerpc quirk.

You then added it back as a hacky per-architecture hook (apparently
having realized that you never did it at all), which didn't fix the
unintentional lack of flushing on x86.

So now we're going to do it right.  No more "oh, powerpc didn't need
to flush because the hash tables weren't in the tlb at all" thing in
the generic code that then others need to work around.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
>
> powerpc/radix has no such issue, it already does this tracking.

Yeah, I now realize that this was why you wanted to add that hacky
thing to the generic code, so that you can add the tlb_flush_pgtable()
call.

I thought it was because powerpc had some special flush instruction
for it, and the regular tlb flush didn't do it. But no. It was because
the regular code had lost the tlb flush _entirely_, because powerpc
didn't want it.

> We were discussing this a couple of months ago, I wasn't aware of ARM's
> issue but I suggested x86 could go the same way as powerpc.

The problem is that x86 _used_ to do this all correctly long long ago.

And then we switched over to the "generic" table flushing (which
harkens back to the powerpc code).

Which actually turned out to be not generic at all, and did not flush
the internal pages like x86 used to (back when x86 just used
tlb_remove_page for everything).

So as a result, x86 had unintentionally lost the TLB flush we used to
have, because tlb_remove_table() had lost the tlb flushing because of
a powerpc quirk.

You then added it back as a hacky per-architecture hook (apparently
having realized that you never did it at all), which didn't fix the
unintentional lack of flushing on x86.

So now we're going to do it right.  No more "oh, powerpc didn't need
to flush because the hash tables weren't in the tlb at all" thing in
the generic code that then others need to work around.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:55:27 +0200
Peter Zijlstra  wrote:

> On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> > ARM
> > which later used this put an explicit TLB invalidate in their
> > __p*_free_tlb() functions, and PowerPC-radix followed that example.  
> 
> > +/*
> > + * If we want tlb_remove_table() to imply TLB invalidates.
> > + */
> > +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> > +{
> > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> > +   /*
> > +* Invalidate page-table caches used by hardware walkers. Then we still
> > +* need to RCU-sched wait while freeing the pages because software
> > +* walkers can still be in-flight.
> > +*/
> > +   __tlb_flush_mmu_tlbonly(tlb);
> > +#endif
> > +}  
> 
> 
> Nick, Will is already looking at using this to remove the synchronous
> invalidation from __p*_free_tlb() for ARM, could you have a look to see
> if PowerPC-radix could benefit from that too?

powerpc/radix has no such issue, it already does this tracking.

We were discussing this a couple of months ago, I wasn't aware of ARM's
issue but I suggested x86 could go the same way as powerpc. Would have
been good to get some feedback on some of the proposed approaches there.
Because it's not just pwc tracking but if you do this you don't need
those silly hacks in generic code to expand the TLB address range
either.

So powerpc has no fundamental problem with making this stuff generic.
If you need a fix for x86 and ARM for this merge window though, I would
suggest just copying what powerpc already has. Next time we can
consider consolidating them all into generic code.

Thanks,
Nick


> 
> Basically, using a patch like the below, would give your tlb_flush()
> information on if tables were removed or not.
> 
> ---
> 
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -96,12 +96,22 @@ struct mmu_gather {
>  #endif
>   unsigned long   start;
>   unsigned long   end;
> - /* we are in the middle of an operation to clear
> -  * a full mm and can make some optimizations */
> - unsigned intfullmm : 1,
> - /* we have performed an operation which
> -  * requires a complete flush of the tlb */
> - need_flush_all : 1;
> + /*
> +  * we are in the middle of an operation to clear
> +  * a full mm and can make some optimizations
> +  */
> + unsigned intfullmm : 1;
> +
> + /*
> +  * we have performed an operation which
> +  * requires a complete flush of the tlb
> +  */
> + unsigned intneed_flush_all : 1;
> +
> + /*
> +  * we have removed page directories
> +  */
> + unsigned intfreed_tables : 1;
>  
>   struct mmu_gather_batch *active;
>   struct mmu_gather_batch local;
> @@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str
>   tlb->start = TASK_SIZE;
>   tlb->end = 0;
>   }
> + tlb->freed_tables = 0;
>  }
>  
>  static inline void tlb_remove_page_size(struct mmu_gather *tlb,
> @@ -269,6 +280,7 @@ static inline void tlb_remove_check_page
>  #define pte_free_tlb(tlb, ptep, address) \
>   do {\
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pte_free_tlb(tlb, ptep, address); \
>   } while (0)
>  #endif
> @@ -276,7 +288,8 @@ static inline void tlb_remove_check_page
>  #ifndef pmd_free_tlb
>  #define pmd_free_tlb(tlb, pmdp, address) \
>   do {\
> - __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pmd_free_tlb(tlb, pmdp, address); \
>   } while (0)
>  #endif
> @@ -286,6 +299,7 @@ static inline void tlb_remove_check_page
>  #define pud_free_tlb(tlb, pudp, address) \
>   do {\
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pud_free_tlb(tlb, pudp, address); \
>   } while (0)
>  #endif
> @@ -295,7 +309,8 @@ static inline void tlb_remove_check_page
>  #ifndef p4d_free_tlb
>  #define p4d_free_tlb(tlb, pudp, address) \
>   do {\
> - __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __p4d_free_tlb(tlb, pudp, address); \
>   } while (0)
>  #endif



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Nicholas Piggin
On Wed, 22 Aug 2018 17:55:27 +0200
Peter Zijlstra  wrote:

> On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> > ARM
> > which later used this put an explicit TLB invalidate in their
> > __p*_free_tlb() functions, and PowerPC-radix followed that example.  
> 
> > +/*
> > + * If we want tlb_remove_table() to imply TLB invalidates.
> > + */
> > +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> > +{
> > +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> > +   /*
> > +* Invalidate page-table caches used by hardware walkers. Then we still
> > +* need to RCU-sched wait while freeing the pages because software
> > +* walkers can still be in-flight.
> > +*/
> > +   __tlb_flush_mmu_tlbonly(tlb);
> > +#endif
> > +}  
> 
> 
> Nick, Will is already looking at using this to remove the synchronous
> invalidation from __p*_free_tlb() for ARM, could you have a look to see
> if PowerPC-radix could benefit from that too?

powerpc/radix has no such issue, it already does this tracking.

We were discussing this a couple of months ago, I wasn't aware of ARM's
issue but I suggested x86 could go the same way as powerpc. Would have
been good to get some feedback on some of the proposed approaches there.
Because it's not just pwc tracking but if you do this you don't need
those silly hacks in generic code to expand the TLB address range
either.

So powerpc has no fundamental problem with making this stuff generic.
If you need a fix for x86 and ARM for this merge window though, I would
suggest just copying what powerpc already has. Next time we can
consider consolidating them all into generic code.

Thanks,
Nick


> 
> Basically, using a patch like the below, would give your tlb_flush()
> information on if tables were removed or not.
> 
> ---
> 
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -96,12 +96,22 @@ struct mmu_gather {
>  #endif
>   unsigned long   start;
>   unsigned long   end;
> - /* we are in the middle of an operation to clear
> -  * a full mm and can make some optimizations */
> - unsigned intfullmm : 1,
> - /* we have performed an operation which
> -  * requires a complete flush of the tlb */
> - need_flush_all : 1;
> + /*
> +  * we are in the middle of an operation to clear
> +  * a full mm and can make some optimizations
> +  */
> + unsigned intfullmm : 1;
> +
> + /*
> +  * we have performed an operation which
> +  * requires a complete flush of the tlb
> +  */
> + unsigned intneed_flush_all : 1;
> +
> + /*
> +  * we have removed page directories
> +  */
> + unsigned intfreed_tables : 1;
>  
>   struct mmu_gather_batch *active;
>   struct mmu_gather_batch local;
> @@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str
>   tlb->start = TASK_SIZE;
>   tlb->end = 0;
>   }
> + tlb->freed_tables = 0;
>  }
>  
>  static inline void tlb_remove_page_size(struct mmu_gather *tlb,
> @@ -269,6 +280,7 @@ static inline void tlb_remove_check_page
>  #define pte_free_tlb(tlb, ptep, address) \
>   do {\
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pte_free_tlb(tlb, ptep, address); \
>   } while (0)
>  #endif
> @@ -276,7 +288,8 @@ static inline void tlb_remove_check_page
>  #ifndef pmd_free_tlb
>  #define pmd_free_tlb(tlb, pmdp, address) \
>   do {\
> - __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pmd_free_tlb(tlb, pmdp, address); \
>   } while (0)
>  #endif
> @@ -286,6 +299,7 @@ static inline void tlb_remove_check_page
>  #define pud_free_tlb(tlb, pudp, address) \
>   do {\
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __pud_free_tlb(tlb, pudp, address); \
>   } while (0)
>  #endif
> @@ -295,7 +309,8 @@ static inline void tlb_remove_check_page
>  #ifndef p4d_free_tlb
>  #define p4d_free_tlb(tlb, pudp, address) \
>   do {\
> - __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + __tlb_adjust_range(tlb, address, PAGE_SIZE);\
> + tlb->freed_tables = 1;  \
>   __p4d_free_tlb(tlb, pudp, address); \
>   } while (0)
>  #endif



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 8:46 AM Peter Zijlstra  wrote:
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -180,6 +180,7 @@ config X86
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> select HAVE_RCU_TABLE_FREE
> +   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE

This is confusing. First you select HAVE_RCU_TABLE_FREE
unconditionally, and then you select HAVE_RCU_TABLE_INVALIDATE based
on that unconditional variable.

I can see why you do it, but that's because I see the next patch. On
its own it just looks like you have a drinking problem.

That said, I was waiting to see if this patch-set would get any
comments before applying it, but it's been mostly crickets... So I
think I'll just apply it and get this issue over and done with.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Linus Torvalds
On Wed, Aug 22, 2018 at 8:46 AM Peter Zijlstra  wrote:
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -180,6 +180,7 @@ config X86
> select HAVE_PERF_REGS
> select HAVE_PERF_USER_STACK_DUMP
> select HAVE_RCU_TABLE_FREE
> +   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE

This is confusing. First you select HAVE_RCU_TABLE_FREE
unconditionally, and then you select HAVE_RCU_TABLE_INVALIDATE based
on that unconditional variable.

I can see why you do it, but that's because I see the next patch. On
its own it just looks like you have a drinking problem.

That said, I was waiting to see if this patch-set would get any
comments before applying it, but it's been mostly crickets... So I
think I'll just apply it and get this issue over and done with.

  Linus


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Peter Zijlstra
On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> ARM
> which later used this put an explicit TLB invalidate in their
> __p*_free_tlb() functions, and PowerPC-radix followed that example.

> +/*
> + * If we want tlb_remove_table() to imply TLB invalidates.
> + */
> +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> +{
> +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> + /*
> +  * Invalidate page-table caches used by hardware walkers. Then we still
> +  * need to RCU-sched wait while freeing the pages because software
> +  * walkers can still be in-flight.
> +  */
> + __tlb_flush_mmu_tlbonly(tlb);
> +#endif
> +}


Nick, Will is already looking at using this to remove the synchronous
invalidation from __p*_free_tlb() for ARM, could you have a look to see
if PowerPC-radix could benefit from that too?

Basically, using a patch like the below, would give your tlb_flush()
information on if tables were removed or not.

---

--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -96,12 +96,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
@@ -269,6 +280,7 @@ static inline void tlb_remove_check_page
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -276,7 +288,8 @@ static inline void tlb_remove_check_page
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -286,6 +299,7 @@ static inline void tlb_remove_check_page
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -295,7 +309,8 @@ static inline void tlb_remove_check_page
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif


Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Peter Zijlstra
On Wed, Aug 22, 2018 at 05:30:15PM +0200, Peter Zijlstra wrote:
> ARM
> which later used this put an explicit TLB invalidate in their
> __p*_free_tlb() functions, and PowerPC-radix followed that example.

> +/*
> + * If we want tlb_remove_table() to imply TLB invalidates.
> + */
> +static inline void tlb_table_invalidate(struct mmu_gather *tlb)
> +{
> +#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
> + /*
> +  * Invalidate page-table caches used by hardware walkers. Then we still
> +  * need to RCU-sched wait while freeing the pages because software
> +  * walkers can still be in-flight.
> +  */
> + __tlb_flush_mmu_tlbonly(tlb);
> +#endif
> +}


Nick, Will is already looking at using this to remove the synchronous
invalidation from __p*_free_tlb() for ARM, could you have a look to see
if PowerPC-radix could benefit from that too?

Basically, using a patch like the below, would give your tlb_flush()
information on if tables were removed or not.

---

--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -96,12 +96,22 @@ struct mmu_gather {
 #endif
unsigned long   start;
unsigned long   end;
-   /* we are in the middle of an operation to clear
-* a full mm and can make some optimizations */
-   unsigned intfullmm : 1,
-   /* we have performed an operation which
-* requires a complete flush of the tlb */
-   need_flush_all : 1;
+   /*
+* we are in the middle of an operation to clear
+* a full mm and can make some optimizations
+*/
+   unsigned intfullmm : 1;
+
+   /*
+* we have performed an operation which
+* requires a complete flush of the tlb
+*/
+   unsigned intneed_flush_all : 1;
+
+   /*
+* we have removed page directories
+*/
+   unsigned intfreed_tables : 1;
 
struct mmu_gather_batch *active;
struct mmu_gather_batch local;
@@ -136,6 +146,7 @@ static inline void __tlb_reset_range(str
tlb->start = TASK_SIZE;
tlb->end = 0;
}
+   tlb->freed_tables = 0;
 }
 
 static inline void tlb_remove_page_size(struct mmu_gather *tlb,
@@ -269,6 +280,7 @@ static inline void tlb_remove_check_page
 #define pte_free_tlb(tlb, ptep, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pte_free_tlb(tlb, ptep, address); \
} while (0)
 #endif
@@ -276,7 +288,8 @@ static inline void tlb_remove_check_page
 #ifndef pmd_free_tlb
 #define pmd_free_tlb(tlb, pmdp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pmd_free_tlb(tlb, pmdp, address); \
} while (0)
 #endif
@@ -286,6 +299,7 @@ static inline void tlb_remove_check_page
 #define pud_free_tlb(tlb, pudp, address)   \
do {\
__tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__pud_free_tlb(tlb, pudp, address); \
} while (0)
 #endif
@@ -295,7 +309,8 @@ static inline void tlb_remove_check_page
 #ifndef p4d_free_tlb
 #define p4d_free_tlb(tlb, pudp, address)   \
do {\
-   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   __tlb_adjust_range(tlb, address, PAGE_SIZE);\
+   tlb->freed_tables = 1;  \
__p4d_free_tlb(tlb, pudp, address); \
} while (0)
 #endif


[PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Peter Zijlstra
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().

This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.

But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.

NOTE: s390 was also needing something like this and might now
  be able to use the generic code again.

Cc: sta...@kernel.org
Cc: Nicholas Piggin 
Cc: David Miller 
Cc: Will Deacon 
Cc: Martin Schwidefsky 
Cc: Michael Ellerman 
Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing 
(CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/Kconfig |3 +++
 arch/x86/Kconfig |1 +
 mm/memory.c  |   27 +--
 3 files changed, 29 insertions(+), 2 deletions(-)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -362,6 +362,9 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
+config HAVE_RCU_TABLE_INVALIDATE
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -180,6 +180,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
+   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -238,17 +238,22 @@ void arch_tlb_gather_mmu(struct mmu_gath
__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static void __tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
if (!tlb->end)
return;
 
tlb_flush(tlb);
mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
+   __tlb_reset_range(tlb);
+}
+
+static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+{
+   __tlb_flush_mmu_tlbonly(tlb);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
tlb_table_flush(tlb);
 #endif
-   __tlb_reset_range(tlb);
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -330,6 +335,21 @@ bool __tlb_remove_page_size(struct mmu_g
  * See the comment near struct mmu_table_batch.
  */
 
+/*
+ * If we want tlb_remove_table() to imply TLB invalidates.
+ */
+static inline void tlb_table_invalidate(struct mmu_gather *tlb)
+{
+#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+   /*
+* Invalidate page-table caches used by hardware walkers. Then we still
+* need to RCU-sched wait while freeing the pages because software
+* walkers can still be in-flight.
+*/
+   __tlb_flush_mmu_tlbonly(tlb);
+#endif
+}
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
/* Simply deliver the interrupt */
@@ -366,6 +386,7 @@ void tlb_table_flush(struct mmu_gather *
struct mmu_table_batch **batch = >batch;
 
if (*batch) {
+   tlb_table_invalidate(tlb);
call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
@@ -378,11 +399,13 @@ void tlb_remove_table(struct mmu_gather
if (*batch == NULL) {
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | 
__GFP_NOWARN);
if (*batch == NULL) {
+   tlb_table_invalidate(tlb);
tlb_remove_table_one(table);
return;
}
(*batch)->nr = 0;
}
+
(*batch)->tables[(*batch)->nr++] = table;
if ((*batch)->nr == MAX_TABLE_BATCH)
tlb_table_flush(tlb);




[PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Peter Zijlstra
Jann reported that x86 was missing required TLB invalidates when he
hit the !*batch slow path in tlb_remove_table().

This is indeed the case; RCU_TABLE_FREE does not provide TLB (cache)
invalidates, the PowerPC-hash where this code originated and the
Sparc-hash where this was subsequently used did not need that. ARM
which later used this put an explicit TLB invalidate in their
__p*_free_tlb() functions, and PowerPC-radix followed that example.

But when we hooked up x86 we failed to consider this. Fix this by
(optionally) hooking tlb_remove_table() into the TLB invalidate code.

NOTE: s390 was also needing something like this and might now
  be able to use the generic code again.

Cc: sta...@kernel.org
Cc: Nicholas Piggin 
Cc: David Miller 
Cc: Will Deacon 
Cc: Martin Schwidefsky 
Cc: Michael Ellerman 
Fixes: 9e52fc2b50de ("x86/mm: Enable RCU based page table freeing 
(CONFIG_HAVE_RCU_TABLE_FREE=y)")
Reported-by: Jann Horn 
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/Kconfig |3 +++
 arch/x86/Kconfig |1 +
 mm/memory.c  |   27 +--
 3 files changed, 29 insertions(+), 2 deletions(-)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -362,6 +362,9 @@ config HAVE_ARCH_JUMP_LABEL
 config HAVE_RCU_TABLE_FREE
bool
 
+config HAVE_RCU_TABLE_INVALIDATE
+   bool
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -180,6 +180,7 @@ config X86
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE
+   select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && 
(UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
select HAVE_STACKPROTECTOR  if CC_HAS_SANE_STACKPROTECTOR
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -238,17 +238,22 @@ void arch_tlb_gather_mmu(struct mmu_gath
__tlb_reset_range(tlb);
 }
 
-static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+static void __tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
if (!tlb->end)
return;
 
tlb_flush(tlb);
mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end);
+   __tlb_reset_range(tlb);
+}
+
+static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
+{
+   __tlb_flush_mmu_tlbonly(tlb);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
tlb_table_flush(tlb);
 #endif
-   __tlb_reset_range(tlb);
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -330,6 +335,21 @@ bool __tlb_remove_page_size(struct mmu_g
  * See the comment near struct mmu_table_batch.
  */
 
+/*
+ * If we want tlb_remove_table() to imply TLB invalidates.
+ */
+static inline void tlb_table_invalidate(struct mmu_gather *tlb)
+{
+#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE
+   /*
+* Invalidate page-table caches used by hardware walkers. Then we still
+* need to RCU-sched wait while freeing the pages because software
+* walkers can still be in-flight.
+*/
+   __tlb_flush_mmu_tlbonly(tlb);
+#endif
+}
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
/* Simply deliver the interrupt */
@@ -366,6 +386,7 @@ void tlb_table_flush(struct mmu_gather *
struct mmu_table_batch **batch = >batch;
 
if (*batch) {
+   tlb_table_invalidate(tlb);
call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
*batch = NULL;
}
@@ -378,11 +399,13 @@ void tlb_remove_table(struct mmu_gather
if (*batch == NULL) {
*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | 
__GFP_NOWARN);
if (*batch == NULL) {
+   tlb_table_invalidate(tlb);
tlb_remove_table_one(table);
return;
}
(*batch)->nr = 0;
}
+
(*batch)->tables[(*batch)->nr++] = table;
if ((*batch)->nr == MAX_TABLE_BATCH)
tlb_table_flush(tlb);