Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-02-01 Thread Andrew Cooper
On 01/02/2021 05:58, Nadav Amit wrote:
>> On Jan 31, 2021, at 4:10 AM, Andrew Cooper  wrote:
>>
>> On 31/01/2021 01:07, Andy Lutomirski wrote:
>>> Adding Andrew Cooper, who has a distressingly extensive understanding
>>> of the x86 PTE magic.
>> Pretty sure it is all learning things the hard way...
>>
>>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
 diff --git a/mm/mprotect.c b/mm/mprotect.c
 index 632d5a677d3f..b7473d2c9a1f 100644
 --- a/mm/mprotect.c
 +++ b/mm/mprotect.c
 @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct 
 mmu_gather *tlb,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, 
 ptent);
 -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
 +   if (pte_may_need_flush(oldpte, ptent))
 +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> You're choosing to avoid the flush, based on A/D bits read ahead of the
>> actual modification of the PTE.
>>
>> In this example, another thread can write into the range (sets A and D),
>> and get a suitable TLB entry which goes unflushed while the rest of the
>> kernel thinks the memory is write-protected and clean.
>>
>> The only safe way to do this is to use XCHG/etc to modify the PTE, and
>> base flush calculations on the results.  Atomic operations are ordered
>> with A/D updates from pagewalks on other CPUs, even on AMD where A
>> updates are explicitly not ordered with regular memory reads, for
>> performance reasons.
> Thanks Andrew for the feedback, but I think the patch does it exactly in
> this safe manner that you describe (at least on native x86, but I see a
> similar path elsewhere as well):
>
> oldpte = ptep_modify_prot_start()
> -> __ptep_modify_prot_start()
> -> ptep_get_and_clear
> -> native_ptep_get_and_clear()
> -> xchg()
>
> Note that the xchg() will clear the PTE (i.e., making it non-present), and
> no further updates of A/D are possible until ptep_modify_prot_commit() is
> called.
>
> On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
> we should be safe.
>
> I guess you are right and a pte_may_need_flush() deserves a comment to
> clarify that oldpte must be obtained by an atomic operation to ensure no A/D
> bits are lost (as you say).
>
> Yet, I do not see a correctness problem. Am I missing something?

No(ish) - I failed to spot that path.

But native_ptep_get_and_clear() is broken on !SMP builds.  It needs to
be an XCHG even in that case, to spot A/D updates from prefetch or
shared-virtual-memory DMA.

~Andrew


Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Nadav Amit
> On Jan 31, 2021, at 4:10 AM, Andrew Cooper  wrote:
> 
> On 31/01/2021 01:07, Andy Lutomirski wrote:
>> Adding Andrew Cooper, who has a distressingly extensive understanding
>> of the x86 PTE magic.
> 
> Pretty sure it is all learning things the hard way...
> 
>> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 632d5a677d3f..b7473d2c9a1f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather 
>>> *tlb,
>>>ptent = pte_mkwrite(ptent);
>>>}
>>>ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>>> ptent);
>>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>>> +   if (pte_may_need_flush(oldpte, ptent))
>>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> 
> You're choosing to avoid the flush, based on A/D bits read ahead of the
> actual modification of the PTE.
> 
> In this example, another thread can write into the range (sets A and D),
> and get a suitable TLB entry which goes unflushed while the rest of the
> kernel thinks the memory is write-protected and clean.
> 
> The only safe way to do this is to use XCHG/etc to modify the PTE, and
> base flush calculations on the results.  Atomic operations are ordered
> with A/D updates from pagewalks on other CPUs, even on AMD where A
> updates are explicitly not ordered with regular memory reads, for
> performance reasons.

Thanks Andrew for the feedback, but I think the patch does it exactly in
this safe manner that you describe (at least on native x86, but I see a
similar path elsewhere as well):

oldpte = ptep_modify_prot_start()
-> __ptep_modify_prot_start()
-> ptep_get_and_clear
-> native_ptep_get_and_clear()
-> xchg()

Note that the xchg() will clear the PTE (i.e., making it non-present), and
no further updates of A/D are possible until ptep_modify_prot_commit() is
called.

On non-SMP setups this is not atomic (no xchg), but since we hold the lock,
we should be safe.

I guess you are right and a pte_may_need_flush() deserves a comment to
clarify that oldpte must be obtained by an atomic operation to ensure no A/D
bits are lost (as you say).

Yet, I do not see a correctness problem. Am I missing something?



Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Andrew Cooper
On 31/01/2021 01:07, Andy Lutomirski wrote:
> Adding Andrew Cooper, who has a distressingly extensive understanding
> of the x86 PTE magic.

Pretty sure it is all learning things the hard way...

> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 632d5a677d3f..b7473d2c9a1f 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather 
>> *tlb,
>> ptent = pte_mkwrite(ptent);
>> }
>> ptep_modify_prot_commit(vma, addr, pte, oldpte, 
>> ptent);
>> -   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> +   if (pte_may_need_flush(oldpte, ptent))
>> +   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);

You're choosing to avoid the flush, based on A/D bits read ahead of the
actual modification of the PTE.

In this example, another thread can write into the range (sets A and D),
and get a suitable TLB entry which goes unflushed while the rest of the
kernel thinks the memory is write-protected and clean.

The only safe way to do this is to use XCHG/etc to modify the PTE, and
base flush calculations on the results.  Atomic operations are ordered
with A/D updates from pagewalks on other CPUs, even on AMD where A
updates are explicitly not ordered with regular memory reads, for
performance reasons.

~Andrew


Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-31 Thread Andrew Cooper
On 31/01/2021 02:59, Andy Lutomirski wrote:
 diff --git a/arch/x86/include/asm/tlbflush.h 
 b/arch/x86/include/asm/tlbflush.h
 index 8c87a2e0b660..a617dc0a9b06 100644
 --- a/arch/x86/include/asm/tlbflush.h
 +++ b/arch/x86/include/asm/tlbflush.h
 @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
 arch_tlbflush_unmap_batch *batch,

 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

 +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
 +{
 +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
 +_PAGE_SOFTW3 | _PAGE_ACCESSED;
>>> Why is accessed ignored?  Surely clearing the accessed bit needs a
>>> flush if the old PTE is present.
>> I am just following the current scheme in the kernel (x86):
>>
>> int ptep_clear_flush_young(struct vm_area_struct *vma,
>>unsigned long address, pte_t *ptep)
>> {
>> /*
>>  * On x86 CPUs, clearing the accessed bit without a TLB flush
>>  * doesn't cause data corruption. [ It could cause incorrect
>>  * page aging and the (mistaken) reclaim of hot pages, but the
>>  * chance of that should be relatively low. ]
>>  *
> If anyone ever implements the optimization of skipping the flush when
> unmapping a !accessed page, then this will cause data corruption.
>
> If nothing else, this deserves a nice comment in the new code.

Architecturally, access bits get as part of a pagewalk, and before the
TLB entry is made.  Once an access bit has been set, you know for
certain that a mapping for the address is, or was, present in a TLB
somewhere.

I can't help but feel that this "optimisation" is shooting itself in the
foot.  It does cause incorrect page aging, and TLBs are getting bigger,
so the chance of reclamating hot pages is getting worse and worse.

~Andrew


Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-30 Thread Andy Lutomirski
On Sat, Jan 30, 2021 at 5:17 PM Nadav Amit  wrote:
>
> > On Jan 30, 2021, at 5:07 PM, Andy Lutomirski  wrote:
> >
> > Adding Andrew Cooper, who has a distressingly extensive understanding
> > of the x86 PTE magic.
> >
> > On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
> >> From: Nadav Amit 
> >>
> >> Currently, using mprotect() to unprotect a memory region or uffd to
> >> unprotect a memory region causes a TLB flush. At least on x86, as
> >> protection is promoted, no TLB flush is needed.
> >>
> >> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> >> flush is needed based on the old PTE and the new one. Implement an x86
> >> pte_may_need_flush().
> >>
> >> For x86, besides the simple logic that PTE protection promotion or
> >> changes of software bits does require a flush, also add logic that
> >> considers the dirty-bit. If the dirty-bit is clear and write-protect is
> >> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
> >> on write, and if the bit is clear, the PTE is reread.
> >>
> >> Signed-off-by: Nadav Amit 
> >> Cc: Andrea Arcangeli 
> >> Cc: Andrew Morton 
> >> Cc: Andy Lutomirski 
> >> Cc: Dave Hansen 
> >> Cc: Peter Zijlstra 
> >> Cc: Thomas Gleixner 
> >> Cc: Will Deacon 
> >> Cc: Yu Zhao 
> >> Cc: Nick Piggin 
> >> Cc: x...@kernel.org
> >> ---
> >> arch/x86/include/asm/tlbflush.h | 44 +
> >> include/asm-generic/tlb.h   |  4 +++
> >> mm/mprotect.c   |  3 ++-
> >> 3 files changed, 50 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/include/asm/tlbflush.h 
> >> b/arch/x86/include/asm/tlbflush.h
> >> index 8c87a2e0b660..a617dc0a9b06 100644
> >> --- a/arch/x86/include/asm/tlbflush.h
> >> +++ b/arch/x86/include/asm/tlbflush.h
> >> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
> >> arch_tlbflush_unmap_batch *batch,
> >>
> >> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> >>
> >> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
> >> +{
> >> +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
> >> +_PAGE_SOFTW3 | _PAGE_ACCESSED;
> >
> > Why is accessed ignored?  Surely clearing the accessed bit needs a
> > flush if the old PTE is present.
>
> I am just following the current scheme in the kernel (x86):
>
> int ptep_clear_flush_young(struct vm_area_struct *vma,
>unsigned long address, pte_t *ptep)
> {
> /*
>  * On x86 CPUs, clearing the accessed bit without a TLB flush
>  * doesn't cause data corruption. [ It could cause incorrect
>  * page aging and the (mistaken) reclaim of hot pages, but the
>  * chance of that should be relatively low. ]
>  *

If anyone ever implements the optimization of skipping the flush when
unmapping a !accessed page, then this will cause data corruption.

If nothing else, this deserves a nice comment in the new code.

>  * So as a performance optimization don't flush the TLB when
>  * clearing the accessed bit, it will eventually be flushed by
>  * a context switch or a VM operation anyway. [ In the rare
>  * event of it not getting flushed for a long time the delay
>  * shouldn't really matter because there's no real memory
>  * pressure for swapout to react to. ]
>  */
> return ptep_test_and_clear_young(vma, address, ptep);
> }
>
>
> >
> >> +   const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
> >> +   pteval_t oldval = pte_val(oldpte);
> >> +   pteval_t newval = pte_val(newpte);
> >> +   pteval_t diff = oldval ^ newval;
> >> +   pteval_t disable_mask = 0;
> >> +
> >> +   if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
> >> +   disable_mask = _PAGE_NX;
> >> +
> >> +   /* new is non-present: need only if old is present */
> >> +   if (pte_none(newpte))
> >> +   return !pte_none(oldpte);
> >> +
> >> +   /*
> >> +* If, excluding the ignored bits, only RW and dirty are cleared 
> >> and the
> >> +* old PTE does not have the dirty-bit set, we can avoid a flush. 
> >> This
> >> +* is possible since x86 architecture set the dirty bit atomically 
> >> while
> >
> > s/set/sets/
> >
> >> +* it caches the PTE in the TLB.
> >> +*
> >> +* The condition considers any change to RW and dirty as not 
> >> requiring
> >> +* flush if the old PTE is not dirty or not writable for 
> >> simplification
> >> +* of the code and to consider (unlikely) cases of changing 
> >> dirty-bit of
> >> +* write-protected PTE.
> >> +*/
> >> +   if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
> >> +   (!(pte_dirty(oldpte) || !pte_write(oldpte
> >> +   return false;
> >
> > This logic seems confusing to me.  Is your goal to say that, if the
> > 

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-30 Thread Nadav Amit
> On Jan 30, 2021, at 5:07 PM, Andy Lutomirski  wrote:
> 
> Adding Andrew Cooper, who has a distressingly extensive understanding
> of the x86 PTE magic.
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> From: Nadav Amit 
>> 
>> Currently, using mprotect() to unprotect a memory region or uffd to
>> unprotect a memory region causes a TLB flush. At least on x86, as
>> protection is promoted, no TLB flush is needed.
>> 
>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>> flush is needed based on the old PTE and the new one. Implement an x86
>> pte_may_need_flush().
>> 
>> For x86, besides the simple logic that PTE protection promotion or
>> changes of software bits does require a flush, also add logic that
>> considers the dirty-bit. If the dirty-bit is clear and write-protect is
>> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
>> on write, and if the bit is clear, the PTE is reread.
>> 
>> Signed-off-by: Nadav Amit 
>> Cc: Andrea Arcangeli 
>> Cc: Andrew Morton 
>> Cc: Andy Lutomirski 
>> Cc: Dave Hansen 
>> Cc: Peter Zijlstra 
>> Cc: Thomas Gleixner 
>> Cc: Will Deacon 
>> Cc: Yu Zhao 
>> Cc: Nick Piggin 
>> Cc: x...@kernel.org
>> ---
>> arch/x86/include/asm/tlbflush.h | 44 +
>> include/asm-generic/tlb.h   |  4 +++
>> mm/mprotect.c   |  3 ++-
>> 3 files changed, 50 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h 
>> b/arch/x86/include/asm/tlbflush.h
>> index 8c87a2e0b660..a617dc0a9b06 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
>> arch_tlbflush_unmap_batch *batch,
>> 
>> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> 
>> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
>> +{
>> +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
>> +_PAGE_SOFTW3 | _PAGE_ACCESSED;
> 
> Why is accessed ignored?  Surely clearing the accessed bit needs a
> flush if the old PTE is present.

I am just following the current scheme in the kernel (x86):

int ptep_clear_flush_young(struct vm_area_struct *vma,
   unsigned long address, pte_t *ptep)
{
/*
 * On x86 CPUs, clearing the accessed bit without a TLB flush
 * doesn't cause data corruption. [ It could cause incorrect
 * page aging and the (mistaken) reclaim of hot pages, but the
 * chance of that should be relatively low. ]
 *
 * So as a performance optimization don't flush the TLB when
 * clearing the accessed bit, it will eventually be flushed by
 * a context switch or a VM operation anyway. [ In the rare
 * event of it not getting flushed for a long time the delay
 * shouldn't really matter because there's no real memory
 * pressure for swapout to react to. ]
 */
return ptep_test_and_clear_young(vma, address, ptep);
}


> 
>> +   const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
>> +   pteval_t oldval = pte_val(oldpte);
>> +   pteval_t newval = pte_val(newpte);
>> +   pteval_t diff = oldval ^ newval;
>> +   pteval_t disable_mask = 0;
>> +
>> +   if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
>> +   disable_mask = _PAGE_NX;
>> +
>> +   /* new is non-present: need only if old is present */
>> +   if (pte_none(newpte))
>> +   return !pte_none(oldpte);
>> +
>> +   /*
>> +* If, excluding the ignored bits, only RW and dirty are cleared and 
>> the
>> +* old PTE does not have the dirty-bit set, we can avoid a flush. 
>> This
>> +* is possible since x86 architecture set the dirty bit atomically 
>> while
> 
> s/set/sets/
> 
>> +* it caches the PTE in the TLB.
>> +*
>> +* The condition considers any change to RW and dirty as not 
>> requiring
>> +* flush if the old PTE is not dirty or not writable for 
>> simplification
>> +* of the code and to consider (unlikely) cases of changing 
>> dirty-bit of
>> +* write-protected PTE.
>> +*/
>> +   if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
>> +   (!(pte_dirty(oldpte) || !pte_write(oldpte
>> +   return false;
> 
> This logic seems confusing to me.  Is your goal to say that, if the
> old PTE was clean and writable and the new PTE is write-protected,
> then no flush is needed?

Yes.

> If so, I would believe you're right, but I'm
> not convinced you've actually implemented this.  Also, there may be
> other things going on that need flushing, e.g. a change of the address
> or an accessed bit or NX change.

The first part (diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask) is supposed
to capture changes of address, NX-bit, etc.

The second part is indeed 

Re: [RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-30 Thread Andy Lutomirski
Adding Andrew Cooper, who has a distressingly extensive understanding
of the x86 PTE magic.

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>
> From: Nadav Amit 
>
> Currently, using mprotect() to unprotect a memory region or uffd to
> unprotect a memory region causes a TLB flush. At least on x86, as
> protection is promoted, no TLB flush is needed.
>
> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> flush is needed based on the old PTE and the new one. Implement an x86
> pte_may_need_flush().
>
> For x86, besides the simple logic that PTE protection promotion or
> changes of software bits does require a flush, also add logic that
> considers the dirty-bit. If the dirty-bit is clear and write-protect is
> set, no TLB flush is needed, as x86 updates the dirty-bit atomically
> on write, and if the bit is clear, the PTE is reread.
>
> Signed-off-by: Nadav Amit 
> Cc: Andrea Arcangeli 
> Cc: Andrew Morton 
> Cc: Andy Lutomirski 
> Cc: Dave Hansen 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Cc: Will Deacon 
> Cc: Yu Zhao 
> Cc: Nick Piggin 
> Cc: x...@kernel.org
> ---
>  arch/x86/include/asm/tlbflush.h | 44 +
>  include/asm-generic/tlb.h   |  4 +++
>  mm/mprotect.c   |  3 ++-
>  3 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8c87a2e0b660..a617dc0a9b06 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
> arch_tlbflush_unmap_batch *batch,
>
>  extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>
> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
> +{
> +   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
> +_PAGE_SOFTW3 | _PAGE_ACCESSED;

Why is accessed ignored?  Surely clearing the accessed bit needs a
flush if the old PTE is present.

> +   const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
> +   pteval_t oldval = pte_val(oldpte);
> +   pteval_t newval = pte_val(newpte);
> +   pteval_t diff = oldval ^ newval;
> +   pteval_t disable_mask = 0;
> +
> +   if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
> +   disable_mask = _PAGE_NX;
> +
> +   /* new is non-present: need only if old is present */
> +   if (pte_none(newpte))
> +   return !pte_none(oldpte);
> +
> +   /*
> +* If, excluding the ignored bits, only RW and dirty are cleared and 
> the
> +* old PTE does not have the dirty-bit set, we can avoid a flush. This
> +* is possible since x86 architecture set the dirty bit atomically 
> while

s/set/sets/

> +* it caches the PTE in the TLB.
> +*
> +* The condition considers any change to RW and dirty as not requiring
> +* flush if the old PTE is not dirty or not writable for 
> simplification
> +* of the code and to consider (unlikely) cases of changing dirty-bit 
> of
> +* write-protected PTE.
> +*/
> +   if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
> +   (!(pte_dirty(oldpte) || !pte_write(oldpte
> +   return false;

This logic seems confusing to me.  Is your goal to say that, if the
old PTE was clean and writable and the new PTE is write-protected,
then no flush is needed?  If so, I would believe you're right, but I'm
not convinced you've actually implemented this.  Also, there may be
other things going on that need flushing, e.g. a change of the address
or an accessed bit or NX change.

Also, CET makes this extra bizarre.

> +
> +   /*
> +* Any change of PFN and any flag other than those that we consider
> +* requires a flush (e.g., PAT, protection keys). To save flushes we 
> do
> +* not consider the access bit as it is considered by the kernel as
> +* best-effort.
> +*/
> +   return diff & ((oldval & enable_mask) |
> +  (newval & disable_mask) |
> +  ~(enable_mask | disable_mask | ignore_mask));
> +}
> +#define pte_may_need_flush pte_may_need_flush
> +
>  #endif /* !MODULE */
>
>  #endif /* _ASM_X86_TLBFLUSH_H */
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index eea113323468..c2deec0b6919 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -654,6 +654,10 @@ static inline void tlb_flush_p4d_range(struct mmu_gather 
> *tlb,
> } while (0)
>  #endif
>
> +#ifndef pte_may_need_flush
> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte) { return 
> true; }
> +#endif
> +
>  #endif /* CONFIG_MMU */
>
>  #endif /* _ASM_GENERIC__TLB_H */
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 632d5a677d3f..b7473d2c9a1f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -139,7 +139,8 

[RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-30 Thread Nadav Amit
From: Nadav Amit 

Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.

Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().

For x86, besides the simple logic that PTE protection promotion or
changes of software bits does require a flush, also add logic that
considers the dirty-bit. If the dirty-bit is clear and write-protect is
set, no TLB flush is needed, as x86 updates the dirty-bit atomically
on write, and if the bit is clear, the PTE is reread.

Signed-off-by: Nadav Amit 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: Nick Piggin 
Cc: x...@kernel.org
---
 arch/x86/include/asm/tlbflush.h | 44 +
 include/asm-generic/tlb.h   |  4 +++
 mm/mprotect.c   |  3 ++-
 3 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..a617dc0a9b06 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -255,6 +255,50 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
+{
+   const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
+_PAGE_SOFTW3 | _PAGE_ACCESSED;
+   const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
+   pteval_t oldval = pte_val(oldpte);
+   pteval_t newval = pte_val(newpte);
+   pteval_t diff = oldval ^ newval;
+   pteval_t disable_mask = 0;
+
+   if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
+   disable_mask = _PAGE_NX;
+
+   /* new is non-present: need only if old is present */
+   if (pte_none(newpte))
+   return !pte_none(oldpte);
+
+   /*
+* If, excluding the ignored bits, only RW and dirty are cleared and the
+* old PTE does not have the dirty-bit set, we can avoid a flush. This
+* is possible since x86 architecture set the dirty bit atomically while
+* it caches the PTE in the TLB.
+*
+* The condition considers any change to RW and dirty as not requiring
+* flush if the old PTE is not dirty or not writable for simplification
+* of the code and to consider (unlikely) cases of changing dirty-bit of
+* write-protected PTE.
+*/
+   if (!(diff & ~(_PAGE_RW | _PAGE_DIRTY | ignore_mask)) &&
+   (!(pte_dirty(oldpte) || !pte_write(oldpte
+   return false;
+
+   /*
+* Any change of PFN and any flag other than those that we consider
+* requires a flush (e.g., PAT, protection keys). To save flushes we do
+* not consider the access bit as it is considered by the kernel as
+* best-effort.
+*/
+   return diff & ((oldval & enable_mask) |
+  (newval & disable_mask) |
+  ~(enable_mask | disable_mask | ignore_mask));
+}
+#define pte_may_need_flush pte_may_need_flush
+
 #endif /* !MODULE */
 
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index eea113323468..c2deec0b6919 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -654,6 +654,10 @@ static inline void tlb_flush_p4d_range(struct mmu_gather 
*tlb,
} while (0)
 #endif
 
+#ifndef pte_may_need_flush
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte) { return 
true; }
+#endif
+
 #endif /* CONFIG_MMU */
 
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 632d5a677d3f..b7473d2c9a1f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather 
*tlb,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
-   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+   if (pte_may_need_flush(oldpte, ptent))
+   tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
-- 
2.25.1