Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-07-27 Thread Andy Lutomirski
> On Jul 27, 2017, at 3:53 PM, Andrew Banman  wrote:
>
>> On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
>>> On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov  wrote:
 On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
 Rewrite it entirely.  When we enter lazy mode, we simply remove the
 cpu from mm_cpumask.  This means that we need a way to figure out
>>>
>>> s/cpu/CPU/
>>
>> Done.
>>
>>>
 whether we've missed a flush when we switch back out of lazy mode.
 I use the tlb_gen machinery to track whether a context is up to
 date.

 Note to reviewers: this patch, my itself, looks a bit odd.  I'm
 using an array of length 1 containing (ctx_id, tlb_gen) rather than
 just storing tlb_gen, and making it at array isn't necessary yet.
 I'm doing this because the next few patches add PCID support, and,
 with PCID, we need ctx_id, and the array will end up with a length
 greater than 1.  Making it an array now means that there will be
 less churn and therefore less stress on your eyeballs.

 NB: This is dubious but, AFAICT, still correct on Xen and UV.
 xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
 patch changes the way that mm_cpumask() works.  This should be okay,
 since Xen *also* iterates all online CPUs to find all the CPUs it
 needs to twiddle.
>>>
>>> This whole text should be under the "---" line below if we don't want it
>>> in the commit message.
>>
>> I figured that some future reader of this patch might actually want to
>> see this text, though.
>>
>>>

 The UV tlbflush code is rather dated and should be changed.
>>
>> And I'd definitely like the UV maintainers to notice this part, now or
>> in the future :)  I don't want to personally touch the UV code with a
>> ten-foot pole, but it really should be updated by someone who has a
>> chance of getting it right and being able to test it.
>
> Noticed! We're aware of these changes and we're planning on updating this
> code in the future. Presently the BAU tlb shootdown feature is working well
> on our recent hardware.

:)

I would suggest reworking it to hook the SMP function call
infrastructure instead of the TLB shootdown code.


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-07-27 Thread Andy Lutomirski
> On Jul 27, 2017, at 3:53 PM, Andrew Banman  wrote:
>
>> On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
>>> On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov  wrote:
 On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
 Rewrite it entirely.  When we enter lazy mode, we simply remove the
 cpu from mm_cpumask.  This means that we need a way to figure out
>>>
>>> s/cpu/CPU/
>>
>> Done.
>>
>>>
 whether we've missed a flush when we switch back out of lazy mode.
 I use the tlb_gen machinery to track whether a context is up to
 date.

 Note to reviewers: this patch, my itself, looks a bit odd.  I'm
 using an array of length 1 containing (ctx_id, tlb_gen) rather than
 just storing tlb_gen, and making it at array isn't necessary yet.
 I'm doing this because the next few patches add PCID support, and,
 with PCID, we need ctx_id, and the array will end up with a length
 greater than 1.  Making it an array now means that there will be
 less churn and therefore less stress on your eyeballs.

 NB: This is dubious but, AFAICT, still correct on Xen and UV.
 xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
 patch changes the way that mm_cpumask() works.  This should be okay,
 since Xen *also* iterates all online CPUs to find all the CPUs it
 needs to twiddle.
>>>
>>> This whole text should be under the "---" line below if we don't want it
>>> in the commit message.
>>
>> I figured that some future reader of this patch might actually want to
>> see this text, though.
>>
>>>

 The UV tlbflush code is rather dated and should be changed.
>>
>> And I'd definitely like the UV maintainers to notice this part, now or
>> in the future :)  I don't want to personally touch the UV code with a
>> ten-foot pole, but it really should be updated by someone who has a
>> chance of getting it right and being able to test it.
>
> Noticed! We're aware of these changes and we're planning on updating this
> code in the future. Presently the BAU tlb shootdown feature is working well
> on our recent hardware.

:)

I would suggest reworking it to hook the SMP function call
infrastructure instead of the TLB shootdown code.


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-07-27 Thread Andrew Banman
On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov  wrote:
> > On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
> >> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> >> cpu from mm_cpumask.  This means that we need a way to figure out
> >
> > s/cpu/CPU/
> 
> Done.
> 
> >
> >> whether we've missed a flush when we switch back out of lazy mode.
> >> I use the tlb_gen machinery to track whether a context is up to
> >> date.
> >>
> >> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> >> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> >> just storing tlb_gen, and making it at array isn't necessary yet.
> >> I'm doing this because the next few patches add PCID support, and,
> >> with PCID, we need ctx_id, and the array will end up with a length
> >> greater than 1.  Making it an array now means that there will be
> >> less churn and therefore less stress on your eyeballs.
> >>
> >> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> >> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> >> patch changes the way that mm_cpumask() works.  This should be okay,
> >> since Xen *also* iterates all online CPUs to find all the CPUs it
> >> needs to twiddle.
> >
> > This whole text should be under the "---" line below if we don't want it
> > in the commit message.
> 
> I figured that some future reader of this patch might actually want to
> see this text, though.
> 
> >
> >>
> >> The UV tlbflush code is rather dated and should be changed.
> 
> And I'd definitely like the UV maintainers to notice this part, now or
> in the future :)  I don't want to personally touch the UV code with a
> ten-foot pole, but it really should be updated by someone who has a
> chance of getting it right and being able to test it.

Noticed! We're aware of these changes and we're planning on updating this
code in the future. Presently the BAU tlb shootdown feature is working well
on our recent hardware.

Thank you,

Andrew
HPE 

> 
> >> +
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
> >
> > It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> > does BTR straightaway.
> 
> Yeah, but I'm doing this for performance.  I think that all the
> various one-line helpers do a LOCKed op right away, and I think it's
> faster to see if we can avoid the LOCKed op by trying an ordinary read
> first.  OTOH, maybe this is misguided -- if the cacheline lives
> somewhere else and we do end up needing to update it, we'll end up
> first sharing it and then making it exclusive, which increases the
> amount of cache coherency traffic, so maybe I'm optimizing for the
> wrong thing.  What do you think?
> 
> >> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> >> - BUG();
> >> + /* Warn if we're not lazy. */
> >> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
> >
> > We don't BUG() anymore?
> 
> We could.  But, when the whole patch series is applied, the only
> caller left is a somewhat dubious Xen optimization, and if we blindly
> continue executing, I think the worst that happens is that we OOPS
> later or that we get segfaults when we shouldn't get segfaults.
> 
> >
> >>
> >>   switch_mm(NULL, _mm, NULL);
> >>  }
> >> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, 
> >> struct mm_struct *next,
> >>  {
> >>   unsigned cpu = smp_processor_id();
> >>   struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> >> + u64 next_tlb_gen;
> >
> > Please sort function local variables declaration in a reverse christmas
> > tree order:
> >
> >  longest_variable_name;
> >  shorter_var_name;
> >  even_shorter;
> >  i;
> >
> >>
> >>   /*
> >> -  * NB: The scheduler will call us with prev == next when
> >> -  * switching from lazy TLB mode to normal mode if active_mm
> >> -  * isn't changing.  When this happens, there is no guarantee
> >> -  * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
> >> +  * NB: The scheduler will call us with prev == next when switching
> >> +  * from lazy TLB mode to normal mode if active_mm isn't changing.
> >> +  * When this happens, we don't assume that CR3 (and hence
> >> +  * cpu_tlbstate.loaded_mm) matches next.
> >>*
> >>* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
> >>*/
> >>
> >> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> >> + /* We don't want flush_tlb_func_* to run concurrently with us. */
> >> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> >> + WARN_ON_ONCE(!irqs_disabled());
> >> +
> >> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
> >
> > Why do we need that check? Can that ever happen?
> 
> It did 

Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-07-27 Thread Andrew Banman
On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov  wrote:
> > On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
> >> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> >> cpu from mm_cpumask.  This means that we need a way to figure out
> >
> > s/cpu/CPU/
> 
> Done.
> 
> >
> >> whether we've missed a flush when we switch back out of lazy mode.
> >> I use the tlb_gen machinery to track whether a context is up to
> >> date.
> >>
> >> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> >> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> >> just storing tlb_gen, and making it at array isn't necessary yet.
> >> I'm doing this because the next few patches add PCID support, and,
> >> with PCID, we need ctx_id, and the array will end up with a length
> >> greater than 1.  Making it an array now means that there will be
> >> less churn and therefore less stress on your eyeballs.
> >>
> >> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> >> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> >> patch changes the way that mm_cpumask() works.  This should be okay,
> >> since Xen *also* iterates all online CPUs to find all the CPUs it
> >> needs to twiddle.
> >
> > This whole text should be under the "---" line below if we don't want it
> > in the commit message.
> 
> I figured that some future reader of this patch might actually want to
> see this text, though.
> 
> >
> >>
> >> The UV tlbflush code is rather dated and should be changed.
> 
> And I'd definitely like the UV maintainers to notice this part, now or
> in the future :)  I don't want to personally touch the UV code with a
> ten-foot pole, but it really should be updated by someone who has a
> chance of getting it right and being able to test it.

Noticed! We're aware of these changes and we're planning on updating this
code in the future. Presently the BAU tlb shootdown feature is working well
on our recent hardware.

Thank you,

Andrew
HPE 

> 
> >> +
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
> >
> > It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> > does BTR straightaway.
> 
> Yeah, but I'm doing this for performance.  I think that all the
> various one-line helpers do a LOCKed op right away, and I think it's
> faster to see if we can avoid the LOCKed op by trying an ordinary read
> first.  OTOH, maybe this is misguided -- if the cacheline lives
> somewhere else and we do end up needing to update it, we'll end up
> first sharing it and then making it exclusive, which increases the
> amount of cache coherency traffic, so maybe I'm optimizing for the
> wrong thing.  What do you think?
> 
> >> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> >> - BUG();
> >> + /* Warn if we're not lazy. */
> >> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
> >
> > We don't BUG() anymore?
> 
> We could.  But, when the whole patch series is applied, the only
> caller left is a somewhat dubious Xen optimization, and if we blindly
> continue executing, I think the worst that happens is that we OOPS
> later or that we get segfaults when we shouldn't get segfaults.
> 
> >
> >>
> >>   switch_mm(NULL, _mm, NULL);
> >>  }
> >> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, 
> >> struct mm_struct *next,
> >>  {
> >>   unsigned cpu = smp_processor_id();
> >>   struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> >> + u64 next_tlb_gen;
> >
> > Please sort function local variables declaration in a reverse christmas
> > tree order:
> >
> >  longest_variable_name;
> >  shorter_var_name;
> >  even_shorter;
> >  i;
> >
> >>
> >>   /*
> >> -  * NB: The scheduler will call us with prev == next when
> >> -  * switching from lazy TLB mode to normal mode if active_mm
> >> -  * isn't changing.  When this happens, there is no guarantee
> >> -  * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
> >> +  * NB: The scheduler will call us with prev == next when switching
> >> +  * from lazy TLB mode to normal mode if active_mm isn't changing.
> >> +  * When this happens, we don't assume that CR3 (and hence
> >> +  * cpu_tlbstate.loaded_mm) matches next.
> >>*
> >>* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
> >>*/
> >>
> >> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> >> + /* We don't want flush_tlb_func_* to run concurrently with us. */
> >> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> >> + WARN_ON_ONCE(!irqs_disabled());
> >> +
> >> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
> >
> > Why do we need that check? Can that ever happen?
> 
> It did in one particular buggy 

Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-23 Thread Andy Lutomirski
On Fri, Jun 23, 2017 at 6:34 AM, Boris Ostrovsky
 wrote:
>
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 1d7a7213a310..f5df56fb8b5c 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
>> /* Get the "official" set of cpus referring to our pagetable. */
>> if (!alloc_cpumask_var(, GFP_ATOMIC)) {
>> for_each_online_cpu(cpu) {
>> -   if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
>> -   && per_cpu(xen_current_cr3, cpu) !=
>> __pa(mm->pgd))
>> +   if (per_cpu(xen_current_cr3, cpu) !=
>> __pa(mm->pgd))
>> continue;
>> smp_call_function_single(cpu,
>> drop_mm_ref_this_cpu, mm, 1);
>> }
>>
>
>
> I wonder then whether
> cpumask_copy(mask, mm_cpumask(mm));
> immediately below is needed.

Probably not.  I'll change it to cpumask_clear().  Then the two cases
in that function match better.

>
> -boris


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-23 Thread Andy Lutomirski
On Fri, Jun 23, 2017 at 6:34 AM, Boris Ostrovsky
 wrote:
>
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 1d7a7213a310..f5df56fb8b5c 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
>> /* Get the "official" set of cpus referring to our pagetable. */
>> if (!alloc_cpumask_var(, GFP_ATOMIC)) {
>> for_each_online_cpu(cpu) {
>> -   if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
>> -   && per_cpu(xen_current_cr3, cpu) !=
>> __pa(mm->pgd))
>> +   if (per_cpu(xen_current_cr3, cpu) !=
>> __pa(mm->pgd))
>> continue;
>> smp_call_function_single(cpu,
>> drop_mm_ref_this_cpu, mm, 1);
>> }
>>
>
>
> I wonder then whether
> cpumask_copy(mask, mm_cpumask(mm));
> immediately below is needed.

Probably not.  I'll change it to cpumask_clear().  Then the two cases
in that function match better.

>
> -boris


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-23 Thread Boris Ostrovsky



diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 1d7a7213a310..f5df56fb8b5c 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
/* Get the "official" set of cpus referring to our pagetable. */
if (!alloc_cpumask_var(, GFP_ATOMIC)) {
for_each_online_cpu(cpu) {
-   if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
-   && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
+   if (per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
continue;
smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 
1);
}




I wonder then whether
cpumask_copy(mask, mm_cpumask(mm));
immediately below is needed.

-boris


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-23 Thread Boris Ostrovsky



diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 1d7a7213a310..f5df56fb8b5c 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1005,8 +1005,7 @@ static void xen_drop_mm_ref(struct mm_struct *mm)
/* Get the "official" set of cpus referring to our pagetable. */
if (!alloc_cpumask_var(, GFP_ATOMIC)) {
for_each_online_cpu(cpu) {
-   if (!cpumask_test_cpu(cpu, mm_cpumask(mm))
-   && per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
+   if (per_cpu(xen_current_cr3, cpu) != __pa(mm->pgd))
continue;
smp_call_function_single(cpu, drop_mm_ref_this_cpu, mm, 
1);
}




I wonder then whether
cpumask_copy(mask, mm_cpumask(mm));
immediately below is needed.

-boris


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-22 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
> I figured that some future reader of this patch might actually want to
> see this text, though.

Oh, don't get me wrong: with commit messages more is more, in the
general case. That's why I said "if".

> >> The UV tlbflush code is rather dated and should be changed.
> 
> And I'd definitely like the UV maintainers to notice this part, now or
> in the future :)  I don't want to personally touch the UV code with a
> ten-foot pole, but it really should be updated by someone who has a
> chance of getting it right and being able to test it.

Ah, could be because they moved recently and have hpe addresses now.
Lemme add them.

> >> +
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
> >
> > It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> > does BTR straightaway.
> 
> Yeah, but I'm doing this for performance.  I think that all the
> various one-line helpers do a LOCKed op right away, and I think it's
> faster to see if we can avoid the LOCKed op by trying an ordinary read
> first.

Right, the test part of the operation is unlocked so if that is the
likely case, it is a win.

> OTOH, maybe this is misguided -- if the cacheline lives somewhere else
> and we do end up needing to update it, we'll end up first sharing it
> and then making it exclusive, which increases the amount of cache
> coherency traffic, so maybe I'm optimizing for the wrong thing. What
> do you think?

Yeah, but we'll have to do that anyway for the locked operation. Ok,
let's leave it split like it is.

> It did in one particular buggy incarnation.  It would also trigger if,
> say, suspend/resume corrupts CR3.  Admittedly this is unlikely, but
> I'd rather catch it.  Once PCID is on, corruption seems a bit less
> farfetched -- this assertion will catch anyone who accidentally does
> write_cr3(read_cr3_pa()).

Ok, but let's put a comment over it pls as it is not obvious when
something like that can happen.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-22 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 10:47:29AM -0700, Andy Lutomirski wrote:
> I figured that some future reader of this patch might actually want to
> see this text, though.

Oh, don't get me wrong: with commit messages more is more, in the
general case. That's why I said "if".

> >> The UV tlbflush code is rather dated and should be changed.
> 
> And I'd definitely like the UV maintainers to notice this part, now or
> in the future :)  I don't want to personally touch the UV code with a
> ten-foot pole, but it really should be updated by someone who has a
> chance of getting it right and being able to test it.

Ah, could be because they moved recently and have hpe addresses now.
Lemme add them.

> >> +
> >> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> >> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
> >
> > It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> > does BTR straightaway.
> 
> Yeah, but I'm doing this for performance.  I think that all the
> various one-line helpers do a LOCKed op right away, and I think it's
> faster to see if we can avoid the LOCKed op by trying an ordinary read
> first.

Right, the test part of the operation is unlocked so if that is the
likely case, it is a win.

> OTOH, maybe this is misguided -- if the cacheline lives somewhere else
> and we do end up needing to update it, we'll end up first sharing it
> and then making it exclusive, which increases the amount of cache
> coherency traffic, so maybe I'm optimizing for the wrong thing. What
> do you think?

Yeah, but we'll have to do that anyway for the locked operation. Ok,
let's leave it split like it is.

> It did in one particular buggy incarnation.  It would also trigger if,
> say, suspend/resume corrupts CR3.  Admittedly this is unlikely, but
> I'd rather catch it.  Once PCID is on, corruption seems a bit less
> farfetched -- this assertion will catch anyone who accidentally does
> write_cr3(read_cr3_pa()).

Ok, but let's put a comment over it pls as it is not obvious when
something like that can happen.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov  wrote:
> On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>
> s/cpu/CPU/

Done.

>
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>
> This whole text should be under the "---" line below if we don't want it
> in the commit message.

I figured that some future reader of this patch might actually want to
see this text, though.

>
>>
>> The UV tlbflush code is rather dated and should be changed.

And I'd definitely like the UV maintainers to notice this part, now or
in the future :)  I don't want to personally touch the UV code with a
ten-foot pole, but it really should be updated by someone who has a
chance of getting it right and being able to test it.

>> +
>> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> does BTR straightaway.

Yeah, but I'm doing this for performance.  I think that all the
various one-line helpers do a LOCKed op right away, and I think it's
faster to see if we can avoid the LOCKed op by trying an ordinary read
first.  OTOH, maybe this is misguided -- if the cacheline lives
somewhere else and we do end up needing to update it, we'll end up
first sharing it and then making it exclusive, which increases the
amount of cache coherency traffic, so maybe I'm optimizing for the
wrong thing.  What do you think?

>> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> - BUG();
>> + /* Warn if we're not lazy. */
>> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
>
> We don't BUG() anymore?

We could.  But, when the whole patch series is applied, the only
caller left is a somewhat dubious Xen optimization, and if we blindly
continue executing, I think the worst that happens is that we OOPS
later or that we get segfaults when we shouldn't get segfaults.

>
>>
>>   switch_mm(NULL, _mm, NULL);
>>  }
>> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
>> mm_struct *next,
>>  {
>>   unsigned cpu = smp_processor_id();
>>   struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>> + u64 next_tlb_gen;
>
> Please sort function local variables declaration in a reverse christmas
> tree order:
>
>  longest_variable_name;
>  shorter_var_name;
>  even_shorter;
>  i;
>
>>
>>   /*
>> -  * NB: The scheduler will call us with prev == next when
>> -  * switching from lazy TLB mode to normal mode if active_mm
>> -  * isn't changing.  When this happens, there is no guarantee
>> -  * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
>> +  * NB: The scheduler will call us with prev == next when switching
>> +  * from lazy TLB mode to normal mode if active_mm isn't changing.
>> +  * When this happens, we don't assume that CR3 (and hence
>> +  * cpu_tlbstate.loaded_mm) matches next.
>>*
>>* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
>>*/
>>
>> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>> + /* We don't want flush_tlb_func_* to run concurrently with us. */
>> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>> + WARN_ON_ONCE(!irqs_disabled());
>> +
>> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
>
> Why do we need that check? Can that ever happen?

It did in one particular buggy incarnation.  It would also trigger if,
say, suspend/resume corrupts CR3.  Admittedly this is unlikely, but
I'd rather catch it.  Once PCID is on, corruption seems a bit less
farfetched -- this assertion will catch anyone who accidentally does
write_cr3(read_cr3_pa()).

>
>>   if (real_prev == next) {
>> - /*
>> -  * There's nothing to do: we always keep the per-mm control
>> -  * regs in sync with cpu_tlbstate.loaded_mm.  Just
>> - 

Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 7:50 AM, Borislav Petkov  wrote:
> On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>
> s/cpu/CPU/

Done.

>
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>
> This whole text should be under the "---" line below if we don't want it
> in the commit message.

I figured that some future reader of this patch might actually want to
see this text, though.

>
>>
>> The UV tlbflush code is rather dated and should be changed.

And I'd definitely like the UV maintainers to notice this part, now or
in the future :)  I don't want to personally touch the UV code with a
ten-foot pole, but it really should be updated by someone who has a
chance of getting it right and being able to test it.

>> +
>> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
> does BTR straightaway.

Yeah, but I'm doing this for performance.  I think that all the
various one-line helpers do a LOCKed op right away, and I think it's
faster to see if we can avoid the LOCKed op by trying an ordinary read
first.  OTOH, maybe this is misguided -- if the cacheline lives
somewhere else and we do end up needing to update it, we'll end up
first sharing it and then making it exclusive, which increases the
amount of cache coherency traffic, so maybe I'm optimizing for the
wrong thing.  What do you think?

>> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> - BUG();
>> + /* Warn if we're not lazy. */
>> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
>
> We don't BUG() anymore?

We could.  But, when the whole patch series is applied, the only
caller left is a somewhat dubious Xen optimization, and if we blindly
continue executing, I think the worst that happens is that we OOPS
later or that we get segfaults when we shouldn't get segfaults.

>
>>
>>   switch_mm(NULL, _mm, NULL);
>>  }
>> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
>> mm_struct *next,
>>  {
>>   unsigned cpu = smp_processor_id();
>>   struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>> + u64 next_tlb_gen;
>
> Please sort function local variables declaration in a reverse christmas
> tree order:
>
>  longest_variable_name;
>  shorter_var_name;
>  even_shorter;
>  i;
>
>>
>>   /*
>> -  * NB: The scheduler will call us with prev == next when
>> -  * switching from lazy TLB mode to normal mode if active_mm
>> -  * isn't changing.  When this happens, there is no guarantee
>> -  * that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
>> +  * NB: The scheduler will call us with prev == next when switching
>> +  * from lazy TLB mode to normal mode if active_mm isn't changing.
>> +  * When this happens, we don't assume that CR3 (and hence
>> +  * cpu_tlbstate.loaded_mm) matches next.
>>*
>>* NB: leave_mm() calls us with prev == NULL and tsk == NULL.
>>*/
>>
>> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>> + /* We don't want flush_tlb_func_* to run concurrently with us. */
>> + if (IS_ENABLED(CONFIG_PROVE_LOCKING))
>> + WARN_ON_ONCE(!irqs_disabled());
>> +
>> + VM_BUG_ON(read_cr3_pa() != __pa(real_prev->pgd));
>
> Why do we need that check? Can that ever happen?

It did in one particular buggy incarnation.  It would also trigger if,
say, suspend/resume corrupts CR3.  Admittedly this is unlikely, but
I'd rather catch it.  Once PCID is on, corruption seems a bit less
farfetched -- this assertion will catch anyone who accidentally does
write_cr3(read_cr3_pa()).

>
>>   if (real_prev == next) {
>> - /*
>> -  * There's nothing to do: we always keep the per-mm control
>> -  * regs in sync with cpu_tlbstate.loaded_mm.  Just
>> -  * sanity-check 

Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-22 Thread Borislav Petkov
On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out

s/cpu/CPU/

> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.

This whole text should be under the "---" line below if we don't want it
in the commit message.

> 
> The UV tlbflush code is rather dated and should be changed.
> 
> Cc: Andrew Banman 
> Cc: Mike Travis 
> Cc: Dimitri Sivanich 
> Cc: Juergen Gross 
> Cc: Boris Ostrovsky 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/mmu_context.h |   6 +-
>  arch/x86/include/asm/tlbflush.h|   4 -
>  arch/x86/mm/init.c |   1 -
>  arch/x86/mm/tlb.c  | 227 
> +++--
>  arch/x86/xen/mmu_pv.c  |   3 +-
>  5 files changed, 119 insertions(+), 122 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h 
> b/arch/x86/include/asm/mmu_context.h
> index e5295d485899..69a4f1ee86ac 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
> struct mm_struct *next)
>  
>  static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
>  {
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> + int cpu = smp_processor_id();
> +
> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));

It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
does BTR straightaway.

>  extern atomic64_t last_mm_ctx_id;
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4f6c30d6ec39..87b13e51e867 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -95,7 +95,6 @@ struct tlb_state {
>* mode even if we've already switched back to swapper_pg_dir.
>*/
>   struct mm_struct *loaded_mm;
> - int state;
>  
>   /*
>* Access to this CR4 shadow and to H/W CR4 is protected by
> @@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct 
> *vma, unsigned long a)
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>const struct flush_tlb_info *info);
>  
> -#define TLBSTATE_OK  1
> -#define TLBSTATE_LAZY2
> -
>  static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
> *batch,
>   struct mm_struct *mm)
>  {
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 88ee942cb47d..7d6fa4676af9 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
>   .loaded_mm = _mm,
> - .state = 0,
>   .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
>  };
>  EXPORT_SYMBOL_GPL(cpu_tlbstate);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 9f5ef7a5e74a..fea2b07ac7d8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -45,8 +45,8 @@ void leave_mm(int cpu)
>   if (loaded_mm == _mm)
>   return;
>  
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - BUG();
> + /* Warn if we're not lazy. */
> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

We don't BUG() anymore?

>  
>   switch_mm(NULL, _mm, NULL);
>  }
> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>  {
>   unsigned cpu = smp_processor_id();
>   struct mm_struct *real_prev = 

Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-22 Thread Borislav Petkov
On Tue, Jun 20, 2017 at 10:22:12PM -0700, Andy Lutomirski wrote:
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out

s/cpu/CPU/

> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.

This whole text should be under the "---" line below if we don't want it
in the commit message.

> 
> The UV tlbflush code is rather dated and should be changed.
> 
> Cc: Andrew Banman 
> Cc: Mike Travis 
> Cc: Dimitri Sivanich 
> Cc: Juergen Gross 
> Cc: Boris Ostrovsky 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/mmu_context.h |   6 +-
>  arch/x86/include/asm/tlbflush.h|   4 -
>  arch/x86/mm/init.c |   1 -
>  arch/x86/mm/tlb.c  | 227 
> +++--
>  arch/x86/xen/mmu_pv.c  |   3 +-
>  5 files changed, 119 insertions(+), 122 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h 
> b/arch/x86/include/asm/mmu_context.h
> index e5295d485899..69a4f1ee86ac 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
> struct mm_struct *next)
>  
>  static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
>  {
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> + int cpu = smp_processor_id();
> +
> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));

It seems we haz a helper for that: cpumask_test_and_clear_cpu() which
does BTR straightaway.

>  extern atomic64_t last_mm_ctx_id;
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 4f6c30d6ec39..87b13e51e867 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -95,7 +95,6 @@ struct tlb_state {
>* mode even if we've already switched back to swapper_pg_dir.
>*/
>   struct mm_struct *loaded_mm;
> - int state;
>  
>   /*
>* Access to this CR4 shadow and to H/W CR4 is protected by
> @@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct 
> *vma, unsigned long a)
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>const struct flush_tlb_info *info);
>  
> -#define TLBSTATE_OK  1
> -#define TLBSTATE_LAZY2
> -
>  static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
> *batch,
>   struct mm_struct *mm)
>  {
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 88ee942cb47d..7d6fa4676af9 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
>   .loaded_mm = _mm,
> - .state = 0,
>   .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
>  };
>  EXPORT_SYMBOL_GPL(cpu_tlbstate);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 9f5ef7a5e74a..fea2b07ac7d8 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -45,8 +45,8 @@ void leave_mm(int cpu)
>   if (loaded_mm == _mm)
>   return;
>  
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - BUG();
> + /* Warn if we're not lazy. */
> + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));

We don't BUG() anymore?

>  
>   switch_mm(NULL, _mm, NULL);
>  }
> @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>  {
>   unsigned cpu = smp_processor_id();
>   struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
> + u64 next_tlb_gen;

Please sort function local variables declaration 

Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-21 Thread Borislav Petkov
On Wed, Jun 21, 2017 at 09:04:48AM -0700, Andy Lutomirski wrote:
> I'll look at the end of the whole series and see if I can come up with
> something good.

... along with the logic what we flush when, please. I.e., the text in
struct flush_tlb_info.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-21 Thread Borislav Petkov
On Wed, Jun 21, 2017 at 09:04:48AM -0700, Andy Lutomirski wrote:
> I'll look at the end of the whole series and see if I can come up with
> something good.

... along with the logic what we flush when, please. I.e., the text in
struct flush_tlb_info.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 2:01 AM, Thomas Gleixner  wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> -/*
>> - * The flush IPI assumes that a thread switch happens in this order:
>> - * [cpu0: the cpu that switches]
>> - * 1) switch_mm() either 1a) or 1b)
>> - * 1a) thread switch to a different mm
>> - * 1a1) set cpu_tlbstate to TLBSTATE_OK
>> - *   Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
>> - *   if cpu0 was in lazy tlb mode.
>> - * 1a2) update cpu active_mm
>> - *   Now cpu0 accepts tlb flushes for the new mm.
>> - * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
>> - *   Now the other cpus will send tlb flush ipis.
>> - * 1a4) change cr3.
>> - * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
>> - *   Stop ipi delivery for the old mm. This is not synchronized with
>> - *   the other cpus, but flush_tlb_func ignore flush ipis for the wrong
>> - *   mm, and in the worst case we perform a superfluous tlb flush.
>> - * 1b) thread switch without mm change
>> - *   cpu active_mm is correct, cpu0 already handles flush ipis.
>> - * 1b1) set cpu_tlbstate to TLBSTATE_OK
>> - * 1b2) test_and_set the cpu bit in cpu_vm_mask.
>> - *   Atomically set the bit [other cpus will start sending flush ipis],
>> - *   and test the bit.
>> - * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
>> - * 2) switch %%esp, ie current
>> - *
>> - * The interrupt must handle 2 special cases:
>> - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
>> - * - the cpu performs speculative tlb reads, i.e. even if the cpu only
>> - *   runs in kernel space, the cpu could load tlb entries for user space
>> - *   pages.
>> - *
>> - * The good news is that cpu_tlbstate is local to each cpu, no
>> - * write/read ordering problems.
>
> While the new code is really well commented, it would be a good thing to
> have a single place where all of this including the ordering constraints
> are documented.

I'll look at the end of the whole series and see if I can come up with
something good.

>
>> @@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct 
>> flush_tlb_info *f,
>>   VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>>  loaded_mm->context.ctx_id);
>>
>> - if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>>   /*
>> -  * leave_mm() is adequate to handle any type of flush, and
>> -  * we would prefer not to receive further IPIs.
>> +  * We're in lazy mode -- don't flush.  We can get here on
>> +  * remote flushes due to races and on local flushes if a
>> +  * kernel thread coincidentally flushes the mm it's lazily
>> +  * still using.
>
> Ok. That's more informative.
>
> Reviewed-by: Thomas Gleixner 


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 2:01 AM, Thomas Gleixner  wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> -/*
>> - * The flush IPI assumes that a thread switch happens in this order:
>> - * [cpu0: the cpu that switches]
>> - * 1) switch_mm() either 1a) or 1b)
>> - * 1a) thread switch to a different mm
>> - * 1a1) set cpu_tlbstate to TLBSTATE_OK
>> - *   Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
>> - *   if cpu0 was in lazy tlb mode.
>> - * 1a2) update cpu active_mm
>> - *   Now cpu0 accepts tlb flushes for the new mm.
>> - * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
>> - *   Now the other cpus will send tlb flush ipis.
>> - * 1a4) change cr3.
>> - * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
>> - *   Stop ipi delivery for the old mm. This is not synchronized with
>> - *   the other cpus, but flush_tlb_func ignore flush ipis for the wrong
>> - *   mm, and in the worst case we perform a superfluous tlb flush.
>> - * 1b) thread switch without mm change
>> - *   cpu active_mm is correct, cpu0 already handles flush ipis.
>> - * 1b1) set cpu_tlbstate to TLBSTATE_OK
>> - * 1b2) test_and_set the cpu bit in cpu_vm_mask.
>> - *   Atomically set the bit [other cpus will start sending flush ipis],
>> - *   and test the bit.
>> - * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
>> - * 2) switch %%esp, ie current
>> - *
>> - * The interrupt must handle 2 special cases:
>> - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
>> - * - the cpu performs speculative tlb reads, i.e. even if the cpu only
>> - *   runs in kernel space, the cpu could load tlb entries for user space
>> - *   pages.
>> - *
>> - * The good news is that cpu_tlbstate is local to each cpu, no
>> - * write/read ordering problems.
>
> While the new code is really well commented, it would be a good thing to
> have a single place where all of this including the ordering constraints
> are documented.

I'll look at the end of the whole series and see if I can come up with
something good.

>
>> @@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct 
>> flush_tlb_info *f,
>>   VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>>  loaded_mm->context.ctx_id);
>>
>> - if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>>   /*
>> -  * leave_mm() is adequate to handle any type of flush, and
>> -  * we would prefer not to receive further IPIs.
>> +  * We're in lazy mode -- don't flush.  We can get here on
>> +  * remote flushes due to races and on local flushes if a
>> +  * kernel thread coincidentally flushes the mm it's lazily
>> +  * still using.
>
> Ok. That's more informative.
>
> Reviewed-by: Thomas Gleixner 


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-21 Thread Thomas Gleixner
On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> -/*
> - * The flush IPI assumes that a thread switch happens in this order:
> - * [cpu0: the cpu that switches]
> - * 1) switch_mm() either 1a) or 1b)
> - * 1a) thread switch to a different mm
> - * 1a1) set cpu_tlbstate to TLBSTATE_OK
> - *   Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
> - *   if cpu0 was in lazy tlb mode.
> - * 1a2) update cpu active_mm
> - *   Now cpu0 accepts tlb flushes for the new mm.
> - * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
> - *   Now the other cpus will send tlb flush ipis.
> - * 1a4) change cr3.
> - * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
> - *   Stop ipi delivery for the old mm. This is not synchronized with
> - *   the other cpus, but flush_tlb_func ignore flush ipis for the wrong
> - *   mm, and in the worst case we perform a superfluous tlb flush.
> - * 1b) thread switch without mm change
> - *   cpu active_mm is correct, cpu0 already handles flush ipis.
> - * 1b1) set cpu_tlbstate to TLBSTATE_OK
> - * 1b2) test_and_set the cpu bit in cpu_vm_mask.
> - *   Atomically set the bit [other cpus will start sending flush ipis],
> - *   and test the bit.
> - * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
> - * 2) switch %%esp, ie current
> - *
> - * The interrupt must handle 2 special cases:
> - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
> - * - the cpu performs speculative tlb reads, i.e. even if the cpu only
> - *   runs in kernel space, the cpu could load tlb entries for user space
> - *   pages.
> - *
> - * The good news is that cpu_tlbstate is local to each cpu, no
> - * write/read ordering problems.

While the new code is really well commented, it would be a good thing to
have a single place where all of this including the ordering constraints
are documented.

> @@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct 
> flush_tlb_info *f,
>   VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>  loaded_mm->context.ctx_id);
>  
> - if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>   /*
> -  * leave_mm() is adequate to handle any type of flush, and
> -  * we would prefer not to receive further IPIs.
> +  * We're in lazy mode -- don't flush.  We can get here on
> +  * remote flushes due to races and on local flushes if a
> +  * kernel thread coincidentally flushes the mm it's lazily
> +  * still using.

Ok. That's more informative.

Reviewed-by: Thomas Gleixner 


Re: [PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-21 Thread Thomas Gleixner
On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> -/*
> - * The flush IPI assumes that a thread switch happens in this order:
> - * [cpu0: the cpu that switches]
> - * 1) switch_mm() either 1a) or 1b)
> - * 1a) thread switch to a different mm
> - * 1a1) set cpu_tlbstate to TLBSTATE_OK
> - *   Now the tlb flush NMI handler flush_tlb_func won't call leave_mm
> - *   if cpu0 was in lazy tlb mode.
> - * 1a2) update cpu active_mm
> - *   Now cpu0 accepts tlb flushes for the new mm.
> - * 1a3) cpu_set(cpu, new_mm->cpu_vm_mask);
> - *   Now the other cpus will send tlb flush ipis.
> - * 1a4) change cr3.
> - * 1a5) cpu_clear(cpu, old_mm->cpu_vm_mask);
> - *   Stop ipi delivery for the old mm. This is not synchronized with
> - *   the other cpus, but flush_tlb_func ignore flush ipis for the wrong
> - *   mm, and in the worst case we perform a superfluous tlb flush.
> - * 1b) thread switch without mm change
> - *   cpu active_mm is correct, cpu0 already handles flush ipis.
> - * 1b1) set cpu_tlbstate to TLBSTATE_OK
> - * 1b2) test_and_set the cpu bit in cpu_vm_mask.
> - *   Atomically set the bit [other cpus will start sending flush ipis],
> - *   and test the bit.
> - * 1b3) if the bit was 0: leave_mm was called, flush the tlb.
> - * 2) switch %%esp, ie current
> - *
> - * The interrupt must handle 2 special cases:
> - * - cr3 is changed before %%esp, ie. it cannot use current->{active_,}mm.
> - * - the cpu performs speculative tlb reads, i.e. even if the cpu only
> - *   runs in kernel space, the cpu could load tlb entries for user space
> - *   pages.
> - *
> - * The good news is that cpu_tlbstate is local to each cpu, no
> - * write/read ordering problems.

While the new code is really well commented, it would be a good thing to
have a single place where all of this including the ordering constraints
are documented.

> @@ -215,12 +200,13 @@ static void flush_tlb_func_common(const struct 
> flush_tlb_info *f,
>   VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>  loaded_mm->context.ctx_id);
>  
> - if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>   /*
> -  * leave_mm() is adequate to handle any type of flush, and
> -  * we would prefer not to receive further IPIs.
> +  * We're in lazy mode -- don't flush.  We can get here on
> +  * remote flushes due to races and on local flushes if a
> +  * kernel thread coincidentally flushes the mm it's lazily
> +  * still using.

Ok. That's more informative.

Reviewed-by: Thomas Gleixner