Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-19 Thread Andy Lutomirski
On Tue, Jun 13, 2017 at 11:09 PM, Juergen Gross  wrote:
> On 14/06/17 06:56, Andy Lutomirski wrote:
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>
> There is a allocation failure path in xen_drop_mm_ref() which might
> be wrong with this patch. As this path should be taken only very
> unlikely I'd suggest to remove the test for mm_cpumask() bit zero in
> this path.
>

Right, fixed.

>
> Juergen


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-19 Thread Andy Lutomirski
On Tue, Jun 13, 2017 at 11:09 PM, Juergen Gross  wrote:
> On 14/06/17 06:56, Andy Lutomirski wrote:
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>
> There is a allocation failure path in xen_drop_mm_ref() which might
> be wrong with this patch. As this path should be taken only very
> unlikely I'd suggest to remove the test for mm_cpumask() bit zero in
> this path.
>

Right, fixed.

>
> Juergen


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-19 Thread Andy Lutomirski
On Sun, Jun 18, 2017 at 1:06 AM, Nadav Amit  wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
>>
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>>
>> The UV tlbflush code is rather dated and should be changed.
>>
>> Cc: Andrew Banman 
>> Cc: Mike Travis 
>> Cc: Dimitri Sivanich 
>> Cc: Juergen Gross 
>> Cc: Boris Ostrovsky 
>> Signed-off-by: Andy Lutomirski 
>> ---
>> arch/x86/include/asm/mmu_context.h |   6 +-
>> arch/x86/include/asm/tlbflush.h|   4 -
>> arch/x86/mm/init.c |   1 -
>> arch/x86/mm/tlb.c  | 242 
>> +++--
>> 4 files changed, 131 insertions(+), 122 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h 
>> b/arch/x86/include/asm/mmu_context.h
>> index e5295d485899..69a4f1ee86ac 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
>> struct mm_struct *next)
>>
>> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
>> *tsk)
>> {
>> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> + int cpu = smp_processor_id();
>> +
>> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> The indication for laziness that was in cpu_tlbstate.state may be a better
> indication whether the cpu needs to be cleared from the previous mm_cpumask().
> If you kept this indication, you could have used this per-cpu information in
> switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))”,
> which might have been accessed by another core.

Hmm, fair enough.  On the other hand, this is the least of our
problems in this particular case -- the scheduler's use of mmgrab()
and mmdrop() are probably at least as bad if not worse.  My preference
would be to get all this stuff merged and then see if we want to add
some scalability improvements on top.


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-19 Thread Andy Lutomirski
On Sun, Jun 18, 2017 at 1:06 AM, Nadav Amit  wrote:
>
>> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
>>
>> x86's lazy TLB mode used to be fairly weak -- it would switch to
>> init_mm the first time it tried to flush a lazy TLB.  This meant an
>> unnecessary CR3 write and, if the flush was remote, an unnecessary
>> IPI.
>>
>> Rewrite it entirely.  When we enter lazy mode, we simply remove the
>> cpu from mm_cpumask.  This means that we need a way to figure out
>> whether we've missed a flush when we switch back out of lazy mode.
>> I use the tlb_gen machinery to track whether a context is up to
>> date.
>>
>> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
>> using an array of length 1 containing (ctx_id, tlb_gen) rather than
>> just storing tlb_gen, and making it at array isn't necessary yet.
>> I'm doing this because the next few patches add PCID support, and,
>> with PCID, we need ctx_id, and the array will end up with a length
>> greater than 1.  Making it an array now means that there will be
>> less churn and therefore less stress on your eyeballs.
>>
>> NB: This is dubious but, AFAICT, still correct on Xen and UV.
>> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
>> patch changes the way that mm_cpumask() works.  This should be okay,
>> since Xen *also* iterates all online CPUs to find all the CPUs it
>> needs to twiddle.
>>
>> The UV tlbflush code is rather dated and should be changed.
>>
>> Cc: Andrew Banman 
>> Cc: Mike Travis 
>> Cc: Dimitri Sivanich 
>> Cc: Juergen Gross 
>> Cc: Boris Ostrovsky 
>> Signed-off-by: Andy Lutomirski 
>> ---
>> arch/x86/include/asm/mmu_context.h |   6 +-
>> arch/x86/include/asm/tlbflush.h|   4 -
>> arch/x86/mm/init.c |   1 -
>> arch/x86/mm/tlb.c  | 242 
>> +++--
>> 4 files changed, 131 insertions(+), 122 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/mmu_context.h 
>> b/arch/x86/include/asm/mmu_context.h
>> index e5295d485899..69a4f1ee86ac 100644
>> --- a/arch/x86/include/asm/mmu_context.h
>> +++ b/arch/x86/include/asm/mmu_context.h
>> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
>> struct mm_struct *next)
>>
>> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
>> *tsk)
>> {
>> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
>> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
>> + int cpu = smp_processor_id();
>> +
>> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
>> + cpumask_clear_cpu(cpu, mm_cpumask(mm));
>
> The indication for laziness that was in cpu_tlbstate.state may be a better
> indication whether the cpu needs to be cleared from the previous mm_cpumask().
> If you kept this indication, you could have used this per-cpu information in
> switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))”,
> which might have been accessed by another core.

Hmm, fair enough.  On the other hand, this is the least of our
problems in this particular case -- the scheduler's use of mmgrab()
and mmdrop() are probably at least as bad if not worse.  My preference
would be to get all this stuff merged and then see if we want to add
some scalability improvements on top.


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-18 Thread Nadav Amit

> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
> 
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.
> 
> The UV tlbflush code is rather dated and should be changed.
> 
> Cc: Andrew Banman 
> Cc: Mike Travis 
> Cc: Dimitri Sivanich 
> Cc: Juergen Gross 
> Cc: Boris Ostrovsky 
> Signed-off-by: Andy Lutomirski 
> ---
> arch/x86/include/asm/mmu_context.h |   6 +-
> arch/x86/include/asm/tlbflush.h|   4 -
> arch/x86/mm/init.c |   1 -
> arch/x86/mm/tlb.c  | 242 +++--
> 4 files changed, 131 insertions(+), 122 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h 
> b/arch/x86/include/asm/mmu_context.h
> index e5295d485899..69a4f1ee86ac 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
> struct mm_struct *next)
> 
> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
> {
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> + int cpu = smp_processor_id();
> +
> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));

The indication for laziness that was in cpu_tlbstate.state may be a better
indication whether the cpu needs to be cleared from the previous mm_cpumask().
If you kept this indication, you could have used this per-cpu information in
switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))”,
which might have been accessed by another core.




Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-18 Thread Nadav Amit

> On Jun 13, 2017, at 9:56 PM, Andy Lutomirski  wrote:
> 
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.
> 
> The UV tlbflush code is rather dated and should be changed.
> 
> Cc: Andrew Banman 
> Cc: Mike Travis 
> Cc: Dimitri Sivanich 
> Cc: Juergen Gross 
> Cc: Boris Ostrovsky 
> Signed-off-by: Andy Lutomirski 
> ---
> arch/x86/include/asm/mmu_context.h |   6 +-
> arch/x86/include/asm/tlbflush.h|   4 -
> arch/x86/mm/init.c |   1 -
> arch/x86/mm/tlb.c  | 242 +++--
> 4 files changed, 131 insertions(+), 122 deletions(-)
> 
> diff --git a/arch/x86/include/asm/mmu_context.h 
> b/arch/x86/include/asm/mmu_context.h
> index e5295d485899..69a4f1ee86ac 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
> struct mm_struct *next)
> 
> static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
> {
> - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
> - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
> + int cpu = smp_processor_id();
> +
> + if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
> + cpumask_clear_cpu(cpu, mm_cpumask(mm));

The indication for laziness that was in cpu_tlbstate.state may be a better
indication whether the cpu needs to be cleared from the previous mm_cpumask().
If you kept this indication, you could have used this per-cpu information in
switch_mm_irqs_off() instead of "cpumask_test_cpu(cpu, mm_cpumask(next))”,
which might have been accessed by another core.




Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-14 Thread Andy Lutomirski
On Wed, Jun 14, 2017 at 3:33 PM, Dave Hansen  wrote:
> On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>> - if (cpumask_test_cpu(cpu, >cpumask))
>> + if (cpumask_test_cpu(cpu, >cpumask)) {
>> + local_irq_disable();
>>   flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN);
>> + local_irq_enable();
>> + }
>> +
>
> Could you talk a little about why this needs to be local_irq_disable()
> and not preempt_disable()?  Is it about the case where somebody is
> trying to call flush_tlb_func_*() from an interrupt handler?

It's to prevent flush_tlb_func_local() and flush_tlb_func_remote()
from being run concurrently, which would cause flush_tlb_func_common()
to be reentered.  Either we'd need to be very careful in
flush_tlb_func_common() to avoid races if this happened, or we could
just disable interrupts around flush_tlb_func_local().  The latter is
fast and easy.

--Andy


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-14 Thread Andy Lutomirski
On Wed, Jun 14, 2017 at 3:33 PM, Dave Hansen  wrote:
> On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
>> - if (cpumask_test_cpu(cpu, >cpumask))
>> + if (cpumask_test_cpu(cpu, >cpumask)) {
>> + local_irq_disable();
>>   flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN);
>> + local_irq_enable();
>> + }
>> +
>
> Could you talk a little about why this needs to be local_irq_disable()
> and not preempt_disable()?  Is it about the case where somebody is
> trying to call flush_tlb_func_*() from an interrupt handler?

It's to prevent flush_tlb_func_local() and flush_tlb_func_remote()
from being run concurrently, which would cause flush_tlb_func_common()
to be reentered.  Either we'd need to be very careful in
flush_tlb_func_common() to avoid races if this happened, or we could
just disable interrupts around flush_tlb_func_local().  The latter is
fast and easy.

--Andy


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-14 Thread Dave Hansen
On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
> - if (cpumask_test_cpu(cpu, >cpumask))
> + if (cpumask_test_cpu(cpu, >cpumask)) {
> + local_irq_disable();
>   flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN);
> + local_irq_enable();
> + }
> +

Could you talk a little about why this needs to be local_irq_disable()
and not preempt_disable()?  Is it about the case where somebody is
trying to call flush_tlb_func_*() from an interrupt handler?


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-14 Thread Dave Hansen
On 06/13/2017 09:56 PM, Andy Lutomirski wrote:
> - if (cpumask_test_cpu(cpu, >cpumask))
> + if (cpumask_test_cpu(cpu, >cpumask)) {
> + local_irq_disable();
>   flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN);
> + local_irq_enable();
> + }
> +

Could you talk a little about why this needs to be local_irq_disable()
and not preempt_disable()?  Is it about the case where somebody is
trying to call flush_tlb_func_*() from an interrupt handler?


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-14 Thread Juergen Gross
On 14/06/17 06:56, Andy Lutomirski wrote:
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.

There is a allocation failure path in xen_drop_mm_ref() which might
be wrong with this patch. As this path should be taken only very
unlikely I'd suggest to remove the test for mm_cpumask() bit zero in
this path.


Juergen


Re: [PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-14 Thread Juergen Gross
On 14/06/17 06:56, Andy Lutomirski wrote:
> x86's lazy TLB mode used to be fairly weak -- it would switch to
> init_mm the first time it tried to flush a lazy TLB.  This meant an
> unnecessary CR3 write and, if the flush was remote, an unnecessary
> IPI.
> 
> Rewrite it entirely.  When we enter lazy mode, we simply remove the
> cpu from mm_cpumask.  This means that we need a way to figure out
> whether we've missed a flush when we switch back out of lazy mode.
> I use the tlb_gen machinery to track whether a context is up to
> date.
> 
> Note to reviewers: this patch, my itself, looks a bit odd.  I'm
> using an array of length 1 containing (ctx_id, tlb_gen) rather than
> just storing tlb_gen, and making it at array isn't necessary yet.
> I'm doing this because the next few patches add PCID support, and,
> with PCID, we need ctx_id, and the array will end up with a length
> greater than 1.  Making it an array now means that there will be
> less churn and therefore less stress on your eyeballs.
> 
> NB: This is dubious but, AFAICT, still correct on Xen and UV.
> xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
> patch changes the way that mm_cpumask() works.  This should be okay,
> since Xen *also* iterates all online CPUs to find all the CPUs it
> needs to twiddle.

There is a allocation failure path in xen_drop_mm_ref() which might
be wrong with this patch. As this path should be taken only very
unlikely I'd suggest to remove the test for mm_cpumask() bit zero in
this path.


Juergen


[PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-13 Thread Andy Lutomirski
x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB.  This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely.  When we enter lazy mode, we simply remove the
cpu from mm_cpumask.  This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd.  I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1.  Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works.  This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman 
Cc: Mike Travis 
Cc: Dimitri Sivanich 
Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |   6 +-
 arch/x86/include/asm/tlbflush.h|   4 -
 arch/x86/mm/init.c |   1 -
 arch/x86/mm/tlb.c  | 242 +++--
 4 files changed, 131 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index e5295d485899..69a4f1ee86ac 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
struct mm_struct *next)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+   int cpu = smp_processor_id();
+
+   if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+   cpumask_clear_cpu(cpu, mm_cpumask(mm));
 }
 
 extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f6c30d6ec39..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
-   int state;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
-#define TLBSTATE_OK1
-#define TLBSTATE_LAZY  2
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
-   .state = 0,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3b19ba748e92..fea2b07ac7d8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == _mm)
return;
 
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   BUG();
+   /* Warn if we're not lazy. */
+   WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+   u64 next_tlb_gen;
 
/*
-* NB: The scheduler will call us with prev == next when
-* switching from lazy TLB mode to normal mode if active_mm
-* isn't changing.  When this happens, there is no guarantee
-* that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+* NB: The scheduler will call us with prev == next when switching
+* from lazy TLB mode to normal mode if active_mm isn't changing.
+* When this happens, 

[PATCH v2 05/10] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-13 Thread Andy Lutomirski
x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB.  This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely.  When we enter lazy mode, we simply remove the
cpu from mm_cpumask.  This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd.  I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1.  Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works.  This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman 
Cc: Mike Travis 
Cc: Dimitri Sivanich 
Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |   6 +-
 arch/x86/include/asm/tlbflush.h|   4 -
 arch/x86/mm/init.c |   1 -
 arch/x86/mm/tlb.c  | 242 +++--
 4 files changed, 131 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index e5295d485899..69a4f1ee86ac 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
struct mm_struct *next)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+   int cpu = smp_processor_id();
+
+   if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+   cpumask_clear_cpu(cpu, mm_cpumask(mm));
 }
 
 extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f6c30d6ec39..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
-   int state;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
-#define TLBSTATE_OK1
-#define TLBSTATE_LAZY  2
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
-   .state = 0,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3b19ba748e92..fea2b07ac7d8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == _mm)
return;
 
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   BUG();
+   /* Warn if we're not lazy. */
+   WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+   u64 next_tlb_gen;
 
/*
-* NB: The scheduler will call us with prev == next when
-* switching from lazy TLB mode to normal mode if active_mm
-* isn't changing.  When this happens, there is no guarantee
-* that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+* NB: The scheduler will call us with prev == next when switching
+* from lazy TLB mode to normal mode if active_mm isn't changing.
+* When this happens, we don't assume that CR3 (and hence
+* cpu_tlbstate.loaded_mm) matches next.
 *
 * NB: