Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2017-04-24 Thread Andy Lutomirski
On Fri, Sep 9, 2016 at 12:44 AM, Peter Zijlstra  wrote:
> On Thu, Sep 08, 2016 at 09:39:45PM -0700, Andy Lutomirski wrote:
>> If they're busy threads, shouldn't the yield return immediately
>> because the threads are still ready to run?  Lazy TLB won't do much
>> unless you get the kernel in some state where it's running in the
>> context of a different kernel thread and hasn't switched to
>> swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
>> sleep to go idle.
>
> Right, a task doing:
>
> for (;;) sched_yield();
>
> esp. when its the only runnable thread on the CPU, is a busy thread. It
> will not enter switch_mm(), which was where the invalidate hook was
> placed IIRC.

Hi all-

I'm guessing that this patch got abandoned, at least temporarily.  I'm
currently polishing up my PCID series, and I think it might be worth
revisiting this on top of my PCID rework.  The relevant major
infrastructure change I'm making with my PCID code is that I'm adding
an atomic64_t to each mm_context_t that gets incremented every time a
flush on that mm is requested.  With that change, we might be able to
get away with simply removing a cpu from mm_cpumask immediately when
it enters lazy mode and adding a hook to the scheduler to revalidate
the TLB state when switching mms when we were previously lazy.
Revalidation would just check that the counter hasn't changed.

--Andy


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2017-04-24 Thread Andy Lutomirski
On Fri, Sep 9, 2016 at 12:44 AM, Peter Zijlstra  wrote:
> On Thu, Sep 08, 2016 at 09:39:45PM -0700, Andy Lutomirski wrote:
>> If they're busy threads, shouldn't the yield return immediately
>> because the threads are still ready to run?  Lazy TLB won't do much
>> unless you get the kernel in some state where it's running in the
>> context of a different kernel thread and hasn't switched to
>> swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
>> sleep to go idle.
>
> Right, a task doing:
>
> for (;;) sched_yield();
>
> esp. when its the only runnable thread on the CPU, is a busy thread. It
> will not enter switch_mm(), which was where the invalidate hook was
> placed IIRC.

Hi all-

I'm guessing that this patch got abandoned, at least temporarily.  I'm
currently polishing up my PCID series, and I think it might be worth
revisiting this on top of my PCID rework.  The relevant major
infrastructure change I'm making with my PCID code is that I'm adding
an atomic64_t to each mm_context_t that gets incremented every time a
flush on that mm is requested.  With that change, we might be able to
get away with simply removing a cpu from mm_cpumask immediately when
it enters lazy mode and adding a hook to the scheduler to revalidate
the TLB state when switching mms when we were previously lazy.
Revalidation would just check that the counter hasn't changed.

--Andy


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-09 Thread Peter Zijlstra
On Thu, Sep 08, 2016 at 09:39:45PM -0700, Andy Lutomirski wrote:
> If they're busy threads, shouldn't the yield return immediately
> because the threads are still ready to run?  Lazy TLB won't do much
> unless you get the kernel in some state where it's running in the
> context of a different kernel thread and hasn't switched to
> swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
> sleep to go idle.

Right, a task doing:

for (;;) sched_yield();

esp. when its the only runnable thread on the CPU, is a busy thread. It
will not enter switch_mm(), which was where the invalidate hook was
placed IIRC.


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-09 Thread Peter Zijlstra
On Thu, Sep 08, 2016 at 09:39:45PM -0700, Andy Lutomirski wrote:
> If they're busy threads, shouldn't the yield return immediately
> because the threads are still ready to run?  Lazy TLB won't do much
> unless you get the kernel in some state where it's running in the
> context of a different kernel thread and hasn't switched to
> swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
> sleep to go idle.

Right, a task doing:

for (;;) sched_yield();

esp. when its the only runnable thread on the CPU, is a busy thread. It
will not enter switch_mm(), which was where the invalidate hook was
placed IIRC.


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-08 Thread Andy Lutomirski
On Thu, Sep 8, 2016 at 5:09 PM, Benjamin Serebrin  wrote:
> Sorry for the delay, I was eaten by a grue.
>
> I found that my initial study did not actually measure the number of
> TLB shootdown IPIs sent per TLB shootdown.  I think the intuition was
> correct but I didn't actually observe what I thought I had; my
> original use of probe points was incorrect.  However, after fixing my
> methodology, I'm having trouble proving that the existing Lazy TLB
> mode is working properly.
>
>
>
> I've spent some time trying to reproduce this in a microbenchmark.
> One thread does mmap, touch page, munmap, while other threads in the
> same process are configured to either busy-spin or busy-spin and
> yield.  All threads set their own affinity to a unique cpu, and the
> system is otherwise idle.  I look at the per-cpu delta of the TLB and
> CAL lines of /proc/interrupts over the run of the microbenchmark.
>
> Let's say I have 4 spin threads that never yield.  The mmap thread
> does N unmaps.  I observe each spin-thread core receives N (+/-  small
> noise) TLB shootdown interrupts, and the total TLB interrupt count is
> 4N (+/- small noise).  This is expected behavior.
>
> Then I add some synchronization:  the unmap thread rendezvouses with
> all the spinners, and when they are all ready, the spinners busy-spin
> for D milliseconds and then yield (pthread_yield, sched_yield produce
> identical results, though I'm not confident here that this is the
> right yield).  Meanwhile, the unmap thread busy-spins for D+E
> milliseconds and then does M map/touch/unmaps.  (D, E are single-digit
> milliseconds).  The idea here is that the unmap happens a little while
> after the spinners yielded; the kernel should be in the user process'
> mm but lazy TLB mode should defer TLB flushes.  It seems that lazy
> mode on each CPU should take 1 interrupt and then suppress subsequent
> interrupts.

If they're busy threads, shouldn't the yield return immediately
because the threads are still ready to run?  Lazy TLB won't do much
unless you get the kernel in some state where it's running in the
context of a different kernel thread and hasn't switched to
swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
sleep to go idle.

--Andy


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-08 Thread Andy Lutomirski
On Thu, Sep 8, 2016 at 5:09 PM, Benjamin Serebrin  wrote:
> Sorry for the delay, I was eaten by a grue.
>
> I found that my initial study did not actually measure the number of
> TLB shootdown IPIs sent per TLB shootdown.  I think the intuition was
> correct but I didn't actually observe what I thought I had; my
> original use of probe points was incorrect.  However, after fixing my
> methodology, I'm having trouble proving that the existing Lazy TLB
> mode is working properly.
>
>
>
> I've spent some time trying to reproduce this in a microbenchmark.
> One thread does mmap, touch page, munmap, while other threads in the
> same process are configured to either busy-spin or busy-spin and
> yield.  All threads set their own affinity to a unique cpu, and the
> system is otherwise idle.  I look at the per-cpu delta of the TLB and
> CAL lines of /proc/interrupts over the run of the microbenchmark.
>
> Let's say I have 4 spin threads that never yield.  The mmap thread
> does N unmaps.  I observe each spin-thread core receives N (+/-  small
> noise) TLB shootdown interrupts, and the total TLB interrupt count is
> 4N (+/- small noise).  This is expected behavior.
>
> Then I add some synchronization:  the unmap thread rendezvouses with
> all the spinners, and when they are all ready, the spinners busy-spin
> for D milliseconds and then yield (pthread_yield, sched_yield produce
> identical results, though I'm not confident here that this is the
> right yield).  Meanwhile, the unmap thread busy-spins for D+E
> milliseconds and then does M map/touch/unmaps.  (D, E are single-digit
> milliseconds).  The idea here is that the unmap happens a little while
> after the spinners yielded; the kernel should be in the user process'
> mm but lazy TLB mode should defer TLB flushes.  It seems that lazy
> mode on each CPU should take 1 interrupt and then suppress subsequent
> interrupts.

If they're busy threads, shouldn't the yield return immediately
because the threads are still ready to run?  Lazy TLB won't do much
unless you get the kernel in some state where it's running in the
context of a different kernel thread and hasn't switched to
swapper_pg_dir.  IIRC idle works like that, but you'd need to actually
sleep to go idle.

--Andy


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-08 Thread Benjamin Serebrin
Sorry for the delay, I was eaten by a grue.

I found that my initial study did not actually measure the number of
TLB shootdown IPIs sent per TLB shootdown.  I think the intuition was
correct but I didn't actually observe what I thought I had; my
original use of probe points was incorrect.  However, after fixing my
methodology, I'm having trouble proving that the existing Lazy TLB
mode is working properly.



I've spent some time trying to reproduce this in a microbenchmark.
One thread does mmap, touch page, munmap, while other threads in the
same process are configured to either busy-spin or busy-spin and
yield.  All threads set their own affinity to a unique cpu, and the
system is otherwise idle.  I look at the per-cpu delta of the TLB and
CAL lines of /proc/interrupts over the run of the microbenchmark.

Let's say I have 4 spin threads that never yield.  The mmap thread
does N unmaps.  I observe each spin-thread core receives N (+/-  small
noise) TLB shootdown interrupts, and the total TLB interrupt count is
4N (+/- small noise).  This is expected behavior.

Then I add some synchronization:  the unmap thread rendezvouses with
all the spinners, and when they are all ready, the spinners busy-spin
for D milliseconds and then yield (pthread_yield, sched_yield produce
identical results, though I'm not confident here that this is the
right yield).  Meanwhile, the unmap thread busy-spins for D+E
milliseconds and then does M map/touch/unmaps.  (D, E are single-digit
milliseconds).  The idea here is that the unmap happens a little while
after the spinners yielded; the kernel should be in the user process'
mm but lazy TLB mode should defer TLB flushes.  It seems that lazy
mode on each CPU should take 1 interrupt and then suppress subsequent
interrupts.

I expect lazy TLB invalidation to take 1 interrupt on each spinner
CPU, per rendezvous sequence, and I expect Rik's extra-lazy version to
take 0.  I see M in all cases.  This leads me to wonder if I'm failing
to trigger lazy TLB invalidation, or if lazy TLB invalidation is not
working as intended.

I get similar results using perf record on probe points: I filter by
CPU number and count the number of IPIs sent per each pair of probe
points in the tlb flush routines.  I put probe points on
flush_tlb_mm_range and  flush_tlb_mm_range%return.  Counting number of
IPIs sent: In a VM that uses x2_physical mode, probing
native_x2apic_icr_write or __x2apic_send_IPI_dest is usually
convenient if it doesn't get inlined away (which sometimes happens),
since that function is called once per CPU target in the cpu_mask of
__x2apic_send_IPI_mask (in x2 physical mode).  I filter perf script to
look at the distribution of cpus targeted per TLB shootdown.


Rik's patch definitely looks correct, but I can't yet cite the gains.

Thanks!
Ben





On Wed, Sep 7, 2016 at 11:56 PM, Ingo Molnar  wrote:
>
> * Rik van Riel  wrote:
>
>> On Sat, 27 Aug 2016 16:02:25 -0700
>> Linus Torvalds  wrote:
>>
>> > Yeah, with those small fixes from Ingo, I definitely don't think this
>> > looks hacky at all. This all seems to be exactly what we should always
>> > have done.
>>
>> OK, so I was too tired yesterday to do kernel hacking, and
>> missed yet another bit (xen_flush_tlb_others). Sigh.
>>
>> Otherwise, the patch is identical.
>>
>> Looking forward to Ben's test results.
>
> Gentle ping to Ben.
>
> I can also apply this without waiting for the test result, the patch looks 
> sane
> enough to me.
>
> Thanks,
>
> Ingo


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-08 Thread Benjamin Serebrin
Sorry for the delay, I was eaten by a grue.

I found that my initial study did not actually measure the number of
TLB shootdown IPIs sent per TLB shootdown.  I think the intuition was
correct but I didn't actually observe what I thought I had; my
original use of probe points was incorrect.  However, after fixing my
methodology, I'm having trouble proving that the existing Lazy TLB
mode is working properly.



I've spent some time trying to reproduce this in a microbenchmark.
One thread does mmap, touch page, munmap, while other threads in the
same process are configured to either busy-spin or busy-spin and
yield.  All threads set their own affinity to a unique cpu, and the
system is otherwise idle.  I look at the per-cpu delta of the TLB and
CAL lines of /proc/interrupts over the run of the microbenchmark.

Let's say I have 4 spin threads that never yield.  The mmap thread
does N unmaps.  I observe each spin-thread core receives N (+/-  small
noise) TLB shootdown interrupts, and the total TLB interrupt count is
4N (+/- small noise).  This is expected behavior.

Then I add some synchronization:  the unmap thread rendezvouses with
all the spinners, and when they are all ready, the spinners busy-spin
for D milliseconds and then yield (pthread_yield, sched_yield produce
identical results, though I'm not confident here that this is the
right yield).  Meanwhile, the unmap thread busy-spins for D+E
milliseconds and then does M map/touch/unmaps.  (D, E are single-digit
milliseconds).  The idea here is that the unmap happens a little while
after the spinners yielded; the kernel should be in the user process'
mm but lazy TLB mode should defer TLB flushes.  It seems that lazy
mode on each CPU should take 1 interrupt and then suppress subsequent
interrupts.

I expect lazy TLB invalidation to take 1 interrupt on each spinner
CPU, per rendezvous sequence, and I expect Rik's extra-lazy version to
take 0.  I see M in all cases.  This leads me to wonder if I'm failing
to trigger lazy TLB invalidation, or if lazy TLB invalidation is not
working as intended.

I get similar results using perf record on probe points: I filter by
CPU number and count the number of IPIs sent per each pair of probe
points in the tlb flush routines.  I put probe points on
flush_tlb_mm_range and  flush_tlb_mm_range%return.  Counting number of
IPIs sent: In a VM that uses x2_physical mode, probing
native_x2apic_icr_write or __x2apic_send_IPI_dest is usually
convenient if it doesn't get inlined away (which sometimes happens),
since that function is called once per CPU target in the cpu_mask of
__x2apic_send_IPI_mask (in x2 physical mode).  I filter perf script to
look at the distribution of cpus targeted per TLB shootdown.


Rik's patch definitely looks correct, but I can't yet cite the gains.

Thanks!
Ben





On Wed, Sep 7, 2016 at 11:56 PM, Ingo Molnar  wrote:
>
> * Rik van Riel  wrote:
>
>> On Sat, 27 Aug 2016 16:02:25 -0700
>> Linus Torvalds  wrote:
>>
>> > Yeah, with those small fixes from Ingo, I definitely don't think this
>> > looks hacky at all. This all seems to be exactly what we should always
>> > have done.
>>
>> OK, so I was too tired yesterday to do kernel hacking, and
>> missed yet another bit (xen_flush_tlb_others). Sigh.
>>
>> Otherwise, the patch is identical.
>>
>> Looking forward to Ben's test results.
>
> Gentle ping to Ben.
>
> I can also apply this without waiting for the test result, the patch looks 
> sane
> enough to me.
>
> Thanks,
>
> Ingo


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-08 Thread Ingo Molnar

* Rik van Riel  wrote:

> On Sat, 27 Aug 2016 16:02:25 -0700
> Linus Torvalds  wrote:
> 
> > Yeah, with those small fixes from Ingo, I definitely don't think this
> > looks hacky at all. This all seems to be exactly what we should always
> > have done.
> 
> OK, so I was too tired yesterday to do kernel hacking, and
> missed yet another bit (xen_flush_tlb_others). Sigh.
> 
> Otherwise, the patch is identical.
> 
> Looking forward to Ben's test results.

Gentle ping to Ben.

I can also apply this without waiting for the test result, the patch looks sane 
enough to me.

Thanks,

Ingo


Re: [PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-09-08 Thread Ingo Molnar

* Rik van Riel  wrote:

> On Sat, 27 Aug 2016 16:02:25 -0700
> Linus Torvalds  wrote:
> 
> > Yeah, with those small fixes from Ingo, I definitely don't think this
> > looks hacky at all. This all seems to be exactly what we should always
> > have done.
> 
> OK, so I was too tired yesterday to do kernel hacking, and
> missed yet another bit (xen_flush_tlb_others). Sigh.
> 
> Otherwise, the patch is identical.
> 
> Looking forward to Ben's test results.

Gentle ping to Ben.

I can also apply this without waiting for the test result, the patch looks sane 
enough to me.

Thanks,

Ingo


[PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-08-31 Thread Rik van Riel
On Sat, 27 Aug 2016 16:02:25 -0700
Linus Torvalds  wrote:

> Yeah, with those small fixes from Ingo, I definitely don't think this
> looks hacky at all. This all seems to be exactly what we should always
> have done.

OK, so I was too tired yesterday to do kernel hacking, and
missed yet another bit (xen_flush_tlb_others). Sigh.

Otherwise, the patch is identical.

Looking forward to Ben's test results.

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %CR3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C-states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

Signed-off-by: Rik van Riel 
Reported-by: Benjamin Serebrin 
---
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h   |  3 +-
 arch/x86/include/asm/uv/uv.h  |  6 ++--
 arch/x86/mm/tlb.c | 64 ---
 arch/x86/platform/uv/tlb_uv.c |  2 +-
 arch/x86/xen/mmu.c|  2 +-
 6 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7fa9e7740ba3..b7e695c90c43 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -225,7 +225,7 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
+   void (*flush_tlb_others)(struct cpumask *cpus,
 struct mm_struct *mm,
 unsigned long start,
 unsigned long end);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..c3dbacbc49be 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,13 @@ extern void flush_tlb_kernel_range(unsigned long start, 
unsigned long end);
 
 #define flush_tlb()flush_tlb_current_task()
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_others(struct cpumask *cpumask,
struct mm_struct *mm,
unsigned long start, unsigned long end);
 
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
+#define TLBSTATE_FLUSH 3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 062921ef34e9..7e83cc633ba1 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -13,7 +13,7 @@ extern int is_uv_system(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 struct mm_struct *mm,
 unsigned long start,
 unsigned long end,
@@ -25,8 +25,8 @@ static inline enum uv_system_type get_uv_system_type(void) { 
return UV_NONE; }
 static inline int is_uv_system(void)   { return 0; }
 static inline void uv_cpu_init(void)   { }
 static inline void uv_system_init(void){ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask, struct mm_struct *mm,
unsigned long start, unsigned long end, unsigned int cpu)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..634248b38db9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
}
 #ifdef CONFIG_SMP
  else {
+   int *tlbstate = this_cpu_ptr(_tlbstate.state);
+   int oldstate = *tlbstate;
+
+   

[PATCH RFC v6] x86,mm,sched: make lazy TLB mode even lazier

2016-08-31 Thread Rik van Riel
On Sat, 27 Aug 2016 16:02:25 -0700
Linus Torvalds  wrote:

> Yeah, with those small fixes from Ingo, I definitely don't think this
> looks hacky at all. This all seems to be exactly what we should always
> have done.

OK, so I was too tired yesterday to do kernel hacking, and
missed yet another bit (xen_flush_tlb_others). Sigh.

Otherwise, the patch is identical.

Looking forward to Ben's test results.

---8<---

Subject: x86,mm,sched: make lazy TLB mode even lazier

Lazy TLB mode can result in an idle CPU being woken up for a TLB
flush, when all it really needed to do was flush %CR3 before the
next context switch.

This is mostly fine on bare metal, though sub-optimal from a power
saving point of view, and deeper C-states could make TLB flushes
take a little longer than desired.

On virtual machines, the pain can be much worse, especially if a
currently non-running VCPU is woken up for a TLB invalidation
IPI, on a CPU that is busy running another task. It could take
a while before that IPI is handled, leading to performance issues.

This patch deals with the issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be flushed at the next
context switch.

A CPU that transitions from TLBSTATE_LAZY to TLBSTATE_OK during
the attempted transition to TLBSTATE_FLUSH will get a TLB flush
IPI, just like a CPU that was in TLBSTATE_OK to begin with.

Nothing is done for a CPU that is already in TLBSTATE_FLUSH mode.

Signed-off-by: Rik van Riel 
Reported-by: Benjamin Serebrin 
---
 arch/x86/include/asm/paravirt_types.h |  2 +-
 arch/x86/include/asm/tlbflush.h   |  3 +-
 arch/x86/include/asm/uv/uv.h  |  6 ++--
 arch/x86/mm/tlb.c | 64 ---
 arch/x86/platform/uv/tlb_uv.c |  2 +-
 arch/x86/xen/mmu.c|  2 +-
 6 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7fa9e7740ba3..b7e695c90c43 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -225,7 +225,7 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
+   void (*flush_tlb_others)(struct cpumask *cpus,
 struct mm_struct *mm,
 unsigned long start,
 unsigned long end);
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4e5be94e079a..c3dbacbc49be 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -304,12 +304,13 @@ extern void flush_tlb_kernel_range(unsigned long start, 
unsigned long end);
 
 #define flush_tlb()flush_tlb_current_task()
 
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_others(struct cpumask *cpumask,
struct mm_struct *mm,
unsigned long start, unsigned long end);
 
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
+#define TLBSTATE_FLUSH 3
 
 static inline void reset_lazy_tlbstate(void)
 {
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index 062921ef34e9..7e83cc633ba1 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -13,7 +13,7 @@ extern int is_uv_system(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 struct mm_struct *mm,
 unsigned long start,
 unsigned long end,
@@ -25,8 +25,8 @@ static inline enum uv_system_type get_uv_system_type(void) { 
return UV_NONE; }
 static inline int is_uv_system(void)   { return 0; }
 static inline void uv_cpu_init(void)   { }
 static inline void uv_system_init(void){ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask, struct mm_struct *mm,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask, struct mm_struct *mm,
unsigned long start, unsigned long end, unsigned int cpu)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5643fd0b1a7d..634248b38db9 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,10 +140,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
}
 #ifdef CONFIG_SMP
  else {
+   int *tlbstate = this_cpu_ptr(_tlbstate.state);
+   int oldstate = *tlbstate;
+
+   if (unlikely(oldstate == TLBSTATE_LAZY)) {
+   /*