Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-26 Thread Borislav Petkov
On Tue, Jun 20, 2017 at 10:22:17PM -0700, Andy Lutomirski wrote:
> PCID is a "process context ID" -- it's what other architectures call
> an address space ID.  Every non-global TLB entry is tagged with a
> PCID, only TLB entries that match the currently selected PCID are
> used, and we can switch PGDs without flushing the TLB.  x86's
> PCID is 12 bits.
> 
> This is an unorthodox approach to using PCID.  x86's PCID is far too
> short to uniquely identify a process, and we can't even really
> uniquely identify a running process because there are monster
> systems with over 4096 CPUs.  To make matters worse, past attempts
> to use all 12 PCID bits have resulted in slowdowns instead of
> speedups.
> 
> This patch uses PCID differently.  We use a PCID to identify a
> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
> 
> This seems to save about 100ns on context switches between mms.

"... with my microbenchmark of ping-ponging." :)

> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/mmu_context.h |  3 ++
>  arch/x86/include/asm/processor-flags.h |  2 +
>  arch/x86/include/asm/tlbflush.h| 18 +++-
>  arch/x86/mm/init.c |  1 +
>  arch/x86/mm/tlb.c  | 82 
> ++
>  5 files changed, 86 insertions(+), 20 deletions(-)

...

> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 57b305e13c4c..a9a5aa6f45f7 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
>  #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
>  #endif
>  
> +/*
> + * 6 because 6 should be plenty and struct tlb_state will fit in
> + * two cache lines.
> + */
> +#define NR_DYNAMIC_ASIDS 6

TLB_NR_DYN_ASIDS

Properly prefixed, I guess.

The rest later, when you're done experimenting. :)

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-23 Thread Thomas Gleixner
On Thu, 22 Jun 2017, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 2:22 PM, Thomas Gleixner  wrote:
> > On Thu, 22 Jun 2017, Andy Lutomirski wrote:
> >> On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner  
> >> wrote:
> >> > Now one other optimization which should be trivial to add is to keep the 
> >> > 4
> >> > asid context entries in cpu_tlbstate and cache the last asid in thread
> >> > info. If that's still valid then use it otherwise unconditionally get a 
> >> > new
> >> > one. That avoids the whole loop machinery and thread info is cache hot in
> >> > the context switch anyway. Delta patch on top of your version below.
> >>
> >> I'm not sure I understand.  If an mm has ASID 0 on CPU 0 and ASID 1 on
> >> CPU 1 and a thread in that mm bounces back and forth between those
> >> CPUs, won't your patch cause it to flush every time?
> >
> > Yeah, I was too focussed on the non migratory case, where two tasks from
> > different processes play rapid ping pong. That's what I was looking at for
> > various reasons.
> >
> > There the cached asid really helps by avoiding the loop completely, but
> > yes, the search needs to be done for the bouncing between CPUs case.
> >
> > So maybe a combo of those might be interesting.
> >
> 
> I'm not too worried about optimizing away the loop.  It's a loop over
> four or six things that are all in cachelines that we need anyway.  I
> suspect that we'll never be able to see it in any microbenchmark, let
> alone real application.

Fair enough.



Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 2:22 PM, Thomas Gleixner  wrote:
> On Thu, 22 Jun 2017, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner  wrote:
>> > Now one other optimization which should be trivial to add is to keep the 4
>> > asid context entries in cpu_tlbstate and cache the last asid in thread
>> > info. If that's still valid then use it otherwise unconditionally get a new
>> > one. That avoids the whole loop machinery and thread info is cache hot in
>> > the context switch anyway. Delta patch on top of your version below.
>>
>> I'm not sure I understand.  If an mm has ASID 0 on CPU 0 and ASID 1 on
>> CPU 1 and a thread in that mm bounces back and forth between those
>> CPUs, won't your patch cause it to flush every time?
>
> Yeah, I was too focussed on the non migratory case, where two tasks from
> different processes play rapid ping pong. That's what I was looking at for
> various reasons.
>
> There the cached asid really helps by avoiding the loop completely, but
> yes, the search needs to be done for the bouncing between CPUs case.
>
> So maybe a combo of those might be interesting.
>

I'm not too worried about optimizing away the loop.  It's a loop over
four or six things that are all in cachelines that we need anyway.  I
suspect that we'll never be able to see it in any microbenchmark, let
alone real application.


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-22 Thread Thomas Gleixner
On Thu, 22 Jun 2017, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner  wrote:
> > Now one other optimization which should be trivial to add is to keep the 4
> > asid context entries in cpu_tlbstate and cache the last asid in thread
> > info. If that's still valid then use it otherwise unconditionally get a new
> > one. That avoids the whole loop machinery and thread info is cache hot in
> > the context switch anyway. Delta patch on top of your version below.
> 
> I'm not sure I understand.  If an mm has ASID 0 on CPU 0 and ASID 1 on
> CPU 1 and a thread in that mm bounces back and forth between those
> CPUs, won't your patch cause it to flush every time?

Yeah, I was too focussed on the non migratory case, where two tasks from
different processes play rapid ping pong. That's what I was looking at for
various reasons.

There the cached asid really helps by avoiding the loop completely, but
yes, the search needs to be done for the bouncing between CPUs case.

So maybe a combo of those might be interesting.

Thanks,

tglx


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 5:21 AM, Thomas Gleixner  wrote:
> On Wed, 21 Jun 2017, Andy Lutomirski wrote:
>> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner  wrote:
>> > That requires a conditional branch
>> >
>> > if (asid >= NR_DYNAMIC_ASIDS) {
>> > asid = 0;
>> > 
>> > }
>> >
>> > The question is whether 4 IDs would be sufficient which trades the branch
>> > for a mask operation. Or you go for 8 and spend another cache line.
>>
>> Interesting.  I'm inclined to either leave it at 6 or reduce it to 4
>> for now and to optimize later.
>
> :)
>
>> > Hmm. So this loop needs to be taken unconditionally even if the task stays
>> > on the same CPU. And of course the number of dynamic IDs has to be short in
>> > order to makes this loop suck performance wise.
>> >
>> > Something like the completely disfunctional below might be worthwhile to
>> > explore. At least arch/x86/mm/ compiles :)
>> >
>> > It gets rid of the loop search and lifts the limit of dynamic ids by
>> > trading it with a percpu variable in mm_context_t.
>>
>> That would work, but it would take a lot more memory on large systems
>> with lots of processes, and I'd also be concerned that we might run
>> out of dynamic percpu space.
>
> Yeah, did not think about the dynamic percpu space.
>
>> How about a different idea: make the percpu data structure look like a
>> 4-way set associative cache.  The ctxs array could be, say, 1024
>> entries long without using crazy amounts of memory.  We'd divide it
>> into 256 buckets, so you'd index it like ctxs[4*bucket + slot].  For
>> each mm, we choose a random bucket (from 0 through 256), and then we'd
>> just loop over the four slots in the bucket in choose_asid().  This
>> would require very slightly more arithmetic (I'd guess only one or two
>> cycles, though) but, critically, wouldn't touch any more cachelines.
>>
>> The downside of both of these approaches over the one in this patch is
>> that the change that the percpu cacheline we need is not in the cache
>> is quite a bit higher since it's potentially a different cacheline for
>> each mm.  It would probably still be a win because avoiding the flush
>> is really quite valuable.
>>
>> What do you think?  The added code would be tiny.
>
> That might be worth a try.
>
> Now one other optimization which should be trivial to add is to keep the 4
> asid context entries in cpu_tlbstate and cache the last asid in thread
> info. If that's still valid then use it otherwise unconditionally get a new
> one. That avoids the whole loop machinery and thread info is cache hot in
> the context switch anyway. Delta patch on top of your version below.

I'm not sure I understand.  If an mm has ASID 0 on CPU 0 and ASID 1 on
CPU 1 and a thread in that mm bounces back and forth between those
CPUs, won't your patch cause it to flush every time?

--Andy


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 9:09 AM, Nadav Amit  wrote:
> Andy Lutomirski  wrote:
>
>>
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
>>
>> DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
>>   .loaded_mm = &init_mm,
>> + .next_asid = 1,
>
> I think this is a remainder from previous version of the patches, no? It
> does not seem necessary and may be confusing (ctx_id 0 is reserved, but not
> asid 0).

Hmm.  It's no longer needed for correctness, but init_mm still lands
in slot 0, and it seems friendly to avoid immediately stomping it.
Admittedly, this won't make any practical difference since it'll only
happen once per cpu.

>
> Other than that, if you want, you can put for the entire series:
>
> Reviewed-by: Nadav Amit 
>

Thanks!


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-22 Thread Nadav Amit
Andy Lutomirski  wrote:

> 
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
> 
> DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
>   .loaded_mm = &init_mm,
> + .next_asid = 1,

I think this is a remainder from previous version of the patches, no? It
does not seem necessary and may be confusing (ctx_id 0 is reserved, but not
asid 0).

Other than that, if you want, you can put for the entire series:

Reviewed-by: Nadav Amit 



Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-22 Thread Thomas Gleixner
On Wed, 21 Jun 2017, Andy Lutomirski wrote:
> On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner  wrote:
> > That requires a conditional branch
> >
> > if (asid >= NR_DYNAMIC_ASIDS) {
> > asid = 0;
> > 
> > }
> >
> > The question is whether 4 IDs would be sufficient which trades the branch
> > for a mask operation. Or you go for 8 and spend another cache line.
> 
> Interesting.  I'm inclined to either leave it at 6 or reduce it to 4
> for now and to optimize later.

:)

> > Hmm. So this loop needs to be taken unconditionally even if the task stays
> > on the same CPU. And of course the number of dynamic IDs has to be short in
> > order to makes this loop suck performance wise.
> >
> > Something like the completely disfunctional below might be worthwhile to
> > explore. At least arch/x86/mm/ compiles :)
> >
> > It gets rid of the loop search and lifts the limit of dynamic ids by
> > trading it with a percpu variable in mm_context_t.
> 
> That would work, but it would take a lot more memory on large systems
> with lots of processes, and I'd also be concerned that we might run
> out of dynamic percpu space.

Yeah, did not think about the dynamic percpu space.
 
> How about a different idea: make the percpu data structure look like a
> 4-way set associative cache.  The ctxs array could be, say, 1024
> entries long without using crazy amounts of memory.  We'd divide it
> into 256 buckets, so you'd index it like ctxs[4*bucket + slot].  For
> each mm, we choose a random bucket (from 0 through 256), and then we'd
> just loop over the four slots in the bucket in choose_asid().  This
> would require very slightly more arithmetic (I'd guess only one or two
> cycles, though) but, critically, wouldn't touch any more cachelines.
> 
> The downside of both of these approaches over the one in this patch is
> that the change that the percpu cacheline we need is not in the cache
> is quite a bit higher since it's potentially a different cacheline for
> each mm.  It would probably still be a win because avoiding the flush
> is really quite valuable.
> 
> What do you think?  The added code would be tiny.

That might be worth a try.

Now one other optimization which should be trivial to add is to keep the 4
asid context entries in cpu_tlbstate and cache the last asid in thread
info. If that's still valid then use it otherwise unconditionally get a new
one. That avoids the whole loop machinery and thread info is cache hot in
the context switch anyway. Delta patch on top of your version below.

> (P.S. Why doesn't random_p32() try arch_random_int()?)

Could you please ask questions which do not require crystalballs for
answering?

Thanks,

tglx

8<---
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -159,8 +159,16 @@ static inline void destroy_context(struc
 extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
  struct task_struct *tsk);
 
-extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
-  struct task_struct *tsk);
+extern void __switch_mm_irqs_off(struct mm_struct *prev,
+struct mm_struct *next, u32 *last_asid);
+
+static inline void switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+   __switch_mm_irqs_off(prev, next, &tsk->thread_info.asid);
+}
+
 #define switch_mm_irqs_off switch_mm_irqs_off
 
 #define activate_mm(prev, next)\
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -54,6 +54,7 @@ struct task_struct;
 
 struct thread_info {
unsigned long   flags;  /* low level flags */
+   u32 asid;
 };
 
 #define INIT_THREAD_INFO(tsk)  \
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -83,10 +83,13 @@ static inline u64 bump_mm_tlb_gen(struct
 #endif
 
 /*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
+ * NR_DYNAMIC_ASIDS must be a power of 2. 4 makes tlb_state fit into two
+ * cache lines.
  */
-#define NR_DYNAMIC_ASIDS 6
+#define NR_DYNAMIC_ASIDS_BITS  2
+#define NR_DYNAMIC_ASIDS   (1U << NR_DYNAMIC_ASIDS_BITS)
+#define DYNAMIC_ASIDS_MASK (NR_DYNAMIC_ASIDS - 1)
+#define ASID_NEEDS_FLUSH   (1U << 16)
 
 struct tlb_context {
u64 ctx_id;
@@ -102,7 +105,8 @@ struct tlb_state {
 */
struct mm_struct *loaded_mm;
u16 loaded_mm_asid;
-   u16 next_asid;
+   u16 curr_asid;
+   u32 notask_asid;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,7 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)

Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 6:38 AM, Thomas Gleixner  wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>> This patch uses PCID differently.  We use a PCID to identify a
>> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
>> binding at all; instead, we give it a fresh PCID each time it's
>> loaded except in cases where we want to preserve the TLB, in which
>> case we reuse a recent value.
>>
>> This seems to save about 100ns on context switches between mms.
>
> Depending on the work load I assume. For a CPU switching between a large
> number of processes consecutively it won't make a change. In fact it will
> be slower due to the extra few cycles required for rotating the asid, but I
> doubt that this can be measured.

True.  I suspect this can be improved -- see below.

>
>> +/*
>> + * 6 because 6 should be plenty and struct tlb_state will fit in
>> + * two cache lines.
>> + */
>> +#define NR_DYNAMIC_ASIDS 6
>
> That requires a conditional branch
>
> if (asid >= NR_DYNAMIC_ASIDS) {
> asid = 0;
> 
> }
>
> The question is whether 4 IDs would be sufficient which trades the branch
> for a mask operation. Or you go for 8 and spend another cache line.

Interesting.  I'm inclined to either leave it at 6 or reduce it to 4
for now and to optimize later.

>
>>  atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>>
>> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
>> + u16 *new_asid, bool *need_flush)
>> +{
>> + u16 asid;
>> +
>> + if (!static_cpu_has(X86_FEATURE_PCID)) {
>> + *new_asid = 0;
>> + *need_flush = true;
>> + return;
>> + }
>> +
>> + for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
>> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
>> + next->context.ctx_id)
>> + continue;
>> +
>> + *new_asid = asid;
>> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
>> +next_tlb_gen);
>> + return;
>> + }
>
> Hmm. So this loop needs to be taken unconditionally even if the task stays
> on the same CPU. And of course the number of dynamic IDs has to be short in
> order to makes this loop suck performance wise.
>
> Something like the completely disfunctional below might be worthwhile to
> explore. At least arch/x86/mm/ compiles :)
>
> It gets rid of the loop search and lifts the limit of dynamic ids by
> trading it with a percpu variable in mm_context_t.

That would work, but it would take a lot more memory on large systems
with lots of processes, and I'd also be concerned that we might run
out of dynamic percpu space.

How about a different idea: make the percpu data structure look like a
4-way set associative cache.  The ctxs array could be, say, 1024
entries long without using crazy amounts of memory.  We'd divide it
into 256 buckets, so you'd index it like ctxs[4*bucket + slot].  For
each mm, we choose a random bucket (from 0 through 256), and then we'd
just loop over the four slots in the bucket in choose_asid().  This
would require very slightly more arithmetic (I'd guess only one or two
cycles, though) but, critically, wouldn't touch any more cachelines.

The downside of both of these approaches over the one in this patch is
that the change that the percpu cacheline we need is not in the cache
is quite a bit higher since it's potentially a different cacheline for
each mm.  It would probably still be a win because avoiding the flush
is really quite valuable.

What do you think?  The added code would be tiny.

(P.S. Why doesn't random_p32() try arch_random_int()?)

--Andy


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-21 Thread Thomas Gleixner
On Wed, 21 Jun 2017, Thomas Gleixner wrote:
> > +   for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
> > +   if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
> > +   next->context.ctx_id)
> > +   continue;
> > +
> > +   *new_asid = asid;
> > +   *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
> > +  next_tlb_gen);
> > +   return;
> > +   }
> 
> Hmm. So this loop needs to be taken unconditionally even if the task stays
> on the same CPU. And of course the number of dynamic IDs has to be short in
> order to makes this loop suck performance wise.

 ...  not suck ...


Re: [PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-21 Thread Thomas Gleixner
On Tue, 20 Jun 2017, Andy Lutomirski wrote:
> This patch uses PCID differently.  We use a PCID to identify a
> recently-used mm on a per-cpu basis.  An mm has no fixed PCID
> binding at all; instead, we give it a fresh PCID each time it's
> loaded except in cases where we want to preserve the TLB, in which
> case we reuse a recent value.
> 
> This seems to save about 100ns on context switches between mms.

Depending on the work load I assume. For a CPU switching between a large
number of processes consecutively it won't make a change. In fact it will
be slower due to the extra few cycles required for rotating the asid, but I
doubt that this can be measured.
 
> +/*
> + * 6 because 6 should be plenty and struct tlb_state will fit in
> + * two cache lines.
> + */
> +#define NR_DYNAMIC_ASIDS 6

That requires a conditional branch 

if (asid >= NR_DYNAMIC_ASIDS) {
asid = 0;

}

The question is whether 4 IDs would be sufficient which trades the branch
for a mask operation. Or you go for 8 and spend another cache line.

>  atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
>  
> +static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
> + u16 *new_asid, bool *need_flush)
> +{
> + u16 asid;
> +
> + if (!static_cpu_has(X86_FEATURE_PCID)) {
> + *new_asid = 0;
> + *need_flush = true;
> + return;
> + }
> +
> + for (asid = 0; asid < NR_DYNAMIC_ASIDS; asid++) {
> + if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
> + next->context.ctx_id)
> + continue;
> +
> + *new_asid = asid;
> + *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
> +next_tlb_gen);
> + return;
> + }

Hmm. So this loop needs to be taken unconditionally even if the task stays
on the same CPU. And of course the number of dynamic IDs has to be short in
order to makes this loop suck performance wise.

Something like the completely disfunctional below might be worthwhile to
explore. At least arch/x86/mm/ compiles :)

It gets rid of the loop search and lifts the limit of dynamic ids by
trading it with a percpu variable in mm_context_t.

Thanks,

tglx

8<

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -25,6 +25,8 @@ typedef struct {
 */
atomic64_t tlb_gen;
 
+   u32 __percpu*asids;
+
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
 #endif
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -156,11 +156,23 @@ static inline void destroy_context(struc
destroy_context_ldt(mm);
 }
 
-extern void switch_mm(struct mm_struct *prev, struct mm_struct *next,
- struct task_struct *tsk);
+extern void __switch_mm(struct mm_struct *prev, struct mm_struct *next);
+
+static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
+struct task_struct *tsk)
+{
+   __switch_mm(prev, next);
+}
+
+extern void __switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct 
*next);
+
+static inline void switch_mm_irqs_off(struct mm_struct *prev,
+ struct mm_struct *next,
+ struct task_struct *tsk)
+{
+   __switch_mm_irqs_off(prev, next);
+}
 
-extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
-  struct task_struct *tsk);
 #define switch_mm_irqs_off switch_mm_irqs_off
 
 #define activate_mm(prev, next)\
@@ -299,6 +311,9 @@ static inline unsigned long __get_curren
 {
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+   if (static_cpu_has(X86_FEATURE_PCID))
+   cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());
 
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
 /* Mask off the address space ID bits. */
 #define CR3_ADDR_MASK 0x7000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
  */
 #define CR3_ADDR_MASK 0xull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,15 @@ static inline u64 bump_mm_tlb_gen(struct
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * NR_DYNAMIC_ASIDS must be a power of 2. 4 makes tlb_state fit into two
+ * cache lines.
+ */
+#define NR_DYNAMIC_ASIDS_BITS  2
+#define NR_DYNAMIC_ASIDS   (1U << NR_DYNAMIC_ASIDS_B