Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-23 Thread Andy Lutomirski
On Fri, Jun 23, 2017 at 1:42 AM, Borislav Petkov  wrote:
> On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote:
>> Yes, I agree it's confusing.  There really are three numbers.  Those
>> numbers are: the latest generation, the generation that this CPU has
>> caught up to, and the generation that the requester of the flush we're
>> currently handling has asked us to catch up to.  I don't see a way to
>> reduce the complexity.
>
> Yeah, can you pls put that clarification what what is, over it. It
> explains it nicely what the check is supposed to do.

Done.  I've tried to improve a bunch of the comments in this function.

>
>> >> The flush IPI hits after a switch_mm_irqs_off() call notices the
>> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
>> >> increment the local tlb_gen to 2, and the IPI handler for the partial
>> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
>> >> == 2 and mm_tlb_gen == 3) and do a partial flush.
>> >
>> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
>> >
>> > That's why you have this thing in addition to the tlb_gen.
>>
>> Yes.  The idea is that we only do remote partial flushes when it's
>> 100% obvious that it's safe.
>
> So why wouldn't my simplified suggestion work then?
>
> if (f->end != TLB_FLUSH_ALL &&
>  mm_tlb_gen == local_tlb_gen + 1)
>
> 1->2 is a partial flush - gets promoted to a full one
> 2->3 is a full flush - it will get executed as one due to the f->end setting 
> to
> TLB_FLUSH_ALL.

This could still fail in some cases, I think.  Suppose 1->2 is a
partial flush and 2->3 is a full flush.  We could have this order of
events:

 - CPU 1: Partial flush.  Increase context.tlb_gen to 2 and send IPI.
 - CPU 0: switch_mm(), observe mm_tlb_gen == 2, set local_tlb_gen to 2.
 - CPU 2: Full flush.  Increase context.tlb_gen to 3 and send IPI.
 - CPU 0: Receive partial flush IPI.  mm_tlb_gen == 2 and
local_tlb_gen == 3.  Do __flush_tlb_single() and set local_tlb_gen to
3.

Our invariant is now broken: CPU 0's percpu tlb_gen is now ahead of
its actual TLB state.

 - CPU 0: Receive full flush IPI and skip the flush.  Oops.

I think my condition makes it clear that the invariants we need hold
no matter it.

>
>> It could be converted to two full flushes or to just one, I think,
>> depending on what order everything happens in.
>
> Right. One flush at the right time would be optimal.
>
>> But this approach of using three separate tlb_gen values seems to
>> cover all the bases, and I don't think it's *that* bad.
>
> Sure.
>
> As I said in IRC, let's document that complexity then so that when we
> stumble over it in the future, we at least know why it was done this
> way.

I've given it a try.  Hopefully v4 is more clear.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-23 Thread Andy Lutomirski
On Fri, Jun 23, 2017 at 1:42 AM, Borislav Petkov  wrote:
> On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote:
>> Yes, I agree it's confusing.  There really are three numbers.  Those
>> numbers are: the latest generation, the generation that this CPU has
>> caught up to, and the generation that the requester of the flush we're
>> currently handling has asked us to catch up to.  I don't see a way to
>> reduce the complexity.
>
> Yeah, can you pls put that clarification what what is, over it. It
> explains it nicely what the check is supposed to do.

Done.  I've tried to improve a bunch of the comments in this function.

>
>> >> The flush IPI hits after a switch_mm_irqs_off() call notices the
>> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
>> >> increment the local tlb_gen to 2, and the IPI handler for the partial
>> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
>> >> == 2 and mm_tlb_gen == 3) and do a partial flush.
>> >
>> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
>> >
>> > That's why you have this thing in addition to the tlb_gen.
>>
>> Yes.  The idea is that we only do remote partial flushes when it's
>> 100% obvious that it's safe.
>
> So why wouldn't my simplified suggestion work then?
>
> if (f->end != TLB_FLUSH_ALL &&
>  mm_tlb_gen == local_tlb_gen + 1)
>
> 1->2 is a partial flush - gets promoted to a full one
> 2->3 is a full flush - it will get executed as one due to the f->end setting 
> to
> TLB_FLUSH_ALL.

This could still fail in some cases, I think.  Suppose 1->2 is a
partial flush and 2->3 is a full flush.  We could have this order of
events:

 - CPU 1: Partial flush.  Increase context.tlb_gen to 2 and send IPI.
 - CPU 0: switch_mm(), observe mm_tlb_gen == 2, set local_tlb_gen to 2.
 - CPU 2: Full flush.  Increase context.tlb_gen to 3 and send IPI.
 - CPU 0: Receive partial flush IPI.  mm_tlb_gen == 2 and
local_tlb_gen == 3.  Do __flush_tlb_single() and set local_tlb_gen to
3.

Our invariant is now broken: CPU 0's percpu tlb_gen is now ahead of
its actual TLB state.

 - CPU 0: Receive full flush IPI and skip the flush.  Oops.

I think my condition makes it clear that the invariants we need hold
no matter it.

>
>> It could be converted to two full flushes or to just one, I think,
>> depending on what order everything happens in.
>
> Right. One flush at the right time would be optimal.
>
>> But this approach of using three separate tlb_gen values seems to
>> cover all the bases, and I don't think it's *that* bad.
>
> Sure.
>
> As I said in IRC, let's document that complexity then so that when we
> stumble over it in the future, we at least know why it was done this
> way.

I've given it a try.  Hopefully v4 is more clear.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-23 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote:
> Yes, I agree it's confusing.  There really are three numbers.  Those
> numbers are: the latest generation, the generation that this CPU has
> caught up to, and the generation that the requester of the flush we're
> currently handling has asked us to catch up to.  I don't see a way to
> reduce the complexity.

Yeah, can you pls put that clarification what what is, over it. It
explains it nicely what the check is supposed to do.

> >> The flush IPI hits after a switch_mm_irqs_off() call notices the
> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
> >> increment the local tlb_gen to 2, and the IPI handler for the partial
> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
> >> == 2 and mm_tlb_gen == 3) and do a partial flush.
> >
> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
> >
> > That's why you have this thing in addition to the tlb_gen.
> 
> Yes.  The idea is that we only do remote partial flushes when it's
> 100% obvious that it's safe.

So why wouldn't my simplified suggestion work then?

if (f->end != TLB_FLUSH_ALL &&
 mm_tlb_gen == local_tlb_gen + 1)

1->2 is a partial flush - gets promoted to a full one
2->3 is a full flush - it will get executed as one due to the f->end setting to
TLB_FLUSH_ALL.

> It could be converted to two full flushes or to just one, I think,
> depending on what order everything happens in.

Right. One flush at the right time would be optimal.

> But this approach of using three separate tlb_gen values seems to
> cover all the bases, and I don't think it's *that* bad.

Sure.

As I said in IRC, let's document that complexity then so that when we
stumble over it in the future, we at least know why it was done this
way.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-23 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote:
> Yes, I agree it's confusing.  There really are three numbers.  Those
> numbers are: the latest generation, the generation that this CPU has
> caught up to, and the generation that the requester of the flush we're
> currently handling has asked us to catch up to.  I don't see a way to
> reduce the complexity.

Yeah, can you pls put that clarification what what is, over it. It
explains it nicely what the check is supposed to do.

> >> The flush IPI hits after a switch_mm_irqs_off() call notices the
> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
> >> increment the local tlb_gen to 2, and the IPI handler for the partial
> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
> >> == 2 and mm_tlb_gen == 3) and do a partial flush.
> >
> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
> >
> > That's why you have this thing in addition to the tlb_gen.
> 
> Yes.  The idea is that we only do remote partial flushes when it's
> 100% obvious that it's safe.

So why wouldn't my simplified suggestion work then?

if (f->end != TLB_FLUSH_ALL &&
 mm_tlb_gen == local_tlb_gen + 1)

1->2 is a partial flush - gets promoted to a full one
2->3 is a full flush - it will get executed as one due to the f->end setting to
TLB_FLUSH_ALL.

> It could be converted to two full flushes or to just one, I think,
> depending on what order everything happens in.

Right. One flush at the right time would be optimal.

> But this approach of using three separate tlb_gen values seems to
> cover all the bases, and I don't think it's *that* bad.

Sure.

As I said in IRC, let's document that complexity then so that when we
stumble over it in the future, we at least know why it was done this
way.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 10:22 AM, Borislav Petkov  wrote:
> On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote:
>> > Ah, simple: we control the flushing with info.new_tlb_gen and
>> > mm->context.tlb_gen. I.e., this check:
>> >
>> >
>> > if (f->end != TLB_FLUSH_ALL &&
>> > f->new_tlb_gen == local_tlb_gen + 1 &&
>> > f->new_tlb_gen == mm_tlb_gen) {
>> >
>> > why can't we write:
>> >
>> > if (f->end != TLB_FLUSH_ALL &&
>> > mm_tlb_gen == local_tlb_gen + 1)
>> >
>> > ?
>>
>> Ah, I thought you were asking about why I needed mm_tlb_gen ==
>> local_tlb_gen + 1.  This is just an optimization, or at least I hope
>> it is.  The idea is that, if we know that another flush is coming, it
>> seems likely that it would be faster to do a full flush and increase
>> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
>> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
>> needing to flush again very soon.
>
> Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1.
>
> Btw, do you see how confusing this check is: you have new_tlb_gen from
> a variable passed from the function call IPI, local_tlb_gen which is
> the CPU's own and then there's also mm_tlb_gen which we've written into
> new_tlb_gen from:
>
> info.new_tlb_gen = bump_mm_tlb_gen(mm);
>
> which incremented mm_tlb_gen too.

Yes, I agree it's confusing.  There really are three numbers.  Those
numbers are: the latest generation, the generation that this CPU has
caught up to, and the generation that the requester of the flush we're
currently handling has asked us to catch up to.  I don't see a way to
reduce the complexity.

>
>> Hmm.  I'd be nervous that there are more subtle races if we do this.
>> For example, suppose that a partial flush increments tlb_gen from 1 to
>> 2 and a full flush increments tlb_gen from 2 to 3.  Meanwhile, the CPU
>> is busy switching back and forth between mms, so the partial flush
>> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
>> set in mm_cpumask.
>
> Lemme see if I understand this correctly: you mean, the full flush will
> exit early due to the
>
> if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>
> test?

Yes, or at least that's what I'm imagining.

>
>> The flush IPI hits after a switch_mm_irqs_off() call notices the
>> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
>> increment the local tlb_gen to 2, and the IPI handler for the partial
>> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
>> == 2 and mm_tlb_gen == 3) and do a partial flush.
>
> Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
>
> That's why you have this thing in addition to the tlb_gen.

Yes.  The idea is that we only do remote partial flushes when it's
100% obvious that it's safe.

>
> What we end up doing in this case, is promote the partial flush to a
> full one and thus have a partial and a full flush which are close by
> converted to two full flushes.

It could be converted to two full flushes or to just one, I think,
depending on what order everything happens in.

>
>> The problem here is that it's not obvious to me that this actually
>> ends up flushing everything that's needed. Maybe all the memory
>> ordering gets this right, but I can imagine scenarios in which
>> switch_mm_irqs_off() does its flush early enough that the TLB picks up
>> an entry that was supposed to get zapped by the full flush.
>
> See above.
>
> And I don't think that having two full flushes back-to-back is going to
> cost a lot as the second one won't flush a whole lot.

>From limited benchmarking on new Intel chips, a full flush is very
expensive no matter what.  I think this is silly because I suspect
that the PCID circuitry could internally simulate a full flush at very
little cost, but it seems that it doesn't.  I haven't tried to
benchmark INVLPG.

>
>> IOW it *might* be valid, but I think it would need very careful review
>> and documentation.
>
> Always.
>
> Btw, I get the sense this TLB flush avoiding scheme becomes pretty
> complex for diminishing reasons.
>
>   [ Or maybe I'm not seeing them - I'm always open to corrections. ]
>
> Especially if intermediary levels from the pagetable walker are cached
> and reestablishing the TLB entries seldom means a full walk.
>
> You should do a full fledged benchmark to see whether this whole
> complexity is even worth it, methinks.

I agree that, by itself, this patch is waay too complex to be
justifiable.  The thing is that, after quite a few false starts, I
couldn't find a clean way to get PCID and improved laziness without
this thing as a prerequisite.  Both of those features depend on having
a heuristic for when a flush can be avoided, and that heuristic must
*never* say that a flush can be skipped when it can't be skipped.
This patch gives a way to do that.

I tried a few other approaches:

 - Keeping a 

Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 10:22 AM, Borislav Petkov  wrote:
> On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote:
>> > Ah, simple: we control the flushing with info.new_tlb_gen and
>> > mm->context.tlb_gen. I.e., this check:
>> >
>> >
>> > if (f->end != TLB_FLUSH_ALL &&
>> > f->new_tlb_gen == local_tlb_gen + 1 &&
>> > f->new_tlb_gen == mm_tlb_gen) {
>> >
>> > why can't we write:
>> >
>> > if (f->end != TLB_FLUSH_ALL &&
>> > mm_tlb_gen == local_tlb_gen + 1)
>> >
>> > ?
>>
>> Ah, I thought you were asking about why I needed mm_tlb_gen ==
>> local_tlb_gen + 1.  This is just an optimization, or at least I hope
>> it is.  The idea is that, if we know that another flush is coming, it
>> seems likely that it would be faster to do a full flush and increase
>> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
>> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
>> needing to flush again very soon.
>
> Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1.
>
> Btw, do you see how confusing this check is: you have new_tlb_gen from
> a variable passed from the function call IPI, local_tlb_gen which is
> the CPU's own and then there's also mm_tlb_gen which we've written into
> new_tlb_gen from:
>
> info.new_tlb_gen = bump_mm_tlb_gen(mm);
>
> which incremented mm_tlb_gen too.

Yes, I agree it's confusing.  There really are three numbers.  Those
numbers are: the latest generation, the generation that this CPU has
caught up to, and the generation that the requester of the flush we're
currently handling has asked us to catch up to.  I don't see a way to
reduce the complexity.

>
>> Hmm.  I'd be nervous that there are more subtle races if we do this.
>> For example, suppose that a partial flush increments tlb_gen from 1 to
>> 2 and a full flush increments tlb_gen from 2 to 3.  Meanwhile, the CPU
>> is busy switching back and forth between mms, so the partial flush
>> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
>> set in mm_cpumask.
>
> Lemme see if I understand this correctly: you mean, the full flush will
> exit early due to the
>
> if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {
>
> test?

Yes, or at least that's what I'm imagining.

>
>> The flush IPI hits after a switch_mm_irqs_off() call notices the
>> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
>> increment the local tlb_gen to 2, and the IPI handler for the partial
>> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
>> == 2 and mm_tlb_gen == 3) and do a partial flush.
>
> Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.
>
> That's why you have this thing in addition to the tlb_gen.

Yes.  The idea is that we only do remote partial flushes when it's
100% obvious that it's safe.

>
> What we end up doing in this case, is promote the partial flush to a
> full one and thus have a partial and a full flush which are close by
> converted to two full flushes.

It could be converted to two full flushes or to just one, I think,
depending on what order everything happens in.

>
>> The problem here is that it's not obvious to me that this actually
>> ends up flushing everything that's needed. Maybe all the memory
>> ordering gets this right, but I can imagine scenarios in which
>> switch_mm_irqs_off() does its flush early enough that the TLB picks up
>> an entry that was supposed to get zapped by the full flush.
>
> See above.
>
> And I don't think that having two full flushes back-to-back is going to
> cost a lot as the second one won't flush a whole lot.

>From limited benchmarking on new Intel chips, a full flush is very
expensive no matter what.  I think this is silly because I suspect
that the PCID circuitry could internally simulate a full flush at very
little cost, but it seems that it doesn't.  I haven't tried to
benchmark INVLPG.

>
>> IOW it *might* be valid, but I think it would need very careful review
>> and documentation.
>
> Always.
>
> Btw, I get the sense this TLB flush avoiding scheme becomes pretty
> complex for diminishing reasons.
>
>   [ Or maybe I'm not seeing them - I'm always open to corrections. ]
>
> Especially if intermediary levels from the pagetable walker are cached
> and reestablishing the TLB entries seldom means a full walk.
>
> You should do a full fledged benchmark to see whether this whole
> complexity is even worth it, methinks.

I agree that, by itself, this patch is waay too complex to be
justifiable.  The thing is that, after quite a few false starts, I
couldn't find a clean way to get PCID and improved laziness without
this thing as a prerequisite.  Both of those features depend on having
a heuristic for when a flush can be avoided, and that heuristic must
*never* say that a flush can be skipped when it can't be skipped.
This patch gives a way to do that.

I tried a few other approaches:

 - Keeping a cpumask of which 

Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote:
> > Ah, simple: we control the flushing with info.new_tlb_gen and
> > mm->context.tlb_gen. I.e., this check:
> >
> >
> > if (f->end != TLB_FLUSH_ALL &&
> > f->new_tlb_gen == local_tlb_gen + 1 &&
> > f->new_tlb_gen == mm_tlb_gen) {
> >
> > why can't we write:
> >
> > if (f->end != TLB_FLUSH_ALL &&
> > mm_tlb_gen == local_tlb_gen + 1)
> >
> > ?
> 
> Ah, I thought you were asking about why I needed mm_tlb_gen ==
> local_tlb_gen + 1.  This is just an optimization, or at least I hope
> it is.  The idea is that, if we know that another flush is coming, it
> seems likely that it would be faster to do a full flush and increase
> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
> needing to flush again very soon.

Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1.

Btw, do you see how confusing this check is: you have new_tlb_gen from
a variable passed from the function call IPI, local_tlb_gen which is
the CPU's own and then there's also mm_tlb_gen which we've written into
new_tlb_gen from:

info.new_tlb_gen = bump_mm_tlb_gen(mm);

which incremented mm_tlb_gen too.

> Hmm.  I'd be nervous that there are more subtle races if we do this.
> For example, suppose that a partial flush increments tlb_gen from 1 to
> 2 and a full flush increments tlb_gen from 2 to 3.  Meanwhile, the CPU
> is busy switching back and forth between mms, so the partial flush
> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
> set in mm_cpumask.

Lemme see if I understand this correctly: you mean, the full flush will
exit early due to the

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {

test?

> The flush IPI hits after a switch_mm_irqs_off() call notices the
> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
> increment the local tlb_gen to 2, and the IPI handler for the partial
> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
> == 2 and mm_tlb_gen == 3) and do a partial flush.

Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.

That's why you have this thing in addition to the tlb_gen.

What we end up doing in this case, is promote the partial flush to a
full one and thus have a partial and a full flush which are close by
converted to two full flushes.

> The problem here is that it's not obvious to me that this actually
> ends up flushing everything that's needed. Maybe all the memory
> ordering gets this right, but I can imagine scenarios in which
> switch_mm_irqs_off() does its flush early enough that the TLB picks up
> an entry that was supposed to get zapped by the full flush.

See above.

And I don't think that having two full flushes back-to-back is going to
cost a lot as the second one won't flush a whole lot.

> IOW it *might* be valid, but I think it would need very careful review
> and documentation.

Always.

Btw, I get the sense this TLB flush avoiding scheme becomes pretty
complex for diminishing reasons.

  [ Or maybe I'm not seeing them - I'm always open to corrections. ]

Especially if intermediary levels from the pagetable walker are cached
and reestablishing the TLB entries seldom means a full walk.

You should do a full fledged benchmark to see whether this whole
complexity is even worth it, methinks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote:
> > Ah, simple: we control the flushing with info.new_tlb_gen and
> > mm->context.tlb_gen. I.e., this check:
> >
> >
> > if (f->end != TLB_FLUSH_ALL &&
> > f->new_tlb_gen == local_tlb_gen + 1 &&
> > f->new_tlb_gen == mm_tlb_gen) {
> >
> > why can't we write:
> >
> > if (f->end != TLB_FLUSH_ALL &&
> > mm_tlb_gen == local_tlb_gen + 1)
> >
> > ?
> 
> Ah, I thought you were asking about why I needed mm_tlb_gen ==
> local_tlb_gen + 1.  This is just an optimization, or at least I hope
> it is.  The idea is that, if we know that another flush is coming, it
> seems likely that it would be faster to do a full flush and increase
> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
> needing to flush again very soon.

Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1.

Btw, do you see how confusing this check is: you have new_tlb_gen from
a variable passed from the function call IPI, local_tlb_gen which is
the CPU's own and then there's also mm_tlb_gen which we've written into
new_tlb_gen from:

info.new_tlb_gen = bump_mm_tlb_gen(mm);

which incremented mm_tlb_gen too.

> Hmm.  I'd be nervous that there are more subtle races if we do this.
> For example, suppose that a partial flush increments tlb_gen from 1 to
> 2 and a full flush increments tlb_gen from 2 to 3.  Meanwhile, the CPU
> is busy switching back and forth between mms, so the partial flush
> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
> set in mm_cpumask.

Lemme see if I understand this correctly: you mean, the full flush will
exit early due to the

if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) {

test?

> The flush IPI hits after a switch_mm_irqs_off() call notices the
> change from 1 to 2. switch_mm_irqs_off() will do a full flush and
> increment the local tlb_gen to 2, and the IPI handler for the partial
> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen
> == 2 and mm_tlb_gen == 3) and do a partial flush.

Why, the 2->3 flush has f->end == TLB_FLUSH_ALL.

That's why you have this thing in addition to the tlb_gen.

What we end up doing in this case, is promote the partial flush to a
full one and thus have a partial and a full flush which are close by
converted to two full flushes.

> The problem here is that it's not obvious to me that this actually
> ends up flushing everything that's needed. Maybe all the memory
> ordering gets this right, but I can imagine scenarios in which
> switch_mm_irqs_off() does its flush early enough that the TLB picks up
> an entry that was supposed to get zapped by the full flush.

See above.

And I don't think that having two full flushes back-to-back is going to
cost a lot as the second one won't flush a whole lot.

> IOW it *might* be valid, but I think it would need very careful review
> and documentation.

Always.

Btw, I get the sense this TLB flush avoiding scheme becomes pretty
complex for diminishing reasons.

  [ Or maybe I'm not seeing them - I'm always open to corrections. ]

Especially if intermediary levels from the pagetable walker are cached
and reestablishing the TLB entries seldom means a full walk.

You should do a full fledged benchmark to see whether this whole
complexity is even worth it, methinks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 7:59 AM, Borislav Petkov  wrote:
> On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov  wrote:
>> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
>> >> > I'm certainly still missing something here:
>> >> >
>> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
>> >> > do once
>> >> >
>> >> > bump_mm_tlb_gen(mm);
>> >> >
>> >> > and once
>> >> >
>> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
>> >> >
>> >> > and in both cases, the bumping is done on mm->context.tlb_gen.
>> >> >
>> >> > So why isn't that enough to do the flushing and we have to consult
>> >> > info.new_tlb_gen too?
>> >>
>> >> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
>> >> then two concurrent flushes happen.  The first flush is a full flush
>> >> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
>> >> to 3.  If the second flush gets propagated to a given CPU first and it
>> >
>> > Maybe I'm still missing something, which is likely...
>> >
>> > but if the second flush gets propagated to the CPU first, the CPU will
>> > have local tlb_gen 1 and thus enforce a full flush anyway because we
>> > will go 1 -> 3 on that particular CPU. Or?
>> >
>>
>> Yes, exactly.  Which means I'm probably just misunderstanding your
>> original question.  Can you re-ask it?
>
> Ah, simple: we control the flushing with info.new_tlb_gen and
> mm->context.tlb_gen. I.e., this check:
>
>
> if (f->end != TLB_FLUSH_ALL &&
> f->new_tlb_gen == local_tlb_gen + 1 &&
> f->new_tlb_gen == mm_tlb_gen) {
>
> why can't we write:
>
> if (f->end != TLB_FLUSH_ALL &&
> mm_tlb_gen == local_tlb_gen + 1)
>
> ?

Ah, I thought you were asking about why I needed mm_tlb_gen ==
local_tlb_gen + 1.  This is just an optimization, or at least I hope
it is.  The idea is that, if we know that another flush is coming, it
seems likely that it would be faster to do a full flush and increase
local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
needing to flush again very soon.

>
> If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then
> partial.
>
> If the second flush, as you say is a partial one and still gets
> propagated first, the check will force a full flush anyway.
>
> When the first flush propagates after the second, we'll ignore it
> because local_tlb_gen has advanced adready due to the second flush.
>
> As a matter of fact, we could simplify the logic: if local_tlb_gen is
> only mm_tlb_gen - 1, then do the requested flush type.

Hmm.  I'd be nervous that there are more subtle races if we do this.
For example, suppose that a partial flush increments tlb_gen from 1 to
2 and a full flush increments tlb_gen from 2 to 3.  Meanwhile, the CPU
is busy switching back and forth between mms, so the partial flush
sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
set in mm_cpumask.  The flush IPI hits after a switch_mm_irqs_off()
call notices the change from 1 to 2.  switch_mm_irqs_off() will do a
full flush and increment the local tlb_gen to 2, and the IPI handler
for the partial flush will see local_tlb_gen == mm_tlb_gen - 1
(because local_tlb_gen == 2 and mm_tlb_gen == 3) and do a partial
flush.  The problem here is that it's not obvious to me that this
actually ends up flushing everything that's needed.  Maybe all the
memory ordering gets this right, but I can imagine scenarios in which
switch_mm_irqs_off() does its flush early enough that the TLB picks up
an entry that was supposed to get zapped by the full flush.

IOW it *might* be valid, but I think it would need very careful review
and documentation.

--Andy


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 7:59 AM, Borislav Petkov  wrote:
> On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov  wrote:
>> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
>> >> > I'm certainly still missing something here:
>> >> >
>> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
>> >> > do once
>> >> >
>> >> > bump_mm_tlb_gen(mm);
>> >> >
>> >> > and once
>> >> >
>> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
>> >> >
>> >> > and in both cases, the bumping is done on mm->context.tlb_gen.
>> >> >
>> >> > So why isn't that enough to do the flushing and we have to consult
>> >> > info.new_tlb_gen too?
>> >>
>> >> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
>> >> then two concurrent flushes happen.  The first flush is a full flush
>> >> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
>> >> to 3.  If the second flush gets propagated to a given CPU first and it
>> >
>> > Maybe I'm still missing something, which is likely...
>> >
>> > but if the second flush gets propagated to the CPU first, the CPU will
>> > have local tlb_gen 1 and thus enforce a full flush anyway because we
>> > will go 1 -> 3 on that particular CPU. Or?
>> >
>>
>> Yes, exactly.  Which means I'm probably just misunderstanding your
>> original question.  Can you re-ask it?
>
> Ah, simple: we control the flushing with info.new_tlb_gen and
> mm->context.tlb_gen. I.e., this check:
>
>
> if (f->end != TLB_FLUSH_ALL &&
> f->new_tlb_gen == local_tlb_gen + 1 &&
> f->new_tlb_gen == mm_tlb_gen) {
>
> why can't we write:
>
> if (f->end != TLB_FLUSH_ALL &&
> mm_tlb_gen == local_tlb_gen + 1)
>
> ?

Ah, I thought you were asking about why I needed mm_tlb_gen ==
local_tlb_gen + 1.  This is just an optimization, or at least I hope
it is.  The idea is that, if we know that another flush is coming, it
seems likely that it would be faster to do a full flush and increase
local_tlb_gen all the way to mm_tlb_gen rather than doing a partial
flush, increasing local_tlb_gen to something less than mm_tlb_gen, and
needing to flush again very soon.

>
> If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then
> partial.
>
> If the second flush, as you say is a partial one and still gets
> propagated first, the check will force a full flush anyway.
>
> When the first flush propagates after the second, we'll ignore it
> because local_tlb_gen has advanced adready due to the second flush.
>
> As a matter of fact, we could simplify the logic: if local_tlb_gen is
> only mm_tlb_gen - 1, then do the requested flush type.

Hmm.  I'd be nervous that there are more subtle races if we do this.
For example, suppose that a partial flush increments tlb_gen from 1 to
2 and a full flush increments tlb_gen from 2 to 3.  Meanwhile, the CPU
is busy switching back and forth between mms, so the partial flush
sees the cpu set in mm_cpumask but the full flush doesn't see the cpu
set in mm_cpumask.  The flush IPI hits after a switch_mm_irqs_off()
call notices the change from 1 to 2.  switch_mm_irqs_off() will do a
full flush and increment the local tlb_gen to 2, and the IPI handler
for the partial flush will see local_tlb_gen == mm_tlb_gen - 1
(because local_tlb_gen == 2 and mm_tlb_gen == 3) and do a partial
flush.  The problem here is that it's not obvious to me that this
actually ends up flushing everything that's needed.  Maybe all the
memory ordering gets this right, but I can imagine scenarios in which
switch_mm_irqs_off() does its flush early enough that the TLB picks up
an entry that was supposed to get zapped by the full flush.

IOW it *might* be valid, but I think it would need very careful review
and documentation.

--Andy


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov  wrote:
> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
> >> > I'm certainly still missing something here:
> >> >
> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> >> > do once
> >> >
> >> > bump_mm_tlb_gen(mm);
> >> >
> >> > and once
> >> >
> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
> >> >
> >> > and in both cases, the bumping is done on mm->context.tlb_gen.
> >> >
> >> > So why isn't that enough to do the flushing and we have to consult
> >> > info.new_tlb_gen too?
> >>
> >> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
> >> then two concurrent flushes happen.  The first flush is a full flush
> >> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
> >> to 3.  If the second flush gets propagated to a given CPU first and it
> >
> > Maybe I'm still missing something, which is likely...
> >
> > but if the second flush gets propagated to the CPU first, the CPU will
> > have local tlb_gen 1 and thus enforce a full flush anyway because we
> > will go 1 -> 3 on that particular CPU. Or?
> >
> 
> Yes, exactly.  Which means I'm probably just misunderstanding your
> original question.  Can you re-ask it?

Ah, simple: we control the flushing with info.new_tlb_gen and
mm->context.tlb_gen. I.e., this check:


if (f->end != TLB_FLUSH_ALL &&
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {

why can't we write:

if (f->end != TLB_FLUSH_ALL &&
mm_tlb_gen == local_tlb_gen + 1)

?

If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then
partial.

If the second flush, as you say is a partial one and still gets
propagated first, the check will force a full flush anyway.

When the first flush propagates after the second, we'll ignore it
because local_tlb_gen has advanced adready due to the second flush.

As a matter of fact, we could simplify the logic: if local_tlb_gen is
only mm_tlb_gen - 1, then do the requested flush type.

If mm_tlb_gen has advanced more than 1 generation, just do a full flush
unconditionally. ... and I think we do something like that already but I
think the logic could be simplified, unless I'm missing something, that is.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Borislav Petkov
On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov  wrote:
> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
> >> > I'm certainly still missing something here:
> >> >
> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> >> > do once
> >> >
> >> > bump_mm_tlb_gen(mm);
> >> >
> >> > and once
> >> >
> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
> >> >
> >> > and in both cases, the bumping is done on mm->context.tlb_gen.
> >> >
> >> > So why isn't that enough to do the flushing and we have to consult
> >> > info.new_tlb_gen too?
> >>
> >> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
> >> then two concurrent flushes happen.  The first flush is a full flush
> >> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
> >> to 3.  If the second flush gets propagated to a given CPU first and it
> >
> > Maybe I'm still missing something, which is likely...
> >
> > but if the second flush gets propagated to the CPU first, the CPU will
> > have local tlb_gen 1 and thus enforce a full flush anyway because we
> > will go 1 -> 3 on that particular CPU. Or?
> >
> 
> Yes, exactly.  Which means I'm probably just misunderstanding your
> original question.  Can you re-ask it?

Ah, simple: we control the flushing with info.new_tlb_gen and
mm->context.tlb_gen. I.e., this check:


if (f->end != TLB_FLUSH_ALL &&
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {

why can't we write:

if (f->end != TLB_FLUSH_ALL &&
mm_tlb_gen == local_tlb_gen + 1)

?

If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then
partial.

If the second flush, as you say is a partial one and still gets
propagated first, the check will force a full flush anyway.

When the first flush propagates after the second, we'll ignore it
because local_tlb_gen has advanced adready due to the second flush.

As a matter of fact, we could simplify the logic: if local_tlb_gen is
only mm_tlb_gen - 1, then do the requested flush type.

If mm_tlb_gen has advanced more than 1 generation, just do a full flush
unconditionally. ... and I think we do something like that already but I
think the logic could be simplified, unless I'm missing something, that is.

Thanks.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov  wrote:
> On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
>> > I'm certainly still missing something here:
>> >
>> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
>> > do once
>> >
>> > bump_mm_tlb_gen(mm);
>> >
>> > and once
>> >
>> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
>> >
>> > and in both cases, the bumping is done on mm->context.tlb_gen.
>> >
>> > So why isn't that enough to do the flushing and we have to consult
>> > info.new_tlb_gen too?
>>
>> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
>> then two concurrent flushes happen.  The first flush is a full flush
>> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
>> to 3.  If the second flush gets propagated to a given CPU first and it
>
> Maybe I'm still missing something, which is likely...
>
> but if the second flush gets propagated to the CPU first, the CPU will
> have local tlb_gen 1 and thus enforce a full flush anyway because we
> will go 1 -> 3 on that particular CPU. Or?
>

Yes, exactly.  Which means I'm probably just misunderstanding your
original question.  Can you re-ask it?

--Andy


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Andy Lutomirski
On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov  wrote:
> On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
>> > I'm certainly still missing something here:
>> >
>> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
>> > do once
>> >
>> > bump_mm_tlb_gen(mm);
>> >
>> > and once
>> >
>> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
>> >
>> > and in both cases, the bumping is done on mm->context.tlb_gen.
>> >
>> > So why isn't that enough to do the flushing and we have to consult
>> > info.new_tlb_gen too?
>>
>> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
>> then two concurrent flushes happen.  The first flush is a full flush
>> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
>> to 3.  If the second flush gets propagated to a given CPU first and it
>
> Maybe I'm still missing something, which is likely...
>
> but if the second flush gets propagated to the CPU first, the CPU will
> have local tlb_gen 1 and thus enforce a full flush anyway because we
> will go 1 -> 3 on that particular CPU. Or?
>

Yes, exactly.  Which means I'm probably just misunderstanding your
original question.  Can you re-ask it?

--Andy


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Borislav Petkov
On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
> > I'm certainly still missing something here:
> >
> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> > do once
> >
> > bump_mm_tlb_gen(mm);
> >
> > and once
> >
> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
> >
> > and in both cases, the bumping is done on mm->context.tlb_gen.
> >
> > So why isn't that enough to do the flushing and we have to consult
> > info.new_tlb_gen too?
> 
> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
> then two concurrent flushes happen.  The first flush is a full flush
> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
> to 3.  If the second flush gets propagated to a given CPU first and it

Maybe I'm still missing something, which is likely...

but if the second flush gets propagated to the CPU first, the CPU will
have local tlb_gen 1 and thus enforce a full flush anyway because we
will go 1 -> 3 on that particular CPU. Or?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-22 Thread Borislav Petkov
On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote:
> > I'm certainly still missing something here:
> >
> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> > do once
> >
> > bump_mm_tlb_gen(mm);
> >
> > and once
> >
> > info.new_tlb_gen = bump_mm_tlb_gen(mm);
> >
> > and in both cases, the bumping is done on mm->context.tlb_gen.
> >
> > So why isn't that enough to do the flushing and we have to consult
> > info.new_tlb_gen too?
> 
> The issue is a possible race.  Suppose we start at tlb_gen == 1 and
> then two concurrent flushes happen.  The first flush is a full flush
> and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
> to 3.  If the second flush gets propagated to a given CPU first and it

Maybe I'm still missing something, which is likely...

but if the second flush gets propagated to the CPU first, the CPU will
have local tlb_gen 1 and thus enforce a full flush anyway because we
will go 1 -> 3 on that particular CPU. Or?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 11:44 AM, Borislav Petkov  wrote:
> On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote:
>> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
>> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
>> +atomic64_read(>context.tlb_gen));
>
> Just let it stick out:
>
> this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,  next->context.ctx_id);
> this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, 
> atomic64_read(>context.tlb_gen));
>
> Should be a bit better readable this way.

Done

>> + if (local_tlb_gen == mm_tlb_gen) {
>
> if (unlikely(...
>
> maybe?
>
> Sounds to me like the concurrent flushes case would be the
> uncommon one...

Agreed.

>> +
>> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
>> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
>> +
>> + /*
>> +  * If we get to this point, we know that our TLB is out of date.
>> +  * This does not strictly imply that we need to flush (it's
>> +  * possible that f->new_tlb_gen <= local_tlb_gen), but we're
>> +  * going to need to flush in the very near future, so we might
>> +  * as well get it over with.
>> +  *
>> +  * The only question is whether to do a full or partial flush.
>> +  *
>> +  * A partial TLB flush is safe and worthwhile if two conditions are
>> +  * met:
>> +  *
>> +  * 1. We wouldn't be skipping a tlb_gen.  If the requester bumped
>> +  *the mm's tlb_gen from p to p+1, a partial flush is only correct
>> +  *if we would be bumping the local CPU's tlb_gen from p to p+1 as
>> +  *well.
>> +  *
>> +  * 2. If there are no more flushes on their way.  Partial TLB
>> +  *flushes are not all that much cheaper than full TLB
>> +  *flushes, so it seems unlikely that it would be a
>> +  *performance win to do a partial flush if that won't bring
>> +  *our TLB fully up to date.
>> +  */
>> + if (f->end != TLB_FLUSH_ALL &&
>> + f->new_tlb_gen == local_tlb_gen + 1 &&
>> + f->new_tlb_gen == mm_tlb_gen) {
>
> I'm certainly still missing something here:
>
> We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> do once
>
> bump_mm_tlb_gen(mm);
>
> and once
>
> info.new_tlb_gen = bump_mm_tlb_gen(mm);
>
> and in both cases, the bumping is done on mm->context.tlb_gen.
>
> So why isn't that enough to do the flushing and we have to consult
> info.new_tlb_gen too?

The issue is a possible race.  Suppose we start at tlb_gen == 1 and
then two concurrent flushes happen.  The first flush is a full flush
and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
to 3.  If the second flush gets propagated to a given CPU first and it
were to do an actual partial flush (INVLPG) and set the percpu tlb_gen
to 3, then the first flush won't do anything and we'll fail to flush
all the pages we need to flush.

My solution was to say that we're only allowed to do INVLPG if we're
making exactly the same change to the local tlb_gen that the requester
made to context.tlb_gen.

I'll add a comment to this effect.

>
>> + /* Partial flush */
>>   unsigned long addr;
>>   unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
>
> < newline here.

Yup.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 11:44 AM, Borislav Petkov  wrote:
> On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote:
>> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
>> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
>> +atomic64_read(>context.tlb_gen));
>
> Just let it stick out:
>
> this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,  next->context.ctx_id);
> this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, 
> atomic64_read(>context.tlb_gen));
>
> Should be a bit better readable this way.

Done

>> + if (local_tlb_gen == mm_tlb_gen) {
>
> if (unlikely(...
>
> maybe?
>
> Sounds to me like the concurrent flushes case would be the
> uncommon one...

Agreed.

>> +
>> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
>> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
>> +
>> + /*
>> +  * If we get to this point, we know that our TLB is out of date.
>> +  * This does not strictly imply that we need to flush (it's
>> +  * possible that f->new_tlb_gen <= local_tlb_gen), but we're
>> +  * going to need to flush in the very near future, so we might
>> +  * as well get it over with.
>> +  *
>> +  * The only question is whether to do a full or partial flush.
>> +  *
>> +  * A partial TLB flush is safe and worthwhile if two conditions are
>> +  * met:
>> +  *
>> +  * 1. We wouldn't be skipping a tlb_gen.  If the requester bumped
>> +  *the mm's tlb_gen from p to p+1, a partial flush is only correct
>> +  *if we would be bumping the local CPU's tlb_gen from p to p+1 as
>> +  *well.
>> +  *
>> +  * 2. If there are no more flushes on their way.  Partial TLB
>> +  *flushes are not all that much cheaper than full TLB
>> +  *flushes, so it seems unlikely that it would be a
>> +  *performance win to do a partial flush if that won't bring
>> +  *our TLB fully up to date.
>> +  */
>> + if (f->end != TLB_FLUSH_ALL &&
>> + f->new_tlb_gen == local_tlb_gen + 1 &&
>> + f->new_tlb_gen == mm_tlb_gen) {
>
> I'm certainly still missing something here:
>
> We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we
> do once
>
> bump_mm_tlb_gen(mm);
>
> and once
>
> info.new_tlb_gen = bump_mm_tlb_gen(mm);
>
> and in both cases, the bumping is done on mm->context.tlb_gen.
>
> So why isn't that enough to do the flushing and we have to consult
> info.new_tlb_gen too?

The issue is a possible race.  Suppose we start at tlb_gen == 1 and
then two concurrent flushes happen.  The first flush is a full flush
and sets tlb_gen to 2.  The second is a partial flush and sets tlb_gen
to 3.  If the second flush gets propagated to a given CPU first and it
were to do an actual partial flush (INVLPG) and set the percpu tlb_gen
to 3, then the first flush won't do anything and we'll fail to flush
all the pages we need to flush.

My solution was to say that we're only allowed to do INVLPG if we're
making exactly the same change to the local tlb_gen that the requester
made to context.tlb_gen.

I'll add a comment to this effect.

>
>> + /* Partial flush */
>>   unsigned long addr;
>>   unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT;
>
> < newline here.

Yup.


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Borislav Petkov
On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote:
> There are two kernel features that would benefit from tracking
> how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
> it up to date in real time:
> 
>  - Lazy mm switching currently works by switching to init_mm when
>it would otherwise flush.  This is wasteful: there isn't fundamentally
>any need to update CR3 at all when going lazy or when returning from
>lazy mode, nor is there any need to receive flush IPIs at all.  Instead,
>we should just stop trying to keep the TLB coherent when we go lazy and,
>when unlazying, check whether we missed any flushes.
> 
>  - PCID will let us keep recent user contexts alive in the TLB.  If we
>start doing this, we need a way to decide whether those contexts are
>up to date.
> 
> On some paravirt systems, remote TLBs can be flushed without IPIs.
> This won't update the target CPUs' tlb_gens, which may cause
> unnecessary local flushes later on.  We can address this if it becomes
> a problem by carefully updating the target CPU's tlb_gen directly.
> 
> By itself, this patch is a very minor optimization that avoids
> unnecessary flushes when multiple TLB flushes targetting the same CPU
> race.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/tlbflush.h | 37 +++
>  arch/x86/mm/tlb.c   | 79 
> +
>  2 files changed, 109 insertions(+), 7 deletions(-)

...

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 6d9d37323a43..9f5ef7a5e74a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>   }
>  
>   this_cpu_write(cpu_tlbstate.loaded_mm, next);
> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> +atomic64_read(>context.tlb_gen));

Just let it stick out:

this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,  next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, 
atomic64_read(>context.tlb_gen));

Should be a bit better readable this way.

>  
>   WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
>   cpumask_set_cpu(cpu, mm_cpumask(next));
> @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>  static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
>  {
> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> + /*
> +  * Our memory ordering requirement is that any TLB fills that
> +  * happen after we flush the TLB are ordered after we read
> +  * active_mm's tlb_gen.  We don't need any explicit barrier
> +  * because all x86 flush operations are serializing and the
> +  * atomic64_read operation won't be reordered by the compiler.
> +  */
> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +
>   /* This code cannot presently handle being reentered. */
>   VM_WARN_ON(!irqs_disabled());
>  
> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> +loaded_mm->context.ctx_id);
> +
>   if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + /*
> +  * leave_mm() is adequate to handle any type of flush, and
> +  * we would prefer not to receive further IPIs.
> +  */
>   leave_mm(smp_processor_id());
>   return;
>   }
>  
> - if (f->end == TLB_FLUSH_ALL) {
> - local_flush_tlb();
> - if (local)
> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
> - } else {
> + if (local_tlb_gen == mm_tlb_gen) {

if (unlikely(... 

maybe?

Sounds to me like the concurrent flushes case would be the
uncommon one...

> + /*
> +  * There's nothing to do: we're already up to date.  This can
> +  * happen if two concurrent flushes happen -- the first IPI to
> +  * be handled can catch us all the way up, leaving no work for
> +  * the second IPI to be handled.
> +  */
> + return;
> + }


> +
> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
> +
> + /*
> +  * If we get to this point, we know that our TLB is out of date.
> +  * This does not strictly imply that we need to flush (it's
> +  * possible that f->new_tlb_gen <= local_tlb_gen), but we're
> +  * going to need to flush in the very near future, so we might
> +  * as well get it over with.
> +  *
> +  * The only question is 

Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Borislav Petkov
On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote:
> There are two kernel features that would benefit from tracking
> how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
> it up to date in real time:
> 
>  - Lazy mm switching currently works by switching to init_mm when
>it would otherwise flush.  This is wasteful: there isn't fundamentally
>any need to update CR3 at all when going lazy or when returning from
>lazy mode, nor is there any need to receive flush IPIs at all.  Instead,
>we should just stop trying to keep the TLB coherent when we go lazy and,
>when unlazying, check whether we missed any flushes.
> 
>  - PCID will let us keep recent user contexts alive in the TLB.  If we
>start doing this, we need a way to decide whether those contexts are
>up to date.
> 
> On some paravirt systems, remote TLBs can be flushed without IPIs.
> This won't update the target CPUs' tlb_gens, which may cause
> unnecessary local flushes later on.  We can address this if it becomes
> a problem by carefully updating the target CPU's tlb_gen directly.
> 
> By itself, this patch is a very minor optimization that avoids
> unnecessary flushes when multiple TLB flushes targetting the same CPU
> race.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/asm/tlbflush.h | 37 +++
>  arch/x86/mm/tlb.c   | 79 
> +
>  2 files changed, 109 insertions(+), 7 deletions(-)

...

> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 6d9d37323a43..9f5ef7a5e74a 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>   }
>  
>   this_cpu_write(cpu_tlbstate.loaded_mm, next);
> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
> +atomic64_read(>context.tlb_gen));

Just let it stick out:

this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id,  next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, 
atomic64_read(>context.tlb_gen));

Should be a bit better readable this way.

>  
>   WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
>   cpumask_set_cpu(cpu, mm_cpumask(next));
> @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>  static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
>  {
> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> + /*
> +  * Our memory ordering requirement is that any TLB fills that
> +  * happen after we flush the TLB are ordered after we read
> +  * active_mm's tlb_gen.  We don't need any explicit barrier
> +  * because all x86 flush operations are serializing and the
> +  * atomic64_read operation won't be reordered by the compiler.
> +  */
> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +
>   /* This code cannot presently handle being reentered. */
>   VM_WARN_ON(!irqs_disabled());
>  
> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> +loaded_mm->context.ctx_id);
> +
>   if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + /*
> +  * leave_mm() is adequate to handle any type of flush, and
> +  * we would prefer not to receive further IPIs.
> +  */
>   leave_mm(smp_processor_id());
>   return;
>   }
>  
> - if (f->end == TLB_FLUSH_ALL) {
> - local_flush_tlb();
> - if (local)
> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
> - } else {
> + if (local_tlb_gen == mm_tlb_gen) {

if (unlikely(... 

maybe?

Sounds to me like the concurrent flushes case would be the
uncommon one...

> + /*
> +  * There's nothing to do: we're already up to date.  This can
> +  * happen if two concurrent flushes happen -- the first IPI to
> +  * be handled can catch us all the way up, leaving no work for
> +  * the second IPI to be handled.
> +  */
> + return;
> + }


> +
> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen);
> +
> + /*
> +  * If we get to this point, we know that our TLB is out of date.
> +  * This does not strictly imply that we need to flush (it's
> +  * possible that f->new_tlb_gen <= local_tlb_gen), but we're
> +  * going to need to flush in the very near future, so we might
> +  * as well get it over with.
> +  *
> +  * The only question is whether to do a full 

Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 1:32 AM, Thomas Gleixner  wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>>  struct flush_tlb_info {
>> + /*
>> +  * We support several kinds of flushes.
>> +  *
>> +  * - Fully flush a single mm.  flush_mm will be set, flush_end will be
>
> flush_mm is the *mm member in the struct, right? You might rename that as a
> preparatory step so comments and implementation match.

The comment is outdated.  Fixed now.

>
>> +  *   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
>> +  *   IPI sender is trying to catch us up.
>> +  *
>> +  * - Partially flush a single mm.  flush_mm will be set, flush_start
>> +  *   and flush_end will indicate the range, and new_tlb_gen will be
>> +  *   set such that the changes between generation new_tlb_gen-1 and
>> +  *   new_tlb_gen are entirely contained in the indicated range.
>> +  *
>> +  * - Fully flush all mms whose tlb_gens have been updated.  flush_mm
>> +  *   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
>> +  *   will be zero.
>> +  */
>>   struct mm_struct *mm;
>>   unsigned long start;
>>   unsigned long end;
>> + u64 new_tlb_gen;
>
> Nit. While at it could you please make that struct tabular aligned as we
> usually do in x86?

Sure.

>
>>  static void flush_tlb_func_common(const struct flush_tlb_info *f,
>> bool local, enum tlb_flush_reason reason)
>>  {
>> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
>> +
>> + /*
>> +  * Our memory ordering requirement is that any TLB fills that
>> +  * happen after we flush the TLB are ordered after we read
>> +  * active_mm's tlb_gen.  We don't need any explicit barrier
>> +  * because all x86 flush operations are serializing and the
>> +  * atomic64_read operation won't be reordered by the compiler.
>> +  */
>
> Can you please move the comment above the loaded_mm assignment?

I'll move it above the function entirely.  It's more of a general
comment about how the function works than any particular part of the
function.

>
>> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
>> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
>> +
>>   /* This code cannot presently handle being reentered. */
>>   VM_WARN_ON(!irqs_disabled());
>>
>> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>> +loaded_mm->context.ctx_id);
>> +
>>   if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> + /*
>> +  * leave_mm() is adequate to handle any type of flush, and
>> +  * we would prefer not to receive further IPIs.
>
> While I know what you mean, it might be useful to have a more elaborate
> explanation why this prevents new IPIs.

Added, although it just gets deleted again later in the series.

>
>> +  */
>>   leave_mm(smp_processor_id());
>>   return;
>>   }
>>
>> - if (f->end == TLB_FLUSH_ALL) {
>> - local_flush_tlb();
>> - if (local)
>> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
>> - } else {
>> + if (local_tlb_gen == mm_tlb_gen) {
>> + /*
>> +  * There's nothing to do: we're already up to date.  This can
>> +  * happen if two concurrent flushes happen -- the first IPI to
>> +  * be handled can catch us all the way up, leaving no work for
>> +  * the second IPI to be handled.
>
> That not restricted to IPIs, right? A local flush / IPI combo can do that
> as well.

Indeed.  Comment fixed.

>
> Other than those nits;
>
> Reviewed-by: Thomas Gleixner 


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Andy Lutomirski
On Wed, Jun 21, 2017 at 1:32 AM, Thomas Gleixner  wrote:
> On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>>  struct flush_tlb_info {
>> + /*
>> +  * We support several kinds of flushes.
>> +  *
>> +  * - Fully flush a single mm.  flush_mm will be set, flush_end will be
>
> flush_mm is the *mm member in the struct, right? You might rename that as a
> preparatory step so comments and implementation match.

The comment is outdated.  Fixed now.

>
>> +  *   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
>> +  *   IPI sender is trying to catch us up.
>> +  *
>> +  * - Partially flush a single mm.  flush_mm will be set, flush_start
>> +  *   and flush_end will indicate the range, and new_tlb_gen will be
>> +  *   set such that the changes between generation new_tlb_gen-1 and
>> +  *   new_tlb_gen are entirely contained in the indicated range.
>> +  *
>> +  * - Fully flush all mms whose tlb_gens have been updated.  flush_mm
>> +  *   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
>> +  *   will be zero.
>> +  */
>>   struct mm_struct *mm;
>>   unsigned long start;
>>   unsigned long end;
>> + u64 new_tlb_gen;
>
> Nit. While at it could you please make that struct tabular aligned as we
> usually do in x86?

Sure.

>
>>  static void flush_tlb_func_common(const struct flush_tlb_info *f,
>> bool local, enum tlb_flush_reason reason)
>>  {
>> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
>> +
>> + /*
>> +  * Our memory ordering requirement is that any TLB fills that
>> +  * happen after we flush the TLB are ordered after we read
>> +  * active_mm's tlb_gen.  We don't need any explicit barrier
>> +  * because all x86 flush operations are serializing and the
>> +  * atomic64_read operation won't be reordered by the compiler.
>> +  */
>
> Can you please move the comment above the loaded_mm assignment?

I'll move it above the function entirely.  It's more of a general
comment about how the function works than any particular part of the
function.

>
>> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
>> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
>> +
>>   /* This code cannot presently handle being reentered. */
>>   VM_WARN_ON(!irqs_disabled());
>>
>> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
>> +loaded_mm->context.ctx_id);
>> +
>>   if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
>> + /*
>> +  * leave_mm() is adequate to handle any type of flush, and
>> +  * we would prefer not to receive further IPIs.
>
> While I know what you mean, it might be useful to have a more elaborate
> explanation why this prevents new IPIs.

Added, although it just gets deleted again later in the series.

>
>> +  */
>>   leave_mm(smp_processor_id());
>>   return;
>>   }
>>
>> - if (f->end == TLB_FLUSH_ALL) {
>> - local_flush_tlb();
>> - if (local)
>> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
>> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
>> - } else {
>> + if (local_tlb_gen == mm_tlb_gen) {
>> + /*
>> +  * There's nothing to do: we're already up to date.  This can
>> +  * happen if two concurrent flushes happen -- the first IPI to
>> +  * be handled can catch us all the way up, leaving no work for
>> +  * the second IPI to be handled.
>
> That not restricted to IPIs, right? A local flush / IPI combo can do that
> as well.

Indeed.  Comment fixed.

>
> Other than those nits;
>
> Reviewed-by: Thomas Gleixner 


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Thomas Gleixner
On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>  struct flush_tlb_info {
> + /*
> +  * We support several kinds of flushes.
> +  *
> +  * - Fully flush a single mm.  flush_mm will be set, flush_end will be

flush_mm is the *mm member in the struct, right? You might rename that as a
preparatory step so comments and implementation match.

> +  *   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
> +  *   IPI sender is trying to catch us up.
> +  *
> +  * - Partially flush a single mm.  flush_mm will be set, flush_start
> +  *   and flush_end will indicate the range, and new_tlb_gen will be
> +  *   set such that the changes between generation new_tlb_gen-1 and
> +  *   new_tlb_gen are entirely contained in the indicated range.
> +  *
> +  * - Fully flush all mms whose tlb_gens have been updated.  flush_mm
> +  *   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
> +  *   will be zero.
> +  */
>   struct mm_struct *mm;
>   unsigned long start;
>   unsigned long end;
> + u64 new_tlb_gen;

Nit. While at it could you please make that struct tabular aligned as we
usually do in x86?

>  static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
>  {
> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> + /*
> +  * Our memory ordering requirement is that any TLB fills that
> +  * happen after we flush the TLB are ordered after we read
> +  * active_mm's tlb_gen.  We don't need any explicit barrier
> +  * because all x86 flush operations are serializing and the
> +  * atomic64_read operation won't be reordered by the compiler.
> +  */

Can you please move the comment above the loaded_mm assignment? 

> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +
>   /* This code cannot presently handle being reentered. */
>   VM_WARN_ON(!irqs_disabled());
>  
> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> +loaded_mm->context.ctx_id);
> +
>   if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + /*
> +  * leave_mm() is adequate to handle any type of flush, and
> +  * we would prefer not to receive further IPIs.

While I know what you mean, it might be useful to have a more elaborate
explanation why this prevents new IPIs.

> +  */
>   leave_mm(smp_processor_id());
>   return;
>   }
>  
> - if (f->end == TLB_FLUSH_ALL) {
> - local_flush_tlb();
> - if (local)
> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
> - } else {
> + if (local_tlb_gen == mm_tlb_gen) {
> + /*
> +  * There's nothing to do: we're already up to date.  This can
> +  * happen if two concurrent flushes happen -- the first IPI to
> +  * be handled can catch us all the way up, leaving no work for
> +  * the second IPI to be handled.

That not restricted to IPIs, right? A local flush / IPI combo can do that
as well.

Other than those nits;

Reviewed-by: Thomas Gleixner 


Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-21 Thread Thomas Gleixner
On Tue, 20 Jun 2017, Andy Lutomirski wrote:
>  struct flush_tlb_info {
> + /*
> +  * We support several kinds of flushes.
> +  *
> +  * - Fully flush a single mm.  flush_mm will be set, flush_end will be

flush_mm is the *mm member in the struct, right? You might rename that as a
preparatory step so comments and implementation match.

> +  *   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
> +  *   IPI sender is trying to catch us up.
> +  *
> +  * - Partially flush a single mm.  flush_mm will be set, flush_start
> +  *   and flush_end will indicate the range, and new_tlb_gen will be
> +  *   set such that the changes between generation new_tlb_gen-1 and
> +  *   new_tlb_gen are entirely contained in the indicated range.
> +  *
> +  * - Fully flush all mms whose tlb_gens have been updated.  flush_mm
> +  *   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
> +  *   will be zero.
> +  */
>   struct mm_struct *mm;
>   unsigned long start;
>   unsigned long end;
> + u64 new_tlb_gen;

Nit. While at it could you please make that struct tabular aligned as we
usually do in x86?

>  static void flush_tlb_func_common(const struct flush_tlb_info *f,
> bool local, enum tlb_flush_reason reason)
>  {
> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
> +
> + /*
> +  * Our memory ordering requirement is that any TLB fills that
> +  * happen after we flush the TLB are ordered after we read
> +  * active_mm's tlb_gen.  We don't need any explicit barrier
> +  * because all x86 flush operations are serializing and the
> +  * atomic64_read operation won't be reordered by the compiler.
> +  */

Can you please move the comment above the loaded_mm assignment? 

> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen);
> +
>   /* This code cannot presently handle being reentered. */
>   VM_WARN_ON(!irqs_disabled());
>  
> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) !=
> +loaded_mm->context.ctx_id);
> +
>   if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
> + /*
> +  * leave_mm() is adequate to handle any type of flush, and
> +  * we would prefer not to receive further IPIs.

While I know what you mean, it might be useful to have a more elaborate
explanation why this prevents new IPIs.

> +  */
>   leave_mm(smp_processor_id());
>   return;
>   }
>  
> - if (f->end == TLB_FLUSH_ALL) {
> - local_flush_tlb();
> - if (local)
> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> - trace_tlb_flush(reason, TLB_FLUSH_ALL);
> - } else {
> + if (local_tlb_gen == mm_tlb_gen) {
> + /*
> +  * There's nothing to do: we're already up to date.  This can
> +  * happen if two concurrent flushes happen -- the first IPI to
> +  * be handled can catch us all the way up, leaving no work for
> +  * the second IPI to be handled.

That not restricted to IPIs, right? A local flush / IPI combo can do that
as well.

Other than those nits;

Reviewed-by: Thomas Gleixner