Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Fri, Jun 23, 2017 at 1:42 AM, Borislav Petkovwrote: > On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote: >> Yes, I agree it's confusing. There really are three numbers. Those >> numbers are: the latest generation, the generation that this CPU has >> caught up to, and the generation that the requester of the flush we're >> currently handling has asked us to catch up to. I don't see a way to >> reduce the complexity. > > Yeah, can you pls put that clarification what what is, over it. It > explains it nicely what the check is supposed to do. Done. I've tried to improve a bunch of the comments in this function. > >> >> The flush IPI hits after a switch_mm_irqs_off() call notices the >> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and >> >> increment the local tlb_gen to 2, and the IPI handler for the partial >> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen >> >> == 2 and mm_tlb_gen == 3) and do a partial flush. >> > >> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. >> > >> > That's why you have this thing in addition to the tlb_gen. >> >> Yes. The idea is that we only do remote partial flushes when it's >> 100% obvious that it's safe. > > So why wouldn't my simplified suggestion work then? > > if (f->end != TLB_FLUSH_ALL && > mm_tlb_gen == local_tlb_gen + 1) > > 1->2 is a partial flush - gets promoted to a full one > 2->3 is a full flush - it will get executed as one due to the f->end setting > to > TLB_FLUSH_ALL. This could still fail in some cases, I think. Suppose 1->2 is a partial flush and 2->3 is a full flush. We could have this order of events: - CPU 1: Partial flush. Increase context.tlb_gen to 2 and send IPI. - CPU 0: switch_mm(), observe mm_tlb_gen == 2, set local_tlb_gen to 2. - CPU 2: Full flush. Increase context.tlb_gen to 3 and send IPI. - CPU 0: Receive partial flush IPI. mm_tlb_gen == 2 and local_tlb_gen == 3. Do __flush_tlb_single() and set local_tlb_gen to 3. Our invariant is now broken: CPU 0's percpu tlb_gen is now ahead of its actual TLB state. - CPU 0: Receive full flush IPI and skip the flush. Oops. I think my condition makes it clear that the invariants we need hold no matter it. > >> It could be converted to two full flushes or to just one, I think, >> depending on what order everything happens in. > > Right. One flush at the right time would be optimal. > >> But this approach of using three separate tlb_gen values seems to >> cover all the bases, and I don't think it's *that* bad. > > Sure. > > As I said in IRC, let's document that complexity then so that when we > stumble over it in the future, we at least know why it was done this > way. I've given it a try. Hopefully v4 is more clear.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Fri, Jun 23, 2017 at 1:42 AM, Borislav Petkov wrote: > On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote: >> Yes, I agree it's confusing. There really are three numbers. Those >> numbers are: the latest generation, the generation that this CPU has >> caught up to, and the generation that the requester of the flush we're >> currently handling has asked us to catch up to. I don't see a way to >> reduce the complexity. > > Yeah, can you pls put that clarification what what is, over it. It > explains it nicely what the check is supposed to do. Done. I've tried to improve a bunch of the comments in this function. > >> >> The flush IPI hits after a switch_mm_irqs_off() call notices the >> >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and >> >> increment the local tlb_gen to 2, and the IPI handler for the partial >> >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen >> >> == 2 and mm_tlb_gen == 3) and do a partial flush. >> > >> > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. >> > >> > That's why you have this thing in addition to the tlb_gen. >> >> Yes. The idea is that we only do remote partial flushes when it's >> 100% obvious that it's safe. > > So why wouldn't my simplified suggestion work then? > > if (f->end != TLB_FLUSH_ALL && > mm_tlb_gen == local_tlb_gen + 1) > > 1->2 is a partial flush - gets promoted to a full one > 2->3 is a full flush - it will get executed as one due to the f->end setting > to > TLB_FLUSH_ALL. This could still fail in some cases, I think. Suppose 1->2 is a partial flush and 2->3 is a full flush. We could have this order of events: - CPU 1: Partial flush. Increase context.tlb_gen to 2 and send IPI. - CPU 0: switch_mm(), observe mm_tlb_gen == 2, set local_tlb_gen to 2. - CPU 2: Full flush. Increase context.tlb_gen to 3 and send IPI. - CPU 0: Receive partial flush IPI. mm_tlb_gen == 2 and local_tlb_gen == 3. Do __flush_tlb_single() and set local_tlb_gen to 3. Our invariant is now broken: CPU 0's percpu tlb_gen is now ahead of its actual TLB state. - CPU 0: Receive full flush IPI and skip the flush. Oops. I think my condition makes it clear that the invariants we need hold no matter it. > >> It could be converted to two full flushes or to just one, I think, >> depending on what order everything happens in. > > Right. One flush at the right time would be optimal. > >> But this approach of using three separate tlb_gen values seems to >> cover all the bases, and I don't think it's *that* bad. > > Sure. > > As I said in IRC, let's document that complexity then so that when we > stumble over it in the future, we at least know why it was done this > way. I've given it a try. Hopefully v4 is more clear.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote: > Yes, I agree it's confusing. There really are three numbers. Those > numbers are: the latest generation, the generation that this CPU has > caught up to, and the generation that the requester of the flush we're > currently handling has asked us to catch up to. I don't see a way to > reduce the complexity. Yeah, can you pls put that clarification what what is, over it. It explains it nicely what the check is supposed to do. > >> The flush IPI hits after a switch_mm_irqs_off() call notices the > >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and > >> increment the local tlb_gen to 2, and the IPI handler for the partial > >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen > >> == 2 and mm_tlb_gen == 3) and do a partial flush. > > > > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. > > > > That's why you have this thing in addition to the tlb_gen. > > Yes. The idea is that we only do remote partial flushes when it's > 100% obvious that it's safe. So why wouldn't my simplified suggestion work then? if (f->end != TLB_FLUSH_ALL && mm_tlb_gen == local_tlb_gen + 1) 1->2 is a partial flush - gets promoted to a full one 2->3 is a full flush - it will get executed as one due to the f->end setting to TLB_FLUSH_ALL. > It could be converted to two full flushes or to just one, I think, > depending on what order everything happens in. Right. One flush at the right time would be optimal. > But this approach of using three separate tlb_gen values seems to > cover all the bases, and I don't think it's *that* bad. Sure. As I said in IRC, let's document that complexity then so that when we stumble over it in the future, we at least know why it was done this way. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 11:08:38AM -0700, Andy Lutomirski wrote: > Yes, I agree it's confusing. There really are three numbers. Those > numbers are: the latest generation, the generation that this CPU has > caught up to, and the generation that the requester of the flush we're > currently handling has asked us to catch up to. I don't see a way to > reduce the complexity. Yeah, can you pls put that clarification what what is, over it. It explains it nicely what the check is supposed to do. > >> The flush IPI hits after a switch_mm_irqs_off() call notices the > >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and > >> increment the local tlb_gen to 2, and the IPI handler for the partial > >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen > >> == 2 and mm_tlb_gen == 3) and do a partial flush. > > > > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. > > > > That's why you have this thing in addition to the tlb_gen. > > Yes. The idea is that we only do remote partial flushes when it's > 100% obvious that it's safe. So why wouldn't my simplified suggestion work then? if (f->end != TLB_FLUSH_ALL && mm_tlb_gen == local_tlb_gen + 1) 1->2 is a partial flush - gets promoted to a full one 2->3 is a full flush - it will get executed as one due to the f->end setting to TLB_FLUSH_ALL. > It could be converted to two full flushes or to just one, I think, > depending on what order everything happens in. Right. One flush at the right time would be optimal. > But this approach of using three separate tlb_gen values seems to > cover all the bases, and I don't think it's *that* bad. Sure. As I said in IRC, let's document that complexity then so that when we stumble over it in the future, we at least know why it was done this way. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 10:22 AM, Borislav Petkovwrote: > On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote: >> > Ah, simple: we control the flushing with info.new_tlb_gen and >> > mm->context.tlb_gen. I.e., this check: >> > >> > >> > if (f->end != TLB_FLUSH_ALL && >> > f->new_tlb_gen == local_tlb_gen + 1 && >> > f->new_tlb_gen == mm_tlb_gen) { >> > >> > why can't we write: >> > >> > if (f->end != TLB_FLUSH_ALL && >> > mm_tlb_gen == local_tlb_gen + 1) >> > >> > ? >> >> Ah, I thought you were asking about why I needed mm_tlb_gen == >> local_tlb_gen + 1. This is just an optimization, or at least I hope >> it is. The idea is that, if we know that another flush is coming, it >> seems likely that it would be faster to do a full flush and increase >> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial >> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and >> needing to flush again very soon. > > Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1. > > Btw, do you see how confusing this check is: you have new_tlb_gen from > a variable passed from the function call IPI, local_tlb_gen which is > the CPU's own and then there's also mm_tlb_gen which we've written into > new_tlb_gen from: > > info.new_tlb_gen = bump_mm_tlb_gen(mm); > > which incremented mm_tlb_gen too. Yes, I agree it's confusing. There really are three numbers. Those numbers are: the latest generation, the generation that this CPU has caught up to, and the generation that the requester of the flush we're currently handling has asked us to catch up to. I don't see a way to reduce the complexity. > >> Hmm. I'd be nervous that there are more subtle races if we do this. >> For example, suppose that a partial flush increments tlb_gen from 1 to >> 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU >> is busy switching back and forth between mms, so the partial flush >> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu >> set in mm_cpumask. > > Lemme see if I understand this correctly: you mean, the full flush will > exit early due to the > > if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) { > > test? Yes, or at least that's what I'm imagining. > >> The flush IPI hits after a switch_mm_irqs_off() call notices the >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and >> increment the local tlb_gen to 2, and the IPI handler for the partial >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen >> == 2 and mm_tlb_gen == 3) and do a partial flush. > > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. > > That's why you have this thing in addition to the tlb_gen. Yes. The idea is that we only do remote partial flushes when it's 100% obvious that it's safe. > > What we end up doing in this case, is promote the partial flush to a > full one and thus have a partial and a full flush which are close by > converted to two full flushes. It could be converted to two full flushes or to just one, I think, depending on what order everything happens in. > >> The problem here is that it's not obvious to me that this actually >> ends up flushing everything that's needed. Maybe all the memory >> ordering gets this right, but I can imagine scenarios in which >> switch_mm_irqs_off() does its flush early enough that the TLB picks up >> an entry that was supposed to get zapped by the full flush. > > See above. > > And I don't think that having two full flushes back-to-back is going to > cost a lot as the second one won't flush a whole lot. >From limited benchmarking on new Intel chips, a full flush is very expensive no matter what. I think this is silly because I suspect that the PCID circuitry could internally simulate a full flush at very little cost, but it seems that it doesn't. I haven't tried to benchmark INVLPG. > >> IOW it *might* be valid, but I think it would need very careful review >> and documentation. > > Always. > > Btw, I get the sense this TLB flush avoiding scheme becomes pretty > complex for diminishing reasons. > > [ Or maybe I'm not seeing them - I'm always open to corrections. ] > > Especially if intermediary levels from the pagetable walker are cached > and reestablishing the TLB entries seldom means a full walk. > > You should do a full fledged benchmark to see whether this whole > complexity is even worth it, methinks. I agree that, by itself, this patch is waay too complex to be justifiable. The thing is that, after quite a few false starts, I couldn't find a clean way to get PCID and improved laziness without this thing as a prerequisite. Both of those features depend on having a heuristic for when a flush can be avoided, and that heuristic must *never* say that a flush can be skipped when it can't be skipped. This patch gives a way to do that. I tried a few other approaches: - Keeping a
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 10:22 AM, Borislav Petkov wrote: > On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote: >> > Ah, simple: we control the flushing with info.new_tlb_gen and >> > mm->context.tlb_gen. I.e., this check: >> > >> > >> > if (f->end != TLB_FLUSH_ALL && >> > f->new_tlb_gen == local_tlb_gen + 1 && >> > f->new_tlb_gen == mm_tlb_gen) { >> > >> > why can't we write: >> > >> > if (f->end != TLB_FLUSH_ALL && >> > mm_tlb_gen == local_tlb_gen + 1) >> > >> > ? >> >> Ah, I thought you were asking about why I needed mm_tlb_gen == >> local_tlb_gen + 1. This is just an optimization, or at least I hope >> it is. The idea is that, if we know that another flush is coming, it >> seems likely that it would be faster to do a full flush and increase >> local_tlb_gen all the way to mm_tlb_gen rather than doing a partial >> flush, increasing local_tlb_gen to something less than mm_tlb_gen, and >> needing to flush again very soon. > > Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1. > > Btw, do you see how confusing this check is: you have new_tlb_gen from > a variable passed from the function call IPI, local_tlb_gen which is > the CPU's own and then there's also mm_tlb_gen which we've written into > new_tlb_gen from: > > info.new_tlb_gen = bump_mm_tlb_gen(mm); > > which incremented mm_tlb_gen too. Yes, I agree it's confusing. There really are three numbers. Those numbers are: the latest generation, the generation that this CPU has caught up to, and the generation that the requester of the flush we're currently handling has asked us to catch up to. I don't see a way to reduce the complexity. > >> Hmm. I'd be nervous that there are more subtle races if we do this. >> For example, suppose that a partial flush increments tlb_gen from 1 to >> 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU >> is busy switching back and forth between mms, so the partial flush >> sees the cpu set in mm_cpumask but the full flush doesn't see the cpu >> set in mm_cpumask. > > Lemme see if I understand this correctly: you mean, the full flush will > exit early due to the > > if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) { > > test? Yes, or at least that's what I'm imagining. > >> The flush IPI hits after a switch_mm_irqs_off() call notices the >> change from 1 to 2. switch_mm_irqs_off() will do a full flush and >> increment the local tlb_gen to 2, and the IPI handler for the partial >> flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen >> == 2 and mm_tlb_gen == 3) and do a partial flush. > > Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. > > That's why you have this thing in addition to the tlb_gen. Yes. The idea is that we only do remote partial flushes when it's 100% obvious that it's safe. > > What we end up doing in this case, is promote the partial flush to a > full one and thus have a partial and a full flush which are close by > converted to two full flushes. It could be converted to two full flushes or to just one, I think, depending on what order everything happens in. > >> The problem here is that it's not obvious to me that this actually >> ends up flushing everything that's needed. Maybe all the memory >> ordering gets this right, but I can imagine scenarios in which >> switch_mm_irqs_off() does its flush early enough that the TLB picks up >> an entry that was supposed to get zapped by the full flush. > > See above. > > And I don't think that having two full flushes back-to-back is going to > cost a lot as the second one won't flush a whole lot. >From limited benchmarking on new Intel chips, a full flush is very expensive no matter what. I think this is silly because I suspect that the PCID circuitry could internally simulate a full flush at very little cost, but it seems that it doesn't. I haven't tried to benchmark INVLPG. > >> IOW it *might* be valid, but I think it would need very careful review >> and documentation. > > Always. > > Btw, I get the sense this TLB flush avoiding scheme becomes pretty > complex for diminishing reasons. > > [ Or maybe I'm not seeing them - I'm always open to corrections. ] > > Especially if intermediary levels from the pagetable walker are cached > and reestablishing the TLB entries seldom means a full walk. > > You should do a full fledged benchmark to see whether this whole > complexity is even worth it, methinks. I agree that, by itself, this patch is waay too complex to be justifiable. The thing is that, after quite a few false starts, I couldn't find a clean way to get PCID and improved laziness without this thing as a prerequisite. Both of those features depend on having a heuristic for when a flush can be avoided, and that heuristic must *never* say that a flush can be skipped when it can't be skipped. This patch gives a way to do that. I tried a few other approaches: - Keeping a cpumask of which
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote: > > Ah, simple: we control the flushing with info.new_tlb_gen and > > mm->context.tlb_gen. I.e., this check: > > > > > > if (f->end != TLB_FLUSH_ALL && > > f->new_tlb_gen == local_tlb_gen + 1 && > > f->new_tlb_gen == mm_tlb_gen) { > > > > why can't we write: > > > > if (f->end != TLB_FLUSH_ALL && > > mm_tlb_gen == local_tlb_gen + 1) > > > > ? > > Ah, I thought you were asking about why I needed mm_tlb_gen == > local_tlb_gen + 1. This is just an optimization, or at least I hope > it is. The idea is that, if we know that another flush is coming, it > seems likely that it would be faster to do a full flush and increase > local_tlb_gen all the way to mm_tlb_gen rather than doing a partial > flush, increasing local_tlb_gen to something less than mm_tlb_gen, and > needing to flush again very soon. Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1. Btw, do you see how confusing this check is: you have new_tlb_gen from a variable passed from the function call IPI, local_tlb_gen which is the CPU's own and then there's also mm_tlb_gen which we've written into new_tlb_gen from: info.new_tlb_gen = bump_mm_tlb_gen(mm); which incremented mm_tlb_gen too. > Hmm. I'd be nervous that there are more subtle races if we do this. > For example, suppose that a partial flush increments tlb_gen from 1 to > 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU > is busy switching back and forth between mms, so the partial flush > sees the cpu set in mm_cpumask but the full flush doesn't see the cpu > set in mm_cpumask. Lemme see if I understand this correctly: you mean, the full flush will exit early due to the if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) { test? > The flush IPI hits after a switch_mm_irqs_off() call notices the > change from 1 to 2. switch_mm_irqs_off() will do a full flush and > increment the local tlb_gen to 2, and the IPI handler for the partial > flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen > == 2 and mm_tlb_gen == 3) and do a partial flush. Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. That's why you have this thing in addition to the tlb_gen. What we end up doing in this case, is promote the partial flush to a full one and thus have a partial and a full flush which are close by converted to two full flushes. > The problem here is that it's not obvious to me that this actually > ends up flushing everything that's needed. Maybe all the memory > ordering gets this right, but I can imagine scenarios in which > switch_mm_irqs_off() does its flush early enough that the TLB picks up > an entry that was supposed to get zapped by the full flush. See above. And I don't think that having two full flushes back-to-back is going to cost a lot as the second one won't flush a whole lot. > IOW it *might* be valid, but I think it would need very careful review > and documentation. Always. Btw, I get the sense this TLB flush avoiding scheme becomes pretty complex for diminishing reasons. [ Or maybe I'm not seeing them - I'm always open to corrections. ] Especially if intermediary levels from the pagetable walker are cached and reestablishing the TLB entries seldom means a full walk. You should do a full fledged benchmark to see whether this whole complexity is even worth it, methinks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 08:55:36AM -0700, Andy Lutomirski wrote: > > Ah, simple: we control the flushing with info.new_tlb_gen and > > mm->context.tlb_gen. I.e., this check: > > > > > > if (f->end != TLB_FLUSH_ALL && > > f->new_tlb_gen == local_tlb_gen + 1 && > > f->new_tlb_gen == mm_tlb_gen) { > > > > why can't we write: > > > > if (f->end != TLB_FLUSH_ALL && > > mm_tlb_gen == local_tlb_gen + 1) > > > > ? > > Ah, I thought you were asking about why I needed mm_tlb_gen == > local_tlb_gen + 1. This is just an optimization, or at least I hope > it is. The idea is that, if we know that another flush is coming, it > seems likely that it would be faster to do a full flush and increase > local_tlb_gen all the way to mm_tlb_gen rather than doing a partial > flush, increasing local_tlb_gen to something less than mm_tlb_gen, and > needing to flush again very soon. Thus the f->new_tlb_gen check whether it is local_tlb_gen + 1. Btw, do you see how confusing this check is: you have new_tlb_gen from a variable passed from the function call IPI, local_tlb_gen which is the CPU's own and then there's also mm_tlb_gen which we've written into new_tlb_gen from: info.new_tlb_gen = bump_mm_tlb_gen(mm); which incremented mm_tlb_gen too. > Hmm. I'd be nervous that there are more subtle races if we do this. > For example, suppose that a partial flush increments tlb_gen from 1 to > 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU > is busy switching back and forth between mms, so the partial flush > sees the cpu set in mm_cpumask but the full flush doesn't see the cpu > set in mm_cpumask. Lemme see if I understand this correctly: you mean, the full flush will exit early due to the if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))) { test? > The flush IPI hits after a switch_mm_irqs_off() call notices the > change from 1 to 2. switch_mm_irqs_off() will do a full flush and > increment the local tlb_gen to 2, and the IPI handler for the partial > flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen > == 2 and mm_tlb_gen == 3) and do a partial flush. Why, the 2->3 flush has f->end == TLB_FLUSH_ALL. That's why you have this thing in addition to the tlb_gen. What we end up doing in this case, is promote the partial flush to a full one and thus have a partial and a full flush which are close by converted to two full flushes. > The problem here is that it's not obvious to me that this actually > ends up flushing everything that's needed. Maybe all the memory > ordering gets this right, but I can imagine scenarios in which > switch_mm_irqs_off() does its flush early enough that the TLB picks up > an entry that was supposed to get zapped by the full flush. See above. And I don't think that having two full flushes back-to-back is going to cost a lot as the second one won't flush a whole lot. > IOW it *might* be valid, but I think it would need very careful review > and documentation. Always. Btw, I get the sense this TLB flush avoiding scheme becomes pretty complex for diminishing reasons. [ Or maybe I'm not seeing them - I'm always open to corrections. ] Especially if intermediary levels from the pagetable walker are cached and reestablishing the TLB entries seldom means a full walk. You should do a full fledged benchmark to see whether this whole complexity is even worth it, methinks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 7:59 AM, Borislav Petkovwrote: > On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote: >> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov wrote: >> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: >> >> > I'm certainly still missing something here: >> >> > >> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we >> >> > do once >> >> > >> >> > bump_mm_tlb_gen(mm); >> >> > >> >> > and once >> >> > >> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm); >> >> > >> >> > and in both cases, the bumping is done on mm->context.tlb_gen. >> >> > >> >> > So why isn't that enough to do the flushing and we have to consult >> >> > info.new_tlb_gen too? >> >> >> >> The issue is a possible race. Suppose we start at tlb_gen == 1 and >> >> then two concurrent flushes happen. The first flush is a full flush >> >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen >> >> to 3. If the second flush gets propagated to a given CPU first and it >> > >> > Maybe I'm still missing something, which is likely... >> > >> > but if the second flush gets propagated to the CPU first, the CPU will >> > have local tlb_gen 1 and thus enforce a full flush anyway because we >> > will go 1 -> 3 on that particular CPU. Or? >> > >> >> Yes, exactly. Which means I'm probably just misunderstanding your >> original question. Can you re-ask it? > > Ah, simple: we control the flushing with info.new_tlb_gen and > mm->context.tlb_gen. I.e., this check: > > > if (f->end != TLB_FLUSH_ALL && > f->new_tlb_gen == local_tlb_gen + 1 && > f->new_tlb_gen == mm_tlb_gen) { > > why can't we write: > > if (f->end != TLB_FLUSH_ALL && > mm_tlb_gen == local_tlb_gen + 1) > > ? Ah, I thought you were asking about why I needed mm_tlb_gen == local_tlb_gen + 1. This is just an optimization, or at least I hope it is. The idea is that, if we know that another flush is coming, it seems likely that it would be faster to do a full flush and increase local_tlb_gen all the way to mm_tlb_gen rather than doing a partial flush, increasing local_tlb_gen to something less than mm_tlb_gen, and needing to flush again very soon. > > If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then > partial. > > If the second flush, as you say is a partial one and still gets > propagated first, the check will force a full flush anyway. > > When the first flush propagates after the second, we'll ignore it > because local_tlb_gen has advanced adready due to the second flush. > > As a matter of fact, we could simplify the logic: if local_tlb_gen is > only mm_tlb_gen - 1, then do the requested flush type. Hmm. I'd be nervous that there are more subtle races if we do this. For example, suppose that a partial flush increments tlb_gen from 1 to 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU is busy switching back and forth between mms, so the partial flush sees the cpu set in mm_cpumask but the full flush doesn't see the cpu set in mm_cpumask. The flush IPI hits after a switch_mm_irqs_off() call notices the change from 1 to 2. switch_mm_irqs_off() will do a full flush and increment the local tlb_gen to 2, and the IPI handler for the partial flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen == 2 and mm_tlb_gen == 3) and do a partial flush. The problem here is that it's not obvious to me that this actually ends up flushing everything that's needed. Maybe all the memory ordering gets this right, but I can imagine scenarios in which switch_mm_irqs_off() does its flush early enough that the TLB picks up an entry that was supposed to get zapped by the full flush. IOW it *might* be valid, but I think it would need very careful review and documentation. --Andy
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 7:59 AM, Borislav Petkov wrote: > On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote: >> On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov wrote: >> > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: >> >> > I'm certainly still missing something here: >> >> > >> >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we >> >> > do once >> >> > >> >> > bump_mm_tlb_gen(mm); >> >> > >> >> > and once >> >> > >> >> > info.new_tlb_gen = bump_mm_tlb_gen(mm); >> >> > >> >> > and in both cases, the bumping is done on mm->context.tlb_gen. >> >> > >> >> > So why isn't that enough to do the flushing and we have to consult >> >> > info.new_tlb_gen too? >> >> >> >> The issue is a possible race. Suppose we start at tlb_gen == 1 and >> >> then two concurrent flushes happen. The first flush is a full flush >> >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen >> >> to 3. If the second flush gets propagated to a given CPU first and it >> > >> > Maybe I'm still missing something, which is likely... >> > >> > but if the second flush gets propagated to the CPU first, the CPU will >> > have local tlb_gen 1 and thus enforce a full flush anyway because we >> > will go 1 -> 3 on that particular CPU. Or? >> > >> >> Yes, exactly. Which means I'm probably just misunderstanding your >> original question. Can you re-ask it? > > Ah, simple: we control the flushing with info.new_tlb_gen and > mm->context.tlb_gen. I.e., this check: > > > if (f->end != TLB_FLUSH_ALL && > f->new_tlb_gen == local_tlb_gen + 1 && > f->new_tlb_gen == mm_tlb_gen) { > > why can't we write: > > if (f->end != TLB_FLUSH_ALL && > mm_tlb_gen == local_tlb_gen + 1) > > ? Ah, I thought you were asking about why I needed mm_tlb_gen == local_tlb_gen + 1. This is just an optimization, or at least I hope it is. The idea is that, if we know that another flush is coming, it seems likely that it would be faster to do a full flush and increase local_tlb_gen all the way to mm_tlb_gen rather than doing a partial flush, increasing local_tlb_gen to something less than mm_tlb_gen, and needing to flush again very soon. > > If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then > partial. > > If the second flush, as you say is a partial one and still gets > propagated first, the check will force a full flush anyway. > > When the first flush propagates after the second, we'll ignore it > because local_tlb_gen has advanced adready due to the second flush. > > As a matter of fact, we could simplify the logic: if local_tlb_gen is > only mm_tlb_gen - 1, then do the requested flush type. Hmm. I'd be nervous that there are more subtle races if we do this. For example, suppose that a partial flush increments tlb_gen from 1 to 2 and a full flush increments tlb_gen from 2 to 3. Meanwhile, the CPU is busy switching back and forth between mms, so the partial flush sees the cpu set in mm_cpumask but the full flush doesn't see the cpu set in mm_cpumask. The flush IPI hits after a switch_mm_irqs_off() call notices the change from 1 to 2. switch_mm_irqs_off() will do a full flush and increment the local tlb_gen to 2, and the IPI handler for the partial flush will see local_tlb_gen == mm_tlb_gen - 1 (because local_tlb_gen == 2 and mm_tlb_gen == 3) and do a partial flush. The problem here is that it's not obvious to me that this actually ends up flushing everything that's needed. Maybe all the memory ordering gets this right, but I can imagine scenarios in which switch_mm_irqs_off() does its flush early enough that the TLB picks up an entry that was supposed to get zapped by the full flush. IOW it *might* be valid, but I think it would need very careful review and documentation. --Andy
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote: > On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkovwrote: > > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: > >> > I'm certainly still missing something here: > >> > > >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we > >> > do once > >> > > >> > bump_mm_tlb_gen(mm); > >> > > >> > and once > >> > > >> > info.new_tlb_gen = bump_mm_tlb_gen(mm); > >> > > >> > and in both cases, the bumping is done on mm->context.tlb_gen. > >> > > >> > So why isn't that enough to do the flushing and we have to consult > >> > info.new_tlb_gen too? > >> > >> The issue is a possible race. Suppose we start at tlb_gen == 1 and > >> then two concurrent flushes happen. The first flush is a full flush > >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen > >> to 3. If the second flush gets propagated to a given CPU first and it > > > > Maybe I'm still missing something, which is likely... > > > > but if the second flush gets propagated to the CPU first, the CPU will > > have local tlb_gen 1 and thus enforce a full flush anyway because we > > will go 1 -> 3 on that particular CPU. Or? > > > > Yes, exactly. Which means I'm probably just misunderstanding your > original question. Can you re-ask it? Ah, simple: we control the flushing with info.new_tlb_gen and mm->context.tlb_gen. I.e., this check: if (f->end != TLB_FLUSH_ALL && f->new_tlb_gen == local_tlb_gen + 1 && f->new_tlb_gen == mm_tlb_gen) { why can't we write: if (f->end != TLB_FLUSH_ALL && mm_tlb_gen == local_tlb_gen + 1) ? If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then partial. If the second flush, as you say is a partial one and still gets propagated first, the check will force a full flush anyway. When the first flush propagates after the second, we'll ignore it because local_tlb_gen has advanced adready due to the second flush. As a matter of fact, we could simplify the logic: if local_tlb_gen is only mm_tlb_gen - 1, then do the requested flush type. If mm_tlb_gen has advanced more than 1 generation, just do a full flush unconditionally. ... and I think we do something like that already but I think the logic could be simplified, unless I'm missing something, that is. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 07:48:21AM -0700, Andy Lutomirski wrote: > On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov wrote: > > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: > >> > I'm certainly still missing something here: > >> > > >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we > >> > do once > >> > > >> > bump_mm_tlb_gen(mm); > >> > > >> > and once > >> > > >> > info.new_tlb_gen = bump_mm_tlb_gen(mm); > >> > > >> > and in both cases, the bumping is done on mm->context.tlb_gen. > >> > > >> > So why isn't that enough to do the flushing and we have to consult > >> > info.new_tlb_gen too? > >> > >> The issue is a possible race. Suppose we start at tlb_gen == 1 and > >> then two concurrent flushes happen. The first flush is a full flush > >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen > >> to 3. If the second flush gets propagated to a given CPU first and it > > > > Maybe I'm still missing something, which is likely... > > > > but if the second flush gets propagated to the CPU first, the CPU will > > have local tlb_gen 1 and thus enforce a full flush anyway because we > > will go 1 -> 3 on that particular CPU. Or? > > > > Yes, exactly. Which means I'm probably just misunderstanding your > original question. Can you re-ask it? Ah, simple: we control the flushing with info.new_tlb_gen and mm->context.tlb_gen. I.e., this check: if (f->end != TLB_FLUSH_ALL && f->new_tlb_gen == local_tlb_gen + 1 && f->new_tlb_gen == mm_tlb_gen) { why can't we write: if (f->end != TLB_FLUSH_ALL && mm_tlb_gen == local_tlb_gen + 1) ? If mm_tlb_gen is + 2, then we'll do a full flush, if it is + 1, then partial. If the second flush, as you say is a partial one and still gets propagated first, the check will force a full flush anyway. When the first flush propagates after the second, we'll ignore it because local_tlb_gen has advanced adready due to the second flush. As a matter of fact, we could simplify the logic: if local_tlb_gen is only mm_tlb_gen - 1, then do the requested flush type. If mm_tlb_gen has advanced more than 1 generation, just do a full flush unconditionally. ... and I think we do something like that already but I think the logic could be simplified, unless I'm missing something, that is. Thanks. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkovwrote: > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: >> > I'm certainly still missing something here: >> > >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we >> > do once >> > >> > bump_mm_tlb_gen(mm); >> > >> > and once >> > >> > info.new_tlb_gen = bump_mm_tlb_gen(mm); >> > >> > and in both cases, the bumping is done on mm->context.tlb_gen. >> > >> > So why isn't that enough to do the flushing and we have to consult >> > info.new_tlb_gen too? >> >> The issue is a possible race. Suppose we start at tlb_gen == 1 and >> then two concurrent flushes happen. The first flush is a full flush >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen >> to 3. If the second flush gets propagated to a given CPU first and it > > Maybe I'm still missing something, which is likely... > > but if the second flush gets propagated to the CPU first, the CPU will > have local tlb_gen 1 and thus enforce a full flush anyway because we > will go 1 -> 3 on that particular CPU. Or? > Yes, exactly. Which means I'm probably just misunderstanding your original question. Can you re-ask it? --Andy
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Thu, Jun 22, 2017 at 12:24 AM, Borislav Petkov wrote: > On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: >> > I'm certainly still missing something here: >> > >> > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we >> > do once >> > >> > bump_mm_tlb_gen(mm); >> > >> > and once >> > >> > info.new_tlb_gen = bump_mm_tlb_gen(mm); >> > >> > and in both cases, the bumping is done on mm->context.tlb_gen. >> > >> > So why isn't that enough to do the flushing and we have to consult >> > info.new_tlb_gen too? >> >> The issue is a possible race. Suppose we start at tlb_gen == 1 and >> then two concurrent flushes happen. The first flush is a full flush >> and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen >> to 3. If the second flush gets propagated to a given CPU first and it > > Maybe I'm still missing something, which is likely... > > but if the second flush gets propagated to the CPU first, the CPU will > have local tlb_gen 1 and thus enforce a full flush anyway because we > will go 1 -> 3 on that particular CPU. Or? > Yes, exactly. Which means I'm probably just misunderstanding your original question. Can you re-ask it? --Andy
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: > > I'm certainly still missing something here: > > > > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we > > do once > > > > bump_mm_tlb_gen(mm); > > > > and once > > > > info.new_tlb_gen = bump_mm_tlb_gen(mm); > > > > and in both cases, the bumping is done on mm->context.tlb_gen. > > > > So why isn't that enough to do the flushing and we have to consult > > info.new_tlb_gen too? > > The issue is a possible race. Suppose we start at tlb_gen == 1 and > then two concurrent flushes happen. The first flush is a full flush > and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen > to 3. If the second flush gets propagated to a given CPU first and it Maybe I'm still missing something, which is likely... but if the second flush gets propagated to the CPU first, the CPU will have local tlb_gen 1 and thus enforce a full flush anyway because we will go 1 -> 3 on that particular CPU. Or? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Wed, Jun 21, 2017 at 07:46:05PM -0700, Andy Lutomirski wrote: > > I'm certainly still missing something here: > > > > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we > > do once > > > > bump_mm_tlb_gen(mm); > > > > and once > > > > info.new_tlb_gen = bump_mm_tlb_gen(mm); > > > > and in both cases, the bumping is done on mm->context.tlb_gen. > > > > So why isn't that enough to do the flushing and we have to consult > > info.new_tlb_gen too? > > The issue is a possible race. Suppose we start at tlb_gen == 1 and > then two concurrent flushes happen. The first flush is a full flush > and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen > to 3. If the second flush gets propagated to a given CPU first and it Maybe I'm still missing something, which is likely... but if the second flush gets propagated to the CPU first, the CPU will have local tlb_gen 1 and thus enforce a full flush anyway because we will go 1 -> 3 on that particular CPU. Or? -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Wed, Jun 21, 2017 at 11:44 AM, Borislav Petkovwrote: > On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote: >> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); >> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, >> +atomic64_read(>context.tlb_gen)); > > Just let it stick out: > > this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); > this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, > atomic64_read(>context.tlb_gen)); > > Should be a bit better readable this way. Done >> + if (local_tlb_gen == mm_tlb_gen) { > > if (unlikely(... > > maybe? > > Sounds to me like the concurrent flushes case would be the > uncommon one... Agreed. >> + >> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen); >> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen); >> + >> + /* >> + * If we get to this point, we know that our TLB is out of date. >> + * This does not strictly imply that we need to flush (it's >> + * possible that f->new_tlb_gen <= local_tlb_gen), but we're >> + * going to need to flush in the very near future, so we might >> + * as well get it over with. >> + * >> + * The only question is whether to do a full or partial flush. >> + * >> + * A partial TLB flush is safe and worthwhile if two conditions are >> + * met: >> + * >> + * 1. We wouldn't be skipping a tlb_gen. If the requester bumped >> + *the mm's tlb_gen from p to p+1, a partial flush is only correct >> + *if we would be bumping the local CPU's tlb_gen from p to p+1 as >> + *well. >> + * >> + * 2. If there are no more flushes on their way. Partial TLB >> + *flushes are not all that much cheaper than full TLB >> + *flushes, so it seems unlikely that it would be a >> + *performance win to do a partial flush if that won't bring >> + *our TLB fully up to date. >> + */ >> + if (f->end != TLB_FLUSH_ALL && >> + f->new_tlb_gen == local_tlb_gen + 1 && >> + f->new_tlb_gen == mm_tlb_gen) { > > I'm certainly still missing something here: > > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we > do once > > bump_mm_tlb_gen(mm); > > and once > > info.new_tlb_gen = bump_mm_tlb_gen(mm); > > and in both cases, the bumping is done on mm->context.tlb_gen. > > So why isn't that enough to do the flushing and we have to consult > info.new_tlb_gen too? The issue is a possible race. Suppose we start at tlb_gen == 1 and then two concurrent flushes happen. The first flush is a full flush and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen to 3. If the second flush gets propagated to a given CPU first and it were to do an actual partial flush (INVLPG) and set the percpu tlb_gen to 3, then the first flush won't do anything and we'll fail to flush all the pages we need to flush. My solution was to say that we're only allowed to do INVLPG if we're making exactly the same change to the local tlb_gen that the requester made to context.tlb_gen. I'll add a comment to this effect. > >> + /* Partial flush */ >> unsigned long addr; >> unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT; > > < newline here. Yup.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Wed, Jun 21, 2017 at 11:44 AM, Borislav Petkov wrote: > On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote: >> + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); >> + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, >> +atomic64_read(>context.tlb_gen)); > > Just let it stick out: > > this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); > this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, > atomic64_read(>context.tlb_gen)); > > Should be a bit better readable this way. Done >> + if (local_tlb_gen == mm_tlb_gen) { > > if (unlikely(... > > maybe? > > Sounds to me like the concurrent flushes case would be the > uncommon one... Agreed. >> + >> + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen); >> + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen); >> + >> + /* >> + * If we get to this point, we know that our TLB is out of date. >> + * This does not strictly imply that we need to flush (it's >> + * possible that f->new_tlb_gen <= local_tlb_gen), but we're >> + * going to need to flush in the very near future, so we might >> + * as well get it over with. >> + * >> + * The only question is whether to do a full or partial flush. >> + * >> + * A partial TLB flush is safe and worthwhile if two conditions are >> + * met: >> + * >> + * 1. We wouldn't be skipping a tlb_gen. If the requester bumped >> + *the mm's tlb_gen from p to p+1, a partial flush is only correct >> + *if we would be bumping the local CPU's tlb_gen from p to p+1 as >> + *well. >> + * >> + * 2. If there are no more flushes on their way. Partial TLB >> + *flushes are not all that much cheaper than full TLB >> + *flushes, so it seems unlikely that it would be a >> + *performance win to do a partial flush if that won't bring >> + *our TLB fully up to date. >> + */ >> + if (f->end != TLB_FLUSH_ALL && >> + f->new_tlb_gen == local_tlb_gen + 1 && >> + f->new_tlb_gen == mm_tlb_gen) { > > I'm certainly still missing something here: > > We have f->new_tlb_gen and mm_tlb_gen to control the flushing, i.e., we > do once > > bump_mm_tlb_gen(mm); > > and once > > info.new_tlb_gen = bump_mm_tlb_gen(mm); > > and in both cases, the bumping is done on mm->context.tlb_gen. > > So why isn't that enough to do the flushing and we have to consult > info.new_tlb_gen too? The issue is a possible race. Suppose we start at tlb_gen == 1 and then two concurrent flushes happen. The first flush is a full flush and sets tlb_gen to 2. The second is a partial flush and sets tlb_gen to 3. If the second flush gets propagated to a given CPU first and it were to do an actual partial flush (INVLPG) and set the percpu tlb_gen to 3, then the first flush won't do anything and we'll fail to flush all the pages we need to flush. My solution was to say that we're only allowed to do INVLPG if we're making exactly the same change to the local tlb_gen that the requester made to context.tlb_gen. I'll add a comment to this effect. > >> + /* Partial flush */ >> unsigned long addr; >> unsigned long nr_pages = (f->end - f->start) >> PAGE_SHIFT; > > < newline here. Yup.
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote: > There are two kernel features that would benefit from tracking > how up-to-date each CPU's TLB is in the case where IPIs aren't keeping > it up to date in real time: > > - Lazy mm switching currently works by switching to init_mm when >it would otherwise flush. This is wasteful: there isn't fundamentally >any need to update CR3 at all when going lazy or when returning from >lazy mode, nor is there any need to receive flush IPIs at all. Instead, >we should just stop trying to keep the TLB coherent when we go lazy and, >when unlazying, check whether we missed any flushes. > > - PCID will let us keep recent user contexts alive in the TLB. If we >start doing this, we need a way to decide whether those contexts are >up to date. > > On some paravirt systems, remote TLBs can be flushed without IPIs. > This won't update the target CPUs' tlb_gens, which may cause > unnecessary local flushes later on. We can address this if it becomes > a problem by carefully updating the target CPU's tlb_gen directly. > > By itself, this patch is a very minor optimization that avoids > unnecessary flushes when multiple TLB flushes targetting the same CPU > race. > > Signed-off-by: Andy Lutomirski> --- > arch/x86/include/asm/tlbflush.h | 37 +++ > arch/x86/mm/tlb.c | 79 > + > 2 files changed, 109 insertions(+), 7 deletions(-) ... > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 6d9d37323a43..9f5ef7a5e74a 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > } > > this_cpu_write(cpu_tlbstate.loaded_mm, next); > + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); > + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, > +atomic64_read(>context.tlb_gen)); Just let it stick out: this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(>context.tlb_gen)); Should be a bit better readable this way. > > WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next))); > cpumask_set_cpu(cpu, mm_cpumask(next)); > @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > static void flush_tlb_func_common(const struct flush_tlb_info *f, > bool local, enum tlb_flush_reason reason) > { > + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); > + > + /* > + * Our memory ordering requirement is that any TLB fills that > + * happen after we flush the TLB are ordered after we read > + * active_mm's tlb_gen. We don't need any explicit barrier > + * because all x86 flush operations are serializing and the > + * atomic64_read operation won't be reordered by the compiler. > + */ > + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); > + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen); > + > /* This code cannot presently handle being reentered. */ > VM_WARN_ON(!irqs_disabled()); > > + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) != > +loaded_mm->context.ctx_id); > + > if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { > + /* > + * leave_mm() is adequate to handle any type of flush, and > + * we would prefer not to receive further IPIs. > + */ > leave_mm(smp_processor_id()); > return; > } > > - if (f->end == TLB_FLUSH_ALL) { > - local_flush_tlb(); > - if (local) > - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > - trace_tlb_flush(reason, TLB_FLUSH_ALL); > - } else { > + if (local_tlb_gen == mm_tlb_gen) { if (unlikely(... maybe? Sounds to me like the concurrent flushes case would be the uncommon one... > + /* > + * There's nothing to do: we're already up to date. This can > + * happen if two concurrent flushes happen -- the first IPI to > + * be handled can catch us all the way up, leaving no work for > + * the second IPI to be handled. > + */ > + return; > + } > + > + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen); > + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen); > + > + /* > + * If we get to this point, we know that our TLB is out of date. > + * This does not strictly imply that we need to flush (it's > + * possible that f->new_tlb_gen <= local_tlb_gen), but we're > + * going to need to flush in the very near future, so we might > + * as well get it over with. > + * > + * The only question is
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Tue, Jun 20, 2017 at 10:22:11PM -0700, Andy Lutomirski wrote: > There are two kernel features that would benefit from tracking > how up-to-date each CPU's TLB is in the case where IPIs aren't keeping > it up to date in real time: > > - Lazy mm switching currently works by switching to init_mm when >it would otherwise flush. This is wasteful: there isn't fundamentally >any need to update CR3 at all when going lazy or when returning from >lazy mode, nor is there any need to receive flush IPIs at all. Instead, >we should just stop trying to keep the TLB coherent when we go lazy and, >when unlazying, check whether we missed any flushes. > > - PCID will let us keep recent user contexts alive in the TLB. If we >start doing this, we need a way to decide whether those contexts are >up to date. > > On some paravirt systems, remote TLBs can be flushed without IPIs. > This won't update the target CPUs' tlb_gens, which may cause > unnecessary local flushes later on. We can address this if it becomes > a problem by carefully updating the target CPU's tlb_gen directly. > > By itself, this patch is a very minor optimization that avoids > unnecessary flushes when multiple TLB flushes targetting the same CPU > race. > > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/asm/tlbflush.h | 37 +++ > arch/x86/mm/tlb.c | 79 > + > 2 files changed, 109 insertions(+), 7 deletions(-) ... > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 6d9d37323a43..9f5ef7a5e74a 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > } > > this_cpu_write(cpu_tlbstate.loaded_mm, next); > + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); > + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, > +atomic64_read(>context.tlb_gen)); Just let it stick out: this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, atomic64_read(>context.tlb_gen)); Should be a bit better readable this way. > > WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next))); > cpumask_set_cpu(cpu, mm_cpumask(next)); > @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > static void flush_tlb_func_common(const struct flush_tlb_info *f, > bool local, enum tlb_flush_reason reason) > { > + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); > + > + /* > + * Our memory ordering requirement is that any TLB fills that > + * happen after we flush the TLB are ordered after we read > + * active_mm's tlb_gen. We don't need any explicit barrier > + * because all x86 flush operations are serializing and the > + * atomic64_read operation won't be reordered by the compiler. > + */ > + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); > + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen); > + > /* This code cannot presently handle being reentered. */ > VM_WARN_ON(!irqs_disabled()); > > + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) != > +loaded_mm->context.ctx_id); > + > if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { > + /* > + * leave_mm() is adequate to handle any type of flush, and > + * we would prefer not to receive further IPIs. > + */ > leave_mm(smp_processor_id()); > return; > } > > - if (f->end == TLB_FLUSH_ALL) { > - local_flush_tlb(); > - if (local) > - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > - trace_tlb_flush(reason, TLB_FLUSH_ALL); > - } else { > + if (local_tlb_gen == mm_tlb_gen) { if (unlikely(... maybe? Sounds to me like the concurrent flushes case would be the uncommon one... > + /* > + * There's nothing to do: we're already up to date. This can > + * happen if two concurrent flushes happen -- the first IPI to > + * be handled can catch us all the way up, leaving no work for > + * the second IPI to be handled. > + */ > + return; > + } > + > + WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen); > + WARN_ON_ONCE(f->new_tlb_gen > mm_tlb_gen); > + > + /* > + * If we get to this point, we know that our TLB is out of date. > + * This does not strictly imply that we need to flush (it's > + * possible that f->new_tlb_gen <= local_tlb_gen), but we're > + * going to need to flush in the very near future, so we might > + * as well get it over with. > + * > + * The only question is whether to do a full
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Wed, Jun 21, 2017 at 1:32 AM, Thomas Gleixnerwrote: > On Tue, 20 Jun 2017, Andy Lutomirski wrote: >> struct flush_tlb_info { >> + /* >> + * We support several kinds of flushes. >> + * >> + * - Fully flush a single mm. flush_mm will be set, flush_end will be > > flush_mm is the *mm member in the struct, right? You might rename that as a > preparatory step so comments and implementation match. The comment is outdated. Fixed now. > >> + * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the >> + * IPI sender is trying to catch us up. >> + * >> + * - Partially flush a single mm. flush_mm will be set, flush_start >> + * and flush_end will indicate the range, and new_tlb_gen will be >> + * set such that the changes between generation new_tlb_gen-1 and >> + * new_tlb_gen are entirely contained in the indicated range. >> + * >> + * - Fully flush all mms whose tlb_gens have been updated. flush_mm >> + * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen >> + * will be zero. >> + */ >> struct mm_struct *mm; >> unsigned long start; >> unsigned long end; >> + u64 new_tlb_gen; > > Nit. While at it could you please make that struct tabular aligned as we > usually do in x86? Sure. > >> static void flush_tlb_func_common(const struct flush_tlb_info *f, >> bool local, enum tlb_flush_reason reason) >> { >> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); >> + >> + /* >> + * Our memory ordering requirement is that any TLB fills that >> + * happen after we flush the TLB are ordered after we read >> + * active_mm's tlb_gen. We don't need any explicit barrier >> + * because all x86 flush operations are serializing and the >> + * atomic64_read operation won't be reordered by the compiler. >> + */ > > Can you please move the comment above the loaded_mm assignment? I'll move it above the function entirely. It's more of a general comment about how the function works than any particular part of the function. > >> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); >> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen); >> + >> /* This code cannot presently handle being reentered. */ >> VM_WARN_ON(!irqs_disabled()); >> >> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) != >> +loaded_mm->context.ctx_id); >> + >> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { >> + /* >> + * leave_mm() is adequate to handle any type of flush, and >> + * we would prefer not to receive further IPIs. > > While I know what you mean, it might be useful to have a more elaborate > explanation why this prevents new IPIs. Added, although it just gets deleted again later in the series. > >> + */ >> leave_mm(smp_processor_id()); >> return; >> } >> >> - if (f->end == TLB_FLUSH_ALL) { >> - local_flush_tlb(); >> - if (local) >> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> - trace_tlb_flush(reason, TLB_FLUSH_ALL); >> - } else { >> + if (local_tlb_gen == mm_tlb_gen) { >> + /* >> + * There's nothing to do: we're already up to date. This can >> + * happen if two concurrent flushes happen -- the first IPI to >> + * be handled can catch us all the way up, leaving no work for >> + * the second IPI to be handled. > > That not restricted to IPIs, right? A local flush / IPI combo can do that > as well. Indeed. Comment fixed. > > Other than those nits; > > Reviewed-by: Thomas Gleixner
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Wed, Jun 21, 2017 at 1:32 AM, Thomas Gleixner wrote: > On Tue, 20 Jun 2017, Andy Lutomirski wrote: >> struct flush_tlb_info { >> + /* >> + * We support several kinds of flushes. >> + * >> + * - Fully flush a single mm. flush_mm will be set, flush_end will be > > flush_mm is the *mm member in the struct, right? You might rename that as a > preparatory step so comments and implementation match. The comment is outdated. Fixed now. > >> + * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the >> + * IPI sender is trying to catch us up. >> + * >> + * - Partially flush a single mm. flush_mm will be set, flush_start >> + * and flush_end will indicate the range, and new_tlb_gen will be >> + * set such that the changes between generation new_tlb_gen-1 and >> + * new_tlb_gen are entirely contained in the indicated range. >> + * >> + * - Fully flush all mms whose tlb_gens have been updated. flush_mm >> + * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen >> + * will be zero. >> + */ >> struct mm_struct *mm; >> unsigned long start; >> unsigned long end; >> + u64 new_tlb_gen; > > Nit. While at it could you please make that struct tabular aligned as we > usually do in x86? Sure. > >> static void flush_tlb_func_common(const struct flush_tlb_info *f, >> bool local, enum tlb_flush_reason reason) >> { >> + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); >> + >> + /* >> + * Our memory ordering requirement is that any TLB fills that >> + * happen after we flush the TLB are ordered after we read >> + * active_mm's tlb_gen. We don't need any explicit barrier >> + * because all x86 flush operations are serializing and the >> + * atomic64_read operation won't be reordered by the compiler. >> + */ > > Can you please move the comment above the loaded_mm assignment? I'll move it above the function entirely. It's more of a general comment about how the function works than any particular part of the function. > >> + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); >> + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen); >> + >> /* This code cannot presently handle being reentered. */ >> VM_WARN_ON(!irqs_disabled()); >> >> + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) != >> +loaded_mm->context.ctx_id); >> + >> if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { >> + /* >> + * leave_mm() is adequate to handle any type of flush, and >> + * we would prefer not to receive further IPIs. > > While I know what you mean, it might be useful to have a more elaborate > explanation why this prevents new IPIs. Added, although it just gets deleted again later in the series. > >> + */ >> leave_mm(smp_processor_id()); >> return; >> } >> >> - if (f->end == TLB_FLUSH_ALL) { >> - local_flush_tlb(); >> - if (local) >> - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); >> - trace_tlb_flush(reason, TLB_FLUSH_ALL); >> - } else { >> + if (local_tlb_gen == mm_tlb_gen) { >> + /* >> + * There's nothing to do: we're already up to date. This can >> + * happen if two concurrent flushes happen -- the first IPI to >> + * be handled can catch us all the way up, leaving no work for >> + * the second IPI to be handled. > > That not restricted to IPIs, right? A local flush / IPI combo can do that > as well. Indeed. Comment fixed. > > Other than those nits; > > Reviewed-by: Thomas Gleixner
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Tue, 20 Jun 2017, Andy Lutomirski wrote: > struct flush_tlb_info { > + /* > + * We support several kinds of flushes. > + * > + * - Fully flush a single mm. flush_mm will be set, flush_end will be flush_mm is the *mm member in the struct, right? You might rename that as a preparatory step so comments and implementation match. > + * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the > + * IPI sender is trying to catch us up. > + * > + * - Partially flush a single mm. flush_mm will be set, flush_start > + * and flush_end will indicate the range, and new_tlb_gen will be > + * set such that the changes between generation new_tlb_gen-1 and > + * new_tlb_gen are entirely contained in the indicated range. > + * > + * - Fully flush all mms whose tlb_gens have been updated. flush_mm > + * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen > + * will be zero. > + */ > struct mm_struct *mm; > unsigned long start; > unsigned long end; > + u64 new_tlb_gen; Nit. While at it could you please make that struct tabular aligned as we usually do in x86? > static void flush_tlb_func_common(const struct flush_tlb_info *f, > bool local, enum tlb_flush_reason reason) > { > + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); > + > + /* > + * Our memory ordering requirement is that any TLB fills that > + * happen after we flush the TLB are ordered after we read > + * active_mm's tlb_gen. We don't need any explicit barrier > + * because all x86 flush operations are serializing and the > + * atomic64_read operation won't be reordered by the compiler. > + */ Can you please move the comment above the loaded_mm assignment? > + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); > + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen); > + > /* This code cannot presently handle being reentered. */ > VM_WARN_ON(!irqs_disabled()); > > + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) != > +loaded_mm->context.ctx_id); > + > if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { > + /* > + * leave_mm() is adequate to handle any type of flush, and > + * we would prefer not to receive further IPIs. While I know what you mean, it might be useful to have a more elaborate explanation why this prevents new IPIs. > + */ > leave_mm(smp_processor_id()); > return; > } > > - if (f->end == TLB_FLUSH_ALL) { > - local_flush_tlb(); > - if (local) > - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > - trace_tlb_flush(reason, TLB_FLUSH_ALL); > - } else { > + if (local_tlb_gen == mm_tlb_gen) { > + /* > + * There's nothing to do: we're already up to date. This can > + * happen if two concurrent flushes happen -- the first IPI to > + * be handled can catch us all the way up, leaving no work for > + * the second IPI to be handled. That not restricted to IPIs, right? A local flush / IPI combo can do that as well. Other than those nits; Reviewed-by: Thomas Gleixner
Re: [PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
On Tue, 20 Jun 2017, Andy Lutomirski wrote: > struct flush_tlb_info { > + /* > + * We support several kinds of flushes. > + * > + * - Fully flush a single mm. flush_mm will be set, flush_end will be flush_mm is the *mm member in the struct, right? You might rename that as a preparatory step so comments and implementation match. > + * TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the > + * IPI sender is trying to catch us up. > + * > + * - Partially flush a single mm. flush_mm will be set, flush_start > + * and flush_end will indicate the range, and new_tlb_gen will be > + * set such that the changes between generation new_tlb_gen-1 and > + * new_tlb_gen are entirely contained in the indicated range. > + * > + * - Fully flush all mms whose tlb_gens have been updated. flush_mm > + * will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen > + * will be zero. > + */ > struct mm_struct *mm; > unsigned long start; > unsigned long end; > + u64 new_tlb_gen; Nit. While at it could you please make that struct tabular aligned as we usually do in x86? > static void flush_tlb_func_common(const struct flush_tlb_info *f, > bool local, enum tlb_flush_reason reason) > { > + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); > + > + /* > + * Our memory ordering requirement is that any TLB fills that > + * happen after we flush the TLB are ordered after we read > + * active_mm's tlb_gen. We don't need any explicit barrier > + * because all x86 flush operations are serializing and the > + * atomic64_read operation won't be reordered by the compiler. > + */ Can you please move the comment above the loaded_mm assignment? > + u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); > + u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[0].tlb_gen); > + > /* This code cannot presently handle being reentered. */ > VM_WARN_ON(!irqs_disabled()); > > + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[0].ctx_id) != > +loaded_mm->context.ctx_id); > + > if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { > + /* > + * leave_mm() is adequate to handle any type of flush, and > + * we would prefer not to receive further IPIs. While I know what you mean, it might be useful to have a more elaborate explanation why this prevents new IPIs. > + */ > leave_mm(smp_processor_id()); > return; > } > > - if (f->end == TLB_FLUSH_ALL) { > - local_flush_tlb(); > - if (local) > - count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL); > - trace_tlb_flush(reason, TLB_FLUSH_ALL); > - } else { > + if (local_tlb_gen == mm_tlb_gen) { > + /* > + * There's nothing to do: we're already up to date. This can > + * happen if two concurrent flushes happen -- the first IPI to > + * be handled can catch us all the way up, leaving no work for > + * the second IPI to be handled. That not restricted to IPIs, right? A local flush / IPI combo can do that as well. Other than those nits; Reviewed-by: Thomas Gleixner