Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-15 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm:
> 
> 
>> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin  wrote:
>> 
>> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
 
> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
> 
> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  
>>> wrote:
>>> 
>>> On big systems, the mm refcount can become highly contented when doing
>>> a lot of context switching with threaded applications (particularly
>>> switching between the idle thread and an application thread).
>>> 
>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>> any remaining lazy ones.
>>> 
>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>> with as many software threads as CPUs (so each switch will go in and
>>> out of idle), upstream can achieve a rate of about 1 million context
>>> switches per second. After this patch it goes up to 118 million.
>>> 
>> 
>> I read the patch a couple of times, and I have a suggestion that could
>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>> refcount.  You're saying "hey, this mm has no more references, but it
>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>> those references too."  I'm wondering whether you actually need the
>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>> you just removed the last bit from mm_cpumask and potentially free the
>> mm.
>> 
>> Getting the locking right here could be a bit tricky -- you need to
>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>> should free the mm, and you also need to avoid an mm with mm_users
>> hitting zero concurrently with the last remote CPU using it lazily
>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>> cmpxchg or dec_return to make sure that only one CPU frees it.
>> 
>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>> only ever take the lock after mm_users goes to zero.
> 
> I don't think it's nonsense, it could be a good way to avoid IPIs.
> 
> I haven't seen much problem here that made me too concerned about IPIs 
> yet, so I think the simple patch may be good enough to start with
> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
> unlazying with the exit TLB flush without doing anything fancy with
> ref counting, but we'll see.
 
 I would be cautious with benchmarking here. I would expect that the
 nasty cases may affect power consumption more than performance — the 
 specific issue is IPIs hitting idle cores, and the main effects are to 
 slow down exit() a bit but also to kick the idle core out of idle. 
 Although, if the idle core is in a deep sleep, that IPI could be 
 *very* slow.
>>> 
>>> It will tend to be self-limiting to some degree (deeper idle cores
>>> would tend to have less chance of IPI) but we have bigger issues on
>>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>>> management. Power hasn't really shown up as an issue but powerpc
>>> CPUs may have their own requirements and issues there, shall we say.
>>> 
 So I think it’s worth at least giving this a try.
>>> 
>>> To be clear it's not a complete solution itself. The problem is of 
>>> course that mm cpumask gives you false negatives, so the bits
>>> won't always clean up after themselves as CPUs switch away from their
>>> lazy tlb mms.
>> 
>> ^^
>> 
>> False positives: CPU is in the mm_cpumask, but is not using the mm
>> as a lazy tlb. So there can be bits left and never freed.
>> 
>> If you closed the false positives, you're back to a shared mm cache
>> line on lazy mm context switches.
> 
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
> line :)
> 
> Can your share your benchmark?

Just testing the IPI rates (on a smaller 176 CPU system), on a
kernel compile, it causes about 300 shootdown interrupts (not
300 broadcasts but total interrupts).

And very short lived fork;exec;exit things like typical scripting
commands doesn't typically generate any.

So yeah the 

Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-15 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm:
> 
> 
>> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin  wrote:
>> 
>> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
 
> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
> 
> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  
>>> wrote:
>>> 
>>> On big systems, the mm refcount can become highly contented when doing
>>> a lot of context switching with threaded applications (particularly
>>> switching between the idle thread and an application thread).
>>> 
>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>> any remaining lazy ones.
>>> 
>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>> with as many software threads as CPUs (so each switch will go in and
>>> out of idle), upstream can achieve a rate of about 1 million context
>>> switches per second. After this patch it goes up to 118 million.
>>> 
>> 
>> I read the patch a couple of times, and I have a suggestion that could
>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>> refcount.  You're saying "hey, this mm has no more references, but it
>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>> those references too."  I'm wondering whether you actually need the
>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>> you just removed the last bit from mm_cpumask and potentially free the
>> mm.
>> 
>> Getting the locking right here could be a bit tricky -- you need to
>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>> should free the mm, and you also need to avoid an mm with mm_users
>> hitting zero concurrently with the last remote CPU using it lazily
>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>> cmpxchg or dec_return to make sure that only one CPU frees it.
>> 
>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>> only ever take the lock after mm_users goes to zero.
> 
> I don't think it's nonsense, it could be a good way to avoid IPIs.
> 
> I haven't seen much problem here that made me too concerned about IPIs 
> yet, so I think the simple patch may be good enough to start with
> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
> unlazying with the exit TLB flush without doing anything fancy with
> ref counting, but we'll see.
 
 I would be cautious with benchmarking here. I would expect that the
 nasty cases may affect power consumption more than performance — the 
 specific issue is IPIs hitting idle cores, and the main effects are to 
 slow down exit() a bit but also to kick the idle core out of idle. 
 Although, if the idle core is in a deep sleep, that IPI could be 
 *very* slow.
>>> 
>>> It will tend to be self-limiting to some degree (deeper idle cores
>>> would tend to have less chance of IPI) but we have bigger issues on
>>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>>> management. Power hasn't really shown up as an issue but powerpc
>>> CPUs may have their own requirements and issues there, shall we say.
>>> 
 So I think it’s worth at least giving this a try.
>>> 
>>> To be clear it's not a complete solution itself. The problem is of 
>>> course that mm cpumask gives you false negatives, so the bits
>>> won't always clean up after themselves as CPUs switch away from their
>>> lazy tlb mms.
>> 
>> ^^
>> 
>> False positives: CPU is in the mm_cpumask, but is not using the mm
>> as a lazy tlb. So there can be bits left and never freed.
>> 
>> If you closed the false positives, you're back to a shared mm cache
>> line on lazy mm context switches.
> 
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
> line :)
> 
> Can your share your benchmark?

It's just context_switch1_threads from will-it-scale, running with 1/2
the number of CPUs, and core affinity with an SMT processor (so each
thread from the switching pairs gets spread to their own CPU and so you
get the task->idle->task switching.

It's really just about the worst 

Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Peter Zijlstra
On Tue, Jul 14, 2020 at 05:46:05AM -0700, Andy Lutomirski wrote:
> x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
> line :)

I've seen patches for a 'sparse' bitmap to solve related problems.

It's basically the same code, except it multiplies everything (size,
bit-nr) by a constant to reduce the number of active bits per line.

This sadly doesn't take topology into account, but reducing contention
is still good ofcourse.


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Andy Lutomirski



> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>>> 
 On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
 
 Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>> 
>> On big systems, the mm refcount can become highly contented when doing
>> a lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>> 
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so so instead implement a non-refcounted scheme
>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> any remaining lazy ones.
>> 
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
>> 
> 
> I read the patch a couple of times, and I have a suggestion that could
> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
> refcount.  You're saying "hey, this mm has no more references, but it
> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
> those references too."  I'm wondering whether you actually need the
> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
> for real?  Roughly, in __mmdrop(), you would only free the page tables
> if mm_cpumask() is empty.  And, in the code that removes a CPU from
> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
> you just removed the last bit from mm_cpumask and potentially free the
> mm.
> 
> Getting the locking right here could be a bit tricky -- you need to
> avoid two CPUs simultaneously exiting lazy TLB and thinking they
> should free the mm, and you also need to avoid an mm with mm_users
> hitting zero concurrently with the last remote CPU using it lazily
> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
> mm" and mm_count == 0 meaning "now it's dead" and using some careful
> cmpxchg or dec_return to make sure that only one CPU frees it.
> 
> Or maybe you'd need a lock or RCU for this, but the idea would be to
> only ever take the lock after mm_users goes to zero.
 
 I don't think it's nonsense, it could be a good way to avoid IPIs.
 
 I haven't seen much problem here that made me too concerned about IPIs 
 yet, so I think the simple patch may be good enough to start with
 for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
 unlazying with the exit TLB flush without doing anything fancy with
 ref counting, but we'll see.
>>> 
>>> I would be cautious with benchmarking here. I would expect that the
>>> nasty cases may affect power consumption more than performance — the 
>>> specific issue is IPIs hitting idle cores, and the main effects are to 
>>> slow down exit() a bit but also to kick the idle core out of idle. 
>>> Although, if the idle core is in a deep sleep, that IPI could be 
>>> *very* slow.
>> 
>> It will tend to be self-limiting to some degree (deeper idle cores
>> would tend to have less chance of IPI) but we have bigger issues on
>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>> management. Power hasn't really shown up as an issue but powerpc
>> CPUs may have their own requirements and issues there, shall we say.
>> 
>>> So I think it’s worth at least giving this a try.
>> 
>> To be clear it's not a complete solution itself. The problem is of 
>> course that mm cpumask gives you false negatives, so the bits
>> won't always clean up after themselves as CPUs switch away from their
>> lazy tlb mms.
> 
> ^^
> 
> False positives: CPU is in the mm_cpumask, but is not using the mm
> as a lazy tlb. So there can be bits left and never freed.
> 
> If you closed the false positives, you're back to a shared mm cache
> line on lazy mm context switches.

x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
line :)

Can your share your benchmark?

> 
> Thanks,
> Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>> 
>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
>>> 
>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
> 
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
> 
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.
> 
 
 I read the patch a couple of times, and I have a suggestion that could
 be nonsense.  You are, effectively, using mm_cpumask() as a sort of
 refcount.  You're saying "hey, this mm has no more references, but it
 still has nonempty mm_cpumask(), so let's send an IPI and shoot down
 those references too."  I'm wondering whether you actually need the
 IPI.  What if, instead, you actually treated mm_cpumask as a refcount
 for real?  Roughly, in __mmdrop(), you would only free the page tables
 if mm_cpumask() is empty.  And, in the code that removes a CPU from
 mm_cpumask(), you would check if mm_users == 0 and, if so, check if
 you just removed the last bit from mm_cpumask and potentially free the
 mm.
 
 Getting the locking right here could be a bit tricky -- you need to
 avoid two CPUs simultaneously exiting lazy TLB and thinking they
 should free the mm, and you also need to avoid an mm with mm_users
 hitting zero concurrently with the last remote CPU using it lazily
 exiting lazy TLB.  Perhaps this could be resolved by having mm_count
 == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
 mm" and mm_count == 0 meaning "now it's dead" and using some careful
 cmpxchg or dec_return to make sure that only one CPU frees it.
 
 Or maybe you'd need a lock or RCU for this, but the idea would be to
 only ever take the lock after mm_users goes to zero.
>>> 
>>> I don't think it's nonsense, it could be a good way to avoid IPIs.
>>> 
>>> I haven't seen much problem here that made me too concerned about IPIs 
>>> yet, so I think the simple patch may be good enough to start with
>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
>>> unlazying with the exit TLB flush without doing anything fancy with
>>> ref counting, but we'll see.
>> 
>> I would be cautious with benchmarking here. I would expect that the
>> nasty cases may affect power consumption more than performance — the 
>> specific issue is IPIs hitting idle cores, and the main effects are to 
>> slow down exit() a bit but also to kick the idle core out of idle. 
>> Although, if the idle core is in a deep sleep, that IPI could be 
>> *very* slow.
> 
> It will tend to be self-limiting to some degree (deeper idle cores
> would tend to have less chance of IPI) but we have bigger issues on
> powerpc with that, like broadcast IPIs to the mm cpumask for THP
> management. Power hasn't really shown up as an issue but powerpc
> CPUs may have their own requirements and issues there, shall we say.
> 
>> So I think it’s worth at least giving this a try.
> 
> To be clear it's not a complete solution itself. The problem is of 
> course that mm cpumask gives you false negatives, so the bits
> won't always clean up after themselves as CPUs switch away from their
> lazy tlb mms.

^^

False positives: CPU is in the mm_cpumask, but is not using the mm
as a lazy tlb. So there can be bits left and never freed.

If you closed the false positives, you're back to a shared mm cache
line on lazy mm context switches.

Thanks,
Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-13 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
> 
>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
>> 
>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
 On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
 
 On big systems, the mm refcount can become highly contented when doing
 a lot of context switching with threaded applications (particularly
 switching between the idle thread and an application thread).
 
 Abandoning lazy tlb slows switching down quite a bit in the important
 user->idle->user cases, so so instead implement a non-refcounted scheme
 that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
 any remaining lazy ones.
 
 On a 16-socket 192-core POWER8 system, a context switching benchmark
 with as many software threads as CPUs (so each switch will go in and
 out of idle), upstream can achieve a rate of about 1 million context
 switches per second. After this patch it goes up to 118 million.
 
>>> 
>>> I read the patch a couple of times, and I have a suggestion that could
>>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>>> refcount.  You're saying "hey, this mm has no more references, but it
>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>>> those references too."  I'm wondering whether you actually need the
>>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>>> you just removed the last bit from mm_cpumask and potentially free the
>>> mm.
>>> 
>>> Getting the locking right here could be a bit tricky -- you need to
>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>>> should free the mm, and you also need to avoid an mm with mm_users
>>> hitting zero concurrently with the last remote CPU using it lazily
>>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>>> cmpxchg or dec_return to make sure that only one CPU frees it.
>>> 
>>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>>> only ever take the lock after mm_users goes to zero.
>> 
>> I don't think it's nonsense, it could be a good way to avoid IPIs.
>> 
>> I haven't seen much problem here that made me too concerned about IPIs 
>> yet, so I think the simple patch may be good enough to start with
>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
>> unlazying with the exit TLB flush without doing anything fancy with
>> ref counting, but we'll see.
> 
> I would be cautious with benchmarking here. I would expect that the
> nasty cases may affect power consumption more than performance — the 
> specific issue is IPIs hitting idle cores, and the main effects are to 
> slow down exit() a bit but also to kick the idle core out of idle. 
> Although, if the idle core is in a deep sleep, that IPI could be 
> *very* slow.

It will tend to be self-limiting to some degree (deeper idle cores
would tend to have less chance of IPI) but we have bigger issues on
powerpc with that, like broadcast IPIs to the mm cpumask for THP
management. Power hasn't really shown up as an issue but powerpc
CPUs may have their own requirements and issues there, shall we say.

> So I think it’s worth at least giving this a try.

To be clear it's not a complete solution itself. The problem is of 
course that mm cpumask gives you false negatives, so the bits
won't always clean up after themselves as CPUs switch away from their
lazy tlb mms.

I would suspect it _may_ help with garbage collecting some remainders
nicely after exit, but only with somewhat of a different accounting
system than powerpc uses -- we tie mm_cpumask to TLB valids, so it can
become spread over CPUs that don't (and even have never) used that mm
as a lazy mm I don't know that the self-culling trick would help
a great deal within that scheme.

So powerpc needs a bit more work on that side of things too, hence
looking at doing more of this in the final TLB shootdown.

There's actually a lot of other things we can do as well to reduce
IPIs, batching being a simple hammer, some kind of quiescing, testing
the remote CPU to check what active mm it is using, doing the un-lazy
at certain defined points etc, so I'm actually not worried about IPIs
suddenly popping up and rendering the whole concept unworkable. At
some point (unless we go something pretty complex like a SRCU type 
thing, or adding extra locking .e.g, to use_mm()), then at least 
sometimes an IPI will be required so I think it's reasonable to
start here and introduce complexity more slowly if it's justified.


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-13 Thread Andy Lutomirski


> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
> 
> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>>> 
>>> On big systems, the mm refcount can become highly contented when doing
>>> a lot of context switching with threaded applications (particularly
>>> switching between the idle thread and an application thread).
>>> 
>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>> any remaining lazy ones.
>>> 
>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>> with as many software threads as CPUs (so each switch will go in and
>>> out of idle), upstream can achieve a rate of about 1 million context
>>> switches per second. After this patch it goes up to 118 million.
>>> 
>> 
>> I read the patch a couple of times, and I have a suggestion that could
>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>> refcount.  You're saying "hey, this mm has no more references, but it
>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>> those references too."  I'm wondering whether you actually need the
>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>> you just removed the last bit from mm_cpumask and potentially free the
>> mm.
>> 
>> Getting the locking right here could be a bit tricky -- you need to
>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>> should free the mm, and you also need to avoid an mm with mm_users
>> hitting zero concurrently with the last remote CPU using it lazily
>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>> cmpxchg or dec_return to make sure that only one CPU frees it.
>> 
>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>> only ever take the lock after mm_users goes to zero.
> 
> I don't think it's nonsense, it could be a good way to avoid IPIs.
> 
> I haven't seen much problem here that made me too concerned about IPIs 
> yet, so I think the simple patch may be good enough to start with
> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
> unlazying with the exit TLB flush without doing anything fancy with
> ref counting, but we'll see.

I would be cautious with benchmarking here. I would expect that the nasty cases 
may affect power consumption more than performance — the specific issue is IPIs 
hitting idle cores, and the main effects are to slow down exit() a bit but also 
to kick the idle core out of idle. Although, if the idle core is in a deep 
sleep, that IPI could be *very* slow.

So I think it’s worth at least giving this a try.

> 
> Thanks,
> Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-13 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>>
>> On big systems, the mm refcount can become highly contented when doing
>> a lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>>
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so so instead implement a non-refcounted scheme
>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> any remaining lazy ones.
>>
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
>>
> 
> I read the patch a couple of times, and I have a suggestion that could
> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
> refcount.  You're saying "hey, this mm has no more references, but it
> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
> those references too."  I'm wondering whether you actually need the
> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
> for real?  Roughly, in __mmdrop(), you would only free the page tables
> if mm_cpumask() is empty.  And, in the code that removes a CPU from
> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
> you just removed the last bit from mm_cpumask and potentially free the
> mm.
> 
> Getting the locking right here could be a bit tricky -- you need to
> avoid two CPUs simultaneously exiting lazy TLB and thinking they
> should free the mm, and you also need to avoid an mm with mm_users
> hitting zero concurrently with the last remote CPU using it lazily
> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
> mm" and mm_count == 0 meaning "now it's dead" and using some careful
> cmpxchg or dec_return to make sure that only one CPU frees it.
> 
> Or maybe you'd need a lock or RCU for this, but the idea would be to
> only ever take the lock after mm_users goes to zero.

I don't think it's nonsense, it could be a good way to avoid IPIs.

I haven't seen much problem here that made me too concerned about IPIs 
yet, so I think the simple patch may be good enough to start with
for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
unlazying with the exit TLB flush without doing anything fancy with
ref counting, but we'll see.

Thanks,
Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-13 Thread Andy Lutomirski
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
>
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
>
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.
>

I read the patch a couple of times, and I have a suggestion that could
be nonsense.  You are, effectively, using mm_cpumask() as a sort of
refcount.  You're saying "hey, this mm has no more references, but it
still has nonempty mm_cpumask(), so let's send an IPI and shoot down
those references too."  I'm wondering whether you actually need the
IPI.  What if, instead, you actually treated mm_cpumask as a refcount
for real?  Roughly, in __mmdrop(), you would only free the page tables
if mm_cpumask() is empty.  And, in the code that removes a CPU from
mm_cpumask(), you would check if mm_users == 0 and, if so, check if
you just removed the last bit from mm_cpumask and potentially free the
mm.

Getting the locking right here could be a bit tricky -- you need to
avoid two CPUs simultaneously exiting lazy TLB and thinking they
should free the mm, and you also need to avoid an mm with mm_users
hitting zero concurrently with the last remote CPU using it lazily
exiting lazy TLB.  Perhaps this could be resolved by having mm_count
== 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
mm" and mm_count == 0 meaning "now it's dead" and using some careful
cmpxchg or dec_return to make sure that only one CPU frees it.

Or maybe you'd need a lock or RCU for this, but the idea would be to
only ever take the lock after mm_users goes to zero.

--Andy


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-12 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of July 10, 2020 7:35 pm:
> On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote:
>> On big systems, the mm refcount can become highly contented when doing
>> a lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>> 
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so so instead implement a non-refcounted scheme
>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> any remaining lazy ones.
>> 
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
> 
> That's mighty impressive, however:

Well, it's the usual case of "find a bouncing line and scale up the
machine size until you achieve your desired improvements" :) But we
are looking at some fundamental scalabilities and seeing if we can
improve a few things.

> 
>> +static void shoot_lazy_tlbs(struct mm_struct *mm)
>> +{
>> +if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
>> +smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void 
>> *)mm, 1);
>> +do_shoot_lazy_tlb(mm);
>> +}
>> +}
> 
> IIRC you (power) never clear a CPU from that mask, so for other
> workloads I can see this resulting in massive amounts of IPIs.
> 
> For instance, take as many processes as you have CPUs. For each,
> manually walk the task across all CPUs and exit. Again.
> 
> Clearly, that's an extreme, but still...

We do have some issues with that, it does tend to be very self-limiting
though, short lived tasks that can drive lots of exits won't get to run
on a lot of cores.

It's worth keeping an eye on, it may not be too hard to mitigate the IPIs
doing something dumb like collecting a queue of mms before killing a
batch of them.

Thanks,
Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-10 Thread Peter Zijlstra
On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote:
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
> 
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.

That's mighty impressive, however:

> +static void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
> + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void 
> *)mm, 1);
> + do_shoot_lazy_tlb(mm);
> + }
> +}

IIRC you (power) never clear a CPU from that mask, so for other
workloads I can see this resulting in massive amounts of IPIs.

For instance, take as many processes as you have CPUs. For each,
manually walk the task across all CPUs and exit. Again.

Clearly, that's an extreme, but still...




[RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-09 Thread Nicholas Piggin
On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig | 16 
 arch/powerpc/Kconfig |  1 +
 include/linux/sched/mm.h |  6 +++---
 kernel/fork.c| 39 +++
 4 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 2daf8fe6146a..edf69437a971 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,22 @@ config MMU_LAZY_TLB
help
  Enable "lazy TLB" mmu context switching for kernel threads.
 
+config MMU_LAZY_TLB_REFCOUNT
+   def_bool y
+   depends on MMU_LAZY_TLB
+   depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+config MMU_LAZY_TLB_SHOOTDOWN
+   bool
+   depends on MMU_LAZY_TLB
+   help
+ Instead of refcounting the "lazy tlb" mm struct, which can cause
+ contention with multi-threaded apps on large multiprocessor systems,
+ this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
+ switch to init_mm if they were using the to-be-freed mm as the lazy
+ tlb. Architectures which do not track all possible lazy tlb CPUs in
+ mm_cpumask can not use this (without modification).
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 920c4e3ca4ef..24ac85c868db 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -225,6 +225,7 @@ config PPC
select HAVE_PERF_USER_STACK_DUMP
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_PAGE_SIZE
+   select MMU_LAZY_TLB_SHOOTDOWN
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2c2b20e2ccc7..1067af8039bd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -53,19 +53,19 @@ void mmdrop(struct mm_struct *mm);
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmdrop(mm);
 }
 
 static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm)
 {
-   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmdrop(mm); /* This depends on mmdrop providing a full smp_mb() 
*/
else
smp_mb();
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..da0fba9e6079 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -685,6 +685,40 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()  (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   if (current->active_mm == mm) {
+   BUG_ON(current->mm);
+   switch_mm(mm, _mm, current);
+   current->active_mm = _mm;
+   }
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   BUG_ON(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+   smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void 
*)mm, 1);
+   do_shoot_lazy_tlb(mm);
+   }
+}
+
+static void check_lazy_tlbs(struct mm_struct *mm)
+{
+   if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+   smp_call_function(do_check_lazy_tlb, (void *)mm, 1);
+   do_check_lazy_tlb(mm);
+   }
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -695,6 +729,11 @@ void __mmdrop(struct mm_struct *mm)
BUG_ON(mm == _mm);
WARN_ON_ONCE(mm == current->mm);
WARN_ON_ONCE(mm == current->active_mm);
+
+   /* Ensure no CPUs are using this as their lazy tlb mm */
+   shoot_lazy_tlbs(mm);
+   check_lazy_tlbs(mm);
+