Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-06 Thread Mathieu Desnoyers
- On Jul 5, 2019, at 5:00 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Fri, 5 Jul 2019, Thomas Gleixner wrote:
>> On Fri, 5 Jul 2019, Mathieu Desnoyers wrote:
>> > - On Jul 5, 2019, at 4:49 AM, Ingo Molnar mi...@kernel.org wrote:
>> > > * Mathieu Desnoyers  wrote:
>> > >> The semantic I am looking for here is C11's relaxed atomics.
>> > > 
>> > > What does this mean?
>> > 
>> > C11 states:
>> > 
>> > "Atomic operations specifying memory_order_relaxed are  relaxed  only  with
>> > respect
>> > to memory ordering.  Implementations must still guarantee that any given 
>> > atomic
>> > access
>> > to a particular atomic object be indivisible with respect to all other 
>> > atomic
>> > accesses
>> > to that object."
>> > 
>> > So I am concerned that num_online_cpus() as proposed in this patch
>> > try to access __num_online_cpus non-atomically, and without using
>> > READ_ONCE().
>> >
>> > 
>> > Similarly, the update-side should use WRITE_ONCE(). Protecting with a mutex
>> > does not provide mutual exclusion against concurrent readers of that 
>> > variable.
>> 
>> Again. This is nothing new. The current implementation of num_online_cpus()
>> has no guarantees whatsoever.
>> 
>> bitmap_hweight() can be hit by a concurrent update of the mask it is
>> looking at.
>> 
>> num_online_cpus() gives you only the correct number if you invoke it inside
>> a cpuhp_lock held section. So why do we need that fuzz about atomicity now?
>> 
>> It's racy and was racy forever and even if we add that READ/WRITE_ONCE muck
>> then it still wont give you a reliable answer unless you hold cpuhp_lock at
>> least for read. So fore me that READ/WRITE_ONCE is just a cosmetic and
>> misleading reality distortion.
> 
> That said. If it makes everyone happy and feel better, I'm happy to add it
> along with a bit fat comment which explains that it's just preventing a
> theoretical store/load tearing issue and does not provide any guarantees
> other than that.

One example where the lack of READ_ONCE() makes me uneasy is found
in drivers/net/wireless/intel/iwlwifi/pcie/trans.c: 
iwl_pcie_set_interrupt_capa():

static void iwl_pcie_set_interrupt_capa(struct pci_dev *pdev,
struct iwl_trans *trans)
{
struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
int max_irqs, num_irqs, i, ret;
[...]
max_irqs = min_t(u32, num_online_cpus() + 2, IWL_MAX_RX_HW_QUEUES);
for (i = 0; i < max_irqs; i++)
trans_pcie->msix_entries[i].entry = i;

max_entries is an array of IWL_MAX_RX_HW_QUEUES entries. AFAIU, if the compiler
decides to re-fetch num_online_cpus() for the loop condition after hot-plugging
of a few cpus, we end up doing an out-of bound access to the array.

The scenario would be:

- load __num_online_cpus into a register,
- compare register + 2 to IWL_MAX_RX_HW_QUEUES
  (let's say the current number of cpus is lower that IWL_MAX_RX_HW_QUEUES - 2)
- a few other cpus are brought online,
- compiler decides to re-fetch __num_online_cpus for the loop bound, because
  its value is rightfully assumed to have never changed,
- corruption and sadness follows.

I'm not saying the current compiler implementations actually generate this
for this specific function, but I'm concerned about the fact that they are
within their right to do so. It seems quite fragile to expose a kernel API
which can yield to this kind of subtle bug.

A quick kernel tree grep for both "num_online_cpus" and "min" on the same line
gives 64 results, so it's not an isolated case.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2019, Thomas Gleixner wrote:
> On Fri, 5 Jul 2019, Mathieu Desnoyers wrote:
> > - On Jul 5, 2019, at 4:49 AM, Ingo Molnar mi...@kernel.org wrote:
> > > * Mathieu Desnoyers  wrote:
> > >> The semantic I am looking for here is C11's relaxed atomics.
> > > 
> > > What does this mean?
> > 
> > C11 states:
> > 
> > "Atomic operations specifying memory_order_relaxed are  relaxed  only  with 
> >  respect
> > to memory ordering.  Implementations must still guarantee that any given 
> > atomic access
> > to a particular atomic object be indivisible with respect to all other 
> > atomic accesses
> > to that object."
> > 
> > So I am concerned that num_online_cpus() as proposed in this patch
> > try to access __num_online_cpus non-atomically, and without using
> > READ_ONCE().
> >
> > 
> > Similarly, the update-side should use WRITE_ONCE(). Protecting with a mutex
> > does not provide mutual exclusion against concurrent readers of that 
> > variable.
> 
> Again. This is nothing new. The current implementation of num_online_cpus()
> has no guarantees whatsoever. 
> 
> bitmap_hweight() can be hit by a concurrent update of the mask it is
> looking at.
> 
> num_online_cpus() gives you only the correct number if you invoke it inside
> a cpuhp_lock held section. So why do we need that fuzz about atomicity now?
> 
> It's racy and was racy forever and even if we add that READ/WRITE_ONCE muck
> then it still wont give you a reliable answer unless you hold cpuhp_lock at
> least for read. So fore me that READ/WRITE_ONCE is just a cosmetic and
> misleading reality distortion.

That said. If it makes everyone happy and feel better, I'm happy to add it
along with a bit fat comment which explains that it's just preventing a
theoretical store/load tearing issue and does not provide any guarantees
other than that.

Thanks,

tglx


Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-05 Thread Thomas Gleixner
On Fri, 5 Jul 2019, Mathieu Desnoyers wrote:
> - On Jul 5, 2019, at 4:49 AM, Ingo Molnar mi...@kernel.org wrote:
> > * Mathieu Desnoyers  wrote:
> >> The semantic I am looking for here is C11's relaxed atomics.
> > 
> > What does this mean?
> 
> C11 states:
> 
> "Atomic operations specifying memory_order_relaxed are  relaxed  only  with  
> respect
> to memory ordering.  Implementations must still guarantee that any given 
> atomic access
> to a particular atomic object be indivisible with respect to all other atomic 
> accesses
> to that object."
> 
> So I am concerned that num_online_cpus() as proposed in this patch
> try to access __num_online_cpus non-atomically, and without using
> READ_ONCE().
>
> 
> Similarly, the update-side should use WRITE_ONCE(). Protecting with a mutex
> does not provide mutual exclusion against concurrent readers of that variable.

Again. This is nothing new. The current implementation of num_online_cpus()
has no guarantees whatsoever. 

bitmap_hweight() can be hit by a concurrent update of the mask it is
looking at.

num_online_cpus() gives you only the correct number if you invoke it inside
a cpuhp_lock held section. So why do we need that fuzz about atomicity now?

It's racy and was racy forever and even if we add that READ/WRITE_ONCE muck
then it still wont give you a reliable answer unless you hold cpuhp_lock at
least for read. So fore me that READ/WRITE_ONCE is just a cosmetic and
misleading reality distortion.

Thanks,

tglx






Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-05 Thread Mathieu Desnoyers



- On Jul 5, 2019, at 4:49 AM, Ingo Molnar mi...@kernel.org wrote:

> * Mathieu Desnoyers  wrote:
> 
>> - On Jul 4, 2019, at 6:33 PM, Thomas Gleixner t...@linutronix.de wrote:
>> 
>> > On Thu, 4 Jul 2019, Mathieu Desnoyers wrote:
>> >> - On Jul 4, 2019, at 5:10 PM, Thomas Gleixner t...@linutronix.de 
>> >> wrote:
>> >> >
>> >> > num_online_cpus() is racy today vs. CPU hotplug operations as
>> >> > long as you don't hold the hotplug lock.
>> >> 
>> >> Fair point, AFAIU none of the loads performed within num_online_cpus()
>> >> seem to rely on atomic nor volatile accesses. So not using a volatile
>> >> access to load the cached value should not introduce any regression.
>> >> 
>> >> I'm concerned that some code may rely on re-fetching of the cached
>> >> value between iterations of a loop. The lack of READ_ONCE() would
>> >> let the compiler keep a lifted load within a register and never
>> >> re-fetch, unless there is a cpu_relax() or a barrier() within the
>> >> loop.
>> > 
>> > If someone really wants to write code which can handle concurrent CPU
>> > hotplug operations and rely on that information, then it's probably better
>> > to write out:
>> > 
>> > ncpus = READ_ONCE(__num_online_cpus);
>> > 
>> > explicitely along with a big fat comment.
>> > 
>> > I can't figure out why one wants to do that and how it is supposed to work,
>> > but my brain is in shutdown mode already :)
>> > 
>> > I'd rather write a proper kernel doc comment for num_online_cpus() which
>> > explains what the constraints are instead of pretending that the READ_ONCE
>> > in the inline has any meaning.
>> 
>> The other aspect I am concerned about is freedom given to the compiler
>> to perform the store to __num_online_cpus non-atomically, or the load
>> non-atomically due to memory pressure.
> 
> What connection does "memory pressure" have to what the compiler does?
> 
> Did you confuse it with "register pressure"?

My brain wires must have been crossed when I wrote that. Indeed, I mean
register pressure.

> 
>> Is that something we should be concerned about ?
> 
> Once I understand it :)
> 
>> I thought we had WRITE_ONCE and READ_ONCE to take care of that kind of
>> situation.
> 
> Store and load tearing is one of the minor properties of READ_ONCE() and
> WRITE_ONCE() - the main properties are the ordering guarantees.
> 
> Since __num_online_cpus is neither weirdly aligned nor is it written via
> constants I don't see how load/store tearing could occur. Can you outline
> such a scenario?

I'm referring to the examples provided in Documentation/core-api/atomic_ops.rst,
e.g.:

"For another example, consider the following code::

tmp_a = a;
do_something_with(tmp_a);
do_something_else_with(tmp_a);

 If the compiler can prove that do_something_with() does not store to the
 variable a, then the compiler is within its rights to manufacture an
 additional load as follows::

tmp_a = a;
do_something_with(tmp_a);
tmp_a = a;
do_something_else_with(tmp_a);"

So if instead of "a" we have num_online_cpus(), the compiler would be
within its right to re-fetch the number of online cpus between the two
calls, which seems unexpected.

> 
>> The semantic I am looking for here is C11's relaxed atomics.
> 
> What does this mean?

C11 states:

"Atomic operations specifying memory_order_relaxed are  relaxed  only  with  
respect
to memory ordering.  Implementations must still guarantee that any given atomic 
access
to a particular atomic object be indivisible with respect to all other atomic 
accesses
to that object."

So I am concerned that num_online_cpus() as proposed in this patch
try to access __num_online_cpus non-atomically, and without using
READ_ONCE().

Similarly, the update-side should use WRITE_ONCE(). Protecting with a mutex
does not provide mutual exclusion against concurrent readers of that variable.

Again based on Documentation/core-api/atomic_ops.rst:

"For a final example, consider the following code, assuming that the
 variable a is set at boot time before the second CPU is brought online
 and never changed later, so that memory barriers are not needed::

if (a)
b = 9;
else
b = 42;

 The compiler is within its rights to manufacture an additional store
 by transforming the above code into the following::

b = 42;
if (a)
b = 9;

 This could come as a fatal surprise to other code running concurrently
 that expected b to never have the value 42 if a was zero.  To prevent
 the compiler from doing this, write something like::

if (a)
WRITE_ONCE(b, 9);
else
WRITE_ONCE(b, 42);"

If the compiler decides to manufacture additional stores when updating
the __num_online_cpus cached value, then loads could observe this
intermediate value because they fetch it without holding any mutex.

Thanks,

Mathieu

> 
> Thanks,
> 
>   Ingo

-- 
Mathieu 

Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-05 Thread Ingo Molnar


* Mathieu Desnoyers  wrote:

> - On Jul 4, 2019, at 6:33 PM, Thomas Gleixner t...@linutronix.de wrote:
> 
> > On Thu, 4 Jul 2019, Mathieu Desnoyers wrote:
> >> - On Jul 4, 2019, at 5:10 PM, Thomas Gleixner t...@linutronix.de wrote:
> >> >
> >> > num_online_cpus() is racy today vs. CPU hotplug operations as
> >> > long as you don't hold the hotplug lock.
> >> 
> >> Fair point, AFAIU none of the loads performed within num_online_cpus()
> >> seem to rely on atomic nor volatile accesses. So not using a volatile
> >> access to load the cached value should not introduce any regression.
> >> 
> >> I'm concerned that some code may rely on re-fetching of the cached
> >> value between iterations of a loop. The lack of READ_ONCE() would
> >> let the compiler keep a lifted load within a register and never
> >> re-fetch, unless there is a cpu_relax() or a barrier() within the
> >> loop.
> > 
> > If someone really wants to write code which can handle concurrent CPU
> > hotplug operations and rely on that information, then it's probably better
> > to write out:
> > 
> > ncpus = READ_ONCE(__num_online_cpus);
> > 
> > explicitely along with a big fat comment.
> > 
> > I can't figure out why one wants to do that and how it is supposed to work,
> > but my brain is in shutdown mode already :)
> > 
> > I'd rather write a proper kernel doc comment for num_online_cpus() which
> > explains what the constraints are instead of pretending that the READ_ONCE
> > in the inline has any meaning.
> 
> The other aspect I am concerned about is freedom given to the compiler 
> to perform the store to __num_online_cpus non-atomically, or the load 
> non-atomically due to memory pressure.

What connection does "memory pressure" have to what the compiler does? 

Did you confuse it with "register pressure"?

> Is that something we should be concerned about ?

Once I understand it :)

> I thought we had WRITE_ONCE and READ_ONCE to take care of that kind of 
> situation.

Store and load tearing is one of the minor properties of READ_ONCE() and 
WRITE_ONCE() - the main properties are the ordering guarantees.

Since __num_online_cpus is neither weirdly aligned nor is it written via 
constants I don't see how load/store tearing could occur. Can you outline 
such a scenario?

> The semantic I am looking for here is C11's relaxed atomics.

What does this mean?

Thanks,

Ingo


Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-04 Thread Mathieu Desnoyers
- On Jul 4, 2019, at 6:33 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Thu, 4 Jul 2019, Mathieu Desnoyers wrote:
>> - On Jul 4, 2019, at 5:10 PM, Thomas Gleixner t...@linutronix.de wrote:
>> >
>> > num_online_cpus() is racy today vs. CPU hotplug operations as
>> > long as you don't hold the hotplug lock.
>> 
>> Fair point, AFAIU none of the loads performed within num_online_cpus()
>> seem to rely on atomic nor volatile accesses. So not using a volatile
>> access to load the cached value should not introduce any regression.
>> 
>> I'm concerned that some code may rely on re-fetching of the cached
>> value between iterations of a loop. The lack of READ_ONCE() would
>> let the compiler keep a lifted load within a register and never
>> re-fetch, unless there is a cpu_relax() or a barrier() within the
>> loop.
> 
> If someone really wants to write code which can handle concurrent CPU
> hotplug operations and rely on that information, then it's probably better
> to write out:
> 
> ncpus = READ_ONCE(__num_online_cpus);
> 
> explicitely along with a big fat comment.
> 
> I can't figure out why one wants to do that and how it is supposed to work,
> but my brain is in shutdown mode already :)
> 
> I'd rather write a proper kernel doc comment for num_online_cpus() which
> explains what the constraints are instead of pretending that the READ_ONCE
> in the inline has any meaning.

The other aspect I am concerned about is freedom given to the compiler
to perform the store to __num_online_cpus non-atomically, or the load
non-atomically due to memory pressure. Is that something we should be
concerned about ?

I thought we had WRITE_ONCE and READ_ONCE to take care of that kind of
situation.

The semantic I am looking for here is C11's relaxed atomics.

Thanks,

Mathieu


> 
> Thanks,
> 
>   tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-04 Thread Thomas Gleixner
On Thu, 4 Jul 2019, Mathieu Desnoyers wrote:
> - On Jul 4, 2019, at 5:10 PM, Thomas Gleixner t...@linutronix.de wrote:
> >
> > num_online_cpus() is racy today vs. CPU hotplug operations as
> > long as you don't hold the hotplug lock.
> 
> Fair point, AFAIU none of the loads performed within num_online_cpus()
> seem to rely on atomic nor volatile accesses. So not using a volatile
> access to load the cached value should not introduce any regression.
> 
> I'm concerned that some code may rely on re-fetching of the cached
> value between iterations of a loop. The lack of READ_ONCE() would
> let the compiler keep a lifted load within a register and never
> re-fetch, unless there is a cpu_relax() or a barrier() within the
> loop.

If someone really wants to write code which can handle concurrent CPU
hotplug operations and rely on that information, then it's probably better
to write out:

 ncpus = READ_ONCE(__num_online_cpus);

explicitely along with a big fat comment.

I can't figure out why one wants to do that and how it is supposed to work,
but my brain is in shutdown mode already :)

I'd rather write a proper kernel doc comment for num_online_cpus() which
explains what the constraints are instead of pretending that the READ_ONCE
in the inline has any meaning.

Thanks,

tglx


Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-04 Thread Mathieu Desnoyers
- On Jul 4, 2019, at 5:10 PM, Thomas Gleixner t...@linutronix.de wrote:

> On Thu, 4 Jul 2019, Mathieu Desnoyers wrote:
> 
>> - On Jul 4, 2019, at 4:42 PM, Thomas Gleixner t...@linutronix.de wrote:
>> 
>> > Revaluating the bitmap wheight of the online cpus bitmap in every
>> > invocation of num_online_cpus() over and over is a pretty useless
>> > exercise. Especially when num_online_cpus() is used in code pathes like the
>> > IPI delivery of x86 or the membarrier code.
>> > 
>> > Cache the number of online CPUs in the core and just return the cached
>> > variable.
>> > 
>> > Signed-off-by: Thomas Gleixner 
>> > ---
>> > include/linux/cpumask.h |   16 +++-
>> > kernel/cpu.c|   16 
>> > 2 files changed, 23 insertions(+), 9 deletions(-)
>> > 
>> > --- a/include/linux/cpumask.h
>> > +++ b/include/linux/cpumask.h
>> > @@ -95,8 +95,13 @@ extern struct cpumask __cpu_active_mask;
>> > #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
>> > #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
>> > 
>> > +extern unsigned int __num_online_cpus;
>> 
>> [...]
>> 
>> > +
>> > +void set_cpu_online(unsigned int cpu, bool online)
>> > +{
>> > +  lockdep_assert_cpus_held();
>> 
>> I don't think it is required that the cpu_hotplug lock is held
>> when reading __num_online_cpus, right ?
> 
> Errm, that's the update function. And this is better called from a hotplug
> lock held region and not from some random crappy code.

Sure, this is fine to assume this lock is held for the update.
It's the read-side I'm worried about (which does not hold the lock).

> 
>> I would have expected the increment/decrement below to be performed
>> with a WRITE_ONCE(), and use a READ_ONCE() when reading the current
>> value.
> 
> What for?
> 
> num_online_cpus() is racy today vs. CPU hotplug operations as
> long as you don't hold the hotplug lock.

Fair point, AFAIU none of the loads performed within num_online_cpus()
seem to rely on atomic nor volatile accesses. So not using a volatile
access to load the cached value should not introduce any regression.

I'm concerned that some code may rely on re-fetching of the cached
value between iterations of a loop. The lack of READ_ONCE() would
let the compiler keep a lifted load within a register and never
re-fetch, unless there is a cpu_relax() or a barrier() within the
loop.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> 
>   tglx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-04 Thread Thomas Gleixner
On Thu, 4 Jul 2019, Mathieu Desnoyers wrote:

> - On Jul 4, 2019, at 4:42 PM, Thomas Gleixner t...@linutronix.de wrote:
> 
> > Revaluating the bitmap wheight of the online cpus bitmap in every
> > invocation of num_online_cpus() over and over is a pretty useless
> > exercise. Especially when num_online_cpus() is used in code pathes like the
> > IPI delivery of x86 or the membarrier code.
> > 
> > Cache the number of online CPUs in the core and just return the cached
> > variable.
> > 
> > Signed-off-by: Thomas Gleixner 
> > ---
> > include/linux/cpumask.h |   16 +++-
> > kernel/cpu.c|   16 
> > 2 files changed, 23 insertions(+), 9 deletions(-)
> > 
> > --- a/include/linux/cpumask.h
> > +++ b/include/linux/cpumask.h
> > @@ -95,8 +95,13 @@ extern struct cpumask __cpu_active_mask;
> > #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
> > #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
> > 
> > +extern unsigned int __num_online_cpus;
> 
> [...]
> 
> > +
> > +void set_cpu_online(unsigned int cpu, bool online)
> > +{
> > +   lockdep_assert_cpus_held();
> 
> I don't think it is required that the cpu_hotplug lock is held
> when reading __num_online_cpus, right ?

Errm, that's the update function. And this is better called from a hotplug
lock held region and not from some random crappy code.

> I would have expected the increment/decrement below to be performed
> with a WRITE_ONCE(), and use a READ_ONCE() when reading the current
> value.

What for?

num_online_cpus() is racy today vs. CPU hotplug operations as
long as you don't hold the hotplug lock.

Thanks,

tglx





Re: [PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-04 Thread Mathieu Desnoyers
- On Jul 4, 2019, at 4:42 PM, Thomas Gleixner t...@linutronix.de wrote:

> Revaluating the bitmap wheight of the online cpus bitmap in every
> invocation of num_online_cpus() over and over is a pretty useless
> exercise. Especially when num_online_cpus() is used in code pathes like the
> IPI delivery of x86 or the membarrier code.
> 
> Cache the number of online CPUs in the core and just return the cached
> variable.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> include/linux/cpumask.h |   16 +++-
> kernel/cpu.c|   16 
> 2 files changed, 23 insertions(+), 9 deletions(-)
> 
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -95,8 +95,13 @@ extern struct cpumask __cpu_active_mask;
> #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
> #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
> 
> +extern unsigned int __num_online_cpus;

[...]

> +
> +void set_cpu_online(unsigned int cpu, bool online)
> +{
> + lockdep_assert_cpus_held();

I don't think it is required that the cpu_hotplug lock is held
when reading __num_online_cpus, right ?

I would have expected the increment/decrement below to be performed
with a WRITE_ONCE(), and use a READ_ONCE() when reading the current
value.

Thanks,

Mathieu

> +
> + if (online) {
> + if (!cpumask_test_and_set_cpu(cpu, &__cpu_online_mask))
> + __num_online_cpus++;
> + } else {
> + if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
> + __num_online_cpus--;
> + }
> +}
> +
> /*
>  * Activate the first processor.
>   */

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH] cpu/hotplug: Cache number of online CPUs

2019-07-04 Thread Thomas Gleixner
Revaluating the bitmap wheight of the online cpus bitmap in every
invocation of num_online_cpus() over and over is a pretty useless
exercise. Especially when num_online_cpus() is used in code pathes like the
IPI delivery of x86 or the membarrier code.

Cache the number of online CPUs in the core and just return the cached
variable.

Signed-off-by: Thomas Gleixner 
---
 include/linux/cpumask.h |   16 +++-
 kernel/cpu.c|   16 
 2 files changed, 23 insertions(+), 9 deletions(-)

--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -95,8 +95,13 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_present_mask  ((const struct cpumask *)&__cpu_present_mask)
 #define cpu_active_mask   ((const struct cpumask *)&__cpu_active_mask)
 
+extern unsigned int __num_online_cpus;
+
 #if NR_CPUS > 1
-#define num_online_cpus()  cpumask_weight(cpu_online_mask)
+static inline unsigned int num_online_cpus(void)
+{
+   return __num_online_cpus;
+}
 #define num_possible_cpus()cpumask_weight(cpu_possible_mask)
 #define num_present_cpus() cpumask_weight(cpu_present_mask)
 #define num_active_cpus()  cpumask_weight(cpu_active_mask)
@@ -821,14 +826,7 @@ set_cpu_present(unsigned int cpu, bool p
cpumask_clear_cpu(cpu, &__cpu_present_mask);
 }
 
-static inline void
-set_cpu_online(unsigned int cpu, bool online)
-{
-   if (online)
-   cpumask_set_cpu(cpu, &__cpu_online_mask);
-   else
-   cpumask_clear_cpu(cpu, &__cpu_online_mask);
-}
+void set_cpu_online(unsigned int cpu, bool online);
 
 static inline void
 set_cpu_active(unsigned int cpu, bool active)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2291,6 +2291,9 @@ EXPORT_SYMBOL(__cpu_present_mask);
 struct cpumask __cpu_active_mask __read_mostly;
 EXPORT_SYMBOL(__cpu_active_mask);
 
+unsigned int __num_online_cpus __read_mostly;
+EXPORT_SYMBOL(__num_online_cpus);
+
 void init_cpu_present(const struct cpumask *src)
 {
cpumask_copy(&__cpu_present_mask, src);
@@ -2306,6 +2309,19 @@ void init_cpu_online(const struct cpumas
cpumask_copy(&__cpu_online_mask, src);
 }
 
+void set_cpu_online(unsigned int cpu, bool online)
+{
+   lockdep_assert_cpus_held();
+
+   if (online) {
+   if (!cpumask_test_and_set_cpu(cpu, &__cpu_online_mask))
+   __num_online_cpus++;
+   } else {
+   if (cpumask_test_and_clear_cpu(cpu, &__cpu_online_mask))
+   __num_online_cpus--;
+   }
+}
+
 /*
  * Activate the first processor.
  */