subject:"mm\: deadlock between get_online_cpus\/pcpu

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-10 Thread Christoph Lameter

On Thu, 9 Feb 2017, Michal Hocko wrote:

> Christoph, you are completely ignoring the reality and the code. There
> is no need for stop_machine nor it is helping anything. As the matter
> of fact there is a synchronization with the cpu hotplug needed if you
> want to make a per-cpu specific operations. get_online_cpus is the
> most straightforward and heavy weight way to do this synchronization
> but not the only one. As the patch [1] describes we do not really need
> get_online_cpus in drain_all_pages because we can do _better_. But
> this is not in any way a generic thing applicable to other code paths.
>
> If you disagree then you are free to post patches but hand waving you
> are doing here is just wasting everybody's time. So please cut it here
> unless you have specific proposals to improve the current situation.

I am fine with the particular solution here for this particular problem.

My problem is the general way of having to synchronize via
get_online_cpus() because of cpu hotplug operations.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Michal Hocko

On Thu 09-02-17 11:22:49, Cristopher Lameter wrote:
> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
> 
> > You are just not getting it, really.
> >
> > The problem is that this for_each_online_cpu() is racy against a concurrent
> > hot unplug and therefor can queue stuff for a not longer online cpu. That's
> > what the mm folks tried to avoid by preventing a CPU hotplug operation
> > before entering that loop.
> 
> With a stop machine action it is NOT racy because the machine goes into a
> special kernel state that guarantees that key operating system structures
> are not touched. See mm/page_alloc.c's use of that characteristic to build
> zonelists. Thus it cannot be executing for_each_online_cpu and related
> tasks (unless one does not disable preempt  but that is a given if a
> spinlock has been taken)..

Christoph, you are completely ignoring the reality and the code. There
is no need for stop_machine nor it is helping anything. As the matter
of fact there is a synchronization with the cpu hotplug needed if you
want to make a per-cpu specific operations. get_online_cpus is the
most straightforward and heavy weight way to do this synchronization
but not the only one. As the patch [1] describes we do not really need
get_online_cpus in drain_all_pages because we can do _better_. But
this is not in any way a generic thing applicable to other code paths.

If you disagree then you are free to post patches but hand waving you
are doing here is just wasting everybody's time. So please cut it here
unless you have specific proposals to improve the current situation.

Thanks!

[1] http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Thomas Gleixner

On Thu, 9 Feb 2017, Christoph Lameter wrote:
> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
> 
> > You are just not getting it, really.
> >
> > The problem is that this for_each_online_cpu() is racy against a concurrent
> > hot unplug and therefor can queue stuff for a not longer online cpu. That's
> > what the mm folks tried to avoid by preventing a CPU hotplug operation
> > before entering that loop.
> 
> With a stop machine action it is NOT racy because the machine goes into a
> special kernel state that guarantees that key operating system structures
> are not touched. See mm/page_alloc.c's use of that characteristic to build
> zonelists. Thus it cannot be executing for_each_online_cpu and related
> tasks (unless one does not disable preempt  but that is a given if a
> spinlock has been taken)..

drain_all_pages() is called from preemptible context. So what are you
talking about again?

> > > Lets get rid of get_online_cpus() etc.
> >
> > And that solves what?
> 
> It gets rid of future issues with serialization in paths were we need to
> lock and still do for_each_online_cpu().

There are code pathes which might sleep inside the loop so
get_online_cpus() is the only way to serialize against hotplug.

Just because the only tool you know is stop machine it does not make
everything an atomic context where it can be applied.

> > Can you please start to understand the scope of the whole hotplug machinery
> > including the requirements for get_online_cpus() before you waste
> > everybodys time with your uninformed and halfbaken proposals?
> 
> Its an obvious solution to the issues that have arisen multiple times with
> get_online_cpus() within the slab allocators. The hotplug machinery should
> make things as easy as possible for other people and having these
> get_online_cpus() everywhere does complicate things.

It's no obvious solution to everything. It's context dependend and people
have to think hard how to solve their problem within the context they are
dealing with.

Your 'get rid of get_online_cpus()' mantra does make all of this magically
simple. Relying on the fact, that the CPU online bit is cleared via
stomp_machine(), which is a horrible operation in various aspects, only
applies to a very small subset of problems. You can repeat your mantra
another thousand times and that will not make the way larger set of
problems magically disappear.

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Christoph Lameter

On Thu, 9 Feb 2017, Thomas Gleixner wrote:

> You are just not getting it, really.
>
> The problem is that this for_each_online_cpu() is racy against a concurrent
> hot unplug and therefor can queue stuff for a not longer online cpu. That's
> what the mm folks tried to avoid by preventing a CPU hotplug operation
> before entering that loop.

With a stop machine action it is NOT racy because the machine goes into a
special kernel state that guarantees that key operating system structures
are not touched. See mm/page_alloc.c's use of that characteristic to build
zonelists. Thus it cannot be executing for_each_online_cpu and related
tasks (unless one does not disable preempt  but that is a given if a
spinlock has been taken)..

> > Lets get rid of get_online_cpus() etc.
>
> And that solves what?

It gets rid of future issues with serialization in paths were we need to
lock and still do for_each_online_cpu().

> Can you please start to understand the scope of the whole hotplug machinery
> including the requirements for get_online_cpus() before you waste
> everybodys time with your uninformed and halfbaken proposals?

Its an obvious solution to the issues that have arisen multiple times with
get_online_cpus() within the slab allocators. The hotplug machinery should
make things as easy as possible for other people and having these
get_online_cpus() everywhere does complicate things.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Thomas Gleixner

On Thu, 9 Feb 2017, Christoph Lameter wrote:
> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
> 
> > > The stop_machine would need to ensure that all cpus cease processing
> > > before proceeding.
> >
> > Ok. I try again:
> >
> > CPU 0   CPU 1
> > for_each_online_cpu(cpu)
> >   ==> cpu = 1
> > stop_machine()
> >
> > Stops processing on all CPUs by preempting the current execution and
> > forcing them into a high priority busy loop with interrupts disabled.
> 
> Exactly that means we are outside of the sections marked with
> get_online_cpous().
> 
> > It does exactly what you describe. It stops processing on all other cpus
> > until release, but that does not invalidate any data on those cpus.
> 
> Why would it need to invalidate any data? The change of the cpu masks
> would need to be done when the machine is stopped. This sounds exactly
> like what we need and much of it is already there.

You are just not getting it, really.

The problem is that this for_each_online_cpu() is racy against a concurrent
hot unplug and therefor can queue stuff for a not longer online cpu. That's
what the mm folks tried to avoid by preventing a CPU hotplug operation
before entering that loop.

> Lets get rid of get_online_cpus() etc.

And that solves what?

Can you please start to understand the scope of the whole hotplug machinery
including the requirements for get_online_cpus() before you waste
everybodys time with your uninformed and halfbaken proposals?

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Christoph Lameter

On Thu, 9 Feb 2017, Thomas Gleixner wrote:

> > The stop_machine would need to ensure that all cpus cease processing
> > before proceeding.
>
> Ok. I try again:
>
> CPU 0 CPU 1
> for_each_online_cpu(cpu)
>   ==> cpu = 1
>   stop_machine()
>
> Stops processing on all CPUs by preempting the current execution and
> forcing them into a high priority busy loop with interrupts disabled.

Exactly that means we are outside of the sections marked with
get_online_cpous().

> It does exactly what you describe. It stops processing on all other cpus
> until release, but that does not invalidate any data on those cpus.

Why would it need to invalidate any data? The change of the cpu masks
would need to be done when the machine is stopped. This sounds exactly
like what we need and much of it is already there.

Lets get rid of get_online_cpus() etc.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Thomas Gleixner

On Thu, 9 Feb 2017, Christoph Lameter wrote:

> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
> 
> > And how does that solve the problem at hand? Not at all:
> >
> > CPU 0   CPU 1
> >
> > for_each_online_cpu(cpu)
> >  ==> cpu = 1
> > stop_machine()
> > set_cpu_online(1, false)
> >  queue_work(cpu1)
> >
> > Thanks,
> 
> Well thats not how I remember stop_machine does work. Doesnt it stop the
> processing on all cpus otherwise its not a real usable stop.
> 
> The stop_machine would need to ensure that all cpus cease processing
> before proceeding.

Ok. I try again:

CPU 0   CPU 1
for_each_online_cpu(cpu)
  ==> cpu = 1
stop_machine()

Stops processing on all CPUs by preempting the current execution and
forcing them into a high priority busy loop with interrupts disabled.

   context_switch()
   stomper_thread()
busyloop()

  set_cpu_online(1, false)

stop_machine end()
  release busy looping CPUs

   context_switch

Resumes operation at the preemption point. cpu is still 1

   queue_work(cpu == 1)

It does exactly what you describe. It stops processing on all other cpus
until release, but that does not invalidate any data on those cpus.

It's been that way forever.

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Christoph Lameter

On Thu, 9 Feb 2017, Thomas Gleixner wrote:

> And how does that solve the problem at hand? Not at all:
>
> CPU 0 CPU 1
>
> for_each_online_cpu(cpu)
>  ==> cpu = 1
>   stop_machine()
>   set_cpu_online(1, false)
>  queue_work(cpu1)
>
> Thanks,

Well thats not how I remember stop_machine does work. Doesnt it stop the
processing on all cpus otherwise its not a real usable stop.

The stop_machine would need to ensure that all cpus cease processing
before proceeding.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-09 Thread Thomas Gleixner

On Wed, 8 Feb 2017, Christoph Lameter wrote:
> On Wed, 8 Feb 2017, Thomas Gleixner wrote:
> 
> > There is a world outside yours. Hotplug is actually used frequently for
> > power purposes in some scenarios.
> 
> The usual case does not inolve hotplug.

We do not care about your definition of "usual". The kernel serves _ALL_
use cases.

> > It will improve nothing. The stop machine context is extremly limited and
> > you cannot do complex things there at all. Not to talk about the inability
> > of taking a simple mutex which would immediately deadlock the machine.
> 
> You do not need to do complex things. Basically flipping some cpu mask
> bits will do it. stop machine ensures that code is not
> executing on the processors when the bits are flipped. That will ensure
> that there is no need to do any get_online_cpu() nastiness in critical VM
> paths since we are guaranteed not to be executing them.

And how does that solve the problem at hand? Not at all:

CPU 0   CPU 1

for_each_online_cpu(cpu)
 ==> cpu = 1
stop_machine()
set_cpu_online(1, false)
 queue_work(cpu1)

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Christoph Lameter

On Wed, 8 Feb 2017, Thomas Gleixner wrote:

> There is a world outside yours. Hotplug is actually used frequently for
> power purposes in some scenarios.

The usual case does not inolve hotplug.

> It will improve nothing. The stop machine context is extremly limited and
> you cannot do complex things there at all. Not to talk about the inability
> of taking a simple mutex which would immediately deadlock the machine.

You do not need to do complex things. Basically flipping some cpu mask
bits will do it. stop machine ensures that code is not
executing on the processors when the bits are flipped. That will ensure
that there is no need to do any get_online_cpu() nastiness in critical VM
paths since we are guaranteed not to be executing them.

> And everything complex needs to be done _before_ that in normal
> context. Hot unplug already uses stop machine for the final removal of the
> outgoing CPU, but that's definitely not the place where you can do anything
> complex like page management.

If it already does that then why do we still need get_online_cpu()? We do
not do anything like page management. Why would we? We just need to ensure
that nothing is executing when the bits are flipped. If that is the case
then the get_online_cpu(0 calls are unecessary because the bit flipping
simply cannot occur in these functions. There is nothing to serialize
against.

> If you can prepare the outgoing cpu work during the cpu offline phase and
> then just flip a bit in the stop machine part, then this might work, but
> anything else is just handwaving and proliferation of wet dreams.

Fine with that.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Thomas Gleixner

On Wed, 8 Feb 2017, Christoph Lameter wrote:
> On Wed, 8 Feb 2017, Michal Hocko wrote:
> 
> > I have no idea what you are trying to say and how this is related to the
> > deadlock we are discussing here. We certainly do not need to add
> > stop_machine the problem. And yeah, dropping get_online_cpus was
> > possible after considering all fallouts.
> 
> This is not the first time get_online_cpus() causes problems due to the
> need to support hotplug for processors. Hotplugging is not happening
> frequently (which is low balling it. Actually the frequency of the hotplug
> events on almost all systems is zero) so the constant check is a useless
> overhead and causes trouble for development. In particular

There is a world outside yours. Hotplug is actually used frequently for
power purposes in some scenarios.

> get_online_cpus() is often needed in sections that need to hold locks.
> 
> So lets get rid of it. The severity, frequency and rarity of processor
> hotplug events would justify only allowing adding and removal of
> processors through the stop_machine_xx mechanism. With that in place the
> processor masks can be used without synchronization and the locking issues
> all over the kernel would become simpler.
> 
> It is likely that this will even improve the hotplug code because the
> easier form of synchronization (you have a piece of code that executed
> while the OS is in stop state) would allow to make more significant
> changes to the software environment. F.e. one could think about removing
> memory segments as well as maybe per cpu segments.

It will improve nothing. The stop machine context is extremly limited and
you cannot do complex things there at all. Not to talk about the inability
of taking a simple mutex which would immediately deadlock the machine.

stop machine is the last resort for things which need to be done atomically
and that operation can be done in a very restricted context.

And everything complex needs to be done _before_ that in normal
context. Hot unplug already uses stop machine for the final removal of the
outgoing CPU, but that's definitely not the place where you can do anything
complex like page management.

If you can prepare the outgoing cpu work during the cpu offline phase and
then just flip a bit in the stop machine part, then this might work, but
anything else is just handwaving and proliferation of wet dreams.

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Michal Hocko

On Wed 08-02-17 09:11:06, Cristopher Lameter wrote:
> On Wed, 8 Feb 2017, Michal Hocko wrote:
> 
> > > Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> > > must be simpler solutions than that.
> >
> > Absolutely agreed. We are in the page allocator path so using the
> > stop_machine* is just ridiculous. And, in fact, there is a much simpler
> > solution [1]
> 
> That is nonsense. stop_machine would be used when adding removing a
> processor. There would be no need to synchronize when looping over active
> cpus anymore. get_online_cpus() etc would be removed from the hot
> path since the cpu masks are guaranteed to be stable.

I have no idea what you are trying to say and how this is related to the
deadlock we are discussing here. We certainly do not need to add
stop_machine the problem. And yeah, dropping get_online_cpus was
possible after considering all fallouts.

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Peter Zijlstra

On Wed, Feb 08, 2017 at 02:03:32PM +, Mel Gorman wrote:
> > Yeah, we'll sort that out once it hits Linus tree and we move RT forward.
> > Though I have once complaint right away:
> > 
> > +   preempt_enable_no_resched();
> > 
> > This is a nono, even in mainline. You effectively disable a preemption
> > point.
> > 
> 
> This came up during review on whether it should or shouldn't be a preemption
> point. Initially it was preempt_enable() but a preemption point didn't
> exist before, the reviewer pushed for it and as it was the allocator fast
> path that was unlikely to need a reschedule or preempt, I made the change.

Not relevant. The only acceptable use of preempt_enable_no_resched() is
if the next statement is a schedule() variant.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Christoph Lameter

On Wed, 8 Feb 2017, Michal Hocko wrote:

> I have no idea what you are trying to say and how this is related to the
> deadlock we are discussing here. We certainly do not need to add
> stop_machine the problem. And yeah, dropping get_online_cpus was
> possible after considering all fallouts.

This is not the first time get_online_cpus() causes problems due to the
need to support hotplug for processors. Hotplugging is not happening
frequently (which is low balling it. Actually the frequency of the hotplug
events on almost all systems is zero) so the constant check is a useless
overhead and causes trouble for development. In particular
get_online_cpus() is often needed in sections that need to hold locks.

So lets get rid of it. The severity, frequency and rarity of processor
hotplug events would justify only allowing adding and removal of
processors through the stop_machine_xx mechanism. With that in place the
processor masks can be used without synchronization and the locking issues
all over the kernel would become simpler.

It is likely that this will even improve the hotplug code because the
easier form of synchronization (you have a piece of code that executed
while the OS is in stop state) would allow to make more significant
changes to the software environment. F.e. one could think about removing
memory segments as well as maybe per cpu segments.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Christoph Lameter

On Wed, 8 Feb 2017, Michal Hocko wrote:

> > Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> > must be simpler solutions than that.
>
> Absolutely agreed. We are in the page allocator path so using the
> stop_machine* is just ridiculous. And, in fact, there is a much simpler
> solution [1]

That is nonsense. stop_machine would be used when adding removing a
processor. There would be no need to synchronize when looping over active
cpus anymore. get_online_cpus() etc would be removed from the hot
path since the cpu masks are guaranteed to be stable.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Christoph Lameter

On Tue, 7 Feb 2017, Thomas Gleixner wrote:

> > Yep. Hotplug events are pretty significant. Using stop_machine_() etc
> > would be advisable and that would avoid the taking of locks and get rid of 
> > all the
> > ocmplexity, reduce the code size and make the overall system much more
> > reliable.
>
> Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> must be simpler solutions than that.

Inserting or removing hardware is a heavy process. This would help quite a
bit with these issues for loops over active cpus.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Thomas Gleixner

On Wed, 8 Feb 2017, Mel Gorman wrote:
> It may be worth noting that patches in Andrew's tree no longer disable
> interrupts in the per-cpu allocator and now per-cpu draining will
> be from workqueue context. The reasoning was due to the overhead of
> the page allocator with figures included. Interrupts will bypass the
> per-cpu allocator and use the irq-safe zone->lock to allocate from
> the core.  It'll collide with the RT patch. Primary patch of interest is
> http://www.ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch

Yeah, we'll sort that out once it hits Linus tree and we move RT forward.
Though I have once complaint right away:

+   preempt_enable_no_resched();

This is a nono, even in mainline. You effectively disable a preemption
point.

> The draining from workqueue context may be a problem for RT but one
> option would be to move the drain to only drain for high-order pages
> after direct reclaim combined with only draining for order-0 if
> __alloc_pages_may_oom is about to be called.

Why would the draining from workqueue context be an issue on RT?

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Mel Gorman

On Wed, Feb 08, 2017 at 02:23:19PM +0100, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Mel Gorman wrote:
> > It may be worth noting that patches in Andrew's tree no longer disable
> > interrupts in the per-cpu allocator and now per-cpu draining will
> > be from workqueue context. The reasoning was due to the overhead of
> > the page allocator with figures included. Interrupts will bypass the
> > per-cpu allocator and use the irq-safe zone->lock to allocate from
> > the core.  It'll collide with the RT patch. Primary patch of interest is
> > http://www.ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
> 
> Yeah, we'll sort that out once it hits Linus tree and we move RT forward.
> Though I have once complaint right away:
> 
> + preempt_enable_no_resched();
> 
> This is a nono, even in mainline. You effectively disable a preemption
> point.
> 

This came up during review on whether it should or shouldn't be a preemption
point. Initially it was preempt_enable() but a preemption point didn't
exist before, the reviewer pushed for it and as it was the allocator fast
path that was unlikely to need a reschedule or preempt, I made the change.

I can alter it before it hits mainline if you say RT is going to have an
issue with it.

> > The draining from workqueue context may be a problem for RT but one
> > option would be to move the drain to only drain for high-order pages
> > after direct reclaim combined with only draining for order-0 if
> > __alloc_pages_may_oom is about to be called.
> 
> Why would the draining from workqueue context be an issue on RT?
> 

It probably isn't. The latency of the operation is likely longer than an IPI
was but given the context it occurs in, I severely doubted it mattered. I
couldn't think of a reason why it would matter to RT but there was no harm
double checking.

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Mel Gorman

On Wed, Feb 08, 2017 at 01:02:07PM +0100, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Michal Hocko wrote:
> > On Tue 07-02-17 23:25:17, Thomas Gleixner wrote:
> > > On Tue, 7 Feb 2017, Christoph Lameter wrote:
> > > > On Tue, 7 Feb 2017, Michal Hocko wrote:
> > > > 
> > > > > I am always nervous when seeing hotplug locks being used in low level
> > > > > code. It has bitten us several times already and those deadlocks are
> > > > > quite hard to spot when reviewing the code and very rare to hit so 
> > > > > they
> > > > > tend to live for a long time.
> > > > 
> > > > Yep. Hotplug events are pretty significant. Using stop_machine_() 
> > > > etc
> > > > would be advisable and that would avoid the taking of locks and get rid 
> > > > of all the
> > > > ocmplexity, reduce the code size and make the overall system much more
> > > > reliable.
> > > 
> > > Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> > > must be simpler solutions than that.
> > 
> > Absolutely agreed. We are in the page allocator path so using the
> > stop_machine* is just ridiculous. And, in fact, there is a much simpler
> > solution [1]
> > 
> > [1] http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org
> 
> Well, yes. It's simple, but from an RT point of view I really don't like
> it as we have to fix it up again.
> 
> On RT we solved the problem of the page allocator differently which allows
> us to do drain_all_pages() from the caller CPU as a side effect. That's
> interesting not only for RT, it's also interesting for NOHZ FULL scenarios
> because you don't inflict the work on the other CPUs.
> 
> https://git.kernel.org/cgit/linux/kernel/git/rt/linux-rt-devel.git/commit/?h=linux-4.9.y-rt-rebase&id=d577a017da694e29a06af057c517f2a7051eb305
> 

It may be worth noting that patches in Andrew's tree no longer disable
interrupts in the per-cpu allocator and now per-cpu draining will
be from workqueue context. The reasoning was due to the overhead of
the page allocator with figures included. Interrupts will bypass the
per-cpu allocator and use the irq-safe zone->lock to allocate from
the core.  It'll collide with the RT patch. Primary patch of interest is
http://www.ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch

The draining from workqueue context may be a problem for RT but one
option would be to move the drain to only drain for high-order pages
after direct reclaim combined with only draining for order-0 if
__alloc_pages_may_oom is about to be called.

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Michal Hocko

On Wed 08-02-17 13:02:07, Thomas Gleixner wrote:
> On Wed, 8 Feb 2017, Michal Hocko wrote:
[...]
> > [1] http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org
> 
> Well, yes. It's simple, but from an RT point of view I really don't like
> it as we have to fix it up again.

I thought that preempt_disable would turn into migrate_disable or
something like that which shouldn't cause too much trouble. Or am I
missing something? Which part of the patch is so RT unfriendly?
-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-08 Thread Thomas Gleixner

On Wed, 8 Feb 2017, Michal Hocko wrote:
> On Tue 07-02-17 23:25:17, Thomas Gleixner wrote:
> > On Tue, 7 Feb 2017, Christoph Lameter wrote:
> > > On Tue, 7 Feb 2017, Michal Hocko wrote:
> > > 
> > > > I am always nervous when seeing hotplug locks being used in low level
> > > > code. It has bitten us several times already and those deadlocks are
> > > > quite hard to spot when reviewing the code and very rare to hit so they
> > > > tend to live for a long time.
> > > 
> > > Yep. Hotplug events are pretty significant. Using stop_machine_() etc
> > > would be advisable and that would avoid the taking of locks and get rid 
> > > of all the
> > > ocmplexity, reduce the code size and make the overall system much more
> > > reliable.
> > 
> > Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> > must be simpler solutions than that.
> 
> Absolutely agreed. We are in the page allocator path so using the
> stop_machine* is just ridiculous. And, in fact, there is a much simpler
> solution [1]
> 
> [1] http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org

Well, yes. It's simple, but from an RT point of view I really don't like
it as we have to fix it up again.

On RT we solved the problem of the page allocator differently which allows
us to do drain_all_pages() from the caller CPU as a side effect. That's
interesting not only for RT, it's also interesting for NOHZ FULL scenarios
because you don't inflict the work on the other CPUs.

https://git.kernel.org/cgit/linux/kernel/git/rt/linux-rt-devel.git/commit/?h=linux-4.9.y-rt-rebase&id=d577a017da694e29a06af057c517f2a7051eb305

That uses local locks (an RT speciality which compile away into preempt/irq
disable/enable when RT is disabled).

Works like a charm :)

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 23:25:17, Thomas Gleixner wrote:
> On Tue, 7 Feb 2017, Christoph Lameter wrote:
> > On Tue, 7 Feb 2017, Michal Hocko wrote:
> > 
> > > I am always nervous when seeing hotplug locks being used in low level
> > > code. It has bitten us several times already and those deadlocks are
> > > quite hard to spot when reviewing the code and very rare to hit so they
> > > tend to live for a long time.
> > 
> > Yep. Hotplug events are pretty significant. Using stop_machine_() etc
> > would be advisable and that would avoid the taking of locks and get rid of 
> > all the
> > ocmplexity, reduce the code size and make the overall system much more
> > reliable.
> 
> Huch? stop_machine() is horrible and heavy weight. Don't go there, there
> must be simpler solutions than that.

Absolutely agreed. We are in the page allocator path so using the
stop_machine* is just ridiculous. And, in fact, there is a much simpler
solution [1]

[1] http://lkml.kernel.org/r/20170207201950.20482-1-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Thomas Gleixner

On Mon, 6 Feb 2017, Dmitry Vyukov wrote:
> On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov  wrote:
> Unfortunately it does not seem to help.
> Fuzzer now runs on 510948533b059f4f5033464f9f4a0c32d4ab0c08 of
> mmotm/auto-latest
> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git):
> 
> commit 510948533b059f4f5033464f9f4a0c32d4ab0c08
> Date:   Thu Feb 2 10:08:47 2017 +0100
> mmotm: userfaultfd-non-cooperative-add-event-for-memory-unmaps-fix
> 
> The commit you referenced is already there:
> 
> commit 806b158031ca0b4714e775898396529a758ebc2c
> Date:   Thu Feb 2 08:53:16 2017 +0100
> mm, page_alloc: use static global work_struct for draining per-cpu pages

> Chain exists of:
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(pcpu_alloc_mutex);
>lock(cpu_hotplug.lock);
>lock(pcpu_alloc_mutex);
>   lock(cpu_hotplug.dep_map);

And that's exactly what happens:

cpu_up()
alloc_percpu()   lock(hotplug.lock)
lock(&pcpu_alloc_mutex)
.. alloc_percpu()
  drain_all_pages()  lock(&pcpu_alloc_mutex)
get_online_cpus()
  lock(hotplug.lock)

Classic deadlock, i.e. you _cannot_ call get_online_cpus() while holding
pcpu_alloc_mutex.

Alternatively you can forbid to do per cpu alloc/free while holding
hotplug.lock. I doubt that this will make people happy :)

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Thomas Gleixner

On Tue, 7 Feb 2017, Christoph Lameter wrote:
> On Tue, 7 Feb 2017, Michal Hocko wrote:
> 
> > I am always nervous when seeing hotplug locks being used in low level
> > code. It has bitten us several times already and those deadlocks are
> > quite hard to spot when reviewing the code and very rare to hit so they
> > tend to live for a long time.
> 
> Yep. Hotplug events are pretty significant. Using stop_machine_() etc
> would be advisable and that would avoid the taking of locks and get rid of 
> all the
> ocmplexity, reduce the code size and make the overall system much more
> reliable.

Huch? stop_machine() is horrible and heavy weight. Don't go there, there
must be simpler solutions than that.

Thanks,

tglx

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 12:03:19, Tejun Heo wrote:
> Hello,
> 
> Sorry about the delay.
> 
> On Tue, Feb 07, 2017 at 04:34:59PM +0100, Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c3358d4f7932..b6411816787a 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2343,7 +2343,16 @@ void drain_local_pages(struct zone *zone)
> >  
> >  static void drain_local_pages_wq(struct work_struct *work)
> >  {
> > +   /*
> > +* drain_all_pages doesn't use proper cpu hotplug protection so
> > +* we can race with cpu offline when the WQ can move this from
> > +* a cpu pinned worker to an unbound one. We can operate on a different
> > +* cpu which is allright but we also have to make sure to not move to
> > +* a different one.
> > +*/
> > +   preempt_disable();
> > drain_local_pages(NULL);
> > +   preempt_enable();
> >  }
> >  
> >  /*
> > @@ -2379,12 +2388,6 @@ void drain_all_pages(struct zone *zone)
> > }
> >  
> > /*
> > -* As this can be called from reclaim context, do not reenter reclaim.
> > -* An allocation failure can be handled, it's simply slower
> > -*/
> > -   get_online_cpus();
> > -
> > -   /*
> >  * We don't care about racing with CPU hotplug event
> >  * as offline notification will cause the notified
> >  * cpu to drain that CPU pcps and on_each_cpu_mask
> > @@ -2423,7 +2426,6 @@ void drain_all_pages(struct zone *zone)
> > for_each_cpu(cpu, &cpus_with_pcps)
> > flush_work(per_cpu_ptr(&pcpu_drain, cpu));
> >  
> > -   put_online_cpus();
> > mutex_unlock(&pcpu_drain_mutex);
> 
> I think this would work; however, a more canonical way would be
> something along the line of...
> 
>   drain_all_pages()
>   {
> ...
> spin_lock();
> for_each_possible_cpu() {
> if (this cpu should get drained) {
> queue_work_on(this cpu's work);
> }
> }
> spin_unlock();
> ...
>   }
> 
>   offline_hook()
>   {
> spin_lock();
> this cpu should get drained = false;
> spin_unlock();
> queue_work_on(this cpu's work);
> flush_work(this cpu's work);
>   }

I see

> I think what workqueue should do is automatically flush in-flight CPU
> work items on CPU offline and erroring out on queue_work_on() on
> offline CPUs.  And we now actually can do that because we have lifted
> the guarantee that queue_work() is local CPU affine some releases ago.
> I'll look into it soonish.
> 
> For the time being, either approach should be fine.  The more
> canonical one might be a bit less surprising but the
> preempt_disable/enable() change is short and sweet and completely fine
> for the case at hand.

Thanks for double checking!
 
> Please feel free to add
> 
> Acked-by: Tejun Heo 

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Tejun Heo

Hello,

Sorry about the delay.

On Tue, Feb 07, 2017 at 04:34:59PM +0100, Michal Hocko wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c3358d4f7932..b6411816787a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2343,7 +2343,16 @@ void drain_local_pages(struct zone *zone)
>  
>  static void drain_local_pages_wq(struct work_struct *work)
>  {
> + /*
> +  * drain_all_pages doesn't use proper cpu hotplug protection so
> +  * we can race with cpu offline when the WQ can move this from
> +  * a cpu pinned worker to an unbound one. We can operate on a different
> +  * cpu which is allright but we also have to make sure to not move to
> +  * a different one.
> +  */
> + preempt_disable();
>   drain_local_pages(NULL);
> + preempt_enable();
>  }
>  
>  /*
> @@ -2379,12 +2388,6 @@ void drain_all_pages(struct zone *zone)
>   }
>  
>   /*
> -  * As this can be called from reclaim context, do not reenter reclaim.
> -  * An allocation failure can be handled, it's simply slower
> -  */
> - get_online_cpus();
> -
> - /*
>* We don't care about racing with CPU hotplug event
>* as offline notification will cause the notified
>* cpu to drain that CPU pcps and on_each_cpu_mask
> @@ -2423,7 +2426,6 @@ void drain_all_pages(struct zone *zone)
>   for_each_cpu(cpu, &cpus_with_pcps)
>   flush_work(per_cpu_ptr(&pcpu_drain, cpu));
>  
> - put_online_cpus();
>   mutex_unlock(&pcpu_drain_mutex);

I think this would work; however, a more canonical way would be
something along the line of...

  drain_all_pages()
  {
  ...
  spin_lock();
  for_each_possible_cpu() {
  if (this cpu should get drained) {
  queue_work_on(this cpu's work);
  }
  }
  spin_unlock();
  ...
  }

  offline_hook()
  {
  spin_lock();
  this cpu should get drained = false;
  spin_unlock();
  queue_work_on(this cpu's work);
  flush_work(this cpu's work);
  }

I think what workqueue should do is automatically flush in-flight CPU
work items on CPU offline and erroring out on queue_work_on() on
offline CPUs.  And we now actually can do that because we have lifted
the guarantee that queue_work() is local CPU affine some releases ago.
I'll look into it soonish.

For the time being, either approach should be fine.  The more
canonical one might be a bit less surprising but the
preempt_disable/enable() change is short and sweet and completely fine
for the case at hand.

Please feel free to add

Acked-by: Tejun Heo 

Thanks.

-- 
tejun

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Christoph Lameter

On Tue, 7 Feb 2017, Michal Hocko wrote:

> I am always nervous when seeing hotplug locks being used in low level
> code. It has bitten us several times already and those deadlocks are
> quite hard to spot when reviewing the code and very rare to hit so they
> tend to live for a long time.

Yep. Hotplug events are pretty significant. Using stop_machine_() etc
would be advisable and that would avoid the taking of locks and get rid of all 
the
ocmplexity, reduce the code size and make the overall system much more
reliable.

Thomas?

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 16:22:24, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 04:34:59PM +0100, Michal Hocko wrote:
> > > But we do not care about the whole cpu hotplug code. The only part we
> > > really do care about is the race inside drain_pages_zone and that will
> > > run in an atomic context on the specific CPU.
> > > 
> > > You are absolutely right that using the mutex is safe as well but the
> > > hotplug path is already littered with locks and adding one more to the
> > > picture doesn't sound great to me. So I would really like to not use a
> > > lock if that is possible and safe (with a big fat comment of course).
> > 
> > And with the full changelog. I hope I haven't missed anything this time.
> > ---
> > From 8c6af3116520251cc4ec2213f0a4ed2544bb4365 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Tue, 7 Feb 2017 16:08:35 +0100
> > Subject: [PATCH] mm, page_alloc: do not depend on cpu hotplug locks inside 
> > the
> >  allocator
> > 
> > 
> >
> > Reported-by: Dmitry Vyukov 
> > Signed-off-by: Michal Hocko 
> 
> Not that I can think of. It's almost identical to the diff I posted with
> the exception of the mutex in the cpu hotplug teardown path. I agree that
> in the current implementation that it should be unnecessary even if I
> thought it would be more robust against any other hotplug churn.

I am always nervous when seeing hotplug locks being used in low level
code. It has bitten us several times already and those deadlocks are
quite hard to spot when reviewing the code and very rare to hit so they
tend to live for a long time.

> Acked-by: Mel Gorman 

Thanks! I will wait for Tejun to confirm my assumptions are correct and
post the patch to Andrew if there are no further problems spotted. Btw.
this will also get rid of another lockdep report which seem to be false
possitive though
http://lkml.kernel.org/r/20170203145548.gc19...@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 04:34:59PM +0100, Michal Hocko wrote:
> > But we do not care about the whole cpu hotplug code. The only part we
> > really do care about is the race inside drain_pages_zone and that will
> > run in an atomic context on the specific CPU.
> > 
> > You are absolutely right that using the mutex is safe as well but the
> > hotplug path is already littered with locks and adding one more to the
> > picture doesn't sound great to me. So I would really like to not use a
> > lock if that is possible and safe (with a big fat comment of course).
> 
> And with the full changelog. I hope I haven't missed anything this time.
> ---
> From 8c6af3116520251cc4ec2213f0a4ed2544bb4365 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 7 Feb 2017 16:08:35 +0100
> Subject: [PATCH] mm, page_alloc: do not depend on cpu hotplug locks inside the
>  allocator
> 
> 
>
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Michal Hocko 

Not that I can think of. It's almost identical to the diff I posted with
the exception of the mutex in the cpu hotplug teardown path. I agree that
in the current implementation that it should be unnecessary even if I
thought it would be more robust against any other hotplug churn.

Acked-by: Mel Gorman 

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 15:19:11, Michal Hocko wrote:
> On Tue 07-02-17 13:58:46, Mel Gorman wrote:
> > On Tue, Feb 07, 2017 at 01:37:08PM +0100, Michal Hocko wrote:
> [...]
> > > Anyway, shouldn't be it sufficient to disable preemption
> > > on drain_local_pages_wq?
> > 
> > That would be sufficient for a hot-removed CPU moving the drain request
> > to another CPU and avoiding any scheduling events.
> > 
> > > The CPU hotplug callback will not preempt us
> > > and so we cannot work on the same cpus, right?
> > > 
> > 
> > I don't see a specific guarantee that it cannot be preempted and it
> > would depend on an the exact cpu hotplug implementation which is subject
> > to quite a lot of change.
> 
> But we do not care about the whole cpu hotplug code. The only part we
> really do care about is the race inside drain_pages_zone and that will
> run in an atomic context on the specific CPU.
> 
> You are absolutely right that using the mutex is safe as well but the
> hotplug path is already littered with locks and adding one more to the
> picture doesn't sound great to me. So I would really like to not use a
> lock if that is possible and safe (with a big fat comment of course).

And with the full changelog. I hope I haven't missed anything this time.
---
>From 8c6af3116520251cc4ec2213f0a4ed2544bb4365 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Tue, 7 Feb 2017 16:08:35 +0100
Subject: [PATCH] mm, page_alloc: do not depend on cpu hotplug locks inside the
 allocator

Dmitry has reported the following lockdep splat
[] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[] __mutex_lock_common kernel/locking/mutex.c:521 [inline]
[] mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
[] pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
[] __alloc_percpu+0x24/0x30 mm/percpu.c:1075
[] smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
[] cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
[] cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
[] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
[] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
[] cpu_up+0x18/0x20 kernel/cpu.c:1095
[] smp_init+0xe9/0xee kernel/smp.c:564
[] kernel_init_freeable+0x439/0x690 init/main.c:1010
[] kernel_init+0x13/0x180 init/main.c:941
[] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
  cpu_hotplug.lock
pcpu_alloc
  pcpu_alloc_mutex

[] get_online_cpus+0x62/0x90 kernel/cpu.c:248
[] drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
[] __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
[] __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
[] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
[] __alloc_pages include/linux/gfp.h:426 [inline]
[] __alloc_pages_node include/linux/gfp.h:439 [inline]
[] alloc_pages_node include/linux/gfp.h:453 [inline]
[] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
[] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[] pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
[] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
[] bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
[] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
[] find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
[] map_create kernel/bpf/syscall.c:188 [inline]
[] SYSC_bpf kernel/bpf/syscall.c:870 [inline]
[] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
[] entry_SYSCALL_64_fastpath+0x1f/0xc2

pcpu_alloc
  pcpu_alloc_mutex
drain_all_pages
  get_online_cpus
cpu_hotplug.lock

[] cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
[] _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
[] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
[] cpu_up+0x18/0x20 kernel/cpu.c:1095
[] smp_init+0xe9/0xee kernel/smp.c:564
[] kernel_init_freeable+0x439/0x690 init/main.c:1010
[] kernel_init+0x13/0x180 init/main.c:941
[] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
  cpu_hotplug.lock

Pulling cpu hotplug locks inside the page allocator is just too
dangerous. Let's remove the dependency by dropping get_online_cpus()
from drain_all_pages. This is not so simple though because now we do not
have a protection against cpu hotplug which means 2 things:
- the work item might be executed on a different cpu in worker
  from unbound pool so it doesn't run on pinned on the cpu
- we have to make sure that we do not race with page_alloc_cpu_dead
  calling drain_pages_zone

Disabling preemption in drain_local_pages_wq will solve the first
problem drain_local_pages will determine its local CPU from the WQ
context which will be stable after that point, page_alloc_cpu_dead
is pinned to the CPU already. The later condition is achieved
by disabling IRQs in drain_pages_zone.

Reported-by: Dmitry Vyukov 
Signed-off-by: Michal Hocko 
---
 mm/page_alloc.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..b6411816787a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2343,7 +2343,16 @@ void drain_local_pages(struct zone *zone)
 
 static void drain_local_pages_wq(st

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 13:58:46, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 01:37:08PM +0100, Michal Hocko wrote:
[...]
> > Anyway, shouldn't be it sufficient to disable preemption
> > on drain_local_pages_wq?
> 
> That would be sufficient for a hot-removed CPU moving the drain request
> to another CPU and avoiding any scheduling events.
> 
> > The CPU hotplug callback will not preempt us
> > and so we cannot work on the same cpus, right?
> > 
> 
> I don't see a specific guarantee that it cannot be preempted and it
> would depend on an the exact cpu hotplug implementation which is subject
> to quite a lot of change.

But we do not care about the whole cpu hotplug code. The only part we
really do care about is the race inside drain_pages_zone and that will
run in an atomic context on the specific CPU.

You are absolutely right that using the mutex is safe as well but the
hotplug path is already littered with locks and adding one more to the
picture doesn't sound great to me. So I would really like to not use a
lock if that is possible and safe (with a big fat comment of course).

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 01:37:08PM +0100, Michal Hocko wrote:
> > You cannot put sleepable lock inside the preempt disbaled section...
> > We can make it a spinlock right?
> 
> Scratch that! For some reason I thought that cpu notifiers are run in an
> atomic context. Now that I am checking the code again it turns out I was
> wrong. __cpu_notify uses __raw_notifier_call_chain so this is not an
> atomic context.

Indeed.

> Anyway, shouldn't be it sufficient to disable preemption
> on drain_local_pages_wq?

That would be sufficient for a hot-removed CPU moving the drain request
to another CPU and avoiding any scheduling events.

> The CPU hotplug callback will not preempt us
> and so we cannot work on the same cpus, right?
> 

I don't see a specific guarantee that it cannot be preempted and it
would depend on an the exact cpu hotplug implementation which is subject
to quite a lot of change. Hence, the mutex provides a guantee that the
hot-removed CPU teardown cannot run on the same CPU as a workqueue drain
running on a CPU it was not originally scheduled for.

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Vlastimil Babka

On 02/07/2017 01:48 PM, Michal Hocko wrote:

On Tue 07-02-17 13:43:39, Vlastimil Babka wrote:
[...]

> Anyway, shouldn't be it sufficient to disable preemption
> on drain_local_pages_wq? The CPU hotplug callback will not preempt us
> and so we cannot work on the same cpus, right?

I thought the problem here was that the callback races with the work item
that has been migrated to a different cpu. Once we are not working on the
local cpu, disabling preempt/irq's won't help?

If the worker is racing with the callback than only one of can run on a
_particular_ cpu. So they cannot race. Or am I missing something?

Ah I forgot that migrated work item will in fact run on local cpu. So looks like 
nobody should race with the callback indeed (assuming that when the callback is 
called, the cpu in question already isn't executing workqueue workers).

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 13:03:50, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 12:43:27PM +0100, Michal Hocko wrote:
> > > Right. The unbind operation can set a mask that is any allowable CPU and
> > > the final process_work is not done in a context that prevents
> > > preemption.
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 3b93879990fd..7af165d308c4 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)
> > >  
> > >  static void drain_local_pages_wq(struct work_struct *work)
> > >  {
> > > + /*
> > > +  * Ordinarily a drain operation is bound to a CPU but may be unbound
> > > +  * after a CPU hotplug operation so it's necessary to disable
> > > +  * preemption for the drain to stabilise the CPU ID.
> > > +  */
> > > + preempt_disable();
> > >   drain_local_pages(NULL);
> > > + preempt_enable_no_resched();
> > >  }
> > >  
> > >  /*
> > [...]
> > > @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
> > >  {
> > >  
> > >   lru_add_drain_cpu(cpu);
> > > +
> > > + /*
> > > +  * A per-cpu drain via a workqueue from drain_all_pages can be
> > > +  * rescheduled onto an unrelated CPU. That allows the hotplug
> > > +  * operation and the drain to potentially race on the same
> > > +  * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> > > +  */
> > > + mutex_lock(&pcpu_drain_mutex);
> > >   drain_pages(cpu);
> > > + mutex_unlock(&pcpu_drain_mutex);
> > 
> > You cannot put sleepable lock inside the preempt disbaled section...
> > We can make it a spinlock right?
> > 
> 
> The CPU down callback can hold a mutex and at least he SLUB callback
> already does so. That gives
> 
> page_alloc_cpu_dead
>   mutex_lock
> drain_pages
>   mutex_unlock
> 
> drain_all_pages
>   mutex_lock
> queue workqueue
>   drain_local_pages_wq
> preempt_disable
> drain_local_pages
> drain_pages
> preempt_enable
>flush queues
>  mutex_unlock
> 
> I must be blind or maybe it's rushing between multiple concerns but which
> sleepable lock is of concern?

I thought the cpu hotplug callback was non-preemptible. This is not the
case as mentioned in other reply. The pcpu_drain_mutex in the hotplug
callback is alright. Sorry about the confusion! I am still wondering
whether the lock is really needed. See the other reply.

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 12:43:27PM +0100, Michal Hocko wrote:
> > Right. The unbind operation can set a mask that is any allowable CPU and
> > the final process_work is not done in a context that prevents
> > preemption.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3b93879990fd..7af165d308c4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)
> >  
> >  static void drain_local_pages_wq(struct work_struct *work)
> >  {
> > +   /*
> > +* Ordinarily a drain operation is bound to a CPU but may be unbound
> > +* after a CPU hotplug operation so it's necessary to disable
> > +* preemption for the drain to stabilise the CPU ID.
> > +*/
> > +   preempt_disable();
> > drain_local_pages(NULL);
> > +   preempt_enable_no_resched();
> >  }
> >  
> >  /*
> [...]
> > @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
> >  {
> >  
> > lru_add_drain_cpu(cpu);
> > +
> > +   /*
> > +* A per-cpu drain via a workqueue from drain_all_pages can be
> > +* rescheduled onto an unrelated CPU. That allows the hotplug
> > +* operation and the drain to potentially race on the same
> > +* CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> > +*/
> > +   mutex_lock(&pcpu_drain_mutex);
> > drain_pages(cpu);
> > +   mutex_unlock(&pcpu_drain_mutex);
> 
> You cannot put sleepable lock inside the preempt disbaled section...
> We can make it a spinlock right?
> 

The CPU down callback can hold a mutex and at least he SLUB callback
already does so. That gives

page_alloc_cpu_dead
  mutex_lock
drain_pages
  mutex_unlock

drain_all_pages
  mutex_lock
queue workqueue
  drain_local_pages_wq
preempt_disable
drain_local_pages
drain_pages
preempt_enable
   flush queues
 mutex_unlock

I must be blind or maybe it's rushing between multiple concerns but which
sleepable lock is of concern?

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 13:43:39, Vlastimil Babka wrote:
[...]
> > Anyway, shouldn't be it sufficient to disable preemption
> > on drain_local_pages_wq? The CPU hotplug callback will not preempt us
> > and so we cannot work on the same cpus, right?
> 
> I thought the problem here was that the callback races with the work item
> that has been migrated to a different cpu. Once we are not working on the
> local cpu, disabling preempt/irq's won't help?

If the worker is racing with the callback than only one of can run on a
_particular_ cpu. So they cannot race. Or am I missing something?

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Vlastimil Babka

On 02/07/2017 01:37 PM, Michal Hocko wrote:

> @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
>  {
>
>lru_add_drain_cpu(cpu);
> +
> +  /*
> +   * A per-cpu drain via a workqueue from drain_all_pages can be
> +   * rescheduled onto an unrelated CPU. That allows the hotplug
> +   * operation and the drain to potentially race on the same
> +   * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> +   */
> +  mutex_lock(&pcpu_drain_mutex);
>drain_pages(cpu);
> +  mutex_unlock(&pcpu_drain_mutex);

You cannot put sleepable lock inside the preempt disbaled section...
We can make it a spinlock right?

Scratch that! For some reason I thought that cpu notifiers are run in an
atomic context. Now that I am checking the code again it turns out I was
wrong. __cpu_notify uses __raw_notifier_call_chain so this is not an
atomic context.

Good.

Anyway, shouldn't be it sufficient to disable preemption
on drain_local_pages_wq? The CPU hotplug callback will not preempt us
and so we cannot work on the same cpus, right?

I thought the problem here was that the callback races with the work item that 
has been migrated to a different cpu. Once we are not working on the local cpu, 
disabling preempt/irq's won't help?

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 12:43:27, Michal Hocko wrote:
> On Tue 07-02-17 11:34:35, Mel Gorman wrote:
> > On Tue, Feb 07, 2017 at 11:35:52AM +0100, Michal Hocko wrote:
> > > On Tue 07-02-17 10:28:09, Mel Gorman wrote:
> > > > On Tue, Feb 07, 2017 at 10:49:28AM +0100, Vlastimil Babka wrote:
> > > > > On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > > > > > If I'm reading this right, a hot-remove will set the pool 
> > > > > > POOL_DISASSOCIATED
> > > > > > and unbound. A workqueue queued for draining get migrated during 
> > > > > > hot-remove
> > > > > > and a drain operation will execute twice on a CPU -- one for what 
> > > > > > was
> > > > > > queued and a second time for the CPU it was migrated from. It 
> > > > > > should still
> > > > > > work with flush_work which doesn't appear to block forever if an 
> > > > > > item
> > > > > > got migrated to another workqueue. The actual drain workqueue 
> > > > > > function is
> > > > > > using the CPU ID it's currently running on so it shouldn't get 
> > > > > > confused.
> > > > > 
> > > > > Is the worker that will process this migrated workqueue also 
> > > > > guaranteed
> > > > > to be pinned to a cpu for the whole work, though? drain_local_pages()
> > > > > needs that guarantee.
> > > > > 
> > > > 
> > > > It should be by running on a workqueue handler bound to that CPU (queued
> > > > on wq->cpu_pwqs in __queue_work)
> > > 
> > > Are you sure? The comment in kernel/workqueue.c says
> > >  * While DISASSOCIATED, the cpu may be offline and all workers 
> > > have
> > >  * %WORKER_UNBOUND set and concurrency management disabled, and 
> > > may
> > >  * be executing on any CPU.  The pool behaves as an unbound one.
> > > 
> > > I might be misreadig but an unbound pool can be handled by workers which
> > > are not pinned on any cpu AFAIU.
> > 
> > Right. The unbind operation can set a mask that is any allowable CPU and
> > the final process_work is not done in a context that prevents
> > preemption.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3b93879990fd..7af165d308c4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)
> >  
> >  static void drain_local_pages_wq(struct work_struct *work)
> >  {
> > +   /*
> > +* Ordinarily a drain operation is bound to a CPU but may be unbound
> > +* after a CPU hotplug operation so it's necessary to disable
> > +* preemption for the drain to stabilise the CPU ID.
> > +*/
> > +   preempt_disable();
> > drain_local_pages(NULL);
> > +   preempt_enable_no_resched();
> >  }
> >  
> >  /*
> [...]
> > @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
> >  {
> >  
> > lru_add_drain_cpu(cpu);
> > +
> > +   /*
> > +* A per-cpu drain via a workqueue from drain_all_pages can be
> > +* rescheduled onto an unrelated CPU. That allows the hotplug
> > +* operation and the drain to potentially race on the same
> > +* CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> > +*/
> > +   mutex_lock(&pcpu_drain_mutex);
> > drain_pages(cpu);
> > +   mutex_unlock(&pcpu_drain_mutex);
> 
> You cannot put sleepable lock inside the preempt disbaled section...
> We can make it a spinlock right?

Scratch that! For some reason I thought that cpu notifiers are run in an
atomic context. Now that I am checking the code again it turns out I was
wrong. __cpu_notify uses __raw_notifier_call_chain so this is not an
atomic context. Anyway, shouldn't be it sufficient to disable preemption
on drain_local_pages_wq? The CPU hotplug callback will not preempt us
and so we cannot work on the same cpus, right?

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 12:54:48, Vlastimil Babka wrote:
> On 02/07/2017 12:43 PM, Michal Hocko wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 3b93879990fd..7af165d308c4 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)
> > > 
> > >  static void drain_local_pages_wq(struct work_struct *work)
> > >  {
> > > + /*
> > > +  * Ordinarily a drain operation is bound to a CPU but may be unbound
> > > +  * after a CPU hotplug operation so it's necessary to disable
> > > +  * preemption for the drain to stabilise the CPU ID.
> > > +  */
> > > + preempt_disable();
> > >   drain_local_pages(NULL);
> > > + preempt_enable_no_resched();
> > >  }
> > > 
> > >  /*
> > [...]
> > > @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
> > >  {
> > > 
> > >   lru_add_drain_cpu(cpu);
> > > +
> > > + /*
> > > +  * A per-cpu drain via a workqueue from drain_all_pages can be
> > > +  * rescheduled onto an unrelated CPU. That allows the hotplug
> > > +  * operation and the drain to potentially race on the same
> > > +  * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> > > +  */
> > > + mutex_lock(&pcpu_drain_mutex);
> > >   drain_pages(cpu);
> > > + mutex_unlock(&pcpu_drain_mutex);
> > 
> > You cannot put sleepable lock inside the preempt disbaled section...
> > We can make it a spinlock right?
> 
> Could we do flush_work() with a spinlock? Sounds bad too.

We surely cannot. I thought the lock would be gone in drain_all_pages,
we would deadlock with the lock there anyway... But it is true that we
would need a way to only allow one caller to get in. This is getting
messier and messier...

> Maybe we could just use the fact that the whole drain happens with disabled
> irq's and obtain the current cpu under that protection?

preempt_disable should be enough, no? The CPU callback is not called
from an IRQ context, right?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1ee49474207e..4a9a65479435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2343,7 +2343,16 @@ void drain_local_pages(struct zone *zone)
 
 static void drain_local_pages_wq(struct work_struct *work)
 {
+   /*
+* drain_all_pages doesn't use proper cpu hotplug protection so
+* we can race with cpu offline when the WQ can move this from
+* a cpu pinned worker to an unbound one. We can operate on a different
+* cpu which is allright but we also have to make sure to not move to
+* a different one.
+*/
+   preempt_disable();
drain_local_pages(NULL);
+   preempt_enable();
 }
 
 /*
@@ -2379,12 +2388,6 @@ void drain_all_pages(struct zone *zone)
}
 
/*
-* As this can be called from reclaim context, do not reenter reclaim.
-* An allocation failure can be handled, it's simply slower
-*/
-   get_online_cpus();
-
-   /*
 * We don't care about racing with CPU hotplug event
 * as offline notification will cause the notified
 * cpu to drain that CPU pcps and on_each_cpu_mask
@@ -2423,7 +2426,6 @@ void drain_all_pages(struct zone *zone)
for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));
 
-   put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);
 }
 
-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Vlastimil Babka


On 02/07/2017 12:43 PM, Michal Hocko wrote:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..7af165d308c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)

 static void drain_local_pages_wq(struct work_struct *work)
 {
+   /*
+* Ordinarily a drain operation is bound to a CPU but may be unbound
+* after a CPU hotplug operation so it's necessary to disable
+* preemption for the drain to stabilise the CPU ID.
+*/
+   preempt_disable();
drain_local_pages(NULL);
+   preempt_enable_no_resched();
 }

 /*

[...]

@@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
 {

lru_add_drain_cpu(cpu);
+
+   /*
+* A per-cpu drain via a workqueue from drain_all_pages can be
+* rescheduled onto an unrelated CPU. That allows the hotplug
+* operation and the drain to potentially race on the same
+* CPU. Serialise hotplug versus drain using pcpu_drain_mutex
+*/
+   mutex_lock(&pcpu_drain_mutex);
drain_pages(cpu);
+   mutex_unlock(&pcpu_drain_mutex);


You cannot put sleepable lock inside the preempt disbaled section...
We can make it a spinlock right?


Could we do flush_work() with a spinlock? Sounds bad too.
Maybe we could just use the fact that the whole drain happens with disabled 
irq's and obtain the current cpu under that protection?

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 11:34:35, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 11:35:52AM +0100, Michal Hocko wrote:
> > On Tue 07-02-17 10:28:09, Mel Gorman wrote:
> > > On Tue, Feb 07, 2017 at 10:49:28AM +0100, Vlastimil Babka wrote:
> > > > On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > > > > If I'm reading this right, a hot-remove will set the pool 
> > > > > POOL_DISASSOCIATED
> > > > > and unbound. A workqueue queued for draining get migrated during 
> > > > > hot-remove
> > > > > and a drain operation will execute twice on a CPU -- one for what was
> > > > > queued and a second time for the CPU it was migrated from. It should 
> > > > > still
> > > > > work with flush_work which doesn't appear to block forever if an item
> > > > > got migrated to another workqueue. The actual drain workqueue 
> > > > > function is
> > > > > using the CPU ID it's currently running on so it shouldn't get 
> > > > > confused.
> > > > 
> > > > Is the worker that will process this migrated workqueue also guaranteed
> > > > to be pinned to a cpu for the whole work, though? drain_local_pages()
> > > > needs that guarantee.
> > > > 
> > > 
> > > It should be by running on a workqueue handler bound to that CPU (queued
> > > on wq->cpu_pwqs in __queue_work)
> > 
> > Are you sure? The comment in kernel/workqueue.c says
> >  * While DISASSOCIATED, the cpu may be offline and all workers have
> >  * %WORKER_UNBOUND set and concurrency management disabled, and may
> >  * be executing on any CPU.  The pool behaves as an unbound one.
> > 
> > I might be misreadig but an unbound pool can be handled by workers which
> > are not pinned on any cpu AFAIU.
> 
> Right. The unbind operation can set a mask that is any allowable CPU and
> the final process_work is not done in a context that prevents
> preemption.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b93879990fd..7af165d308c4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)
>  
>  static void drain_local_pages_wq(struct work_struct *work)
>  {
> + /*
> +  * Ordinarily a drain operation is bound to a CPU but may be unbound
> +  * after a CPU hotplug operation so it's necessary to disable
> +  * preemption for the drain to stabilise the CPU ID.
> +  */
> + preempt_disable();
>   drain_local_pages(NULL);
> + preempt_enable_no_resched();
>  }
>  
>  /*
[...]
> @@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
>  {
>  
>   lru_add_drain_cpu(cpu);
> +
> + /*
> +  * A per-cpu drain via a workqueue from drain_all_pages can be
> +  * rescheduled onto an unrelated CPU. That allows the hotplug
> +  * operation and the drain to potentially race on the same
> +  * CPU. Serialise hotplug versus drain using pcpu_drain_mutex
> +  */
> + mutex_lock(&pcpu_drain_mutex);
>   drain_pages(cpu);
> + mutex_unlock(&pcpu_drain_mutex);

You cannot put sleepable lock inside the preempt disbaled section...
We can make it a spinlock right?

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 11:35:52AM +0100, Michal Hocko wrote:
> On Tue 07-02-17 10:28:09, Mel Gorman wrote:
> > On Tue, Feb 07, 2017 at 10:49:28AM +0100, Vlastimil Babka wrote:
> > > On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > > > If I'm reading this right, a hot-remove will set the pool 
> > > > POOL_DISASSOCIATED
> > > > and unbound. A workqueue queued for draining get migrated during 
> > > > hot-remove
> > > > and a drain operation will execute twice on a CPU -- one for what was
> > > > queued and a second time for the CPU it was migrated from. It should 
> > > > still
> > > > work with flush_work which doesn't appear to block forever if an item
> > > > got migrated to another workqueue. The actual drain workqueue function 
> > > > is
> > > > using the CPU ID it's currently running on so it shouldn't get confused.
> > > 
> > > Is the worker that will process this migrated workqueue also guaranteed
> > > to be pinned to a cpu for the whole work, though? drain_local_pages()
> > > needs that guarantee.
> > > 
> > 
> > It should be by running on a workqueue handler bound to that CPU (queued
> > on wq->cpu_pwqs in __queue_work)
> 
> Are you sure? The comment in kernel/workqueue.c says
>  * While DISASSOCIATED, the cpu may be offline and all workers have
>  * %WORKER_UNBOUND set and concurrency management disabled, and may
>  * be executing on any CPU.  The pool behaves as an unbound one.
> 
> I might be misreadig but an unbound pool can be handled by workers which
> are not pinned on any cpu AFAIU.

Right. The unbind operation can set a mask that is any allowable CPU and
the final process_work is not done in a context that prevents
preemption.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..7af165d308c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2342,7 +2342,14 @@ void drain_local_pages(struct zone *zone)
 
 static void drain_local_pages_wq(struct work_struct *work)
 {
+   /*
+* Ordinarily a drain operation is bound to a CPU but may be unbound
+* after a CPU hotplug operation so it's necessary to disable
+* preemption for the drain to stabilise the CPU ID.
+*/
+   preempt_disable();
drain_local_pages(NULL);
+   preempt_enable_no_resched();
 }
 
 /*
@@ -2377,13 +2384,10 @@ void drain_all_pages(struct zone *zone)
mutex_lock(&pcpu_drain_mutex);
}
 
-   get_online_cpus();
-
/*
-* We don't care about racing with CPU hotplug event
-* as offline notification will cause the notified
-* cpu to drain that CPU pcps and on_each_cpu_mask
-* disables preemption as part of its processing
+* We don't care about racing with CPU hotplug event as offline
+* notification will cause the notified cpu to drain that CPU pcps
+* and it is serialised against here via pcpu_drain_mutex.
 */
for_each_online_cpu(cpu) {
struct per_cpu_pageset *pcp;
@@ -2418,7 +2422,6 @@ void drain_all_pages(struct zone *zone)
for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));
 
-   put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);
 }
 
@@ -6711,7 +6714,16 @@ static int page_alloc_cpu_dead(unsigned int cpu)
 {
 
lru_add_drain_cpu(cpu);
+
+   /*
+* A per-cpu drain via a workqueue from drain_all_pages can be
+* rescheduled onto an unrelated CPU. That allows the hotplug
+* operation and the drain to potentially race on the same
+* CPU. Serialise hotplug versus drain using pcpu_drain_mutex
+*/
+   mutex_lock(&pcpu_drain_mutex);
drain_pages(cpu);
+   mutex_unlock(&pcpu_drain_mutex);
 
/*
 * Spill the event counters of the dead processor

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Tetsuo Handa

On 2017/02/07 7:05, Mel Gorman wrote:
> On Mon, Feb 06, 2017 at 08:13:35PM +0100, Dmitry Vyukov wrote:
>> On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov  wrote:
>>> On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka  wrote:
 On 29.1.2017 13:44, Dmitry Vyukov wrote:
> Hello,
>
> I've got the following deadlock report while running syzkaller fuzzer
> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
>
> [ INFO: possible circular locking dependency detected ]
> 4.10.0-rc5-next-20170125 #1 Not tainted
> ---
> syz-executor3/14255 is trying to acquire lock:
>  (cpu_hotplug.dep_map){++}, at: []
> get_online_cpus+0x37/0x90 kernel/cpu.c:239
>
> but task is already holding lock:
>  (pcpu_alloc_mutex){+.+.+.}, at: []
> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
>
> which lock already depends on the new lock.

 I suspect the dependency comes from recent changes in drain_all_pages(). 
 They
 were later redone (for other reasons, but nice to have another validation) 
 in
 the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. 
 Could
 you try if it helps?
>>>
>>> It happened only once on linux-next, so I can't verify the fix. But I
>>> will watch out for other occurrences.
>>
>> Unfortunately it does not seem to help.
> 
> I'm a little stuck on how to best handle this. get_online_cpus() can
> halt forever if the hotplug operation is holding the mutex when calling
> pcpu_alloc. One option would be to add a try_get_online_cpus() helper which
> trylocks the mutex. However, given that drain is so unlikely to actually
> make that make a difference when racing against parallel allocations,
> I think this should be acceptable.
> 
> Any objections?

Why below change on top of current linux.git (8b1b41ee74f9) insufficient?
I think it can eliminate IPIs a lot without introducing lockdep warnings.

--
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f3e0c69..ae6e7aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2354,6 +2354,7 @@ void drain_local_pages(struct zone *zone)
  */
 void drain_all_pages(struct zone *zone)
 {
+   static DEFINE_MUTEX(lock);
int cpu;
 
/*
@@ -2362,6 +2363,7 @@ void drain_all_pages(struct zone *zone)
 */
static cpumask_t cpus_with_pcps;
 
+   mutex_lock(&lock);
/*
 * We don't care about racing with CPU hotplug event
 * as offline notification will cause the notified
@@ -2394,6 +2396,7 @@ void drain_all_pages(struct zone *zone)
}
on_each_cpu_mask(&cpus_with_pcps, (smp_call_func_t) drain_local_pages,
zone, 1);
+   mutex_unlock(&lock);
 }
 
 #ifdef CONFIG_HIBERNATION
--

By the way, drain_all_pages() is a sleepable context, isn't it?
I don't get get soft lockup using current linux.git (8b1b41ee74f9).
But I trivially get soft lockup if I try above change after reverting
"mm, page_alloc: use static global work_struct for draining per-cpu pages" and
"mm, page_alloc: drain per-cpu pages from workqueue context" on 
linux-next-20170207 .
List corruption also came in with linux-next-20170207 .

--
[   32.672890] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   33.109860] Ebtables v2.0 registered
[   33.410293] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   33.935512] IPv6: ADDRCONF(NETDEV_UP): eno1628: link is not ready
[   33.937478] e1000: eno1628 NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: None
[   33.939777] IPv6: ADDRCONF(NETDEV_CHANGE): eno1628: link becomes ready
[   34.194000] Netfilter messages via NETLINK v0.30.
[   34.258828] ip_set: protocol 6
[   38.518375] nf_conntrack: default automatic helper assignment has been 
turned off for security reasons and CT-based  firewall rule not found. Use the 
iptables CT target to attach helpers instead.
[   43.911609] cp (5167) used greatest stack depth: 10488 bytes left
[   48.174125] cp (5860) used greatest stack depth: 10224 bytes left
[  101.075280] [ cut here ]
[  101.076743] WARNING: CPU: 0 PID: 9766 at lib/list_debug.c:25 
__list_add_valid+0x46/0xa0
[  101.079095] list_add corruption. next->prev should be prev 
(ea00013dfce0), but was 88006d3e1d40. (next=88006d3e1d40).
[  101.081863] Modules linked in: nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge 
stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter 
ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd ppdev 
glue_he

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 10:42:49AM +, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 10:23:31AM +0100, Vlastimil Babka wrote:
> > > cpu offlining. I have to check the code but my impression was that WQ
> > > code will ignore the cpu requested by the work item when the cpu is
> > > going offline. If the offline happens while the worker function already
> > > executes then it has to wait as we run with preemption disabled so we
> > > should be safe here. Or am I missing something obvious?
> > 
> > Tejun suggested an alternative solution to avoiding get_online_cpus() in
> > this thread:
> > https://lkml.kernel.org/r/<20170123170329.ga7...@htj.duckdns.org>
> 
> But it would look like the following as it could be serialised against
> pcpu_drain_mutex as the cpu hotplug teardown callback is allowed to sleep.
> 

Bah, this is obviously unsafe. It's guaranteed to deadlock.

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 10:23:31AM +0100, Vlastimil Babka wrote:
> > cpu offlining. I have to check the code but my impression was that WQ
> > code will ignore the cpu requested by the work item when the cpu is
> > going offline. If the offline happens while the worker function already
> > executes then it has to wait as we run with preemption disabled so we
> > should be safe here. Or am I missing something obvious?
> 
> Tejun suggested an alternative solution to avoiding get_online_cpus() in
> this thread:
> https://lkml.kernel.org/r/<20170123170329.ga7...@htj.duckdns.org>

But it would look like the following as it could be serialised against
pcpu_drain_mutex as the cpu hotplug teardown callback is allowed to sleep.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..8cd8b1bbe00c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2319,9 +2319,17 @@ static void drain_pages(unsigned int cpu)
 {
struct zone *zone;
 
+   /*
+* A per-cpu drain via a workqueue from drain_all_pages can be
+* rescheduled onto an unrelated CPU. That allows the hotplug
+* operation and the drain to potentially race on the same
+* CPU. Serialise hotplug versus drain using pcpu_drain_mutex
+*/
+   mutex_lock(&pcpu_drain_mutex);
for_each_populated_zone(zone) {
drain_pages_zone(cpu, zone);
}
+   mutex_unlock(&pcpu_drain_mutex);
 }
 
 /*
@@ -2377,13 +2385,10 @@ void drain_all_pages(struct zone *zone)
mutex_lock(&pcpu_drain_mutex);
}
 
-   get_online_cpus();
-
/*
-* We don't care about racing with CPU hotplug event
-* as offline notification will cause the notified
-* cpu to drain that CPU pcps and on_each_cpu_mask
-* disables preemption as part of its processing
+* We don't care about racing with CPU hotplug event as offline
+* notification will cause the notified cpu to drain that CPU pcps
+* and it is serialised against here via pcpu_drain_mutex.
 */
for_each_online_cpu(cpu) {
struct per_cpu_pageset *pcp;
@@ -2418,7 +2423,6 @@ void drain_all_pages(struct zone *zone)
for_each_cpu(cpu, &cpus_with_pcps)
flush_work(per_cpu_ptr(&pcpu_drain, cpu));
 
-   put_online_cpus();
mutex_unlock(&pcpu_drain_mutex);
 }
 

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 10:28:09, Mel Gorman wrote:
> On Tue, Feb 07, 2017 at 10:49:28AM +0100, Vlastimil Babka wrote:
> > On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > > If I'm reading this right, a hot-remove will set the pool 
> > > POOL_DISASSOCIATED
> > > and unbound. A workqueue queued for draining get migrated during 
> > > hot-remove
> > > and a drain operation will execute twice on a CPU -- one for what was
> > > queued and a second time for the CPU it was migrated from. It should still
> > > work with flush_work which doesn't appear to block forever if an item
> > > got migrated to another workqueue. The actual drain workqueue function is
> > > using the CPU ID it's currently running on so it shouldn't get confused.
> > 
> > Is the worker that will process this migrated workqueue also guaranteed
> > to be pinned to a cpu for the whole work, though? drain_local_pages()
> > needs that guarantee.
> > 
> 
> It should be by running on a workqueue handler bound to that CPU (queued
> on wq->cpu_pwqs in __queue_work)

Are you sure? The comment in kernel/workqueue.c says
 * While DISASSOCIATED, the cpu may be offline and all workers have
 * %WORKER_UNBOUND set and concurrency management disabled, and may
 * be executing on any CPU.  The pool behaves as an unbound one.

I might be misreadig but an unbound pool can be handled by workers which
are not pinned on any cpu AFAIU.
-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 10:49:28AM +0100, Vlastimil Babka wrote:
> On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
> > and unbound. A workqueue queued for draining get migrated during hot-remove
> > and a drain operation will execute twice on a CPU -- one for what was
> > queued and a second time for the CPU it was migrated from. It should still
> > work with flush_work which doesn't appear to block forever if an item
> > got migrated to another workqueue. The actual drain workqueue function is
> > using the CPU ID it's currently running on so it shouldn't get confused.
> 
> Is the worker that will process this migrated workqueue also guaranteed
> to be pinned to a cpu for the whole work, though? drain_local_pages()
> needs that guarantee.
> 

It should be by running on a workqueue handler bound to that CPU (queued
on wq->cpu_pwqs in __queue_work)

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 10:49:28, Vlastimil Babka wrote:
> On 02/07/2017 10:43 AM, Mel Gorman wrote:
> > If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
> > and unbound. A workqueue queued for draining get migrated during hot-remove
> > and a drain operation will execute twice on a CPU -- one for what was
> > queued and a second time for the CPU it was migrated from. It should still
> > work with flush_work which doesn't appear to block forever if an item
> > got migrated to another workqueue. The actual drain workqueue function is
> > using the CPU ID it's currently running on so it shouldn't get confused.
> 
> Is the worker that will process this migrated workqueue also guaranteed
> to be pinned to a cpu for the whole work, though? drain_local_pages()
> needs that guarantee.

Yeah I guess you are right. This would mean that drain_local_pages_wq
should to preempt_{disable,enable} around drain_local_pages
> 
> > Tejun, did I miss anything? Does a workqueue item queued on a CPU being
> > offline get unbound and a caller can still flush it safely? In this
> > specific case, it's ok that the workqueue item does not run on the CPU it
> > was queued on.

I guess we need to do one more step and ensure that our (rebound) worker
doesn't race with the page_alloc_cpu_notify. I guess we can just cmpxchg
pcp->count in drain_pages_zone to ensure the exclusivity. Not as simple
as I originally thought but doable I guess and definitely better than
making a subtle dependency on the hotplug locks which is just a PITA to
maintain.

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Tue 07-02-17 10:23:31, Vlastimil Babka wrote:
> On 02/07/2017 09:48 AM, Michal Hocko wrote:
> > On Mon 06-02-17 22:05:30, Mel Gorman wrote:
> >>> Unfortunately it does not seem to help.
> >>
> >> I'm a little stuck on how to best handle this. get_online_cpus() can
> >> halt forever if the hotplug operation is holding the mutex when calling
> >> pcpu_alloc. One option would be to add a try_get_online_cpus() helper which
> >> trylocks the mutex. However, given that drain is so unlikely to actually
> >> make that make a difference when racing against parallel allocations,
> >> I think this should be acceptable.
> >>
> >> Any objections?
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 3b93879990fd..a3192447e906 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -3432,7 +3432,17 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, 
> >> unsigned int order,
> >> */
> >>if (!page && !drained) {
> >>unreserve_highatomic_pageblock(ac, false);
> >> -  drain_all_pages(NULL);
> >> +
> >> +  /*
> >> +   * Only drain from contexts allocating for user allocations.
> >> +   * Kernel allocations could be holding a CPU hotplug-related
> >> +   * mutex, particularly hot-add allocating per-cpu structures
> >> +   * while hotplug-related mutex's are held which would prevent
> >> +   * get_online_cpus ever returning.
> >> +   */
> >> +  if (gfp_mask & __GFP_HARDWALL)
> >> +  drain_all_pages(NULL);
> >> +
> > 
> > This wouldn't work AFAICS. If you look at the lockdep splat, the path
> > which reverses the locking order (takes pcpu_alloc_mutex prior to
> > cpu_hotplug.lock is bpf_array_alloc_percpu which is GFP_USER and thus
> > __GFP_HARDWALL.
> > 
> > I believe we shouldn't pull any dependency on the hotplug locks inside
> > the allocator. This is just too fragile! Can we simply drop the
> > get_online_cpus()? Why do we need it, anyway? Say we are racing with the
> 
> It was added after I noticed in review that queue_work_on() has a
> comment that caller must ensure that cpu can't go away, and wondered
> about it.

Ohh, I haven't noticed the comment. Thanks for pointing it out. I still
do not see what would a missing get_online_cpus mean for queuing.

> Also noted that a similar lru_add_drain_all() does it too.
> 
> > cpu offlining. I have to check the code but my impression was that WQ
> > code will ignore the cpu requested by the work item when the cpu is
> > going offline. If the offline happens while the worker function already
> > executes then it has to wait as we run with preemption disabled so we
> > should be safe here. Or am I missing something obvious?
> 
> Tejun suggested an alternative solution to avoiding get_online_cpus() in
> this thread:
> https://lkml.kernel.org/r/<20170123170329.ga7...@htj.duckdns.org>

OK, so we have page_alloc_cpu_notify which also does drain_pages so all
we have to do to make sure they do not race is to synchronize there.

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Vlastimil Babka

On 02/07/2017 10:43 AM, Mel Gorman wrote:
> If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
> and unbound. A workqueue queued for draining get migrated during hot-remove
> and a drain operation will execute twice on a CPU -- one for what was
> queued and a second time for the CPU it was migrated from. It should still
> work with flush_work which doesn't appear to block forever if an item
> got migrated to another workqueue. The actual drain workqueue function is
> using the CPU ID it's currently running on so it shouldn't get confused.

Is the worker that will process this migrated workqueue also guaranteed
to be pinned to a cpu for the whole work, though? drain_local_pages()
needs that guarantee.

> Tejun, did I miss anything? Does a workqueue item queued on a CPU being
> offline get unbound and a caller can still flush it safely? In this
> specific case, it's ok that the workqueue item does not run on the CPU it
> was queued on.
>

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 10:23:31AM +0100, Vlastimil Babka wrote:
> 
> > cpu offlining. I have to check the code but my impression was that WQ
> > code will ignore the cpu requested by the work item when the cpu is
> > going offline. If the offline happens while the worker function already
> > executes then it has to wait as we run with preemption disabled so we
> > should be safe here. Or am I missing something obvious?
> 
> Tejun suggested an alternative solution to avoiding get_online_cpus() in
> this thread:
> https://lkml.kernel.org/r/<20170123170329.ga7...@htj.duckdns.org>

I was hoping it would not really be necessary to sync against the cpu
offline callback if we know we don't really have to guarantee we run on
the correct CPU.

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Mel Gorman

On Tue, Feb 07, 2017 at 09:48:56AM +0100, Michal Hocko wrote:
> > +
> > +   /*
> > +* Only drain from contexts allocating for user allocations.
> > +* Kernel allocations could be holding a CPU hotplug-related
> > +* mutex, particularly hot-add allocating per-cpu structures
> > +* while hotplug-related mutex's are held which would prevent
> > +* get_online_cpus ever returning.
> > +*/
> > +   if (gfp_mask & __GFP_HARDWALL)
> > +   drain_all_pages(NULL);
> > +
> 
> This wouldn't work AFAICS. If you look at the lockdep splat, the path
> which reverses the locking order (takes pcpu_alloc_mutex prior to
> cpu_hotplug.lock is bpf_array_alloc_percpu which is GFP_USER and thus
> __GFP_HARDWALL.
> 

You're right, I looked at the wrong caller.

> I believe we shouldn't pull any dependency on the hotplug locks inside
> the allocator. This is just too fragile! Can we simply drop the
> get_online_cpus()? Why do we need it, anyway?

To stop the CPU being queued going offline. Another alternative is bringing
back try_get_offline_cpus which Peter pointed out to be offline was removed
in commit 02ef3c4a2aae65a1632b27770bfea3f83ca06772 although it'd be a
shame for just this case.

> Say we are racing with the
> cpu offlining. I have to check the code but my impression was that WQ
> code will ignore the cpu requested by the work item when the cpu is
> going offline. If the offline happens while the worker function already
> executes then it has to wait as we run with preemption disabled so we
> should be safe here. Or am I missing something obvious?

It may be safe but I'd prefer to get confirmation from Tejun. If it happens
that a drain on a particular CPU gets missed in this path, it's not the
end of the world as the CPU offlining drains the pages in another path so
nothing gets lost.

It would also be acceptable if one CPU was drained twice because the
workqueue was unbound from a CPU being hot-removed and moved during a
hotplug operation.

If I'm reading this right, a hot-remove will set the pool POOL_DISASSOCIATED
and unbound. A workqueue queued for draining get migrated during hot-remove
and a drain operation will execute twice on a CPU -- one for what was
queued and a second time for the CPU it was migrated from. It should still
work with flush_work which doesn't appear to block forever if an item
got migrated to another workqueue. The actual drain workqueue function is
using the CPU ID it's currently running on so it shouldn't get confused.

Tejun, did I miss anything? Does a workqueue item queued on a CPU being
offline get unbound and a caller can still flush it safely? In this
specific case, it's ok that the workqueue item does not run on the CPU it
was queued on.

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Vlastimil Babka

On 02/07/2017 09:48 AM, Michal Hocko wrote:
> On Mon 06-02-17 22:05:30, Mel Gorman wrote:
>>> Unfortunately it does not seem to help.
>>
>> I'm a little stuck on how to best handle this. get_online_cpus() can
>> halt forever if the hotplug operation is holding the mutex when calling
>> pcpu_alloc. One option would be to add a try_get_online_cpus() helper which
>> trylocks the mutex. However, given that drain is so unlikely to actually
>> make that make a difference when racing against parallel allocations,
>> I think this should be acceptable.
>>
>> Any objections?
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3b93879990fd..a3192447e906 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3432,7 +3432,17 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned 
>> int order,
>>   */
>>  if (!page && !drained) {
>>  unreserve_highatomic_pageblock(ac, false);
>> -drain_all_pages(NULL);
>> +
>> +/*
>> + * Only drain from contexts allocating for user allocations.
>> + * Kernel allocations could be holding a CPU hotplug-related
>> + * mutex, particularly hot-add allocating per-cpu structures
>> + * while hotplug-related mutex's are held which would prevent
>> + * get_online_cpus ever returning.
>> + */
>> +if (gfp_mask & __GFP_HARDWALL)
>> +drain_all_pages(NULL);
>> +
> 
> This wouldn't work AFAICS. If you look at the lockdep splat, the path
> which reverses the locking order (takes pcpu_alloc_mutex prior to
> cpu_hotplug.lock is bpf_array_alloc_percpu which is GFP_USER and thus
> __GFP_HARDWALL.
> 
> I believe we shouldn't pull any dependency on the hotplug locks inside
> the allocator. This is just too fragile! Can we simply drop the
> get_online_cpus()? Why do we need it, anyway? Say we are racing with the

It was added after I noticed in review that queue_work_on() has a
comment that caller must ensure that cpu can't go away, and wondered
about it. Also noted that a similar lru_add_drain_all() does it too.

> cpu offlining. I have to check the code but my impression was that WQ
> code will ignore the cpu requested by the work item when the cpu is
> going offline. If the offline happens while the worker function already
> executes then it has to wait as we run with preemption disabled so we
> should be safe here. Or am I missing something obvious?

Tejun suggested an alternative solution to avoiding get_online_cpus() in
this thread:
https://lkml.kernel.org/r/<20170123170329.ga7...@htj.duckdns.org>

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Mon 06-02-17 22:05:30, Mel Gorman wrote:
> On Mon, Feb 06, 2017 at 08:13:35PM +0100, Dmitry Vyukov wrote:
> > On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov  wrote:
> > > On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka  wrote:
> > >> On 29.1.2017 13:44, Dmitry Vyukov wrote:
> > >>> Hello,
> > >>>
> > >>> I've got the following deadlock report while running syzkaller fuzzer
> > >>> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
> > >>>
> > >>> [ INFO: possible circular locking dependency detected ]
> > >>> 4.10.0-rc5-next-20170125 #1 Not tainted
> > >>> ---
> > >>> syz-executor3/14255 is trying to acquire lock:
> > >>>  (cpu_hotplug.dep_map){++}, at: []
> > >>> get_online_cpus+0x37/0x90 kernel/cpu.c:239
> > >>>
> > >>> but task is already holding lock:
> > >>>  (pcpu_alloc_mutex){+.+.+.}, at: []
> > >>> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
> > >>>
> > >>> which lock already depends on the new lock.
> > >>
> > >> I suspect the dependency comes from recent changes in drain_all_pages(). 
> > >> They
> > >> were later redone (for other reasons, but nice to have another 
> > >> validation) in
> > >> the mmots patch [1], which AFAICS is not yet in mmotm and thus 
> > >> linux-next. Could
> > >> you try if it helps?
> > >
> > > It happened only once on linux-next, so I can't verify the fix. But I
> > > will watch out for other occurrences.
> > 
> > Unfortunately it does not seem to help.
> 
> I'm a little stuck on how to best handle this. get_online_cpus() can
> halt forever if the hotplug operation is holding the mutex when calling
> pcpu_alloc. One option would be to add a try_get_online_cpus() helper which
> trylocks the mutex. However, given that drain is so unlikely to actually
> make that make a difference when racing against parallel allocations,
> I think this should be acceptable.
> 
> Any objections?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b93879990fd..a3192447e906 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3432,7 +3432,17 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned 
> int order,
>*/
>   if (!page && !drained) {
>   unreserve_highatomic_pageblock(ac, false);
> - drain_all_pages(NULL);
> +
> + /*
> +  * Only drain from contexts allocating for user allocations.
> +  * Kernel allocations could be holding a CPU hotplug-related
> +  * mutex, particularly hot-add allocating per-cpu structures
> +  * while hotplug-related mutex's are held which would prevent
> +  * get_online_cpus ever returning.
> +  */
> + if (gfp_mask & __GFP_HARDWALL)
> + drain_all_pages(NULL);
> +

This wouldn't work AFAICS. If you look at the lockdep splat, the path
which reverses the locking order (takes pcpu_alloc_mutex prior to
cpu_hotplug.lock is bpf_array_alloc_percpu which is GFP_USER and thus
__GFP_HARDWALL.

I believe we shouldn't pull any dependency on the hotplug locks inside
the allocator. This is just too fragile! Can we simply drop the
get_online_cpus()? Why do we need it, anyway? Say we are racing with the
cpu offlining. I have to check the code but my impression was that WQ
code will ignore the cpu requested by the work item when the cpu is
going offline. If the offline happens while the worker function already
executes then it has to wait as we run with preemption disabled so we
should be safe here. Or am I missing something obvious?

-- 
Michal Hocko
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-07 Thread Michal Hocko

On Mon 06-02-17 20:13:35, Dmitry Vyukov wrote:
[...]
> Fuzzer now runs on 510948533b059f4f5033464f9f4a0c32d4ab0c08 of
> mmotm/auto-latest
> (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git):
> 
> commit 510948533b059f4f5033464f9f4a0c32d4ab0c08
> Date:   Thu Feb 2 10:08:47 2017 +0100
> mmotm: userfaultfd-non-cooperative-add-event-for-memory-unmaps-fix
> 
> The commit you referenced is already there:
> 
> commit 806b158031ca0b4714e775898396529a758ebc2c
> Date:   Thu Feb 2 08:53:16 2017 +0100
> mm, page_alloc: use static global work_struct for draining per-cpu pages
> 
> But I still got:
> 
> [ INFO: possible circular locking dependency detected ]
> 4.9.0 #6 Not tainted
> ---
> syz-executor1/8199 is trying to acquire lock:
>  (cpu_hotplug.dep_map){++}, at: [] 
> get_online_cpus+0x37/0x90 kernel/cpu.c:246
> but task is already holding lock:
>  (pcpu_alloc_mutex){+.+.+.}, at: [] pcpu_alloc+0xbda/0x1280 
> mm/percpu.c:896
> which lock already depends on the new lock.
> 

the original was too hard to read so here is the reformated output.

> the existing dependency chain (in reverse order) is:
> 
> [  403.953319] [] validate_chain 
> kernel/locking/lockdep.c:2265 [inline]
> [  403.953319] [] __lock_acquire+0x2149/0x3430 
> kernel/locking/lockdep.c:3338
> [  403.961232] [] lock_acquire+0x2a1/0x630 
> kernel/locking/lockdep.c:3753
> [  403.968788] [] __mutex_lock_common 
> kernel/locking/mutex.c:521 [inline]
> [  403.968788] [] mutex_lock_nested+0x24e/0xff0 
> kernel/locking/mutex.c:621
> [  403.976782] [] pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
> [  403.984266] [] __alloc_percpu+0x24/0x30 mm/percpu.c:1075
> [  403.991873] [] smpcfd_prepare_cpu+0x73/0xd0 
> kernel/smp.c:44
> [  403.999799] [] cpuhp_invoke_callback+0x254/0x1480 
> kernel/cpu.c:136
> [  404.008253] [] cpuhp_up_callbacks+0x81/0x2a0 
> kernel/cpu.c:493
> [  404.016365] [] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
> [  404.023507] [] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
> [  404.030647] [] cpu_up+0x18/0x20 kernel/cpu.c:1095
> [  404.037523] [] smp_init+0xe9/0xee kernel/smp.c:564
> [  404.044559] [] kernel_init_freeable+0x439/0x690 
> init/main.c:1010
> [  404.052811] [] kernel_init+0x13/0x180 init/main.c:941
> [  404.060198] [] ret_from_fork+0x2a/0x40 
> arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
  cpu_hotplug.lock
pcpu_alloc
  pcpu_alloc_mutex

> [  404.072827] [] validate_chain 
> kernel/locking/lockdep.c:2265 [inline]
> [  404.072827] [] __lock_acquire+0x2149/0x3430 
> kernel/locking/lockdep.c:3338
> [  404.080733] [] lock_acquire+0x2a1/0x630 
> kernel/locking/lockdep.c:3753
> [  404.088311] [] __mutex_lock_common 
> kernel/locking/mutex.c:521 [inline]
> [  404.088311] [] mutex_lock_nested+0x24e/0xff0 
> kernel/locking/mutex.c:621
> [  404.096318] [] cpu_hotplug_begin+0x206/0x2e0 
> kernel/cpu.c:304
> [  404.104321] [] _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
> [  404.111357] [] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
> [  404.118480] [] cpu_up+0x18/0x20 kernel/cpu.c:1095
> [  404.125360] [] smp_init+0xe9/0xee kernel/smp.c:564
> [  404.132393] [] kernel_init_freeable+0x439/0x690 
> init/main.c:1010
> [  404.140668] [] kernel_init+0x13/0x180 init/main.c:941
> [  404.148079] [] ret_from_fork+0x2a/0x40 
> arch/x86/entry/entry_64.S:433

cpu_hotplug_begin
  cpu_hotplug.lock
 
> [  404.160977] [] check_prev_add 
> kernel/locking/lockdep.c:1828 [inline]
> [  404.160977] [] check_prevs_add+0xa8d/0x1c00 
> kernel/locking/lockdep.c:1938
> [  404.168898] [] validate_chain 
> kernel/locking/lockdep.c:2265 [inline]
> [  404.168898] [] __lock_acquire+0x2149/0x3430 
> kernel/locking/lockdep.c:3338
> [  404.176844] [] lock_acquire+0x2a1/0x630 
> kernel/locking/lockdep.c:3753
> [  404.184416] [] get_online_cpus+0x62/0x90 kernel/cpu.c:248
> [  404.192103] [] drain_all_pages+0xf8/0x710 
> mm/page_alloc.c:2385
> [  404.199880] [] __alloc_pages_direct_reclaim 
> mm/page_alloc.c:3440 [inline]
> [  404.199880] [] __alloc_pages_slowpath+0x8fd/0x2370 
> mm/page_alloc.c:3778
> [  404.208406] [] __alloc_pages_nodemask+0x8f5/0xc60 
> mm/page_alloc.c:3980
> [  404.216851] [] __alloc_pages include/linux/gfp.h:426 
> [inline]
> [  404.216851] [] __alloc_pages_node 
> include/linux/gfp.h:439 [inline]
> [  404.216851] [] alloc_pages_node include/linux/gfp.h:453 
> [inline]
> [  404.216851] [] pcpu_alloc_pages mm/percpu-vm.c:93 
> [inline]
> [  404.216851] [] pcpu_populate_chunk+0x1e1/0x900 
> mm/percpu-vm.c:282
> [  404.225015] [] pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
> [  404.232482] [] __alloc_percpu_gfp+0x27/0x30 
> mm/percpu.c:1062
> [  404.240389] [] bpf_array_alloc_percpu 
> kernel/bpf/arraymap.c:34 [inline]
> [  404.240389] [] array_map_alloc+0x532/0x710 
> kernel/bpf/arraymap.c:99
> [  404.248224] [] find_and_alloc_map 
> kernel/bpf/syscall.c:34 [inline]
> [  404.248224] [] map_create kernel/bpf/syscall.c:188 
> [inline]
> [  404.248224] [] SYSC_bpf kernel/bpf/syscall.c:870 [inline]
> [  404.248224] [

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-06 Thread Mel Gorman

On Mon, Feb 06, 2017 at 08:13:35PM +0100, Dmitry Vyukov wrote:
> On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov  wrote:
> > On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka  wrote:
> >> On 29.1.2017 13:44, Dmitry Vyukov wrote:
> >>> Hello,
> >>>
> >>> I've got the following deadlock report while running syzkaller fuzzer
> >>> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
> >>>
> >>> [ INFO: possible circular locking dependency detected ]
> >>> 4.10.0-rc5-next-20170125 #1 Not tainted
> >>> ---
> >>> syz-executor3/14255 is trying to acquire lock:
> >>>  (cpu_hotplug.dep_map){++}, at: []
> >>> get_online_cpus+0x37/0x90 kernel/cpu.c:239
> >>>
> >>> but task is already holding lock:
> >>>  (pcpu_alloc_mutex){+.+.+.}, at: []
> >>> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
> >>>
> >>> which lock already depends on the new lock.
> >>
> >> I suspect the dependency comes from recent changes in drain_all_pages(). 
> >> They
> >> were later redone (for other reasons, but nice to have another validation) 
> >> in
> >> the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. 
> >> Could
> >> you try if it helps?
> >
> > It happened only once on linux-next, so I can't verify the fix. But I
> > will watch out for other occurrences.
> 
> Unfortunately it does not seem to help.

I'm a little stuck on how to best handle this. get_online_cpus() can
halt forever if the hotplug operation is holding the mutex when calling
pcpu_alloc. One option would be to add a try_get_online_cpus() helper which
trylocks the mutex. However, given that drain is so unlikely to actually
make that make a difference when racing against parallel allocations,
I think this should be acceptable.

Any objections?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b93879990fd..a3192447e906 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3432,7 +3432,17 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned 
int order,
 */
if (!page && !drained) {
unreserve_highatomic_pageblock(ac, false);
-   drain_all_pages(NULL);
+
+   /*
+* Only drain from contexts allocating for user allocations.
+* Kernel allocations could be holding a CPU hotplug-related
+* mutex, particularly hot-add allocating per-cpu structures
+* while hotplug-related mutex's are held which would prevent
+* get_online_cpus ever returning.
+*/
+   if (gfp_mask & __GFP_HARDWALL)
+   drain_all_pages(NULL);
+
drained = true;
goto retry;
}

-- 
Mel Gorman
SUSE Labs

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-02-06 Thread Dmitry Vyukov

On Mon, Jan 30, 2017 at 4:48 PM, Dmitry Vyukov  wrote:
> On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka  wrote:
>> On 29.1.2017 13:44, Dmitry Vyukov wrote:
>>> Hello,
>>>
>>> I've got the following deadlock report while running syzkaller fuzzer
>>> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
>>>
>>> [ INFO: possible circular locking dependency detected ]
>>> 4.10.0-rc5-next-20170125 #1 Not tainted
>>> ---
>>> syz-executor3/14255 is trying to acquire lock:
>>>  (cpu_hotplug.dep_map){++}, at: []
>>> get_online_cpus+0x37/0x90 kernel/cpu.c:239
>>>
>>> but task is already holding lock:
>>>  (pcpu_alloc_mutex){+.+.+.}, at: []
>>> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
>>>
>>> which lock already depends on the new lock.
>>
>> I suspect the dependency comes from recent changes in drain_all_pages(). They
>> were later redone (for other reasons, but nice to have another validation) in
>> the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. 
>> Could
>> you try if it helps?
>
> It happened only once on linux-next, so I can't verify the fix. But I
> will watch out for other occurrences.


Unfortunately it does not seem to help.
Fuzzer now runs on 510948533b059f4f5033464f9f4a0c32d4ab0c08 of
mmotm/auto-latest
(git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git):

commit 510948533b059f4f5033464f9f4a0c32d4ab0c08
Date:   Thu Feb 2 10:08:47 2017 +0100
mmotm: userfaultfd-non-cooperative-add-event-for-memory-unmaps-fix

The commit you referenced is already there:

commit 806b158031ca0b4714e775898396529a758ebc2c
Date:   Thu Feb 2 08:53:16 2017 +0100
mm, page_alloc: use static global work_struct for draining per-cpu pages

But I still got:

[ INFO: possible circular locking dependency detected ]
4.9.0 #6 Not tainted
---
syz-executor1/8199 is trying to acquire lock:
 (cpu_hotplug.dep_map){++}, at: []
get_online_cpus+0x37/0x90 kernel/cpu.c:246
but task is already holding lock:
 (pcpu_alloc_mutex){+.+.+.}, at: []
pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

   [  403.953319] [] validate_chain
kernel/locking/lockdep.c:2265 [inline]
   [  403.953319] []
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
   [  403.961232] [] lock_acquire+0x2a1/0x630
kernel/locking/lockdep.c:3753
   [  403.968788] [] __mutex_lock_common
kernel/locking/mutex.c:521 [inline]
   [  403.968788] []
mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
   [  403.976782] [] pcpu_alloc+0xbda/0x1280
mm/percpu.c:896
   [  403.984266] [] __alloc_percpu+0x24/0x30
mm/percpu.c:1075
   [  403.991873] []
smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
   [  403.999799] []
cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
   [  404.008253] []
cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
   [  404.016365] [] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
   [  404.023507] [] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
   [  404.030647] [] cpu_up+0x18/0x20 kernel/cpu.c:1095
   [  404.037523] [] smp_init+0xe9/0xee kernel/smp.c:564
   [  404.044559] []
kernel_init_freeable+0x439/0x690 init/main.c:1010
   [  404.052811] [] kernel_init+0x13/0x180
init/main.c:941
   [  404.060198] [] ret_from_fork+0x2a/0x40
arch/x86/entry/entry_64.S:433

   [  404.072827] [] validate_chain
kernel/locking/lockdep.c:2265 [inline]
   [  404.072827] []
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
   [  404.080733] [] lock_acquire+0x2a1/0x630
kernel/locking/lockdep.c:3753
   [  404.088311] [] __mutex_lock_common
kernel/locking/mutex.c:521 [inline]
   [  404.088311] []
mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
   [  404.096318] []
cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
   [  404.104321] [] _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
   [  404.111357] [] do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
   [  404.118480] [] cpu_up+0x18/0x20 kernel/cpu.c:1095
   [  404.125360] [] smp_init+0xe9/0xee kernel/smp.c:564
   [  404.132393] []
kernel_init_freeable+0x439/0x690 init/main.c:1010
   [  404.140668] [] kernel_init+0x13/0x180
init/main.c:941
   [  404.148079] [] ret_from_fork+0x2a/0x40
arch/x86/entry/entry_64.S:433

   [  404.160977] [] check_prev_add
kernel/locking/lockdep.c:1828 [inline]
   [  404.160977] []
check_prevs_add+0xa8d/0x1c00 kernel/locking/lockdep.c:1938
   [  404.168898] [] validate_chain
kernel/locking/lockdep.c:2265 [inline]
   [  404.168898] []
__lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
   [  404.176844] [] lock_acquire+0x2a1/0x630
kernel/locking/lockdep.c:3753
   [  404.184416] [] get_online_cpus+0x62/0x90
kernel/cpu.c:248
   [  404.192103] [] drain_all_pages+0xf8/0x710
mm/page_alloc.c:2385
   [  404.199880] []
__alloc_pages_direct_reclaim mm/page_alloc.c:

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-01-30 Thread Dmitry Vyukov

On Sun, Jan 29, 2017 at 6:22 PM, Vlastimil Babka  wrote:
> On 29.1.2017 13:44, Dmitry Vyukov wrote:
>> Hello,
>>
>> I've got the following deadlock report while running syzkaller fuzzer
>> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
>>
>> [ INFO: possible circular locking dependency detected ]
>> 4.10.0-rc5-next-20170125 #1 Not tainted
>> ---
>> syz-executor3/14255 is trying to acquire lock:
>>  (cpu_hotplug.dep_map){++}, at: []
>> get_online_cpus+0x37/0x90 kernel/cpu.c:239
>>
>> but task is already holding lock:
>>  (pcpu_alloc_mutex){+.+.+.}, at: []
>> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
>>
>> which lock already depends on the new lock.
>
> I suspect the dependency comes from recent changes in drain_all_pages(). They
> were later redone (for other reasons, but nice to have another validation) in
> the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. 
> Could
> you try if it helps?

It happened only once on linux-next, so I can't verify the fix. But I
will watch out for other occurrences.

Re: mm: deadlock between get_online_cpus/pcpu_alloc

2017-01-29 Thread Vlastimil Babka

On 29.1.2017 13:44, Dmitry Vyukov wrote:
> Hello,
> 
> I've got the following deadlock report while running syzkaller fuzzer
> on f37208bc3c9c2f811460ef264909dfbc7f605a60:
> 
> [ INFO: possible circular locking dependency detected ]
> 4.10.0-rc5-next-20170125 #1 Not tainted
> ---
> syz-executor3/14255 is trying to acquire lock:
>  (cpu_hotplug.dep_map){++}, at: []
> get_online_cpus+0x37/0x90 kernel/cpu.c:239
> 
> but task is already holding lock:
>  (pcpu_alloc_mutex){+.+.+.}, at: []
> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
> 
> which lock already depends on the new lock.

I suspect the dependency comes from recent changes in drain_all_pages(). They
were later redone (for other reasons, but nice to have another validation) in
the mmots patch [1], which AFAICS is not yet in mmotm and thus linux-next. Could
you try if it helps?

Vlastimil


[1]
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-use-static-global-work_struct-for-draining-per-cpu-pages.patch

> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (pcpu_alloc_mutex){+.+.+.}:
> 
> [] validate_chain kernel/locking/lockdep.c:2265 [inline]
> [] __lock_acquire+0x2149/0x3430 
> kernel/locking/lockdep.c:3338
> [] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
> [] __mutex_lock_common kernel/locking/mutex.c:757 [inline]
> [] __mutex_lock+0x382/0x25c0 kernel/locking/mutex.c:894
> [] mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:909
> [] pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
> [] __alloc_percpu+0x24/0x30 mm/percpu.c:1076
> [] smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:47
> [] cpuhp_invoke_callback+0x256/0x1480 kernel/cpu.c:136
> [] cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:425
> [] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:940
> [] do_cpu_up+0x73/0xa0 kernel/cpu.c:970
> [] cpu_up+0x18/0x20 kernel/cpu.c:978
> [] smp_init+0x148/0x160 kernel/smp.c:565
> [] kernel_init_freeable+0x43e/0x695 init/main.c:1026
> [] kernel_init+0x13/0x180 init/main.c:955
> [] ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> 
> -> #1 (cpu_hotplug.lock){+.+.+.}:
> 
> [] validate_chain kernel/locking/lockdep.c:2265 [inline]
> [] __lock_acquire+0x2149/0x3430 
> kernel/locking/lockdep.c:3338
> [] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
> [] __mutex_lock_common kernel/locking/mutex.c:757 [inline]
> [] __mutex_lock+0x382/0x25c0 kernel/locking/mutex.c:894
> [] mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:909
> [] cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:297
> [] _cpu_up+0xca/0x2a0 kernel/cpu.c:894
> [] do_cpu_up+0x73/0xa0 kernel/cpu.c:970
> [] cpu_up+0x18/0x20 kernel/cpu.c:978
> [] smp_init+0x148/0x160 kernel/smp.c:565
> [] kernel_init_freeable+0x43e/0x695 init/main.c:1026
> [] kernel_init+0x13/0x180 init/main.c:955
> [] ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> 
> -> #0 (cpu_hotplug.dep_map){++}:
> 
> [] check_prev_add kernel/locking/lockdep.c:1828 [inline]
> [] check_prevs_add+0xa8f/0x19f0 
> kernel/locking/lockdep.c:1938
> [] validate_chain kernel/locking/lockdep.c:2265 [inline]
> [] __lock_acquire+0x2149/0x3430 
> kernel/locking/lockdep.c:3338
> [] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
> [] get_online_cpus+0x62/0x90 kernel/cpu.c:241
> [] drain_all_pages.part.98+0x8c/0x8f0 mm/page_alloc.c:2371
> [] drain_all_pages mm/page_alloc.c:2364 [inline]
> [] __alloc_pages_direct_reclaim mm/page_alloc.c:3435 
> [inline]
> [] __alloc_pages_slowpath+0x966/0x23d0 mm/page_alloc.c:3773
> [] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3975
> [] __alloc_pages include/linux/gfp.h:426 [inline]
> [] __alloc_pages_node include/linux/gfp.h:439 [inline]
> [] alloc_pages_node include/linux/gfp.h:453 [inline]
> [] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
> [] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
> [] pcpu_alloc+0xe15/0x1290 mm/percpu.c:999
> [] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1063
> [] bpf_array_alloc_percpu kernel/bpf/arraymap.c:33 [inline]
> [] array_map_alloc+0x543/0x700 kernel/bpf/arraymap.c:94
> [] find_and_alloc_map kernel/bpf/syscall.c:37 [inline]
> [] map_create kernel/bpf/syscall.c:228 [inline]
> [] SYSC_bpf kernel/bpf/syscall.c:1040 [inline]
> [] SyS_bpf+0x108d/0x27c0 kernel/bpf/syscall.c:997
> [] entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> other info that might help us debug this:
> 
> Chain exists of:
>   cpu_hotplug.dep_map --> cpu_hotplug.lock --> pcpu_alloc_mutex
> 
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(pcpu_alloc_mutex);
>lock(cpu_hotplug.lock);
>lock(pcpu_alloc_mutex);
>   lock(cpu_hotplug.dep_map);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by syz-executor3/14255:
>  #0:  (pcpu_alloc_mutex){+.+.+.}, at: []
> pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
> 
> stack backtrace:
> CPU: 1 PID: 14255 Comm: syz-executor3 Not tainted 4.10.0-rc5-next-20170125 #1
> Hardware name: Google Goo

mm: deadlock between get_online_cpus/pcpu_alloc

2017-01-29 Thread Dmitry Vyukov

Hello,

I've got the following deadlock report while running syzkaller fuzzer
on f37208bc3c9c2f811460ef264909dfbc7f605a60:

[ INFO: possible circular locking dependency detected ]
4.10.0-rc5-next-20170125 #1 Not tainted
---
syz-executor3/14255 is trying to acquire lock:
 (cpu_hotplug.dep_map){++}, at: []
get_online_cpus+0x37/0x90 kernel/cpu.c:239

but task is already holding lock:
 (pcpu_alloc_mutex){+.+.+.}, at: []
pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (pcpu_alloc_mutex){+.+.+.}:

[] validate_chain kernel/locking/lockdep.c:2265 [inline]
[] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[] __mutex_lock_common kernel/locking/mutex.c:757 [inline]
[] __mutex_lock+0x382/0x25c0 kernel/locking/mutex.c:894
[] mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:909
[] pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897
[] __alloc_percpu+0x24/0x30 mm/percpu.c:1076
[] smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:47
[] cpuhp_invoke_callback+0x256/0x1480 kernel/cpu.c:136
[] cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:425
[] _cpu_up+0x1e3/0x2a0 kernel/cpu.c:940
[] do_cpu_up+0x73/0xa0 kernel/cpu.c:970
[] cpu_up+0x18/0x20 kernel/cpu.c:978
[] smp_init+0x148/0x160 kernel/smp.c:565
[] kernel_init_freeable+0x43e/0x695 init/main.c:1026
[] kernel_init+0x13/0x180 init/main.c:955
[] ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

-> #1 (cpu_hotplug.lock){+.+.+.}:

[] validate_chain kernel/locking/lockdep.c:2265 [inline]
[] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[] __mutex_lock_common kernel/locking/mutex.c:757 [inline]
[] __mutex_lock+0x382/0x25c0 kernel/locking/mutex.c:894
[] mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:909
[] cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:297
[] _cpu_up+0xca/0x2a0 kernel/cpu.c:894
[] do_cpu_up+0x73/0xa0 kernel/cpu.c:970
[] cpu_up+0x18/0x20 kernel/cpu.c:978
[] smp_init+0x148/0x160 kernel/smp.c:565
[] kernel_init_freeable+0x43e/0x695 init/main.c:1026
[] kernel_init+0x13/0x180 init/main.c:955
[] ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

-> #0 (cpu_hotplug.dep_map){++}:

[] check_prev_add kernel/locking/lockdep.c:1828 [inline]
[] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1938
[] validate_chain kernel/locking/lockdep.c:2265 [inline]
[] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
[] lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
[] get_online_cpus+0x62/0x90 kernel/cpu.c:241
[] drain_all_pages.part.98+0x8c/0x8f0 mm/page_alloc.c:2371
[] drain_all_pages mm/page_alloc.c:2364 [inline]
[] __alloc_pages_direct_reclaim mm/page_alloc.c:3435 [inline]
[] __alloc_pages_slowpath+0x966/0x23d0 mm/page_alloc.c:3773
[] __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3975
[] __alloc_pages include/linux/gfp.h:426 [inline]
[] __alloc_pages_node include/linux/gfp.h:439 [inline]
[] alloc_pages_node include/linux/gfp.h:453 [inline]
[] pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
[] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
[] pcpu_alloc+0xe15/0x1290 mm/percpu.c:999
[] __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1063
[] bpf_array_alloc_percpu kernel/bpf/arraymap.c:33 [inline]
[] array_map_alloc+0x543/0x700 kernel/bpf/arraymap.c:94
[] find_and_alloc_map kernel/bpf/syscall.c:37 [inline]
[] map_create kernel/bpf/syscall.c:228 [inline]
[] SYSC_bpf kernel/bpf/syscall.c:1040 [inline]
[] SyS_bpf+0x108d/0x27c0 kernel/bpf/syscall.c:997
[] entry_SYSCALL_64_fastpath+0x1f/0xc2

other info that might help us debug this:

Chain exists of:
  cpu_hotplug.dep_map --> cpu_hotplug.lock --> pcpu_alloc_mutex

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(pcpu_alloc_mutex);
   lock(cpu_hotplug.lock);
   lock(pcpu_alloc_mutex);
  lock(cpu_hotplug.dep_map);

 *** DEADLOCK ***

1 lock held by syz-executor3/14255:
 #0:  (pcpu_alloc_mutex){+.+.+.}, at: []
pcpu_alloc+0xbfe/0x1290 mm/percpu.c:897

stack backtrace:
CPU: 1 PID: 14255 Comm: syz-executor3 Not tainted 4.10.0-rc5-next-20170125 #1
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:15 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1202
 check_prev_add kernel/locking/lockdep.c:1828 [inline]
 check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1938
 validate_chain kernel/locking/lockdep.c:2265 [inline]
 __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3338
 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
 get_online_cpus+0x62/0x90 kernel/cpu.c:241
 drain_all_pages.part.98+0x8c/0x8f0 mm/page_alloc.c:2371
 drain_all_pages mm/page_alloc.c:2364 [inline]
 __alloc

60 matches

Mail list logo