Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Peter Zijlstra
On Thu, Sep 01, 2016 at 10:17:57PM +0800, Boqun Feng wrote:
> On Thu, Sep 01, 2016 at 08:57:38AM +0200, Peter Zijlstra wrote:
> Could there be some code that relies on the full barrier semantics of
> schedule() to provide transitivity?

Could, sure, who knows. RCU might, although Paul typically sticks in
smp_mb just to be safe.

The kernel coming apart when you remove that sync would prove this
fairly quick though.

Like Ben said, it not coming apart doesn't prove anything. It coming
apart does however prove something, namely that it is required :-)

> IIUC, the Program Order Guarantee you stated before try_to_wake_up()
> only has ordering effect on wakers and wakees.

Right, the scheduler only does RCpc guarantees.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Peter Zijlstra
On Thu, Sep 01, 2016 at 10:17:57PM +0800, Boqun Feng wrote:
> On Thu, Sep 01, 2016 at 08:57:38AM +0200, Peter Zijlstra wrote:
> Could there be some code that relies on the full barrier semantics of
> schedule() to provide transitivity?

Could, sure, who knows. RCU might, although Paul typically sticks in
smp_mb just to be safe.

The kernel coming apart when you remove that sync would prove this
fairly quick though.

Like Ben said, it not coming apart doesn't prove anything. It coming
apart does however prove something, namely that it is required :-)

> IIUC, the Program Order Guarantee you stated before try_to_wake_up()
> only has ordering effect on wakers and wakees.

Right, the scheduler only does RCpc guarantees.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Boqun Feng
On Thu, Sep 01, 2016 at 08:57:38AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 01, 2016 at 07:47:10AM +1000, Benjamin Herrenschmidt wrote:
> 
> > > OK, for giggles, could you (or Balbir) check what happens if you take
> > > that sync out?
> 
> > The problem is no amount of testing can tell you it works for sure :-)
> 
> It breaking does prove the negative though, so still interesting.
> 
> > I would be nervous not having a real full sync in _switch. All we have
> > along the scheduler path is lwsync's and our isync based load construct
> > for spin_lock, I'm not sure what other assumptions we have around that
> > sync in there...
> 
> Only one way to find out  ;-)
> 
> I'm not saying you should commit that change, just curious if (and how
> fast) it would come apart.
> 
> At the very least we could update the comment that goes with that sync.

Could there be some code that relies on the full barrier semantics of
schedule() to provide transitivity?

IIUC, the Program Order Guarantee you stated before try_to_wake_up()
only has ordering effect on wakers and wakees.

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Boqun Feng
On Thu, Sep 01, 2016 at 08:57:38AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 01, 2016 at 07:47:10AM +1000, Benjamin Herrenschmidt wrote:
> 
> > > OK, for giggles, could you (or Balbir) check what happens if you take
> > > that sync out?
> 
> > The problem is no amount of testing can tell you it works for sure :-)
> 
> It breaking does prove the negative though, so still interesting.
> 
> > I would be nervous not having a real full sync in _switch. All we have
> > along the scheduler path is lwsync's and our isync based load construct
> > for spin_lock, I'm not sure what other assumptions we have around that
> > sync in there...
> 
> Only one way to find out  ;-)
> 
> I'm not saying you should commit that change, just curious if (and how
> fast) it would come apart.
> 
> At the very least we could update the comment that goes with that sync.

Could there be some code that relies on the full barrier semantics of
schedule() to provide transitivity?

IIUC, the Program Order Guarantee you stated before try_to_wake_up()
only has ordering effect on wakers and wakees.

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Alexey Kardashevskiy
On 01/09/16 11:48, Alexey Kardashevskiy wrote:
> On 31/08/16 17:28, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>>> On 30/08/16 22:19, Peter Zijlstra wrote:
 On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

 As Nick says (good to see you're back Nick!), this is unrelated to
 rwsems.

 This is true for pretty much every blocking wait loop out there, they
 all do:

for (;;) {
current->state = UNINTERRUPTIBLE;
smp_mb();
if (cond)
break;
schedule();
}
current->state = RUNNING;

 Which, if the wakeup is spurious, is just the pattern you need.
>>>
>>> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
>>> clearer
>>> in my commit log. Should I resend the patch?
>>
>> Yes please.
>>
 There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
 to PPC, is _not_ MB. It is however sufficient for this case.

>>>
>>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>>> different thread.
>>
>> Right, although even without that, there is sufficient ordering, as the
>> rq unlock from the wakeup, coupled with the rq lock from the schedule
>> already form a load-store barrier.
>>
 Now, this has been present for a fair while, I suspect ever since we
 reworked the wakeup path to not use rq->lock twice. Curious you only now
 hit it.

>>>
>>> Yes, I just hit it a a week or two back and I needed to collect data to
>>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>>> I needed a system with large threads and less memory running stress-ng.
>>> Reproducing the problem takes an unpredictable amount of time.
>>
>> What hardware do you see this on, is it shiny new Power8 chips which
>> have never before seen deep queues or something. Or is it 'regular' old
>> Power7 like stuff?
> 
> I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
> devices with vhost enabled, all virtio-net devices are connected to the
> same virtual bridge on the host (via /dev/tap*) and are doing lots of
> trafic, just between these 2 guests.
> 
> I remember doing the same test on POWER7 more than 2 years ago and finding
> missing barriers in virtio but nothing like this one. But POWER7 is
> seriously slower than POWER8 so it seems that nobody bothered with loading
> it that much.
> 
> I wonder how to reproduce the bug quicker as sometime it works days with no
> fault but sometime it fails within first 30 minutes (backtraces from 2
> stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
> type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
> return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
> but this did not make any difference.


Below are the backtraces of 2 stuck cpus (they could not be stopped in xmon
as others, got these via bmc). I am also adding Michael in cc:, just in case...

0:mon> t 0xc00fd4ab79b0
[c00fd4ab79b0] c01330c4 do_raw_spin_lock+0x1f4/0x260
[c00fd4ab79f0] c0a368e8 _raw_spin_lock_irqsave+0x98/0xd0
[c00fd4ab7a30] c011ecbc remove_wait_queue+0x2c/0x70
[c00fd4ab7a70] d000166e0364 vhost_poll_stop+0x34/0x60 [vhost]
[c00fd4ab7aa0] d000167a08c8 handle_rx+0x168/0x8d0 [vhost_net]
[c00fd4ab7c80] d000166e04e0 vhost_worker+0x120/0x1d0 [vhost]
[c00fd4ab7d00] c00ebad8 kthread+0x118/0x140
[c00fd4ab7e30] c00098e8 ret_from_kernel_thread+0x5c/0x74

0:mon> t 0xc01fff41b5a0
[c01fff41b5a0] c00fc3ac try_to_wake_up+0x5c/0x6c0
[c01fff41b630] d000166e0fcc vhost_work_queue+0x6c/0x90 [vhost]
[c01fff41b660] d000166e205c vhost_poll_wakeup+0x3c/0x50 [vhost]
[c01fff41b680] c011e6d4 __wake_up_common+0x84/0xf0
[c01fff41b6e0] c011ee98 __wake_up_sync_key+0x68/0xa0
[c01fff41b730] c08defd0 sock_def_readable+0x80/0x180
[c01fff41b760] d000167438ac tun_net_xmit+0x52c/0x5d0 [tun]
[c01fff41b7b0] c0906dd8 dev_hard_start_xmit+0x178/0x450
[c01fff41b870] c0938d6c sch_direct_xmit+0x11c/0x250
[c01fff41b910] c0907840 __dev_queue_xmit+0x560/0x8c0
[c01fff41b9c0] d00016553c80 br_dev_queue_push_xmit+0xb0/0x250 [bridge]
[c01fff41ba00] d00016553e54 br_forward_finish+0x34/0xe0 [bridge]
[c01fff41ba80] d00016554044 __br_forward.isra.0+0x144/0x220 [bridge]
[c01fff41baf0] d00016555e78 br_handle_frame_finish+0x148/0x570 [bridge]
[c01fff41bb80] d0001655654c br_handle_frame+0x24c/0x400 [bridge]
[c01fff41bc10] c08fed28 __netif_receive_skb_core+0x498/0xc60
[c01fff41bcf0] 

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Alexey Kardashevskiy
On 01/09/16 11:48, Alexey Kardashevskiy wrote:
> On 31/08/16 17:28, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>>> On 30/08/16 22:19, Peter Zijlstra wrote:
 On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

 As Nick says (good to see you're back Nick!), this is unrelated to
 rwsems.

 This is true for pretty much every blocking wait loop out there, they
 all do:

for (;;) {
current->state = UNINTERRUPTIBLE;
smp_mb();
if (cond)
break;
schedule();
}
current->state = RUNNING;

 Which, if the wakeup is spurious, is just the pattern you need.
>>>
>>> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
>>> clearer
>>> in my commit log. Should I resend the patch?
>>
>> Yes please.
>>
 There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
 to PPC, is _not_ MB. It is however sufficient for this case.

>>>
>>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>>> different thread.
>>
>> Right, although even without that, there is sufficient ordering, as the
>> rq unlock from the wakeup, coupled with the rq lock from the schedule
>> already form a load-store barrier.
>>
 Now, this has been present for a fair while, I suspect ever since we
 reworked the wakeup path to not use rq->lock twice. Curious you only now
 hit it.

>>>
>>> Yes, I just hit it a a week or two back and I needed to collect data to
>>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>>> I needed a system with large threads and less memory running stress-ng.
>>> Reproducing the problem takes an unpredictable amount of time.
>>
>> What hardware do you see this on, is it shiny new Power8 chips which
>> have never before seen deep queues or something. Or is it 'regular' old
>> Power7 like stuff?
> 
> I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
> devices with vhost enabled, all virtio-net devices are connected to the
> same virtual bridge on the host (via /dev/tap*) and are doing lots of
> trafic, just between these 2 guests.
> 
> I remember doing the same test on POWER7 more than 2 years ago and finding
> missing barriers in virtio but nothing like this one. But POWER7 is
> seriously slower than POWER8 so it seems that nobody bothered with loading
> it that much.
> 
> I wonder how to reproduce the bug quicker as sometime it works days with no
> fault but sometime it fails within first 30 minutes (backtraces from 2
> stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
> type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
> return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
> but this did not make any difference.


Below are the backtraces of 2 stuck cpus (they could not be stopped in xmon
as others, got these via bmc). I am also adding Michael in cc:, just in case...

0:mon> t 0xc00fd4ab79b0
[c00fd4ab79b0] c01330c4 do_raw_spin_lock+0x1f4/0x260
[c00fd4ab79f0] c0a368e8 _raw_spin_lock_irqsave+0x98/0xd0
[c00fd4ab7a30] c011ecbc remove_wait_queue+0x2c/0x70
[c00fd4ab7a70] d000166e0364 vhost_poll_stop+0x34/0x60 [vhost]
[c00fd4ab7aa0] d000167a08c8 handle_rx+0x168/0x8d0 [vhost_net]
[c00fd4ab7c80] d000166e04e0 vhost_worker+0x120/0x1d0 [vhost]
[c00fd4ab7d00] c00ebad8 kthread+0x118/0x140
[c00fd4ab7e30] c00098e8 ret_from_kernel_thread+0x5c/0x74

0:mon> t 0xc01fff41b5a0
[c01fff41b5a0] c00fc3ac try_to_wake_up+0x5c/0x6c0
[c01fff41b630] d000166e0fcc vhost_work_queue+0x6c/0x90 [vhost]
[c01fff41b660] d000166e205c vhost_poll_wakeup+0x3c/0x50 [vhost]
[c01fff41b680] c011e6d4 __wake_up_common+0x84/0xf0
[c01fff41b6e0] c011ee98 __wake_up_sync_key+0x68/0xa0
[c01fff41b730] c08defd0 sock_def_readable+0x80/0x180
[c01fff41b760] d000167438ac tun_net_xmit+0x52c/0x5d0 [tun]
[c01fff41b7b0] c0906dd8 dev_hard_start_xmit+0x178/0x450
[c01fff41b870] c0938d6c sch_direct_xmit+0x11c/0x250
[c01fff41b910] c0907840 __dev_queue_xmit+0x560/0x8c0
[c01fff41b9c0] d00016553c80 br_dev_queue_push_xmit+0xb0/0x250 [bridge]
[c01fff41ba00] d00016553e54 br_forward_finish+0x34/0xe0 [bridge]
[c01fff41ba80] d00016554044 __br_forward.isra.0+0x144/0x220 [bridge]
[c01fff41baf0] d00016555e78 br_handle_frame_finish+0x148/0x570 [bridge]
[c01fff41bb80] d0001655654c br_handle_frame+0x24c/0x400 [bridge]
[c01fff41bc10] c08fed28 __netif_receive_skb_core+0x498/0xc60
[c01fff41bcf0] 

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Peter Zijlstra
On Thu, Sep 01, 2016 at 07:47:10AM +1000, Benjamin Herrenschmidt wrote:

> > OK, for giggles, could you (or Balbir) check what happens if you take
> > that sync out?

> The problem is no amount of testing can tell you it works for sure :-)

It breaking does prove the negative though, so still interesting.

> I would be nervous not having a real full sync in _switch. All we have
> along the scheduler path is lwsync's and our isync based load construct
> for spin_lock, I'm not sure what other assumptions we have around that
> sync in there...

Only one way to find out  ;-)

I'm not saying you should commit that change, just curious if (and how
fast) it would come apart.

At the very least we could update the comment that goes with that sync.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Peter Zijlstra
On Thu, Sep 01, 2016 at 07:47:10AM +1000, Benjamin Herrenschmidt wrote:

> > OK, for giggles, could you (or Balbir) check what happens if you take
> > that sync out?

> The problem is no amount of testing can tell you it works for sure :-)

It breaking does prove the negative though, so still interesting.

> I would be nervous not having a real full sync in _switch. All we have
> along the scheduler path is lwsync's and our isync based load construct
> for spin_lock, I'm not sure what other assumptions we have around that
> sync in there...

Only one way to find out  ;-)

I'm not saying you should commit that change, just curious if (and how
fast) it would come apart.

At the very least we could update the comment that goes with that sync.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Balbir Singh


On 01/09/16 07:47, Benjamin Herrenschmidt wrote:
> On Wed, 2016-08-31 at 15:31 +0200, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
>> wrote:
>>
>>>
>>> On powerpc we have a sync deep in _switch to achieve that.
>>
>> OK, for giggles, could you (or Balbir) check what happens if you take
>> that sync out?
>>
>> There should be enough serialization in the generic code to cover the
>> case that code mentions.
>>
>> ARM64 has a stronger barrier in its context switch code, but that's
>> because they need to sync against external agents (like their TLB and
>> cache) and no amount of generic locking is going to cover that.
> 
> The problem is no amount of testing can tell you it works for sure :-)
> 
> I would be nervous not having a real full sync in _switch. All we have
> along the scheduler path is lwsync's and our isync based load construct
> for spin_lock, I'm not sure what other assumptions we have around that
> sync in there...
> 

I would agree, I am not sure of the assumptions either.

Balbir Singh.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-09-01 Thread Balbir Singh


On 01/09/16 07:47, Benjamin Herrenschmidt wrote:
> On Wed, 2016-08-31 at 15:31 +0200, Peter Zijlstra wrote:
>> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
>> wrote:
>>
>>>
>>> On powerpc we have a sync deep in _switch to achieve that.
>>
>> OK, for giggles, could you (or Balbir) check what happens if you take
>> that sync out?
>>
>> There should be enough serialization in the generic code to cover the
>> case that code mentions.
>>
>> ARM64 has a stronger barrier in its context switch code, but that's
>> because they need to sync against external agents (like their TLB and
>> cache) and no amount of generic locking is going to cover that.
> 
> The problem is no amount of testing can tell you it works for sure :-)
> 
> I would be nervous not having a real full sync in _switch. All we have
> along the scheduler path is lwsync's and our isync based load construct
> for spin_lock, I'm not sure what other assumptions we have around that
> sync in there...
> 

I would agree, I am not sure of the assumptions either.

Balbir Singh.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Alexey Kardashevskiy
On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:


 The origin of the issue I've seen seems to be related to
 rwsem spin lock stealing. Basically I see the system deadlock'd in the
 following state
>>>
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>>
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>>
>>> for (;;) {
>>> current->state = UNINTERRUPTIBLE;
>>> smp_mb();
>>> if (cond)
>>> break;
>>> schedule();
>>> }
>>> current->state = RUNNING;
>>>
>>> Which, if the wakeup is spurious, is just the pattern you need.
>>
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
>> clearer
>> in my commit log. Should I resend the patch?
> 
> Yes please.
> 
>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>
>>
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>> different thread.
> 
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
> 
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>>>
>>
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?

I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
devices with vhost enabled, all virtio-net devices are connected to the
same virtual bridge on the host (via /dev/tap*) and are doing lots of
trafic, just between these 2 guests.

I remember doing the same test on POWER7 more than 2 years ago and finding
missing barriers in virtio but nothing like this one. But POWER7 is
seriously slower than POWER8 so it seems that nobody bothered with loading
it that much.

I wonder how to reproduce the bug quicker as sometime it works days with no
fault but sometime it fails within first 30 minutes (backtraces from 2
stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
but this did not make any difference.



-- 
Alexey


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Alexey Kardashevskiy
On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:


 The origin of the issue I've seen seems to be related to
 rwsem spin lock stealing. Basically I see the system deadlock'd in the
 following state
>>>
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>>
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>>
>>> for (;;) {
>>> current->state = UNINTERRUPTIBLE;
>>> smp_mb();
>>> if (cond)
>>> break;
>>> schedule();
>>> }
>>> current->state = RUNNING;
>>>
>>> Which, if the wakeup is spurious, is just the pattern you need.
>>
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
>> clearer
>> in my commit log. Should I resend the patch?
> 
> Yes please.
> 
>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>
>>
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>> different thread.
> 
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
> 
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>>>
>>
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?

I am seeing it on POWER8 with KVM and 2 guests, each having 3 virtio-net
devices with vhost enabled, all virtio-net devices are connected to the
same virtual bridge on the host (via /dev/tap*) and are doing lots of
trafic, just between these 2 guests.

I remember doing the same test on POWER7 more than 2 years ago and finding
missing barriers in virtio but nothing like this one. But POWER7 is
seriously slower than POWER8 so it seems that nobody bothered with loading
it that much.

I wonder how to reproduce the bug quicker as sometime it works days with no
fault but sometime it fails within first 30 minutes (backtraces from 2
stuck CPUs are the same though), anyone has an idea (kernel hacks, taskset,
type of traficб уес)? As Nick suggested, I changed cpus_share_cache() to
return "false" so ttwu_queue() would always go via ttwu_queue_remote() path
but this did not make any difference.



-- 
Alexey


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 15:31 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
> wrote:
> 
> > 
> > On powerpc we have a sync deep in _switch to achieve that.
> 
> OK, for giggles, could you (or Balbir) check what happens if you take
> that sync out?
> 
> There should be enough serialization in the generic code to cover the
> case that code mentions.
> 
> ARM64 has a stronger barrier in its context switch code, but that's
> because they need to sync against external agents (like their TLB and
> cache) and no amount of generic locking is going to cover that.

The problem is no amount of testing can tell you it works for sure :-)

I would be nervous not having a real full sync in _switch. All we have
along the scheduler path is lwsync's and our isync based load construct
for spin_lock, I'm not sure what other assumptions we have around that
sync in there...

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 15:31 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
> wrote:
> 
> > 
> > On powerpc we have a sync deep in _switch to achieve that.
> 
> OK, for giggles, could you (or Balbir) check what happens if you take
> that sync out?
> 
> There should be enough serialization in the generic code to cover the
> case that code mentions.
> 
> ARM64 has a stronger barrier in its context switch code, but that's
> because they need to sync against external agents (like their TLB and
> cache) and no amount of generic locking is going to cover that.

The problem is no amount of testing can tell you it works for sure :-)

I would be nervous not having a real full sync in _switch. All we have
along the scheduler path is lwsync's and our isync based load construct
for spin_lock, I'm not sure what other assumptions we have around that
sync in there...

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt wrote:

> On powerpc we have a sync deep in _switch to achieve that.

OK, for giggles, could you (or Balbir) check what happens if you take
that sync out?

There should be enough serialization in the generic code to cover the
case that code mentions.

ARM64 has a stronger barrier in its context switch code, but that's
because they need to sync against external agents (like their TLB and
cache) and no amount of generic locking is going to cover that.




Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt wrote:

> On powerpc we have a sync deep in _switch to achieve that.

OK, for giggles, could you (or Balbir) check what happens if you take
that sync out?

There should be enough serialization in the generic code to cover the
case that code mentions.

ARM64 has a stronger barrier in its context switch code, but that's
because they need to sync against external agents (like their TLB and
cache) and no amount of generic locking is going to cover that.




Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 09:18 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
> wrote:
> > 
> > It's always been a requirement that if you actually context switch
> > a
> > full mb() is implied ...
> 
> > 
> > On powerpc we have a sync deep in _switch to achieve that.
> 
> OK, fair enough. I must've missed it in the x86 switch_to, must be
> one
> of those implied serializing instructions I'm not too familiar with.
> 
> > 
> > (though that isn't the case if you don't actually
> > switch, ie, you are back to RUNNING before you even hit schedule).
> 
> Right, which invalidates the claim that schedule() implies a full mb,

Right, it's only full mb if you actually schedule to another process :-
)

> > 
> > This is necessary so that a process who wakes up on a different CPU
> > sees
> > all of its own load/stores.
> 
> Don't actually think its needed for that, see the comment from
> 8643cda549ca4, the scheduler has enough barriers to guarantee
> Program-Order for tasks without that.
> 


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 09:20 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:25:01AM +1000, Benjamin Herrenschmidt wrote:
> > 
> > On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> > > 
> > > 
> > > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > > leak into the critical section.
> > > 
> > > But context switch should imply mb() we can rely on?
> > 
> > Between setting of ->on_rq and returning to the task so it can
> > change its state back to [UN]INTERRUPTIBLE, there will be at least one
> > write barrier (spin unlock of the rq),
> 
> spin-unlock is _not_ a write barrier, its a RELEASE barrier, and is not
> sufficient for this.

Ah yes well it's an lwsync so it's a wmb for us :-) .

> > possibly even a full barrier
> > (context switch). The write barrier is enough so I didn't dig to make
> > sure we always context switch in the scenario we're looking at but I
> > think we do.
> 
> There is enough, you just need to pair the RELEASE with an ACQUIRE to
> get a full load-store barrier.

Right so I *think* there will be at least the release of the rq_lock by
the IPI followed by schedule itself taking and releasing it again, but
I can't vouch for it. As I said, I didn't dig deeper on that side of
things as for us a spin_unlock is a write barrier and for the write
side that concerns me here it's sufficient ;-) It's the read side that
has a problem.

That said you may want to investigate more to make sure there is no way
out of schedule where that spin_unlock is the only thing between
setting on_rq and coming out (which leads to setting the task state).

I suspect there will be at least one more re-aquisition & release of
the rq lock but I may be wrong.

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 09:18 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt
> wrote:
> > 
> > It's always been a requirement that if you actually context switch
> > a
> > full mb() is implied ...
> 
> > 
> > On powerpc we have a sync deep in _switch to achieve that.
> 
> OK, fair enough. I must've missed it in the x86 switch_to, must be
> one
> of those implied serializing instructions I'm not too familiar with.
> 
> > 
> > (though that isn't the case if you don't actually
> > switch, ie, you are back to RUNNING before you even hit schedule).
> 
> Right, which invalidates the claim that schedule() implies a full mb,

Right, it's only full mb if you actually schedule to another process :-
)

> > 
> > This is necessary so that a process who wakes up on a different CPU
> > sees
> > all of its own load/stores.
> 
> Don't actually think its needed for that, see the comment from
> 8643cda549ca4, the scheduler has enough barriers to guarantee
> Program-Order for tasks without that.
> 


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 09:20 +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 07:25:01AM +1000, Benjamin Herrenschmidt wrote:
> > 
> > On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> > > 
> > > 
> > > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > > leak into the critical section.
> > > 
> > > But context switch should imply mb() we can rely on?
> > 
> > Between setting of ->on_rq and returning to the task so it can
> > change its state back to [UN]INTERRUPTIBLE, there will be at least one
> > write barrier (spin unlock of the rq),
> 
> spin-unlock is _not_ a write barrier, its a RELEASE barrier, and is not
> sufficient for this.

Ah yes well it's an lwsync so it's a wmb for us :-) .

> > possibly even a full barrier
> > (context switch). The write barrier is enough so I didn't dig to make
> > sure we always context switch in the scenario we're looking at but I
> > think we do.
> 
> There is enough, you just need to pair the RELEASE with an ACQUIRE to
> get a full load-store barrier.

Right so I *think* there will be at least the release of the rq_lock by
the IPI followed by schedule itself taking and releasing it again, but
I can't vouch for it. As I said, I didn't dig deeper on that side of
things as for us a spin_unlock is a write barrier and for the write
side that concerns me here it's sufficient ;-) It's the read side that
has a problem.

That said you may want to investigate more to make sure there is no way
out of schedule where that spin_unlock is the only thing between
setting on_rq and coming out (which leads to setting the task state).

I suspect there will be at least one more re-aquisition & release of
the rq lock but I may be wrong.

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 09:28 +0200, Peter Zijlstra wrote:
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular'
> old Power7 like stuff?

Power8 which isn't *that* new these days...

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Benjamin Herrenschmidt
On Wed, 2016-08-31 at 09:28 +0200, Peter Zijlstra wrote:
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular'
> old Power7 like stuff?

Power8 which isn't *that* new these days...

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Balbir Singh


On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:


 The origin of the issue I've seen seems to be related to
 rwsem spin lock stealing. Basically I see the system deadlock'd in the
 following state
>>>
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>>
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>>
>>> for (;;) {
>>> current->state = UNINTERRUPTIBLE;
>>> smp_mb();
>>> if (cond)
>>> break;
>>> schedule();
>>> }
>>> current->state = RUNNING;
>>>
>>> Which, if the wakeup is spurious, is just the pattern you need.
>>
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
>> clearer
>> in my commit log. Should I resend the patch?
> 
> Yes please.
> 

Done, just now. I've tried to generalize the issue, but I've kept the
example

>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>
>>
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>> different thread.
> 
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
> 
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>>>
>>
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?
> 

I don't think the issue is processor specific, it is probabilistic, but
I've not tested on Power7. I am seeing it on a Power8 system


Balbir Singh


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Balbir Singh


On 31/08/16 17:28, Peter Zijlstra wrote:
> On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
>> On 30/08/16 22:19, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:


 The origin of the issue I've seen seems to be related to
 rwsem spin lock stealing. Basically I see the system deadlock'd in the
 following state
>>>
>>> As Nick says (good to see you're back Nick!), this is unrelated to
>>> rwsems.
>>>
>>> This is true for pretty much every blocking wait loop out there, they
>>> all do:
>>>
>>> for (;;) {
>>> current->state = UNINTERRUPTIBLE;
>>> smp_mb();
>>> if (cond)
>>> break;
>>> schedule();
>>> }
>>> current->state = RUNNING;
>>>
>>> Which, if the wakeup is spurious, is just the pattern you need.
>>
>> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
>> clearer
>> in my commit log. Should I resend the patch?
> 
> Yes please.
> 

Done, just now. I've tried to generalize the issue, but I've kept the
example

>>> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
>>> to PPC, is _not_ MB. It is however sufficient for this case.
>>>
>>
>> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
>> different thread.
> 
> Right, although even without that, there is sufficient ordering, as the
> rq unlock from the wakeup, coupled with the rq lock from the schedule
> already form a load-store barrier.
> 
>>> Now, this has been present for a fair while, I suspect ever since we
>>> reworked the wakeup path to not use rq->lock twice. Curious you only now
>>> hit it.
>>>
>>
>> Yes, I just hit it a a week or two back and I needed to collect data to
>> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
>> I needed a system with large threads and less memory running stress-ng.
>> Reproducing the problem takes an unpredictable amount of time.
> 
> What hardware do you see this on, is it shiny new Power8 chips which
> have never before seen deep queues or something. Or is it 'regular' old
> Power7 like stuff?
> 

I don't think the issue is processor specific, it is probabilistic, but
I've not tested on Power7. I am seeing it on a Power8 system


Balbir Singh


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
> On 30/08/16 22:19, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> >>
> >>
> >> The origin of the issue I've seen seems to be related to
> >> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> >> following state
> > 
> > As Nick says (good to see you're back Nick!), this is unrelated to
> > rwsems.
> > 
> > This is true for pretty much every blocking wait loop out there, they
> > all do:
> > 
> > for (;;) {
> > current->state = UNINTERRUPTIBLE;
> > smp_mb();
> > if (cond)
> > break;
> > schedule();
> > }
> > current->state = RUNNING;
> > 
> > Which, if the wakeup is spurious, is just the pattern you need.
> 
> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
> clearer
> in my commit log. Should I resend the patch?

Yes please.

> > There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> > to PPC, is _not_ MB. It is however sufficient for this case.
> > 
> 
> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
> different thread.

Right, although even without that, there is sufficient ordering, as the
rq unlock from the wakeup, coupled with the rq lock from the schedule
already form a load-store barrier.

> > Now, this has been present for a fair while, I suspect ever since we
> > reworked the wakeup path to not use rq->lock twice. Curious you only now
> > hit it.
> > 
> 
> Yes, I just hit it a a week or two back and I needed to collect data to
> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
> I needed a system with large threads and less memory running stress-ng.
> Reproducing the problem takes an unpredictable amount of time.

What hardware do you see this on, is it shiny new Power8 chips which
have never before seen deep queues or something. Or is it 'regular' old
Power7 like stuff?


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 01:41:33PM +1000, Balbir Singh wrote:
> On 30/08/16 22:19, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> >>
> >>
> >> The origin of the issue I've seen seems to be related to
> >> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> >> following state
> > 
> > As Nick says (good to see you're back Nick!), this is unrelated to
> > rwsems.
> > 
> > This is true for pretty much every blocking wait loop out there, they
> > all do:
> > 
> > for (;;) {
> > current->state = UNINTERRUPTIBLE;
> > smp_mb();
> > if (cond)
> > break;
> > schedule();
> > }
> > current->state = RUNNING;
> > 
> > Which, if the wakeup is spurious, is just the pattern you need.
> 
> Yes True! My bad Alexey had seen the same basic pattern, I should have been 
> clearer
> in my commit log. Should I resend the patch?

Yes please.

> > There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> > to PPC, is _not_ MB. It is however sufficient for this case.
> > 
> 
> The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
> different thread.

Right, although even without that, there is sufficient ordering, as the
rq unlock from the wakeup, coupled with the rq lock from the schedule
already form a load-store barrier.

> > Now, this has been present for a fair while, I suspect ever since we
> > reworked the wakeup path to not use rq->lock twice. Curious you only now
> > hit it.
> > 
> 
> Yes, I just hit it a a week or two back and I needed to collect data to
> explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
> I needed a system with large threads and less memory running stress-ng.
> Reproducing the problem takes an unpredictable amount of time.

What hardware do you see this on, is it shiny new Power8 chips which
have never before seen deep queues or something. Or is it 'regular' old
Power7 like stuff?


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 07:25:01AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> > 
> > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > leak into the critical section.
> > 
> > But context switch should imply mb() we can rely on?
> 
> Between setting of ->on_rq and returning to the task so it can
> change its state back to [UN]INTERRUPTIBLE, there will be at least one
> write barrier (spin unlock of the rq),

spin-unlock is _not_ a write barrier, its a RELEASE barrier, and is not
sufficient for this.

> possibly even a full barrier
> (context switch). The write barrier is enough so I didn't dig to make
> sure we always context switch in the scenario we're looking at but I
> think we do.

There is enough, you just need to pair the RELEASE with an ACQUIRE to
get a full load-store barrier.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 07:25:01AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> > 
> > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > leak into the critical section.
> > 
> > But context switch should imply mb() we can rely on?
> 
> Between setting of ->on_rq and returning to the task so it can
> change its state back to [UN]INTERRUPTIBLE, there will be at least one
> write barrier (spin unlock of the rq),

spin-unlock is _not_ a write barrier, its a RELEASE barrier, and is not
sufficient for this.

> possibly even a full barrier
> (context switch). The write barrier is enough so I didn't dig to make
> sure we always context switch in the scenario we're looking at but I
> think we do.

There is enough, you just need to pair the RELEASE with an ACQUIRE to
get a full load-store barrier.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt wrote:
> It's always been a requirement that if you actually context switch a
> full mb() is implied ...

> On powerpc we have a sync deep in _switch to achieve that.

OK, fair enough. I must've missed it in the x86 switch_to, must be one
of those implied serializing instructions I'm not too familiar with.

> (though that isn't the case if you don't actually
> switch, ie, you are back to RUNNING before you even hit schedule).

Right, which invalidates the claim that schedule() implies a full mb,

> This is necessary so that a process who wakes up on a different CPU sees
> all of its own load/stores.

Don't actually think its needed for that, see the comment from
8643cda549ca4, the scheduler has enough barriers to guarantee
Program-Order for tasks without that.




Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-31 Thread Peter Zijlstra
On Wed, Aug 31, 2016 at 07:28:18AM +1000, Benjamin Herrenschmidt wrote:
> It's always been a requirement that if you actually context switch a
> full mb() is implied ...

> On powerpc we have a sync deep in _switch to achieve that.

OK, fair enough. I must've missed it in the x86 switch_to, must be one
of those implied serializing instructions I'm not too familiar with.

> (though that isn't the case if you don't actually
> switch, ie, you are back to RUNNING before you even hit schedule).

Right, which invalidates the claim that schedule() implies a full mb,

> This is necessary so that a process who wakes up on a different CPU sees
> all of its own load/stores.

Don't actually think its needed for that, see the comment from
8643cda549ca4, the scheduler has enough barriers to guarantee
Program-Order for tasks without that.




Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Balbir Singh


On 30/08/16 22:19, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
> 
> As Nick says (good to see you're back Nick!), this is unrelated to
> rwsems.
> 
> This is true for pretty much every blocking wait loop out there, they
> all do:
> 
>   for (;;) {
>   current->state = UNINTERRUPTIBLE;
>   smp_mb();
>   if (cond)
>   break;
>   schedule();
>   }
>   current->state = RUNNING;
> 
> Which, if the wakeup is spurious, is just the pattern you need.

Yes True! My bad Alexey had seen the same basic pattern, I should have been 
clearer
in my commit log. Should I resend the patch?

> 
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
>> state, int wake_flags)
>>  success = 1; /* we're going to change ->state */
>>  cpu = task_cpu(p);
>>  
>> +/*
>> + * Ensure we see on_rq and p_state consistently
>> + *
>> + * For example in __rwsem_down_write_failed(), we have
>> + *[S] ->on_rq = 1   [L] ->state
>> + *MB RMB
> 
> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> to PPC, is _not_ MB. It is however sufficient for this case.
> 

The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
different thread.

>> + *[S] ->state = TASK_UNINTERRUPTIBLE[L] ->on_rq
>> + * In the absence of the RMB p->on_rq can be observed to be 0
>> + * and we end up spinning indefinitely in while (p->on_cpu)
>> + */
> 
> 
>   /*
>* Ensure we load p->on_rq _after_ p->state, otherwise it would
>* be possible to, falsely, observe p->on_rq == 0 and get stuck
>* in smp_cond_load_acquire() below.
>*
>* sched_ttwu_pending() try_to_wake_up()
>*   [S] p->on_rq = 1;  [L] P->state
>*   UNLOCK rq->lock
>*
>* schedule()   RMB
>*   LOCK rq->lock
>*   UNLOCK rq->lock
>*
>* [task p]
>*   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
>*
>* Pairs with the UNLOCK+LOCK on rq->lock from the
>* last wakeup of our task and the schedule that got our task
>* current.
>*/
> 
>> +smp_rmb();
>>  if (p->on_rq && ttwu_remote(p, wake_flags))
>>  goto stat;
>>  
> 
> 
> Now, this has been present for a fair while, I suspect ever since we
> reworked the wakeup path to not use rq->lock twice. Curious you only now
> hit it.
> 

Yes, I just hit it a a week or two back and I needed to collect data to
explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
I needed a system with large threads and less memory running stress-ng.
Reproducing the problem takes an unpredictable amount of time.

Balbir Singh.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Balbir Singh


On 30/08/16 22:19, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
>>
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
> 
> As Nick says (good to see you're back Nick!), this is unrelated to
> rwsems.
> 
> This is true for pretty much every blocking wait loop out there, they
> all do:
> 
>   for (;;) {
>   current->state = UNINTERRUPTIBLE;
>   smp_mb();
>   if (cond)
>   break;
>   schedule();
>   }
>   current->state = RUNNING;
> 
> Which, if the wakeup is spurious, is just the pattern you need.

Yes True! My bad Alexey had seen the same basic pattern, I should have been 
clearer
in my commit log. Should I resend the patch?

> 
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
>> state, int wake_flags)
>>  success = 1; /* we're going to change ->state */
>>  cpu = task_cpu(p);
>>  
>> +/*
>> + * Ensure we see on_rq and p_state consistently
>> + *
>> + * For example in __rwsem_down_write_failed(), we have
>> + *[S] ->on_rq = 1   [L] ->state
>> + *MB RMB
> 
> There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
> to PPC, is _not_ MB. It is however sufficient for this case.
> 

The MB comes from the __switch_to() in schedule(). Ben mentioned it in a 
different thread.

>> + *[S] ->state = TASK_UNINTERRUPTIBLE[L] ->on_rq
>> + * In the absence of the RMB p->on_rq can be observed to be 0
>> + * and we end up spinning indefinitely in while (p->on_cpu)
>> + */
> 
> 
>   /*
>* Ensure we load p->on_rq _after_ p->state, otherwise it would
>* be possible to, falsely, observe p->on_rq == 0 and get stuck
>* in smp_cond_load_acquire() below.
>*
>* sched_ttwu_pending() try_to_wake_up()
>*   [S] p->on_rq = 1;  [L] P->state
>*   UNLOCK rq->lock
>*
>* schedule()   RMB
>*   LOCK rq->lock
>*   UNLOCK rq->lock
>*
>* [task p]
>*   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
>*
>* Pairs with the UNLOCK+LOCK on rq->lock from the
>* last wakeup of our task and the schedule that got our task
>* current.
>*/
> 
>> +smp_rmb();
>>  if (p->on_rq && ttwu_remote(p, wake_flags))
>>  goto stat;
>>  
> 
> 
> Now, this has been present for a fair while, I suspect ever since we
> reworked the wakeup path to not use rq->lock twice. Curious you only now
> hit it.
> 

Yes, I just hit it a a week or two back and I needed to collect data to
explain why p->on_rq got to 0. Hitting it requires extreme stress -- for me
I needed a system with large threads and less memory running stress-ng.
Reproducing the problem takes an unpredictable amount of time.

Balbir Singh.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Balbir Singh


On 30/08/16 22:58, Oleg Nesterov wrote:
> On 08/30, Balbir Singh wrote:
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
>>
>> I have a system with multiple threads and
>>
>> Most of the threads are stuck doing
>>
>> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
>> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
>> [67272.700996] [c00012857ae0] [c012453c] rwsem_wake+0xcc/0x110
>> [67272.749283] [c00012857b20] [c01215d8] up_write+0x78/0x90
>> [67272.788965] [c00012857b50] [c028153c] 
>> unlink_anon_vmas+0x15c/0x2c0
>> [67272.798782] [c00012857bc0] [c026f5c0] free_pgtables+0xf0/0x1c0
>> [67272.842528] [c00012857c10] [c027c9a0] exit_mmap+0x100/0x1a0
>> [67272.872947] [c00012857cd0] [c00b4a98] mmput+0xa8/0x1b0
>> [67272.898432] [c00012857d00] [c00bc50c] do_exit+0x33c/0xc30
>> [67272.944721] [c00012857dc0] [c00bcee4] do_group_exit+0x64/0x100
>> [67272.969014] [c00012857e00] [c00bcfac] SyS_exit_group+0x2c/0x30
>> [67272.978971] [c00012857e30] [c0009204] system_call+0x38/0xb4
>> [67272.999016] Instruction dump:
>>
>> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
>> irq's disabled and is doing
>>
>> [c0037930fb30] c00f724c try_to_wake_up+0x6c/0x570
>> [c0037930fbb0] c0124328 __rwsem_do_wake+0x1f8/0x260
>> [c0037930fc00] c01244b4 rwsem_wake+0x84/0x110
>> [c0037930fc40] c0121598 up_write+0x78/0x90
>> [c0037930fc70] c0281a54 anon_vma_fork+0x184/0x1d0
>> [c0037930fcc0] c00b68e0 copy_process.isra.5+0x14c0/0x1870
>> [c0037930fda0] c00b6e68 _do_fork+0xa8/0x4b0
>> [c0037930fe30] c0009460 ppc_clone+0x8/0xc
>>
>> The offset of try_to_wake_up is actually misleading, it is actually stuck
>> doing the following in try_to_wake_up
>>
>> while (p->on_cpu)
>>  cpu_relax();
>>
>> Analysis
>>
>> The issue is triggered, due to the following race
>>
>> CPU1 CPU2
>>
>> while () {
>>   if (cond)
>> break;
>>   do {
>> schedule();
>> set_current_state(TASK_UN..)
>>   } while (!cond);
>>  rwsem_wake()
>>spin_lock_irqsave(wait_lock)
>>   raw_spin_lock_irqsave(wait_lock) wake_up_process()
>> }  try_to_wake_up()
>> set_current_state(TASK_RUNNING);   ..
>> list_del();
>>
>> CPU2 wakes up CPU1, but before it can get the wait_lock and set
>> current state to TASK_RUNNING the following occurs
>>
>> CPU3
>> (stole the rwsem before waiter can be woken up from queue)
>> up_write()
>> rwsem_wake()
>> raw_spin_lock_irqsave(wait_lock)
>> if (!list_empty)
>>   wake_up_process()
>>   try_to_wake_up()
>>   raw_spin_lock_irqsave(p->pi_lock)
>>   ..
>>   if (p->on_rq && ttwu_wakeup())
>>   ..
>>   while (p->on_cpu)
>> cpu_relax()
>>   ..
>>
>> CPU3 tries to wake up the task on CPU1 again since it finds
>> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
>> after CPU2, CPU3 got it.
>>
>> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
>> the task is spinning on the wait_lock. Interestingly since p->on_rq
>> is checked under pi_lock, I've noticed that try_to_wake_up() finds
>> p->on_rq to be 0. This was the most confusing bit of the analysis,
>> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
>> check is not reliable without this fix IMHO. The race is visible
>> (based on the analysis) only when ttwu_queue() does a remote wakeup
>> via ttwu_queue_remote. In which case the p->on_rq change is not
>> done uder the pi_lock.
>>
>> The result is that after a while the entire system locks up on
>> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
>>
>> Reproduction of the issue
>>
>> The issue can be reproduced after a long run on my system with 80
>> threads and having to tweak available memory to very low and running
>> memory stress-ng mmapfork test. It usually takes a long time to
>> reproduce. I am trying to work on a test case that can reproduce
>> the issue faster, but thats work in progress. I am still testing the
>> changes on my still in a loop and the tests seem OK thus far.
>>
>> Big thanks to Benjamin and Nick for helping debug this as well.
>> Ben helped catch the missing barrier, Nick caught every missing
>> bit in my theory
>>
>> Cc: Peter Zijlstra 
>> Cc: Nicholas Piggin 
>> Cc: Benjamin Herrenschmidt 
>>
>> Signed-off-by: Balbir Singh 
>> ---
>>  kernel/sched/core.c | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2a906f2..582c684 100644
>> --- 

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Balbir Singh


On 30/08/16 22:58, Oleg Nesterov wrote:
> On 08/30, Balbir Singh wrote:
>>
>> The origin of the issue I've seen seems to be related to
>> rwsem spin lock stealing. Basically I see the system deadlock'd in the
>> following state
>>
>> I have a system with multiple threads and
>>
>> Most of the threads are stuck doing
>>
>> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
>> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
>> [67272.700996] [c00012857ae0] [c012453c] rwsem_wake+0xcc/0x110
>> [67272.749283] [c00012857b20] [c01215d8] up_write+0x78/0x90
>> [67272.788965] [c00012857b50] [c028153c] 
>> unlink_anon_vmas+0x15c/0x2c0
>> [67272.798782] [c00012857bc0] [c026f5c0] free_pgtables+0xf0/0x1c0
>> [67272.842528] [c00012857c10] [c027c9a0] exit_mmap+0x100/0x1a0
>> [67272.872947] [c00012857cd0] [c00b4a98] mmput+0xa8/0x1b0
>> [67272.898432] [c00012857d00] [c00bc50c] do_exit+0x33c/0xc30
>> [67272.944721] [c00012857dc0] [c00bcee4] do_group_exit+0x64/0x100
>> [67272.969014] [c00012857e00] [c00bcfac] SyS_exit_group+0x2c/0x30
>> [67272.978971] [c00012857e30] [c0009204] system_call+0x38/0xb4
>> [67272.999016] Instruction dump:
>>
>> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
>> irq's disabled and is doing
>>
>> [c0037930fb30] c00f724c try_to_wake_up+0x6c/0x570
>> [c0037930fbb0] c0124328 __rwsem_do_wake+0x1f8/0x260
>> [c0037930fc00] c01244b4 rwsem_wake+0x84/0x110
>> [c0037930fc40] c0121598 up_write+0x78/0x90
>> [c0037930fc70] c0281a54 anon_vma_fork+0x184/0x1d0
>> [c0037930fcc0] c00b68e0 copy_process.isra.5+0x14c0/0x1870
>> [c0037930fda0] c00b6e68 _do_fork+0xa8/0x4b0
>> [c0037930fe30] c0009460 ppc_clone+0x8/0xc
>>
>> The offset of try_to_wake_up is actually misleading, it is actually stuck
>> doing the following in try_to_wake_up
>>
>> while (p->on_cpu)
>>  cpu_relax();
>>
>> Analysis
>>
>> The issue is triggered, due to the following race
>>
>> CPU1 CPU2
>>
>> while () {
>>   if (cond)
>> break;
>>   do {
>> schedule();
>> set_current_state(TASK_UN..)
>>   } while (!cond);
>>  rwsem_wake()
>>spin_lock_irqsave(wait_lock)
>>   raw_spin_lock_irqsave(wait_lock) wake_up_process()
>> }  try_to_wake_up()
>> set_current_state(TASK_RUNNING);   ..
>> list_del();
>>
>> CPU2 wakes up CPU1, but before it can get the wait_lock and set
>> current state to TASK_RUNNING the following occurs
>>
>> CPU3
>> (stole the rwsem before waiter can be woken up from queue)
>> up_write()
>> rwsem_wake()
>> raw_spin_lock_irqsave(wait_lock)
>> if (!list_empty)
>>   wake_up_process()
>>   try_to_wake_up()
>>   raw_spin_lock_irqsave(p->pi_lock)
>>   ..
>>   if (p->on_rq && ttwu_wakeup())
>>   ..
>>   while (p->on_cpu)
>> cpu_relax()
>>   ..
>>
>> CPU3 tries to wake up the task on CPU1 again since it finds
>> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
>> after CPU2, CPU3 got it.
>>
>> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
>> the task is spinning on the wait_lock. Interestingly since p->on_rq
>> is checked under pi_lock, I've noticed that try_to_wake_up() finds
>> p->on_rq to be 0. This was the most confusing bit of the analysis,
>> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
>> check is not reliable without this fix IMHO. The race is visible
>> (based on the analysis) only when ttwu_queue() does a remote wakeup
>> via ttwu_queue_remote. In which case the p->on_rq change is not
>> done uder the pi_lock.
>>
>> The result is that after a while the entire system locks up on
>> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
>>
>> Reproduction of the issue
>>
>> The issue can be reproduced after a long run on my system with 80
>> threads and having to tweak available memory to very low and running
>> memory stress-ng mmapfork test. It usually takes a long time to
>> reproduce. I am trying to work on a test case that can reproduce
>> the issue faster, but thats work in progress. I am still testing the
>> changes on my still in a loop and the tests seem OK thus far.
>>
>> Big thanks to Benjamin and Nick for helping debug this as well.
>> Ben helped catch the missing barrier, Nick caught every missing
>> bit in my theory
>>
>> Cc: Peter Zijlstra 
>> Cc: Nicholas Piggin 
>> Cc: Benjamin Herrenschmidt 
>>
>> Signed-off-by: Balbir Singh 
>> ---
>>  kernel/sched/core.c | 11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2a906f2..582c684 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct 

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Benjamin Herrenschmidt
On Tue, 2016-08-30 at 20:34 +0200, Peter Zijlstra wrote:
> 
> I'm not actually sure it does. There is the comment from 8643cda549ca4
> which explain the program order guarantees.
> 
> But I'm not sure who or what would simply a full smp_mb() when you call
> schedule() -- I mean, its true on x86, but that's 'trivial'.

It's always been a requirement that if you actually context switch a
full mb() is implied (though that isn't the case if you don't actually
switch, ie, you are back to RUNNING before you even hit schedule).

On powerpc we have a sync deep in _switch to achieve that.

This is necessary so that a process who wakes up on a different CPU sees
all of its own load/stores.

> > I mean, I thought that the LOAD/STORE's done by some task can't
> > be re-ordered with LOAD/STORE's done by another task which was
> > running on the same CPU. Wrong?
> 
> If so, I'm not sure how :/
> 
> So smp_mb__before_spinlock() stops stores from @prev, and the ACQUIRE
> from spin_lock(>lock) stops both loads/stores from @next, but afaict
> nothing stops the loads from @prev seeing stores from @next.
> 
> Also not sure this matters though, if they're threads in the same
> process its a data race already and nobody cares. If they're not threads
> in the same process, they're separated by address space and can't 'see'
> each other anyway.

The architecture switch_to() has to do the right thing.

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Benjamin Herrenschmidt
On Tue, 2016-08-30 at 20:34 +0200, Peter Zijlstra wrote:
> 
> I'm not actually sure it does. There is the comment from 8643cda549ca4
> which explain the program order guarantees.
> 
> But I'm not sure who or what would simply a full smp_mb() when you call
> schedule() -- I mean, its true on x86, but that's 'trivial'.

It's always been a requirement that if you actually context switch a
full mb() is implied (though that isn't the case if you don't actually
switch, ie, you are back to RUNNING before you even hit schedule).

On powerpc we have a sync deep in _switch to achieve that.

This is necessary so that a process who wakes up on a different CPU sees
all of its own load/stores.

> > I mean, I thought that the LOAD/STORE's done by some task can't
> > be re-ordered with LOAD/STORE's done by another task which was
> > running on the same CPU. Wrong?
> 
> If so, I'm not sure how :/
> 
> So smp_mb__before_spinlock() stops stores from @prev, and the ACQUIRE
> from spin_lock(>lock) stops both loads/stores from @next, but afaict
> nothing stops the loads from @prev seeing stores from @next.
> 
> Also not sure this matters though, if they're threads in the same
> process its a data race already and nobody cares. If they're not threads
> in the same process, they're separated by address space and can't 'see'
> each other anyway.

The architecture switch_to() has to do the right thing.

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Benjamin Herrenschmidt
On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> 
> Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> leak into the critical section.
> 
> But context switch should imply mb() we can rely on?

Between setting of ->on_rq and returning to the task so it can
change its state back to [UN]INTERRUPTIBLE, there will be at least one
write barrier (spin unlock of the rq), possibly even a full barrier
(context switch). The write barrier is enough so I didn't dig to make
sure we always context switch in the scenario we're looking at but I
think we do.

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Benjamin Herrenschmidt
On Tue, 2016-08-30 at 15:04 +0200, Oleg Nesterov wrote:
> 
> Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> leak into the critical section.
> 
> But context switch should imply mb() we can rely on?

Between setting of ->on_rq and returning to the task so it can
change its state back to [UN]INTERRUPTIBLE, there will be at least one
write barrier (spin unlock of the rq), possibly even a full barrier
(context switch). The write barrier is enough so I didn't dig to make
sure we always context switch in the scenario we're looking at but I
think we do.

Cheers,
Ben.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 06:57:47PM +0200, Oleg Nesterov wrote:
> On 08/30, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:

> > > But context switch should imply mb() we can rely on?
> >
> > Not sure it should, on x86 switch_mm does a CR3 write and that is
> > serializing, but switch_to() doesn't need to do anything iirc.
> 
> Documentation/memory-barriers.txt says
> 
>   schedule() and similar imply full memory barriers.
> 
> and I (wrongly?) interpreted this as if this is also true for 2
> different threadds.

I'm not actually sure it does. There is the comment from 8643cda549ca4
which explain the program order guarantees.

But I'm not sure who or what would simply a full smp_mb() when you call
schedule() -- I mean, its true on x86, but that's 'trivial'.

> I mean, I thought that the LOAD/STORE's done by some task can't
> be re-ordered with LOAD/STORE's done by another task which was
> running on the same CPU. Wrong?

If so, I'm not sure how :/

So smp_mb__before_spinlock() stops stores from @prev, and the ACQUIRE
from spin_lock(>lock) stops both loads/stores from @next, but afaict
nothing stops the loads from @prev seeing stores from @next.

Also not sure this matters though, if they're threads in the same
process its a data race already and nobody cares. If they're not threads
in the same process, they're separated by address space and can't 'see'
each other anyway.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 06:57:47PM +0200, Oleg Nesterov wrote:
> On 08/30, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:

> > > But context switch should imply mb() we can rely on?
> >
> > Not sure it should, on x86 switch_mm does a CR3 write and that is
> > serializing, but switch_to() doesn't need to do anything iirc.
> 
> Documentation/memory-barriers.txt says
> 
>   schedule() and similar imply full memory barriers.
> 
> and I (wrongly?) interpreted this as if this is also true for 2
> different threadds.

I'm not actually sure it does. There is the comment from 8643cda549ca4
which explain the program order guarantees.

But I'm not sure who or what would simply a full smp_mb() when you call
schedule() -- I mean, its true on x86, but that's 'trivial'.

> I mean, I thought that the LOAD/STORE's done by some task can't
> be re-ordered with LOAD/STORE's done by another task which was
> running on the same CPU. Wrong?

If so, I'm not sure how :/

So smp_mb__before_spinlock() stops stores from @prev, and the ACQUIRE
from spin_lock(>lock) stops both loads/stores from @next, but afaict
nothing stops the loads from @prev seeing stores from @next.

Also not sure this matters though, if they're threads in the same
process its a data race already and nobody cares. If they're not threads
in the same process, they're separated by address space and can't 'see'
each other anyway.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Oleg Nesterov
On 08/30, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:
> > On 08/30, Peter Zijlstra wrote:
> > >
> > >   /*
> > >* Ensure we load p->on_rq _after_ p->state, otherwise it would
> > >* be possible to, falsely, observe p->on_rq == 0 and get stuck
> > >* in smp_cond_load_acquire() below.
> > >*
> > >* sched_ttwu_pending() try_to_wake_up()
> > >*   [S] p->on_rq = 1;  [L] P->state
> > >*   UNLOCK rq->lock
> > >*
> > >* schedule()   RMB
> > >*   LOCK rq->lock
> > >*   UNLOCK rq->lock
> > >*
> > >* [task p]
> > >*   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
> > >*
> > >* Pairs with the UNLOCK+LOCK on rq->lock from the
> > >* last wakeup of our task and the schedule that got our task
> > >* current.
> > >*/
> >
> > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > leak into the critical section.
>
> How so? That LOCK+UNLOCK which is leaky, UNLOCK+LOCK is a read/write
> barrier (just not an MB because it lacks full transitivity).

Ah, I have wrongly read the "Pairs with the UNLOCK+LOCK" as
"Pairs with the LOCK+UNLOCK". And didn't notice this even after I
copy-and-pasted this part.

> > But context switch should imply mb() we can rely on?
>
> Not sure it should, on x86 switch_mm does a CR3 write and that is
> serializing, but switch_to() doesn't need to do anything iirc.

Documentation/memory-barriers.txt says

schedule() and similar imply full memory barriers.

and I (wrongly?) interpreted this as if this is also true for 2
different threadds.

I mean, I thought that the LOAD/STORE's done by some task can't
be re-ordered with LOAD/STORE's done by another task which was
running on the same CPU. Wrong?

Oleg.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Oleg Nesterov
On 08/30, Peter Zijlstra wrote:
> On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:
> > On 08/30, Peter Zijlstra wrote:
> > >
> > >   /*
> > >* Ensure we load p->on_rq _after_ p->state, otherwise it would
> > >* be possible to, falsely, observe p->on_rq == 0 and get stuck
> > >* in smp_cond_load_acquire() below.
> > >*
> > >* sched_ttwu_pending() try_to_wake_up()
> > >*   [S] p->on_rq = 1;  [L] P->state
> > >*   UNLOCK rq->lock
> > >*
> > >* schedule()   RMB
> > >*   LOCK rq->lock
> > >*   UNLOCK rq->lock
> > >*
> > >* [task p]
> > >*   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
> > >*
> > >* Pairs with the UNLOCK+LOCK on rq->lock from the
> > >* last wakeup of our task and the schedule that got our task
> > >* current.
> > >*/
> >
> > Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> > leak into the critical section.
>
> How so? That LOCK+UNLOCK which is leaky, UNLOCK+LOCK is a read/write
> barrier (just not an MB because it lacks full transitivity).

Ah, I have wrongly read the "Pairs with the UNLOCK+LOCK" as
"Pairs with the LOCK+UNLOCK". And didn't notice this even after I
copy-and-pasted this part.

> > But context switch should imply mb() we can rely on?
>
> Not sure it should, on x86 switch_mm does a CR3 write and that is
> serializing, but switch_to() doesn't need to do anything iirc.

Documentation/memory-barriers.txt says

schedule() and similar imply full memory barriers.

and I (wrongly?) interpreted this as if this is also true for 2
different threadds.

I mean, I thought that the LOAD/STORE's done by some task can't
be re-ordered with LOAD/STORE's done by another task which was
running on the same CPU. Wrong?

Oleg.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:
> On 08/30, Peter Zijlstra wrote:
> >
> > /*
> >  * Ensure we load p->on_rq _after_ p->state, otherwise it would
> >  * be possible to, falsely, observe p->on_rq == 0 and get stuck
> >  * in smp_cond_load_acquire() below.
> >  *
> >  * sched_ttwu_pending() try_to_wake_up()
> >  *   [S] p->on_rq = 1;  [L] P->state
> >  *   UNLOCK rq->lock
> >  *
> >  * schedule()   RMB
> >  *   LOCK rq->lock
> >  *   UNLOCK rq->lock
> >  *
> >  * [task p]
> >  *   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
> >  *
> >  * Pairs with the UNLOCK+LOCK on rq->lock from the
> >  * last wakeup of our task and the schedule that got our task
> >  * current.
> >  */
> 
> Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> leak into the critical section.

How so? That LOCK+UNLOCK which is leaky, UNLOCK+LOCK is a read/write
barrier (just not an MB because it lacks full transitivity).

> But context switch should imply mb() we can rely on?

Not sure it should, on x86 switch_mm does a CR3 write and that is
serializing, but switch_to() doesn't need to do anything iirc.




Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 03:04:27PM +0200, Oleg Nesterov wrote:
> On 08/30, Peter Zijlstra wrote:
> >
> > /*
> >  * Ensure we load p->on_rq _after_ p->state, otherwise it would
> >  * be possible to, falsely, observe p->on_rq == 0 and get stuck
> >  * in smp_cond_load_acquire() below.
> >  *
> >  * sched_ttwu_pending() try_to_wake_up()
> >  *   [S] p->on_rq = 1;  [L] P->state
> >  *   UNLOCK rq->lock
> >  *
> >  * schedule()   RMB
> >  *   LOCK rq->lock
> >  *   UNLOCK rq->lock
> >  *
> >  * [task p]
> >  *   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
> >  *
> >  * Pairs with the UNLOCK+LOCK on rq->lock from the
> >  * last wakeup of our task and the schedule that got our task
> >  * current.
> >  */
> 
> Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
> leak into the critical section.

How so? That LOCK+UNLOCK which is leaky, UNLOCK+LOCK is a read/write
barrier (just not an MB because it lacks full transitivity).

> But context switch should imply mb() we can rely on?

Not sure it should, on x86 switch_mm does a CR3 write and that is
serializing, but switch_to() doesn't need to do anything iirc.




Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Oleg Nesterov
On 08/30, Peter Zijlstra wrote:
>
>   /*
>* Ensure we load p->on_rq _after_ p->state, otherwise it would
>* be possible to, falsely, observe p->on_rq == 0 and get stuck
>* in smp_cond_load_acquire() below.
>*
>* sched_ttwu_pending() try_to_wake_up()
>*   [S] p->on_rq = 1;  [L] P->state
>*   UNLOCK rq->lock
>*
>* schedule()   RMB
>*   LOCK rq->lock
>*   UNLOCK rq->lock
>*
>* [task p]
>*   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
>*
>* Pairs with the UNLOCK+LOCK on rq->lock from the
>* last wakeup of our task and the schedule that got our task
>* current.
>*/

Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
leak into the critical section.

But context switch should imply mb() we can rely on?

Oleg.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Oleg Nesterov
On 08/30, Peter Zijlstra wrote:
>
>   /*
>* Ensure we load p->on_rq _after_ p->state, otherwise it would
>* be possible to, falsely, observe p->on_rq == 0 and get stuck
>* in smp_cond_load_acquire() below.
>*
>* sched_ttwu_pending() try_to_wake_up()
>*   [S] p->on_rq = 1;  [L] P->state
>*   UNLOCK rq->lock
>*
>* schedule()   RMB
>*   LOCK rq->lock
>*   UNLOCK rq->lock
>*
>* [task p]
>*   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
>*
>* Pairs with the UNLOCK+LOCK on rq->lock from the
>* last wakeup of our task and the schedule that got our task
>* current.
>*/

Confused... how this connects to UNLOCK+LOCK on rq->lock? A LOAD can
leak into the critical section.

But context switch should imply mb() we can rely on?

Oleg.



Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Oleg Nesterov
On 08/30, Balbir Singh wrote:
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state
> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c00012857ae0] [c012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c00012857b20] [c01215d8] up_write+0x78/0x90
> [67272.788965] [c00012857b50] [c028153c] 
> unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c00012857bc0] [c026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c00012857c10] [c027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c00012857cd0] [c00b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c00012857d00] [c00bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c00012857dc0] [c00bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c00012857e00] [c00bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c00012857e30] [c0009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c0037930fb30] c00f724c try_to_wake_up+0x6c/0x570
> [c0037930fbb0] c0124328 __rwsem_do_wake+0x1f8/0x260
> [c0037930fc00] c01244b4 rwsem_wake+0x84/0x110
> [c0037930fc40] c0121598 up_write+0x78/0x90
> [c0037930fc70] c0281a54 anon_vma_fork+0x184/0x1d0
> [c0037930fcc0] c00b68e0 copy_process.isra.5+0x14c0/0x1870
> [c0037930fda0] c00b6e68 _do_fork+0xa8/0x4b0
> [c0037930fe30] c0009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
>   cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1  CPU2
> 
> while () {
>   if (cond)
> break;
>   do {
> schedule();
> set_current_state(TASK_UN..)
>   } while (!cond);
>   rwsem_wake()
> spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)  wake_up_process()
> }   try_to_wake_up()
> set_current_state(TASK_RUNNING);..
> list_del();
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
> cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra 
> Cc: Nicholas Piggin 
> Cc: Benjamin Herrenschmidt 
> 
> Signed-off-by: Balbir Singh 
> ---
>  kernel/sched/core.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>  

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Oleg Nesterov
On 08/30, Balbir Singh wrote:
>
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state
> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c00012857ae0] [c012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c00012857b20] [c01215d8] up_write+0x78/0x90
> [67272.788965] [c00012857b50] [c028153c] 
> unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c00012857bc0] [c026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c00012857c10] [c027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c00012857cd0] [c00b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c00012857d00] [c00bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c00012857dc0] [c00bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c00012857e00] [c00bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c00012857e30] [c0009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c0037930fb30] c00f724c try_to_wake_up+0x6c/0x570
> [c0037930fbb0] c0124328 __rwsem_do_wake+0x1f8/0x260
> [c0037930fc00] c01244b4 rwsem_wake+0x84/0x110
> [c0037930fc40] c0121598 up_write+0x78/0x90
> [c0037930fc70] c0281a54 anon_vma_fork+0x184/0x1d0
> [c0037930fcc0] c00b68e0 copy_process.isra.5+0x14c0/0x1870
> [c0037930fda0] c00b6e68 _do_fork+0xa8/0x4b0
> [c0037930fe30] c0009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
>   cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1  CPU2
> 
> while () {
>   if (cond)
> break;
>   do {
> schedule();
> set_current_state(TASK_UN..)
>   } while (!cond);
>   rwsem_wake()
> spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)  wake_up_process()
> }   try_to_wake_up()
> set_current_state(TASK_RUNNING);..
> list_del();
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
> cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra 
> Cc: Nicholas Piggin 
> Cc: Benjamin Herrenschmidt 
> 
> Signed-off-by: Balbir Singh 
> ---
>  kernel/sched/core.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2a906f2..582c684 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   success = 1; /* we're going to change ->state */
>   cpu = task_cpu(p);
>  
> +  

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> 
> 
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

As Nick says (good to see you're back Nick!), this is unrelated to
rwsems.

This is true for pretty much every blocking wait loop out there, they
all do:

for (;;) {
current->state = UNINTERRUPTIBLE;
smp_mb();
if (cond)
break;
schedule();
}
current->state = RUNNING;

Which, if the wakeup is spurious, is just the pattern you need.

> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   success = 1; /* we're going to change ->state */
>   cpu = task_cpu(p);
>  
> + /*
> +  * Ensure we see on_rq and p_state consistently
> +  *
> +  * For example in __rwsem_down_write_failed(), we have
> +  *[S] ->on_rq = 1   [L] ->state
> +  *MB RMB

There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
to PPC, is _not_ MB. It is however sufficient for this case.

> +  *[S] ->state = TASK_UNINTERRUPTIBLE[L] ->on_rq
> +  * In the absence of the RMB p->on_rq can be observed to be 0
> +  * and we end up spinning indefinitely in while (p->on_cpu)
> +  */


/*
 * Ensure we load p->on_rq _after_ p->state, otherwise it would
 * be possible to, falsely, observe p->on_rq == 0 and get stuck
 * in smp_cond_load_acquire() below.
 *
 * sched_ttwu_pending() try_to_wake_up()
 *   [S] p->on_rq = 1;  [L] P->state
 *   UNLOCK rq->lock
 *
 * schedule()   RMB
 *   LOCK rq->lock
 *   UNLOCK rq->lock
 *
 * [task p]
 *   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
 *
 * Pairs with the UNLOCK+LOCK on rq->lock from the
 * last wakeup of our task and the schedule that got our task
 * current.
 */

> + smp_rmb();
>   if (p->on_rq && ttwu_remote(p, wake_flags))
>   goto stat;
>  


Now, this has been present for a fair while, I suspect ever since we
reworked the wakeup path to not use rq->lock twice. Curious you only now
hit it.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Peter Zijlstra
On Tue, Aug 30, 2016 at 06:49:37PM +1000, Balbir Singh wrote:
> 
> 
> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

As Nick says (good to see you're back Nick!), this is unrelated to
rwsems.

This is true for pretty much every blocking wait loop out there, they
all do:

for (;;) {
current->state = UNINTERRUPTIBLE;
smp_mb();
if (cond)
break;
schedule();
}
current->state = RUNNING;

Which, if the wakeup is spurious, is just the pattern you need.

> +++ b/kernel/sched/core.c
> @@ -2016,6 +2016,17 @@ try_to_wake_up(struct task_struct *p, unsigned int 
> state, int wake_flags)
>   success = 1; /* we're going to change ->state */
>   cpu = task_cpu(p);
>  
> + /*
> +  * Ensure we see on_rq and p_state consistently
> +  *
> +  * For example in __rwsem_down_write_failed(), we have
> +  *[S] ->on_rq = 1   [L] ->state
> +  *MB RMB

There isn't an MB there. The best I can do is UNLOCK+LOCK, which, thanks
to PPC, is _not_ MB. It is however sufficient for this case.

> +  *[S] ->state = TASK_UNINTERRUPTIBLE[L] ->on_rq
> +  * In the absence of the RMB p->on_rq can be observed to be 0
> +  * and we end up spinning indefinitely in while (p->on_cpu)
> +  */


/*
 * Ensure we load p->on_rq _after_ p->state, otherwise it would
 * be possible to, falsely, observe p->on_rq == 0 and get stuck
 * in smp_cond_load_acquire() below.
 *
 * sched_ttwu_pending() try_to_wake_up()
 *   [S] p->on_rq = 1;  [L] P->state
 *   UNLOCK rq->lock
 *
 * schedule()   RMB
 *   LOCK rq->lock
 *   UNLOCK rq->lock
 *
 * [task p]
 *   [S] p->state = UNINTERRUPTIBLE [L] p->on_rq
 *
 * Pairs with the UNLOCK+LOCK on rq->lock from the
 * last wakeup of our task and the schedule that got our task
 * current.
 */

> + smp_rmb();
>   if (p->on_rq && ttwu_remote(p, wake_flags))
>   goto stat;
>  


Now, this has been present for a fair while, I suspect ever since we
reworked the wakeup path to not use rq->lock twice. Curious you only now
hit it.


Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Nicholas Piggin
On Tue, 30 Aug 2016 18:49:37 +1000
Balbir Singh  wrote:

> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

BTW. this is not really to do with rwsems, but purely a scheduler bug.
It was seen in other callers too that follow similar pattern, but even
when those don't follow this pattern that leads to deadlock, it could
lead to a long busywait in the waker until the wakee next sleeps or
gets preempted.

> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c00012857ae0] [c012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c00012857b20] [c01215d8] up_write+0x78/0x90
> [67272.788965] [c00012857b50] [c028153c] 
> unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c00012857bc0] [c026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c00012857c10] [c027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c00012857cd0] [c00b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c00012857d00] [c00bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c00012857dc0] [c00bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c00012857e00] [c00bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c00012857e30] [c0009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c0037930fb30] c00f724c try_to_wake_up+0x6c/0x570
> [c0037930fbb0] c0124328 __rwsem_do_wake+0x1f8/0x260
> [c0037930fc00] c01244b4 rwsem_wake+0x84/0x110
> [c0037930fc40] c0121598 up_write+0x78/0x90
> [c0037930fc70] c0281a54 anon_vma_fork+0x184/0x1d0
> [c0037930fcc0] c00b68e0 copy_process.isra.5+0x14c0/0x1870
> [c0037930fda0] c00b6e68 _do_fork+0xa8/0x4b0
> [c0037930fe30] c0009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
>   cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1  CPU2
> 
> while () {
>   if (cond)
> break;
>   do {
> schedule();
> set_current_state(TASK_UN..)
>   } while (!cond);
>   rwsem_wake()
> spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)  wake_up_process()
> }   try_to_wake_up()
> set_current_state(TASK_RUNNING);..
> list_del();
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
> cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra 
> Cc: Nicholas Piggin 
> Cc: Benjamin Herrenschmidt 
> 
> Signed-off-by: Balbir Singh 

Re: [RFC][PATCH] Fix a race between rwsem and the scheduler

2016-08-30 Thread Nicholas Piggin
On Tue, 30 Aug 2016 18:49:37 +1000
Balbir Singh  wrote:

> The origin of the issue I've seen seems to be related to
> rwsem spin lock stealing. Basically I see the system deadlock'd in the
> following state

BTW. this is not really to do with rwsems, but purely a scheduler bug.
It was seen in other callers too that follow similar pattern, but even
when those don't follow this pattern that leads to deadlock, it could
lead to a long busywait in the waker until the wakee next sleeps or
gets preempted.

> 
> I have a system with multiple threads and
> 
> Most of the threads are stuck doing
> 
> [67272.593915] --- interrupt: e81 at _raw_spin_lock_irqsave+0xa4/0x130
> [67272.593915] LR = _raw_spin_lock_irqsave+0x9c/0x130
> [67272.700996] [c00012857ae0] [c012453c] rwsem_wake+0xcc/0x110
> [67272.749283] [c00012857b20] [c01215d8] up_write+0x78/0x90
> [67272.788965] [c00012857b50] [c028153c] 
> unlink_anon_vmas+0x15c/0x2c0
> [67272.798782] [c00012857bc0] [c026f5c0] free_pgtables+0xf0/0x1c0
> [67272.842528] [c00012857c10] [c027c9a0] exit_mmap+0x100/0x1a0
> [67272.872947] [c00012857cd0] [c00b4a98] mmput+0xa8/0x1b0
> [67272.898432] [c00012857d00] [c00bc50c] do_exit+0x33c/0xc30
> [67272.944721] [c00012857dc0] [c00bcee4] do_group_exit+0x64/0x100
> [67272.969014] [c00012857e00] [c00bcfac] SyS_exit_group+0x2c/0x30
> [67272.978971] [c00012857e30] [c0009204] system_call+0x38/0xb4
> [67272.999016] Instruction dump:
> 
> They are spinning on the sem->wait_lock, the holder of sem->wait_lock has
> irq's disabled and is doing
> 
> [c0037930fb30] c00f724c try_to_wake_up+0x6c/0x570
> [c0037930fbb0] c0124328 __rwsem_do_wake+0x1f8/0x260
> [c0037930fc00] c01244b4 rwsem_wake+0x84/0x110
> [c0037930fc40] c0121598 up_write+0x78/0x90
> [c0037930fc70] c0281a54 anon_vma_fork+0x184/0x1d0
> [c0037930fcc0] c00b68e0 copy_process.isra.5+0x14c0/0x1870
> [c0037930fda0] c00b6e68 _do_fork+0xa8/0x4b0
> [c0037930fe30] c0009460 ppc_clone+0x8/0xc
> 
> The offset of try_to_wake_up is actually misleading, it is actually stuck
> doing the following in try_to_wake_up
> 
> while (p->on_cpu)
>   cpu_relax();
> 
> Analysis
> 
> The issue is triggered, due to the following race
> 
> CPU1  CPU2
> 
> while () {
>   if (cond)
> break;
>   do {
> schedule();
> set_current_state(TASK_UN..)
>   } while (!cond);
>   rwsem_wake()
> spin_lock_irqsave(wait_lock)
>   raw_spin_lock_irqsave(wait_lock)  wake_up_process()
> }   try_to_wake_up()
> set_current_state(TASK_RUNNING);..
> list_del();
> 
> CPU2 wakes up CPU1, but before it can get the wait_lock and set
> current state to TASK_RUNNING the following occurs
> 
> CPU3
> (stole the rwsem before waiter can be woken up from queue)
> up_write()
> rwsem_wake()
> raw_spin_lock_irqsave(wait_lock)
> if (!list_empty)
>   wake_up_process()
>   try_to_wake_up()
>   raw_spin_lock_irqsave(p->pi_lock)
>   ..
>   if (p->on_rq && ttwu_wakeup())
>   ..
>   while (p->on_cpu)
> cpu_relax()
>   ..
> 
> CPU3 tries to wake up the task on CPU1 again since it finds
> it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
> after CPU2, CPU3 got it.
> 
> CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
> the task is spinning on the wait_lock. Interestingly since p->on_rq
> is checked under pi_lock, I've noticed that try_to_wake_up() finds
> p->on_rq to be 0. This was the most confusing bit of the analysis,
> but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
> check is not reliable without this fix IMHO. The race is visible
> (based on the analysis) only when ttwu_queue() does a remote wakeup
> via ttwu_queue_remote. In which case the p->on_rq change is not
> done uder the pi_lock.
> 
> The result is that after a while the entire system locks up on
> the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
> 
> Reproduction of the issue
> 
> The issue can be reproduced after a long run on my system with 80
> threads and having to tweak available memory to very low and running
> memory stress-ng mmapfork test. It usually takes a long time to
> reproduce. I am trying to work on a test case that can reproduce
> the issue faster, but thats work in progress. I am still testing the
> changes on my still in a loop and the tests seem OK thus far.
> 
> Big thanks to Benjamin and Nick for helping debug this as well.
> Ben helped catch the missing barrier, Nick caught every missing
> bit in my theory
> 
> Cc: Peter Zijlstra 
> Cc: Nicholas Piggin 
> Cc: Benjamin Herrenschmidt 
> 
> Signed-off-by: Balbir Singh 
> ---
>  kernel/sched/core.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git