Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Waiman Long
On 11/07/2017 02:41 PM, Peter Zijlstra wrote:
> On Tue, Nov 07, 2017 at 11:39:02AM -0500, Waiman Long wrote:
>> On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
>>> On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
 This patch combines the best attributes of an unfair lock and a
 pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
 mode. A lock waiter goes into the unfair mode when there are waiters
 in the wait queue but the pending bit isn't set. Otherwise, it will
 go into the queued mode waiting in the queue for its turn.
>>> You forgot to test a starvation case. And you also forgot to describe
>>> how they cannot happen. I really cannot remember how all this is
>>> supposed to work.
>> Lock starvation shouldn't happen. The pending bit is set by the queue
>> head to indicate its readiness before spinning on the lock. Once the
>> pending bit is made visible to all the CPUs, no one can steal the lock
>> and they will all queued up in the wait queue.
> So the queue path of queued_spin_lock_slowpath() does
> queued_spin_trylock() which, for PV, ends up being that
> pv_queued_spin_steal_lock(), which you changed to spin util PENDING.
>
> Now PENDING is set by pv_wait_head_or_lock(), but that is far after
> queued_spin_trylock().
>
> The way I'm reading this is that we'll never get there... because we'll
> all be spinning in pv_queued_spin_steal_lock().
>
> So how will we fail pv_queued_spin_steal_lock() in order to then end up
> in pv_wait_head_or_lock() to set PENDING such that
> pv_queued_spin_steal_lock() will fail?
>
> The comment with your new pv_queued_spin_steal_lock() should very
> clearly spell out how this works.

I did add a comment to say that the lock waiter will stay in unfair mode
only when there are WAITERS and the pending bit is not set. So if no CPU
is in the wait queue, it will just do one trylock as is before the patch
and move on to the regular slowpath. I will update the comments to
clarify that I mean waiters in the wait queue.

>
>> I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
>> with # of locking threads equals to the number of vCPUs in a kvm VM. The
>> table below show the the min/mean/max per-thread locking ops done in 5
>> seconds:
>>
>> #vCPUsmin/avg/max lockops
>> 36822,135/881,063/950,363
>> 54542,435/581,664/625,937
>> 72397,500/428,177/499,299
>>
>> It is obvious that there was no lock starvation here.
> It is however not obvious if that is the worst case; at the very least
> you should compare to the test-and-set variant which has known
> starvation. If this test doesn't show starvation with the test-and-set
> then this test is bad.

Sure, I will add the unfair lock case for comparison.

Cheers,
Longman




Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Waiman Long
On 11/07/2017 02:41 PM, Peter Zijlstra wrote:
> On Tue, Nov 07, 2017 at 11:39:02AM -0500, Waiman Long wrote:
>> On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
>>> On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
 This patch combines the best attributes of an unfair lock and a
 pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
 mode. A lock waiter goes into the unfair mode when there are waiters
 in the wait queue but the pending bit isn't set. Otherwise, it will
 go into the queued mode waiting in the queue for its turn.
>>> You forgot to test a starvation case. And you also forgot to describe
>>> how they cannot happen. I really cannot remember how all this is
>>> supposed to work.
>> Lock starvation shouldn't happen. The pending bit is set by the queue
>> head to indicate its readiness before spinning on the lock. Once the
>> pending bit is made visible to all the CPUs, no one can steal the lock
>> and they will all queued up in the wait queue.
> So the queue path of queued_spin_lock_slowpath() does
> queued_spin_trylock() which, for PV, ends up being that
> pv_queued_spin_steal_lock(), which you changed to spin util PENDING.
>
> Now PENDING is set by pv_wait_head_or_lock(), but that is far after
> queued_spin_trylock().
>
> The way I'm reading this is that we'll never get there... because we'll
> all be spinning in pv_queued_spin_steal_lock().
>
> So how will we fail pv_queued_spin_steal_lock() in order to then end up
> in pv_wait_head_or_lock() to set PENDING such that
> pv_queued_spin_steal_lock() will fail?
>
> The comment with your new pv_queued_spin_steal_lock() should very
> clearly spell out how this works.

I did add a comment to say that the lock waiter will stay in unfair mode
only when there are WAITERS and the pending bit is not set. So if no CPU
is in the wait queue, it will just do one trylock as is before the patch
and move on to the regular slowpath. I will update the comments to
clarify that I mean waiters in the wait queue.

>
>> I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
>> with # of locking threads equals to the number of vCPUs in a kvm VM. The
>> table below show the the min/mean/max per-thread locking ops done in 5
>> seconds:
>>
>> #vCPUsmin/avg/max lockops
>> 36822,135/881,063/950,363
>> 54542,435/581,664/625,937
>> 72397,500/428,177/499,299
>>
>> It is obvious that there was no lock starvation here.
> It is however not obvious if that is the worst case; at the very least
> you should compare to the test-and-set variant which has known
> starvation. If this test doesn't show starvation with the test-and-set
> then this test is bad.

Sure, I will add the unfair lock case for comparison.

Cheers,
Longman




Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Peter Zijlstra
On Tue, Nov 07, 2017 at 11:39:02AM -0500, Waiman Long wrote:
> On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
> > On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:

> >> This patch combines the best attributes of an unfair lock and a
> >> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> >> mode. A lock waiter goes into the unfair mode when there are waiters
> >> in the wait queue but the pending bit isn't set. Otherwise, it will
> >> go into the queued mode waiting in the queue for its turn.

> > You forgot to test a starvation case. And you also forgot to describe
> > how they cannot happen. I really cannot remember how all this is
> > supposed to work.
> 
> Lock starvation shouldn't happen. The pending bit is set by the queue
> head to indicate its readiness before spinning on the lock. Once the
> pending bit is made visible to all the CPUs, no one can steal the lock
> and they will all queued up in the wait queue.

So the queue path of queued_spin_lock_slowpath() does
queued_spin_trylock() which, for PV, ends up being that
pv_queued_spin_steal_lock(), which you changed to spin util PENDING.

Now PENDING is set by pv_wait_head_or_lock(), but that is far after
queued_spin_trylock().

The way I'm reading this is that we'll never get there... because we'll
all be spinning in pv_queued_spin_steal_lock().

So how will we fail pv_queued_spin_steal_lock() in order to then end up
in pv_wait_head_or_lock() to set PENDING such that
pv_queued_spin_steal_lock() will fail?

The comment with your new pv_queued_spin_steal_lock() should very
clearly spell out how this works.

> I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
> with # of locking threads equals to the number of vCPUs in a kvm VM. The
> table below show the the min/mean/max per-thread locking ops done in 5
> seconds:
> 
> #vCPUsmin/avg/max lockops
> 36822,135/881,063/950,363
> 54542,435/581,664/625,937
> 72397,500/428,177/499,299
> 
> It is obvious that there was no lock starvation here.

It is however not obvious if that is the worst case; at the very least
you should compare to the test-and-set variant which has known
starvation. If this test doesn't show starvation with the test-and-set
then this test is bad.


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Peter Zijlstra
On Tue, Nov 07, 2017 at 11:39:02AM -0500, Waiman Long wrote:
> On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
> > On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:

> >> This patch combines the best attributes of an unfair lock and a
> >> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> >> mode. A lock waiter goes into the unfair mode when there are waiters
> >> in the wait queue but the pending bit isn't set. Otherwise, it will
> >> go into the queued mode waiting in the queue for its turn.

> > You forgot to test a starvation case. And you also forgot to describe
> > how they cannot happen. I really cannot remember how all this is
> > supposed to work.
> 
> Lock starvation shouldn't happen. The pending bit is set by the queue
> head to indicate its readiness before spinning on the lock. Once the
> pending bit is made visible to all the CPUs, no one can steal the lock
> and they will all queued up in the wait queue.

So the queue path of queued_spin_lock_slowpath() does
queued_spin_trylock() which, for PV, ends up being that
pv_queued_spin_steal_lock(), which you changed to spin util PENDING.

Now PENDING is set by pv_wait_head_or_lock(), but that is far after
queued_spin_trylock().

The way I'm reading this is that we'll never get there... because we'll
all be spinning in pv_queued_spin_steal_lock().

So how will we fail pv_queued_spin_steal_lock() in order to then end up
in pv_wait_head_or_lock() to set PENDING such that
pv_queued_spin_steal_lock() will fail?

The comment with your new pv_queued_spin_steal_lock() should very
clearly spell out how this works.

> I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
> with # of locking threads equals to the number of vCPUs in a kvm VM. The
> table below show the the min/mean/max per-thread locking ops done in 5
> seconds:
> 
> #vCPUsmin/avg/max lockops
> 36822,135/881,063/950,363
> 54542,435/581,664/625,937
> 72397,500/428,177/499,299
> 
> It is obvious that there was no lock starvation here.

It is however not obvious if that is the worst case; at the very least
you should compare to the test-and-set variant which has known
starvation. If this test doesn't show starvation with the test-and-set
then this test is bad.


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Waiman Long
On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
> On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
>> Currently, all the lock waiters entering the slowpath will do one
>> lock stealing attempt to acquire the lock. That helps performance,
>> especially in VMs with over-committed vCPUs. However, the current
>> pvqspinlocks still don't perform as good as unfair locks in many cases.
>> On the other hands, unfair locks do have the problem of lock starvation
>> that pvqspinlocks don't have.
>>
>> This patch combines the best attributes of an unfair lock and a
>> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
>> mode. A lock waiter goes into the unfair mode when there are waiters
>> in the wait queue but the pending bit isn't set. Otherwise, it will
>> go into the queued mode waiting in the queue for its turn.
>>
>> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
>> (make -j) was done in a VM with unpinned vCPUs 3 times with the
>> best time selected and  is the number of vCPUs available. The build
>> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
>> with various number of vCPUs are as follows:
>>
>>   vCPUspvqlock hybrid pvqlockunfair lock
>>   ---- -----
>> 30  342.1s 329.1s  329.1s
>> 36  314.1s 305.3s  307.3s
>> 45  345.0s 302.1s  306.6s
>> 54  365.4s 308.6s  307.8s
>> 72  358.9s 293.6s  303.9s
>>108  343.0s 285.9s  304.2s
>>
>> The hybrid pvqspinlock performs better or comparable to the unfair
>> lock.
>>
>> By turning on QUEUED_LOCK_STAT, the table below showed the number
>> of lock acquisitions in unfair mode and queue mode after a kernel
>> build with various number of vCPUs.
>>
>>   vCPUsqueued mode  unfair mode
>>   ----  ---
>> 30  9,130,518  294,954
>> 36 10,856,614  386,809
>> 45  8,467,264   11,475,373
>> 54  6,409,987   19,670,855
>> 72  4,782,063   25,712,180
>>
>> It can be seen that as the VM became more and more over-committed,
>> the ratio of locks acquired in unfair mode increases. This is all
>> done automatically to get the best overall performance as possible.
>>
>> The table below shows the kernel build times on a smaller 2-socket
>> 16-core 32-thread E5-2620 v4 system.
>>
>>   vCPUspvqlock hybrid pvqlockunfair lock
>>   ---- -----
>> 16  436.8s 433.4s  435.6s
>> 36  366.2s 364.8s  364.5s
>> 48  423.6s 376.3s  370.2s
>> 64  433.1s 376.6s  376.8s
>>
>> Again, the performance of the hybrid pvqspinlock was comparable to
>> that of the unfair lock.
> You forgot to test a starvation case. And you also forgot to describe
> how they cannot happen. I really cannot remember how all this is
> supposed to work.

Lock starvation shouldn't happen. The pending bit is set by the queue
head to indicate its readiness before spinning on the lock. Once the
pending bit is made visible to all the CPUs, no one can steal the lock
and they will all queued up in the wait queue.

I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
with # of locking threads equals to the number of vCPUs in a kvm VM. The
table below show the the min/mean/max per-thread locking ops done in 5
seconds:

#vCPUsmin/avg/max lockops
36822,135/881,063/950,363
54542,435/581,664/625,937
72397,500/428,177/499,299

It is obvious that there was no lock starvation here.

Cheers,
Longman





Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Waiman Long
On 11/07/2017 07:58 AM, Peter Zijlstra wrote:
> On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
>> Currently, all the lock waiters entering the slowpath will do one
>> lock stealing attempt to acquire the lock. That helps performance,
>> especially in VMs with over-committed vCPUs. However, the current
>> pvqspinlocks still don't perform as good as unfair locks in many cases.
>> On the other hands, unfair locks do have the problem of lock starvation
>> that pvqspinlocks don't have.
>>
>> This patch combines the best attributes of an unfair lock and a
>> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
>> mode. A lock waiter goes into the unfair mode when there are waiters
>> in the wait queue but the pending bit isn't set. Otherwise, it will
>> go into the queued mode waiting in the queue for its turn.
>>
>> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
>> (make -j) was done in a VM with unpinned vCPUs 3 times with the
>> best time selected and  is the number of vCPUs available. The build
>> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
>> with various number of vCPUs are as follows:
>>
>>   vCPUspvqlock hybrid pvqlockunfair lock
>>   ---- -----
>> 30  342.1s 329.1s  329.1s
>> 36  314.1s 305.3s  307.3s
>> 45  345.0s 302.1s  306.6s
>> 54  365.4s 308.6s  307.8s
>> 72  358.9s 293.6s  303.9s
>>108  343.0s 285.9s  304.2s
>>
>> The hybrid pvqspinlock performs better or comparable to the unfair
>> lock.
>>
>> By turning on QUEUED_LOCK_STAT, the table below showed the number
>> of lock acquisitions in unfair mode and queue mode after a kernel
>> build with various number of vCPUs.
>>
>>   vCPUsqueued mode  unfair mode
>>   ----  ---
>> 30  9,130,518  294,954
>> 36 10,856,614  386,809
>> 45  8,467,264   11,475,373
>> 54  6,409,987   19,670,855
>> 72  4,782,063   25,712,180
>>
>> It can be seen that as the VM became more and more over-committed,
>> the ratio of locks acquired in unfair mode increases. This is all
>> done automatically to get the best overall performance as possible.
>>
>> The table below shows the kernel build times on a smaller 2-socket
>> 16-core 32-thread E5-2620 v4 system.
>>
>>   vCPUspvqlock hybrid pvqlockunfair lock
>>   ---- -----
>> 16  436.8s 433.4s  435.6s
>> 36  366.2s 364.8s  364.5s
>> 48  423.6s 376.3s  370.2s
>> 64  433.1s 376.6s  376.8s
>>
>> Again, the performance of the hybrid pvqspinlock was comparable to
>> that of the unfair lock.
> You forgot to test a starvation case. And you also forgot to describe
> how they cannot happen. I really cannot remember how all this is
> supposed to work.

Lock starvation shouldn't happen. The pending bit is set by the queue
head to indicate its readiness before spinning on the lock. Once the
pending bit is made visible to all the CPUs, no one can steal the lock
and they will all queued up in the wait queue.

I ran my locking microbenchmark on a 2-socket 36-core system (HT off)
with # of locking threads equals to the number of vCPUs in a kvm VM. The
table below show the the min/mean/max per-thread locking ops done in 5
seconds:

#vCPUsmin/avg/max lockops
36822,135/881,063/950,363
54542,435/581,664/625,937
72397,500/428,177/499,299

It is obvious that there was no lock starvation here.

Cheers,
Longman





Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Peter Zijlstra
On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
> 
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
> 
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and  is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 30  342.1s 329.1s  329.1s
> 36  314.1s 305.3s  307.3s
> 45  345.0s 302.1s  306.6s
> 54  365.4s 308.6s  307.8s
> 72  358.9s 293.6s  303.9s
>108  343.0s 285.9s  304.2s
> 
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
> 
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
> 
>   vCPUsqueued mode  unfair mode
>   ----  ---
> 30  9,130,518  294,954
> 36 10,856,614  386,809
> 45  8,467,264   11,475,373
> 54  6,409,987   19,670,855
> 72  4,782,063   25,712,180
> 
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
> 
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 16  436.8s 433.4s  435.6s
> 36  366.2s 364.8s  364.5s
> 48  423.6s 376.3s  370.2s
> 64  433.1s 376.6s  376.8s
> 
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.

You forgot to test a starvation case. And you also forgot to describe
how they cannot happen. I really cannot remember how all this is
supposed to work.


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Peter Zijlstra
On Fri, Nov 03, 2017 at 11:35:31AM -0400, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
> 
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
> 
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and  is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 30  342.1s 329.1s  329.1s
> 36  314.1s 305.3s  307.3s
> 45  345.0s 302.1s  306.6s
> 54  365.4s 308.6s  307.8s
> 72  358.9s 293.6s  303.9s
>108  343.0s 285.9s  304.2s
> 
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
> 
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
> 
>   vCPUsqueued mode  unfair mode
>   ----  ---
> 30  9,130,518  294,954
> 36 10,856,614  386,809
> 45  8,467,264   11,475,373
> 54  6,409,987   19,670,855
> 72  4,782,063   25,712,180
> 
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
> 
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 16  436.8s 433.4s  435.6s
> 36  366.2s 364.8s  364.5s
> 48  423.6s 376.3s  370.2s
> 64  433.1s 376.6s  376.8s
> 
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.

You forgot to test a starvation case. And you also forgot to describe
how they cannot happen. I really cannot remember how all this is
supposed to work.


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Paolo Bonzini
On 03/11/2017 16:35, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
> 
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
> 
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and  is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 30  342.1s 329.1s  329.1s
> 36  314.1s 305.3s  307.3s
> 45  345.0s 302.1s  306.6s
> 54  365.4s 308.6s  307.8s
> 72  358.9s 293.6s  303.9s
>108  343.0s 285.9s  304.2s
> 
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
> 
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
> 
>   vCPUsqueued mode  unfair mode
>   ----  ---
> 30  9,130,518  294,954
> 36 10,856,614  386,809
> 45  8,467,264   11,475,373
> 54  6,409,987   19,670,855
> 72  4,782,063   25,712,180
> 
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
> 
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 16  436.8s 433.4s  435.6s
> 36  366.2s 364.8s  364.5s
> 48  423.6s 376.3s  370.2s
> 64  433.1s 376.6s  376.8s
> 
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.
> 
> Signed-off-by: Waiman Long 
> ---
>  kernel/locking/qspinlock_paravirt.h | 28 +---
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock_paravirt.h 
> b/kernel/locking/qspinlock_paravirt.h
> index 4355568..405a923 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -60,21 +60,35 @@ struct pv_node {
>  #include "qspinlock_stat.h"
>  
>  /*
> + * Hybrid PV queued/unfair lock
> + *
>   * By replacing the regular queued_spin_trylock() with the function below,
>   * it will be called once when a lock waiter enter the PV slowpath before
> - * being queued. By allowing one lock stealing attempt here when the pending
> - * bit is off, it helps to reduce the performance impact of lock waiter
> - * preemption without the drawback of lock starvation.
> + * being queued. By allowing lock stealing attempts here when there are
> + * waiters but the pending bit is off, it helps to reduce the performance
> + * impact of lock waiter preemption without the drawback of lock starvation.
>   */
>  #define queued_spin_trylock(l)   pv_queued_spin_steal_lock(l)
>  static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
>  {
>   struct __qspinlock *l = (void *)lock;
>  
> - if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
> - (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
> - qstat_inc(qstat_pv_lock_stealing, true);
> - return true;
> + /*
> +  * Stay in unfair lock mode as long as waiters are present but
> +  * the pending bit isn't set.
> +  */
> + for (;;) {
> + int val = atomic_read(>val);
> +
> + if (!(val & _Q_LOCKED_PENDING_MASK) &&
> +(cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
> + qstat_inc(qstat_pv_lock_stealing, true);
> + return true;
> + }
> + if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
> + break;
> +
> + cpu_relax();
>   }
>  
>   return false;
> 

Awesome!  Thanks Waiman.

Paolo


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Paolo Bonzini
On 03/11/2017 16:35, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
> 
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
> 
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and  is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 30  342.1s 329.1s  329.1s
> 36  314.1s 305.3s  307.3s
> 45  345.0s 302.1s  306.6s
> 54  365.4s 308.6s  307.8s
> 72  358.9s 293.6s  303.9s
>108  343.0s 285.9s  304.2s
> 
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
> 
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
> 
>   vCPUsqueued mode  unfair mode
>   ----  ---
> 30  9,130,518  294,954
> 36 10,856,614  386,809
> 45  8,467,264   11,475,373
> 54  6,409,987   19,670,855
> 72  4,782,063   25,712,180
> 
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
> 
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 16  436.8s 433.4s  435.6s
> 36  366.2s 364.8s  364.5s
> 48  423.6s 376.3s  370.2s
> 64  433.1s 376.6s  376.8s
> 
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.
> 
> Signed-off-by: Waiman Long 
> ---
>  kernel/locking/qspinlock_paravirt.h | 28 +---
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock_paravirt.h 
> b/kernel/locking/qspinlock_paravirt.h
> index 4355568..405a923 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -60,21 +60,35 @@ struct pv_node {
>  #include "qspinlock_stat.h"
>  
>  /*
> + * Hybrid PV queued/unfair lock
> + *
>   * By replacing the regular queued_spin_trylock() with the function below,
>   * it will be called once when a lock waiter enter the PV slowpath before
> - * being queued. By allowing one lock stealing attempt here when the pending
> - * bit is off, it helps to reduce the performance impact of lock waiter
> - * preemption without the drawback of lock starvation.
> + * being queued. By allowing lock stealing attempts here when there are
> + * waiters but the pending bit is off, it helps to reduce the performance
> + * impact of lock waiter preemption without the drawback of lock starvation.
>   */
>  #define queued_spin_trylock(l)   pv_queued_spin_steal_lock(l)
>  static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
>  {
>   struct __qspinlock *l = (void *)lock;
>  
> - if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
> - (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
> - qstat_inc(qstat_pv_lock_stealing, true);
> - return true;
> + /*
> +  * Stay in unfair lock mode as long as waiters are present but
> +  * the pending bit isn't set.
> +  */
> + for (;;) {
> + int val = atomic_read(>val);
> +
> + if (!(val & _Q_LOCKED_PENDING_MASK) &&
> +(cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
> + qstat_inc(qstat_pv_lock_stealing, true);
> + return true;
> + }
> + if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
> + break;
> +
> + cpu_relax();
>   }
>  
>   return false;
> 

Awesome!  Thanks Waiman.

Paolo


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Juergen Gross
On 03/11/17 16:35, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
> 
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
> 
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and  is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 30  342.1s 329.1s  329.1s
> 36  314.1s 305.3s  307.3s
> 45  345.0s 302.1s  306.6s
> 54  365.4s 308.6s  307.8s
> 72  358.9s 293.6s  303.9s
>108  343.0s 285.9s  304.2s
> 
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
> 
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
> 
>   vCPUsqueued mode  unfair mode
>   ----  ---
> 30  9,130,518  294,954
> 36 10,856,614  386,809
> 45  8,467,264   11,475,373
> 54  6,409,987   19,670,855
> 72  4,782,063   25,712,180
> 
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
> 
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 16  436.8s 433.4s  435.6s
> 36  366.2s 364.8s  364.5s
> 48  423.6s 376.3s  370.2s
> 64  433.1s 376.6s  376.8s
> 
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.
> 
> Signed-off-by: Waiman Long 

Reviewed-by: Juergen Gross 


Juergen


Re: [PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-07 Thread Juergen Gross
On 03/11/17 16:35, Waiman Long wrote:
> Currently, all the lock waiters entering the slowpath will do one
> lock stealing attempt to acquire the lock. That helps performance,
> especially in VMs with over-committed vCPUs. However, the current
> pvqspinlocks still don't perform as good as unfair locks in many cases.
> On the other hands, unfair locks do have the problem of lock starvation
> that pvqspinlocks don't have.
> 
> This patch combines the best attributes of an unfair lock and a
> pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
> mode. A lock waiter goes into the unfair mode when there are waiters
> in the wait queue but the pending bit isn't set. Otherwise, it will
> go into the queued mode waiting in the queue for its turn.
> 
> On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
> (make -j) was done in a VM with unpinned vCPUs 3 times with the
> best time selected and  is the number of vCPUs available. The build
> times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
> with various number of vCPUs are as follows:
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 30  342.1s 329.1s  329.1s
> 36  314.1s 305.3s  307.3s
> 45  345.0s 302.1s  306.6s
> 54  365.4s 308.6s  307.8s
> 72  358.9s 293.6s  303.9s
>108  343.0s 285.9s  304.2s
> 
> The hybrid pvqspinlock performs better or comparable to the unfair
> lock.
> 
> By turning on QUEUED_LOCK_STAT, the table below showed the number
> of lock acquisitions in unfair mode and queue mode after a kernel
> build with various number of vCPUs.
> 
>   vCPUsqueued mode  unfair mode
>   ----  ---
> 30  9,130,518  294,954
> 36 10,856,614  386,809
> 45  8,467,264   11,475,373
> 54  6,409,987   19,670,855
> 72  4,782,063   25,712,180
> 
> It can be seen that as the VM became more and more over-committed,
> the ratio of locks acquired in unfair mode increases. This is all
> done automatically to get the best overall performance as possible.
> 
> The table below shows the kernel build times on a smaller 2-socket
> 16-core 32-thread E5-2620 v4 system.
> 
>   vCPUspvqlock hybrid pvqlockunfair lock
>   ---- -----
> 16  436.8s 433.4s  435.6s
> 36  366.2s 364.8s  364.5s
> 48  423.6s 376.3s  370.2s
> 64  433.1s 376.6s  376.8s
> 
> Again, the performance of the hybrid pvqspinlock was comparable to
> that of the unfair lock.
> 
> Signed-off-by: Waiman Long 

Reviewed-by: Juergen Gross 


Juergen


[PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-03 Thread Waiman Long
Currently, all the lock waiters entering the slowpath will do one
lock stealing attempt to acquire the lock. That helps performance,
especially in VMs with over-committed vCPUs. However, the current
pvqspinlocks still don't perform as good as unfair locks in many cases.
On the other hands, unfair locks do have the problem of lock starvation
that pvqspinlocks don't have.

This patch combines the best attributes of an unfair lock and a
pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
mode. A lock waiter goes into the unfair mode when there are waiters
in the wait queue but the pending bit isn't set. Otherwise, it will
go into the queued mode waiting in the queue for its turn.

On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
(make -j) was done in a VM with unpinned vCPUs 3 times with the
best time selected and  is the number of vCPUs available. The build
times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
with various number of vCPUs are as follows:

  vCPUspvqlock hybrid pvqlockunfair lock
  ---- -----
30  342.1s 329.1s  329.1s
36  314.1s 305.3s  307.3s
45  345.0s 302.1s  306.6s
54  365.4s 308.6s  307.8s
72  358.9s 293.6s  303.9s
   108  343.0s 285.9s  304.2s

The hybrid pvqspinlock performs better or comparable to the unfair
lock.

By turning on QUEUED_LOCK_STAT, the table below showed the number
of lock acquisitions in unfair mode and queue mode after a kernel
build with various number of vCPUs.

  vCPUsqueued mode  unfair mode
  ----  ---
30  9,130,518  294,954
36 10,856,614  386,809
45  8,467,264   11,475,373
54  6,409,987   19,670,855
72  4,782,063   25,712,180

It can be seen that as the VM became more and more over-committed,
the ratio of locks acquired in unfair mode increases. This is all
done automatically to get the best overall performance as possible.

The table below shows the kernel build times on a smaller 2-socket
16-core 32-thread E5-2620 v4 system.

  vCPUspvqlock hybrid pvqlockunfair lock
  ---- -----
16  436.8s 433.4s  435.6s
36  366.2s 364.8s  364.5s
48  423.6s 376.3s  370.2s
64  433.1s 376.6s  376.8s

Again, the performance of the hybrid pvqspinlock was comparable to
that of the unfair lock.

Signed-off-by: Waiman Long 
---
 kernel/locking/qspinlock_paravirt.h | 28 +---
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 4355568..405a923 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -60,21 +60,35 @@ struct pv_node {
 #include "qspinlock_stat.h"
 
 /*
+ * Hybrid PV queued/unfair lock
+ *
  * By replacing the regular queued_spin_trylock() with the function below,
  * it will be called once when a lock waiter enter the PV slowpath before
- * being queued. By allowing one lock stealing attempt here when the pending
- * bit is off, it helps to reduce the performance impact of lock waiter
- * preemption without the drawback of lock starvation.
+ * being queued. By allowing lock stealing attempts here when there are
+ * waiters but the pending bit is off, it helps to reduce the performance
+ * impact of lock waiter preemption without the drawback of lock starvation.
  */
 #define queued_spin_trylock(l) pv_queued_spin_steal_lock(l)
 static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
 
-   if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
-   (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
-   qstat_inc(qstat_pv_lock_stealing, true);
-   return true;
+   /*
+* Stay in unfair lock mode as long as waiters are present but
+* the pending bit isn't set.
+*/
+   for (;;) {
+   int val = atomic_read(>val);
+
+   if (!(val & _Q_LOCKED_PENDING_MASK) &&
+  (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
+   qstat_inc(qstat_pv_lock_stealing, true);
+   return true;
+   }
+   if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
+   break;
+
+   cpu_relax();
}
 
return false;
-- 
1.8.3.1



[PATCH] locking/pvqspinlock: Hybrid PV queued/unfair locks

2017-11-03 Thread Waiman Long
Currently, all the lock waiters entering the slowpath will do one
lock stealing attempt to acquire the lock. That helps performance,
especially in VMs with over-committed vCPUs. However, the current
pvqspinlocks still don't perform as good as unfair locks in many cases.
On the other hands, unfair locks do have the problem of lock starvation
that pvqspinlocks don't have.

This patch combines the best attributes of an unfair lock and a
pvqspinlock into a hybrid lock with 2 modes - queued mode & unfair
mode. A lock waiter goes into the unfair mode when there are waiters
in the wait queue but the pending bit isn't set. Otherwise, it will
go into the queued mode waiting in the queue for its turn.

On a 2-socket 36-core E5-2699 v3 system (HT off), a kernel build
(make -j) was done in a VM with unpinned vCPUs 3 times with the
best time selected and  is the number of vCPUs available. The build
times of the original pvqspinlock, hybrid pvqspinlock and unfair lock
with various number of vCPUs are as follows:

  vCPUspvqlock hybrid pvqlockunfair lock
  ---- -----
30  342.1s 329.1s  329.1s
36  314.1s 305.3s  307.3s
45  345.0s 302.1s  306.6s
54  365.4s 308.6s  307.8s
72  358.9s 293.6s  303.9s
   108  343.0s 285.9s  304.2s

The hybrid pvqspinlock performs better or comparable to the unfair
lock.

By turning on QUEUED_LOCK_STAT, the table below showed the number
of lock acquisitions in unfair mode and queue mode after a kernel
build with various number of vCPUs.

  vCPUsqueued mode  unfair mode
  ----  ---
30  9,130,518  294,954
36 10,856,614  386,809
45  8,467,264   11,475,373
54  6,409,987   19,670,855
72  4,782,063   25,712,180

It can be seen that as the VM became more and more over-committed,
the ratio of locks acquired in unfair mode increases. This is all
done automatically to get the best overall performance as possible.

The table below shows the kernel build times on a smaller 2-socket
16-core 32-thread E5-2620 v4 system.

  vCPUspvqlock hybrid pvqlockunfair lock
  ---- -----
16  436.8s 433.4s  435.6s
36  366.2s 364.8s  364.5s
48  423.6s 376.3s  370.2s
64  433.1s 376.6s  376.8s

Again, the performance of the hybrid pvqspinlock was comparable to
that of the unfair lock.

Signed-off-by: Waiman Long 
---
 kernel/locking/qspinlock_paravirt.h | 28 +---
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 4355568..405a923 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -60,21 +60,35 @@ struct pv_node {
 #include "qspinlock_stat.h"
 
 /*
+ * Hybrid PV queued/unfair lock
+ *
  * By replacing the regular queued_spin_trylock() with the function below,
  * it will be called once when a lock waiter enter the PV slowpath before
- * being queued. By allowing one lock stealing attempt here when the pending
- * bit is off, it helps to reduce the performance impact of lock waiter
- * preemption without the drawback of lock starvation.
+ * being queued. By allowing lock stealing attempts here when there are
+ * waiters but the pending bit is off, it helps to reduce the performance
+ * impact of lock waiter preemption without the drawback of lock starvation.
  */
 #define queued_spin_trylock(l) pv_queued_spin_steal_lock(l)
 static inline bool pv_queued_spin_steal_lock(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
 
-   if (!(atomic_read(>val) & _Q_LOCKED_PENDING_MASK) &&
-   (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
-   qstat_inc(qstat_pv_lock_stealing, true);
-   return true;
+   /*
+* Stay in unfair lock mode as long as waiters are present but
+* the pending bit isn't set.
+*/
+   for (;;) {
+   int val = atomic_read(>val);
+
+   if (!(val & _Q_LOCKED_PENDING_MASK) &&
+  (cmpxchg_acquire(>locked, 0, _Q_LOCKED_VAL) == 0)) {
+   qstat_inc(qstat_pv_lock_stealing, true);
+   return true;
+   }
+   if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
+   break;
+
+   cpu_relax();
}
 
return false;
-- 
1.8.3.1