[webkit-dev] PSA: you should use WTF::Lock and WTF::Condition instead of WTF::SpinLock, WTF::Mutex, WTF::ThreadCondition, std::mutex, std::condition_variable, or std::condition_variable_any

Filip Pizlo Wed, 19 Aug 2015 13:09:00 -0700

Hi everyone,

Over the past two weeks or so I’ve been doing some work to improve the locking 
primitives that we use in WebKit.  These new primitives have landed, and they 
are simply called Lock and Condition.  You should use Lock instead of SpinLock, 
Mutex, or std::mutex, because it combines the best qualities of those other 
locks into one simple implementation.  You should use Condition instead of 
WTF::ThreadCondition, std::condition_variable, or std::condition_variable_any 
because Condition is always smaller and faster.  Also, Condition has a richer 
API for timed wait, which allows you to use either std::chrono time or the 
double-based time from <wtf/CurrentTime.h>.  It can even use our notion of 
monotonic time.


Prior to this change, we had to choose between WTF::SpinLock, which was fast, 
and WTF::Mutex/std::mutex, which didn’t waste CPU cycles during contention.  
After this change, we no longer have to make such choices: the new Lock is fast 
and doesn’t waste CPU cycles.

I’ve already landed changes to all of WebKit that replace all uses of those 
other locking primitives with Lock/Condition.

The specific benefits of these new primitives are:

- Lock and Condition take 1 byte of space.  That’s the total space that they 
will ever consume.  Contention is handled using thread local data structures 
managed by WTF::ParkingLot and that space usage is O(number of threads).  By 
comparison, WTF::Mutex and std::mutex take 64 bytes on Darwin.  They are 
usually take a lot of space on all OSes.  WTF::SpinLock takes 4 bytes.  So, 
Lock is 4x more compact than SpinLock and 64x more compact than 
std::mutex/WTF::Mutex.

- Lock is “adaptive”: it will not waste CPU time or power when a lock is held 
for a long time.  SpinLock will peg a CPU at 100% while it’s trying to acquire 
a lock, which makes SpinLock unsuitable for any critical section that may be 
held for a while.  This means never using SpinLock in critical sections that do 
I/O or that may block on other locks.  Lock doesn’t have this problem; like 
std::mutex and WTF::Mutex, it has the ability to block threads indefinitely 
while waking them up as soon as the lock is available again.  WTF::Lock behaves 
like a spinlock so long as no thread waits for too long (currently, too long = 
40 spins), and otherwise turns into a queue-based lock with barging (a hot 
thread that calls lock() just as another thread is dequeued may barge in, which 
increases throughput and somewhat avoids the convoy problem) and random 
fairness (in the long term, every contending thread has an equal shot at 
getting the lock) and thundering herd avoidance (unlock could cause one thread 
to try to contend for the lock, but it won’t ever wake up threads only to have 
them go back to sleep).

- Lock is very fast in the uncontended case.  Locking and unlocking fast paths 
are just inline CAS instructions (or LL/SC sequences on ARM and friends).  This 
makes Lock about 3x faster than some system mutexes, and within 2x of a spin 
lock.  For Lock performance details, see 
http://trac.webkit.org/changeset/188169/trunk/Source/WTF/ChangeLog 
<http://trac.webkit.org/changeset/188169/trunk/Source/WTF/ChangeLog>, 
http://trac.webkit.org/changeset/188323/trunk/Source/WTF/ChangeLog 
<http://trac.webkit.org/changeset/188323/trunk/Source/WTF/ChangeLog>, and 
http://trac.webkit.org/changeset/188374/trunk/Source/WTF/ChangeLog 
<http://trac.webkit.org/changeset/188374/trunk/Source/WTF/ChangeLog>.

- Lock is very fast in the case of microcontention.  Microcontention is when 
you have multiple threads all piling up on a lock that is held for a short 
time.  You may remember that this case is important in WebKit, in particular in 
our parallel GC: https://trac.webkit.org/changeset/117478 
<https://trac.webkit.org/changeset/117478>, 
https://www.webkit.org/blog/2136/on-spinlocks-and-sleep/ 
<https://www.webkit.org/blog/2136/on-spinlocks-and-sleep/>.  Well, Lock takes 
all of the good ideas from the WTF::SpinLock, which was our previous best 
answer for microcontention.  Lock is up to 100x faster than WTF::Mutex and 
std::mutex in case of microcontention on some OSes.

- Condition::notifyOne()/notifyAll() are just a fast inlined load-and-branch in 
the case that nobody is waiting on the condition.  This feature, combined with 
the fact that Condition can be used with Lock, and the fact that Lock is very 
fast to acquire and release, means that producer/consumer scenarios can run up 
to 58x faster with Lock/Condition than with Mutex/ThreadCondition or 
std::mutex/std::condition_variable.  For Condition performance details, see 
http://trac.webkit.org/changeset/188594/trunk/Source/WTF/ChangeLog 
<http://trac.webkit.org/changeset/188594/trunk/Source/WTF/ChangeLog> and 
http://trac.webkit.org/changeset/188605/trunk/Source/WTF/ChangeLog 
<http://trac.webkit.org/changeset/188605/trunk/Source/WTF/ChangeLog>.

The algorithms used for Lock and Condition have novelty in them, but the 
overall heuristics are consistent with what I take to be existing best 
practices as documented in other places.  The way that Lock handles contention 
is most similar to how Jikes RVM did it.  That algorithm is best documented in 
http://www.filpizlo.com/papers/pizlo-pppj2011-fable.pdf 
<http://www.filpizlo.com/papers/pizlo-pppj2011-fable.pdf> (though ignore all of 
the stuff about biasing and inflation of “thin” locks - WTF::Lock doesn’t do 
any of that).  That basic algorithm had been around for a while and took in 
many influences (see that paper for citations to other relevant stuff).

FWIW, in parallel and concurrent code in JSC, switching over to Lock/Condition 
from the other primitives gave us speed-ups on SunSpider and Kraken, as well as 
speed-ups on some Octane subtests.  Both parallel GC and concurrent/parallel 
JIT appeared to be happy with this change.

Another nice outcome of this work is that the underlying queueing mechanism 
used by both Lock and Condition is available for implementing other locking 
protocols.  It’s called ParkingLot, because its job is to allow you to “park” 
threads that shouldn’t run until someone else “unparks” them.  Any valid memory 
address can be used as a handle for a ParkingLot queue.  ParkingLot can be used 
to implement any locking protocol that uses futexes 
(https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf 
<https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf>) - just use 
“parkConditionally” whenever you would have used “FUTEX_WAIT” and use 
“unparkOne” when you would have used “FUTEX_WAKE”, but it can also do a lot 
more than that, since it allows you to pass std::function callbacks that run 
while ParkingLot holds its internal queue lock.  This is what enables Lock to 
avoid the thundering herd and enables Condition to have a lock-free notify fast 
path when nobody is waiting.  In theory, you can use ParkingLot to implement 
completely fair locks, counting semaphores, read-write locks, and other things. 
 It’ll also come in handy for implementing new locking primitives for things 
like WebAssembly.

TL;DR.  Don’t use WTF::SpinLock, WTF::Mutex, std::mutex, WTF::ThreadCondition, 
std::condition_variable, or std::condition_variable_any.  They waste CPU time 
and they waste memory.  Use WTF::Lock and WTF::Condition instead.

-Filip

_______________________________________________
webkit-dev mailing list
[email protected]
https://lists.webkit.org/mailman/listinfo/webkit-dev

[webkit-dev] PSA: you should use WTF::Lock and WTF::Condition instead of WTF::SpinLock, WTF::Mutex, WTF::ThreadCondition, std::mutex, std::condition_variable, or std::condition_variable_any

Reply via email to