Hi Filip, very interesting to see you change out such a fundamental algorithm! It's amazing that it works. Reading the code, it seems very comparable to Windows' critical sections [1]. Is that the case? Looking at github and msdn, 4000 seems to be the most common number of spins.
I believe your code would also useful be outside of WebKit. Have you given any thought on breaking it off and making it part of the operating system? 1: https://msdn.microsoft.com/en-us/library/windows/desktop/ms683476(v=vs.85).aspx On Wed, Aug 19, 2015 at 1:08 PM, Filip Pizlo <[email protected]> wrote: > Hi everyone, > > Over the past two weeks or so I’ve been doing some work to improve the > locking primitives that we use in WebKit. These new primitives have > landed, and they are simply called Lock and Condition. You should use Lock > instead of SpinLock, Mutex, or std::mutex, because it combines the best > qualities of those other locks into one simple implementation. You should > use Condition instead of WTF::ThreadCondition, std::condition_variable, or > std::condition_variable_any because Condition is always smaller and > faster. Also, Condition has a richer API for timed wait, which allows you > to use either std::chrono time or the double-based time from > <wtf/CurrentTime.h>. It can even use our notion of monotonic time. > > Prior to this change, we had to choose between WTF::SpinLock, which was > fast, and WTF::Mutex/std::mutex, which didn’t waste CPU cycles during > contention. After this change, we no longer have to make such choices: the > new Lock is fast and doesn’t waste CPU cycles. > > I’ve already landed changes to all of WebKit that replace all uses of > those other locking primitives with Lock/Condition. > > The specific benefits of these new primitives are: > > - Lock and Condition take 1 byte of space. That’s the total space that > they will ever consume. Contention is handled using thread local data > structures managed by WTF::ParkingLot and that space usage is O(number of > threads). By comparison, WTF::Mutex and std::mutex take 64 bytes on > Darwin. They are usually take a lot of space on all OSes. WTF::SpinLock > takes 4 bytes. So, Lock is 4x more compact than SpinLock and 64x more > compact than std::mutex/WTF::Mutex. > > - Lock is “adaptive”: it will not waste CPU time or power when a lock is > held for a long time. SpinLock will peg a CPU at 100% while it’s trying to > acquire a lock, which makes SpinLock unsuitable for any critical section > that may be held for a while. This means never using SpinLock in critical > sections that do I/O or that may block on other locks. Lock doesn’t have > this problem; like std::mutex and WTF::Mutex, it has the ability to block > threads indefinitely while waking them up as soon as the lock is available > again. WTF::Lock behaves like a spinlock so long as no thread waits for > too long (currently, too long = 40 spins), and otherwise turns into a > queue-based lock with barging (a hot thread that calls lock() just as > another thread is dequeued may barge in, which increases throughput and > somewhat avoids the convoy problem) and random fairness (in the long term, > every contending thread has an equal shot at getting the lock) and > thundering herd avoidance (unlock could cause one thread to try to contend > for the lock, but it won’t ever wake up threads only to have them go back > to sleep). > > - Lock is very fast in the uncontended case. Locking and unlocking fast > paths are just inline CAS instructions (or LL/SC sequences on ARM and > friends). This makes Lock about 3x faster than some system mutexes, and > within 2x of a spin lock. For Lock performance details, see > http://trac.webkit.org/changeset/188169/trunk/Source/WTF/ChangeLog, > http://trac.webkit.org/changeset/188323/trunk/Source/WTF/ChangeLog, and > http://trac.webkit.org/changeset/188374/trunk/Source/WTF/ChangeLog. > > - Lock is very fast in the case of microcontention. Microcontention is > when you have multiple threads all piling up on a lock that is held for a > short time. You may remember that this case is important in WebKit, in > particular in our parallel GC: https://trac.webkit.org/changeset/117478, > https://www.webkit.org/blog/2136/on-spinlocks-and-sleep/. Well, Lock > takes all of the good ideas from the WTF::SpinLock, which was our previous > best answer for microcontention. Lock is up to 100x faster than WTF::Mutex > and std::mutex in case of microcontention on some OSes. > > - Condition::notifyOne()/notifyAll() are just a fast inlined > load-and-branch in the case that nobody is waiting on the condition. This > feature, combined with the fact that Condition can be used with Lock, and > the fact that Lock is very fast to acquire and release, means that > producer/consumer scenarios can run up to 58x faster with Lock/Condition > than with Mutex/ThreadCondition or std::mutex/std::condition_variable. For > Condition performance details, see > http://trac.webkit.org/changeset/188594/trunk/Source/WTF/ChangeLog and > http://trac.webkit.org/changeset/188605/trunk/Source/WTF/ChangeLog. > > The algorithms used for Lock and Condition have novelty in them, but the > overall heuristics are consistent with what I take to be existing best > practices as documented in other places. The way that Lock handles > contention is most similar to how Jikes RVM did it. That algorithm is best > documented in http://www.filpizlo.com/papers/pizlo-pppj2011-fable.pdf (though > ignore all of the stuff about biasing and inflation of “thin” locks - > WTF::Lock doesn’t do any of that). That basic algorithm had been around > for a while and took in many influences (see that paper for citations to > other relevant stuff). > > FWIW, in parallel and concurrent code in JSC, switching over to > Lock/Condition from the other primitives gave us speed-ups on SunSpider and > Kraken, as well as speed-ups on some Octane subtests. Both parallel GC and > concurrent/parallel JIT appeared to be happy with this change. > > Another nice outcome of this work is that the underlying queueing > mechanism used by both Lock and Condition is available for implementing > other locking protocols. It’s called ParkingLot, because its job is to > allow you to “park” threads that shouldn’t run until someone else “unparks” > them. Any valid memory address can be used as a handle for a ParkingLot > queue. ParkingLot can be used to implement any locking protocol that uses > futexes (https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf) - > just use “parkConditionally” whenever you would have used “FUTEX_WAIT” and > use “unparkOne” when you would have used “FUTEX_WAKE”, but it can also do a > lot more than that, since it allows you to pass std::function callbacks > that run while ParkingLot holds its internal queue lock. This is what > enables Lock to avoid the thundering herd and enables Condition to have a > lock-free notify fast path when nobody is waiting. In theory, you can use > ParkingLot to implement completely fair locks, counting semaphores, > read-write locks, and other things. It’ll also come in handy for > implementing new locking primitives for things like WebAssembly. > > TL;DR. Don’t use WTF::SpinLock, WTF::Mutex, std::mutex, > WTF::ThreadCondition, std::condition_variable, or > std::condition_variable_any. They waste CPU time and they waste memory. > Use WTF::Lock and WTF::Condition instead. > > -Filip > > > > _______________________________________________ > webkit-dev mailing list > [email protected] > https://lists.webkit.org/mailman/listinfo/webkit-dev > >
_______________________________________________ webkit-dev mailing list [email protected] https://lists.webkit.org/mailman/listinfo/webkit-dev

