CVSROOT:        /cvs
Module name:    src
Changes by:     [email protected]    2025/11/05 17:45:31

Modified files:
        sys/kern       : kern_lock.c subr_witness.c 
        sys/sys        : mutex.h 

Log message:
replace the cas spinlock in kernel mutexes with a "parking" lock.

this is motivated because cas based locks are unfair, meaning that
no effort is made by the algorithm to try and give CPUs access to
the critical section in the order that they tried to acquire them.
cas based locks can also generate a lot of work for the cache
subsystem on a computer because every cpu ends up hammering the
same cacheline.

the combination of these effects for heavily contended mutexes can
get some systems into a situation where they don't make progress,
and are effectively livelocked.

this parking mutex mitigates against these problems.

it's called parking because it was very heavily influnced by what's
described in https://webkit.org/blog/6161/locking-in-webkit/. the
big influence is that the lock itself only has to record it's state,
but the machinery for waiting for the lock is external to the lock.

in practice this means that cpus waiting to acquire a mutex use
memory on their stack as an entry in a list of waiting cpus, and
then each cpu spins on a word in their own entry. this mitigates
against the cache subsystem having to spend all its time pulling
memory around between cpus. when the critical section is released,
this list is inspected and the first entry on it gets their wait
word cleared. this means cpus are woken up one at a time and in
order, which provides fairness.

it looks very similar to a "classic" futex based lock except the
syscall backend has been cut out and replaced with a hash of lists
and spinning.

the main benefits of this lock are:

- cpus spin on their own memory

this is kind of the whole point of the excercise.

- the size of struct mutex doesn't increase

the mutex itself only has to record ownership (which could be a single
bit, but we still record which cpu owns the lock for diagnostic
purposes) and whether it's contended. if it is contended, then the
address of the mutex is used to find a parking lot for cpus to
coordinate in.

- uncontended mutex acquision/release is the same cost as our current
implementation.

this means if we do the work to reduce contention on our locks, we
don't pay an overhead from the more complicated mutex implementation.

there's others, but these are the important ones.

this was tested by many, but of note is jca@ who tested the following:

- amd64 Intel Gen 12 laptop big.little 10 cores
- amd64 AMD VM 8 cores
- arm64 mac mini m2 pro 10 cores (3 core clusters iirc)
- sparc64 LDOM 16 vcpus
- riscv64 Hifive Unmatched 4 cores
- riscv64 Startfive VF2 1.2a 4 cores

phessler@ has been running this on ports build machines.
claudio@ tested a variety of amd64 and sparc64 systems.
tb@ and djm@ have been running this for months on various systems
without trouble.

i've been working on a variety of systems include an 80 core ampere
system, m1 mac mini, and 64 thread amd64 systems. the arm64 systems
in particular were brutal because arm64 has a weak memory model.

the algorithm is being examined by researchers in formal methods
too. special thanks to ian hayes, rob colvin, brijesh dongol, roger
su, and daniel wright for their work so far. it's helped me sleep
better at night.

bluhm@ and kettenis@ have done performance testing, which shows
plusses and minuses, but are happy for this to go in so we can get
some experience with it.

ok jca@ claudio@ mpi@

Reply via email to