Hello, I was looking at rw lock code out of curiosity and noticed you always do membar_enter which on MP-enabled amd64 kernel translates to mfence. This makes the entire business a little bit slower.
Interestingly you already have relevant macros for amd64: #define membar_enter_after_atomic() __membar("") #define membar_exit_before_atomic() __membar("") And there is even a default variant for archs which don't provide one. I guess the switch should be easy. Grabbing for reading is rw_cas to a higher value. On failure you explicitly re-read from the lock This is slower than necessary in presence of concurrent read lock/unlock since cas returns the found value and you can use that instead. Also the read lock fast path does not have to descent to the slow path after a single cas failure, but this probably does not matter right now. The actual question I have here is if you played with adaptive spinning instead of instantly putting threads to sleep at least for cases when the kernel lock is not held. This can be as simplistic as just spinning as long as the lock is owned by a running thread. For cases where curproc holds the kernel lock you can perhaps drop it for spinning purposes and reacquire later. Although I have no idea if this one is going to help anything. Definitely worth testing imo. A side note is that the global locks I found are not annotated in any manner with respect to exclusivity of cacheline placement. In particular netlock in a 6.2 kernel shares its chaceline with if_input_task_locked: ---------------- ffffffff81aca608 D if_input_task_locked ffffffff81aca630 D netlock ---------------- The standard boiler plate to deal with it is to annotate with aligned() and place the variable in a dedicated section. FreeBSD and NetBSD contain the necessary crappery to copy-paste including linker script support. Cheers, -- Mateusz Guzik <mjguzik gmail.com>