Ping? Is your objection still standing? If yes, can you address my responses?
Recap: - Proposal is to change one word in the membar_enter man page from Any store preceding membar_enter() will happen before all memory operations following it. to Any load preceding membar_enter() will happen before all memory operations following it. In other words, document membar_enter as load-before-load/store, i.e., as load-before-load and load-before-store -- not as store-before-load/store. This will secondarily allow us to remove a lot of confusing verbiage in the man page about membar_ops and load-acquire operations. - Every use of membar_enter in tree needs load-before-load/store, not store-before-load/store. So we're already relying on the proposed semantics, not the documented semantics. I'd like to add some more load-before-load/store uses, in places where atomic_load_acquire doesn't quite work. - Store-before-load is a _weird_ ordering that generally occurs only in exotic protocols like Dekker's algorithm, which we should not be encouraging in tree. - The one-word difference is immaterial for ordering atomic-r/m/w and then load/store (or, equivalently, ll/sc and then load/store) -- so the change doesn't affect mutex_enter-type operations implemented with, e.g., atomic_cas. - Our implementation of membar_enter on all CPUs (except riscv which has never been released) already implements the proposed semantics, but _does not_ implement the documented semantics on amd64 i386 powerpc sparc sparc64 and it's been this way for fifteen years since it was introduced. - Store-before-load is often much more expensive than load-before-load/store or load/store-before-store: . On x86 and SPARC TSO, store-before-load needs the most expensive memory fence instruction (MFENCE), whereas load-before-load/store and load/store-before-store don't require any fence at all. . On Armv8, store-before-load needs DMB ISH, but load-before-load/store needs only the cheaper DMB ISHLD. . On powerpc, store-before-load needs SYNC, but load-before-load/store and load/store-before-store only need the cheaper LWSYNC. (Load-before-load/store might actually only need the even cheaper ISYNC, not 100% sure.) So even for ordering atomic r/m/w, where there's no semantic difference, it's cheaper to use load-before-load/store than to use store-before-load/store.