On 15/12/17(Fri) 22:03, Mateusz Guzik wrote:
> [...]
> However, contended behaviour is a regression compared to the asm
> variant.
Now that I checked the files in could you generate a diff with your
suggestions?
> From what I gather this is a step towards unifying all mutex
> implementations, hence the membar_* use.
Exactly.
> > +void
> > +__mtx_enter(struct mutex *mtx)
> > +{
> > +#ifdef MP_LOCKDEBUG
> > + int nticks = __mp_lock_spinout;
> > +#endif
> > +
> > + while (__mtx_enter_try(mtx) == 0) {
> > + CPU_BUSY_CYCLE();
>
> So this is effectively __mtx_enter_try with single pause in-between.
>
> > +}
> > +
> > +int
> > +__mtx_enter_try(struct mutex *mtx)
> > +{
> > + struct cpu_info *owner, *ci = curcpu();
> > + int s;
> > +
> > + if (mtx->mtx_wantipl != IPL_NONE)
> > + s = splraise(mtx->mtx_wantipl);
> > +
>
> This is at least one read.
>
> > + owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> > +#ifdef DIAGNOSTIC
> > + if (__predict_false(owner == ci))
> > + panic("mtx %p: locking against myself", mtx);
> > +#endif
> > + if (owner == NULL) {
> > + membar_enter_after_atomic();
> > + if (mtx->mtx_wantipl != IPL_NONE)
> > + mtx->mtx_oldipl = s;
>
> This repeats the read done earlier.
>
> Since the caller loops, this is effectively a very inefficient lock of
> the form:
> while (!atomic_cas_ptr(...))
> CPU_BUSY_CYCLE();
>
> + some reads in-between
>
> Assembly code would spin waiting for the lock to become free before
> playing with spl level and attempting to lock. I don't know how
> contended your mutexes are right now, but this will have to be very
> quickly changed at least to current asm behaviour as more of the kernel
> gets unlocked. Going for ticket locks or backoff later will probably
> work fine enough for the foreseeable future and will postpone the need
> to invest into anything fancy.
>
> That said, proposed restructure is as follows (pseudo-code, no debug):
>
> void
> __mtx_enter(struct mutex *mtx)
> {
> int want_ipl, s;
>
> /* mark mtx_wantipl as volatile or add a read casted through
> * one to force a read *here* and no re-reads later */
> want_ipl = mtx->mtx_wantipl;
>
> for (;;) {
> if (want_ipl != IPL_NONE)
> s = splraise(want_ipl);
> owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> if (owner == NULL) {
> membar_enter_after_atomic();
> if (want_ipl != IPL_NONE)
> mtx->mtx_oldipl = s;
> break;
> }
>
> if (want_ipl != IPL_NONE)
> splx(s);
>
> do {
> CPU_BUSY_CYCLE();
> } while (mtx->mtx_owner != NULL);
> }
> }
>
> I don't think partial duplication of try_enter can be avoided without
> serious ugliness.
Duplication is not a problem since we're going to have a single MI file
for all (most?) archs.
> > +void
> > +__mtx_leave(struct mutex *mtx)
> > +{
> > + int s;
> > +
> > + MUTEX_ASSERT_LOCKED(mtx);
> > +
> > +#ifdef DIAGNOSTIC
> > + curcpu()->ci_mutex_level--;
> > +#endif
> > +
> > + s = mtx->mtx_oldipl;
> > +#ifdef MULTIPROCESSOR
> > + membar_exit_before_atomic();
> > +#endif
> > + mtx->mtx_owner = NULL;
> > + if (mtx->mtx_wantipl != IPL_NONE)
> > + splx(s);
> > +}
>
> It should be somewhat faster to read mtx_wantipl while the lock is still
> held - it is less likely that someone else dirtied the cacheline. Then
> mtx_oldipl can be only conditionally read.
>
> This function does not do atomic ops, so membar_exit_before_atomic does
> not really fit, even though it happens to do the exact same thing the
> "correctly" named primitive would - just provide a compiler barrier on
> amd64/i386.
>
> From what I gather you are trying to mimick illumos nomenclature, but
> they don't have an equivalent afaics. (perhaps Solaris grew one in the
> meantime?)
>
> In FreeBSD an appropriate routine is named atomic_thread_fence_rel (see
> amd64/include/atomic.h) and I suggest just borrowing the api.
API changes are subject to bikescheding so I'd suggest we concentrate on
the nice optimizations/improvements you're pointing out.
> Side note is that you probably can shorten ipl to vars to make the lock
> smaller. It can be doable to fit it into lock word, but I don't know how
> much
> sense would playing with it make.
One reason to have MI locks is also to make debugging easier. I don't
know if shrinking the size of the structure goes in the same direction.
Time will tell.