My last diff (11 May 2021) still has a potential problem with memory
barriers.  I will mail a new diff if I think of a fix.

On Tue, 11 May 2021 11:08:55 -0500
Dale Rahn <[email protected]> wrote:

> This structure really should be cache-line aligned, which should prevent it
> from spilling across a page boundary.

There is both a struct __ppc_lock and a function __ppc_lock(); my G5
got stuck at boot when the function crossed a page boundary.  It might
help to align parts of the function, but if I would do so, I might
need to write the entire function in asm and count instructions.

The potential problem in my last diff is here inside __ppc_lock():

  15eb8c:       7c c0 19 2d     stwcx.  r6,0,r3
  15eb90:       40 c2 ff f0     bne+    15eb80 <__ppc_lock+0x40>
  15eb94:       7c 08 38 40     cmplw   r8,r7
  15eb98:       41 82 00 28     beq-    15ebc0 <__ppc_lock+0x80>
  ...
  15ebc0:       4c 00 01 2c     isync

After "stwcx." grabs the lock, we need a "bc; isync" memory barrier,
which is the "bne+" and "isync" in the above disassembly.  A page
fault might act like an "isync", so if we can do the "bne+" before
the page fault, we might be fine.

The problem is that, if the "stwcx." and "bne+" are in different pages
(1 in 1024 chance?), then I would get a page fault before the "bne+",
so my code would skip the necessary memory barrier.  I guess that I
can fix it by adding another membar_enter() somewhere.

> The powerpc pmap was originally designed to have  8 'way' locks so that
> only a single way would get locked, thus (as long as one doesn't have way
> more than 8 cores) any core should be able to choose some way to replace
> and get the work done. This also would have allowed limited recursion.
> However at some point those were taken out. Note that before locking one of
> the ways of the hashtable, it would hold the lock on the pmap that it was
> inserting the mapping from (which could be the kernel). Not certain if the
> kernel pmap lock is recursive or not.

That's a good history lesson for me; I'm too new.

The pmap lock is "struct mutex pm_mtx" (powerpc/include/pmap.h), so
it isn't recursive.  The code in pmap.c pte_spill_r() doesn't hold the
pmap lock if the page is in the direct map.  The kernel text, including
the function __ppc_lock(), is in the direct map.

> The point was to allow multiple cores to access different regions of the
> hashtable at the same time (minimal contention). However it would also
> allow a core to reenter the lock_try on a different way where it should be
> able to succeed (as long as the recursion doesn't go 8 deep?)

If I understand the code, the max recursion is 2 levels.  The kernel
can grab the hash lock, get a page fault, then the page fault handler
can do 2nd grab.  The handler is in real mode (no address translation)
and would never cause another page fault and a 3rd grab.

If we would have the 8 locks, then it might deadlock if 8 cpus each
hold a lock and try to grab a 2nd lock (but a macppc has at most
4 cpus: PPC_MAXPROCS in powerpc/include/cpu.h).

--George

> On Tue, May 11, 2021 at 12:42 AM George Koehler <[email protected]> wrote:
> 
> > I made a mistake in my last diff by deleting the memory barriers.
> > Here is a diff that keeps membar_enter() and membar_exit().
> >
> > With my last diff, my G5 froze after 15 hours of uptime, and again
> > after 10 hours of uptime.  I'm not sure whether the missing memory
> > barriers caused the freeze, but I will be running this diff and hoping
> > for fewer freezes.
> >
> > On Sat, 8 May 2021 18:59:35 +0200 (CEST)
> > Mark Kettenis <[email protected]> wrote:
> >
> > > Good find!  On powerpc64 I avoid the issue because I guarantee that
> > > the kernel mappings are never evicted from the hash.  But doing so on
> > > powerpc would require more serious development.  I'm not sure we
> > > really need a ticket lock for this, but since you already did the
> > > work, let's stick with it for now.
> >
> > __ppc_lock (before and after this diff) doesn't give tickets like
> > __mp_lock does, but __ppc_lock and __mp_lock are both recursive locks.
> > I don't know an easy way to avoid the recursion, if some of the kernel
> > mappings might not be in the hash.  My __ppc_lock diff should be easy
> > if I don't make mistakes like wrong memory barriers.
> >
> > --George
> >
> > Index: arch/powerpc/include/mplock.h
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/powerpc/include/mplock.h,v
> > retrieving revision 1.4
> > diff -u -p -r1.4 mplock.h
> > --- arch/powerpc/include/mplock.h       15 Apr 2020 08:09:00 -0000      1.4
> > +++ arch/powerpc/include/mplock.h       10 May 2021 23:33:00 -0000
> > @@ -30,13 +30,13 @@
> >  #define __USE_MI_MPLOCK
> >
> >  /*
> > + * __ppc_lock exists because pte_spill_r() can't use __mp_lock.
> >   * Really simple spinlock implementation with recursive capabilities.
> >   * Correctness is paramount, no fancyness allowed.
> >   */
> >
> >  struct __ppc_lock {
> > -       volatile struct cpu_info *mpl_cpu;
> > -       volatile long           mpl_count;
> > +       volatile unsigned int   mpl_bolt;
> >  };
> >
> >  #ifndef _LOCORE
> > @@ -44,10 +44,6 @@ struct __ppc_lock {
> >  void __ppc_lock_init(struct __ppc_lock *);
> >  void __ppc_lock(struct __ppc_lock *);
> >  void __ppc_unlock(struct __ppc_lock *);
> > -int __ppc_release_all(struct __ppc_lock *);
> > -int __ppc_release_all_but_one(struct __ppc_lock *);
> > -void __ppc_acquire_count(struct __ppc_lock *, int);
> > -int __ppc_lock_held(struct __ppc_lock *, struct cpu_info *);
> >
> >  #endif
> >
> > Index: arch/powerpc/powerpc/lock_machdep.c
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/powerpc/powerpc/lock_machdep.c,v
> > retrieving revision 1.9
> > diff -u -p -r1.9 lock_machdep.c
> > --- arch/powerpc/powerpc/lock_machdep.c 15 Apr 2020 08:09:00 -0000      1.9
> > +++ arch/powerpc/powerpc/lock_machdep.c 10 May 2021 23:33:00 -0000
> > @@ -1,6 +1,7 @@
> >  /*     $OpenBSD: lock_machdep.c,v 1.9 2020/04/15 08:09:00 mpi Exp $    */
> >
> >  /*
> > + * Copyright (c) 2021 George Koehler <[email protected]>
> >   * Copyright (c) 2007 Artur Grabowski <[email protected]>
> >   *
> >   * Permission to use, copy, modify, and distribute this software for any
> > @@ -22,15 +23,29 @@
> >  #include <sys/atomic.h>
> >
> >  #include <machine/cpu.h>
> > -#include <machine/psl.h>
> >
> >  #include <ddb/db_output.h>
> >
> > +/*
> > + * If __ppc_lock() crosses a page boundary in the kernel text, then it
> > + * may cause a page fault (on G5 with ppc_nobat), and pte_spill_r()
> > + * would recursively call __ppc_lock().  The lock must be in a valid
> > + * state when the page fault happens.
> > + *
> > + * This lock has 2 fields, "cpuid" and "count", packed in a 32-bit
> > + * "bolt", so we can use 32-bit atomic ops to ensure that our lock is
> > + * always in a valid state.
> > + */
> > +#define BOLT(cpuid, count)     ((cpuid) << 24 | (count) & 0x00ffffff)
> > +#define BOLT_CPUID(bolt)       ((bolt) >> 24)
> > +#define BOLT_COUNT(bolt)       ((bolt) & 0x00ffffff)
> > +#define BOLT_LOCKED(bolt)      (BOLT_COUNT(bolt) != 0)
> > +#define BOLT_UNLOCKED(bolt)    (BOLT_COUNT(bolt) == 0)
> > +
> >  void
> >  __ppc_lock_init(struct __ppc_lock *lock)
> >  {
> > -       lock->mpl_cpu = NULL;
> > -       lock->mpl_count = 0;
> > +       lock->mpl_bolt = 0;
> >  }
> >
> >  #if defined(MP_LOCKDEBUG)
> > @@ -46,12 +61,12 @@ static __inline void
> >  __ppc_lock_spin(struct __ppc_lock *mpl)
> >  {
> >  #ifndef MP_LOCKDEBUG
> > -       while (mpl->mpl_count != 0)
> > +       while (BOLT_LOCKED(mpl->mpl_bolt))
> >                 CPU_BUSY_CYCLE();
> >  #else
> >         int nticks = __mp_lock_spinout;
> >
> > -       while (mpl->mpl_count != 0 && --nticks > 0)
> > +       while (BOLT_LOCKED(mpl->mpl_bolt) && --nticks > 0)
> >                 CPU_BUSY_CYCLE();
> >
> >         if (nticks == 0) {
> > @@ -64,33 +79,23 @@ __ppc_lock_spin(struct __ppc_lock *mpl)
> >  void
> >  __ppc_lock(struct __ppc_lock *mpl)
> >  {
> > -       /*
> > -        * Please notice that mpl_count gets incremented twice for the
> > -        * first lock. This is on purpose. The way we release the lock
> > -        * in mp_unlock is to decrement the mpl_count and then check if
> > -        * the lock should be released. Since mpl_count is what we're
> > -        * spinning on, decrementing it in mpl_unlock to 0 means that
> > -        * we can't clear mpl_cpu, because we're no longer holding the
> > -        * lock. In theory mpl_cpu doesn't need to be cleared, but it's
> > -        * safer to clear it and besides, setting mpl_count to 2 on the
> > -        * first lock makes most of this code much simpler.
> > -        */
> > +       int id = curcpu()->ci_cpuid;
> >
> >         while (1) {
> > -               int s;
> > -
> > -               s = ppc_intr_disable();
> > -               if (atomic_cas_ulong(&mpl->mpl_count, 0, 1) == 0) {
> > -                       membar_enter();
> > -                       mpl->mpl_cpu = curcpu();
> > -               }
> > +               unsigned int bolt = mpl->mpl_bolt;
> >
> > -               if (mpl->mpl_cpu == curcpu()) {
> > -                       mpl->mpl_count++;
> > -                       ppc_intr_enable(s);
> > +               if (BOLT_UNLOCKED(bolt)) {
> > +                       /* Try to grab the lock.  count = 1 */
> > +                       if (atomic_cas_uint(&mpl->mpl_bolt, bolt,
> > +                           BOLT(id, 1)) == bolt) {
> > +                               membar_enter();
> > +                               break;
> > +                       }
> > +               } else if (BOLT_CPUID(bolt) == id) {
> > +                       /* We hold the lock.  count++ */
> > +                       mpl->mpl_bolt = BOLT(id, BOLT_COUNT(bolt) + 1);
> >                         break;
> >                 }
> > -               ppc_intr_enable(s);
> >
> >                 __ppc_lock_spin(mpl);
> >         }
> > @@ -99,72 +104,18 @@ __ppc_lock(struct __ppc_lock *mpl)
> >  void
> >  __ppc_unlock(struct __ppc_lock *mpl)
> >  {
> > -       int s;
> > +       unsigned int bolt = mpl->mpl_bolt;
> >
> >  #ifdef MP_LOCKDEBUG
> > -       if (mpl->mpl_cpu != curcpu()) {
> > +       if (BOLT_CPUID(bolt) != curcpu()->ci_cpuid ||
> > +           BOLT_COUNT(bolt) == 0) {
> >                 db_printf("__ppc_unlock(%p): not held lock\n", mpl);
> >                 db_enter();
> >         }
> >  #endif
> >
> > -       s = ppc_intr_disable();
> > -       if (--mpl->mpl_count == 1) {
> > -               mpl->mpl_cpu = NULL;
> > -               membar_exit();
> > -               mpl->mpl_count = 0;
> > -       }
> > -       ppc_intr_enable(s);
> > -}
> > -
> > -int
> > -__ppc_release_all(struct __ppc_lock *mpl)
> > -{
> > -       int rv = mpl->mpl_count - 1;
> > -       int s;
> > -
> > -#ifdef MP_LOCKDEBUG
> > -       if (mpl->mpl_cpu != curcpu()) {
> > -               db_printf("__ppc_release_all(%p): not held lock\n", mpl);
> > -               db_enter();
> > -       }
> > -#endif
> > -
> > -       s = ppc_intr_disable();
> > -       mpl->mpl_cpu = NULL;
> > +       /* count-- */
> > +       bolt = BOLT(BOLT_CPUID(bolt), BOLT_COUNT(bolt) - 1);
> >         membar_exit();
> > -       mpl->mpl_count = 0;
> > -       ppc_intr_enable(s);
> > -
> > -       return (rv);
> > -}
> > -
> > -int
> > -__ppc_release_all_but_one(struct __ppc_lock *mpl)
> > -{
> > -       int rv = mpl->mpl_count - 2;
> > -
> > -#ifdef MP_LOCKDEBUG
> > -       if (mpl->mpl_cpu != curcpu()) {
> > -               db_printf("__ppc_release_all_but_one(%p): not held
> > lock\n", mpl);
> > -               db_enter();
> > -       }
> > -#endif
> > -
> > -       mpl->mpl_count = 2;
> > -
> > -       return (rv);
> > -}
> > -
> > -void
> > -__ppc_acquire_count(struct __ppc_lock *mpl, int count)
> > -{
> > -       while (count--)
> > -               __ppc_lock(mpl);
> > -}
> > -
> > -int
> > -__ppc_lock_held(struct __ppc_lock *mpl, struct cpu_info *ci)
> > -{
> > -       return mpl->mpl_cpu == ci;
> > +       mpl->mpl_bolt = bolt;
> >  }
> >
> >


-- 
George Koehler <[email protected]>

Reply via email to