On Mon, 24 Jun 2013, Konstantin Belousov wrote:

On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote:
The case that can't be fixed by rereading the counters is when fetching
code runs in between the stores.  If the stores are on a another CPU
that is currently executing them, then we can keep checking that the
counters don't change for the maximum time that any pair of stores
take to complete.  This time is quite long and difficult to determine
(but it can be bounded by hard-disabling interrupts in the storer, and
limited by prefetching).  If the stores are on the same CPU, then we
have no good way to detect that another thread is in the middle of
them, or to switch back to that thread.  So we currently prevent this
case using slow methods.

We are already explicit about the fetched value potentially not
reflecting any real moment value, but indeed, having the fetch result
off by 2^32 is not acceptable.

We need atomic 64bit loads to fix this, which is provided by the same
cmpxchg8b instruction on the 8-byte aligned elements.  Intel guarantees
that 8-byte aligned loads and stores are atomic.

The following is the prototype for the x86. The other 64bit
architectures are handled exactly like amd64. For 32bit !x86 arches,
i.e. arm, mips32 and powerpc32, some more investigation is needed.

That works of course, but is too complicated, machine-dependent and slow
for me.
   (I just checked all the pcpu.h implementations and found that
    all of them except amd64 and i386 still use the quick and racy
    PCPU_PTR(member) += (value) for PCPU_ADD().  This has 1 sure
    race and 1 compiler/machine-dependent race:
    - PCPU_PTR() is only usable if preemption (resulting in rescheduling
      to run on a different CPU) is prevented in the caller.  Otherwise,
      the pointer may become invalid immediately iafter it is initialized.
    - the compiler may generate a racy load-add-store operation for +=.
      Using PCPU_PTR() makes this a smaller problem than using a thread-
      local pointer.  It prevents the store going to a different CPU
      than the load.
    PCPU_GET() and PCPU_SET() have the first of these problems on all
    arches except amd64 and i386.  Even curthread wouldn't work in the
    SMP case if it has the default implementation of PCPU_GET(curthread).
    However, it has a non-default implementation on all arches except
    arm and mips.  So even curthread seems to be broken for arm and mips.)

   The non-optimized <machine/counter.h> avoids these problems using a
   critical section.  I'm not sure that works.  Anyway, counter.h shouldn't
   have to understand pcpu better than pcpu.h does in order to work.

   The arm implementation of atomic.h is also relevant and is especially
   interesting.  I looked at it after mdf@ pointed out ARM_RAS_START.
   This seems to only be used by arm in userland.  It is even larger and
   slower and more complicated than critical sections, so it doesn't seem
   to be useful for counters.  In the kernel, arm has several
   implementations depending on machine capabilities.  The ifdefs are
   hard to understand.  On some machine, it seems to use the equivalent
   of a cmpxchg loop.  On others, it just disables interrupts.  Disabling
   interrupts is presumably better than a critical section, and userland
   has to use ARM_RAS_START for some arches because neither disabling
   interrupts not critical sections are available in userland.  None
   of this care helps avoid the races in pcpu.h, since pcpu.h intentionally
   avoids using atomic ops.  None of this care helps avoid the races in
   counter.h or make counter.h efficient, since counter.h uses its own
   methods.  To reach the same quality as atomic.h, counter.h would have
   to copy half of the ifdefs and methods in atomic.h.

The slowness that I care about is only in the counter update.  Anything
complicated there will be slow.

My current best design:
- use ordinary mutexes to protect counter fetches in non-per-CPU contexts.
- use native-sized or always 32-bit counters.  Counter updates are done
  by a single addl on i386.  Fix pcpu.h on arches other than amd64 and
  i386 and use the same method as there.
- counter fetches add the native-sized pcpu counters to 64-bit non-pcpu
  counters, when the native-size counters are in danger of overflowing
  or always, under the mutex.  Transferring data uses an ordinary
  atomic_cmpset.  To avoid ifdefs, always use u_int pcpu counters.
  The 64-bit non-pcpu counters can easily be changed to pcpu counters
  if the API is extended to support pcpu data.
- run a daemon every few minutes to fetch all the counters, so that
  the native-sized counters are in no danger of overflowing on systems
  that don't run statistics programs often enough to fetch the counters
  to actually use.

...
diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c
index a98ed40..3222881 100644
--- a/sys/kern/subr_counter.c
+++ b/sys/kern/subr_counter.c
...
@@ -49,14 +51,8 @@ counter_u64_zero(counter_u64_t c)
uint64_t
counter_u64_fetch(counter_u64_t c)
{
-       uint64_t r;
-       int i;

-       r = 0;
-       for (i = 0; i < mp_ncpus; i++)
-               r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i);
-
-       return (r);
+       return (counter_u64_fetch_inline(c));
}

With my design:

        extern struct mtx counter_locks[];
        extern uint64_t counters[];
        uint64_t r;
        volatile u_int *p;
        u_int v;
        int cnum;

        cnum = ctonum(c);
        mtx_lock(&counter_locks[cnum]);     /* might not need 1 per counter */
        r = counters[cnum];
        for (i = 0; i < mp_ncpus; i++) {
                p = (u_int *)((char *)c + sizeof(struct pcpu) * i);
                v = *p;                 /* don't care if it is stale */
                if (v >= 0x80000000) {
                        /* Transfer only when above critical level. */
                        while (atomic_cmpset_rel_int(p, v, 0) == 0)
                                v = *p; /* still don't care if it is stale */
                        counters[cnum] += v;
                }
                r += v;
        }
        mtx_unlock(&counter_locks[cnum]);
        return (r);

Mutexes give some slowness in the fetching code, but fetches eare expected
to be relatively rare.

I added the complication to usually avoid atomic ops at the last minute.
The complication might not be worth it.

Mutexes can be avoided in the usual case too if there is an atomic op
to load the 64-bit counters, but I don't want to depend on that.  i386
already has atomic load_acq and store_rel for 64-bit accesses, but
these are not available on all arches and I consider their existence
on i386 to be a bug.  On i386. they use cmpxch8b if available, else
disable interrupts.  The store_rel one is never used, and the load_acq
one is used twice to handle a tiny problem with TSC frequencies >=
2**32 Hz.  You didn't even use them here.  It seems to be just a
micro-optimization to not use them (in the cmpxch8b case, it avoids
the lock prefix for the SMP case, and in the critical section case,
it allows putting the critical section around the whole loop).

Bruce
_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to