Re: FYI: mb(9) API is finally going away

2019-11-29 Thread Taylor R Campbell
The mb(9) API is now gone, except within sys/arch/mips where it is
used inside __cpu_simple_lock -- there were too many ifdefs across
components for me to disentangle, and the logic seems to be completely
bonkers, e.g. on Octeon __cpu_simple_unlock issues a memory barrier
_after_ the store to release the lock which makes no sense.


Re: __{read,write}_once

2019-11-29 Thread Taylor R Campbell
> Date: Sun, 24 Nov 2019 19:25:52 +
> From: Taylor R Campbell 
> 
> This thread is not converging on consensus, so we're discussing the
> semantics and naming of these operations as core and will come back
> with a decision by the end of the week.

We (core) carefully read the thread, and discussed this and the
related Linux READ_ONCE/WRITE_ONCE macros as well as the C11 atomic
API.

   For maxv: Please add conditional definitions in 
   according to what KCSAN needs, and use atomic_load/store_relaxed
   for counters and other integer objects in the rest of your patch.
   (I didn't see any pointer loads there.)  For uvm's lossy counters,
   please use atomic_store_relaxed(p, 1 + atomic_load_relaxed(p)) and
   not an __add_once macro -- since these should really be per-CPU
   counters, we don't want to endorse this pattern by making it
   pretty.

* Summary

We added a few macros to  for the purpose,
atomic_load_(p) and atomic_store_(p,v).  The
orderings are relaxed, acquire, consume, and release, and are intended
to match C11 semantics.  See the new atomic_loadstore(9) man page for
reference.

Currently they are defined in terms of volatile loads and stores, but
we should eventually use the C11 atomic API instead in order to
provide the intended atomicity guarantees under all compilers without
having to rely on the folklore interpretations of volatile.

* Details

There are four main properties involved in the operations under
discussion:

1. No tearing.  A 32-bit write can't be split into two separate 16-bit
   writes, for instance.

   * In _some_ cases, namely aligned pointers to sufficiently small
 objects, Linux READ_ONCE/WRITE_ONCE guarantee no tearing.

   * C11 atomic_load/store guarantees no tearing -- although on large
 objects it may involve locks, requiring the C11 type qualifier
 _Atomic and changing the ABI.

   This was the primary motivation for maxv's original question.

2. No fusing.  Consecutive writes can't be combined into one, for
   instance, or a write followed by a read can't skip the read to
   return the value that was written.

   * Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store
 guarantee no fusing.

3. Data-dependent memory ordering.  If you read a pointer, and then
   dereference the pointer (maybe plus some offset), the reads happen
   in that order.

   * Linux's READ_ONCE guarantees this by issuing the analogue of
 membar_datadep_consumer on DEC Alpha, and nothing on other CPUs.

   * C11's atomic_load guarantees this with seq_cst, acquire, or
 consume memory ordering.

4. Cost.  There's no need to incur cost of read/modify/write atomic
   operations, and for many purposes, no need to incur cost of
   memory-ordering barriers.

To express these, we've decided to add a few macros that are similar
to Linux's READ_ONCE/WRITE_ONCE and C11's atomic_load/store_explicit
but are less error-prone and less cumbersome:

#include 

- atomic_load_relaxed(p) is like *p, but guarantees no tearing and no
  fusing.  No ordering relative to memory operations on other objects
  is guaranteed.  

- atomic_store_relaxed(p, v) is like *p = v, but guarantees no tearing
  and no fusing.  No ordering relative to memory operations on other
  objects is guaranteed.

- atomic_store_release(p, v) and atomic_load_acquire(p) are,
  respectively, like *p = v and *p, but guarantee no tearing and no
  fusing.  They _also_ guarantee for logic like

Thread AThread B

stuff();
atomic_store_release(p, v);
u = atomic_load_acquire(p);
things();

  that _if_ the atomic_load_acquire(p) in thread B witnesses the state
  of the object at p set by atomic_store_release(p, v) in thread A,
  then all memory operations in stuff() happen before any memory
  operations in things().

  No guarantees if only one thread participates -- the store-release
  and load-acquire _must_ be paired.

- atomic_load_consume(p) is like atomic_load_acquire(p), but it only
  guarantees ordering for data-dependent memory references.  Like
  atomic_load_acquire, it must be paired with atomic_store_release.
  However, on most CPUs, it is as _cheap_ as atomic_load_relaxed.

The atomic load/store operations are defined _only_ on objects as
large as the architecture can support -- so, for example, on 32-bit
platforms they cannot be used on 64-bit quantities; attempts to do so
will lead to compile-time errors.  They are also defined _only_ on
aligned pointers -- using them on unaligned pointers may lead to
run-time crashes, even on architectures without strict alignment
requirements.

* Why the names atomic_{load,store}_?

- Atomic.  Although `atomic' may suggest `expensive' to some people
  (and I'm guilty of making that connection in the past), what's
  really expensive is atomic _read/modify/write_ operations and
  _memory ordering guarantees_.

  Merely preventing tearing 

FYI: mb(9) API is finally going away

2019-11-29 Thread Taylor R Campbell
FYI: The mb(9) API -- consisting of the mb_read, mb_write, and
mb_memory memory barriers -- was incomplete for users (failed to cover
important use cases) and incompletely defined (not defined on some
platforms like x86).  It was intended to be removed over a decade ago
in favour of the Solaris-style membar_*; only a few MD users in-tree
remain.

I'm about to remove mb(9) altogether.  This is a step in modernizing
our memory ordering interfaces.  If you were using it for some reason,
let me know and I can help you find the appropriate replacement.


Re: Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-11-29 Thread Andrew Doran
On Fri, Nov 29, 2019 at 10:39:09PM +0900, Shoichi Yamaguchi wrote:

> FreeBSD and OpenBSD use memory fences(mfence, sfence, lfence) to
> implement bus_space_barrier().
> On the other hand, NetBSD does not use them.
> 
> Do you know the background about current implementation?
> 
> I found an issue caused by this implementation difference while porting
> ixl(4).

Are you using BUS_SPACE_MAP_PREFETCHABLE?

If yes, I think there might be the possibility of reordering.  There should
probably be a fence.

It no, the CALL instruction that calls bus_space_barrier() produces a write
to the stack when storing the return address.  On x86, stores are totally
ordered, and loads are never reordered around stores.  No further barrier
is needed.  This is the idea anyway, sometimes reality does not match..

Andrew


Why NetBSD x86's bus_space_barrier does not use [sml]fence?

2019-11-29 Thread Shoichi Yamaguchi
Hi, all

FreeBSD and OpenBSD use memory fences(mfence, sfence, lfence) to
implement bus_space_barrier().
On the other hand, NetBSD does not use them.

Do you know the background about current implementation?

I found an issue caused by this implementation difference while porting
ixl(4).