Re: bus_space_barrier semantics

2022-07-16 Thread Martin Husemann
On Sat, Jul 16, 2022 at 01:42:38PM -0700, Jason Thorpe wrote:

> Didn't OpenBSD remove the offset / length arguments?  Anyway, I'm not
> particularly concerned about this part, but it would be a nice wart to
> rid ourselves of.

Since these calls (now) should be relatively rare, and conversion would
be to just remove arguments when we port a driver, that sounds like
an acceptable difference to me.

Martin


Re: bus_space_barrier semantics

2022-07-16 Thread Jason Thorpe



> On Jul 16, 2022, at 11:15 AM, Taylor R Campbell  wrote:
> 
> Thoughts?

110% on-board with all of this.

> P.S.  I also considered eliminating the offset/length argument,
> because no implementations take advantage of them, so I started (with
> maya@'s help wrangling coccinelle) to draft a patch to remove them.
> However, the other BSDs have the same API, so changing this would make
> it more of a pain to share drivers.

Didn’t OpenBSD remove the offset / length arguments?  Anyway, I’m not 
particularly concerned about this part, but it would be a nice wart to rid 
ourselves of.

-- thorpej




bus_space_barrier semantics

2022-07-16 Thread Taylor R Campbell
tl;dr -- Proposal:
- Clarify bus_space_barrier is needed only if prefetchable/cacheable.
- Define BUS_SPACE_BARRIER_READ as acquire barrier, like membar_acquire.
- Define BUS_SPACE_BARRIER_WRITE as release barrier, like membar_release.


Many people have been confused by the semantics of bus_space_barrier,
and the bus_space(9) man page is internally inconsistent.

But on every machine I've examined, bus_space is implemented so that
bus_space_read/write are observed by the device in program order
already -- except if you enable BUS_SPACE_MAP_PREFETCHABLE or
BUS_SPACE_MAP_CACHEABLE.

Only a handful of drivers other than framebuffer drivers even use
bus_space_barrier -- the vast majority of drivers using bus_space lack
bus_space_barrier calls and would undoubtedly fail to function at all
if the machine were reordering their bus_space_reads/writes.

I propose we explicitly adopt and document the following semantics, to
match most existing usage and implementations:

- bus_space_barrier is needed only for mappings with
  BUS_SPACE_MAP_PREFETCHABLE or BUS_SPACE_MAP_CACHEABLE.
  bus_space_read/write on non-prefetchable, non-cacheable mappings on
  a single device are issued in program order and never fused or torn.

- bus_space_barrier(t, h, o, l, BUS_SPACE_BARRIER_READ) means that any
  program-prior bus_space_read (on any space) has returned data from
  the device or storage before any program-later bus_space_read or
  bus_space_write on t/h/o/l (tag, handle, offset, length), or memory
  access via the l bytes from (char *)bus_space_vaddr(t, h) + o.

  If h was mapped non-prefetchable and non-cacheable this may be a
  noop.

  This functions similarly to membar_acquire, but for I/O.

- bus_space_barrier(t, h, o, l, BUS_SPACE_BARRIER_WRITE) means that
  any program-prior bus_space_read or bus_space_write on t/h/o/l, or
  memory access via the l bytes from (char *)bus_space_vaddr(t, h) +
  o, has returned data from or been observed by the device or storage
  before before any program-later bus_space_write (on any space) is
  observed by the device.

  If h was mapped non-prefetchable and non-cacheable this may be a
  noop.

  This functions similarly to membar_release, but for I/O.

- bus_space_barrier(t, h, o, l, BUS_SPACE_BARRIER_READ |
  BUS_SPACE_BARRIER_WRITE) means that all program-prior bus_space_read
  or bus_space_write on any space must complete before any
  program-later bus_space_read or bus_space_write on any space.

  Note that this is independent of t/h/o/l; there's no way the
  arguments can elide any action here.

  This functions similarly to membar_sync, but for I/O.

- No MI notion of `ordering' vs `completion'.

  In practical terms this distinction is relevant only for software
  timing measurement (when do machine instructions retire/commit on
  the CPU?) or highly machine-dependent things (when do speculative
  TLB translation table walks happen? when does a machine-check
  exception trigger?), not for device driver correctness (in what
  order do reads and writes hit the device?).

  Currently there is no such MI difference in the API but the man page
  makes the distinction nevertheless in its prose.

Drivers that don't use BUS_SPACE_MAP_PREFETCHABLE or
BUS_SPACE_MAP_CACHEABLE never have to worry about barriers -- and,
since `cacheable' is almost always, if not always, used for ROMs,
bus_space_barrier is relevant essentially only for drivers that use
BUS_SPACE_MAP_PREFETCHABLE, e.g. for framebuffers or command queues.
(Drivers may still have to worry about bus_dmamap_sync if they use
bus_dma(9), of course.)

The usage model is something like:

/* Map MMIO registers and command/response regions on attach */
bus_space_map(sc->sc_regt, ..., 0, &sc->sc_regh);
bus_space_map(sc->sc_memt, ..., BUS_SPACE_MAP_PREFETCHABLE, 
&sc->sc_memh);

/* Submit a command */
i = sc->sc_nextcmdslot;
bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i), cmd);
bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i) + 4, arg1);
bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i) + 8, arg2);
...
bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i) + 4*n, argn);
bus_space_barrier(sc->sc_memt, sc->sc_memh, CMDSLOT(i), 4*n,
BUS_SPACE_BARRIER_WRITE);
bus_space_write_4(sc->sc_regt, sc->sc_regh, CMDTAIL, i);
sc->sc_nextcmdslot = (i + 1) % sc->sc_ncmdslots;

/* Obtain a response */
i = bus_space_read_4(sc->sc_regt, sc->sc_regh, RESPTAIL);
bus_space_barrier_4(sc->sc_memt, sc->sc_memh, RESPSLOT(i), 4*n,
BUS_SPACE_BARRIER_READ);
... bus_space_read_4(sc->sc_memt, sc->sc_memh, RESPSLOT(i) + 4*j) ...

The practical consequences of this proposal are:

1. We remove a lot of verbiage (and internal inconsistency) in
   bus_space(9).

2. We can omit needless bus_space_barrier calls in drivers that don't
   use BUS_SPACE_MAP_PREFETCHABLE (or BUS_SPACE_MAP_CA