tl;dr -- Proposal: - Clarify bus_space_barrier is needed only if prefetchable/cacheable. - Define BUS_SPACE_BARRIER_READ as acquire barrier, like membar_acquire. - Define BUS_SPACE_BARRIER_WRITE as release barrier, like membar_release.
Many people have been confused by the semantics of bus_space_barrier, and the bus_space(9) man page is internally inconsistent. But on every machine I've examined, bus_space is implemented so that bus_space_read/write are observed by the device in program order already -- except if you enable BUS_SPACE_MAP_PREFETCHABLE or BUS_SPACE_MAP_CACHEABLE. Only a handful of drivers other than framebuffer drivers even use bus_space_barrier -- the vast majority of drivers using bus_space lack bus_space_barrier calls and would undoubtedly fail to function at all if the machine were reordering their bus_space_reads/writes. I propose we explicitly adopt and document the following semantics, to match most existing usage and implementations: - bus_space_barrier is needed only for mappings with BUS_SPACE_MAP_PREFETCHABLE or BUS_SPACE_MAP_CACHEABLE. bus_space_read/write on non-prefetchable, non-cacheable mappings on a single device are issued in program order and never fused or torn. - bus_space_barrier(t, h, o, l, BUS_SPACE_BARRIER_READ) means that any program-prior bus_space_read (on any space) has returned data from the device or storage before any program-later bus_space_read or bus_space_write on t/h/o/l (tag, handle, offset, length), or memory access via the l bytes from (char *)bus_space_vaddr(t, h) + o. If h was mapped non-prefetchable and non-cacheable this may be a noop. This functions similarly to membar_acquire, but for I/O. - bus_space_barrier(t, h, o, l, BUS_SPACE_BARRIER_WRITE) means that any program-prior bus_space_read or bus_space_write on t/h/o/l, or memory access via the l bytes from (char *)bus_space_vaddr(t, h) + o, has returned data from or been observed by the device or storage before before any program-later bus_space_write (on any space) is observed by the device. If h was mapped non-prefetchable and non-cacheable this may be a noop. This functions similarly to membar_release, but for I/O. - bus_space_barrier(t, h, o, l, BUS_SPACE_BARRIER_READ | BUS_SPACE_BARRIER_WRITE) means that all program-prior bus_space_read or bus_space_write on any space must complete before any program-later bus_space_read or bus_space_write on any space. Note that this is independent of t/h/o/l; there's no way the arguments can elide any action here. This functions similarly to membar_sync, but for I/O. - No MI notion of `ordering' vs `completion'. In practical terms this distinction is relevant only for software timing measurement (when do machine instructions retire/commit on the CPU?) or highly machine-dependent things (when do speculative TLB translation table walks happen? when does a machine-check exception trigger?), not for device driver correctness (in what order do reads and writes hit the device?). Currently there is no such MI difference in the API but the man page makes the distinction nevertheless in its prose. Drivers that don't use BUS_SPACE_MAP_PREFETCHABLE or BUS_SPACE_MAP_CACHEABLE never have to worry about barriers -- and, since `cacheable' is almost always, if not always, used for ROMs, bus_space_barrier is relevant essentially only for drivers that use BUS_SPACE_MAP_PREFETCHABLE, e.g. for framebuffers or command queues. (Drivers may still have to worry about bus_dmamap_sync if they use bus_dma(9), of course.) The usage model is something like: /* Map MMIO registers and command/response regions on attach */ bus_space_map(sc->sc_regt, ..., 0, &sc->sc_regh); bus_space_map(sc->sc_memt, ..., BUS_SPACE_MAP_PREFETCHABLE, &sc->sc_memh); /* Submit a command */ i = sc->sc_nextcmdslot; bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i), cmd); bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i) + 4, arg1); bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i) + 8, arg2); ... bus_space_write_4(sc->sc_memt, sc->sc_memh, CMDSLOT(i) + 4*n, argn); bus_space_barrier(sc->sc_memt, sc->sc_memh, CMDSLOT(i), 4*n, BUS_SPACE_BARRIER_WRITE); bus_space_write_4(sc->sc_regt, sc->sc_regh, CMDTAIL, i); sc->sc_nextcmdslot = (i + 1) % sc->sc_ncmdslots; /* Obtain a response */ i = bus_space_read_4(sc->sc_regt, sc->sc_regh, RESPTAIL); bus_space_barrier_4(sc->sc_memt, sc->sc_memh, RESPSLOT(i), 4*n, BUS_SPACE_BARRIER_READ); ... bus_space_read_4(sc->sc_memt, sc->sc_memh, RESPSLOT(i) + 4*j) ... The practical consequences of this proposal are: 1. We remove a lot of verbiage (and internal inconsistency) in bus_space(9). 2. We can omit needless bus_space_barrier calls in drivers that don't use BUS_SPACE_MAP_PREFETCHABLE (or BUS_SPACE_MAP_CACHEABLE). Most of these calls, outside framebuffer drivers, were added just because the man page said so, not because it was ever necessary on any machine as part of the bus_space semantics, and then they got copied & pasted from time to time. In some cases this change may speed up device drivers by avoiding unnecessary cache/store-buffer flushes for regular memory that is not relevant to device I/O. 3. We may need to strengthen the BUS_SPACE_BARRIER_WRITE case on some platforms. For instance, on aarch64, it is currently implemented by DSB ST, which orders stores but not loads. Arm CPUs have been observed to reorder load;store to store;load on normal memory[1], so it would not be surprising if they did so for device memory too, in which case under the proposed semantics BUS_SPACE_BARRIER_WRITE requires the costlier DSB SY. (In contrast, on x86, SFENCE suffices here because x86 never reodrders load;store to store;load under any circumstances[2].) Alternatives: (a) We could define BUS_SPACE_BARRIER_WRITE to be write-before-write only, and BUS_SPACE_BARRIER_READ to be read-before-read only. But if any device uses prefetchable mappings for bidirectional communication between the device and CPU, it really needs read-before-read/write and read/write-before-write. And write-before-read is usually the most expensive type of barrier, but very seldom is it necessary, so forcing driver authors to reach for it (via read/write-before-read/write, i.e., BSB_READ|BSB_WRITE) when all they need is read/write-before-write isn't great. (b) We could introduce finer-grained flags: BUS_SPACE_BARRIER_READ_READ, BUS_SPACE_BARRIER_READ_WRITE, &c., and maybe define BUS_SPACE_BARRIER_READ as an alias for BUS_SPACE_BARRIER_READ_READ and BUS_SPACE_BARRIER_WRITE as an alias for BUS_SPACE_BARRIER_WRITE_WRITE. But I don't think the extra API complexity -- for something used so seldom -- is worthwhile, and in my experience supporting arbitrary arbitrary combinations of {read,write} before {read,write} ordering is counterproductive because it's confusing and most combinations are useless in practice. (c) On top of (a), we could define new BUS_SPACE_BARRIER_ACQUIRE to be read-before-read/write and BUS_SPACE_BARRIER_RELEASE to be read/write-before-read, and maybe BUS_SPACE_BARRIER_SYNC to be r/w-before-r/w as an alias for BSB_READ|BSB_WRITE; this way we could dispense with the other useless combinations. But again I'm not sure the complexity is worthwhile. Maybe it will turn out that there are no devices which use prefetchable mappings for bidirectional communication, in which case (a) is probably the best option, and requires no additional implementation work (although we could still stand to audit all the bus_space_barrier implementations!). Thoughts? P.S. I also considered eliminating the offset/length argument, because no implementations take advantage of them, so I started (with maya@'s help wrangling coccinelle) to draft a patch to remove them. However, the other BSDs have the same API, so changing this would make it more of a pain to share drivers. [1] Luc Maranget, Susmit Sarkar, and Peter Sewell, `A Tutorial Introduction to the ARM and POWER Relaxed Memory Models', Revision 120, 2012-10-10, Sec. 7.1 Load-Buffering (LB) Examples: Observed Behaviour, p. 24. https://web.archive.org/web/20220601003130/https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf#page=24 [2] AMD64 Architecture Programmer's Manual, Volume 2: System Programming, AMD Publication No. 24593, Revision 3.38, 2021-10, Sec. 7.4.2 Memory Barrier Interaction with Memory Types, Table 7-3, p. 196. https://web.archive.org/web/20220625040004/https://www.amd.com/system/files/TechDocs/24593.pdf#page=256