Module Name:    src
Committed By:   riastradh
Date:           Fri Aug 12 13:44:12 UTC 2022

Modified Files:
        src/sys/arch/x86/x86: bus_dma.c

Log Message:
x86: Adjust fences issued in bus_dmamap_sync after bouncing.

And expand the comment on the lfence for POSTREAD before bouncing.

Net change:

op                      before bounce       after bounce
                                        old             new
PREREAD                 nop             lfence          sfence
PREWRITE                nop             mfence          sfence
PREREAD|PREWRITE        nop             mfence          sfence
POSTREAD                lfence          lfence          nop[*]
POSTWRITE               nop             mfence          nop
POSTREAD|POSTWRITE      lfence          mfence          nop[*]

The case of PREREAD is as follows:

1. loads and stores before DMA buffer may be allocated for the purpose
2. bus_dmamap_sync(BUS_DMASYNC_PREREAD)
3. store to register or DMA descriptor to trigger DMA

The register or DMA descriptor may be in any type of memory (or I/O).

lfence at (2) is _not enough_ to ensure stores at (1) have completed
before the store in (3) in case the register or DMA descriptor lives
in wc/wc+ memory, or the store to it is non-temporal: in that case,
it may executed early before all the stores in (1) have completed.

On the other hand, lfence at (2) is _not needed_ to ensure loads in
(1) have completed before the store in (3), because x86 never
reorders load;store to store;load.  So we may need to enforce
store/store ordering, but not any other ordering, hence sfence.

The case of PREWRITE is as follows:

1. stores to DMA buffer (and loads from it, before allocated)
2. bus_dmamap_sync(BUS_DMASYNC_PREWRITE)
3. store to register or DMA descriptor to trigger DMA

Ensuring prior loads have completed is not necessary because x86
never reorders load;store to store;load (and in any case, the device
isn't changing the DMA buffer, so it's safe to read over and over
again).  But we must ensure the stores in (1) have completed before
the store in (3).  So we need sfence, in case either the DMA buffer
or the register or the DMA descriptor is in wc/wc+ memory or either
store is non-temporal.  But we don't need mfence.

The case of POSTREAD is as follows:

1. load from register or DMA descriptor notifying DMA completion
2. bus_dmamap_sync(BUS_DMASYNC_POSTREAD)
   (a) lfence [*]
   (b) if bouncing, memcpy(userbuf, bouncebuf, ...)
   (c) ???
3. loads from DMA buffer to use data, and stores to reuse buffer

This certainly needs an lfence to prevent the loads at (3) from being
executed early -- but bus_dmamap_sync already issues lfence in that
case at 2(a), before it conditionally loads from the bounce buffer
into the user's buffer.  So we don't need any _additional_ fence
_after_ bouncing at 2(c).

The case of POSTWRITE is as follows:

1. load from register or DMA descriptor notifying DMA completion
2. bus_dmamap_sync(BUS_DMASYNC_POSTWRITE)
3. loads and stores to reuse buffer

Stores at (3) will never be executed early because x86 never reorders
load;store to store;load for any memory types.  Loads at (3) are
harmless because the device isn't changing the buffer -- it's
supposed to be fixed from the time of PREWRITE to the time of
POSTWRITE as far as the CPU can witness.

Proposed on port-amd64 last month:

https://mail-index.netbsd.org/port-amd64/2022/07/16/msg003593.html

Reference:

AMD64 Architecture Programmer's Manual, Volume 2: System Programming,
24593--Rev. 3.38--November 2021, Sec. 7.4.2 Memory Barrier Interaction
with Memory Types, Table 7-3, p. 196.
https://www.amd.com/system/files/TechDocs/24593.pdf


To generate a diff of this commit:
cvs rdiff -u -r1.85 -r1.86 src/sys/arch/x86/x86/bus_dma.c

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Modified files:

Index: src/sys/arch/x86/x86/bus_dma.c
diff -u src/sys/arch/x86/x86/bus_dma.c:1.85 src/sys/arch/x86/x86/bus_dma.c:1.86
--- src/sys/arch/x86/x86/bus_dma.c:1.85	Wed Jul 13 00:12:20 2022
+++ src/sys/arch/x86/x86/bus_dma.c	Fri Aug 12 13:44:12 2022
@@ -1,4 +1,4 @@
-/*	$NetBSD: bus_dma.c,v 1.85 2022/07/13 00:12:20 riastradh Exp $	*/
+/*	$NetBSD: bus_dma.c,v 1.86 2022/08/12 13:44:12 riastradh Exp $	*/
 
 /*-
  * Copyright (c) 1996, 1997, 1998, 2007, 2020 The NetBSD Foundation, Inc.
@@ -31,7 +31,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: bus_dma.c,v 1.85 2022/07/13 00:12:20 riastradh Exp $");
+__KERNEL_RCSID(0, "$NetBSD: bus_dma.c,v 1.86 2022/08/12 13:44:12 riastradh Exp $");
 
 /*
  * The following is included because _bus_dma_uiomove is derived from
@@ -789,6 +789,13 @@ _bus_dmamap_unload(bus_dma_tag_t t, bus_
 
 /*
  * Synchronize a DMA map.
+ *
+ * Reference:
+ *
+ *	AMD64 Architecture Programmer's Manual, Volume 2: System
+ *	Programming, 24593--Rev. 3.38--November 2021, Sec. 7.4.2 Memory
+ *	Barrier Interaction with Memory Types, Table 7-3, p. 196.
+ *	https://web.archive.org/web/20220625040004/https://www.amd.com/system/files/TechDocs/24593.pdf#page=256
  */
 static void
 _bus_dmamap_sync(bus_dma_tag_t t, bus_dmamap_t map, bus_addr_t offset,
@@ -816,11 +823,17 @@ _bus_dmamap_sync(bus_dma_tag_t t, bus_dm
 #endif
 
 	/*
-	 * The caller has been alerted to DMA completion by reading a
-	 * register or DMA descriptor, and is about to read out of the
-	 * DMA memory buffer that the device filled.  LFENCE ensures
-	 * that these happen in order, so that the caller doesn't
-	 * proceed to read any stale data from cache or speculation.
+	 * BUS_DMASYNC_POSTREAD: The caller has been alerted to DMA
+	 * completion by reading a register or DMA descriptor, and the
+	 * caller is about to read out of the DMA memory buffer that
+	 * the device just filled.
+	 *
+	 * => LFENCE ensures that these happen in order so that the
+	 *    caller, or the bounce buffer logic here, doesn't proceed
+	 *    to read any stale data from cache or speculation.  x86
+	 *    never reorders loads from wp/wt/wb or uc memory, but it
+	 *    may execute loads from wc/wc+ memory early, e.g. with
+	 *    BUS_SPACE_MAP_PREFETCHABLE.
 	 */
 	if (ops & BUS_DMASYNC_POSTREAD)
 		x86_lfence();
@@ -954,20 +967,46 @@ _bus_dmamap_sync(bus_dma_tag_t t, bus_dm
 		break;
 	}
 end:
-	if (ops & (BUS_DMASYNC_PREWRITE|BUS_DMASYNC_POSTWRITE)) {
-		/*
-		 * from the memory POV a load can be reordered before a store
-		 * (a load can fetch data from the write buffers, before
-		 * data hits the cache or memory), a mfence avoids it.
-		 */
-		x86_mfence();
-	} else if (ops & (BUS_DMASYNC_PREREAD|BUS_DMASYNC_POSTREAD)) {
-		/*
-		 * all past reads should have completed at before this point,
-		 * and future reads should not have started yet.
-		 */
-		x86_lfence();
-	}
+	/*
+	 * BUS_DMASYNC_PREREAD: The caller may have previously been
+	 * using a DMA memory buffer, with loads and stores, and is
+	 * about to trigger DMA by writing to a register or DMA
+	 * descriptor.
+	 *
+	 * => SFENCE ensures that the stores happen in order, in case
+	 *    the latter one is non-temporal or to wc/wc+ memory and
+	 *    thus may be executed early.  x86 never reorders
+	 *    load;store to store;load for any memory type, so no
+	 *    barrier is needed for prior loads.
+	 *
+	 * BUS_DMASYNC_PREWRITE: The caller has just written to a DMA
+	 * memory buffer, or we just wrote to to the bounce buffer,
+	 * data that the device needs to use, and the caller is about
+	 * to trigger DMA by writing to a register or DMA descriptor.
+	 *
+	 * => SFENCE ensures that these happen in order so that any
+	 *    buffered stores are visible to the device before the DMA
+	 *    is triggered.  x86 never reorders (non-temporal) stores
+	 *    to wp/wt/wb or uc memory, but it may reorder two stores
+	 *    if one is to wc/wc+ memory, e.g. if the DMA descriptor is
+	 *    mapped with BUS_SPACE_MAP_PREFETCHABLE.
+	 */
+	if (ops & (BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE))
+		x86_sfence();
+
+	/*
+	 * BUS_DMASYNC_POSTWRITE: The caller has been alerted to DMA
+	 * completion by reading a register or DMA descriptor, and the
+	 * caller may proceed to reuse the DMA memory buffer, with
+	 * loads and stores.
+	 *
+	 * => No barrier is needed.  Since the DMA memory buffer is not
+	 *    changing (we're sending data to the device, not receiving
+	 *    data from the device), prefetched loads are safe.  x86
+	 *    never reoreders load;store to store;load for any memory
+	 *    type, so early execution of stores prior to witnessing
+	 *    the DMA completion is not possible.
+	 */
 }
 
 /*

Reply via email to