Re: [Mesa-dev] [PATCH 2/5] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

2018-04-10 Thread Chris Wilson
Quoting Scott D Phillips (2018-04-10 16:33:18)
> Chris Wilson  writes:
> 
> > Quoting Chris Wilson (2018-04-05 20:54:54)
> > > Quoting Scott D Phillips (2018-04-03 21:05:42)
> 
> [...]
> 
> > > Ok, was hoping to see how you choose to use the streaming load, but I
> > > guess that's the next patch.
> > > 
> > > Reviewed-by: Chris Wilson 
> >
> > Oh, one point Eric Anholt mentioned on another thread about movntqda is
> > that stale data inside the internal buffer is not automatically
> > invalidated. We may need to emit explicit mfence before the copies if we
> > are in doubt. A single mfence per tiled-copy is probably not enough to
> > worry about optimising away.
> 
> Looking around, I found this errata about movntdqa not honoring the
> ordering guarantees of locked instructions (VLP31 in the pdf):
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf
> 
> So I added this code near the top of tiled_to_linear():
> 
> if (mem_copy == (mem_copy_fn)_mesa_streaming_load_memcpy) {
>/* Various atom processors have errata where the movntdqa instruction
> * (which is used in streaming_load_memcpu) may incorrectly be reordered
> * before locked instructions. To work around that, we put an lfence
> * here to manually wait for preceeding loads to be completed.
> */
>__builtin_ia32_lfence();
> }
> 
> It seems that an mfence won't suffice where the errata mentions you need
> the lfence, by my hazy understanding. Do I have that right, or should
> this be an mfence?

An lfence is a weaker version of mfence. We are not using locked
instructions for serialising access within the data, or at least not
from the perspective of serialising it with the GPU. Certainly it's not
been an issue for the kernel. *touch wood*

Note you can use _mm_*fence() to keep use a consistent instruction set.
-Chris
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/5] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

2018-04-10 Thread Scott D Phillips
Chris Wilson  writes:

> Quoting Chris Wilson (2018-04-05 20:54:54)
> > Quoting Scott D Phillips (2018-04-03 21:05:42)

[...]

> > Ok, was hoping to see how you choose to use the streaming load, but I
> > guess that's the next patch.
> > 
> > Reviewed-by: Chris Wilson 
>
> Oh, one point Eric Anholt mentioned on another thread about movntqda is
> that stale data inside the internal buffer is not automatically
> invalidated. We may need to emit explicit mfence before the copies if we
> are in doubt. A single mfence per tiled-copy is probably not enough to
> worry about optimising away.

Looking around, I found this errata about movntdqa not honoring the
ordering guarantees of locked instructions (VLP31 in the pdf):

https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf

So I added this code near the top of tiled_to_linear():

if (mem_copy == (mem_copy_fn)_mesa_streaming_load_memcpy) {
   /* Various atom processors have errata where the movntdqa instruction
* (which is used in streaming_load_memcpu) may incorrectly be reordered
* before locked instructions. To work around that, we put an lfence
* here to manually wait for preceeding loads to be completed.
*/
   __builtin_ia32_lfence();
}

It seems that an mfence won't suffice where the errata mentions you need
the lfence, by my hazy understanding. Do I have that right, or should
this be an mfence?
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/5] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

2018-04-05 Thread Chris Wilson
Quoting Chris Wilson (2018-04-05 20:54:54)
> Quoting Scott D Phillips (2018-04-03 21:05:42)
> > The reference for MOVNTDQA says:
> > 
> > For WC memory type, the nontemporal hint may be implemented by
> > loading a temporary internal buffer with the equivalent of an
> > aligned cache line without filling this data to the cache.
> > [...] Subsequent MOVNTDQA reads to unread portions of the WC
> > cache line will receive data from the temporary internal
> > buffer if data is available.
> > 
> > This hidden cache line sized temporary buffer can improve the
> > read performance from wc maps.
> > ---
> >  src/mesa/drivers/dri/i965/Makefile.am  |  7 
> >  src/mesa/drivers/dri/i965/Makefile.sources |  6 ++-
> >  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 52 
> > ++
> >  src/mesa/drivers/dri/i965/meson.build  | 18 +++--
> >  4 files changed, 78 insertions(+), 5 deletions(-)
> > 
> > diff --git a/src/mesa/drivers/dri/i965/Makefile.am 
> > b/src/mesa/drivers/dri/i965/Makefile.am
> > index 889d4c68a2b..ff47add93f4 100644
> > --- a/src/mesa/drivers/dri/i965/Makefile.am
> > +++ b/src/mesa/drivers/dri/i965/Makefile.am
> > @@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) 
> > -DGEN_VERSIONx10=110
> >  
> >  noinst_LTLIBRARIES = \
> > libi965_dri.la \
> > +   libintel_tiled_memcpy.la \
> > $(I965_PERGEN_LIBS)
> >  
> > +libintel_tiled_memcpy_la_SOURCES = \
> > +   $(intel_tiled_memcpy_FILES)
> > +libintel_tiled_memcpy_la_CFLAGS = \
> > +   $(AM_CFLAGS) $(SSE41_CFLAGS)
> > +
> >  libi965_dri_la_SOURCES = \
> > $(i965_FILES) \
> > $(i965_oa_GENERATED_FILES)
> > @@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \
> > $(top_builddir)/src/intel/compiler/libintel_compiler.la \
> > $(top_builddir)/src/intel/blorp/libblorp.la \
> > $(I965_PERGEN_LIBS) \
> > +   libintel_tiled_memcpy.la
> > $(LIBDRM_LIBS)
> 
> Makes sense.
> 
> > diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c 
> > b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> > index 7c6bde990d6..d076351b322 100644
> > --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> > +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> > @@ -36,6 +36,10 @@
> >  #include "brw_context.h"
> >  #include "intel_tiled_memcpy.h"
> >  
> > +#if defined(USE_SSE41)
> > +#include "main/streaming-load-memcpy.h"
> > +#include 
> > +#endif
> >  #if defined(__SSSE3__)
> >  #include 
> >  #elif defined(__SSE2__)
> > @@ -213,6 +217,30 @@ rgba8_copy_aligned_src(void *dst, const void *src, 
> > size_t bytes)
> > return dst;
> >  }
> >  
> > +#if defined(USE_SSE41)
> > +static ALWAYS_INLINE void*
> 
> Space in that void*? (but don't quote me on mesa/i965 preferred style!)
> 
> > +_memcpy_streaming_load(void *dest, const void *src, size_t count)
> > +{
> > +   if (count == 16) {
> > +  __m128i val = _mm_stream_load_si128((__m128i *)src);
> > +  _mm_store_si128((__m128i *)dest, val);
> > +  return dest;
> > +   } else if (count == 64) {
> > +  __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0);
> > +  __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1);
> > +  __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2);
> > +  __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3);
> > +  _mm_store_si128(((__m128i *)dest) + 0, val0);
> > +  _mm_store_si128(((__m128i *)dest) + 1, val1);
> > +  _mm_store_si128(((__m128i *)dest) + 2, val2);
> > +  _mm_store_si128(((__m128i *)dest) + 3, val3);
> > +  return dest;
> > +   } else {
> 
> assert(count < 16); or assert(count < 64) ?
> 
> Might as well remind the reader (and caller?!) that this is only for
> copying the residuals.
> 
> > +  return memcpy(dest, src, count);
> > +   }
> > +}
> > +#endif
> > +
> >  /**
> >   * Each row from y0 to y1 is copied in three parts: [x0,x1), [x1,x2), 
> > [x2,x3).
> >   * These ranges are in bytes, i.e. pixels * bytes-per-pixel.
> > @@ -677,6 +705,12 @@ xtiled_to_linear_faster(uint32_t x0, uint32_t x1, 
> > uint32_t x2, uint32_t x3,
> >   return xtiled_to_linear(0, 0, xtile_width, xtile_width, 0, 
> > xtile_height,
> >   dst, src, dst_pitch, swizzle_bit,
> >   rgba8_copy, rgba8_copy_aligned_src);
> > +#if defined(USE_SSE41)
> > +  else if (mem_copy == (mem_copy_fn)_mesa_streaming_load_memcpy)
> > + return xtiled_to_linear(0, 0, xtile_width, xtile_width, 0, 
> > xtile_height,
> > + dst, src, dst_pitch, swizzle_bit, memcpy,
> > + _memcpy_streaming_load);
> 
> Please group memcpy and _memcpy_streaming_load (put the line brea before
> to keep them on the same line).
> 
> > +#endif
> >else
> >   unreachable("not reached");
> > } else {
> > @@ -687,6 +721,12 @@ xtiled_to_linear_faster(uint32_t x0, uint32_t x1, 
> > 

Re: [Mesa-dev] [PATCH 2/5] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

2018-04-05 Thread Chris Wilson
Quoting Scott D Phillips (2018-04-03 21:05:42)
> The reference for MOVNTDQA says:
> 
> For WC memory type, the nontemporal hint may be implemented by
> loading a temporary internal buffer with the equivalent of an
> aligned cache line without filling this data to the cache.
> [...] Subsequent MOVNTDQA reads to unread portions of the WC
> cache line will receive data from the temporary internal
> buffer if data is available.
> 
> This hidden cache line sized temporary buffer can improve the
> read performance from wc maps.
> ---
>  src/mesa/drivers/dri/i965/Makefile.am  |  7 
>  src/mesa/drivers/dri/i965/Makefile.sources |  6 ++-
>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 52 
> ++
>  src/mesa/drivers/dri/i965/meson.build  | 18 +++--
>  4 files changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/src/mesa/drivers/dri/i965/Makefile.am 
> b/src/mesa/drivers/dri/i965/Makefile.am
> index 889d4c68a2b..ff47add93f4 100644
> --- a/src/mesa/drivers/dri/i965/Makefile.am
> +++ b/src/mesa/drivers/dri/i965/Makefile.am
> @@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) -DGEN_VERSIONx10=110
>  
>  noinst_LTLIBRARIES = \
> libi965_dri.la \
> +   libintel_tiled_memcpy.la \
> $(I965_PERGEN_LIBS)
>  
> +libintel_tiled_memcpy_la_SOURCES = \
> +   $(intel_tiled_memcpy_FILES)
> +libintel_tiled_memcpy_la_CFLAGS = \
> +   $(AM_CFLAGS) $(SSE41_CFLAGS)
> +
>  libi965_dri_la_SOURCES = \
> $(i965_FILES) \
> $(i965_oa_GENERATED_FILES)
> @@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \
> $(top_builddir)/src/intel/compiler/libintel_compiler.la \
> $(top_builddir)/src/intel/blorp/libblorp.la \
> $(I965_PERGEN_LIBS) \
> +   libintel_tiled_memcpy.la
> $(LIBDRM_LIBS)

Makes sense.

> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c 
> b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> index 7c6bde990d6..d076351b322 100644
> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> @@ -36,6 +36,10 @@
>  #include "brw_context.h"
>  #include "intel_tiled_memcpy.h"
>  
> +#if defined(USE_SSE41)
> +#include "main/streaming-load-memcpy.h"
> +#include 
> +#endif
>  #if defined(__SSSE3__)
>  #include 
>  #elif defined(__SSE2__)
> @@ -213,6 +217,30 @@ rgba8_copy_aligned_src(void *dst, const void *src, 
> size_t bytes)
> return dst;
>  }
>  
> +#if defined(USE_SSE41)
> +static ALWAYS_INLINE void*

Space in that void*? (but don't quote me on mesa/i965 preferred style!)

> +_memcpy_streaming_load(void *dest, const void *src, size_t count)
> +{
> +   if (count == 16) {
> +  __m128i val = _mm_stream_load_si128((__m128i *)src);
> +  _mm_store_si128((__m128i *)dest, val);
> +  return dest;
> +   } else if (count == 64) {
> +  __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0);
> +  __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1);
> +  __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2);
> +  __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3);
> +  _mm_store_si128(((__m128i *)dest) + 0, val0);
> +  _mm_store_si128(((__m128i *)dest) + 1, val1);
> +  _mm_store_si128(((__m128i *)dest) + 2, val2);
> +  _mm_store_si128(((__m128i *)dest) + 3, val3);
> +  return dest;
> +   } else {

assert(count < 16); or assert(count < 64) ?

Might as well remind the reader (and caller?!) that this is only for
copying the residuals.

> +  return memcpy(dest, src, count);
> +   }
> +}
> +#endif
> +
>  /**
>   * Each row from y0 to y1 is copied in three parts: [x0,x1), [x1,x2), 
> [x2,x3).
>   * These ranges are in bytes, i.e. pixels * bytes-per-pixel.
> @@ -677,6 +705,12 @@ xtiled_to_linear_faster(uint32_t x0, uint32_t x1, 
> uint32_t x2, uint32_t x3,
>   return xtiled_to_linear(0, 0, xtile_width, xtile_width, 0, 
> xtile_height,
>   dst, src, dst_pitch, swizzle_bit,
>   rgba8_copy, rgba8_copy_aligned_src);
> +#if defined(USE_SSE41)
> +  else if (mem_copy == (mem_copy_fn)_mesa_streaming_load_memcpy)
> + return xtiled_to_linear(0, 0, xtile_width, xtile_width, 0, 
> xtile_height,
> + dst, src, dst_pitch, swizzle_bit, memcpy,
> + _memcpy_streaming_load);

Please group memcpy and _memcpy_streaming_load (put the line brea before
to keep them on the same line).

> +#endif
>else
>   unreachable("not reached");
> } else {
> @@ -687,6 +721,12 @@ xtiled_to_linear_faster(uint32_t x0, uint32_t x1, 
> uint32_t x2, uint32_t x3,
>   return xtiled_to_linear(x0, x1, x2, x3, y0, y1,
>   dst, src, dst_pitch, swizzle_bit,
>   rgba8_copy, rgba8_copy_aligned_src);
> +#if defined(USE_SSE41)
> +  else if (mem_copy == 

[Mesa-dev] [PATCH 2/5] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

2018-04-03 Thread Scott D Phillips
The reference for MOVNTDQA says:

For WC memory type, the nontemporal hint may be implemented by
loading a temporary internal buffer with the equivalent of an
aligned cache line without filling this data to the cache.
[...] Subsequent MOVNTDQA reads to unread portions of the WC
cache line will receive data from the temporary internal
buffer if data is available.

This hidden cache line sized temporary buffer can improve the
read performance from wc maps.
---
 src/mesa/drivers/dri/i965/Makefile.am  |  7 
 src/mesa/drivers/dri/i965/Makefile.sources |  6 ++-
 src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 52 ++
 src/mesa/drivers/dri/i965/meson.build  | 18 +++--
 4 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/Makefile.am 
b/src/mesa/drivers/dri/i965/Makefile.am
index 889d4c68a2b..ff47add93f4 100644
--- a/src/mesa/drivers/dri/i965/Makefile.am
+++ b/src/mesa/drivers/dri/i965/Makefile.am
@@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) -DGEN_VERSIONx10=110
 
 noinst_LTLIBRARIES = \
libi965_dri.la \
+   libintel_tiled_memcpy.la \
$(I965_PERGEN_LIBS)
 
+libintel_tiled_memcpy_la_SOURCES = \
+   $(intel_tiled_memcpy_FILES)
+libintel_tiled_memcpy_la_CFLAGS = \
+   $(AM_CFLAGS) $(SSE41_CFLAGS)
+
 libi965_dri_la_SOURCES = \
$(i965_FILES) \
$(i965_oa_GENERATED_FILES)
@@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \
$(top_builddir)/src/intel/compiler/libintel_compiler.la \
$(top_builddir)/src/intel/blorp/libblorp.la \
$(I965_PERGEN_LIBS) \
+   libintel_tiled_memcpy.la
$(LIBDRM_LIBS)
 
 BUILT_SOURCES = $(i965_oa_GENERATED_FILES)
diff --git a/src/mesa/drivers/dri/i965/Makefile.sources 
b/src/mesa/drivers/dri/i965/Makefile.sources
index 3479ceb9d16..ab7db3be7ed 100644
--- a/src/mesa/drivers/dri/i965/Makefile.sources
+++ b/src/mesa/drivers/dri/i965/Makefile.sources
@@ -110,11 +110,13 @@ i965_FILES = \
intel_tex_image.c \
intel_tex_obj.h \
intel_tex_validate.c \
-   intel_tiled_memcpy.c \
-   intel_tiled_memcpy.h \
intel_upload.c \
libdrm_macros.h
 
+intel_tiled_memcpy_FILES = \
+   intel_tiled_memcpy.c \
+   intel_tiled_memcpy.h
+
 i965_gen4_FILES = \
genX_blorp_exec.c \
genX_state_upload.c
diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c 
b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
index 7c6bde990d6..d076351b322 100644
--- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
+++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
@@ -36,6 +36,10 @@
 #include "brw_context.h"
 #include "intel_tiled_memcpy.h"
 
+#if defined(USE_SSE41)
+#include "main/streaming-load-memcpy.h"
+#include 
+#endif
 #if defined(__SSSE3__)
 #include 
 #elif defined(__SSE2__)
@@ -213,6 +217,30 @@ rgba8_copy_aligned_src(void *dst, const void *src, size_t 
bytes)
return dst;
 }
 
+#if defined(USE_SSE41)
+static ALWAYS_INLINE void*
+_memcpy_streaming_load(void *dest, const void *src, size_t count)
+{
+   if (count == 16) {
+  __m128i val = _mm_stream_load_si128((__m128i *)src);
+  _mm_store_si128((__m128i *)dest, val);
+  return dest;
+   } else if (count == 64) {
+  __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0);
+  __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1);
+  __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2);
+  __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3);
+  _mm_store_si128(((__m128i *)dest) + 0, val0);
+  _mm_store_si128(((__m128i *)dest) + 1, val1);
+  _mm_store_si128(((__m128i *)dest) + 2, val2);
+  _mm_store_si128(((__m128i *)dest) + 3, val3);
+  return dest;
+   } else {
+  return memcpy(dest, src, count);
+   }
+}
+#endif
+
 /**
  * Each row from y0 to y1 is copied in three parts: [x0,x1), [x1,x2), [x2,x3).
  * These ranges are in bytes, i.e. pixels * bytes-per-pixel.
@@ -677,6 +705,12 @@ xtiled_to_linear_faster(uint32_t x0, uint32_t x1, uint32_t 
x2, uint32_t x3,
  return xtiled_to_linear(0, 0, xtile_width, xtile_width, 0, 
xtile_height,
  dst, src, dst_pitch, swizzle_bit,
  rgba8_copy, rgba8_copy_aligned_src);
+#if defined(USE_SSE41)
+  else if (mem_copy == (mem_copy_fn)_mesa_streaming_load_memcpy)
+ return xtiled_to_linear(0, 0, xtile_width, xtile_width, 0, 
xtile_height,
+ dst, src, dst_pitch, swizzle_bit, memcpy,
+ _memcpy_streaming_load);
+#endif
   else
  unreachable("not reached");
} else {
@@ -687,6 +721,12 @@ xtiled_to_linear_faster(uint32_t x0, uint32_t x1, uint32_t 
x2, uint32_t x3,
  return xtiled_to_linear(x0, x1, x2, x3, y0, y1,
  dst, src, dst_pitch, swizzle_bit,
  rgba8_copy, rgba8_copy_aligned_src);
+#if