Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-07 Thread Timothy Arceri
On Thu, 2014-11-06 at 19:30 -0500, Frank Henigman wrote:
 I tested your patch with the teximage program in mesa demos, the
 same thing I used to benchmark when I developed this code.
 As Matt and Chad point out, the odd-looking _faster functions are
 there for a reason.  Your change causes a huge slowdown.

Yes I should have known better than to assume it was left over code. I
didn't know that gcc could inline memcpy like that, very nice. In fact I
was reading a blog just last week that was saying msvc was better than
gcc for memcpy because gcc was reliant on a library implementation. A
good reminder not to believe everything you read on the internet.

Anyway I've had another go at it and the performance regression should
be fixed. In my testing I couldn't spot any real difference. The main
down side is the ssse3 code can't be inlined so there will be a small
trade off compared to the current way of building with ssse3 enabled.

Also thanks for pointing out teximage I didn't know the mesa demos
contained pref tools. 

 I tested on a sandybridge system with a Intel(R) Celeron(R) CPU 857 @
 1.20GHz.  Mesa compiled with -O2.
 
 original code:
   TexSubImage(RGBA/ubyte 256 x 256): 9660.4 images/sec, 2415.1 MB/sec
   TexSubImage(RGBA/ubyte 1024 x 1024): 821.2 images/sec, 3284.7 MB/sec
   TexSubImage(RGBA/ubyte 4096 x 4096): 76.3 images/sec, 4884.9 MB/sec
 
   TexSubImage(BGRA/ubyte 256 x 256): 11307.1 images/sec, 2826.8 MB/sec
   TexSubImage(BGRA/ubyte 1024 x 1024): 944.6 images/sec, 3778.6 MB/sec
   TexSubImage(BGRA/ubyte 4096 x 4096): 76.7 images/sec, 4908.3 MB/sec
 
   TexSubImage(L/ubyte 256 x 256): 17847.5 images/sec, 1115.5 MB/sec
   TexSubImage(L/ubyte 1024 x 1024): 3068.2 images/sec, 3068.2 MB/sec
   TexSubImage(L/ubyte 4096 x 4096): 224.6 images/sec, 3593.0 MB/sec
 
 your code:
   TexSubImage(RGBA/ubyte 256 x 256): 3271.6 images/sec, 817.9 MB/sec
   TexSubImage(RGBA/ubyte 1024 x 1024): 232.3 images/sec, 929.2 MB/sec
   TexSubImage(RGBA/ubyte 4096 x 4096): 47.5 images/sec, 3038.6 MB/sec
 
   TexSubImage(BGRA/ubyte 256 x 256): 2426.5 images/sec, 606.6 MB/sec
   TexSubImage(BGRA/ubyte 1024 x 1024): 164.1 images/sec, 656.4 MB/sec
   TexSubImage(BGRA/ubyte 4096 x 4096): 13.4 images/sec, 854.8 MB/sec
 
   TexSubImage(L/ubyte 256 x 256): 9514.5 images/sec, 594.7 MB/sec
   TexSubImage(L/ubyte 1024 x 1024): 864.1 images/sec, 864.1 MB/sec
   TexSubImage(L/ubyte 4096 x 4096): 59.7 images/sec, 955.2 MB/sec
 
 This is just one run, not an average, but you can see it's slower
 across the board up to a factor of around 6.
 Also I couldn't configure the build after your patch.  I think you
 left out a change to configure.ac to define SSSE3_SUPPORTED.
 


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Timothy Arceri
Also cleans up some if statements in the *faster functions.

Callgrind cpu usage results from pts benchmarks:

For ytile_copy_faster()

Nexuiz 1.6.1: 2.16% - 1.20%

Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au
---
 src/mesa/Makefile.am   |  8 +++
 src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 ++
 src/mesa/main/fast_rgba8_copy.c| 78 
 src/mesa/main/fast_rgba8_copy.h| 37 
 4 files changed, 141 insertions(+), 64 deletions(-)
 create mode 100644 src/mesa/main/fast_rgba8_copy.c
 create mode 100644 src/mesa/main/fast_rgba8_copy.h

diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am
index e71bccb..2402096 100644
--- a/src/mesa/Makefile.am
+++ b/src/mesa/Makefile.am
@@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS)
 
 ARCH_LIBS =
 
+if SSSE3_SUPPORTED
+ARCH_LIBS += libmesa_ssse3.la
+endif
+
 if SSE41_SUPPORTED
 ARCH_LIBS += libmesa_sse41.la
 endif
@@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \
main/streaming-load-memcpy.c
 libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1
 
+libmesa_ssse3_la_SOURCES = \
+   main/fast_rgba8_copy.c
+libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3
+
 pkgconfigdir = $(libdir)/pkgconfig
 pkgconfig_DATA = gl.pc
 
diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c 
b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
index cb5738a..0deeb75 100644
--- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
+++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
@@ -27,6 +27,7 @@
  **/
 
 #include main/bufferobj.h
+#include main/fast_rgba8_copy.h
 #include main/image.h
 #include main/macros.h
 #include main/mtypes.h
@@ -42,9 +43,7 @@
 #include intel_mipmap_tree.h
 #include intel_blit.h
 
-#ifdef __SSSE3__
-#include tmmintrin.h
-#endif
+#include x86/common_x86_asm.h
 
 #define FILE_DEBUG_FLAG DEBUG_TEXTURE
 
@@ -175,18 +174,6 @@ err:
return false;
 }
 
-#ifdef __SSSE3__
-static const uint8_t rgba8_permutation[16] =
-   { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
-
-/* NOTE: dst must be 16 byte aligned */
-#define rgba8_copy_16(dst, src) \
-   *(__m128i *)(dst) = _mm_shuffle_epi8(\
-  (__m128i) _mm_loadu_ps((float *)(src)),   \
-  *(__m128i *) rgba8_permutation\
-   )
-#endif
-
 /**
  * Copy RGBA to BGRA - swap R and B.
  */
@@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes)
uint8_t *d = dst;
uint8_t const *s = src;
 
-#ifdef __SSSE3__
-   /* Fast copying for tile spans.
-*
-* As long as the destination texture is 16 aligned,
-* any 16 or 64 spans we get here should also be 16 aligned.
-*/
-
-   if (bytes == 16) {
-  assert(!(((uintptr_t)dst)  0xf));
-  rgba8_copy_16(d+ 0, s+ 0);
-  return dst;
-   }
-
-   if (bytes == 64) {
-  assert(!(((uintptr_t)dst)  0xf));
-  rgba8_copy_16(d+ 0, s+ 0);
-  rgba8_copy_16(d+16, s+16);
-  rgba8_copy_16(d+32, s+32);
-  rgba8_copy_16(d+48, s+48);
-  return dst;
-   }
-#endif
-
while (bytes = 4) {
   d[0] = s[2];
   d[1] = s[1];
@@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, 
uint32_t x3,
   mem_copy_fn mem_copy)
 {
if (x0 == 0  x3 == xtile_width  y0 == 0  y1 == xtile_height) {
-  if (mem_copy == memcpy)
- return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
-   dst, src, src_pitch, swizzle_bit, memcpy);
-  else if (mem_copy == rgba8_copy)
- return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
-   dst, src, src_pitch, swizzle_bit, rgba8_copy);
-   } else {
-  if (mem_copy == memcpy)
- return xtile_copy(x0, x1, x2, x3, y0, y1,
-   dst, src, src_pitch, swizzle_bit, memcpy);
-  else if (mem_copy == rgba8_copy)
- return xtile_copy(x0, x1, x2, x3, y0, y1,
-   dst, src, src_pitch, swizzle_bit, rgba8_copy);
+  return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
+dst, src, src_pitch, swizzle_bit, mem_copy);
}
xtile_copy(x0, x1, x2, x3, y0, y1,
   dst, src, src_pitch, swizzle_bit, mem_copy);
@@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, 
uint32_t x3,
   mem_copy_fn mem_copy)
 {
if (x0 == 0  x3 == ytile_width  y0 == 0  y1 == ytile_height) {
-  if (mem_copy == memcpy)
- return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height,
-   dst, src, src_pitch, swizzle_bit, memcpy);
-  else if (mem_copy == rgba8_copy)
- return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height,
-   dst, src, src_pitch, swizzle_bit, rgba8_copy);
-   } else {
-  if (mem_copy == memcpy)

Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Matt Turner
On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote:
 Also cleans up some if statements in the *faster functions.

 Callgrind cpu usage results from pts benchmarks:

 For ytile_copy_faster()

 Nexuiz 1.6.1: 2.16% - 1.20%

 Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au
 ---
  src/mesa/Makefile.am   |  8 +++
  src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 
 ++
  src/mesa/main/fast_rgba8_copy.c| 78 
  src/mesa/main/fast_rgba8_copy.h| 37 
  4 files changed, 141 insertions(+), 64 deletions(-)
  create mode 100644 src/mesa/main/fast_rgba8_copy.c
  create mode 100644 src/mesa/main/fast_rgba8_copy.h

 diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am
 index e71bccb..2402096 100644
 --- a/src/mesa/Makefile.am
 +++ b/src/mesa/Makefile.am
 @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS)

  ARCH_LIBS =

 +if SSSE3_SUPPORTED
 +ARCH_LIBS += libmesa_ssse3.la
 +endif
 +
  if SSE41_SUPPORTED
  ARCH_LIBS += libmesa_sse41.la
  endif
 @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \
 main/streaming-load-memcpy.c
  libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1

 +libmesa_ssse3_la_SOURCES = \
 +   main/fast_rgba8_copy.c
 +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3
 +
  pkgconfigdir = $(libdir)/pkgconfig
  pkgconfig_DATA = gl.pc

 diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c 
 b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 index cb5738a..0deeb75 100644
 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 @@ -27,6 +27,7 @@
   **/

  #include main/bufferobj.h
 +#include main/fast_rgba8_copy.h
  #include main/image.h
  #include main/macros.h
  #include main/mtypes.h
 @@ -42,9 +43,7 @@
  #include intel_mipmap_tree.h
  #include intel_blit.h

 -#ifdef __SSSE3__
 -#include tmmintrin.h
 -#endif
 +#include x86/common_x86_asm.h

  #define FILE_DEBUG_FLAG DEBUG_TEXTURE

 @@ -175,18 +174,6 @@ err:
 return false;
  }

 -#ifdef __SSSE3__
 -static const uint8_t rgba8_permutation[16] =
 -   { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
 -
 -/* NOTE: dst must be 16 byte aligned */
 -#define rgba8_copy_16(dst, src) \
 -   *(__m128i *)(dst) = _mm_shuffle_epi8(\
 -  (__m128i) _mm_loadu_ps((float *)(src)),   \
 -  *(__m128i *) rgba8_permutation\
 -   )
 -#endif
 -
  /**
   * Copy RGBA to BGRA - swap R and B.
   */
 @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes)
 uint8_t *d = dst;
 uint8_t const *s = src;

 -#ifdef __SSSE3__
 -   /* Fast copying for tile spans.
 -*
 -* As long as the destination texture is 16 aligned,
 -* any 16 or 64 spans we get here should also be 16 aligned.
 -*/
 -
 -   if (bytes == 16) {
 -  assert(!(((uintptr_t)dst)  0xf));
 -  rgba8_copy_16(d+ 0, s+ 0);
 -  return dst;
 -   }
 -
 -   if (bytes == 64) {
 -  assert(!(((uintptr_t)dst)  0xf));
 -  rgba8_copy_16(d+ 0, s+ 0);
 -  rgba8_copy_16(d+16, s+16);
 -  rgba8_copy_16(d+32, s+32);
 -  rgba8_copy_16(d+48, s+48);
 -  return dst;
 -   }
 -#endif
 -
 while (bytes = 4) {
d[0] = s[2];
d[1] = s[1];
 @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, 
 uint32_t x3,
mem_copy_fn mem_copy)
  {
 if (x0 == 0  x3 == xtile_width  y0 == 0  y1 == xtile_height) {
 -  if (mem_copy == memcpy)
 - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
 -   dst, src, src_pitch, swizzle_bit, memcpy);
 -  else if (mem_copy == rgba8_copy)
 - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
 -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
 -   } else {
 -  if (mem_copy == memcpy)
 - return xtile_copy(x0, x1, x2, x3, y0, y1,
 -   dst, src, src_pitch, swizzle_bit, memcpy);
 -  else if (mem_copy == rgba8_copy)
 - return xtile_copy(x0, x1, x2, x3, y0, y1,
 -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
 +  return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
 +dst, src, src_pitch, swizzle_bit, mem_copy);
 }
 xtile_copy(x0, x1, x2, x3, y0, y1,
dst, src, src_pitch, swizzle_bit, mem_copy);
 @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, 
 uint32_t x3,
mem_copy_fn mem_copy)
  {
 if (x0 == 0  x3 == ytile_width  y0 == 0  y1 == ytile_height) {
 -  if (mem_copy == memcpy)
 - return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height,
 -   dst, src, src_pitch, swizzle_bit, memcpy);
 -  else if (mem_copy == 

Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Timothy Arceri
On Thu, 2014-11-06 at 10:03 -0800, Matt Turner wrote:
 On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote:
  Also cleans up some if statements in the *faster functions.
 
  Callgrind cpu usage results from pts benchmarks:
 
  For ytile_copy_faster()
 
  Nexuiz 1.6.1: 2.16% - 1.20%
 
  Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au
  ---
   src/mesa/Makefile.am   |  8 +++
   src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 
  ++
   src/mesa/main/fast_rgba8_copy.c| 78 
  
   src/mesa/main/fast_rgba8_copy.h| 37 
   4 files changed, 141 insertions(+), 64 deletions(-)
   create mode 100644 src/mesa/main/fast_rgba8_copy.c
   create mode 100644 src/mesa/main/fast_rgba8_copy.h
 
  diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am
  index e71bccb..2402096 100644
  --- a/src/mesa/Makefile.am
  +++ b/src/mesa/Makefile.am
  @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS)
 
   ARCH_LIBS =
 
  +if SSSE3_SUPPORTED
  +ARCH_LIBS += libmesa_ssse3.la
  +endif
  +
   if SSE41_SUPPORTED
   ARCH_LIBS += libmesa_sse41.la
   endif
  @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \
  main/streaming-load-memcpy.c
   libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1
 
  +libmesa_ssse3_la_SOURCES = \
  +   main/fast_rgba8_copy.c
  +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3
  +
   pkgconfigdir = $(libdir)/pkgconfig
   pkgconfig_DATA = gl.pc
 
  diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c 
  b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
  index cb5738a..0deeb75 100644
  --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
  +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
  @@ -27,6 +27,7 @@

  **/
 
   #include main/bufferobj.h
  +#include main/fast_rgba8_copy.h
   #include main/image.h
   #include main/macros.h
   #include main/mtypes.h
  @@ -42,9 +43,7 @@
   #include intel_mipmap_tree.h
   #include intel_blit.h
 
  -#ifdef __SSSE3__
  -#include tmmintrin.h
  -#endif
  +#include x86/common_x86_asm.h
 
   #define FILE_DEBUG_FLAG DEBUG_TEXTURE
 
  @@ -175,18 +174,6 @@ err:
  return false;
   }
 
  -#ifdef __SSSE3__
  -static const uint8_t rgba8_permutation[16] =
  -   { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
  -
  -/* NOTE: dst must be 16 byte aligned */
  -#define rgba8_copy_16(dst, src) \
  -   *(__m128i *)(dst) = _mm_shuffle_epi8(\
  -  (__m128i) _mm_loadu_ps((float *)(src)),   \
  -  *(__m128i *) rgba8_permutation\
  -   )
  -#endif
  -
   /**
* Copy RGBA to BGRA - swap R and B.
*/
  @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes)
  uint8_t *d = dst;
  uint8_t const *s = src;
 
  -#ifdef __SSSE3__
  -   /* Fast copying for tile spans.
  -*
  -* As long as the destination texture is 16 aligned,
  -* any 16 or 64 spans we get here should also be 16 aligned.
  -*/
  -
  -   if (bytes == 16) {
  -  assert(!(((uintptr_t)dst)  0xf));
  -  rgba8_copy_16(d+ 0, s+ 0);
  -  return dst;
  -   }
  -
  -   if (bytes == 64) {
  -  assert(!(((uintptr_t)dst)  0xf));
  -  rgba8_copy_16(d+ 0, s+ 0);
  -  rgba8_copy_16(d+16, s+16);
  -  rgba8_copy_16(d+32, s+32);
  -  rgba8_copy_16(d+48, s+48);
  -  return dst;
  -   }
  -#endif
  -
  while (bytes = 4) {
 d[0] = s[2];
 d[1] = s[1];
  @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t 
  x2, uint32_t x3,
 mem_copy_fn mem_copy)
   {
  if (x0 == 0  x3 == xtile_width  y0 == 0  y1 == xtile_height) {
  -  if (mem_copy == memcpy)
  - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
  -   dst, src, src_pitch, swizzle_bit, memcpy);
  -  else if (mem_copy == rgba8_copy)
  - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
  -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
  -   } else {
  -  if (mem_copy == memcpy)
  - return xtile_copy(x0, x1, x2, x3, y0, y1,
  -   dst, src, src_pitch, swizzle_bit, memcpy);
  -  else if (mem_copy == rgba8_copy)
  - return xtile_copy(x0, x1, x2, x3, y0, y1,
  -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
  +  return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
  +dst, src, src_pitch, swizzle_bit, mem_copy);
  }
  xtile_copy(x0, x1, x2, x3, y0, y1,
 dst, src, src_pitch, swizzle_bit, mem_copy);
  @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t 
  x2, uint32_t x3,
 mem_copy_fn mem_copy)
   {
  if (x0 == 0  x3 == ytile_width  y0 == 0  y1 == ytile_height) {
  -  

Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Matt Turner
On Thu, Nov 6, 2014 at 1:22 PM, Timothy Arceri t_arc...@yahoo.com.au wrote:
 On Thu, 2014-11-06 at 10:03 -0800, Matt Turner wrote:
 On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote:
  Also cleans up some if statements in the *faster functions.
 
  Callgrind cpu usage results from pts benchmarks:
 
  For ytile_copy_faster()
 
  Nexuiz 1.6.1: 2.16% - 1.20%
 
  Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au
  ---
   src/mesa/Makefile.am   |  8 +++
   src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 
  ++
   src/mesa/main/fast_rgba8_copy.c| 78 
  
   src/mesa/main/fast_rgba8_copy.h| 37 
   4 files changed, 141 insertions(+), 64 deletions(-)
   create mode 100644 src/mesa/main/fast_rgba8_copy.c
   create mode 100644 src/mesa/main/fast_rgba8_copy.h
 
  diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am
  index e71bccb..2402096 100644
  --- a/src/mesa/Makefile.am
  +++ b/src/mesa/Makefile.am
  @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS)
 
   ARCH_LIBS =
 
  +if SSSE3_SUPPORTED
  +ARCH_LIBS += libmesa_ssse3.la
  +endif
  +
   if SSE41_SUPPORTED
   ARCH_LIBS += libmesa_sse41.la
   endif
  @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \
  main/streaming-load-memcpy.c
   libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1
 
  +libmesa_ssse3_la_SOURCES = \
  +   main/fast_rgba8_copy.c
  +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3
  +
   pkgconfigdir = $(libdir)/pkgconfig
   pkgconfig_DATA = gl.pc
 
  diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c 
  b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
  index cb5738a..0deeb75 100644
  --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
  +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
  @@ -27,6 +27,7 @@

  **/
 
   #include main/bufferobj.h
  +#include main/fast_rgba8_copy.h
   #include main/image.h
   #include main/macros.h
   #include main/mtypes.h
  @@ -42,9 +43,7 @@
   #include intel_mipmap_tree.h
   #include intel_blit.h
 
  -#ifdef __SSSE3__
  -#include tmmintrin.h
  -#endif
  +#include x86/common_x86_asm.h
 
   #define FILE_DEBUG_FLAG DEBUG_TEXTURE
 
  @@ -175,18 +174,6 @@ err:
  return false;
   }
 
  -#ifdef __SSSE3__
  -static const uint8_t rgba8_permutation[16] =
  -   { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 };
  -
  -/* NOTE: dst must be 16 byte aligned */
  -#define rgba8_copy_16(dst, src) \
  -   *(__m128i *)(dst) = _mm_shuffle_epi8(\
  -  (__m128i) _mm_loadu_ps((float *)(src)),   \
  -  *(__m128i *) rgba8_permutation\
  -   )
  -#endif
  -
   /**
* Copy RGBA to BGRA - swap R and B.
*/
  @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes)
  uint8_t *d = dst;
  uint8_t const *s = src;
 
  -#ifdef __SSSE3__
  -   /* Fast copying for tile spans.
  -*
  -* As long as the destination texture is 16 aligned,
  -* any 16 or 64 spans we get here should also be 16 aligned.
  -*/
  -
  -   if (bytes == 16) {
  -  assert(!(((uintptr_t)dst)  0xf));
  -  rgba8_copy_16(d+ 0, s+ 0);
  -  return dst;
  -   }
  -
  -   if (bytes == 64) {
  -  assert(!(((uintptr_t)dst)  0xf));
  -  rgba8_copy_16(d+ 0, s+ 0);
  -  rgba8_copy_16(d+16, s+16);
  -  rgba8_copy_16(d+32, s+32);
  -  rgba8_copy_16(d+48, s+48);
  -  return dst;
  -   }
  -#endif
  -
  while (bytes = 4) {
 d[0] = s[2];
 d[1] = s[1];
  @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t 
  x2, uint32_t x3,
 mem_copy_fn mem_copy)
   {
  if (x0 == 0  x3 == xtile_width  y0 == 0  y1 == xtile_height) {
  -  if (mem_copy == memcpy)
  - return xtile_copy(0, 0, xtile_width, xtile_width, 0, 
  xtile_height,
  -   dst, src, src_pitch, swizzle_bit, memcpy);
  -  else if (mem_copy == rgba8_copy)
  - return xtile_copy(0, 0, xtile_width, xtile_width, 0, 
  xtile_height,
  -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
  -   } else {
  -  if (mem_copy == memcpy)
  - return xtile_copy(x0, x1, x2, x3, y0, y1,
  -   dst, src, src_pitch, swizzle_bit, memcpy);
  -  else if (mem_copy == rgba8_copy)
  - return xtile_copy(x0, x1, x2, x3, y0, y1,
  -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
  +  return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
  +dst, src, src_pitch, swizzle_bit, mem_copy);
  }
  xtile_copy(x0, x1, x2, x3, y0, y1,
 dst, src, src_pitch, swizzle_bit, mem_copy);
  @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t 
  x2, uint32_t x3,
 mem_copy_fn mem_copy)
  

Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Ian Romanick
On 11/06/2014 02:12 PM, Matt Turner wrote:
 On Thu, Nov 6, 2014 at 1:22 PM, Timothy Arceri t_arc...@yahoo.com.au wrote:
 On Thu, 2014-11-06 at 10:03 -0800, Matt Turner wrote:
 On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au 
 wrote:
 +#include assert.h
 +#include stdint.h
 +#include stddef.h

 I don't think you need these three includes for a single prototype.

 Right I can move assert to the .c
 
 Presumably one of the others can be removed as well? I don't know what
 defines size_t.

stddef.h since C89 at least.

 +
 +/* Fast copying for tile spans.
 + *
 + * As long as the destination texture is 16 aligned,
 + * any 16 or 64 spans we get here should also be 16 aligned.
 + */
 +void *
 +_mesa_fast_rgba8_copy(void *dst, const void *src, size_t n);
 --
 1.9.3

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Chad Versace

On Thu 06 Nov 2014, Timothy Arceri wrote:

Also cleans up some if statements in the *faster functions.


I have comments about the cleanup below.


diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c 
b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
index cb5738a..0deeb75 100644
--- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
+++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c


/**
* Copy texture data from linear to X tile layout, faster.
*
* Same as \ref xtile_copy but faster, because it passes constant parameters
* for common cases, allowing the compiler to inline code optimized for those
* cases.
*
* \copydoc tile_copy_fn
*/
static FLATTEN void
xtile_copy_faster(...)


@@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, 
uint32_t x3,
  mem_copy_fn mem_copy)




{
   if (x0 == 0  x3 == xtile_width  y0 == 0  y1 == xtile_height) {
-  if (mem_copy == memcpy)
- return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
-   dst, src, src_pitch, swizzle_bit, memcpy);
-  else if (mem_copy == rgba8_copy)
- return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
-   dst, src, src_pitch, swizzle_bit, rgba8_copy);
-   } else {
-  if (mem_copy == memcpy)
- return xtile_copy(x0, x1, x2, x3, y0, y1,
-   dst, src, src_pitch, swizzle_bit, memcpy);
-  else if (mem_copy == rgba8_copy)
- return xtile_copy(x0, x1, x2, x3, y0, y1,
-   dst, src, src_pitch, swizzle_bit, rgba8_copy);
+  return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
+dst, src, src_pitch, swizzle_bit, mem_copy);
   }
   xtile_copy(x0, x1, x2, x3, y0, y1,
  dst, src, src_pitch, swizzle_bit, mem_copy);


The cleanup of this if tree concerns me. Accoring the function
comment, the original author of this function, fjhenigman, clearly 
created the weird 'if' tree with the intentation that the compiler would 
inline code optimized for those cases.


Without one of the following, I object to this cleanup:
   - Frank's approval, or
   - Proof that gcc never does the desired optimizations, or
   - Proof that this change does not harm's Chrome's texture upload 
   performance.

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Frank Henigman
I tested your patch with the teximage program in mesa demos, the
same thing I used to benchmark when I developed this code.
As Matt and Chad point out, the odd-looking _faster functions are
there for a reason.  Your change causes a huge slowdown.
I tested on a sandybridge system with a Intel(R) Celeron(R) CPU 857 @
1.20GHz.  Mesa compiled with -O2.

original code:
  TexSubImage(RGBA/ubyte 256 x 256): 9660.4 images/sec, 2415.1 MB/sec
  TexSubImage(RGBA/ubyte 1024 x 1024): 821.2 images/sec, 3284.7 MB/sec
  TexSubImage(RGBA/ubyte 4096 x 4096): 76.3 images/sec, 4884.9 MB/sec

  TexSubImage(BGRA/ubyte 256 x 256): 11307.1 images/sec, 2826.8 MB/sec
  TexSubImage(BGRA/ubyte 1024 x 1024): 944.6 images/sec, 3778.6 MB/sec
  TexSubImage(BGRA/ubyte 4096 x 4096): 76.7 images/sec, 4908.3 MB/sec

  TexSubImage(L/ubyte 256 x 256): 17847.5 images/sec, 1115.5 MB/sec
  TexSubImage(L/ubyte 1024 x 1024): 3068.2 images/sec, 3068.2 MB/sec
  TexSubImage(L/ubyte 4096 x 4096): 224.6 images/sec, 3593.0 MB/sec

your code:
  TexSubImage(RGBA/ubyte 256 x 256): 3271.6 images/sec, 817.9 MB/sec
  TexSubImage(RGBA/ubyte 1024 x 1024): 232.3 images/sec, 929.2 MB/sec
  TexSubImage(RGBA/ubyte 4096 x 4096): 47.5 images/sec, 3038.6 MB/sec

  TexSubImage(BGRA/ubyte 256 x 256): 2426.5 images/sec, 606.6 MB/sec
  TexSubImage(BGRA/ubyte 1024 x 1024): 164.1 images/sec, 656.4 MB/sec
  TexSubImage(BGRA/ubyte 4096 x 4096): 13.4 images/sec, 854.8 MB/sec

  TexSubImage(L/ubyte 256 x 256): 9514.5 images/sec, 594.7 MB/sec
  TexSubImage(L/ubyte 1024 x 1024): 864.1 images/sec, 864.1 MB/sec
  TexSubImage(L/ubyte 4096 x 4096): 59.7 images/sec, 955.2 MB/sec

This is just one run, not an average, but you can see it's slower
across the board up to a factor of around 6.
Also I couldn't configure the build after your patch.  I think you
left out a change to configure.ac to define SSSE3_SUPPORTED.

On Thu, Nov 6, 2014 at 6:08 PM, Chad Versace chad.vers...@intel.com wrote:
 On Thu 06 Nov 2014, Timothy Arceri wrote:

 Also cleans up some if statements in the *faster functions.


 I have comments about the cleanup below.

 diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 b/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 index cb5738a..0deeb75 100644
 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c
 +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c


 /**
 * Copy texture data from linear to X tile layout, faster.
 *
 * Same as \ref xtile_copy but faster, because it passes constant parameters
 * for common cases, allowing the compiler to inline code optimized for those
 * cases.
 *
 * \copydoc tile_copy_fn
 */
 static FLATTEN void
 xtile_copy_faster(...)

 @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t
 x2, uint32_t x3,
   mem_copy_fn mem_copy)



 {
if (x0 == 0  x3 == xtile_width  y0 == 0  y1 == xtile_height) {
 -  if (mem_copy == memcpy)
 - return xtile_copy(0, 0, xtile_width, xtile_width, 0,
 xtile_height,
 -   dst, src, src_pitch, swizzle_bit, memcpy);
 -  else if (mem_copy == rgba8_copy)
 - return xtile_copy(0, 0, xtile_width, xtile_width, 0,
 xtile_height,
 -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
 -   } else {
 -  if (mem_copy == memcpy)
 - return xtile_copy(x0, x1, x2, x3, y0, y1,
 -   dst, src, src_pitch, swizzle_bit, memcpy);
 -  else if (mem_copy == rgba8_copy)
 - return xtile_copy(x0, x1, x2, x3, y0, y1,
 -   dst, src, src_pitch, swizzle_bit, rgba8_copy);
 +  return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height,
 +dst, src, src_pitch, swizzle_bit, mem_copy);
}
xtile_copy(x0, x1, x2, x3, y0, y1,
   dst, src, src_pitch, swizzle_bit, mem_copy);


 The cleanup of this if tree concerns me. Accoring the function
 comment, the original author of this function, fjhenigman, clearly created
 the weird 'if' tree with the intentation that the compiler would inline
 code optimized for those cases.

 Without one of the following, I object to this cleanup:
- Frank's approval, or
- Proof that gcc never does the desired optimizations, or
- Proof that this change does not harm's Chrome's texture upload
 performance.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy

2014-11-06 Thread Frank Henigman
On Thu, Nov 6, 2014 at 7:30 PM, Frank Henigman fjhenig...@google.com wrote:

 Also I couldn't configure the build after your patch.  I think you
 left out a change to configure.ac to define SSSE3_SUPPORTED.

Ah, that was in patch 1/2.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev