Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On Thu, 2014-11-06 at 19:30 -0500, Frank Henigman wrote: I tested your patch with the teximage program in mesa demos, the same thing I used to benchmark when I developed this code. As Matt and Chad point out, the odd-looking _faster functions are there for a reason. Your change causes a huge slowdown. Yes I should have known better than to assume it was left over code. I didn't know that gcc could inline memcpy like that, very nice. In fact I was reading a blog just last week that was saying msvc was better than gcc for memcpy because gcc was reliant on a library implementation. A good reminder not to believe everything you read on the internet. Anyway I've had another go at it and the performance regression should be fixed. In my testing I couldn't spot any real difference. The main down side is the ssse3 code can't be inlined so there will be a small trade off compared to the current way of building with ssse3 enabled. Also thanks for pointing out teximage I didn't know the mesa demos contained pref tools. I tested on a sandybridge system with a Intel(R) Celeron(R) CPU 857 @ 1.20GHz. Mesa compiled with -O2. original code: TexSubImage(RGBA/ubyte 256 x 256): 9660.4 images/sec, 2415.1 MB/sec TexSubImage(RGBA/ubyte 1024 x 1024): 821.2 images/sec, 3284.7 MB/sec TexSubImage(RGBA/ubyte 4096 x 4096): 76.3 images/sec, 4884.9 MB/sec TexSubImage(BGRA/ubyte 256 x 256): 11307.1 images/sec, 2826.8 MB/sec TexSubImage(BGRA/ubyte 1024 x 1024): 944.6 images/sec, 3778.6 MB/sec TexSubImage(BGRA/ubyte 4096 x 4096): 76.7 images/sec, 4908.3 MB/sec TexSubImage(L/ubyte 256 x 256): 17847.5 images/sec, 1115.5 MB/sec TexSubImage(L/ubyte 1024 x 1024): 3068.2 images/sec, 3068.2 MB/sec TexSubImage(L/ubyte 4096 x 4096): 224.6 images/sec, 3593.0 MB/sec your code: TexSubImage(RGBA/ubyte 256 x 256): 3271.6 images/sec, 817.9 MB/sec TexSubImage(RGBA/ubyte 1024 x 1024): 232.3 images/sec, 929.2 MB/sec TexSubImage(RGBA/ubyte 4096 x 4096): 47.5 images/sec, 3038.6 MB/sec TexSubImage(BGRA/ubyte 256 x 256): 2426.5 images/sec, 606.6 MB/sec TexSubImage(BGRA/ubyte 1024 x 1024): 164.1 images/sec, 656.4 MB/sec TexSubImage(BGRA/ubyte 4096 x 4096): 13.4 images/sec, 854.8 MB/sec TexSubImage(L/ubyte 256 x 256): 9514.5 images/sec, 594.7 MB/sec TexSubImage(L/ubyte 1024 x 1024): 864.1 images/sec, 864.1 MB/sec TexSubImage(L/ubyte 4096 x 4096): 59.7 images/sec, 955.2 MB/sec This is just one run, not an average, but you can see it's slower across the board up to a factor of around 6. Also I couldn't configure the build after your patch. I think you left out a change to configure.ac to define SSSE3_SUPPORTED. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
Also cleans up some if statements in the *faster functions. Callgrind cpu usage results from pts benchmarks: For ytile_copy_faster() Nexuiz 1.6.1: 2.16% - 1.20% Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au --- src/mesa/Makefile.am | 8 +++ src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 ++ src/mesa/main/fast_rgba8_copy.c| 78 src/mesa/main/fast_rgba8_copy.h| 37 4 files changed, 141 insertions(+), 64 deletions(-) create mode 100644 src/mesa/main/fast_rgba8_copy.c create mode 100644 src/mesa/main/fast_rgba8_copy.h diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am index e71bccb..2402096 100644 --- a/src/mesa/Makefile.am +++ b/src/mesa/Makefile.am @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS) ARCH_LIBS = +if SSSE3_SUPPORTED +ARCH_LIBS += libmesa_ssse3.la +endif + if SSE41_SUPPORTED ARCH_LIBS += libmesa_sse41.la endif @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \ main/streaming-load-memcpy.c libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1 +libmesa_ssse3_la_SOURCES = \ + main/fast_rgba8_copy.c +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3 + pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA = gl.pc diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c index cb5738a..0deeb75 100644 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c @@ -27,6 +27,7 @@ **/ #include main/bufferobj.h +#include main/fast_rgba8_copy.h #include main/image.h #include main/macros.h #include main/mtypes.h @@ -42,9 +43,7 @@ #include intel_mipmap_tree.h #include intel_blit.h -#ifdef __SSSE3__ -#include tmmintrin.h -#endif +#include x86/common_x86_asm.h #define FILE_DEBUG_FLAG DEBUG_TEXTURE @@ -175,18 +174,6 @@ err: return false; } -#ifdef __SSSE3__ -static const uint8_t rgba8_permutation[16] = - { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 }; - -/* NOTE: dst must be 16 byte aligned */ -#define rgba8_copy_16(dst, src) \ - *(__m128i *)(dst) = _mm_shuffle_epi8(\ - (__m128i) _mm_loadu_ps((float *)(src)), \ - *(__m128i *) rgba8_permutation\ - ) -#endif - /** * Copy RGBA to BGRA - swap R and B. */ @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes) uint8_t *d = dst; uint8_t const *s = src; -#ifdef __SSSE3__ - /* Fast copying for tile spans. -* -* As long as the destination texture is 16 aligned, -* any 16 or 64 spans we get here should also be 16 aligned. -*/ - - if (bytes == 16) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - return dst; - } - - if (bytes == 64) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - rgba8_copy_16(d+16, s+16); - rgba8_copy_16(d+32, s+32); - rgba8_copy_16(d+48, s+48); - return dst; - } -#endif - while (bytes = 4) { d[0] = s[2]; d[1] = s[1]; @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == xtile_width y0 == 0 y1 == xtile_height) { - if (mem_copy == memcpy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, rgba8_copy); + return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, +dst, src, src_pitch, swizzle_bit, mem_copy); } xtile_copy(x0, x1, x2, x3, y0, y1, dst, src, src_pitch, swizzle_bit, mem_copy); @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == ytile_width y0 == 0 y1 == ytile_height) { - if (mem_copy == memcpy) - return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy)
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote: Also cleans up some if statements in the *faster functions. Callgrind cpu usage results from pts benchmarks: For ytile_copy_faster() Nexuiz 1.6.1: 2.16% - 1.20% Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au --- src/mesa/Makefile.am | 8 +++ src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 ++ src/mesa/main/fast_rgba8_copy.c| 78 src/mesa/main/fast_rgba8_copy.h| 37 4 files changed, 141 insertions(+), 64 deletions(-) create mode 100644 src/mesa/main/fast_rgba8_copy.c create mode 100644 src/mesa/main/fast_rgba8_copy.h diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am index e71bccb..2402096 100644 --- a/src/mesa/Makefile.am +++ b/src/mesa/Makefile.am @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS) ARCH_LIBS = +if SSSE3_SUPPORTED +ARCH_LIBS += libmesa_ssse3.la +endif + if SSE41_SUPPORTED ARCH_LIBS += libmesa_sse41.la endif @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \ main/streaming-load-memcpy.c libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1 +libmesa_ssse3_la_SOURCES = \ + main/fast_rgba8_copy.c +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3 + pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA = gl.pc diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c index cb5738a..0deeb75 100644 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c @@ -27,6 +27,7 @@ **/ #include main/bufferobj.h +#include main/fast_rgba8_copy.h #include main/image.h #include main/macros.h #include main/mtypes.h @@ -42,9 +43,7 @@ #include intel_mipmap_tree.h #include intel_blit.h -#ifdef __SSSE3__ -#include tmmintrin.h -#endif +#include x86/common_x86_asm.h #define FILE_DEBUG_FLAG DEBUG_TEXTURE @@ -175,18 +174,6 @@ err: return false; } -#ifdef __SSSE3__ -static const uint8_t rgba8_permutation[16] = - { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 }; - -/* NOTE: dst must be 16 byte aligned */ -#define rgba8_copy_16(dst, src) \ - *(__m128i *)(dst) = _mm_shuffle_epi8(\ - (__m128i) _mm_loadu_ps((float *)(src)), \ - *(__m128i *) rgba8_permutation\ - ) -#endif - /** * Copy RGBA to BGRA - swap R and B. */ @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes) uint8_t *d = dst; uint8_t const *s = src; -#ifdef __SSSE3__ - /* Fast copying for tile spans. -* -* As long as the destination texture is 16 aligned, -* any 16 or 64 spans we get here should also be 16 aligned. -*/ - - if (bytes == 16) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - return dst; - } - - if (bytes == 64) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - rgba8_copy_16(d+16, s+16); - rgba8_copy_16(d+32, s+32); - rgba8_copy_16(d+48, s+48); - return dst; - } -#endif - while (bytes = 4) { d[0] = s[2]; d[1] = s[1]; @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == xtile_width y0 == 0 y1 == xtile_height) { - if (mem_copy == memcpy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, rgba8_copy); + return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, +dst, src, src_pitch, swizzle_bit, mem_copy); } xtile_copy(x0, x1, x2, x3, y0, y1, dst, src, src_pitch, swizzle_bit, mem_copy); @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == ytile_width y0 == 0 y1 == ytile_height) { - if (mem_copy == memcpy) - return ytile_copy(0, 0, ytile_width, ytile_width, 0, ytile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy ==
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On Thu, 2014-11-06 at 10:03 -0800, Matt Turner wrote: On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote: Also cleans up some if statements in the *faster functions. Callgrind cpu usage results from pts benchmarks: For ytile_copy_faster() Nexuiz 1.6.1: 2.16% - 1.20% Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au --- src/mesa/Makefile.am | 8 +++ src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 ++ src/mesa/main/fast_rgba8_copy.c| 78 src/mesa/main/fast_rgba8_copy.h| 37 4 files changed, 141 insertions(+), 64 deletions(-) create mode 100644 src/mesa/main/fast_rgba8_copy.c create mode 100644 src/mesa/main/fast_rgba8_copy.h diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am index e71bccb..2402096 100644 --- a/src/mesa/Makefile.am +++ b/src/mesa/Makefile.am @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS) ARCH_LIBS = +if SSSE3_SUPPORTED +ARCH_LIBS += libmesa_ssse3.la +endif + if SSE41_SUPPORTED ARCH_LIBS += libmesa_sse41.la endif @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \ main/streaming-load-memcpy.c libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1 +libmesa_ssse3_la_SOURCES = \ + main/fast_rgba8_copy.c +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3 + pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA = gl.pc diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c index cb5738a..0deeb75 100644 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c @@ -27,6 +27,7 @@ **/ #include main/bufferobj.h +#include main/fast_rgba8_copy.h #include main/image.h #include main/macros.h #include main/mtypes.h @@ -42,9 +43,7 @@ #include intel_mipmap_tree.h #include intel_blit.h -#ifdef __SSSE3__ -#include tmmintrin.h -#endif +#include x86/common_x86_asm.h #define FILE_DEBUG_FLAG DEBUG_TEXTURE @@ -175,18 +174,6 @@ err: return false; } -#ifdef __SSSE3__ -static const uint8_t rgba8_permutation[16] = - { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 }; - -/* NOTE: dst must be 16 byte aligned */ -#define rgba8_copy_16(dst, src) \ - *(__m128i *)(dst) = _mm_shuffle_epi8(\ - (__m128i) _mm_loadu_ps((float *)(src)), \ - *(__m128i *) rgba8_permutation\ - ) -#endif - /** * Copy RGBA to BGRA - swap R and B. */ @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes) uint8_t *d = dst; uint8_t const *s = src; -#ifdef __SSSE3__ - /* Fast copying for tile spans. -* -* As long as the destination texture is 16 aligned, -* any 16 or 64 spans we get here should also be 16 aligned. -*/ - - if (bytes == 16) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - return dst; - } - - if (bytes == 64) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - rgba8_copy_16(d+16, s+16); - rgba8_copy_16(d+32, s+32); - rgba8_copy_16(d+48, s+48); - return dst; - } -#endif - while (bytes = 4) { d[0] = s[2]; d[1] = s[1]; @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == xtile_width y0 == 0 y1 == xtile_height) { - if (mem_copy == memcpy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, rgba8_copy); + return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, +dst, src, src_pitch, swizzle_bit, mem_copy); } xtile_copy(x0, x1, x2, x3, y0, y1, dst, src, src_pitch, swizzle_bit, mem_copy); @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == ytile_width y0 == 0 y1 == ytile_height) { -
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On Thu, Nov 6, 2014 at 1:22 PM, Timothy Arceri t_arc...@yahoo.com.au wrote: On Thu, 2014-11-06 at 10:03 -0800, Matt Turner wrote: On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote: Also cleans up some if statements in the *faster functions. Callgrind cpu usage results from pts benchmarks: For ytile_copy_faster() Nexuiz 1.6.1: 2.16% - 1.20% Signed-off-by: Timothy Arceri t_arc...@yahoo.com.au --- src/mesa/Makefile.am | 8 +++ src/mesa/drivers/dri/i965/intel_tex_subimage.c | 82 ++ src/mesa/main/fast_rgba8_copy.c| 78 src/mesa/main/fast_rgba8_copy.h| 37 4 files changed, 141 insertions(+), 64 deletions(-) create mode 100644 src/mesa/main/fast_rgba8_copy.c create mode 100644 src/mesa/main/fast_rgba8_copy.h diff --git a/src/mesa/Makefile.am b/src/mesa/Makefile.am index e71bccb..2402096 100644 --- a/src/mesa/Makefile.am +++ b/src/mesa/Makefile.am @@ -107,6 +107,10 @@ AM_CXXFLAGS = $(LLVM_CFLAGS) $(VISIBILITY_CXXFLAGS) ARCH_LIBS = +if SSSE3_SUPPORTED +ARCH_LIBS += libmesa_ssse3.la +endif + if SSE41_SUPPORTED ARCH_LIBS += libmesa_sse41.la endif @@ -154,6 +158,10 @@ libmesa_sse41_la_SOURCES = \ main/streaming-load-memcpy.c libmesa_sse41_la_CFLAGS = $(AM_CFLAGS) -msse4.1 +libmesa_ssse3_la_SOURCES = \ + main/fast_rgba8_copy.c +libmesa_ssse3_la_CFLAGS = $(AM_CFLAGS) -mssse3 + pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA = gl.pc diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c index cb5738a..0deeb75 100644 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c @@ -27,6 +27,7 @@ **/ #include main/bufferobj.h +#include main/fast_rgba8_copy.h #include main/image.h #include main/macros.h #include main/mtypes.h @@ -42,9 +43,7 @@ #include intel_mipmap_tree.h #include intel_blit.h -#ifdef __SSSE3__ -#include tmmintrin.h -#endif +#include x86/common_x86_asm.h #define FILE_DEBUG_FLAG DEBUG_TEXTURE @@ -175,18 +174,6 @@ err: return false; } -#ifdef __SSSE3__ -static const uint8_t rgba8_permutation[16] = - { 2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15 }; - -/* NOTE: dst must be 16 byte aligned */ -#define rgba8_copy_16(dst, src) \ - *(__m128i *)(dst) = _mm_shuffle_epi8(\ - (__m128i) _mm_loadu_ps((float *)(src)), \ - *(__m128i *) rgba8_permutation\ - ) -#endif - /** * Copy RGBA to BGRA - swap R and B. */ @@ -196,29 +183,6 @@ rgba8_copy(void *dst, const void *src, size_t bytes) uint8_t *d = dst; uint8_t const *s = src; -#ifdef __SSSE3__ - /* Fast copying for tile spans. -* -* As long as the destination texture is 16 aligned, -* any 16 or 64 spans we get here should also be 16 aligned. -*/ - - if (bytes == 16) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - return dst; - } - - if (bytes == 64) { - assert(!(((uintptr_t)dst) 0xf)); - rgba8_copy_16(d+ 0, s+ 0); - rgba8_copy_16(d+16, s+16); - rgba8_copy_16(d+32, s+32); - rgba8_copy_16(d+48, s+48); - return dst; - } -#endif - while (bytes = 4) { d[0] = s[2]; d[1] = s[1]; @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == xtile_width y0 == 0 y1 == xtile_height) { - if (mem_copy == memcpy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, rgba8_copy); + return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, +dst, src, src_pitch, swizzle_bit, mem_copy); } xtile_copy(x0, x1, x2, x3, y0, y1, dst, src, src_pitch, swizzle_bit, mem_copy); @@ -388,19 +341,8 @@ ytile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy)
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On 11/06/2014 02:12 PM, Matt Turner wrote: On Thu, Nov 6, 2014 at 1:22 PM, Timothy Arceri t_arc...@yahoo.com.au wrote: On Thu, 2014-11-06 at 10:03 -0800, Matt Turner wrote: On Thu, Nov 6, 2014 at 4:20 AM, Timothy Arceri t_arc...@yahoo.com.au wrote: +#include assert.h +#include stdint.h +#include stddef.h I don't think you need these three includes for a single prototype. Right I can move assert to the .c Presumably one of the others can be removed as well? I don't know what defines size_t. stddef.h since C89 at least. + +/* Fast copying for tile spans. + * + * As long as the destination texture is 16 aligned, + * any 16 or 64 spans we get here should also be 16 aligned. + */ +void * +_mesa_fast_rgba8_copy(void *dst, const void *src, size_t n); -- 1.9.3 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On Thu 06 Nov 2014, Timothy Arceri wrote: Also cleans up some if statements in the *faster functions. I have comments about the cleanup below. diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c index cb5738a..0deeb75 100644 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c /** * Copy texture data from linear to X tile layout, faster. * * Same as \ref xtile_copy but faster, because it passes constant parameters * for common cases, allowing the compiler to inline code optimized for those * cases. * * \copydoc tile_copy_fn */ static FLATTEN void xtile_copy_faster(...) @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == xtile_width y0 == 0 y1 == xtile_height) { - if (mem_copy == memcpy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, rgba8_copy); + return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, +dst, src, src_pitch, swizzle_bit, mem_copy); } xtile_copy(x0, x1, x2, x3, y0, y1, dst, src, src_pitch, swizzle_bit, mem_copy); The cleanup of this if tree concerns me. Accoring the function comment, the original author of this function, fjhenigman, clearly created the weird 'if' tree with the intentation that the compiler would inline code optimized for those cases. Without one of the following, I object to this cleanup: - Frank's approval, or - Proof that gcc never does the desired optimizations, or - Proof that this change does not harm's Chrome's texture upload performance. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
I tested your patch with the teximage program in mesa demos, the same thing I used to benchmark when I developed this code. As Matt and Chad point out, the odd-looking _faster functions are there for a reason. Your change causes a huge slowdown. I tested on a sandybridge system with a Intel(R) Celeron(R) CPU 857 @ 1.20GHz. Mesa compiled with -O2. original code: TexSubImage(RGBA/ubyte 256 x 256): 9660.4 images/sec, 2415.1 MB/sec TexSubImage(RGBA/ubyte 1024 x 1024): 821.2 images/sec, 3284.7 MB/sec TexSubImage(RGBA/ubyte 4096 x 4096): 76.3 images/sec, 4884.9 MB/sec TexSubImage(BGRA/ubyte 256 x 256): 11307.1 images/sec, 2826.8 MB/sec TexSubImage(BGRA/ubyte 1024 x 1024): 944.6 images/sec, 3778.6 MB/sec TexSubImage(BGRA/ubyte 4096 x 4096): 76.7 images/sec, 4908.3 MB/sec TexSubImage(L/ubyte 256 x 256): 17847.5 images/sec, 1115.5 MB/sec TexSubImage(L/ubyte 1024 x 1024): 3068.2 images/sec, 3068.2 MB/sec TexSubImage(L/ubyte 4096 x 4096): 224.6 images/sec, 3593.0 MB/sec your code: TexSubImage(RGBA/ubyte 256 x 256): 3271.6 images/sec, 817.9 MB/sec TexSubImage(RGBA/ubyte 1024 x 1024): 232.3 images/sec, 929.2 MB/sec TexSubImage(RGBA/ubyte 4096 x 4096): 47.5 images/sec, 3038.6 MB/sec TexSubImage(BGRA/ubyte 256 x 256): 2426.5 images/sec, 606.6 MB/sec TexSubImage(BGRA/ubyte 1024 x 1024): 164.1 images/sec, 656.4 MB/sec TexSubImage(BGRA/ubyte 4096 x 4096): 13.4 images/sec, 854.8 MB/sec TexSubImage(L/ubyte 256 x 256): 9514.5 images/sec, 594.7 MB/sec TexSubImage(L/ubyte 1024 x 1024): 864.1 images/sec, 864.1 MB/sec TexSubImage(L/ubyte 4096 x 4096): 59.7 images/sec, 955.2 MB/sec This is just one run, not an average, but you can see it's slower across the board up to a factor of around 6. Also I couldn't configure the build after your patch. I think you left out a change to configure.ac to define SSSE3_SUPPORTED. On Thu, Nov 6, 2014 at 6:08 PM, Chad Versace chad.vers...@intel.com wrote: On Thu 06 Nov 2014, Timothy Arceri wrote: Also cleans up some if statements in the *faster functions. I have comments about the cleanup below. diff --git a/src/mesa/drivers/dri/i965/intel_tex_subimage.c b/src/mesa/drivers/dri/i965/intel_tex_subimage.c index cb5738a..0deeb75 100644 --- a/src/mesa/drivers/dri/i965/intel_tex_subimage.c +++ b/src/mesa/drivers/dri/i965/intel_tex_subimage.c /** * Copy texture data from linear to X tile layout, faster. * * Same as \ref xtile_copy but faster, because it passes constant parameters * for common cases, allowing the compiler to inline code optimized for those * cases. * * \copydoc tile_copy_fn */ static FLATTEN void xtile_copy_faster(...) @@ -352,19 +316,8 @@ xtile_copy_faster(uint32_t x0, uint32_t x1, uint32_t x2, uint32_t x3, mem_copy_fn mem_copy) { if (x0 == 0 x3 == xtile_width y0 == 0 y1 == xtile_height) { - if (mem_copy == memcpy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, - dst, src, src_pitch, swizzle_bit, rgba8_copy); - } else { - if (mem_copy == memcpy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, memcpy); - else if (mem_copy == rgba8_copy) - return xtile_copy(x0, x1, x2, x3, y0, y1, - dst, src, src_pitch, swizzle_bit, rgba8_copy); + return xtile_copy(0, 0, xtile_width, xtile_width, 0, xtile_height, +dst, src, src_pitch, swizzle_bit, mem_copy); } xtile_copy(x0, x1, x2, x3, y0, y1, dst, src, src_pitch, swizzle_bit, mem_copy); The cleanup of this if tree concerns me. Accoring the function comment, the original author of this function, fjhenigman, clearly created the weird 'if' tree with the intentation that the compiler would inline code optimized for those cases. Without one of the following, I object to this cleanup: - Frank's approval, or - Proof that gcc never does the desired optimizations, or - Proof that this change does not harm's Chrome's texture upload performance. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 2/2] i965: add runtime check for SSSE3 rgba8_copy
On Thu, Nov 6, 2014 at 7:30 PM, Frank Henigman fjhenig...@google.com wrote: Also I couldn't configure the build after your patch. I think you left out a change to configure.ac to define SSSE3_SUPPORTED. Ah, that was in patch 1/2. ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev