Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On Mon, Jun 25, 2012 at 7:45 PM, Matt Turner wrote: > On Mon, Jun 25, 2012 at 1:00 AM, Siarhei Siamashka > wrote: >> OK, I got 7-bit variant of SSE2 bilinear scaling working. It shows >> quite a good speed boost thanks to PMADDWD instruction, which can be >> used now. > > Looking forward to seeing the patch. I'll be really interested to > compare performance on Loongson and iwMMXt when I can switch the > scaling functions over to multiply-add. Sent to the list. Took a bit of time to test it on different hardware. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On Mon, Jun 25, 2012 at 1:00 AM, Siarhei Siamashka wrote: > On Mon, Jun 18, 2012 at 9:09 PM, Søren Sandmann wrote: >> Siarhei Siamashka writes: >> >>> This is also a very useful test, but it effectively requires to have >>> an alternative double precision implementation for all the pixman >>> functionality to be verified. For bilinear scaling it means that at >>> least various types of repeats need to be handled, etc. And this >>> sounds like a lot of work. >> >> Yeah, I agree that it's a lot of work and that dropping to 7 bits is >> easier in the short term. > > OK, I got 7-bit variant of SSE2 bilinear scaling working. It shows > quite a good speed boost thanks to PMADDWD instruction, which can be > used now. Looking forward to seeing the patch. I'll be really interested to compare performance on Loongson and iwMMXt when I can switch the scaling functions over to multiply-add. ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On Mon, Jun 18, 2012 at 9:09 PM, Søren Sandmann wrote: > Siarhei Siamashka writes: > >> This is also a very useful test, but it effectively requires to have >> an alternative double precision implementation for all the pixman >> functionality to be verified. For bilinear scaling it means that at >> least various types of repeats need to be handled, etc. And this >> sounds like a lot of work. > > Yeah, I agree that it's a lot of work and that dropping to 7 bits is > easier in the short term. OK, I got 7-bit variant of SSE2 bilinear scaling working. It shows quite a good speed boost thanks to PMADDWD instruction, which can be used now. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
Siarhei Siamashka writes: > This is also a very useful test, but it effectively requires to have > an alternative double precision implementation for all the pixman > functionality to be verified. For bilinear scaling it means that at > least various types of repeats need to be handled, etc. And this > sounds like a lot of work. Yeah, I agree that it's a lot of work and that dropping to 7 bits is easier in the short term. > > There are also some alternative variants. For example, allow a custom > prefix for public symbols in pixman (so that several pixman instances > can be loaded into test application at the same time). Or even update > the existing pixman tests to add xlib support and compare the locally > rendered results with xrender. The latter seems particularly useful, > because it could be also used for xrender implementation validation in > various hardware accelerated drivers (and complement/retire > rendercheck). Yet another variant is to get the single precision floating point pipeline working instead of the current 16 bit one, and then useit as the reference implementation. Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On Sun, Jun 17, 2012 at 8:27 AM, Bill Spitzak wrote: > On 06/16/2012 07:08 AM, Siarhei Siamashka wrote: > >>> An alternative idea is instead of changing the algorithm across the >>> board, we could stop requiring bit exact results. The main piece of work >>> here is to change the test suite so that it will accept pixels up to >>> some maximum relative error. There is already some support for this: the >>> 'composite' test is using the 'pixel_checker_t" to do compare the pixman >>> output with perfect pixels computed in double precision. >>> >>> This latter idea is ultimately more useful because it will allow much >>> more flexibility in the kinds of SIMD instruction sets we can support. >> >> >> This is also a very useful test, but it effectively requires to have >> an alternative double precision implementation for all the pixman >> functionality to be verified. > > > I don't understand this. The 'composite' test alone has limited utility. It checks the correctness of composite operations performed with just a single pixel. But in order to provide better coverage for the functionality used by real applications, we also must test different image sizes (inner loops of composite functions do unrolling, and bugs may be potentially introduced both in the main loop body and in the handling of leading/trailing pixels). Additionally, when skipping fully transparent pixels, SIMD optimized code skips the whole groups of them, etc. There are lots of corner cases which need to be checked. But it's easier to demonstrate by using an example. Let's try to add a bug in 'sse2_combine_add_u' function: diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c index 70f8b77..fbea4f6 100644 --- a/pixman/pixman-sse2.c +++ b/pixman/pixman-sse2.c @@ -1348,7 +1348,7 @@ sse2_combine_add_u (pixman_implementation_t *imp, { __m128i s; - s = combine4 ((__m128i*)ps, (__m128i*)pm); + s = _mm_setzero_si128 (); save_128_aligned ( (__m128i*)pd, _mm_adds_epu8 (s, load_128_aligned ((__m128i*)pd))); The patch above just introduces a bug into the code from "while (w >= 4)" loop. Let's see how it is handled by pixman test suite: PASS: a1-trap-test PASS: pdf-op-test PASS: region-test PASS: region-translate-test PASS: fetch-test PASS: oob-test PASS: trap-crasher PASS: alpha-loop PASS: scaling-crash-test PASS: scaling-helpers-test PASS: gradient-crash-test region_contains test passed (checksum=D2BF8C73) PASS: region-contains-test Wrong alpha value at (0, 0). Should be 0xff; got 0xf7. Source was 0x65, original dest was 0xf7 src: a8r8g8b8, alpha: none, origin 0 0 dst: a8r8g8b8, alpha: none, origin: 0 0 FAIL: alphamap PASS: stress-test composite traps test failed! (checksum=BE93DA05, expected E3112106) FAIL: composite-traps-test blitters test failed! (checksum=C8682A01, expected A364B5BF) FAIL: blitters-test glyph test failed! (checksum=B1B638A1, expected 1B7696A2) FAIL: glyph-test scaling test failed! (checksum=64788A7E, expected 80DF1CB2) FAIL: scaling-test affine test passed (checksum=1EF2175A) PASS: affine-test PASS: composite = 5 of 20 tests failed The 'composite' test did not detect anything wrong as expected. Now let's break the same 'sse2_combine_add_u' function completely by inserting "return" into its very beginning: diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c index 70f8b77..25c7aa0 100644 --- a/pixman/pixman-sse2.c +++ b/pixman/pixman-sse2.c @@ -1331,6 +1331,8 @@ sse2_combine_add_u (pixman_implementation_t *imp, const uint32_t* ps = src; const uint32_t* pm = mask; +return; + while (w && (unsigned long)pd & 15) { s = combine1 (ps, pm); Now 'composite' test can see that there is a problem: PASS: a1-trap-test PASS: pdf-op-test PASS: region-test PASS: region-translate-test PASS: fetch-test PASS: oob-test PASS: trap-crasher PASS: alpha-loop PASS: scaling-crash-test PASS: scaling-helpers-test PASS: gradient-crash-test region_contains test passed (checksum=D2BF8C73) PASS: region-contains-test Wrong alpha value at (0, 0). Should be 0xff; got 0xf7. Source was 0x65, original dest was 0xf7 src: a8r8g8b8, alpha: none, origin 0 0 dst: a8r8g8b8, alpha: none, origin: 0 0 FAIL: alphamap PASS: stress-test composite traps test failed! (checksum=4B0E22E6, expected E3112106) FAIL: composite-traps-test blitters test failed! (checksum=E95FFC20, expected A364B5BF) FAIL: blitters-test glyph test failed! (checksum=FDF0BD54, expected 1B7696A2) FAIL: glyph-test scaling test failed! (checksum=55981EC2, expected 80DF1CB2) FAIL: scaling-test affine test passed (checksum=1EF2175A) PASS: affine-test Test 3145752 failed Operator: ADD Test 4194328 failed Operator: ADD Source:r3g3b2, 1x1 Destination: x4r4g4b4, 1x1 Source:a1r1g1b1, 1x1 Destination: a2r2g2b2, 1x1 R G B A Rounded Source color: 1.000 1.000 1.000 0.000 1.000 1.000 1.000 0.00
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On 06/16/2012 07:08 AM, Siarhei Siamashka wrote: An alternative idea is instead of changing the algorithm across the board, we could stop requiring bit exact results. The main piece of work here is to change the test suite so that it will accept pixels up to some maximum relative error. There is already some support for this: the 'composite' test is using the 'pixel_checker_t" to do compare the pixman output with perfect pixels computed in double precision. This latter idea is ultimately more useful because it will allow much more flexibility in the kinds of SIMD instruction sets we can support. This is also a very useful test, but it effectively requires to have an alternative double precision implementation for all the pixman functionality to be verified. I don't understand this. The current tests are checking for equality with am image. The approximate results will just check for approximate equality with the same image. I fail to see why the image has to somehow be "more correct". ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On Fri, Jun 15, 2012 at 10:51 PM, Søren Sandmann wrote: > Matt Turner writes: >> Also, are we planning to change the bilinear scaling algorithm for >> 0.28 so that we can use pmaddubsw? > > I wouldn't object to a patch that dropped precision to 7 bits for all > bilinear code, but it would require changes at least to the general > code, the fast path code, the NEON code and the SSE2 code. This is really a trivial change. The only difficulty is to enable it and test on all the supported platforms simultaneously. Using qemu and --enable-static-testprogs option allows to run the basic tests even not having all the hardware. Though MIPS DSP ASE support is only being added to qemu at the moment: http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg04990.html > An alternative idea is instead of changing the algorithm across the > board, we could stop requiring bit exact results. The main piece of work > here is to change the test suite so that it will accept pixels up to > some maximum relative error. There is already some support for this: the > 'composite' test is using the 'pixel_checker_t" to do compare the pixman > output with perfect pixels computed in double precision. > > This latter idea is ultimately more useful because it will allow much > more flexibility in the kinds of SIMD instruction sets we can support. This is also a very useful test, but it effectively requires to have an alternative double precision implementation for all the pixman functionality to be verified. For bilinear scaling it means that at least various types of repeats need to be handled, etc. And this sounds like a lot of work. There are also some alternative variants. For example, allow a custom prefix for public symbols in pixman (so that several pixman instances can be loaded into test application at the same time). Or even update the existing pixman tests to add xlib support and compare the locally rendered results with xrender. The latter seems particularly useful, because it could be also used for xrender implementation validation in various hardware accelerated drivers (and complement/retire rendercheck). -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
Matt Turner writes: > The registers -- yes. The 8-byte aligned loads and stores I'm not > sure. Can you do 8-byte aligned loads and stores to/from SSE > registers? I believe movq can use SSE registers. > Indeed, runtime generation would be great. Something like LLVM or orc > would be interesting options. I'm not sure I'm up to that kind of > project yet/now though. > > I think adding pixman-sse*.c files is a reasonable measure for now. > Think it's okay to split the static inline support functions from > pixman-sse2.c out into a header to be shared with the other > pixman-sse*.c files? Sounds reasonable to me. > Also, are we planning to change the bilinear scaling algorithm for > 0.28 so that we can use pmaddubsw? I wouldn't object to a patch that dropped precision to 7 bits for all bilinear code, but it would require changes at least to the general code, the fast path code, the NEON code and the SSE2 code. An alternative idea is instead of changing the algorithm across the board, we could stop requiring bit exact results. The main piece of work here is to change the test suite so that it will accept pixels up to some maximum relative error. There is already some support for this: the 'composite' test is using the 'pixel_checker_t" to do compare the pixman output with perfect pixels computed in double precision. This latter idea is ultimately more useful because it will allow much more flexibility in the kinds of SIMD instruction sets we can support. Søren ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
Sorry it's taken so long to get back to this. On Wed, May 9, 2012 at 12:57 PM, Søren Sandmann wrote: > Matt Turner writes: > I still think MMX has no use on modern systems. The SSE2 implementation > used to have such MMX loops, but they were removed in > f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with > compilers that would miscompile the emms instruction. > > Can't the MMX loop you added be done with SSE registers and instructions > as well? The registers -- yes. The 8-byte aligned loads and stores I'm not sure. Can you do 8-byte aligned loads and stores to/from SSE registers? >> Porting the pmadd algorithm to SSE4.1 gave another (very large) >> improvement. >> >> fast: src__0565 = L1: 655.18 L2: 675.94 M:642.31 ( 23.44%) HT:403.00 >> VT:286.45 R:307.61 RT:150.59 (1675Kops/s) >> mmx: src__0565 = L1:2050.45 L2:1988.97 M:1586.16 ( 57.34%) HT:529.12 >> VT:374.28 R:412.09 RT:177.35 (1913Kops/s) >> sse2: src__0565 = L1:1518.61 L2:1493.10 M:1279.18 ( 46.24%) HT:433.65 >> VT:314.48 R:349.14 RT:151.84 (1685Kops/s) >> sse2mmx:src__0565 = L1:1544.91 L2:1520.83 M:1307.79 ( 47.01%) >> HT:447.82 VT:326.81 R:379.60 RT:174.07 (1878Kops/s) >> sse4: src__0565 = L1:4654.11 L2:4202.98 M:1885.01 ( 69.35%) HT:540.65 >> VT:421.04 R:427.73 RT:161.45 (1773Kops/s) >> sse4mmx:src__0565 = L1:4786.27 L2:4255.13 M:1920.18 ( 69.93%) >> HT:581.42 VT:447.99 R:482.27 RT:193.15 (2049Kops/s) >> >> I'd like to isolate exactly what the performance improvement given by >> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring >> SSE4.1 a fantastic improvement. If you can come up with a reasonable way >> to pack the two xmm registers together in pack_565_2packedx128_128, >> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain. > > Would it work to subtract 0x8000, then use packssdw, then add 0x8000? I couldn't make that work, but I asked on StackOverflow and got a nice solution: http://stackoverflow.com/questions/11024652/simulating-packusdw-functionality-with-sse2 >> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd >> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a >> compile-time option (or recompile the whole file to get a few >> improvements from SSE4.1). >> >> It seems like we need a generic solution that would say for each >> compositing function >> - this is what you do for 1-byte; >> - this is what you do for 8-bytes if you have MMX; >> - this is what you do for 16-bytes if you have SSE2; >> - this is what you do for 16-bytes if you have SSE3; >> - this is what you do for 16-bytes if you have SSE4.1. >> and then construct the functions for generic/MMX/SSE2/SSE4 at build time. >> >> Does this seem like a reasonable approach? *How* to do it -- suggestions >> welcome. > > I think ideally we would generate this code at runtime. It's just not > feasible to generate code for all combinations of instruction sets at > build time and libpixman.so is already rather large. Generating the code > at runtime has the additional advantages that it is not limited to a > fixed set of fast paths and that it can make use of more details of the > operation such as the precise alignment for palignr generation. > > There are various ways to go about this, ranging from simple-minded > stitching-together of pre-written snippets to a full shader compiler. A > full shader compiler is obviously a big project, but maybe a simple > stich-together kind of thing wouldn't actually be that hard using > something like this: > > http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph > http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph > > That said, runtime code-generation is still a big project, and it does > make sense to make use of some of the newer instruction sets. > > We do have support for fallbacks, so as Makoto-san says, just adding new > pixman-ssse3.c and pixman-sse41.c files with duplicated code for the > particular operations that benefit from ssse3 and sse4.1 might be the > simplest way to proceed. Indeed, runtime generation would be great. Something like LLVM or orc would be interesting options. I'm not sure I'm up to that kind of project yet/now though. I think adding pixman-sse*.c files is a reasonable measure for now. Think it's okay to split the static inline support functions from pixman-sse2.c out into a header to be shared with the other pixman-sse*.c files? Also, are we planning to change the bilinear scaling algorithm for 0.28 so that we can use pmaddubsw? Thanks, Matt ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On Wed, May 9, 2012 at 7:57 PM, Søren Sandmann wrote: > Matt Turner writes: > >> I started porting my src__0565 MMX function to SSE2, and in the >> process started thinking about using SSE3+. The useful instructions >> added post SSE2 that I see are >> SSE3: lddqu - for unaligned loads across cache lines > > I don't really understand that instruction. Isn't it identical to > movdqu? Or is the idea that lddqu is faster than movdqu for cache line > splits, but slower for plain old, non-cache split unaligned loads? > >> SSSE3: palignr - for unaligned loads (but requires software >> pipelining...) >> pmaddubsw - maybe? > > pmaddubsw would be very useful for bilinear interpolation if we drop > coordinate precision to 7 bits instead of the current 8. One example way > to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register, > and interleave a top/left and top/right pixels in another. pmaddubsw on > those two registers will then produce a linear interpolation between the > top top pixels. A similar thing can be done for the bottom pixels, and > then the intermediate results can be interleaved and combined using > pmaddwd. I would say that improving bilinear scaling performance on x86 is really important for pixman in order to remain competitive. The following link might be a good source of inspiration: http://www.hackermusings.com/2012/05/firefoxs-graphics-performance-on-x11/ The comments with the azure backend performance numbers are particularly interesting. For example, one of them mentions 12fps with xrender disabled (using pixman?) vs. 15fps with azure canvas enabled (using skia?) for FishIETank. Needless to say that it would be nice to improve pixman performance by 30% or more. And here are some benchmarks for firefox-fishtank trace with pixman-0.25.2, comparing NEON vs. SSE2 for ARM Cortex-A8 and intel Atom (both are superscalar dual-issue in-order cores): === ARM Cortex-A8 @1GHz === CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp" [ 0]image firefox-fishtank 359.228 359.436 0.43%3/3 CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mthumb" [ 0]image firefox-fishtank 347.195 347.773 0.12%3/3 === Intel Atom N450 @1.67GHz === CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom" [ 0]image firefox-fishtank 308.439 308.881 0.09%3/3 CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom -m32" [ 0]image firefox-fishtank 309.457 309.568 0.07%3/3 CC=gcc-4.5.3 CFLAGS="-O2" [ 0]image firefox-fishtank 345.906 346.156 0.04%3/3 CC=gcc-4.5.3 CFLAGS="-O2 -mtune=generic" [ 0]image firefox-fishtank 345.367 345.900 0.09%3/3 The results for gcc-4.7.0 were nearly the same. Currently 1GHz ARM Cortex-A8 is almost as fast as 1.67GHz Atom. ARM NEON bilinear code is using 8-bit multiplications. Atom could use PMADDUBSW to also benefit from 8-bit multiplications and improve performance per MHz. -- Best regards, Siarhei Siamashka ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
On 2012-05-09, at 12:57 PM, Søren Sandmann wrote: > Matt Turner writes: > >> I started porting my src__0565 MMX function to SSE2, and in the >> process started thinking about using SSE3+. The useful instructions >> added post SSE2 that I see are >> SSE3: lddqu - for unaligned loads across cache lines > > I don't really understand that instruction. Isn't it identical to > movdqu? Or is the idea that lddqu is faster than movdqu for cache line > splits, but slower for plain old, non-cache split unaligned loads? "The instructions movdqu, movups, movupd and lddqu are all able to read unaligned vectors. lddqu is faster than the alternatives on P4E and PM processors, but requires the SSE3 instruction set. The unaligned read instructions are relatively slow on older processors, but faster on Nehalem, Sandy Bridge and on future AMD and Intel processors." >From http://www.agner.org/optimize/optimizing_assembly.pdf -Jeff ___ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
Matt Turner writes: > I started porting my src__0565 MMX function to SSE2, and in the > process started thinking about using SSE3+. The useful instructions > added post SSE2 that I see are > SSE3: lddqu - for unaligned loads across cache lines I don't really understand that instruction. Isn't it identical to movdqu? Or is the idea that lddqu is faster than movdqu for cache line splits, but slower for plain old, non-cache split unaligned loads? > SSSE3: palignr - for unaligned loads (but requires software > pipelining...) > pmaddubsw - maybe? pmaddubsw would be very useful for bilinear interpolation if we drop coordinate precision to 7 bits instead of the current 8. One example way to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register, and interleave a top/left and top/right pixels in another. pmaddubsw on those two registers will then produce a linear interpolation between the top top pixels. A similar thing can be done for the bottom pixels, and then the intermediate results can be interleaved and combined using pmaddwd. > SSE4.1: pextr*, pinsr* pcmpeqq, ptest packusdw - for 888 -> 565 > packing > > I first wrote a basic src__0565 for SSE2 and discovered that the > performance was worse than MMX (which we've been saying has no use in > modern systems -- oops!). I figured the cool pmadd algorithm of MMX was > the cause, but I wondered if 16-byte SSE chunks are too large > sometimes. > > I added an 8-byte MMX loop before and after the main 16-byte SSE loop > and got a nice improvement. > I still think MMX has no use on modern systems. The SSE2 implementation used to have such MMX loops, but they were removed in f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with compilers that would miscompile the emms instruction. Can't the MMX loop you added be done with SSE registers and instructions as well? > Porting the pmadd algorithm to SSE4.1 gave another (very large) > improvement. > > fast: src__0565 = L1: 655.18 L2: 675.94 M:642.31 ( 23.44%) HT:403.00 > VT:286.45 R:307.61 RT:150.59 (1675Kops/s) > mmx: src__0565 = L1:2050.45 L2:1988.97 M:1586.16 ( 57.34%) HT:529.12 > VT:374.28 R:412.09 RT:177.35 (1913Kops/s) > sse2: src__0565 = L1:1518.61 L2:1493.10 M:1279.18 ( 46.24%) HT:433.65 > VT:314.48 R:349.14 RT:151.84 (1685Kops/s) > sse2mmx:src__0565 = L1:1544.91 L2:1520.83 M:1307.79 ( 47.01%) HT:447.82 > VT:326.81 R:379.60 RT:174.07 (1878Kops/s) > sse4: src__0565 = L1:4654.11 L2:4202.98 M:1885.01 ( 69.35%) HT:540.65 > VT:421.04 R:427.73 RT:161.45 (1773Kops/s) > sse4mmx:src__0565 = L1:4786.27 L2:4255.13 M:1920.18 ( 69.93%) > HT:581.42 VT:447.99 R:482.27 RT:193.15 (2049Kops/s) > > I'd like to isolate exactly what the performance improvement given by > the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring > SSE4.1 a fantastic improvement. If you can come up with a reasonable way > to pack the two xmm registers together in pack_565_2packedx128_128, > please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain. Would it work to subtract 0x8000, then use packssdw, then add 0x8000? > I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd > rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a > compile-time option (or recompile the whole file to get a few > improvements from SSE4.1). > > It seems like we need a generic solution that would say for each > compositing function > - this is what you do for 1-byte; > - this is what you do for 8-bytes if you have MMX; > - this is what you do for 16-bytes if you have SSE2; > - this is what you do for 16-bytes if you have SSE3; > - this is what you do for 16-bytes if you have SSE4.1. > and then construct the functions for generic/MMX/SSE2/SSE4 at build time. > > Does this seem like a reasonable approach? *How* to do it -- suggestions > welcome. I think ideally we would generate this code at runtime. It's just not feasible to generate code for all combinations of instruction sets at build time and libpixman.so is already rather large. Generating the code at runtime has the additional advantages that it is not limited to a fixed set of fast paths and that it can make use of more details of the operation such as the precise alignment for palignr generation. There are various ways to go about this, ranging from simple-minded stitching-together of pre-written snippets to a full shader compiler. A full shader compiler is obviously a big project, but maybe a simple stich-together kind of thing wouldn't actually be that hard using something like this: http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph That said, runtime code-generation is still a big project, and it does make sense to make use of some of the newer instruction sets. We do ha
Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1
Hi, Matt. Win64 MSVC target doesn't support MMX intrinsic. If you add MMX code to pixman-sse2.c, please add USE_X86_MMX macro checking for all. And if using MMX, you have to call _mm_empty() after MMX code is finished. I think that you should split SSE4.1 code to another file (pixman-sse41.c?). You know, gcc needs -msse4.1 option for it. -- Makoto (2012/05/03 12:42), Matt Turner wrote: I started porting my src__0565 MMX function to SSE2, and in the process started thinking about using SSE3+. The useful instructions added post SSE2 that I see are SSE3: lddqu - for unaligned loads across cache lines SSSE3: palignr - for unaligned loads (but requires software pipelining...) pmaddubsw - maybe? SSE4.1: pextr*, pinsr* pcmpeqq, ptest packusdw - for 888 -> 565 packing I first wrote a basic src__0565 for SSE2 and discovered that the performance was worse than MMX (which we've been saying has no use in modern systems -- oops!). I figured the cool pmadd algorithm of MMX was the cause, but I wondered if 16-byte SSE chunks are too large sometimes. I added an 8-byte MMX loop before and after the main 16-byte SSE loop and got a nice improvement. Porting the pmadd algorithm to SSE4.1 gave another (very large) improvement. fast: src__0565 = L1: 655.18 L2: 675.94 M:642.31 ( 23.44%) HT:403.00 VT:286.45 R:307.61 RT:150.59 (1675Kops/s) mmx:src__0565 = L1:2050.45 L2:1988.97 M:1586.16 ( 57.34%) HT:529.12 VT:374.28 R:412.09 RT:177.35 (1913Kops/s) sse2: src__0565 = L1:1518.61 L2:1493.10 M:1279.18 ( 46.24%) HT:433.65 VT:314.48 R:349.14 RT:151.84 (1685Kops/s) sse2mmx:src__0565 = L1:1544.91 L2:1520.83 M:1307.79 ( 47.01%) HT:447.82 VT:326.81 R:379.60 RT:174.07 (1878Kops/s) sse4: src__0565 = L1:4654.11 L2:4202.98 M:1885.01 ( 69.35%) HT:540.65 VT:421.04 R:427.73 RT:161.45 (1773Kops/s) sse4mmx:src__0565 = L1:4786.27 L2:4255.13 M:1920.18 ( 69.93%) HT:581.42 VT:447.99 R:482.27 RT:193.15 (2049Kops/s) I'd like to isolate exactly what the performance improvement given by the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring SSE4.1 a fantastic improvement. If you can come up with a reasonable way to pack the two xmm registers together in pack_565_2packedx128_128, please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain. This got me wondering how to proceed. I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a compile-time option (or recompile the whole file to get a few improvements from SSE4.1). It seems like we need a generic solution that would say for each compositing function - this is what you do for 1-byte; - this is what you do for 8-bytes if you have MMX; - this is what you do for 16-bytes if you have SSE2; - this is what you do for 16-bytes if you have SSE3; - this is what you do for 16-bytes if you have SSE4.1. and then construct the functions for generic/MMX/SSE2/SSE4 at build time. Does this seem like a reasonable approach? *How* to do it -- suggestions welcome. --- pixman/pixman-sse2.c | 152 ++ 1 files changed, 152 insertions(+), 0 deletions(-) diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c index e217ca3..763c7b3 100644 --- a/pixman/pixman-sse2.c +++ b/pixman/pixman-sse2.c @@ -30,8 +30,12 @@ #include #endif +#include #include /* for _mm_shuffle_pi16 and _MM_SHUFFLE */ #include /* for SSE2 intrinsics */ +#if USE_SSE41 +#include +#endif #include "pixman-private.h" #include "pixman-combine32.h" #include "pixman-inlines.h" @@ -53,6 +57,9 @@ static __m128i mask_blue; static __m128i mask_565_fix_rb; static __m128i mask_565_fix_g; +static __m128i mask_565_rb; +static __m128i mask_565_pack_multiplier; + static force_inline __m128i unpack_32_1x128 (uint32_t data) { @@ -120,7 +127,59 @@ pack_2x128_128 (__m128i lo, __m128i hi) return _mm_packus_epi16 (lo, hi); } +#if USE_X86_MMX +#define MC(x) ((__m64)mmx_ ## x) + +static force_inline __m64 +pack_4xpacked565 (__m64 a, __m64 b) +{ +static const uint64_t mmx_565_pack_multiplier = 0x20042004ULL; +static const uint64_t mmx_packed_565_rb = 0x00f800f800f800f8ULL; +static const uint64_t mmx_packed_565_g = 0xfc00fc00ULL; + +__m64 rb0 = _mm_and_si64 (a, MC (packed_565_rb)); +__m64 rb1 = _mm_and_si64 (b, MC (packed_565_rb)); + +__m64 t0 = _mm_madd_pi16 (rb0, MC (565_pack_multiplier)); +__m64 t1 = _mm_madd_pi16 (rb1, MC (565_pack_multiplier)); + +__m64 g0 = _mm_and_si64 (a, MC (packed_565_g)); +__m64 g1 = _mm_and_si64 (b, MC (packed_565_g)); + +t0 = _mm_or_si64 (t0, g0); +t1 = _mm_or_si64 (t1, g1); + +t0 = _mm_srli_si64 (t0, 5); +t1 = _mm_slli_si64 (t1, 11); +return _mm_shuffle_pi16 (