Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-25 Thread Siarhei Siamashka
On Mon, Jun 25, 2012 at 7:45 PM, Matt Turner  wrote:
> On Mon, Jun 25, 2012 at 1:00 AM, Siarhei Siamashka
>  wrote:
>> OK, I got 7-bit variant of SSE2 bilinear scaling working. It shows
>> quite a good speed boost thanks to PMADDWD instruction, which can be
>> used now.
>
> Looking forward to seeing the patch. I'll be really interested to
> compare performance on Loongson and iwMMXt when I can switch the
> scaling functions over to multiply-add.

Sent to the list. Took a bit of time to test it on different hardware.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-25 Thread Matt Turner
On Mon, Jun 25, 2012 at 1:00 AM, Siarhei Siamashka
 wrote:
> On Mon, Jun 18, 2012 at 9:09 PM, Søren Sandmann  wrote:
>> Siarhei Siamashka  writes:
>>
>>> This is also a very useful test, but it effectively requires to have
>>> an alternative double precision implementation for all the pixman
>>> functionality to be verified. For bilinear scaling it means that at
>>> least various types of repeats need to be handled, etc. And this
>>> sounds like a lot of work.
>>
>> Yeah, I agree that it's a lot of work and that dropping to 7 bits is
>> easier in the short term.
>
> OK, I got 7-bit variant of SSE2 bilinear scaling working. It shows
> quite a good speed boost thanks to PMADDWD instruction, which can be
> used now.

Looking forward to seeing the patch. I'll be really interested to
compare performance on Loongson and iwMMXt when I can switch the
scaling functions over to multiply-add.
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-24 Thread Siarhei Siamashka
On Mon, Jun 18, 2012 at 9:09 PM, Søren Sandmann  wrote:
> Siarhei Siamashka  writes:
>
>> This is also a very useful test, but it effectively requires to have
>> an alternative double precision implementation for all the pixman
>> functionality to be verified. For bilinear scaling it means that at
>> least various types of repeats need to be handled, etc. And this
>> sounds like a lot of work.
>
> Yeah, I agree that it's a lot of work and that dropping to 7 bits is
> easier in the short term.

OK, I got 7-bit variant of SSE2 bilinear scaling working. It shows
quite a good speed boost thanks to PMADDWD instruction, which can be
used now.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-18 Thread Søren Sandmann
Siarhei Siamashka  writes:

> This is also a very useful test, but it effectively requires to have
> an alternative double precision implementation for all the pixman
> functionality to be verified. For bilinear scaling it means that at
> least various types of repeats need to be handled, etc. And this
> sounds like a lot of work.

Yeah, I agree that it's a lot of work and that dropping to 7 bits is
easier in the short term.

>
> There are also some alternative variants. For example, allow a custom
> prefix for public symbols in pixman (so that several pixman instances
> can be loaded into test application at the same time). Or even update
> the existing pixman tests to add xlib support and compare the locally
> rendered results with xrender. The latter seems particularly useful,
> because it could be also used for xrender implementation validation in
> various hardware accelerated drivers (and complement/retire
> rendercheck).

Yet another variant is to get the single precision floating point
pipeline working instead of the current 16 bit one, and then useit as
the reference implementation.


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-17 Thread Siarhei Siamashka
On Sun, Jun 17, 2012 at 8:27 AM, Bill Spitzak  wrote:
> On 06/16/2012 07:08 AM, Siarhei Siamashka wrote:
>
>>> An alternative idea is instead of changing the algorithm across the
>>> board, we could stop requiring bit exact results. The main piece of work
>>> here is to change the test suite so that it will accept pixels up to
>>> some maximum relative error. There is already some support for this: the
>>> 'composite' test is using the 'pixel_checker_t" to do compare the pixman
>>> output with perfect pixels computed in double precision.
>>>
>>> This latter idea is ultimately more useful because it will allow much
>>> more flexibility in the kinds of SIMD instruction sets we can support.
>>
>>
>> This is also a very useful test, but it effectively requires to have
>> an alternative double precision implementation for all the pixman
>> functionality to be verified.
>
>
> I don't understand this.

The 'composite' test alone has limited utility. It checks the
correctness of composite operations performed with just a single
pixel. But in order to provide better coverage for the functionality
used by real applications, we also must test different image sizes
(inner loops of composite functions do unrolling, and bugs may be
potentially introduced both in the main loop body and in the handling
of leading/trailing pixels). Additionally, when skipping fully
transparent pixels, SIMD optimized code skips the whole groups of
them, etc. There are lots of corner cases which need to be checked.
But it's easier to demonstrate by using an example. Let's try to add a
bug in 'sse2_combine_add_u' function:

diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index 70f8b77..fbea4f6 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -1348,7 +1348,7 @@ sse2_combine_add_u (pixman_implementation_t *imp,
 {
__m128i s;

-   s = combine4 ((__m128i*)ps, (__m128i*)pm);
+   s = _mm_setzero_si128 ();

save_128_aligned (
(__m128i*)pd, _mm_adds_epu8 (s, load_128_aligned  ((__m128i*)pd)));

The patch above just introduces a bug into the code from "while (w >=
4)" loop. Let's see how it is handled by pixman test suite:

PASS: a1-trap-test
PASS: pdf-op-test
PASS: region-test
PASS: region-translate-test
PASS: fetch-test
PASS: oob-test
PASS: trap-crasher
PASS: alpha-loop
PASS: scaling-crash-test
PASS: scaling-helpers-test
PASS: gradient-crash-test
region_contains test passed (checksum=D2BF8C73)
PASS: region-contains-test

Wrong alpha value at (0, 0). Should be 0xff; got 0xf7. Source was
0x65, original dest was 0xf7
src: a8r8g8b8, alpha: none, origin 0 0
dst: a8r8g8b8, alpha: none, origin: 0 0

FAIL: alphamap
PASS: stress-test
composite traps test failed! (checksum=BE93DA05, expected E3112106)
FAIL: composite-traps-test
blitters test failed! (checksum=C8682A01, expected A364B5BF)
FAIL: blitters-test
glyph test failed! (checksum=B1B638A1, expected 1B7696A2)
FAIL: glyph-test
scaling test failed! (checksum=64788A7E, expected 80DF1CB2)
FAIL: scaling-test
affine test passed (checksum=1EF2175A)
PASS: affine-test
PASS: composite
=
5 of 20 tests failed

The 'composite' test did not detect anything wrong as expected. Now
let's break the same 'sse2_combine_add_u' function completely by
inserting "return" into its very beginning:

diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index 70f8b77..25c7aa0 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -1331,6 +1331,8 @@ sse2_combine_add_u (pixman_implementation_t *imp,
 const uint32_t* ps = src;
 const uint32_t* pm = mask;

+return;
+
 while (w && (unsigned long)pd & 15)
 {
s = combine1 (ps, pm);

Now 'composite' test can see that there is a problem:

PASS: a1-trap-test
PASS: pdf-op-test
PASS: region-test
PASS: region-translate-test
PASS: fetch-test
PASS: oob-test
PASS: trap-crasher
PASS: alpha-loop
PASS: scaling-crash-test
PASS: scaling-helpers-test
PASS: gradient-crash-test
region_contains test passed (checksum=D2BF8C73)
PASS: region-contains-test

Wrong alpha value at (0, 0). Should be 0xff; got 0xf7. Source was
0x65, original dest was 0xf7
src: a8r8g8b8, alpha: none, origin 0 0
dst: a8r8g8b8, alpha: none, origin: 0 0

FAIL: alphamap
PASS: stress-test
composite traps test failed! (checksum=4B0E22E6, expected E3112106)
FAIL: composite-traps-test
blitters test failed! (checksum=E95FFC20, expected A364B5BF)
FAIL: blitters-test
glyph test failed! (checksum=FDF0BD54, expected 1B7696A2)
FAIL: glyph-test
scaling test failed! (checksum=55981EC2, expected 80DF1CB2)
FAIL: scaling-test
affine test passed (checksum=1EF2175A)
PASS: affine-test
 Test 3145752 failed 
Operator:  ADD
 Test 4194328 failed 
Operator:  ADD
Source:r3g3b2, 1x1
Destination:   x4r4g4b4, 1x1

Source:a1r1g1b1, 1x1
Destination:   a2r2g2b2, 1x1

   R G B A Rounded
Source color:  1.000 1.000 1.000 0.000 1.000 1.000 1.000 0.00

Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-16 Thread Bill Spitzak

On 06/16/2012 07:08 AM, Siarhei Siamashka wrote:


An alternative idea is instead of changing the algorithm across the
board, we could stop requiring bit exact results. The main piece of work
here is to change the test suite so that it will accept pixels up to
some maximum relative error. There is already some support for this: the
'composite' test is using the 'pixel_checker_t" to do compare the pixman
output with perfect pixels computed in double precision.

This latter idea is ultimately more useful because it will allow much
more flexibility in the kinds of SIMD instruction sets we can support.


This is also a very useful test, but it effectively requires to have
an alternative double precision implementation for all the pixman
functionality to be verified.


I don't understand this.

The current tests are checking for equality with am image. The 
approximate results will just check for approximate equality with the 
same image. I fail to see why the image has to somehow be "more correct".

___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-16 Thread Siarhei Siamashka
On Fri, Jun 15, 2012 at 10:51 PM, Søren Sandmann  wrote:
> Matt Turner  writes:
>> Also, are we planning to change the bilinear scaling algorithm for
>> 0.28 so that we can use pmaddubsw?
>
> I wouldn't object to a patch that dropped precision to 7 bits for all
> bilinear code, but it would require changes at least to the general
> code, the fast path code, the NEON code and the SSE2 code.

This is really a trivial change. The only difficulty is to enable it
and test on all the supported platforms simultaneously. Using qemu and
--enable-static-testprogs option allows to run the basic tests even
not having all the hardware. Though MIPS DSP ASE support is only being
added to qemu at the moment:
http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg04990.html

> An alternative idea is instead of changing the algorithm across the
> board, we could stop requiring bit exact results. The main piece of work
> here is to change the test suite so that it will accept pixels up to
> some maximum relative error. There is already some support for this: the
> 'composite' test is using the 'pixel_checker_t" to do compare the pixman
> output with perfect pixels computed in double precision.
>
> This latter idea is ultimately more useful because it will allow much
> more flexibility in the kinds of SIMD instruction sets we can support.

This is also a very useful test, but it effectively requires to have
an alternative double precision implementation for all the pixman
functionality to be verified. For bilinear scaling it means that at
least various types of repeats need to be handled, etc. And this
sounds like a lot of work.

There are also some alternative variants. For example, allow a custom
prefix for public symbols in pixman (so that several pixman instances
can be loaded into test application at the same time). Or even update
the existing pixman tests to add xlib support and compare the locally
rendered results with xrender. The latter seems particularly useful,
because it could be also used for xrender implementation validation in
various hardware accelerated drivers (and complement/retire
rendercheck).

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-15 Thread Søren Sandmann
Matt Turner  writes:

> The registers -- yes. The 8-byte aligned loads and stores I'm not
> sure. Can you do 8-byte aligned loads and stores to/from SSE
> registers?

I believe movq can use SSE registers.

> Indeed, runtime generation would be great. Something like LLVM or orc
> would be interesting options. I'm not sure I'm up to that kind of
> project yet/now though.
>
> I think adding pixman-sse*.c files is a reasonable measure for now.
> Think it's okay to split the static inline support functions from
> pixman-sse2.c out into a header to be shared with the other
> pixman-sse*.c files?

Sounds reasonable to me.

> Also, are we planning to change the bilinear scaling algorithm for
> 0.28 so that we can use pmaddubsw?

I wouldn't object to a patch that dropped precision to 7 bits for all
bilinear code, but it would require changes at least to the general
code, the fast path code, the NEON code and the SSE2 code.

An alternative idea is instead of changing the algorithm across the
board, we could stop requiring bit exact results. The main piece of work
here is to change the test suite so that it will accept pixels up to
some maximum relative error. There is already some support for this: the
'composite' test is using the 'pixel_checker_t" to do compare the pixman
output with perfect pixels computed in double precision.

This latter idea is ultimately more useful because it will allow much
more flexibility in the kinds of SIMD instruction sets we can support.


Søren
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-06-14 Thread Matt Turner
Sorry it's taken so long to get back to this.

On Wed, May 9, 2012 at 12:57 PM, Søren Sandmann  wrote:
> Matt Turner  writes:
> I still think MMX has no use on modern systems. The SSE2 implementation
> used to have such MMX loops, but they were removed in
> f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with
> compilers that would miscompile the emms instruction.
>
> Can't the MMX loop you added be done with SSE registers and instructions
> as well?

The registers -- yes. The 8-byte aligned loads and stores I'm not
sure. Can you do 8-byte aligned loads and stores to/from SSE
registers?

>> Porting the pmadd algorithm to SSE4.1 gave another (very large)
>> improvement.
>>
>> fast: src__0565 = L1: 655.18  L2: 675.94  M:642.31  ( 23.44%) HT:403.00  
>> VT:286.45  R:307.61  RT:150.59 (1675Kops/s)
>> mmx:  src__0565 = L1:2050.45  L2:1988.97  M:1586.16 ( 57.34%) HT:529.12  
>> VT:374.28  R:412.09  RT:177.35 (1913Kops/s)
>> sse2: src__0565 = L1:1518.61  L2:1493.10  M:1279.18 ( 46.24%) HT:433.65  
>> VT:314.48  R:349.14  RT:151.84 (1685Kops/s)
>> sse2mmx:src__0565 = L1:1544.91  L2:1520.83  M:1307.79 ( 47.01%) 
>> HT:447.82  VT:326.81  R:379.60  RT:174.07 (1878Kops/s)
>> sse4: src__0565 = L1:4654.11  L2:4202.98  M:1885.01 ( 69.35%) HT:540.65  
>> VT:421.04  R:427.73  RT:161.45 (1773Kops/s)
>> sse4mmx:src__0565 = L1:4786.27  L2:4255.13  M:1920.18 ( 69.93%)
>> HT:581.42  VT:447.99  R:482.27  RT:193.15 (2049Kops/s)
>>
>> I'd like to isolate exactly what the performance improvement given by
>> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring
>> SSE4.1 a fantastic improvement. If you can come up with a reasonable way
>> to pack the two xmm registers together in pack_565_2packedx128_128,
>> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain.
>
> Would it work to subtract 0x8000, then use packssdw, then add 0x8000?

I couldn't make that work, but I asked on StackOverflow and got a nice
solution: 
http://stackoverflow.com/questions/11024652/simulating-packusdw-functionality-with-sse2

>> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd
>> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a
>> compile-time option (or recompile the whole file to get a few
>> improvements from SSE4.1).
>>
>> It seems like we need a generic solution that would say for each
>> compositing function
>>       - this is what you do for 1-byte;
>>       - this is what you do for 8-bytes if you have MMX;
>>       - this is what you do for 16-bytes if you have SSE2;
>>       - this is what you do for 16-bytes if you have SSE3;
>>       - this is what you do for 16-bytes if you have SSE4.1.
>> and then construct the functions for generic/MMX/SSE2/SSE4 at build time.
>>
>> Does this seem like a reasonable approach? *How* to do it -- suggestions
>> welcome.
>
> I think ideally we would generate this code at runtime. It's just not
> feasible to generate code for all combinations of instruction sets at
> build time and libpixman.so is already rather large. Generating the code
> at runtime has the additional advantages that it is not limited to a
> fixed set of fast paths and that it can make use of more details of the
> operation such as the precise alignment for palignr generation.
>
> There are various ways to go about this, ranging from simple-minded
> stitching-together of pre-written snippets to a full shader compiler. A
> full shader compiler is obviously a big project, but maybe a simple
> stich-together kind of thing wouldn't actually be that hard using
> something like this:
>
> http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph
> http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph
>
> That said, runtime code-generation is still a big project, and it does
> make sense to make use of some of the newer instruction sets.
>
> We do have support for fallbacks, so as Makoto-san says, just adding new
> pixman-ssse3.c and pixman-sse41.c files with duplicated code for the
> particular operations that benefit from ssse3 and sse4.1 might be the
> simplest way to proceed.

Indeed, runtime generation would be great. Something like LLVM or orc
would be interesting options. I'm not sure I'm up to that kind of
project yet/now though.

I think adding pixman-sse*.c files is a reasonable measure for now.
Think it's okay to split the static inline support functions from
pixman-sse2.c out into a header to be shared with the other
pixman-sse*.c files?

Also, are we planning to change the bilinear scaling algorithm for
0.28 so that we can use pmaddubsw?

Thanks,
Matt
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-05-13 Thread Siarhei Siamashka
On Wed, May 9, 2012 at 7:57 PM, Søren Sandmann  wrote:
> Matt Turner  writes:
>
>> I started porting my src__0565 MMX function to SSE2, and in the
>> process started thinking about using SSE3+. The useful instructions
>> added post SSE2 that I see are
>>       SSE3:   lddqu - for unaligned loads across cache lines
>
> I don't really understand that instruction. Isn't it identical to
> movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
> splits, but slower for plain old, non-cache split unaligned loads?
>
>>       SSSE3:  palignr - for unaligned loads (but requires software
>>                         pipelining...)
>>               pmaddubsw - maybe?
>
> pmaddubsw would be very useful for bilinear interpolation if we drop
> coordinate precision to 7 bits instead of the current 8. One example way
> to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register,
> and interleave a top/left and top/right pixels in another. pmaddubsw on
> those two registers will then produce a linear interpolation between the
> top top pixels. A similar thing can be done for the bottom pixels, and
> then the intermediate results can be interleaved and combined using
> pmaddwd.

I would say that improving bilinear scaling performance on x86 is
really important for pixman in order to remain competitive. The
following link might be a good source of inspiration:
http://www.hackermusings.com/2012/05/firefoxs-graphics-performance-on-x11/

The comments with the azure backend performance numbers are
particularly interesting. For example, one of them mentions 12fps with
xrender disabled (using pixman?) vs. 15fps with azure canvas enabled
(using skia?) for FishIETank. Needless to say that it would be nice to
improve pixman performance by 30% or more.

And here are some benchmarks for firefox-fishtank trace with
pixman-0.25.2, comparing NEON vs. SSE2 for ARM Cortex-A8 and intel
Atom (both are superscalar dual-issue in-order cores):

=== ARM Cortex-A8 @1GHz ===

CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp"
[  0]image firefox-fishtank  359.228  359.436   0.43%3/3

CC=gcc-4.5.3 CFLAGS="-O2 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mthumb"
[  0]image firefox-fishtank  347.195  347.773   0.12%3/3


=== Intel Atom N450 @1.67GHz ===

CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom"
[  0]image firefox-fishtank  308.439  308.881   0.09%3/3

CC=gcc-4.5.3 CFLAGS="-O2 -g -march=atom -mtune=atom -m32"
[  0]image firefox-fishtank  309.457  309.568   0.07%3/3

CC=gcc-4.5.3 CFLAGS="-O2"
[  0]image firefox-fishtank  345.906  346.156   0.04%3/3

CC=gcc-4.5.3 CFLAGS="-O2 -mtune=generic"
[  0]image firefox-fishtank  345.367  345.900   0.09%3/3

The results for gcc-4.7.0 were nearly the same. Currently 1GHz ARM
Cortex-A8 is almost as fast as 1.67GHz Atom. ARM NEON bilinear code is
using 8-bit multiplications. Atom could use PMADDUBSW to also benefit
from 8-bit multiplications and improve performance per MHz.

-- 
Best regards,
Siarhei Siamashka
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-05-09 Thread Jeff Muizelaar

On 2012-05-09, at 12:57 PM, Søren Sandmann wrote:

> Matt Turner  writes:
> 
>> I started porting my src__0565 MMX function to SSE2, and in the
>> process started thinking about using SSE3+. The useful instructions
>> added post SSE2 that I see are
>>  SSE3:   lddqu - for unaligned loads across cache lines
> 
> I don't really understand that instruction. Isn't it identical to
> movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
> splits, but slower for plain old, non-cache split unaligned loads?

"The instructions movdqu, movups, movupd and lddqu are all able to read 
unaligned vectors. lddqu is faster than the alternatives on P4E and PM 
processors, but requires the SSE3 instruction set. The unaligned read 
instructions are relatively slow on older processors, but faster on Nehalem, 
Sandy Bridge and on future AMD and Intel processors."

>From http://www.agner.org/optimize/optimizing_assembly.pdf

-Jeff
___
Pixman mailing list
Pixman@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/pixman


Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-05-09 Thread Søren Sandmann
Matt Turner  writes:

> I started porting my src__0565 MMX function to SSE2, and in the
> process started thinking about using SSE3+. The useful instructions
> added post SSE2 that I see are
>   SSE3:   lddqu - for unaligned loads across cache lines

I don't really understand that instruction. Isn't it identical to
movdqu?  Or is the idea that lddqu is faster than movdqu for cache line
splits, but slower for plain old, non-cache split unaligned loads?

>   SSSE3:  palignr - for unaligned loads (but requires software
> pipelining...)
>   pmaddubsw - maybe?

pmaddubsw would be very useful for bilinear interpolation if we drop
coordinate precision to 7 bits instead of the current 8. One example way
to use it is to put 7-bit values of (1 - x, x, 1 - x, x) in a register,
and interleave a top/left and top/right pixels in another. pmaddubsw on
those two registers will then produce a linear interpolation between the
top top pixels. A similar thing can be done for the bottom pixels, and
then the intermediate results can be interleaved and combined using
pmaddwd.

>   SSE4.1: pextr*, pinsr* pcmpeqq, ptest packusdw - for 888 -> 565
>   packing
>
> I first wrote a basic src__0565 for SSE2 and discovered that the
> performance was worse than MMX (which we've been saying has no use in
> modern systems -- oops!). I figured the cool pmadd algorithm of MMX was
> the cause, but I wondered if 16-byte SSE chunks are too large
> sometimes.
>
> I added an 8-byte MMX loop before and after the main 16-byte SSE loop
> and got a nice improvement.
>

I still think MMX has no use on modern systems. The SSE2 implementation
used to have such MMX loops, but they were removed in
f88ae14c15040345a12ff0488c7b23d25639e49b because there were issues with
compilers that would miscompile the emms instruction.

Can't the MMX loop you added be done with SSE registers and instructions
as well?

> Porting the pmadd algorithm to SSE4.1 gave another (very large)
> improvement.
>
> fast: src__0565 = L1: 655.18  L2: 675.94  M:642.31  ( 23.44%) HT:403.00  
> VT:286.45  R:307.61  RT:150.59 (1675Kops/s)
> mmx:  src__0565 = L1:2050.45  L2:1988.97  M:1586.16 ( 57.34%) HT:529.12  
> VT:374.28  R:412.09  RT:177.35 (1913Kops/s)
> sse2: src__0565 = L1:1518.61  L2:1493.10  M:1279.18 ( 46.24%) HT:433.65  
> VT:314.48  R:349.14  RT:151.84 (1685Kops/s)
> sse2mmx:src__0565 = L1:1544.91  L2:1520.83  M:1307.79 ( 47.01%) HT:447.82 
>  VT:326.81  R:379.60  RT:174.07 (1878Kops/s)
> sse4: src__0565 = L1:4654.11  L2:4202.98  M:1885.01 ( 69.35%) HT:540.65  
> VT:421.04  R:427.73  RT:161.45 (1773Kops/s)
> sse4mmx:src__0565 = L1:4786.27  L2:4255.13  M:1920.18 ( 69.93%)
> HT:581.42  VT:447.99  R:482.27  RT:193.15 (2049Kops/s)
>
> I'd like to isolate exactly what the performance improvement given by
> the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring
> SSE4.1 a fantastic improvement. If you can come up with a reasonable way
> to pack the two xmm registers together in pack_565_2packedx128_128,
> please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain.

Would it work to subtract 0x8000, then use packssdw, then add 0x8000?

> I'd rather not duplicate a bunch of code from pixman-mmx.c, and I'd
> rather not add #ifdef USE_SSE41 to pixman-sse2.c and make it a
> compile-time option (or recompile the whole file to get a few
> improvements from SSE4.1).
>
> It seems like we need a generic solution that would say for each
> compositing function
>   - this is what you do for 1-byte;
>   - this is what you do for 8-bytes if you have MMX;
>   - this is what you do for 16-bytes if you have SSE2;
>   - this is what you do for 16-bytes if you have SSE3;
>   - this is what you do for 16-bytes if you have SSE4.1.
> and then construct the functions for generic/MMX/SSE2/SSE4 at build time.
>
> Does this seem like a reasonable approach? *How* to do it -- suggestions
> welcome.

I think ideally we would generate this code at runtime. It's just not
feasible to generate code for all combinations of instruction sets at
build time and libpixman.so is already rather large. Generating the code
at runtime has the additional advantages that it is not limited to a
fixed set of fast paths and that it can make use of more details of the
operation such as the precise alignment for palignr generation.

There are various ways to go about this, ranging from simple-minded
stitching-together of pre-written snippets to a full shader compiler. A
full shader compiler is obviously a big project, but maybe a simple
stich-together kind of thing wouldn't actually be that hard using
something like this:

http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.h?h=graph
http://cgit.freedesktop.org/~sandmann/genrender/tree/codex86.c?h=graph

That said, runtime code-generation is still a big project, and it does
make sense to make use of some of the newer instruction sets.

We do ha

Re: [Pixman] [PATCH] sse2: Using MMX and SSE 4.1

2012-05-08 Thread Makoto Kato

Hi, Matt.

Win64 MSVC target doesn't support MMX intrinsic.  If you add MMX code to 
pixman-sse2.c, please add USE_X86_MMX macro checking for all.


And if using MMX, you have to call _mm_empty() after MMX code is finished.

I think that you should split SSE4.1 code to another file 
(pixman-sse41.c?).  You know, gcc needs -msse4.1 option for it.



-- Makoto

(2012/05/03 12:42), Matt Turner wrote:

I started porting my src__0565 MMX function to SSE2, and in the
process started thinking about using SSE3+. The useful instructions
added post SSE2 that I see are
SSE3:   lddqu - for unaligned loads across cache lines
SSSE3:  palignr - for unaligned loads (but requires software
  pipelining...)
pmaddubsw - maybe?
SSE4.1: pextr*, pinsr*
pcmpeqq, ptest
packusdw - for 888 ->  565 packing

I first wrote a basic src__0565 for SSE2 and discovered that the
performance was worse than MMX (which we've been saying has no use in
modern systems -- oops!). I figured the cool pmadd algorithm of MMX was
the cause, but I wondered if 16-byte SSE chunks are too large sometimes.

I added an 8-byte MMX loop before and after the main 16-byte SSE loop
and got a nice improvement.

Porting the pmadd algorithm to SSE4.1 gave another (very large)
improvement.

fast:   src__0565 = L1: 655.18  L2: 675.94  M:642.31  ( 23.44%) HT:403.00  
VT:286.45  R:307.61  RT:150.59 (1675Kops/s)
mmx:src__0565 = L1:2050.45  L2:1988.97  M:1586.16 ( 57.34%) HT:529.12  
VT:374.28  R:412.09  RT:177.35 (1913Kops/s)
sse2:   src__0565 = L1:1518.61  L2:1493.10  M:1279.18 ( 46.24%) HT:433.65  
VT:314.48  R:349.14  RT:151.84 (1685Kops/s)
sse2mmx:src__0565 = L1:1544.91  L2:1520.83  M:1307.79 ( 47.01%) HT:447.82  
VT:326.81  R:379.60  RT:174.07 (1878Kops/s)
sse4:   src__0565 = L1:4654.11  L2:4202.98  M:1885.01 ( 69.35%) HT:540.65  
VT:421.04  R:427.73  RT:161.45 (1773Kops/s)
sse4mmx:src__0565 = L1:4786.27  L2:4255.13  M:1920.18 ( 69.93%) HT:581.42  
VT:447.99  R:482.27  RT:193.15 (2049Kops/s)

I'd like to isolate exactly what the performance improvement given by
the only SSE4.1 instruction (i.e., _mm_packus_epi32) is before declaring
SSE4.1 a fantastic improvement. If you can come up with a reasonable way
to pack the two xmm registers together in pack_565_2packedx128_128,
please tell me. Shuffle only works on hi/lo 8-bytes, so it'd be a pain.

This got me wondering how to proceed. I'd rather not duplicate a bunch
of code from pixman-mmx.c, and I'd rather not add #ifdef USE_SSE41 to
pixman-sse2.c and make it a compile-time option (or recompile the whole
file to get a few improvements from SSE4.1).

It seems like we need a generic solution that would say for each
compositing function
- this is what you do for 1-byte;
- this is what you do for 8-bytes if you have MMX;
- this is what you do for 16-bytes if you have SSE2;
- this is what you do for 16-bytes if you have SSE3;
- this is what you do for 16-bytes if you have SSE4.1.
and then construct the functions for generic/MMX/SSE2/SSE4 at build time.

Does this seem like a reasonable approach? *How* to do it -- suggestions
welcome.
---
  pixman/pixman-sse2.c |  152 ++
  1 files changed, 152 insertions(+), 0 deletions(-)

diff --git a/pixman/pixman-sse2.c b/pixman/pixman-sse2.c
index e217ca3..763c7b3 100644
--- a/pixman/pixman-sse2.c
+++ b/pixman/pixman-sse2.c
@@ -30,8 +30,12 @@
  #include
  #endif

+#include
  #include  /* for _mm_shuffle_pi16 and _MM_SHUFFLE */
  #include  /* for SSE2 intrinsics */
+#if USE_SSE41
+#include
+#endif
  #include "pixman-private.h"
  #include "pixman-combine32.h"
  #include "pixman-inlines.h"
@@ -53,6 +57,9 @@ static __m128i mask_blue;
  static __m128i mask_565_fix_rb;
  static __m128i mask_565_fix_g;

+static __m128i mask_565_rb;
+static __m128i mask_565_pack_multiplier;
+
  static force_inline __m128i
  unpack_32_1x128 (uint32_t data)
  {
@@ -120,7 +127,59 @@ pack_2x128_128 (__m128i lo, __m128i hi)
  return _mm_packus_epi16 (lo, hi);
  }

+#if USE_X86_MMX
+#define MC(x) ((__m64)mmx_ ## x)
+
+static force_inline __m64
+pack_4xpacked565 (__m64 a, __m64 b)
+{
+static const uint64_t mmx_565_pack_multiplier = 0x20042004ULL;
+static const uint64_t mmx_packed_565_rb = 0x00f800f800f800f8ULL;
+static const uint64_t mmx_packed_565_g = 0xfc00fc00ULL;
+
+__m64 rb0 = _mm_and_si64 (a, MC (packed_565_rb));
+__m64 rb1 = _mm_and_si64 (b, MC (packed_565_rb));
+
+__m64 t0 = _mm_madd_pi16 (rb0, MC (565_pack_multiplier));
+__m64 t1 = _mm_madd_pi16 (rb1, MC (565_pack_multiplier));
+
+__m64 g0 = _mm_and_si64 (a, MC (packed_565_g));
+__m64 g1 = _mm_and_si64 (b, MC (packed_565_g));
+
+t0 = _mm_or_si64 (t0, g0);
+t1 = _mm_or_si64 (t1, g1);
+
+t0 = _mm_srli_si64 (t0, 5);
+t1 = _mm_slli_si64 (t1, 11);
+return _mm_shuffle_pi16 (