Re: [FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)

2017-08-07 Thread Martin Vignali
>
> 64 runs seems way too few runs for reliable result.
> Please try to run ffmpeg encoding for about a minute.
>
> About the patch:
>
> > +%include "libavutil/x86/x86util.asm"
> > +
> > +SECTION .text
>
> Please include "x86inc.asm" explicitly.
>
> > +INIT_XMM sse2
> > +cglobal reorder_pixels, 3,5,3, src, dst, size
> > +
> > +shrr2,1;half_size
>
> It's better to use the labels you define in the cglobal,
> If I count correctly, "r2" would be "sizeq".
> ("srcq" and "dstq" would be the correct labels for the pointers).
>
>
> > +;calc loop count for simd reorder
> > +movr3,r2;
> > +shrr3,4;calc loop count simd
>
> Align the instruction at 4 spaces.
> Align the first operands so that the ending commas are vertically aligned.
> For the follow up operands, just add one space after the comma.
>
> Put some spaces before the comment, to separate it from the operands.
> BTW, this is not C, you don't need to end every line with ";"
> e.g.
> > +mov  r3, r2
> > +shr  r3, 4  ;calc loop count simd
>
>
> > +;jump to afterloop2 if loop count simd is 0
> > +cmpr3,0;
> > +jleafterloop2;
>
> If you only check for size==0, then
> the "shr" has already set the correct Z flag.
>
> However, if r3/size==1, you'd still read
> 16 bytes at once in the first loop.
> Aka, overread.
>
>
> > +;simd loop
> > +loop1:
> > +movdqaxmm0,[r0];load first 16 bytes
>
> Use "m0" instead.
>
>
> > +lea r4, [r0 + r2];
> > +
> > +movdquxmm1, [r4]; unaligned load
>
> r4 doesn't seem to be used for anything else.
> Just move the address calculation directly into
> the "movdqu", it can take it.
>
> > +movdqa xmm2,xmm0;copy xmm0
> > +
> > +punpcklbw xmm2,   xmm1;
> > +movdqa [r1],   xmm2;
> > +add r1, 16;
>
> use "mmsize" instead of 16.
>
> > +movdqa xmm2,   xmm0;
> > +punpckhbw xmm2,   xmm1;
> > +movdqa[r1],   xmm2;
> > +addr1,16;
>
> You can actually avoid one of the "add"
> if you use [r1+mmsize] and add 32 at the second
> "add" (or 2*mmsize).
>
> > +decr3;
> > +addr0,16;
>
> > +; test repeat
> > +cmpr3,   0;
> > +jgeloop1;
>
> First, "dec" instructions are avoided,
> because they do a partial update
> on the flag register and
> this causes interdependence.
>
> Second, you can use the flags from
> the "sub" to check if it has reached
> zero or has gone negative.
>
>
> > +afterloop2:
> > +;scalar part
> > +
> > +mov r3, r2;
> > +and r3, 15 ;half_size % 16
> > +lea r4, [r0 + r2];
> > +
> > +;initial condition loop
> > +cmpr3,0;
> > +jleend;
> > +
> > +loop2:
> > +mov r1, r0;
> > +inc r1;
> > +mov r1, r4;
> > +inc r1;
> > +inc r0;
> > +inc r4;
> > +dec r3;
> > +; test repeat
> > +cmpr3,   0;
> > +jgloop2;
>
> O_o
> This loop does not read or write to memory.
>
> You can replace the second "cmp" in this
> loop with unconditional jump to the
> "initial condition loop" compare.
> (aka move loop2: two instruction above).
>
> This is how "for(;;)" is usually implemented in C compilers.
>
> Be sure to test your function with different sizes,
> to salt your output buffers and check for underwrites, overwrites
> etc...
>
> Best Regards.
>
>
> Thanks for your comments,

i will take a look, on all of this.

Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)

2017-08-05 Thread Ivan Kalvachev
On 8/5/17, Martin Vignali  wrote:
> Hello,
>
> Based on pull request in the official openexr library :
> https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf
>
> i try to port the reorder part (used in zip and rle decompression), to asm
> Tested on OS X
> pass fate test for me
>
> I test with this command :
> ./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i
> ./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0
>
> the result of the timer are :
> scalar : 960355 decicycles in reorder_pixels_zip,  64 runs,  0
> skips
> sse :   84763 decicycles in reorder_pixels_zip,  64 runs,  0
> skips
>
> Comments Welcome
>
> Martin
> Jokyo Images

64 runs seems way too few runs for reliable result.
Please try to run ffmpeg encoding for about a minute.

About the patch:

> +%include "libavutil/x86/x86util.asm"
> +
> +SECTION .text

Please include "x86inc.asm" explicitly.

> +INIT_XMM sse2
> +cglobal reorder_pixels, 3,5,3, src, dst, size
> +
> +shrr2,1;half_size

It's better to use the labels you define in the cglobal,
If I count correctly, "r2" would be "sizeq".
("srcq" and "dstq" would be the correct labels for the pointers).


> +;calc loop count for simd reorder
> +movr3,r2;
> +shrr3,4;calc loop count simd

Align the instruction at 4 spaces.
Align the first operands so that the ending commas are vertically aligned.
For the follow up operands, just add one space after the comma.

Put some spaces before the comment, to separate it from the operands.
BTW, this is not C, you don't need to end every line with ";"
e.g.
> +mov  r3, r2
> +shr  r3, 4  ;calc loop count simd


> +;jump to afterloop2 if loop count simd is 0
> +cmpr3,0;
> +jleafterloop2;

If you only check for size==0, then
the "shr" has already set the correct Z flag.

However, if r3/size==1, you'd still read
16 bytes at once in the first loop.
Aka, overread.


> +;simd loop
> +loop1:
> +movdqaxmm0,[r0];load first 16 bytes

Use "m0" instead.


> +lea r4, [r0 + r2];
> +
> +movdquxmm1, [r4]; unaligned load

r4 doesn't seem to be used for anything else.
Just move the address calculation directly into
the "movdqu", it can take it.

> +movdqa xmm2,xmm0;copy xmm0
> +
> +punpcklbw xmm2,   xmm1;
> +movdqa [r1],   xmm2;
> +add r1, 16;

use "mmsize" instead of 16.

> +movdqa xmm2,   xmm0;
> +punpckhbw xmm2,   xmm1;
> +movdqa[r1],   xmm2;
> +addr1,16;

You can actually avoid one of the "add"
if you use [r1+mmsize] and add 32 at the second
"add" (or 2*mmsize).

> +decr3;
> +addr0,16;

> +; test repeat
> +cmpr3,   0;
> +jgeloop1;

First, "dec" instructions are avoided,
because they do a partial update
on the flag register and
this causes interdependence.

Second, you can use the flags from
the "sub" to check if it has reached
zero or has gone negative.


> +afterloop2:
> +;scalar part
> +
> +mov r3, r2;
> +and r3, 15 ;half_size % 16
> +lea r4, [r0 + r2];
> +
> +;initial condition loop
> +cmpr3,0;
> +jleend;
> +
> +loop2:
> +mov r1, r0;
> +inc r1;
> +mov r1, r4;
> +inc r1;
> +inc r0;
> +inc r4;
> +dec r3;
> +; test repeat
> +cmpr3,   0;
> +jgloop2;

O_o
This loop does not read or write to memory.

You can replace the second "cmp" in this
loop with unconditional jump to the
"initial condition loop" compare.
(aka move loop2: two instruction above).

This is how "for(;;)" is usually implemented in C compilers.

Be sure to test your function with different sizes,
to salt your output buffers and check for underwrites, overwrites
etc...

Best Regards.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)

2017-08-05 Thread Martin Vignali
Hello,

Based on pull request in the official openexr library :
https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf

i try to port the reorder part (used in zip and rle decompression), to asm
Tested on OS X
pass fate test for me

I test with this command :
./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i
./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0

the result of the timer are :
scalar : 960355 decicycles in reorder_pixels_zip,  64 runs,  0 skips
sse :   84763 decicycles in reorder_pixels_zip,  64 runs,  0
skips

Comments Welcome

Martin
Jokyo Images


0001-libavcodec-exr-add-sse2-simd-for-reorder_pixels.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel