Re: [FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)
> > 64 runs seems way too few runs for reliable result. > Please try to run ffmpeg encoding for about a minute. > > About the patch: > > > +%include "libavutil/x86/x86util.asm" > > + > > +SECTION .text > > Please include "x86inc.asm" explicitly. > > > +INIT_XMM sse2 > > +cglobal reorder_pixels, 3,5,3, src, dst, size > > + > > +shrr2,1;half_size > > It's better to use the labels you define in the cglobal, > If I count correctly, "r2" would be "sizeq". > ("srcq" and "dstq" would be the correct labels for the pointers). > > > > +;calc loop count for simd reorder > > +movr3,r2; > > +shrr3,4;calc loop count simd > > Align the instruction at 4 spaces. > Align the first operands so that the ending commas are vertically aligned. > For the follow up operands, just add one space after the comma. > > Put some spaces before the comment, to separate it from the operands. > BTW, this is not C, you don't need to end every line with ";" > e.g. > > +mov r3, r2 > > +shr r3, 4 ;calc loop count simd > > > > +;jump to afterloop2 if loop count simd is 0 > > +cmpr3,0; > > +jleafterloop2; > > If you only check for size==0, then > the "shr" has already set the correct Z flag. > > However, if r3/size==1, you'd still read > 16 bytes at once in the first loop. > Aka, overread. > > > > +;simd loop > > +loop1: > > +movdqaxmm0,[r0];load first 16 bytes > > Use "m0" instead. > > > > +lea r4, [r0 + r2]; > > + > > +movdquxmm1, [r4]; unaligned load > > r4 doesn't seem to be used for anything else. > Just move the address calculation directly into > the "movdqu", it can take it. > > > +movdqa xmm2,xmm0;copy xmm0 > > + > > +punpcklbw xmm2, xmm1; > > +movdqa [r1], xmm2; > > +add r1, 16; > > use "mmsize" instead of 16. > > > +movdqa xmm2, xmm0; > > +punpckhbw xmm2, xmm1; > > +movdqa[r1], xmm2; > > +addr1,16; > > You can actually avoid one of the "add" > if you use [r1+mmsize] and add 32 at the second > "add" (or 2*mmsize). > > > +decr3; > > +addr0,16; > > > +; test repeat > > +cmpr3, 0; > > +jgeloop1; > > First, "dec" instructions are avoided, > because they do a partial update > on the flag register and > this causes interdependence. > > Second, you can use the flags from > the "sub" to check if it has reached > zero or has gone negative. > > > > +afterloop2: > > +;scalar part > > + > > +mov r3, r2; > > +and r3, 15 ;half_size % 16 > > +lea r4, [r0 + r2]; > > + > > +;initial condition loop > > +cmpr3,0; > > +jleend; > > + > > +loop2: > > +mov r1, r0; > > +inc r1; > > +mov r1, r4; > > +inc r1; > > +inc r0; > > +inc r4; > > +dec r3; > > +; test repeat > > +cmpr3, 0; > > +jgloop2; > > O_o > This loop does not read or write to memory. > > You can replace the second "cmp" in this > loop with unconditional jump to the > "initial condition loop" compare. > (aka move loop2: two instruction above). > > This is how "for(;;)" is usually implemented in C compilers. > > Be sure to test your function with different sizes, > to salt your output buffers and check for underwrites, overwrites > etc... > > Best Regards. > > > Thanks for your comments, i will take a look, on all of this. Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)
On 8/5/17, Martin Vignaliwrote: > Hello, > > Based on pull request in the official openexr library : > https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf > > i try to port the reorder part (used in zip and rle decompression), to asm > Tested on OS X > pass fate test for me > > I test with this command : > ./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i > ./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0 > > the result of the timer are : > scalar : 960355 decicycles in reorder_pixels_zip, 64 runs, 0 > skips > sse : 84763 decicycles in reorder_pixels_zip, 64 runs, 0 > skips > > Comments Welcome > > Martin > Jokyo Images 64 runs seems way too few runs for reliable result. Please try to run ffmpeg encoding for about a minute. About the patch: > +%include "libavutil/x86/x86util.asm" > + > +SECTION .text Please include "x86inc.asm" explicitly. > +INIT_XMM sse2 > +cglobal reorder_pixels, 3,5,3, src, dst, size > + > +shrr2,1;half_size It's better to use the labels you define in the cglobal, If I count correctly, "r2" would be "sizeq". ("srcq" and "dstq" would be the correct labels for the pointers). > +;calc loop count for simd reorder > +movr3,r2; > +shrr3,4;calc loop count simd Align the instruction at 4 spaces. Align the first operands so that the ending commas are vertically aligned. For the follow up operands, just add one space after the comma. Put some spaces before the comment, to separate it from the operands. BTW, this is not C, you don't need to end every line with ";" e.g. > +mov r3, r2 > +shr r3, 4 ;calc loop count simd > +;jump to afterloop2 if loop count simd is 0 > +cmpr3,0; > +jleafterloop2; If you only check for size==0, then the "shr" has already set the correct Z flag. However, if r3/size==1, you'd still read 16 bytes at once in the first loop. Aka, overread. > +;simd loop > +loop1: > +movdqaxmm0,[r0];load first 16 bytes Use "m0" instead. > +lea r4, [r0 + r2]; > + > +movdquxmm1, [r4]; unaligned load r4 doesn't seem to be used for anything else. Just move the address calculation directly into the "movdqu", it can take it. > +movdqa xmm2,xmm0;copy xmm0 > + > +punpcklbw xmm2, xmm1; > +movdqa [r1], xmm2; > +add r1, 16; use "mmsize" instead of 16. > +movdqa xmm2, xmm0; > +punpckhbw xmm2, xmm1; > +movdqa[r1], xmm2; > +addr1,16; You can actually avoid one of the "add" if you use [r1+mmsize] and add 32 at the second "add" (or 2*mmsize). > +decr3; > +addr0,16; > +; test repeat > +cmpr3, 0; > +jgeloop1; First, "dec" instructions are avoided, because they do a partial update on the flag register and this causes interdependence. Second, you can use the flags from the "sub" to check if it has reached zero or has gone negative. > +afterloop2: > +;scalar part > + > +mov r3, r2; > +and r3, 15 ;half_size % 16 > +lea r4, [r0 + r2]; > + > +;initial condition loop > +cmpr3,0; > +jleend; > + > +loop2: > +mov r1, r0; > +inc r1; > +mov r1, r4; > +inc r1; > +inc r0; > +inc r4; > +dec r3; > +; test repeat > +cmpr3, 0; > +jgloop2; O_o This loop does not read or write to memory. You can replace the second "cmp" in this loop with unconditional jump to the "initial condition loop" compare. (aka move loop2: two instruction above). This is how "for(;;)" is usually implemented in C compilers. Be sure to test your function with different sizes, to salt your output buffers and check for underwrites, overwrites etc... Best Regards. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] libavcodec/exr : add SSE2 simd for reorder pixels (WIP)
Hello, Based on pull request in the official openexr library : https://github.com/openexr/openexr/pull/229/commits/4198128397c033d4f69e5cc0833195da500c31cf i try to port the reorder part (used in zip and rle decompression), to asm Tested on OS X pass fate test for me I test with this command : ./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null -;./ffmpeg -i ./fate-suite/exr/rgba_slice_zip16.exr -f null - -cpuflags 0 the result of the timer are : scalar : 960355 decicycles in reorder_pixels_zip, 64 runs, 0 skips sse : 84763 decicycles in reorder_pixels_zip, 64 runs, 0 skips Comments Welcome Martin Jokyo Images 0001-libavcodec-exr-add-sse2-simd-for-reorder_pixels.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel