Re: [FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2

2019-03-06 Thread Mike Stoner
Thanks for the feedback.  You are right, I can use VPERMQ to free up a register.  I can also remove the PAND mask by doing PSLLD/PSRLD.  That eliminates the need for an x86-64 block. I tried the naive 'unrolled' version with no permute, and it was much slower, about the same as the AVX/SSSE3

Re: [FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2

2019-03-04 Thread James Darnley
On 2019-03-01 18:41, Michael Stoner wrote: > The AVX2 code leverages VPERMD to process 12 pixels/iteration. This is my > first patch submission so any comments are greatly appreciated. > > -Mike > > Tested on Skylake (Win32 & Win64) > 1920x1080 input frame > = > C code -

Re: [FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2

2019-03-04 Thread James Darnley
On 2019-03-03 15:44, Martin Vignali wrote: > Hello, > > ... > > Not directly related to this patch, but it can be interesting for testing > purpose to write a checkasm test for the v210 func decoding. > So it's more easy to check the perf for "each" cpu flags, and be sure, the > various width

Re: [FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2

2019-03-03 Thread Martin Vignali
Hello, Few comments. You can use VBROADCASTI128 macro instead of changing the size of the constants (VBROADCASTI128 load 128 bit when using XMM, and broadcast the 128bit to the two lane when using YMM) The %if ARCH_X86_64 part, seems strange. seems to only be useful for AVX2, not for sse/avx.

[FFmpeg-devel] [PATCH] Added ff_v210_planar_unpack_aligned_avx2

2019-03-01 Thread Michael Stoner
The AVX2 code leverages VPERMD to process 12 pixels/iteration. This is my first patch submission so any comments are greatly appreciated. -Mike Tested on Skylake (Win32 & Win64) 1920x1080 input frame = C code - 440 fps SSSE3 - 920 fps AVX- 930 fps AVX2 - 1040 fps