Pushed, using vpermq (faster for me)
and add "%if HAVE_AVX2_EXTERNAL" around INIT YMM...
Thanks
Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
2017-12-13 17:18 GMT+01:00 Henrik Gramner :
> On Wed, Dec 13, 2017 at 6:07 AM, Martin Vignali
> wrote:
> > +vpermq m1, [srcq + xq - mmsize + %3], 0x4e; flip each lane
> at load
> > +vpermq m2, [srcq + xq - 2 * mmsize + %3], 0x4e; flip each lane
> at load
>
> Would doing 2x 1
On Wed, Dec 13, 2017 at 6:07 AM, Martin Vignali
wrote:
> +vpermq m1, [srcq + xq - mmsize + %3], 0x4e; flip each lane at
> load
> +vpermq m2, [srcq + xq - 2 * mmsize + %3], 0x4e; flip each lane at
> load
Would doing 2x 128-bit movu + 2x vinserti128 be faster?
__
Hello,
In attach patch to merge byte and short hflip asm func into a macro
and add AVX2 version
Checkasm result (Kaby Lake, x86_64, mac os 10.12)
hflip_byte_c: 30.9
hflip_byte_ssse3: 30.4
hflip_byte_avx2: 21.9
hflip_short_c: 31.6
hflip_short_ssse3: 30.4
hflip_short_avx2: 22.4
Martin
0001-avfi