Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
2018-01-17 21:13 GMT+01:00 Martin Vignali: > Hello, > > > New patch in attach > > with modification in average, grain extract, multiply, screen, grain merge > > > -- blend Average -- > Prev patch : > average_c: 15605.4 > average_sse2: 1205.9 > average_avx2: 772.4 > > New patch : > average_c: 15604.4 > average_sse2: 490.9 > average_avx2: 265.2 > > With 3 operand : > using > %if cpuflag(avx) > pxor m0, m2, [topq + xq] > pxor m1, m2, [bottomq + xq] > %else > movu m0, [topq + xq] > movu m1, [bottomq + xq] > pxor m0, m2 > pxor m1, m2 > %endif > > average_c: 15615.5 > average_sse2: 456.2 > average_avx: 553.7 > average_avx2: 387.0 > > > And for grain extract, multiply, screen, grain merge > using mmsize process at each loop (instead of mmsize / 2) > > -- Grain extract -- > Prev : > grainextract_c: 22182.9 > grainextract_sse2: 1158.9 > grainextract_avx2: 777.6 > > New : > grainextract_c: 22206.5 > grainextract_sse2: 964.8 > grainextract_avx2: 485.3 > > -- Multiply -- > Prev : > multiply_c: 41347.8 > multiply_sse2: 1376.0 > multiply_avx2: 840.0 > > New : > multiply_c: 40432.5 > multiply_sse2: 1248.0 > multiply_avx2: 635.0 > > -- Screen -- > Prev : > screen_c: 21635.8 > screen_sse2: 1801.5 > screen_avx2: 1069.8 > > New : > screen_c: 21617.0 > screen_sse2: 1625.7 > screen_avx2: 840.2 > > -- Grain merge -- > Prev : > grainmerge_c: 25233.5 > grainmerge_sse2: 1158.0 > grainmerge_avx2: 775.7 > > New : > grainmerge_c: 25246.7 > grainmerge_sse2: 967.4 > grainmerge_avx2: 487.7 > > > Martin > Pushed Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
Hello, New patch in attach with modification in average, grain extract, multiply, screen, grain merge -- blend Average -- Prev patch : average_c: 15605.4 average_sse2: 1205.9 average_avx2: 772.4 New patch : average_c: 15604.4 average_sse2: 490.9 average_avx2: 265.2 With 3 operand : using %if cpuflag(avx) pxor m0, m2, [topq + xq] pxor m1, m2, [bottomq + xq] %else movu m0, [topq + xq] movu m1, [bottomq + xq] pxor m0, m2 pxor m1, m2 %endif average_c: 15615.5 average_sse2: 456.2 average_avx: 553.7 average_avx2: 387.0 And for grain extract, multiply, screen, grain merge using mmsize process at each loop (instead of mmsize / 2) -- Grain extract -- Prev : grainextract_c: 22182.9 grainextract_sse2: 1158.9 grainextract_avx2: 777.6 New : grainextract_c: 22206.5 grainextract_sse2: 964.8 grainextract_avx2: 485.3 -- Multiply -- Prev : multiply_c: 41347.8 multiply_sse2: 1376.0 multiply_avx2: 840.0 New : multiply_c: 40432.5 multiply_sse2: 1248.0 multiply_avx2: 635.0 -- Screen -- Prev : screen_c: 21635.8 screen_sse2: 1801.5 screen_avx2: 1069.8 New : screen_c: 21617.0 screen_sse2: 1625.7 screen_avx2: 840.2 -- Grain merge -- Prev : grainmerge_c: 25233.5 grainmerge_sse2: 1158.0 grainmerge_avx2: 775.7 New : grainmerge_c: 25246.7 grainmerge_sse2: 967.4 grainmerge_avx2: 487.7 Martin 0001-avfilter-x86-vf_blend-avfilter-x86-vf_blend-add-AVX2.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
On Tue, Jan 16, 2018 at 11:33 PM, Martin Vignaliwrote: > BLEND_INIT grainextract, 4 You could also try doing twice as much per iteration which might be more efficient, especially in avx2 since it avoids cross-lane shuffles. Applies to some other ones as well. E.g. something like: pxor m4, m4 VBROADCASTI128 m5, [pw_128] .loop: movu m1, [topq + xq] movu m3, [bottomq + xq] punpcklbw m0, m1, m4 punpckhbw m1, m4 punpcklbw m2, m3, m4 punpckhbw m3, m4 paddw m0, m5 paddw m1, m5 psubw m0, m2 psubw m1, m3 packuswb m0, m1 mova [dstq + xq], m0 addxq, mmsize jl .loop > BLEND_INIT average, 3 pavgb should probably be more efficient than unpacking to words. It does round up so some bitflipping shenanigans might be required if you want to round down. E.g. something like: pcmpeqbm2, m2 .loop: movu m0, [topq + xq] movu m1, [bottomq + xq] pxor m0, m2 pxor m1, m2 pavgb m0, m1 pxor m0, m2 mova [dstq + xq], m0 addxq, mmsize jl .loop (optionally combining movu+pxor into a 3-arg pxor with avx since memory operands can be unaligned in VEX-encoded instructions). ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
2018-01-16 23:00 GMT+01:00 James Darnley: > On 2018-01-16 22:26, Martin Vignali wrote: > > diff --git a/libavutil/x86/x86util.asm b/libavutil/x86/x86util.asm > > index d7cd996842..9db2d90e57 100644 > > --- a/libavutil/x86/x86util.asm > > +++ b/libavutil/x86/x86util.asm > > @@ -335,7 +335,7 @@ > > %endmacro > > > > %macro ABS2 4 > > -%if cpuflag(ssse3) > > +%if cpuflag(ssse3)||cpuflag(avx2) > > pabsw %1, %1 > > pabsw %2, %2 > > %elif cpuflag(mmxext) ; a, b, tmp0, tmp1 > > Why? AVX2 implies all earlier flags. > Yes you're right, don't remember why i add it, drop the first patch > > > +;%1 dst, %2 src %3 xm fill by zero (only use in SSE2) > > +%macro PMOVZXBW 3 > > +%if cpuflag(avx2) > > +vpmovzxbw %1, %2 > > +%else; SSE2 > > + movh %1, %2 > > + punpcklbw %1, %3 > > +%endif > > +%endmacro > > Are you aware that SSE4.1 added the packed move sign/zero extend > instructions? I don't suggest that you make an SSE4 but if you use many > 3-operand instructions an AVX version might be worthwhile. > Yes. i test the sse4 pmovzxbw, but slower for me most of the time (if i run the test 4 times for example, only a little bit faster on one test, and slower for the other) i also test using avx, also slower for me (but few lines use three operand) Patch in attach, if someone want to test on other cpu/os. > > > @@ -85,4 +102,25 @@ av_cold void ff_blend_init_x86(FilterParams *param, > int is_16bit) > > case BLEND_NEGATION: param->blend = ff_blend_negation_ssse3; > break; > > } > > } > > +if (EXTERNAL_AVX2_FAST(cpu_flags) && param->opacity == 1 && > !is_16bit) { > > +switch (param->mode) { > > +case BLEND_ADDITION: param->blend = ff_blend_addition_avx2; > break; > > +case BLEND_GRAINMERGE: param->blend = ff_blend_grainmerge_avx2; > break; > > +case BLEND_AND: param->blend = ff_blend_and_avx2; > break; > > +case BLEND_AVERAGE: param->blend = ff_blend_average_avx2; > break; > > +case BLEND_DARKEN: param->blend = ff_blend_darken_avx2; > break; > > +case BLEND_GRAINEXTRACT: param->blend = > ff_blend_grainextract_avx2; break; > > +case BLEND_HARDMIX: param->blend = ff_blend_hardmix_avx2; > break; > > +case BLEND_LIGHTEN: param->blend = ff_blend_lighten_avx2; > break; > > +case BLEND_MULTIPLY: param->blend = ff_blend_multiply_avx2; > break; > > +case BLEND_OR: param->blend = ff_blend_or_avx2; > break; > > +case BLEND_PHOENIX: param->blend = ff_blend_phoenix_avx2; > break; > > +case BLEND_SCREEN: param->blend = ff_blend_screen_avx2; > break; > > +case BLEND_SUBTRACT: param->blend = ff_blend_subtract_avx2; > break; > > +case BLEND_XOR: param->blend = ff_blend_xor_avx2; > break; > > +case BLEND_DIFFERENCE: param->blend = ff_blend_difference_avx2; > break; > > +case BLEND_EXTREMITY: param->blend = ff_blend_extremity_avx2; > break; > > +case BLEND_NEGATION: param->blend = ff_blend_negation_avx2; > break; > > +} > > +} > > } > > If you're going to align things vertically then do it for every line. > ok Martin 0001-avfilter-x86-vf_blend-for-testing.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
On 2018-01-16 22:26, Martin Vignali wrote: > diff --git a/libavutil/x86/x86util.asm b/libavutil/x86/x86util.asm > index d7cd996842..9db2d90e57 100644 > --- a/libavutil/x86/x86util.asm > +++ b/libavutil/x86/x86util.asm > @@ -335,7 +335,7 @@ > %endmacro > > %macro ABS2 4 > -%if cpuflag(ssse3) > +%if cpuflag(ssse3)||cpuflag(avx2) > pabsw %1, %1 > pabsw %2, %2 > %elif cpuflag(mmxext) ; a, b, tmp0, tmp1 Why? AVX2 implies all earlier flags. > +;%1 dst, %2 src %3 xm fill by zero (only use in SSE2) > +%macro PMOVZXBW 3 > +%if cpuflag(avx2) > +vpmovzxbw %1, %2 > +%else; SSE2 > + movh %1, %2 > + punpcklbw %1, %3 > +%endif > +%endmacro Are you aware that SSE4.1 added the packed move sign/zero extend instructions? I don't suggest that you make an SSE4 but if you use many 3-operand instructions an AVX version might be worthwhile. > @@ -85,4 +102,25 @@ av_cold void ff_blend_init_x86(FilterParams *param, int > is_16bit) > case BLEND_NEGATION: param->blend = ff_blend_negation_ssse3; > break; > } > } > +if (EXTERNAL_AVX2_FAST(cpu_flags) && param->opacity == 1 && !is_16bit) { > +switch (param->mode) { > +case BLEND_ADDITION: param->blend = ff_blend_addition_avx2; break; > +case BLEND_GRAINMERGE: param->blend = ff_blend_grainmerge_avx2; > break; > +case BLEND_AND: param->blend = ff_blend_and_avx2; break; > +case BLEND_AVERAGE: param->blend = ff_blend_average_avx2; break; > +case BLEND_DARKEN: param->blend = ff_blend_darken_avx2; break; > +case BLEND_GRAINEXTRACT: param->blend = ff_blend_grainextract_avx2; > break; > +case BLEND_HARDMIX: param->blend = ff_blend_hardmix_avx2; break; > +case BLEND_LIGHTEN: param->blend = ff_blend_lighten_avx2; break; > +case BLEND_MULTIPLY: param->blend = ff_blend_multiply_avx2; break; > +case BLEND_OR: param->blend = ff_blend_or_avx2; break; > +case BLEND_PHOENIX: param->blend = ff_blend_phoenix_avx2; break; > +case BLEND_SCREEN: param->blend = ff_blend_screen_avx2; break; > +case BLEND_SUBTRACT: param->blend = ff_blend_subtract_avx2; break; > +case BLEND_XOR: param->blend = ff_blend_xor_avx2; break; > +case BLEND_DIFFERENCE: param->blend = ff_blend_difference_avx2; > break; > +case BLEND_EXTREMITY: param->blend = ff_blend_extremity_avx2; > break; > +case BLEND_NEGATION: param->blend = ff_blend_negation_avx2; > break; > +} > +} > } If you're going to align things vertically then do it for every line. signature.asc Description: OpenPGP digital signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)
Hello, following Henrik Gramner comments (in discussion "avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)") in attach new patch to add AVX2 version for each 8b func (except divide) 001 : avutil : add ABS2 for avx2 002 : avfilter : add AVX2 version for most of the func, the AVX2 is a simple modification VBROADCASTi128, for constant loading when the process stay in 8bits when the process use intermediate 16 bits i add two macro for the load part PMOVZXBW : load mmsize/2 bits and expand to 16 (the sse4 version seems to be most of the time slower than the SSE2 "emulation") like the avx2 doesn't need zero fill vector register i add a if/else, at the start of each blend macro, and change the index of the vector registers %macro GRAINEXTRACT 0 %if cpuflag(avx2) BLEND_INIT grainextract, 3 %else ; SSE2 BLEND_INIT grainextract, 4 pxor m3, m3 %endif for the store part i add PACKUSWB_AND_STORE macro simplify code of each blend macro pass fate test for me Checkasm result (x86_64, kaby lake) ./tests/checkasm/checkasm --test=vf_blend --bench benchmarking with native FFmpeg timers nop: 35.7 checkasm: using random seed 3558581064 SSE2: - vf_blend.8bit [OK] SSSE3: - vf_blend.8bit [OK] AVX2: - vf_blend.8bit [OK] checkasm: all 37 tests passed addition_c: 20523.3 addition_sse2: 441.8 addition_avx2: 383.3 and_c: 14490.3 and_sse2: 485.8 and_avx2: 205.8 average_c: 15600.5 average_sse2: 1206.0 average_avx2: 773.0 darken_c: 27218.0 darken_sse2: 397.3 darken_avx2: 194.3 difference_c: 20607.8 difference_sse2: 980.8 difference_ssse3: 968.0 difference_avx2: 487.0 extremity_c: 17286.0 extremity_sse2: 1174.0 extremity_ssse3: 981.8 extremity_avx2: 550.0 grainextract_c: 22145.3 grainextract_sse2: 1158.5 grainextract_avx2: 771.5 grainmerge_c: 24505.5 grainmerge_sse2: 1158.8 grainmerge_avx2: 774.5 hardmix_c: 16505.5 hardmix_sse2: 490.8 hardmix_avx2: 388.8 lighten_c: 27153.0 lighten_sse2: 485.0 lighten_avx2: 251.3 multiply_c: 16459.8 multiply_sse2: 1382.5 multiply_avx2: 844.0 negation_c: 32143.8 negation_sse2: 1369.0 negation_ssse3: 1175.3 negation_avx2: 522.5 or_c: 13359.5 or_sse2: 397.3 or_avx2: 195.8 phoenix_c: 31159.8 phoenix_sse2: 551.0 phoenix_avx2: 310.5 screen_c: 25372.3 screen_sse2: 1804.0 screen_avx2: 1069.0 subtract_c: 16782.5 subtract_sse2: 478.8 subtract_avx2: 236.5 xor_c: 15374.8 xor_sse2: 491.3 xor_avx2: 237.0 Martin 0001-avutil-x86-x86util-add-ABS2-for-AVX2.patch Description: Binary data 0002-avfilter-x86-vf_blend-add-AVX2-version-for-each-func.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel