Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)

2018-01-28 Thread Martin Vignali
2018-01-17 21:13 GMT+01:00 Martin Vignali :

> Hello,
>
>
> New patch in attach
>
> with modification in average, grain extract, multiply, screen, grain merge
>
>
> -- blend Average --
> Prev patch :
> average_c: 15605.4
> average_sse2: 1205.9
> average_avx2: 772.4
>
> New patch :
> average_c: 15604.4
> average_sse2: 490.9
> average_avx2: 265.2
>
> With 3 operand :
> using
> %if cpuflag(avx)
> pxor m0, m2, [topq + xq]
> pxor m1, m2, [bottomq + xq]
> %else
> movu   m0, [topq + xq]
> movu   m1, [bottomq + xq]
> pxor   m0, m2
> pxor   m1, m2
> %endif
>
> average_c: 15615.5
> average_sse2: 456.2
> average_avx: 553.7
> average_avx2: 387.0
>
>
> And for grain extract, multiply, screen, grain merge
> using mmsize process at each loop (instead of mmsize / 2)
>
> -- Grain extract --
> Prev :
> grainextract_c: 22182.9
> grainextract_sse2: 1158.9
> grainextract_avx2: 777.6
>
> New :
> grainextract_c: 22206.5
> grainextract_sse2: 964.8
> grainextract_avx2: 485.3
>
> -- Multiply --
> Prev :
> multiply_c: 41347.8
> multiply_sse2: 1376.0
> multiply_avx2: 840.0
>
> New :
> multiply_c: 40432.5
> multiply_sse2: 1248.0
> multiply_avx2: 635.0
>
> -- Screen --
> Prev :
> screen_c: 21635.8
> screen_sse2: 1801.5
> screen_avx2: 1069.8
>
> New :
> screen_c: 21617.0
> screen_sse2: 1625.7
> screen_avx2: 840.2
>
> -- Grain merge --
> Prev :
> grainmerge_c: 25233.5
> grainmerge_sse2: 1158.0
> grainmerge_avx2: 775.7
>
> New :
> grainmerge_c: 25246.7
> grainmerge_sse2: 967.4
> grainmerge_avx2: 487.7
>
>
> Martin
>

Pushed

Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)

2018-01-17 Thread Martin Vignali
Hello,


New patch in attach

with modification in average, grain extract, multiply, screen, grain merge


-- blend Average --
Prev patch :
average_c: 15605.4
average_sse2: 1205.9
average_avx2: 772.4

New patch :
average_c: 15604.4
average_sse2: 490.9
average_avx2: 265.2

With 3 operand :
using
%if cpuflag(avx)
pxor m0, m2, [topq + xq]
pxor m1, m2, [bottomq + xq]
%else
movu   m0, [topq + xq]
movu   m1, [bottomq + xq]
pxor   m0, m2
pxor   m1, m2
%endif

average_c: 15615.5
average_sse2: 456.2
average_avx: 553.7
average_avx2: 387.0


And for grain extract, multiply, screen, grain merge
using mmsize process at each loop (instead of mmsize / 2)

-- Grain extract --
Prev :
grainextract_c: 22182.9
grainextract_sse2: 1158.9
grainextract_avx2: 777.6

New :
grainextract_c: 22206.5
grainextract_sse2: 964.8
grainextract_avx2: 485.3

-- Multiply --
Prev :
multiply_c: 41347.8
multiply_sse2: 1376.0
multiply_avx2: 840.0

New :
multiply_c: 40432.5
multiply_sse2: 1248.0
multiply_avx2: 635.0

-- Screen --
Prev :
screen_c: 21635.8
screen_sse2: 1801.5
screen_avx2: 1069.8

New :
screen_c: 21617.0
screen_sse2: 1625.7
screen_avx2: 840.2

-- Grain merge --
Prev :
grainmerge_c: 25233.5
grainmerge_sse2: 1158.0
grainmerge_avx2: 775.7

New :
grainmerge_c: 25246.7
grainmerge_sse2: 967.4
grainmerge_avx2: 487.7


Martin


0001-avfilter-x86-vf_blend-avfilter-x86-vf_blend-add-AVX2.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)

2018-01-17 Thread Henrik Gramner
On Tue, Jan 16, 2018 at 11:33 PM, Martin Vignali
 wrote:
> BLEND_INIT grainextract, 4

You could also try doing twice as much per iteration which might be
more efficient, especially in avx2 since it avoids cross-lane
shuffles. Applies to some other ones as well.

E.g. something like:

pxor   m4, m4
VBROADCASTI128 m5, [pw_128]

.loop:
movu   m1, [topq + xq]
movu   m3, [bottomq + xq]
punpcklbw  m0, m1, m4
punpckhbw  m1, m4
punpcklbw  m2, m3, m4
punpckhbw  m3, m4
paddw  m0, m5
paddw  m1, m5
psubw  m0, m2
psubw  m1, m3
packuswb   m0, m1
mova  [dstq + xq], m0
addxq, mmsize
jl .loop

> BLEND_INIT average, 3

pavgb should probably be more efficient than unpacking to words. It
does round up so some bitflipping shenanigans might be required if you
want to round down.

E.g. something like:

pcmpeqbm2, m2

.loop:
movu   m0, [topq + xq]
movu   m1, [bottomq + xq]
pxor   m0, m2
pxor   m1, m2
pavgb  m0, m1
pxor   m0, m2
mova  [dstq + xq], m0
addxq, mmsize
jl .loop

(optionally combining movu+pxor into a 3-arg pxor with avx since
memory operands can be unaligned in VEX-encoded instructions).
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)

2018-01-16 Thread Martin Vignali
2018-01-16 23:00 GMT+01:00 James Darnley :

> On 2018-01-16 22:26, Martin Vignali wrote:
> > diff --git a/libavutil/x86/x86util.asm b/libavutil/x86/x86util.asm
> > index d7cd996842..9db2d90e57 100644
> > --- a/libavutil/x86/x86util.asm
> > +++ b/libavutil/x86/x86util.asm
> > @@ -335,7 +335,7 @@
> >  %endmacro
> >
> >  %macro ABS2 4
> > -%if cpuflag(ssse3)
> > +%if cpuflag(ssse3)||cpuflag(avx2)
> >  pabsw   %1, %1
> >  pabsw   %2, %2
> >  %elif cpuflag(mmxext) ; a, b, tmp0, tmp1
>
> Why?  AVX2 implies all earlier flags.
>

Yes you're right, don't remember why i add it, drop the first patch


>
> > +;%1 dst, %2 src %3 xm fill by zero (only use in SSE2)
> > +%macro PMOVZXBW 3
> > +%if cpuflag(avx2)
> > +vpmovzxbw %1, %2
> > +%else; SSE2
> > + movh  %1, %2
> > + punpcklbw %1, %3
> > +%endif
> > +%endmacro
>
> Are you aware that SSE4.1 added the packed move sign/zero extend
> instructions?  I don't suggest that you make an SSE4 but if you use many
> 3-operand instructions an AVX version might be worthwhile.
>

Yes. i test the sse4 pmovzxbw, but slower for me most of the time (if i run
the test 4 times for example, only a little bit faster on one test, and
slower for the other)
i also test using avx, also slower for me (but few lines use three operand)

Patch in attach, if someone want to test on other cpu/os.


>
> > @@ -85,4 +102,25 @@ av_cold void ff_blend_init_x86(FilterParams *param,
> int is_16bit)
> >  case BLEND_NEGATION:   param->blend = ff_blend_negation_ssse3;
>  break;
> >  }
> >  }
> > +if (EXTERNAL_AVX2_FAST(cpu_flags) && param->opacity == 1 &&
> !is_16bit) {
> > +switch (param->mode) {
> > +case BLEND_ADDITION: param->blend = ff_blend_addition_avx2;
> break;
> > +case BLEND_GRAINMERGE: param->blend = ff_blend_grainmerge_avx2;
> break;
> > +case BLEND_AND:  param->blend = ff_blend_and_avx2;
> break;
> > +case BLEND_AVERAGE:  param->blend = ff_blend_average_avx2;
> break;
> > +case BLEND_DARKEN:   param->blend = ff_blend_darken_avx2;
>  break;
> > +case BLEND_GRAINEXTRACT: param->blend =
> ff_blend_grainextract_avx2; break;
> > +case BLEND_HARDMIX:  param->blend = ff_blend_hardmix_avx2;
> break;
> > +case BLEND_LIGHTEN:  param->blend = ff_blend_lighten_avx2;
> break;
> > +case BLEND_MULTIPLY: param->blend = ff_blend_multiply_avx2;
> break;
> > +case BLEND_OR:   param->blend = ff_blend_or_avx2;
>  break;
> > +case BLEND_PHOENIX:  param->blend = ff_blend_phoenix_avx2;
> break;
> > +case BLEND_SCREEN:   param->blend = ff_blend_screen_avx2;
>  break;
> > +case BLEND_SUBTRACT: param->blend = ff_blend_subtract_avx2;
> break;
> > +case BLEND_XOR:  param->blend = ff_blend_xor_avx2;
> break;
> > +case BLEND_DIFFERENCE: param->blend = ff_blend_difference_avx2;
> break;
> > +case BLEND_EXTREMITY:  param->blend = ff_blend_extremity_avx2;
> break;
> > +case BLEND_NEGATION:   param->blend = ff_blend_negation_avx2;
>  break;
> > +}
> > +}
> >  }
>
> If you're going to align things vertically then do it for every line.
>

ok

Martin


0001-avfilter-x86-vf_blend-for-testing.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)

2018-01-16 Thread James Darnley
On 2018-01-16 22:26, Martin Vignali wrote:
> diff --git a/libavutil/x86/x86util.asm b/libavutil/x86/x86util.asm
> index d7cd996842..9db2d90e57 100644
> --- a/libavutil/x86/x86util.asm
> +++ b/libavutil/x86/x86util.asm
> @@ -335,7 +335,7 @@
>  %endmacro
>  
>  %macro ABS2 4
> -%if cpuflag(ssse3)
> +%if cpuflag(ssse3)||cpuflag(avx2)
>  pabsw   %1, %1
>  pabsw   %2, %2
>  %elif cpuflag(mmxext) ; a, b, tmp0, tmp1

Why?  AVX2 implies all earlier flags.

> +;%1 dst, %2 src %3 xm fill by zero (only use in SSE2)
> +%macro PMOVZXBW 3
> +%if cpuflag(avx2)
> +vpmovzxbw %1, %2
> +%else; SSE2
> + movh  %1, %2
> + punpcklbw %1, %3
> +%endif
> +%endmacro

Are you aware that SSE4.1 added the packed move sign/zero extend
instructions?  I don't suggest that you make an SSE4 but if you use many
3-operand instructions an AVX version might be worthwhile.

> @@ -85,4 +102,25 @@ av_cold void ff_blend_init_x86(FilterParams *param, int 
> is_16bit)
>  case BLEND_NEGATION:   param->blend = ff_blend_negation_ssse3;   
> break;
>  }
>  }
> +if (EXTERNAL_AVX2_FAST(cpu_flags) && param->opacity == 1 && !is_16bit) {
> +switch (param->mode) {
> +case BLEND_ADDITION: param->blend = ff_blend_addition_avx2; break;
> +case BLEND_GRAINMERGE: param->blend = ff_blend_grainmerge_avx2; 
> break;
> +case BLEND_AND:  param->blend = ff_blend_and_avx2;  break;
> +case BLEND_AVERAGE:  param->blend = ff_blend_average_avx2;  break;
> +case BLEND_DARKEN:   param->blend = ff_blend_darken_avx2;   break;
> +case BLEND_GRAINEXTRACT: param->blend = ff_blend_grainextract_avx2; 
> break;
> +case BLEND_HARDMIX:  param->blend = ff_blend_hardmix_avx2;  break;
> +case BLEND_LIGHTEN:  param->blend = ff_blend_lighten_avx2;  break;
> +case BLEND_MULTIPLY: param->blend = ff_blend_multiply_avx2; break;
> +case BLEND_OR:   param->blend = ff_blend_or_avx2;   break;
> +case BLEND_PHOENIX:  param->blend = ff_blend_phoenix_avx2;  break;
> +case BLEND_SCREEN:   param->blend = ff_blend_screen_avx2;   break;
> +case BLEND_SUBTRACT: param->blend = ff_blend_subtract_avx2; break;
> +case BLEND_XOR:  param->blend = ff_blend_xor_avx2;  break;
> +case BLEND_DIFFERENCE: param->blend = ff_blend_difference_avx2; 
> break;
> +case BLEND_EXTREMITY:  param->blend = ff_blend_extremity_avx2;  
> break;
> +case BLEND_NEGATION:   param->blend = ff_blend_negation_avx2;   
> break;
> +}
> +}
>  }

If you're going to align things vertically then do it for every line.




signature.asc
Description: OpenPGP digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] avfilter/x86/vf_blend : add avx2 for 8b func (v2)

2018-01-16 Thread Martin Vignali
Hello,

following Henrik Gramner comments (in discussion "avfilter/x86/vf_blend :
add avx2 version for 8b func (WIP)")
in attach new patch to add AVX2 version for each 8b func (except divide)

001 : avutil : add ABS2 for avx2
002 : avfilter : add AVX2 version

for most of the func, the AVX2 is a simple modification
VBROADCASTi128, for constant loading
when the process stay in 8bits

when the process use intermediate 16 bits
i add two macro

for the load part
PMOVZXBW : load mmsize/2 bits and expand to 16
(the sse4 version seems to be most of the time slower than the SSE2
"emulation")
like the avx2 doesn't need zero fill vector register
i add a if/else, at the start of each blend macro, and change the index of
the vector registers

%macro GRAINEXTRACT 0
%if cpuflag(avx2)
BLEND_INIT grainextract, 3
%else ; SSE2
BLEND_INIT grainextract, 4
pxor   m3, m3
%endif


for the store part i add PACKUSWB_AND_STORE macro
simplify code of each blend macro

pass fate test for me

Checkasm result (x86_64, kaby lake)
./tests/checkasm/checkasm --test=vf_blend --bench

benchmarking with native FFmpeg timers
nop: 35.7
checkasm: using random seed 3558581064
SSE2:
 - vf_blend.8bit [OK]
SSSE3:
 - vf_blend.8bit [OK]
AVX2:
 - vf_blend.8bit [OK]
checkasm: all 37 tests passed
addition_c: 20523.3
addition_sse2: 441.8
addition_avx2: 383.3
and_c: 14490.3
and_sse2: 485.8
and_avx2: 205.8
average_c: 15600.5
average_sse2: 1206.0
average_avx2: 773.0
darken_c: 27218.0
darken_sse2: 397.3
darken_avx2: 194.3
difference_c: 20607.8
difference_sse2: 980.8
difference_ssse3: 968.0
difference_avx2: 487.0
extremity_c: 17286.0
extremity_sse2: 1174.0
extremity_ssse3: 981.8
extremity_avx2: 550.0
grainextract_c: 22145.3
grainextract_sse2: 1158.5
grainextract_avx2: 771.5
grainmerge_c: 24505.5
grainmerge_sse2: 1158.8
grainmerge_avx2: 774.5
hardmix_c: 16505.5
hardmix_sse2: 490.8
hardmix_avx2: 388.8
lighten_c: 27153.0
lighten_sse2: 485.0
lighten_avx2: 251.3
multiply_c: 16459.8
multiply_sse2: 1382.5
multiply_avx2: 844.0
negation_c: 32143.8
negation_sse2: 1369.0
negation_ssse3: 1175.3
negation_avx2: 522.5
or_c: 13359.5
or_sse2: 397.3
or_avx2: 195.8
phoenix_c: 31159.8
phoenix_sse2: 551.0
phoenix_avx2: 310.5
screen_c: 25372.3
screen_sse2: 1804.0
screen_avx2: 1069.0
subtract_c: 16782.5
subtract_sse2: 478.8
subtract_avx2: 236.5
xor_c: 15374.8
xor_sse2: 491.3
xor_avx2: 237.0

Martin


0001-avutil-x86-x86util-add-ABS2-for-AVX2.patch
Description: Binary data


0002-avfilter-x86-vf_blend-add-AVX2-version-for-each-func.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel