Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)

2017-12-18 Thread Martin Vignali
2017-12-17 19:41 GMT+01:00 Henrik Gramner :

> On Thu, Dec 14, 2017 at 11:16 AM, Martin Vignali
>  wrote:
> > 2017-12-13 17:37 GMT+01:00 Henrik Gramner :
> >> You could also do vextracti128 + 128-bit packuswb instead of 256-bit
> >> packuswb + vpermq.
> >>
> > Sorry don't understand this part
> > do you mean 128 bit packuswb + movh for each lane ?
> > or something else ?
>
> packuswb  m0, m0
> vpermqm0, m0, q3120
>
> vs.
>
> vextracti128 xm1, m0, 1
> packuswb xm0, xm1
>
> Uses a 128-bit op instead of a 256-bit one which is generally
> preferable whenever possible.
>
>
Thanks !
I will send a new patch, using this way.

Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)

2017-12-17 Thread Henrik Gramner
On Thu, Dec 14, 2017 at 11:16 AM, Martin Vignali
 wrote:
> 2017-12-13 17:37 GMT+01:00 Henrik Gramner :
>> You could also do vextracti128 + 128-bit packuswb instead of 256-bit
>> packuswb + vpermq.
>>
> Sorry don't understand this part
> do you mean 128 bit packuswb + movh for each lane ?
> or something else ?

packuswb  m0, m0
vpermqm0, m0, q3120

vs.

vextracti128 xm1, m0, 1
packuswb xm0, xm1

Uses a 128-bit op instead of a 256-bit one which is generally
preferable whenever possible.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)

2017-12-14 Thread Martin Vignali
2017-12-13 17:37 GMT+01:00 Henrik Gramner :

> On Sat, Dec 9, 2017 at 1:11 PM, Martin Vignali 
> wrote:
> > the idea in AVX2 is to load 128bits of data (2x 64 bits)
> > then shuffle accross lane, the two 64 bits in the low part of each lane,
> to
> > keep the rest of the process similar
> > to the sse version
>
> What about using pmovzxbw instead of movu + vpermq + punpcklbw?
>

You're right, this is faster (tested on the first one with intermediate
16bits processing (grainextract)

vpermq load

grainextract_c: 22162.2
grainextract_sse2: 1160.9
grainextract_avx2: 1154.2


vpmovzxbw

grainextract_c: 22165.7
grainextract_sse2: 1155.7
grainextract_avx2: 772.9


>
> > for the store, the idea is similar in the opposite way (shuffle before
> > store)
>
> You could also do vextracti128 + 128-bit packuswb instead of 256-bit
> packuswb + vpermq.
>
>
Sorry don't understand this part
do you mean 128 bit packuswb + movh for each lane ?
or something else ?

Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)

2017-12-13 Thread Henrik Gramner
On Sat, Dec 9, 2017 at 1:11 PM, Martin Vignali  wrote:
> the idea in AVX2 is to load 128bits of data (2x 64 bits)
> then shuffle accross lane, the two 64 bits in the low part of each lane, to
> keep the rest of the process similar
> to the sse version

What about using pmovzxbw instead of movu + vpermq + punpcklbw?

> for the store, the idea is similar in the opposite way (shuffle before
> store)

You could also do vextracti128 + 128-bit packuswb instead of 256-bit
packuswb + vpermq.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)

2017-12-09 Thread Martin Vignali
Hello,

in attach patch to add AVX2 version for each 8b func (except divide)

001 : avutil : add ABS2 for avx2
002 : avfilter : add AVX2 version

for most of the func, the AVX2 is a simple modification
VBROADCASTi128, for constant loading
when the process stay in 8bits



when the process use intermediate 16 bits (the load use movh (64 bits load))
i create a macro (someone will probably have a better idea for the name of
these new macro)
the idea in AVX2 is to load 128bits of data (2x 64 bits)
then shuffle accross lane, the two 64 bits in the low part of each lane, to
keep the rest of the process similar
to the sse version

for the store, the idea is similar in the opposite way (shuffle before
store)

The speed improvment is not very significative for these func
(grainextract, multiply, screen, average, grainmerge) (i'm not sure, the
avx2 version is need (except for screen).


Checkasm result (x86_64, kaby lake)
./tests/checkasm/checkasm --test=vf_blend --bench
benchmarking with native FFmpeg timers
nop: 36.2
checkasm: using random seed 2027036350
SSE2:
 - vf_blend.8bit [OK]
SSSE3:
 - vf_blend.8bit [OK]
AVX2:
 - vf_blend.8bit [OK]
checkasm: all 37 tests passed
addition_c: 21882.7
addition_sse2: 483.9
addition_avx2: 250.9
and_c: 15336.7
and_sse2: 421.9
and_avx2: 196.7
average_c: 15640.7
average_sse2: 1160.7
average_avx2: 1155.7
darken_c: 27204.7
darken_sse2: 486.7
darken_avx2: 251.9
difference_c: 17101.9
difference_sse2: 981.2
difference_ssse3: 965.4
difference_avx2: 514.2
extremity_c: 27748.9
extremity_sse2: 1174.4
extremity_ssse3: 983.7
extremity_avx2: 520.4
grainextract_c: 22755.9
grainextract_sse2: 1158.2
grainextract_avx2: 1152.9
grainmerge_c: 26173.9
grainmerge_sse2: 1156.9
grainmerge_avx2: 1153.9
hardmix_c: 15676.9
hardmix_sse2: 458.4
hardmix_avx2: 268.7
lighten_c: 27137.4
lighten_sse2: 422.2
lighten_avx2: 194.2
multiply_c: 16449.9
multiply_sse2: 1378.9
multiply_avx2: 1158.7
negation_c: 17372.9
negation_sse2: 1439.4
negation_ssse3: 1172.4
negation_avx2: 520.4
or_c: 14116.2
or_sse2: 483.9
or_avx2: 236.4
phoenix_c: 30905.9
phoenix_sse2: 553.7
phoenix_avx2: 388.7
screen_c: 20414.7
screen_sse2: 1803.9
screen_avx2: 1257.4
subtract_c: 20596.2
subtract_sse2: 439.7
subtract_avx2: 403.7
xor_c: 15380.7
xor_sse2: 445.7
xor_avx2: 405.2

Comment welcome

Martin


0001-avutil-x86-x86util-add-ABS2-for-AVX2.patch
Description: Binary data


0002-avfilter-x86-vf_blend-add-AVX2-version.patch
Description: Binary data
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel