Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)
2017-12-17 19:41 GMT+01:00 Henrik Gramner: > On Thu, Dec 14, 2017 at 11:16 AM, Martin Vignali > wrote: > > 2017-12-13 17:37 GMT+01:00 Henrik Gramner : > >> You could also do vextracti128 + 128-bit packuswb instead of 256-bit > >> packuswb + vpermq. > >> > > Sorry don't understand this part > > do you mean 128 bit packuswb + movh for each lane ? > > or something else ? > > packuswb m0, m0 > vpermqm0, m0, q3120 > > vs. > > vextracti128 xm1, m0, 1 > packuswb xm0, xm1 > > Uses a 128-bit op instead of a 256-bit one which is generally > preferable whenever possible. > > Thanks ! I will send a new patch, using this way. Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)
On Thu, Dec 14, 2017 at 11:16 AM, Martin Vignaliwrote: > 2017-12-13 17:37 GMT+01:00 Henrik Gramner : >> You could also do vextracti128 + 128-bit packuswb instead of 256-bit >> packuswb + vpermq. >> > Sorry don't understand this part > do you mean 128 bit packuswb + movh for each lane ? > or something else ? packuswb m0, m0 vpermqm0, m0, q3120 vs. vextracti128 xm1, m0, 1 packuswb xm0, xm1 Uses a 128-bit op instead of a 256-bit one which is generally preferable whenever possible. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)
2017-12-13 17:37 GMT+01:00 Henrik Gramner: > On Sat, Dec 9, 2017 at 1:11 PM, Martin Vignali > wrote: > > the idea in AVX2 is to load 128bits of data (2x 64 bits) > > then shuffle accross lane, the two 64 bits in the low part of each lane, > to > > keep the rest of the process similar > > to the sse version > > What about using pmovzxbw instead of movu + vpermq + punpcklbw? > You're right, this is faster (tested on the first one with intermediate 16bits processing (grainextract) vpermq load grainextract_c: 22162.2 grainextract_sse2: 1160.9 grainextract_avx2: 1154.2 vpmovzxbw grainextract_c: 22165.7 grainextract_sse2: 1155.7 grainextract_avx2: 772.9 > > > for the store, the idea is similar in the opposite way (shuffle before > > store) > > You could also do vextracti128 + 128-bit packuswb instead of 256-bit > packuswb + vpermq. > > Sorry don't understand this part do you mean 128 bit packuswb + movh for each lane ? or something else ? Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)
On Sat, Dec 9, 2017 at 1:11 PM, Martin Vignaliwrote: > the idea in AVX2 is to load 128bits of data (2x 64 bits) > then shuffle accross lane, the two 64 bits in the low part of each lane, to > keep the rest of the process similar > to the sse version What about using pmovzxbw instead of movu + vpermq + punpcklbw? > for the store, the idea is similar in the opposite way (shuffle before > store) You could also do vextracti128 + 128-bit packuswb instead of 256-bit packuswb + vpermq. ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] avfilter/x86/vf_blend : add avx2 version for 8b func (WIP)
Hello, in attach patch to add AVX2 version for each 8b func (except divide) 001 : avutil : add ABS2 for avx2 002 : avfilter : add AVX2 version for most of the func, the AVX2 is a simple modification VBROADCASTi128, for constant loading when the process stay in 8bits when the process use intermediate 16 bits (the load use movh (64 bits load)) i create a macro (someone will probably have a better idea for the name of these new macro) the idea in AVX2 is to load 128bits of data (2x 64 bits) then shuffle accross lane, the two 64 bits in the low part of each lane, to keep the rest of the process similar to the sse version for the store, the idea is similar in the opposite way (shuffle before store) The speed improvment is not very significative for these func (grainextract, multiply, screen, average, grainmerge) (i'm not sure, the avx2 version is need (except for screen). Checkasm result (x86_64, kaby lake) ./tests/checkasm/checkasm --test=vf_blend --bench benchmarking with native FFmpeg timers nop: 36.2 checkasm: using random seed 2027036350 SSE2: - vf_blend.8bit [OK] SSSE3: - vf_blend.8bit [OK] AVX2: - vf_blend.8bit [OK] checkasm: all 37 tests passed addition_c: 21882.7 addition_sse2: 483.9 addition_avx2: 250.9 and_c: 15336.7 and_sse2: 421.9 and_avx2: 196.7 average_c: 15640.7 average_sse2: 1160.7 average_avx2: 1155.7 darken_c: 27204.7 darken_sse2: 486.7 darken_avx2: 251.9 difference_c: 17101.9 difference_sse2: 981.2 difference_ssse3: 965.4 difference_avx2: 514.2 extremity_c: 27748.9 extremity_sse2: 1174.4 extremity_ssse3: 983.7 extremity_avx2: 520.4 grainextract_c: 22755.9 grainextract_sse2: 1158.2 grainextract_avx2: 1152.9 grainmerge_c: 26173.9 grainmerge_sse2: 1156.9 grainmerge_avx2: 1153.9 hardmix_c: 15676.9 hardmix_sse2: 458.4 hardmix_avx2: 268.7 lighten_c: 27137.4 lighten_sse2: 422.2 lighten_avx2: 194.2 multiply_c: 16449.9 multiply_sse2: 1378.9 multiply_avx2: 1158.7 negation_c: 17372.9 negation_sse2: 1439.4 negation_ssse3: 1172.4 negation_avx2: 520.4 or_c: 14116.2 or_sse2: 483.9 or_avx2: 236.4 phoenix_c: 30905.9 phoenix_sse2: 553.7 phoenix_avx2: 388.7 screen_c: 20414.7 screen_sse2: 1803.9 screen_avx2: 1257.4 subtract_c: 20596.2 subtract_sse2: 439.7 subtract_avx2: 403.7 xor_c: 15380.7 xor_sse2: 445.7 xor_avx2: 405.2 Comment welcome Martin 0001-avutil-x86-x86util-add-ABS2-for-AVX2.patch Description: Binary data 0002-avfilter-x86-vf_blend-add-AVX2-version.patch Description: Binary data ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel