Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function
On 2017-11-10 22:13, James Darnley wrote: > The IRC log should appear at the link below. >> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004651.html Of course when I try to predict what number an email will get based on the past few it ends up being out of order. The ffmpeg-devel log I was referring to is here: > https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004652.html ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function
2017-11-10 22:13 GMT+01:00 James Darnley: > On 2017-11-10 14:32, James Darnley wrote: > > I mentioned previously that using ZMM registers will cause the CPU to > > reduce its frequency. > > > > Gramner said on IRC that a user should spend 20-30% of time in > > AVX-512/ZMM code for it to be a net gain in speed. > > From ffmpeg-devel IRC on 2017-10-26 > >> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/ > 2017-October/004622.html > >> [18:49:26 CEST] J_Darnley: be aware that using zmm registers > induces significant frequency drops which reduces performance of everything > else, so if you want to use 512-bit vectors you better go all in on it to > make up for it. you probably want to spend at least 20-30% of overall > runtime in avx-512 code > >> [18:50:00 CEST] the alternative is to stay in 256-bit mode > and just leverage new instructions and opmasks > > > > This means any cycles you might save by using longer registers, fewer > > instructions, better instructions, whatever, will be lost because the > > frequency drops meaning it takes longer to execute overall. > > Some details about this can be found in one of Intel's documents: IntelĀ® > 64 and IA-32 Architectures Optimization Reference Manual > Order Number: 248966-038 > October 2017 > > https://software.intel.com/sites/default/files/managed/ > 9e/bc/64-ia-32-architectures-optimization-manual.pdf > Specifically section 15.26 "SKYLAKE SERVER POWER MANAGEMENT" > > Earlier on the ffmpeg-devel IRC channel I posted a link to Cloudflare's > blog in which they discuss the effects of running just a few (my words) > AVX-512/ZMM instructions. > > https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ > > In the worst cases on some of the new processors the frequency drop can > be 1GHz. In Cloudflare's case just spending about 2.5% of time in a > cryptography function using AVX-512 was causing a 10% drop in their > overall performance (requests served per second). > > After seeing this and the discussion on IRC I won't commit any of the > function patches. The functions are not very impressive and are likely > to make everything else slower. > > The IRC log should appear at the link below. > > https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/ > 2017-November/004651.html > > > Thanks for the details explanations. Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function
On 2017-11-10 14:32, James Darnley wrote: > I mentioned previously that using ZMM registers will cause the CPU to > reduce its frequency. > > Gramner said on IRC that a user should spend 20-30% of time in > AVX-512/ZMM code for it to be a net gain in speed. > From ffmpeg-devel IRC on 2017-10-26 >> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-October/004622.html >> [18:49:26 CEST] J_Darnley: be aware that using zmm registers >> induces significant frequency drops which reduces performance of everything >> else, so if you want to use 512-bit vectors you better go all in on it to >> make up for it. you probably want to spend at least 20-30% of overall >> runtime in avx-512 code >> [18:50:00 CEST] the alternative is to stay in 256-bit mode and >> just leverage new instructions and opmasks > > This means any cycles you might save by using longer registers, fewer > instructions, better instructions, whatever, will be lost because the > frequency drops meaning it takes longer to execute overall. Some details about this can be found in one of Intel's documents: IntelĀ® 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-038 October 2017 > https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf Specifically section 15.26 "SKYLAKE SERVER POWER MANAGEMENT" Earlier on the ffmpeg-devel IRC channel I posted a link to Cloudflare's blog in which they discuss the effects of running just a few (my words) AVX-512/ZMM instructions. > https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/ In the worst cases on some of the new processors the frequency drop can be 1GHz. In Cloudflare's case just spending about 2.5% of time in a cryptography function using AVX-512 was causing a 10% drop in their overall performance (requests served per second). After seeing this and the discussion on IRC I won't commit any of the function patches. The functions are not very impressive and are likely to make everything else slower. The IRC log should appear at the link below. > https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004651.html ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function
On 2017-11-09 20:42, Martin Vignali wrote: > I doesn't want to block this patch, but > like you say (in your previous version), that this version is not faster, > i'm not sure, it's interesting to apply it. > You already made "real" avx512 version for other funcs, in order to check > the rest of yours patchs. I will not apply any of the new AVX-512/ZMM function patches because they need proper testing in a real world situation. Sorry but I don't have time to see whether these few naive length extensions are better. I have my own work to see whether AVX-512/ZMM provides a speed-up. If that pans out then FFmpeg will benefit because some of the work will trickle back to it. I mentioned previously that using ZMM registers will cause the CPU to reduce its frequency. Gramner said on IRC that a user should spend 20-30% of time in AVX-512/ZMM code for it to be a net gain in speed. From ffmpeg-devel IRC on 2017-10-26 > https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-October/004622.html > [18:49:26 CEST] J_Darnley: be aware that using zmm registers > induces significant frequency drops which reduces performance of everything > else, so if you want to use 512-bit vectors you better go all in on it to > make up for it. you probably want to spend at least 20-30% of overall runtime > in avx-512 code > [18:50:00 CEST] the alternative is to stay in 256-bit mode and just > leverage new instructions and opmasks This means any cycles you might save by using longer registers, fewer instructions, better instructions, whatever, will be lost because the frequency drops meaning it takes longer to execute overall. I don't have time to perform that sort of in-depth testing. I will post the checkasm benchmark results for the 3 patches though. > $ ./tests/checkasm/checkasm --bench --test=v210enc > benchmarking with native FFmpeg timers > nop: 26.0 > checkasm: using random seed 3018512312 > SSSE3: > - v210enc.planar_pack [OK] > AVX: > - v210enc.planar_pack [OK] > AVX2: > - v210enc.planar_pack [OK] > AVX-512: > - v210enc.planar_pack [OK] > checkasm: all 6 tests passed > v210_planar_pack_8_c: 1726.5 > v210_planar_pack_8_ssse3: 308.5 > v210_planar_pack_8_avx: 313.5 > v210_planar_pack_8_avx2: 213.5 > v210_planar_pack_10_c: 1424.0 > v210_planar_pack_10_ssse3: 301.0 > v210_planar_pack_10_avx2: 227.5 > v210_planar_pack_10_avx512: 229.5 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function
2017-11-09 12:58 GMT+01:00 James Darnley: > --- > libavcodec/x86/v210enc.asm| 5 + > libavcodec/x86/v210enc_init.c | 7 +++ > 2 files changed, 12 insertions(+) > > diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm > index 965f2bea3c..5068af27f8 100644 > --- a/libavcodec/x86/v210enc.asm > +++ b/libavcodec/x86/v210enc.asm > @@ -103,6 +103,11 @@ INIT_YMM avx2 > v210_planar_pack_10 > %endif > > +%if HAVE_AVX512_EXTERNAL > +INIT_YMM avx512 > +v210_planar_pack_10 > +%endif > + > %macro v210_planar_pack_8 0 > > ; v210_planar_pack_8(const uint8_t *y, const uint8_t *u, const uint8_t > *v, uint8_t *dst, ptrdiff_t width) > diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c > index e997b4b67a..e8aac373a0 100644 > --- a/libavcodec/x86/v210enc_init.c > +++ b/libavcodec/x86/v210enc_init.c > @@ -32,6 +32,9 @@ void ff_v210_planar_pack_10_ssse3(const uint16_t *y, > const uint16_t *u, > void ff_v210_planar_pack_10_avx2(const uint16_t *y, const uint16_t *u, > const uint16_t *v, uint8_t *dst, > ptrdiff_t width); > +void ff_v210_planar_pack_10_avx512(const uint16_t *y, const uint16_t *u, > + const uint16_t *v, uint8_t *dst, > + ptrdiff_t width); > > av_cold void ff_v210enc_init_x86(V210EncContext *s) > { > @@ -51,4 +54,8 @@ av_cold void ff_v210enc_init_x86(V210EncContext *s) > s->sample_factor_10 = 2; > s->pack_line_10 = ff_v210_planar_pack_10_avx2; > } > + > +if (EXTERNAL_AVX512(cpu_flags)) { > +s->pack_line_10 = ff_v210_planar_pack_10_avx512; > +} > } > -- > > I doesn't want to block this patch, but like you say (in your previous version), that this version is not faster, i'm not sure, it's interesting to apply it. You already made "real" avx512 version for other funcs, in order to check the rest of yours patchs. Martin ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
[FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function
--- libavcodec/x86/v210enc.asm| 5 + libavcodec/x86/v210enc_init.c | 7 +++ 2 files changed, 12 insertions(+) diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm index 965f2bea3c..5068af27f8 100644 --- a/libavcodec/x86/v210enc.asm +++ b/libavcodec/x86/v210enc.asm @@ -103,6 +103,11 @@ INIT_YMM avx2 v210_planar_pack_10 %endif +%if HAVE_AVX512_EXTERNAL +INIT_YMM avx512 +v210_planar_pack_10 +%endif + %macro v210_planar_pack_8 0 ; v210_planar_pack_8(const uint8_t *y, const uint8_t *u, const uint8_t *v, uint8_t *dst, ptrdiff_t width) diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c index e997b4b67a..e8aac373a0 100644 --- a/libavcodec/x86/v210enc_init.c +++ b/libavcodec/x86/v210enc_init.c @@ -32,6 +32,9 @@ void ff_v210_planar_pack_10_ssse3(const uint16_t *y, const uint16_t *u, void ff_v210_planar_pack_10_avx2(const uint16_t *y, const uint16_t *u, const uint16_t *v, uint8_t *dst, ptrdiff_t width); +void ff_v210_planar_pack_10_avx512(const uint16_t *y, const uint16_t *u, + const uint16_t *v, uint8_t *dst, + ptrdiff_t width); av_cold void ff_v210enc_init_x86(V210EncContext *s) { @@ -51,4 +54,8 @@ av_cold void ff_v210enc_init_x86(V210EncContext *s) s->sample_factor_10 = 2; s->pack_line_10 = ff_v210_planar_pack_10_avx2; } + +if (EXTERNAL_AVX512(cpu_flags)) { +s->pack_line_10 = ff_v210_planar_pack_10_avx512; +} } -- 2.15.0 ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel