Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

2017-11-13 Thread James Darnley
On 2017-11-10 22:13, James Darnley wrote:
> The IRC log should appear at the link below.
>> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004651.html

Of course when I try to predict what number an email will get based on
the past few it ends up being out of order.

The ffmpeg-devel log I was referring to is here:
> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004652.html

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

2017-11-13 Thread Martin Vignali
2017-11-10 22:13 GMT+01:00 James Darnley :

> On 2017-11-10 14:32, James Darnley wrote:
> > I mentioned previously that using ZMM registers will cause the CPU to
> > reduce its frequency.
> >
> > Gramner said on IRC that a user should spend 20-30% of time in
> > AVX-512/ZMM code for it to be a net gain in speed.
> > From ffmpeg-devel IRC on 2017-10-26
> >> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/
> 2017-October/004622.html
> >> [18:49:26 CEST]  J_Darnley: be aware that using zmm registers
> induces significant frequency drops which reduces performance of everything
> else, so if you want to use 512-bit vectors you better go all in on it to
> make up for it. you probably want to spend at least 20-30% of overall
> runtime in avx-512 code
> >> [18:50:00 CEST]  the alternative is to stay in 256-bit mode
> and just leverage new instructions and opmasks
> >
> > This means any cycles you might save by using longer registers, fewer
> > instructions, better instructions, whatever, will be lost because the
> > frequency drops meaning it takes longer to execute overall.
>
> Some details about this can be found in one of Intel's documents: IntelĀ®
> 64 and IA-32 Architectures Optimization Reference Manual
> Order Number: 248966-038
> October 2017
> > https://software.intel.com/sites/default/files/managed/
> 9e/bc/64-ia-32-architectures-optimization-manual.pdf
> Specifically section 15.26 "SKYLAKE SERVER POWER MANAGEMENT"
>
> Earlier on the ffmpeg-devel IRC channel I posted a link to Cloudflare's
> blog in which they discuss the effects of running just a few (my words)
> AVX-512/ZMM instructions.
> > https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/
>
> In the worst cases on some of the new processors the frequency drop can
> be 1GHz.  In Cloudflare's case just spending about 2.5% of time in a
> cryptography function using AVX-512 was causing a 10% drop in their
> overall performance (requests served per second).
>
> After seeing this and the discussion on IRC I won't commit any of the
> function patches.  The functions are not very impressive and are likely
> to make everything else slower.
>
> The IRC log should appear at the link below.
> > https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/
> 2017-November/004651.html
>
>
> Thanks for the details explanations.

Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

2017-11-10 Thread James Darnley
On 2017-11-10 14:32, James Darnley wrote:
> I mentioned previously that using ZMM registers will cause the CPU to
> reduce its frequency.
> 
> Gramner said on IRC that a user should spend 20-30% of time in
> AVX-512/ZMM code for it to be a net gain in speed.
> From ffmpeg-devel IRC on 2017-10-26
>> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-October/004622.html
>> [18:49:26 CEST]  J_Darnley: be aware that using zmm registers 
>> induces significant frequency drops which reduces performance of everything 
>> else, so if you want to use 512-bit vectors you better go all in on it to 
>> make up for it. you probably want to spend at least 20-30% of overall 
>> runtime in avx-512 code
>> [18:50:00 CEST]  the alternative is to stay in 256-bit mode and 
>> just leverage new instructions and opmasks
> 
> This means any cycles you might save by using longer registers, fewer
> instructions, better instructions, whatever, will be lost because the
> frequency drops meaning it takes longer to execute overall.

Some details about this can be found in one of Intel's documents: IntelĀ®
64 and IA-32 Architectures Optimization Reference Manual
Order Number: 248966-038
October 2017
> https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf
Specifically section 15.26 "SKYLAKE SERVER POWER MANAGEMENT"

Earlier on the ffmpeg-devel IRC channel I posted a link to Cloudflare's
blog in which they discuss the effects of running just a few (my words)
AVX-512/ZMM instructions.
> https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

In the worst cases on some of the new processors the frequency drop can
be 1GHz.  In Cloudflare's case just spending about 2.5% of time in a
cryptography function using AVX-512 was causing a 10% drop in their
overall performance (requests served per second).

After seeing this and the discussion on IRC I won't commit any of the
function patches.  The functions are not very impressive and are likely
to make everything else slower.

The IRC log should appear at the link below.
> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-November/004651.html

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

2017-11-10 Thread James Darnley
On 2017-11-09 20:42, Martin Vignali wrote:
> I doesn't want to block this patch, but
> like you say (in your previous version), that this version is not faster,
> i'm not sure, it's interesting to apply it.
> You already made "real" avx512 version for other funcs, in order to check
> the rest of yours patchs.

I will not apply any of the new AVX-512/ZMM function patches because
they need proper testing in a real world situation.  Sorry but I don't
have time to see whether these few naive length extensions are better.
I have my own work to see whether AVX-512/ZMM provides a speed-up.  If
that pans out then FFmpeg will benefit because some of the work will
trickle back to it.

I mentioned previously that using ZMM registers will cause the CPU to
reduce its frequency.

Gramner said on IRC that a user should spend 20-30% of time in
AVX-512/ZMM code for it to be a net gain in speed.
From ffmpeg-devel IRC on 2017-10-26
> https://lists.ffmpeg.org/pipermail/ffmpeg-devel-irc/2017-October/004622.html
> [18:49:26 CEST]  J_Darnley: be aware that using zmm registers 
> induces significant frequency drops which reduces performance of everything 
> else, so if you want to use 512-bit vectors you better go all in on it to 
> make up for it. you probably want to spend at least 20-30% of overall runtime 
> in avx-512 code
> [18:50:00 CEST]  the alternative is to stay in 256-bit mode and just 
> leverage new instructions and opmasks

This means any cycles you might save by using longer registers, fewer
instructions, better instructions, whatever, will be lost because the
frequency drops meaning it takes longer to execute overall.

I don't have time to perform that sort of in-depth testing.

I will post the checkasm benchmark results for the 3 patches though.

> $ ./tests/checkasm/checkasm --bench --test=v210enc
> benchmarking with native FFmpeg timers
> nop: 26.0
> checkasm: using random seed 3018512312
> SSSE3:
>  - v210enc.planar_pack [OK]
> AVX:
>  - v210enc.planar_pack [OK]
> AVX2:
>  - v210enc.planar_pack [OK]
> AVX-512:
>  - v210enc.planar_pack [OK]
> checkasm: all 6 tests passed
> v210_planar_pack_8_c: 1726.5
> v210_planar_pack_8_ssse3: 308.5
> v210_planar_pack_8_avx: 313.5
> v210_planar_pack_8_avx2: 213.5
> v210_planar_pack_10_c: 1424.0
> v210_planar_pack_10_ssse3: 301.0
> v210_planar_pack_10_avx2: 227.5
> v210_planar_pack_10_avx512: 229.5

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

2017-11-09 Thread Martin Vignali
2017-11-09 12:58 GMT+01:00 James Darnley :

> ---
>  libavcodec/x86/v210enc.asm| 5 +
>  libavcodec/x86/v210enc_init.c | 7 +++
>  2 files changed, 12 insertions(+)
>
> diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm
> index 965f2bea3c..5068af27f8 100644
> --- a/libavcodec/x86/v210enc.asm
> +++ b/libavcodec/x86/v210enc.asm
> @@ -103,6 +103,11 @@ INIT_YMM avx2
>  v210_planar_pack_10
>  %endif
>
> +%if HAVE_AVX512_EXTERNAL
> +INIT_YMM avx512
> +v210_planar_pack_10
> +%endif
> +
>  %macro v210_planar_pack_8 0
>
>  ; v210_planar_pack_8(const uint8_t *y, const uint8_t *u, const uint8_t
> *v, uint8_t *dst, ptrdiff_t width)
> diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c
> index e997b4b67a..e8aac373a0 100644
> --- a/libavcodec/x86/v210enc_init.c
> +++ b/libavcodec/x86/v210enc_init.c
> @@ -32,6 +32,9 @@ void ff_v210_planar_pack_10_ssse3(const uint16_t *y,
> const uint16_t *u,
>  void ff_v210_planar_pack_10_avx2(const uint16_t *y, const uint16_t *u,
>   const uint16_t *v, uint8_t *dst,
>   ptrdiff_t width);
> +void ff_v210_planar_pack_10_avx512(const uint16_t *y, const uint16_t *u,
> +   const uint16_t *v, uint8_t *dst,
> +   ptrdiff_t width);
>
>  av_cold void ff_v210enc_init_x86(V210EncContext *s)
>  {
> @@ -51,4 +54,8 @@ av_cold void ff_v210enc_init_x86(V210EncContext *s)
>  s->sample_factor_10 = 2;
>  s->pack_line_10 = ff_v210_planar_pack_10_avx2;
>  }
> +
> +if (EXTERNAL_AVX512(cpu_flags)) {
> +s->pack_line_10 = ff_v210_planar_pack_10_avx512;
> +}
>  }
> --
>
>
I doesn't want to block this patch, but
like you say (in your previous version), that this version is not faster,
i'm not sure, it's interesting to apply it.
You already made "real" avx512 version for other funcs, in order to check
the rest of yours patchs.

Martin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 08/11] avcodec/v210enc: add AVX-512 10-bit line pack function

2017-11-09 Thread James Darnley
---
 libavcodec/x86/v210enc.asm| 5 +
 libavcodec/x86/v210enc_init.c | 7 +++
 2 files changed, 12 insertions(+)

diff --git a/libavcodec/x86/v210enc.asm b/libavcodec/x86/v210enc.asm
index 965f2bea3c..5068af27f8 100644
--- a/libavcodec/x86/v210enc.asm
+++ b/libavcodec/x86/v210enc.asm
@@ -103,6 +103,11 @@ INIT_YMM avx2
 v210_planar_pack_10
 %endif
 
+%if HAVE_AVX512_EXTERNAL
+INIT_YMM avx512
+v210_planar_pack_10
+%endif
+
 %macro v210_planar_pack_8 0
 
 ; v210_planar_pack_8(const uint8_t *y, const uint8_t *u, const uint8_t *v, 
uint8_t *dst, ptrdiff_t width)
diff --git a/libavcodec/x86/v210enc_init.c b/libavcodec/x86/v210enc_init.c
index e997b4b67a..e8aac373a0 100644
--- a/libavcodec/x86/v210enc_init.c
+++ b/libavcodec/x86/v210enc_init.c
@@ -32,6 +32,9 @@ void ff_v210_planar_pack_10_ssse3(const uint16_t *y, const 
uint16_t *u,
 void ff_v210_planar_pack_10_avx2(const uint16_t *y, const uint16_t *u,
  const uint16_t *v, uint8_t *dst,
  ptrdiff_t width);
+void ff_v210_planar_pack_10_avx512(const uint16_t *y, const uint16_t *u,
+   const uint16_t *v, uint8_t *dst,
+   ptrdiff_t width);
 
 av_cold void ff_v210enc_init_x86(V210EncContext *s)
 {
@@ -51,4 +54,8 @@ av_cold void ff_v210enc_init_x86(V210EncContext *s)
 s->sample_factor_10 = 2;
 s->pack_line_10 = ff_v210_planar_pack_10_avx2;
 }
+
+if (EXTERNAL_AVX512(cpu_flags)) {
+s->pack_line_10 = ff_v210_planar_pack_10_avx512;
+}
 }
-- 
2.15.0

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel