subject:"\[FFmpeg\-devel\] \[PATCH 7\/8\] lavc\/flacenc\: add AVX2 version of the 32\-bit LPC encoder"

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-12-02 Thread James Darnley

On 2017-11-27 17:50, Henrik Gramner wrote:
> On Sun, Nov 26, 2017 at 11:51 PM, James Darnley  
> wrote:
>> -pd_0_int_min: times  2 dd 0, -2147483648
>> -pq_int_min:   times  2 dq -2147483648
>> -pq_int_max:   times  2 dq  2147483647
>> +pd_0_int_min: times  4 dd 0, -2147483648
>> +pq_int_min:   times  4 dq -2147483648
>> +pq_int_max:   times  4 dq  2147483647
> 
> Using 128-bit broadcasts is preferable over duplicating the constants
> to 256-bit unless there's a good reason for doing so since it wastes
> less cache and is faster on AMD CPU:s.

At first I  thought it sounded like a possible candidate for x86-64
optimisation; I have run out of registers on x86.  Although that is in
the inner loop and these constants used in the outer loop or just once
so I have some room.

Do you want to block the patch set while I change this or could it be
left for another time?

Thanks for the suggestion anyway.

signature.asc
Description: OpenPGP digital signature
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-27 Thread Henrik Gramner

>> Using 128-bit broadcasts is preferable over duplicating the constants
>> to 256-bit unless there's a good reason for doing so since it wastes
>> less cache and is faster on AMD CPU:s.
>
> What would that reason be? Afaik broadcasts are expensive, since they
> both load from memory then splat data across lanes. Using them inside
> loops doesn't sound like a good idea. But i guess you have more
> experience testing with more varied chips than i do.

128-bit broadcasts from memory are done in the load unit for free on
all AVX2-capable CPU:s.

> Also, by AMD cpus you mean Ryzen? Because on Bulldozer-based CPUs we
> purposely disabled functions using ymm regs.

Yes. 128-bit broadcasts have twice the throughput compared to 256-bit
loads on Ryzen since it only has 128-bit load units.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-27 Thread James Almer

On 11/27/2017 1:50 PM, Henrik Gramner wrote:
> On Sun, Nov 26, 2017 at 11:51 PM, James Darnley  
> wrote:
>> -pd_0_int_min: times  2 dd 0, -2147483648
>> -pq_int_min:   times  2 dq -2147483648
>> -pq_int_max:   times  2 dq  2147483647
>> +pd_0_int_min: times  4 dd 0, -2147483648
>> +pq_int_min:   times  4 dq -2147483648
>> +pq_int_max:   times  4 dq  2147483647
> 
> Using 128-bit broadcasts is preferable over duplicating the constants
> to 256-bit unless there's a good reason for doing so since it wastes
> less cache and is faster on AMD CPU:s.

What would that reason be? Afaik broadcasts are expensive, since they
both load from memory then splat data across lanes. Using them inside
loops doesn't sound like a good idea. But i guess you have more
experience testing with more varied chips than i do.

Also, by AMD cpus you mean Ryzen? Because on Bulldozer-based CPUs we
purposely disabled functions using ymm regs.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-27 Thread Henrik Gramner

On Sun, Nov 26, 2017 at 11:51 PM, James Darnley  wrote:
> -pd_0_int_min: times  2 dd 0, -2147483648
> -pq_int_min:   times  2 dq -2147483648
> -pq_int_max:   times  2 dq  2147483647
> +pd_0_int_min: times  4 dd 0, -2147483648
> +pq_int_min:   times  4 dq -2147483648
> +pq_int_max:   times  4 dq  2147483647

Using 128-bit broadcasts is preferable over duplicating the constants
to 256-bit unless there's a good reason for doing so since it wastes
less cache and is faster on AMD CPU:s.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-26 Thread James Almer

On 11/26/2017 8:13 PM, Rostislav Pehlivanov wrote:
> On 26 November 2017 at 22:51, James Darnley  wrote:
> 
>> When compared to the SSE4.2 version runtime, is reduced by 1 to 26%.  The
>> function itself is around 2 times faster.
>> ---
>>  libavcodec/x86/flac_dsp_gpl.asm | 56 ++
>> +--
>>  libavcodec/x86/flacdsp_init.c   |  5 +++-
>>  2 files changed, 47 insertions(+), 14 deletions(-)
>>
>> diff --git a/libavcodec/x86/flac_dsp_gpl.asm
>> b/libavcodec/x86/flac_dsp_gpl.asm
>> index 91989ce560..749e66dec8 100644
>> --- a/libavcodec/x86/flac_dsp_gpl.asm
>> +++ b/libavcodec/x86/flac_dsp_gpl.asm
>> @@ -22,11 +22,11 @@
>>
>>  %include "libavutil/x86/x86util.asm"
>>
>> -SECTION_RODATA
>> +SECTION_RODATA 32
>>
>> -pd_0_int_min: times  2 dd 0, -2147483648
>> -pq_int_min:   times  2 dq -2147483648
>> -pq_int_max:   times  2 dq  2147483647
>> +pd_0_int_min: times  4 dd 0, -2147483648
>> +pq_int_min:   times  4 dq -2147483648
>> +pq_int_max:   times  4 dq  2147483647
>>
>>  SECTION .text
>>
>> @@ -123,7 +123,10 @@ RET
>>  %endmacro
>>
>>  %macro PMINSQ 3
>> -pcmpgtq %3, %2, %1
>> +mova%3, %2
>> +; We cannot use the 3-operand format because the memory location
>> cannot be
>> +; the second operand, only the third.
>> +pcmpgtq %3, %1
>>
> 
> I don't get it, how did it work before then?
> 
> 
>>  pand%1, %3
>>  pandn   %3, %2
>>  por %1, %3
>> @@ -177,11 +180,11 @@ learesq,   [resq+orderq*4]
>>  leasmpq,   [smpq+orderq*4]
>>  leacoefsq, [coefsq+orderq*4]
>>  sublength,  orderd
>> -movd   m3,  r5m
>> +movd   xm3, r5m
>>  negorderq
>>
>>  movu   m4, [pd_0_int_min] ; load 1 bit
>> -psrad  m4,  m3; turn that into shift+1 bits
>> +psrad  m4,  xm3   ; turn that into shift+1 bits
>>  pslld  m4,  1 ; reduce that
>>  mova  [rsp],m4; save sign extend mask
>>
>> @@ -197,8 +200,20 @@ mova  [rsp],m4; save sign extend mask
>>  xor  negj, negj
>>
>>  .looporder1:
>> +%if cpuflag(avx)
>> +vbroadcastss m2, [coefsq+posj*4]
>> +%else
>>  movd   m2,  [coefsq+posj*4] ; c = coefs[j]
>>  SPLATD m2
>> +%endif
>> +%if cpuflag(avx)
>> +vpmuldq  m1, m2, [smpq+negj*4-4]
>> +vpmuldq  m5, m2, [smpq+negj*4-4+mmsize]
>> +vpmuldq  m7, m2, [smpq+negj*4-4+mmsize*2]
>> +vpaddq   m0, m1
>> +vpaddq   m4, m5
>> +vpaddq   m6, m7
>>
> 
> Why force VEX encoding for these instructions, on avx no less?

It's avx2 and using ymm regs, not avx.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-26 Thread James Almer

On 11/26/2017 7:51 PM, James Darnley wrote:
> When compared to the SSE4.2 version runtime, is reduced by 1 to 26%.  The
> function itself is around 2 times faster.
> ---
>  libavcodec/x86/flac_dsp_gpl.asm | 56 
> +++--
>  libavcodec/x86/flacdsp_init.c   |  5 +++-
>  2 files changed, 47 insertions(+), 14 deletions(-)
> 
> diff --git a/libavcodec/x86/flac_dsp_gpl.asm b/libavcodec/x86/flac_dsp_gpl.asm
> index 91989ce560..749e66dec8 100644
> --- a/libavcodec/x86/flac_dsp_gpl.asm
> +++ b/libavcodec/x86/flac_dsp_gpl.asm
> @@ -22,11 +22,11 @@
>  
>  %include "libavutil/x86/x86util.asm"
>  
> -SECTION_RODATA
> +SECTION_RODATA 32
>  
> -pd_0_int_min: times  2 dd 0, -2147483648
> -pq_int_min:   times  2 dq -2147483648
> -pq_int_max:   times  2 dq  2147483647
> +pd_0_int_min: times  4 dd 0, -2147483648
> +pq_int_min:   times  4 dq -2147483648
> +pq_int_max:   times  4 dq  2147483647
>  
>  SECTION .text
>  
> @@ -123,7 +123,10 @@ RET
>  %endmacro
>  
>  %macro PMINSQ 3
> -pcmpgtq %3, %2, %1
> +mova%3, %2
> +; We cannot use the 3-operand format because the memory location cannot 
> be
> +; the second operand, only the third.
> +pcmpgtq %3, %1
>  pand%1, %3
>  pandn   %3, %2
>  por %1, %3
> @@ -177,11 +180,11 @@ learesq,   [resq+orderq*4]
>  leasmpq,   [smpq+orderq*4]
>  leacoefsq, [coefsq+orderq*4]
>  sublength,  orderd
> -movd   m3,  r5m
> +movd   xm3, r5m
>  negorderq
>  
>  movu   m4, [pd_0_int_min] ; load 1 bit
> -psrad  m4,  m3; turn that into shift+1 bits
> +psrad  m4,  xm3   ; turn that into shift+1 bits
>  pslld  m4,  1 ; reduce that
>  mova  [rsp],m4; save sign extend mask
>  
> @@ -197,8 +200,20 @@ mova  [rsp],m4; save sign extend mask
>  xor  negj, negj
>  
>  .looporder1:
> +%if cpuflag(avx)

Either avx2, or check instead for mmsize == 32

> +vbroadcastss m2, [coefsq+posj*4]

vpbroadcastd. Or just use the VPBROADCASTD macro to cover both the avx2
and sse4 cases without ifdeffery.

> +%else
>  movd   m2,  [coefsq+posj*4] ; c = coefs[j]
>  SPLATD m2
> +%endif
> +%if cpuflag(avx)
> +vpmuldq  m1, m2, [smpq+negj*4-4]
> +vpmuldq  m5, m2, [smpq+negj*4-4+mmsize]
> +vpmuldq  m7, m2, [smpq+negj*4-4+mmsize*2]
> +vpaddq   m0, m1
> +vpaddq   m4, m5
> +vpaddq   m6, m7
> +%else
>  movu   m1,  [smpq+negj*4-4] ; s = smp[i-j-1]
>  movu   m5,  [smpq+negj*4-4+mmsize]
>  movu   m7,  [smpq+negj*4-4+mmsize*2]
> @@ -212,14 +227,15 @@ mova  [rsp],m4; save sign extend mask
>  paddq  m0,   m1 ; p += c * s
>  paddq  m4,   m5
>  paddq  m6,   m7
> +%endif
>  
>  decnegj
>  incposj
>  jnz .looporder1
>  
> -HACK_PSRAQ m0, m3, [rsp], m2; p >>= shift
> -HACK_PSRAQ m4, m3, [rsp], m2
> -HACK_PSRAQ m6, m3, [rsp], m2
> +HACK_PSRAQ m0, xm3, [rsp], m2; p >>= shift
> +HACK_PSRAQ m4, xm3, [rsp], m2
> +HACK_PSRAQ m6, xm3, [rsp], m2
>  CLIPQ   m0,   [pq_int_min], [pq_int_max], m2 ; clip(p >> shift)
>  CLIPQ   m4,   [pq_int_min], [pq_int_max], m2
>  CLIPQ   m6,   [pq_int_min], [pq_int_max], m2
> @@ -241,8 +257,20 @@ mova  [rsp],m4; save sign extend mask
>  xor  negj, negj
>  
>  .looporder2:
> +%if cpuflag(avx)
> +vbroadcastss m2, [coefsq+posj*4]

Same

> +%else
>  movd   m2,  [coefsq+posj*4] ; c = coefs[j]
>  SPLATD m2
> +%endif
> +%if cpuflag(avx)
> +vpmuldq  m1, m2, [smpq+negj*4]
> +vpmuldq  m5, m2, [smpq+negj*4+mmsize]
> +vpmuldq  m7, m2, [smpq+negj*4+mmsize*2]
> +vpaddq   m0, m1
> +vpaddq   m4, m5
> +vpaddq   m6, m7
> +%else
>  movu   m1,  [smpq+negj*4] ; s = smp[i-j-1]
>  movu   m5,  [smpq+negj*4+mmsize]
>  movu   m7,  [smpq+negj*4+mmsize*2]
> @@ -252,14 +280,15 @@ mova  [rsp],m4; save sign extend mask
>  paddq  m0,   m1 ; p += c * s
>  paddq  m4,   m5
>  paddq  m6,   m7
> +%endif
>  
>  decnegj
>  incposj
>  jnz .looporder2
>  
> -HACK_PSRAQ m0, m3, [rsp], m2; p >>= shift
> -HACK_PSRAQ m4, m3, [rsp], m2
> -HACK_PSRAQ m6, m3, [rsp], m2
> +HACK_PSRAQ m0, xm3, [rsp], m2; p >>= shift
> +HACK_PSRAQ m4, xm3, [rsp], m2
> +HACK_PSRAQ m6, xm3, [rsp], m2
>  CLIPQ   m0,   [pq_int_min], [pq_int_max], m2 ; clip(p >> shift)
>  CLIPQ   m4,   [pq_int_min], [pq_int_max], m2
>  CLIPQ   m6,   [pq_int_min], [pq_int_max], m2
> @@ -300,3 +329,4 @@ FUNCTION_BODY_32
>  
>  INIT_YMM avx2
>  FUNCTION_BODY_16
> +FUNCTION_BODY_32
> diff --git a/libavcodec/x86/flacdsp_init.c b/libavcodec/x86/flacdsp_init.c
> index f827186c26..fbe70894a0 100644
> --- a/libavcodec/x86/flacdsp_init.c
> +++ b/libavcodec/x86/flacdsp_init.c
> @@

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-26 Thread James Darnley

On 2017-11-27 00:13, Rostislav Pehlivanov wrote:
> On 26 November 2017 at 22:51, James Darnley  wrote:
>> @@ -123,7 +123,10 @@ RET
>>  %endmacro
>>
>>  %macro PMINSQ 3
>> -pcmpgtq %3, %2, %1
>> +mova%3, %2
>> +; We cannot use the 3-operand format because the memory location
>> cannot be
>> +; the second operand, only the third.
>> +pcmpgtq %3, %1
>>
> 
> I don't get it, how did it work before then?

Easy.  3-operand instructions were never generated using it meaning it
was always emulated with a move.

>> @@ -197,8 +200,20 @@ mova  [rsp],m4; save sign extend mask
>>  xor  negj, negj
>>
>>  .looporder1:
>> +%if cpuflag(avx)
>> +vbroadcastss m2, [coefsq+posj*4]
>> +%else
>>  movd   m2,  [coefsq+posj*4] ; c = coefs[j]
>>  SPLATD m2
>> +%endif
>> +%if cpuflag(avx)
>> +vpmuldq  m1, m2, [smpq+negj*4-4]
>> +vpmuldq  m5, m2, [smpq+negj*4-4+mmsize]
>> +vpmuldq  m7, m2, [smpq+negj*4-4+mmsize*2]
>> +vpaddq   m0, m1
>> +vpaddq   m4, m5
>> +vpaddq   m6, m7
>>
> 
> Why force VEX encoding for these instructions, on avx no less?

Not sure.  Legacy code written before I knew what I was doing?  Perhaps
some issue arose with the assembler or x86inc at that time and this is
how I worked around it.

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-26 Thread Rostislav Pehlivanov

On 26 November 2017 at 22:51, James Darnley  wrote:

> When compared to the SSE4.2 version runtime, is reduced by 1 to 26%.  The
> function itself is around 2 times faster.
> ---
>  libavcodec/x86/flac_dsp_gpl.asm | 56 ++
> +--
>  libavcodec/x86/flacdsp_init.c   |  5 +++-
>  2 files changed, 47 insertions(+), 14 deletions(-)
>
> diff --git a/libavcodec/x86/flac_dsp_gpl.asm
> b/libavcodec/x86/flac_dsp_gpl.asm
> index 91989ce560..749e66dec8 100644
> --- a/libavcodec/x86/flac_dsp_gpl.asm
> +++ b/libavcodec/x86/flac_dsp_gpl.asm
> @@ -22,11 +22,11 @@
>
>  %include "libavutil/x86/x86util.asm"
>
> -SECTION_RODATA
> +SECTION_RODATA 32
>
> -pd_0_int_min: times  2 dd 0, -2147483648
> -pq_int_min:   times  2 dq -2147483648
> -pq_int_max:   times  2 dq  2147483647
> +pd_0_int_min: times  4 dd 0, -2147483648
> +pq_int_min:   times  4 dq -2147483648
> +pq_int_max:   times  4 dq  2147483647
>
>  SECTION .text
>
> @@ -123,7 +123,10 @@ RET
>  %endmacro
>
>  %macro PMINSQ 3
> -pcmpgtq %3, %2, %1
> +mova%3, %2
> +; We cannot use the 3-operand format because the memory location
> cannot be
> +; the second operand, only the third.
> +pcmpgtq %3, %1
>

I don't get it, how did it work before then?


>  pand%1, %3
>  pandn   %3, %2
>  por %1, %3
> @@ -177,11 +180,11 @@ learesq,   [resq+orderq*4]
>  leasmpq,   [smpq+orderq*4]
>  leacoefsq, [coefsq+orderq*4]
>  sublength,  orderd
> -movd   m3,  r5m
> +movd   xm3, r5m
>  negorderq
>
>  movu   m4, [pd_0_int_min] ; load 1 bit
> -psrad  m4,  m3; turn that into shift+1 bits
> +psrad  m4,  xm3   ; turn that into shift+1 bits
>  pslld  m4,  1 ; reduce that
>  mova  [rsp],m4; save sign extend mask
>
> @@ -197,8 +200,20 @@ mova  [rsp],m4; save sign extend mask
>  xor  negj, negj
>
>  .looporder1:
> +%if cpuflag(avx)
> +vbroadcastss m2, [coefsq+posj*4]
> +%else
>  movd   m2,  [coefsq+posj*4] ; c = coefs[j]
>  SPLATD m2
> +%endif
> +%if cpuflag(avx)
> +vpmuldq  m1, m2, [smpq+negj*4-4]
> +vpmuldq  m5, m2, [smpq+negj*4-4+mmsize]
> +vpmuldq  m7, m2, [smpq+negj*4-4+mmsize*2]
> +vpaddq   m0, m1
> +vpaddq   m4, m5
> +vpaddq   m6, m7
>

Why force VEX encoding for these instructions, on avx no less?


> +%else
>  movu   m1,  [smpq+negj*4-4] ; s = smp[i-j-1]
>  movu   m5,  [smpq+negj*4-4+mmsize]
>  movu   m7,  [smpq+negj*4-4+mmsize*2]
> @@ -212,14 +227,15 @@ mova  [rsp],m4; save sign extend mask
>  paddq  m0,   m1 ; p += c * s
>  paddq  m4,   m5
>  paddq  m6,   m7
> +%endif
>
>  decnegj
>  incposj
>  jnz .looporder1
>
> -HACK_PSRAQ m0, m3, [rsp], m2; p >>= shift
> -HACK_PSRAQ m4, m3, [rsp], m2
> -HACK_PSRAQ m6, m3, [rsp], m2
> +HACK_PSRAQ m0, xm3, [rsp], m2; p >>= shift
> +HACK_PSRAQ m4, xm3, [rsp], m2
> +HACK_PSRAQ m6, xm3, [rsp], m2
>  CLIPQ   m0,   [pq_int_min], [pq_int_max], m2 ; clip(p >> shift)
>  CLIPQ   m4,   [pq_int_min], [pq_int_max], m2
>  CLIPQ   m6,   [pq_int_min], [pq_int_max], m2
> @@ -241,8 +257,20 @@ mova  [rsp],m4; save sign extend mask
>  xor  negj, negj
>
>  .looporder2:
> +%if cpuflag(avx)
> +vbroadcastss m2, [coefsq+posj*4]
> +%else
>  movd   m2,  [coefsq+posj*4] ; c = coefs[j]
>  SPLATD m2
> +%endif
> +%if cpuflag(avx)
> +vpmuldq  m1, m2, [smpq+negj*4]
> +vpmuldq  m5, m2, [smpq+negj*4+mmsize]
> +vpmuldq  m7, m2, [smpq+negj*4+mmsize*2]
> +vpaddq   m0, m1
> +vpaddq   m4, m5
> +vpaddq   m6, m7
> +%else
>  movu   m1,  [smpq+negj*4] ; s = smp[i-j-1]
>  movu   m5,  [smpq+negj*4+mmsize]
>  movu   m7,  [smpq+negj*4+mmsize*2]
> @@ -252,14 +280,15 @@ mova  [rsp],m4; save sign extend mask
>  paddq  m0,   m1 ; p += c * s
>  paddq  m4,   m5
>  paddq  m6,   m7
> +%endif
>
>  decnegj
>  incposj
>  jnz .looporder2
>
> -HACK_PSRAQ m0, m3, [rsp], m2; p >>= shift
> -HACK_PSRAQ m4, m3, [rsp], m2
> -HACK_PSRAQ m6, m3, [rsp], m2
> +HACK_PSRAQ m0, xm3, [rsp], m2; p >>= shift
> +HACK_PSRAQ m4, xm3, [rsp], m2
> +HACK_PSRAQ m6, xm3, [rsp], m2
>  CLIPQ   m0,   [pq_int_min], [pq_int_max], m2 ; clip(p >> shift)
>  CLIPQ   m4,   [pq_int_min], [pq_int_max], m2
>  CLIPQ   m6,   [pq_int_min], [pq_int_max], m2
> @@ -300,3 +329,4 @@ FUNCTION_BODY_32
>
>  INIT_YMM avx2
>  FUNCTION_BODY_16
> +FUNCTION_BODY_32
> diff --git a/libavcodec/x86/flacdsp_init.c b/libavcodec/x86/flacdsp_init.c
> index f827186c26..fbe70894a0 100644
> --- a/libavcodec/x86/flacdsp_init.c
> +++ b/libavcodec/x86/flacdsp_init.c
> @@ -30,6 +30,7 @@ void

[FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

2017-11-26 Thread James Darnley

When compared to the SSE4.2 version runtime, is reduced by 1 to 26%.  The
function itself is around 2 times faster.
---
 libavcodec/x86/flac_dsp_gpl.asm | 56 +++--
 libavcodec/x86/flacdsp_init.c   |  5 +++-
 2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/libavcodec/x86/flac_dsp_gpl.asm b/libavcodec/x86/flac_dsp_gpl.asm
index 91989ce560..749e66dec8 100644
--- a/libavcodec/x86/flac_dsp_gpl.asm
+++ b/libavcodec/x86/flac_dsp_gpl.asm
@@ -22,11 +22,11 @@
 
 %include "libavutil/x86/x86util.asm"
 
-SECTION_RODATA
+SECTION_RODATA 32
 
-pd_0_int_min: times  2 dd 0, -2147483648
-pq_int_min:   times  2 dq -2147483648
-pq_int_max:   times  2 dq  2147483647
+pd_0_int_min: times  4 dd 0, -2147483648
+pq_int_min:   times  4 dq -2147483648
+pq_int_max:   times  4 dq  2147483647
 
 SECTION .text
 
@@ -123,7 +123,10 @@ RET
 %endmacro
 
 %macro PMINSQ 3
-pcmpgtq %3, %2, %1
+mova%3, %2
+; We cannot use the 3-operand format because the memory location cannot be
+; the second operand, only the third.
+pcmpgtq %3, %1
 pand%1, %3
 pandn   %3, %2
 por %1, %3
@@ -177,11 +180,11 @@ learesq,   [resq+orderq*4]
 leasmpq,   [smpq+orderq*4]
 leacoefsq, [coefsq+orderq*4]
 sublength,  orderd
-movd   m3,  r5m
+movd   xm3, r5m
 negorderq
 
 movu   m4, [pd_0_int_min] ; load 1 bit
-psrad  m4,  m3; turn that into shift+1 bits
+psrad  m4,  xm3   ; turn that into shift+1 bits
 pslld  m4,  1 ; reduce that
 mova  [rsp],m4; save sign extend mask
 
@@ -197,8 +200,20 @@ mova  [rsp],m4; save sign extend mask
 xor  negj, negj
 
 .looporder1:
+%if cpuflag(avx)
+vbroadcastss m2, [coefsq+posj*4]
+%else
 movd   m2,  [coefsq+posj*4] ; c = coefs[j]
 SPLATD m2
+%endif
+%if cpuflag(avx)
+vpmuldq  m1, m2, [smpq+negj*4-4]
+vpmuldq  m5, m2, [smpq+negj*4-4+mmsize]
+vpmuldq  m7, m2, [smpq+negj*4-4+mmsize*2]
+vpaddq   m0, m1
+vpaddq   m4, m5
+vpaddq   m6, m7
+%else
 movu   m1,  [smpq+negj*4-4] ; s = smp[i-j-1]
 movu   m5,  [smpq+negj*4-4+mmsize]
 movu   m7,  [smpq+negj*4-4+mmsize*2]
@@ -212,14 +227,15 @@ mova  [rsp],m4; save sign extend mask
 paddq  m0,   m1 ; p += c * s
 paddq  m4,   m5
 paddq  m6,   m7
+%endif
 
 decnegj
 incposj
 jnz .looporder1
 
-HACK_PSRAQ m0, m3, [rsp], m2; p >>= shift
-HACK_PSRAQ m4, m3, [rsp], m2
-HACK_PSRAQ m6, m3, [rsp], m2
+HACK_PSRAQ m0, xm3, [rsp], m2; p >>= shift
+HACK_PSRAQ m4, xm3, [rsp], m2
+HACK_PSRAQ m6, xm3, [rsp], m2
 CLIPQ   m0,   [pq_int_min], [pq_int_max], m2 ; clip(p >> shift)
 CLIPQ   m4,   [pq_int_min], [pq_int_max], m2
 CLIPQ   m6,   [pq_int_min], [pq_int_max], m2
@@ -241,8 +257,20 @@ mova  [rsp],m4; save sign extend mask
 xor  negj, negj
 
 .looporder2:
+%if cpuflag(avx)
+vbroadcastss m2, [coefsq+posj*4]
+%else
 movd   m2,  [coefsq+posj*4] ; c = coefs[j]
 SPLATD m2
+%endif
+%if cpuflag(avx)
+vpmuldq  m1, m2, [smpq+negj*4]
+vpmuldq  m5, m2, [smpq+negj*4+mmsize]
+vpmuldq  m7, m2, [smpq+negj*4+mmsize*2]
+vpaddq   m0, m1
+vpaddq   m4, m5
+vpaddq   m6, m7
+%else
 movu   m1,  [smpq+negj*4] ; s = smp[i-j-1]
 movu   m5,  [smpq+negj*4+mmsize]
 movu   m7,  [smpq+negj*4+mmsize*2]
@@ -252,14 +280,15 @@ mova  [rsp],m4; save sign extend mask
 paddq  m0,   m1 ; p += c * s
 paddq  m4,   m5
 paddq  m6,   m7
+%endif
 
 decnegj
 incposj
 jnz .looporder2
 
-HACK_PSRAQ m0, m3, [rsp], m2; p >>= shift
-HACK_PSRAQ m4, m3, [rsp], m2
-HACK_PSRAQ m6, m3, [rsp], m2
+HACK_PSRAQ m0, xm3, [rsp], m2; p >>= shift
+HACK_PSRAQ m4, xm3, [rsp], m2
+HACK_PSRAQ m6, xm3, [rsp], m2
 CLIPQ   m0,   [pq_int_min], [pq_int_max], m2 ; clip(p >> shift)
 CLIPQ   m4,   [pq_int_min], [pq_int_max], m2
 CLIPQ   m6,   [pq_int_min], [pq_int_max], m2
@@ -300,3 +329,4 @@ FUNCTION_BODY_32
 
 INIT_YMM avx2
 FUNCTION_BODY_16
+FUNCTION_BODY_32
diff --git a/libavcodec/x86/flacdsp_init.c b/libavcodec/x86/flacdsp_init.c
index f827186c26..fbe70894a0 100644
--- a/libavcodec/x86/flacdsp_init.c
+++ b/libavcodec/x86/flacdsp_init.c
@@ -30,6 +30,7 @@ void ff_flac_lpc_32_xop(int32_t *samples, const int 
coeffs[32], int order,
 void ff_flac_enc_lpc_16_sse4(int32_t *, const int32_t *, int, int, const 
int32_t *,int);
 void ff_flac_enc_lpc_16_avx2(int32_t *, const int32_t *, int, int, const 
int32_t *,int);
 void ff_flac_enc_lpc_32_sse42(int32_t *, const int32_t *, int, int, const 
int32_t *,int);
+void ff_flac_enc_lpc_32_avx2(int32_t *, const int32_t *, int, int, const 
int32_t *,int);
 
 #define DECORRELATE_FUNCS(fmt, opt)

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

Re: [FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

[FFmpeg-devel] [PATCH 7/8] lavc/flacenc: add AVX2 version of the 32-bit LPC encoder

9 matches

Site Navigation

Mail list logo

Footer information