Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-06 Thread Henrik Gramner
On Tue, Sep 6, 2016 at 11:39 AM, Anton Khirnov  wrote:
>> Use 3-arg maxps instead of mova.
>
> Isn't that AVX-only?

It is, x86inc will simply convert it to mova+minps when assembling it
as non-AVX code but it reduces the line count. It's certainly not
worth to go into bikeshedding territory about it however, so if you
prefer to use plain mova:s just keep them.

>> Otherwise LGTM, you could make an AVX version using ymm registers
>> as well in a separate patch if you want to, just need to make sure
>> the buffers are aligned.
>
> This function is only used in two rather obscure places, so probably
> not worth it

Fair enough.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-06 Thread Anton Khirnov
Quoting Henrik Gramner (2016-09-05 15:15:14)
> On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnov  wrote:
> > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max
> > +%if ARCH_X86_32
> > +VBROADCASTSS m0, minm
> > +VBROADCASTSS m1, maxm
> > +%else
> > +VBROADCASTSS m0, m0
> > +VBROADCASTSS m1, m1
> > +%endif
> 
> This will fail on WIN64, to deal with the somewhat silly calling
> conventions on that platform you need to do something like
> VBROADCASTSS m0, m3
> VBROADCASTSS m1, maxm
> (not tested, I don't have access to a Windows machine at the moment).
> 
> > +movsxdifnidn lenq, lend
> > +shl lenq, 2
> > +
> > +.loop
> > +sub lenq, 4 * mmsize
> 
> Move the subtraction just before the branch (jg) to allow macro-op
> fusion on modern Intel CPUs.
> 
> > +
> > +mova m2, [srcq + lenq + 0 * mmsize]
> > +mova m3, [srcq + lenq + 1 * mmsize]
> > +mova m4, [srcq + lenq + 2 * mmsize]
> > +mova m5, [srcq + lenq + 3 * mmsize]
> > +
> > +maxps m2, m0
> > +maxps m3, m0
> > +maxps m4, m0
> > +maxps m5, m0
> 
> Use 3-arg maxps instead of mova.

Isn't that AVX-only?

> 
> > +minps m2, m1
> > +minps m3, m1
> > +minps m4, m1
> > +minps m5, m1
> > +
> > +mova [dstq + lenq + 0 * mmsize], m2
> > +mova [dstq + lenq + 1 * mmsize], m3
> > +mova [dstq + lenq + 2 * mmsize], m4
> > +mova [dstq + lenq + 3 * mmsize], m5
> > +
> > +jg .loop
> > +
> > +RET
> 
> Otherwise LGTM, you could make an AVX version using ymm registers as
> well in a separate patch if you want to, just need to make sure the
> buffers are aligned.

This function is only used in two rather obscure places, so probably not
worth it

-- 
Anton Khirnov
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-06 Thread Diego Biurrun
On Mon, Sep 05, 2016 at 01:02:43PM +0200, Anton Khirnov wrote:
> ---
>  libavcodec/x86/Makefile   |  1 -
>  libavcodec/x86/audiodsp.asm   | 42 +++
>  libavcodec/x86/audiodsp_mmx.c | 58 
> ---
>  3 files changed, 42 insertions(+), 59 deletions(-)
>  delete mode 100644 libavcodec/x86/audiodsp_mmx.c

Changes to the init file are missing.

Diego
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-05 Thread James Almer
On 9/5/2016 11:40 AM, James Almer wrote:
> On 9/5/2016 10:15 AM, Henrik Gramner wrote:
>> On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnov  wrote:
 +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max
 +%if ARCH_X86_32
 +VBROADCASTSS m0, minm
 +VBROADCASTSS m1, maxm
 +%else
 +VBROADCASTSS m0, m0
 +VBROADCASTSS m1, m1
 +%endif
>> This will fail on WIN64, to deal with the somewhat silly calling
>> conventions on that platform you need to do something like
>> VBROADCASTSS m0, m3
>> VBROADCASTSS m1, maxm
>> (not tested, I don't have access to a Windows machine at the moment).
> 
> %if WIN64
> SWAP 0, 2
> SWAP 1, 3
> %endif

Nevermind, misread the prototype.

You can use "SWAP 0, 3" for min, but for max you probably need to fetch
it from stack as you suggested.

> 
> Before the x86_64 vbroadcastss should be enough.
> 

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-05 Thread James Almer
On 9/5/2016 8:02 AM, Anton Khirnov wrote:
> ---
>  libavcodec/x86/Makefile   |  1 -
>  libavcodec/x86/audiodsp.asm   | 42 +++
>  libavcodec/x86/audiodsp_mmx.c | 58 
> ---
>  3 files changed, 42 insertions(+), 59 deletions(-)
>  delete mode 100644 libavcodec/x86/audiodsp_mmx.c

Missing the changes to audiodsp_init.c to make the function depend on
EXTERNAL_SSE instead of INLINE_SSE.

> 
> diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile
> index 906bd0ea..98c27fa 100644
> --- a/libavcodec/x86/Makefile
> +++ b/libavcodec/x86/Makefile
> @@ -58,7 +58,6 @@ OBJS-$(CONFIG_VP9_DECODER) += x86/vp9dsp_init.o
>  
>  # GCC inline assembly optimizations
>  # subsystems
> -MMX-OBJS-$(CONFIG_AUDIODSP)+= x86/audiodsp_mmx.o
>  MMX-OBJS-$(CONFIG_FDCTDSP) += x86/fdct.o
>  MMX-OBJS-$(CONFIG_HPELDSP) += x86/fpel_mmx.o\
>x86/hpeldsp_mmx.o
> diff --git a/libavcodec/x86/audiodsp.asm b/libavcodec/x86/audiodsp.asm
> index 0e3019c..3169249 100644
> --- a/libavcodec/x86/audiodsp.asm
> +++ b/libavcodec/x86/audiodsp.asm
> @@ -136,3 +136,45 @@ VECTOR_CLIP_INT32 11, 1, 1, 0
>  %else
>  VECTOR_CLIP_INT32 6, 1, 0, 0
>  %endif
> +
> +; void ff_vector_clipf_sse(float *dst, const float *src,
> +;  int len, float min, float max)
> +INIT_XMM sse
> +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max
> +%if ARCH_X86_32
> +VBROADCASTSS m0, minm
> +VBROADCASTSS m1, maxm
> +%else

You should SWAP the regs here for Win64 as i mentioned in another email.

> +VBROADCASTSS m0, m0
> +VBROADCASTSS m1, m1
> +%endif
> +
> +movsxdifnidn lenq, lend
> +shl lenq, 2

shl lend, 2

Saves you one instruction by zeroing the high 32 bits implicitly, so you
don't need the movsxd anymore.

___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-05 Thread James Almer
On 9/5/2016 10:15 AM, Henrik Gramner wrote:
> On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnov  wrote:
>> > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max
>> > +%if ARCH_X86_32
>> > +VBROADCASTSS m0, minm
>> > +VBROADCASTSS m1, maxm
>> > +%else
>> > +VBROADCASTSS m0, m0
>> > +VBROADCASTSS m1, m1
>> > +%endif
> This will fail on WIN64, to deal with the somewhat silly calling
> conventions on that platform you need to do something like
> VBROADCASTSS m0, m3
> VBROADCASTSS m1, maxm
> (not tested, I don't have access to a Windows machine at the moment).

%if WIN64
SWAP 0, 2
SWAP 1, 3
%endif

Before the x86_64 vbroadcastss should be enough.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-05 Thread Henrik Gramner
On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnov  wrote:
> +shl lenq, 2

You could also skip this shift and just use 4*lenq instead in the
memory operands, multiplying by 2, 4, or 8 in memory args is free.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel


Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse

2016-09-05 Thread Henrik Gramner
On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnov  wrote:
> +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max
> +%if ARCH_X86_32
> +VBROADCASTSS m0, minm
> +VBROADCASTSS m1, maxm
> +%else
> +VBROADCASTSS m0, m0
> +VBROADCASTSS m1, m1
> +%endif

This will fail on WIN64, to deal with the somewhat silly calling
conventions on that platform you need to do something like
VBROADCASTSS m0, m3
VBROADCASTSS m1, maxm
(not tested, I don't have access to a Windows machine at the moment).

> +movsxdifnidn lenq, lend
> +shl lenq, 2
> +
> +.loop
> +sub lenq, 4 * mmsize

Move the subtraction just before the branch (jg) to allow macro-op
fusion on modern Intel CPUs.

> +
> +mova m2, [srcq + lenq + 0 * mmsize]
> +mova m3, [srcq + lenq + 1 * mmsize]
> +mova m4, [srcq + lenq + 2 * mmsize]
> +mova m5, [srcq + lenq + 3 * mmsize]
> +
> +maxps m2, m0
> +maxps m3, m0
> +maxps m4, m0
> +maxps m5, m0

Use 3-arg maxps instead of mova.

> +minps m2, m1
> +minps m3, m1
> +minps m4, m1
> +minps m5, m1
> +
> +mova [dstq + lenq + 0 * mmsize], m2
> +mova [dstq + lenq + 1 * mmsize], m3
> +mova [dstq + lenq + 2 * mmsize], m4
> +mova [dstq + lenq + 3 * mmsize], m5
> +
> +jg .loop
> +
> +RET

Otherwise LGTM, you could make an AVX version using ymm registers as
well in a separate patch if you want to, just need to make sure the
buffers are aligned.
___
libav-devel mailing list
libav-devel@libav.org
https://lists.libav.org/mailman/listinfo/libav-devel