Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On Tue, Sep 6, 2016 at 11:39 AM, Anton Khirnovwrote: >> Use 3-arg maxps instead of mova. > > Isn't that AVX-only? It is, x86inc will simply convert it to mova+minps when assembling it as non-AVX code but it reduces the line count. It's certainly not worth to go into bikeshedding territory about it however, so if you prefer to use plain mova:s just keep them. >> Otherwise LGTM, you could make an AVX version using ymm registers >> as well in a separate patch if you want to, just need to make sure >> the buffers are aligned. > > This function is only used in two rather obscure places, so probably > not worth it Fair enough. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
Quoting Henrik Gramner (2016-09-05 15:15:14) > On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnovwrote: > > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max > > +%if ARCH_X86_32 > > +VBROADCASTSS m0, minm > > +VBROADCASTSS m1, maxm > > +%else > > +VBROADCASTSS m0, m0 > > +VBROADCASTSS m1, m1 > > +%endif > > This will fail on WIN64, to deal with the somewhat silly calling > conventions on that platform you need to do something like > VBROADCASTSS m0, m3 > VBROADCASTSS m1, maxm > (not tested, I don't have access to a Windows machine at the moment). > > > +movsxdifnidn lenq, lend > > +shl lenq, 2 > > + > > +.loop > > +sub lenq, 4 * mmsize > > Move the subtraction just before the branch (jg) to allow macro-op > fusion on modern Intel CPUs. > > > + > > +mova m2, [srcq + lenq + 0 * mmsize] > > +mova m3, [srcq + lenq + 1 * mmsize] > > +mova m4, [srcq + lenq + 2 * mmsize] > > +mova m5, [srcq + lenq + 3 * mmsize] > > + > > +maxps m2, m0 > > +maxps m3, m0 > > +maxps m4, m0 > > +maxps m5, m0 > > Use 3-arg maxps instead of mova. Isn't that AVX-only? > > > +minps m2, m1 > > +minps m3, m1 > > +minps m4, m1 > > +minps m5, m1 > > + > > +mova [dstq + lenq + 0 * mmsize], m2 > > +mova [dstq + lenq + 1 * mmsize], m3 > > +mova [dstq + lenq + 2 * mmsize], m4 > > +mova [dstq + lenq + 3 * mmsize], m5 > > + > > +jg .loop > > + > > +RET > > Otherwise LGTM, you could make an AVX version using ymm registers as > well in a separate patch if you want to, just need to make sure the > buffers are aligned. This function is only used in two rather obscure places, so probably not worth it -- Anton Khirnov ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On Mon, Sep 05, 2016 at 01:02:43PM +0200, Anton Khirnov wrote: > --- > libavcodec/x86/Makefile | 1 - > libavcodec/x86/audiodsp.asm | 42 +++ > libavcodec/x86/audiodsp_mmx.c | 58 > --- > 3 files changed, 42 insertions(+), 59 deletions(-) > delete mode 100644 libavcodec/x86/audiodsp_mmx.c Changes to the init file are missing. Diego ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On 9/5/2016 11:40 AM, James Almer wrote: > On 9/5/2016 10:15 AM, Henrik Gramner wrote: >> On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnovwrote: +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max +%if ARCH_X86_32 +VBROADCASTSS m0, minm +VBROADCASTSS m1, maxm +%else +VBROADCASTSS m0, m0 +VBROADCASTSS m1, m1 +%endif >> This will fail on WIN64, to deal with the somewhat silly calling >> conventions on that platform you need to do something like >> VBROADCASTSS m0, m3 >> VBROADCASTSS m1, maxm >> (not tested, I don't have access to a Windows machine at the moment). > > %if WIN64 > SWAP 0, 2 > SWAP 1, 3 > %endif Nevermind, misread the prototype. You can use "SWAP 0, 3" for min, but for max you probably need to fetch it from stack as you suggested. > > Before the x86_64 vbroadcastss should be enough. > ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On 9/5/2016 8:02 AM, Anton Khirnov wrote: > --- > libavcodec/x86/Makefile | 1 - > libavcodec/x86/audiodsp.asm | 42 +++ > libavcodec/x86/audiodsp_mmx.c | 58 > --- > 3 files changed, 42 insertions(+), 59 deletions(-) > delete mode 100644 libavcodec/x86/audiodsp_mmx.c Missing the changes to audiodsp_init.c to make the function depend on EXTERNAL_SSE instead of INLINE_SSE. > > diff --git a/libavcodec/x86/Makefile b/libavcodec/x86/Makefile > index 906bd0ea..98c27fa 100644 > --- a/libavcodec/x86/Makefile > +++ b/libavcodec/x86/Makefile > @@ -58,7 +58,6 @@ OBJS-$(CONFIG_VP9_DECODER) += x86/vp9dsp_init.o > > # GCC inline assembly optimizations > # subsystems > -MMX-OBJS-$(CONFIG_AUDIODSP)+= x86/audiodsp_mmx.o > MMX-OBJS-$(CONFIG_FDCTDSP) += x86/fdct.o > MMX-OBJS-$(CONFIG_HPELDSP) += x86/fpel_mmx.o\ >x86/hpeldsp_mmx.o > diff --git a/libavcodec/x86/audiodsp.asm b/libavcodec/x86/audiodsp.asm > index 0e3019c..3169249 100644 > --- a/libavcodec/x86/audiodsp.asm > +++ b/libavcodec/x86/audiodsp.asm > @@ -136,3 +136,45 @@ VECTOR_CLIP_INT32 11, 1, 1, 0 > %else > VECTOR_CLIP_INT32 6, 1, 0, 0 > %endif > + > +; void ff_vector_clipf_sse(float *dst, const float *src, > +; int len, float min, float max) > +INIT_XMM sse > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max > +%if ARCH_X86_32 > +VBROADCASTSS m0, minm > +VBROADCASTSS m1, maxm > +%else You should SWAP the regs here for Win64 as i mentioned in another email. > +VBROADCASTSS m0, m0 > +VBROADCASTSS m1, m1 > +%endif > + > +movsxdifnidn lenq, lend > +shl lenq, 2 shl lend, 2 Saves you one instruction by zeroing the high 32 bits implicitly, so you don't need the movsxd anymore. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On 9/5/2016 10:15 AM, Henrik Gramner wrote: > On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnovwrote: >> > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max >> > +%if ARCH_X86_32 >> > +VBROADCASTSS m0, minm >> > +VBROADCASTSS m1, maxm >> > +%else >> > +VBROADCASTSS m0, m0 >> > +VBROADCASTSS m1, m1 >> > +%endif > This will fail on WIN64, to deal with the somewhat silly calling > conventions on that platform you need to do something like > VBROADCASTSS m0, m3 > VBROADCASTSS m1, maxm > (not tested, I don't have access to a Windows machine at the moment). %if WIN64 SWAP 0, 2 SWAP 1, 3 %endif Before the x86_64 vbroadcastss should be enough. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnovwrote: > +shl lenq, 2 You could also skip this shift and just use 4*lenq instead in the memory operands, multiplying by 2, 4, or 8 in memory args is free. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel
Re: [libav-devel] [PATCH 9/9] audiodsp/x86: yasmify vector_clipf_sse
On Mon, Sep 5, 2016 at 1:02 PM, Anton Khirnovwrote: > +cglobal vector_clipf, 3, 3, 6, dst, src, len, min, max > +%if ARCH_X86_32 > +VBROADCASTSS m0, minm > +VBROADCASTSS m1, maxm > +%else > +VBROADCASTSS m0, m0 > +VBROADCASTSS m1, m1 > +%endif This will fail on WIN64, to deal with the somewhat silly calling conventions on that platform you need to do something like VBROADCASTSS m0, m3 VBROADCASTSS m1, maxm (not tested, I don't have access to a Windows machine at the moment). > +movsxdifnidn lenq, lend > +shl lenq, 2 > + > +.loop > +sub lenq, 4 * mmsize Move the subtraction just before the branch (jg) to allow macro-op fusion on modern Intel CPUs. > + > +mova m2, [srcq + lenq + 0 * mmsize] > +mova m3, [srcq + lenq + 1 * mmsize] > +mova m4, [srcq + lenq + 2 * mmsize] > +mova m5, [srcq + lenq + 3 * mmsize] > + > +maxps m2, m0 > +maxps m3, m0 > +maxps m4, m0 > +maxps m5, m0 Use 3-arg maxps instead of mova. > +minps m2, m1 > +minps m3, m1 > +minps m4, m1 > +minps m5, m1 > + > +mova [dstq + lenq + 0 * mmsize], m2 > +mova [dstq + lenq + 1 * mmsize], m3 > +mova [dstq + lenq + 2 * mmsize], m4 > +mova [dstq + lenq + 3 * mmsize], m5 > + > +jg .loop > + > +RET Otherwise LGTM, you could make an AVX version using ymm registers as well in a separate patch if you want to, just need to make sure the buffers are aligned. ___ libav-devel mailing list libav-devel@libav.org https://lists.libav.org/mailman/listinfo/libav-devel