Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On 18 October 2016 at 21:04, Michael Niedermayer wrote: > On Tue, Oct 18, 2016 at 05:33:13PM +0100, Rostislav Pehlivanov wrote: > > On 18 October 2016 at 16:32, James Almer wrote: > > > > > On 10/18/2016 12:07 PM, Rostislav Pehlivanov wrote: > > > > diff --git a/libavcodec/aacenc.c b/libavcodec/aacenc.c > > > > index ee3cbf8..622f0ba 100644 > > > > --- a/libavcodec/aacenc.c > > > > +++ b/libavcodec/aacenc.c > > > > @@ -1033,6 +1033,12 @@ static av_cold int > aac_encode_init(AVCodecContext > > > *avctx) > > > > ff_lpc_init(&s->lpc, 2*avctx->frame_size, TNS_MAX_ORDER, > > > FF_LPC_TYPE_LEVINSON); > > > > s->random_state = 0x1f2e3d4c; > > > > > > > > +s->abs_pow34 = &abs_pow34_v; > > > > +s->quant_bands = &quantize_bands; > > > > > > No need for & in these. > > > > > > > + > > > > +if (ARCH_X86) > > > > +ff_aac_dsp_init_x86(s); > > > > + > > > > if (HAVE_MIPSDSP) > > > > ff_aac_coder_init_mips(s); > > > > > > [...] > > > > > > > diff --git a/libavcodec/x86/aacencdsp.asm > b/libavcodec/x86/aacencdsp.asm > > > > new file mode 100644 > > > > index 000..dd7b022 > > > > --- /dev/null > > > > +++ b/libavcodec/x86/aacencdsp.asm > > > > @@ -0,0 +1,88 @@ > > > > +;** > > > > > > > +;* SIMD optimized AAC encoder DSP functions > > > > +;* > > > > +;* Copyright (C) 2016 Rostislav Pehlivanov > > > > +;* > > > > +;* This file is part of FFmpeg. > > > > +;* > > > > +;* FFmpeg is free software; you can redistribute it and/or > > > > +;* modify it under the terms of the GNU Lesser General Public > > > > +;* License as published by the Free Software Foundation; either > > > > +;* version 2.1 of the License, or (at your option) any later > version. > > > > +;* > > > > +;* FFmpeg is distributed in the hope that it will be useful, > > > > +;* but WITHOUT ANY WARRANTY; without even the implied warranty of > > > > +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > > > +;* Lesser General Public License for more details. > > > > +;* > > > > +;* You should have received a copy of the GNU Lesser General Public > > > > +;* License along with FFmpeg; if not, write to the Free Software > > > > +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > > 02110-1301 USA > > > > +;** > > > > > > > + > > > > +%include "libavutil/x86/x86util.asm" > > > > + > > > > +SECTION_RODATA > > > > + > > > > +float_abs_mask: times 4 dd 0x7fff > > > > + > > > > +SECTION .text > > > > + > > > > +;** > * > > > > +;void ff_abs_pow34(float *out, const float *in, const int size); > > > > +;** > * > > > > +INIT_XMM sse > > > > +cglobal abs_pow34, 3, 3, 3, out, in, size > > > > +mova m2, [float_abs_mask] > > > > +shlsizeq, 2 > > > > +addinq, sizeq > > > > +addoutq, sizeq > > > > +negsizeq > > > > +.loop: > > > > +movaps m0, [inq+sizeq] > > > > +andps m0, m2 > > > > > > Remove the movaps and do > > > > > > andps m0, m2, [inq+sizeq] > > > > > > Instead. Sorry i didn't notice this last time. > > > > > > > +sqrtps m1, m0 > > > > +mulps m0, m1 > > > > +sqrtps m0, m0 > > > > +mova [outq+sizeq], m0 > > > > +addsizeq, mmsize > > > > +jl.loop > > > > +RET > > > > + > > > > +;** > * > > > > +;void ff_aac_quantize_bands(int *out, const float *in, const float > > > *scaled, > > > > +; int size, int is_signed, int maxval, > const > > > float Q34, > > > > +; const float rounding) > > > > +;** > * > > > > +INIT_XMM sse2 > > > > +cglobal aac_quantize_bands, 5, 5, 6, out, in, scaled, size, > is_signed, > > > maxval, Q34, rounding > > > > +%if UNIX64 == 0 > > > > +movss m0, Q34m > > > > +movss m1, roundingm > > > > +cvtsi2ss m3, maxvald > > > > +%else > > > > +cvtsi2ss m3, dword maxvalm > > > > +%endif > > > > > > The other way around. Unix64 is the one that has maxval on a reg > regardless > > > of how you init the function, whereas win64 and any x86_32 target have > it > > > on stack. > > > > > > > +shufpsm0, m0, 0 > > > > +shufpsm1, m1, 0 > > > > +shufpsm3, m3, 0 > > > > +shl is_signedd, 31 > > > > +movd m4, is_signedd > > > > +shufpsm4, m4, 0 > > > > +shl sized, 2 > > > > +add inq, sizeq > > > > +add outq, sizeq > > > > +add scaledq, sizeq > > > > +neg sizeq > > > > +.loop: > > > > +mulps m2, m0, [scaledq+sizeq] > > > > +addps m2, m1 > > > > +minps m2, m3 > > > > > > > +movapsm5, [inq+size
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On Tue, Oct 18, 2016 at 05:33:13PM +0100, Rostislav Pehlivanov wrote: > On 18 October 2016 at 16:32, James Almer wrote: > > > On 10/18/2016 12:07 PM, Rostislav Pehlivanov wrote: > > > diff --git a/libavcodec/aacenc.c b/libavcodec/aacenc.c > > > index ee3cbf8..622f0ba 100644 > > > --- a/libavcodec/aacenc.c > > > +++ b/libavcodec/aacenc.c > > > @@ -1033,6 +1033,12 @@ static av_cold int aac_encode_init(AVCodecContext > > *avctx) > > > ff_lpc_init(&s->lpc, 2*avctx->frame_size, TNS_MAX_ORDER, > > FF_LPC_TYPE_LEVINSON); > > > s->random_state = 0x1f2e3d4c; > > > > > > +s->abs_pow34 = &abs_pow34_v; > > > +s->quant_bands = &quantize_bands; > > > > No need for & in these. > > > > > + > > > +if (ARCH_X86) > > > +ff_aac_dsp_init_x86(s); > > > + > > > if (HAVE_MIPSDSP) > > > ff_aac_coder_init_mips(s); > > > > [...] > > > > > diff --git a/libavcodec/x86/aacencdsp.asm b/libavcodec/x86/aacencdsp.asm > > > new file mode 100644 > > > index 000..dd7b022 > > > --- /dev/null > > > +++ b/libavcodec/x86/aacencdsp.asm > > > @@ -0,0 +1,88 @@ > > > +;** > > > > > +;* SIMD optimized AAC encoder DSP functions > > > +;* > > > +;* Copyright (C) 2016 Rostislav Pehlivanov > > > +;* > > > +;* This file is part of FFmpeg. > > > +;* > > > +;* FFmpeg is free software; you can redistribute it and/or > > > +;* modify it under the terms of the GNU Lesser General Public > > > +;* License as published by the Free Software Foundation; either > > > +;* version 2.1 of the License, or (at your option) any later version. > > > +;* > > > +;* FFmpeg is distributed in the hope that it will be useful, > > > +;* but WITHOUT ANY WARRANTY; without even the implied warranty of > > > +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > > +;* Lesser General Public License for more details. > > > +;* > > > +;* You should have received a copy of the GNU Lesser General Public > > > +;* License along with FFmpeg; if not, write to the Free Software > > > +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > > 02110-1301 USA > > > +;** > > > > > + > > > +%include "libavutil/x86/x86util.asm" > > > + > > > +SECTION_RODATA > > > + > > > +float_abs_mask: times 4 dd 0x7fff > > > + > > > +SECTION .text > > > + > > > +;*** > > > +;void ff_abs_pow34(float *out, const float *in, const int size); > > > +;*** > > > +INIT_XMM sse > > > +cglobal abs_pow34, 3, 3, 3, out, in, size > > > +mova m2, [float_abs_mask] > > > +shlsizeq, 2 > > > +addinq, sizeq > > > +addoutq, sizeq > > > +negsizeq > > > +.loop: > > > +movaps m0, [inq+sizeq] > > > +andps m0, m2 > > > > Remove the movaps and do > > > > andps m0, m2, [inq+sizeq] > > > > Instead. Sorry i didn't notice this last time. > > > > > +sqrtps m1, m0 > > > +mulps m0, m1 > > > +sqrtps m0, m0 > > > +mova [outq+sizeq], m0 > > > +addsizeq, mmsize > > > +jl.loop > > > +RET > > > + > > > +;*** > > > +;void ff_aac_quantize_bands(int *out, const float *in, const float > > *scaled, > > > +; int size, int is_signed, int maxval, const > > float Q34, > > > +; const float rounding) > > > +;*** > > > +INIT_XMM sse2 > > > +cglobal aac_quantize_bands, 5, 5, 6, out, in, scaled, size, is_signed, > > maxval, Q34, rounding > > > +%if UNIX64 == 0 > > > +movss m0, Q34m > > > +movss m1, roundingm > > > +cvtsi2ss m3, maxvald > > > +%else > > > +cvtsi2ss m3, dword maxvalm > > > +%endif > > > > The other way around. Unix64 is the one that has maxval on a reg regardless > > of how you init the function, whereas win64 and any x86_32 target have it > > on stack. > > > > > +shufpsm0, m0, 0 > > > +shufpsm1, m1, 0 > > > +shufpsm3, m3, 0 > > > +shl is_signedd, 31 > > > +movd m4, is_signedd > > > +shufpsm4, m4, 0 > > > +shl sized, 2 > > > +add inq, sizeq > > > +add outq, sizeq > > > +add scaledq, sizeq > > > +neg sizeq > > > +.loop: > > > +mulps m2, m0, [scaledq+sizeq] > > > +addps m2, m1 > > > +minps m2, m3 > > > > > +movapsm5, [inq+sizeq] > > > +andps m5, m4 > > > > Same as in abs_pow34, remove movaps and do > > > > andps m5, m4, [inq+sizeq] > > > > > +orps m2, m5 > > > +cvttps2dq m2, m2 > > > +mova [outq+sizeq], m2 > > > +add sizeq, mmsize > > > +jl .loop > > > +RET > > > diff --git a/libavcodec/x86/aacencdsp_init.c b/liba
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On 18 October 2016 at 16:32, James Almer wrote: > On 10/18/2016 12:07 PM, Rostislav Pehlivanov wrote: > > diff --git a/libavcodec/aacenc.c b/libavcodec/aacenc.c > > index ee3cbf8..622f0ba 100644 > > --- a/libavcodec/aacenc.c > > +++ b/libavcodec/aacenc.c > > @@ -1033,6 +1033,12 @@ static av_cold int aac_encode_init(AVCodecContext > *avctx) > > ff_lpc_init(&s->lpc, 2*avctx->frame_size, TNS_MAX_ORDER, > FF_LPC_TYPE_LEVINSON); > > s->random_state = 0x1f2e3d4c; > > > > +s->abs_pow34 = &abs_pow34_v; > > +s->quant_bands = &quantize_bands; > > No need for & in these. > > > + > > +if (ARCH_X86) > > +ff_aac_dsp_init_x86(s); > > + > > if (HAVE_MIPSDSP) > > ff_aac_coder_init_mips(s); > > [...] > > > diff --git a/libavcodec/x86/aacencdsp.asm b/libavcodec/x86/aacencdsp.asm > > new file mode 100644 > > index 000..dd7b022 > > --- /dev/null > > +++ b/libavcodec/x86/aacencdsp.asm > > @@ -0,0 +1,88 @@ > > +;** > > > +;* SIMD optimized AAC encoder DSP functions > > +;* > > +;* Copyright (C) 2016 Rostislav Pehlivanov > > +;* > > +;* This file is part of FFmpeg. > > +;* > > +;* FFmpeg is free software; you can redistribute it and/or > > +;* modify it under the terms of the GNU Lesser General Public > > +;* License as published by the Free Software Foundation; either > > +;* version 2.1 of the License, or (at your option) any later version. > > +;* > > +;* FFmpeg is distributed in the hope that it will be useful, > > +;* but WITHOUT ANY WARRANTY; without even the implied warranty of > > +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > +;* Lesser General Public License for more details. > > +;* > > +;* You should have received a copy of the GNU Lesser General Public > > +;* License along with FFmpeg; if not, write to the Free Software > > +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA > 02110-1301 USA > > +;** > > > + > > +%include "libavutil/x86/x86util.asm" > > + > > +SECTION_RODATA > > + > > +float_abs_mask: times 4 dd 0x7fff > > + > > +SECTION .text > > + > > +;*** > > +;void ff_abs_pow34(float *out, const float *in, const int size); > > +;*** > > +INIT_XMM sse > > +cglobal abs_pow34, 3, 3, 3, out, in, size > > +mova m2, [float_abs_mask] > > +shlsizeq, 2 > > +addinq, sizeq > > +addoutq, sizeq > > +negsizeq > > +.loop: > > +movaps m0, [inq+sizeq] > > +andps m0, m2 > > Remove the movaps and do > > andps m0, m2, [inq+sizeq] > > Instead. Sorry i didn't notice this last time. > > > +sqrtps m1, m0 > > +mulps m0, m1 > > +sqrtps m0, m0 > > +mova [outq+sizeq], m0 > > +addsizeq, mmsize > > +jl.loop > > +RET > > + > > +;*** > > +;void ff_aac_quantize_bands(int *out, const float *in, const float > *scaled, > > +; int size, int is_signed, int maxval, const > float Q34, > > +; const float rounding) > > +;*** > > +INIT_XMM sse2 > > +cglobal aac_quantize_bands, 5, 5, 6, out, in, scaled, size, is_signed, > maxval, Q34, rounding > > +%if UNIX64 == 0 > > +movss m0, Q34m > > +movss m1, roundingm > > +cvtsi2ss m3, maxvald > > +%else > > +cvtsi2ss m3, dword maxvalm > > +%endif > > The other way around. Unix64 is the one that has maxval on a reg regardless > of how you init the function, whereas win64 and any x86_32 target have it > on stack. > > > +shufpsm0, m0, 0 > > +shufpsm1, m1, 0 > > +shufpsm3, m3, 0 > > +shl is_signedd, 31 > > +movd m4, is_signedd > > +shufpsm4, m4, 0 > > +shl sized, 2 > > +add inq, sizeq > > +add outq, sizeq > > +add scaledq, sizeq > > +neg sizeq > > +.loop: > > +mulps m2, m0, [scaledq+sizeq] > > +addps m2, m1 > > +minps m2, m3 > > > +movapsm5, [inq+sizeq] > > +andps m5, m4 > > Same as in abs_pow34, remove movaps and do > > andps m5, m4, [inq+sizeq] > > > +orps m2, m5 > > +cvttps2dq m2, m2 > > +mova [outq+sizeq], m2 > > +add sizeq, mmsize > > +jl .loop > > +RET > > diff --git a/libavcodec/x86/aacencdsp_init.c b/libavcodec/x86/aacencdsp_ > init.c > > new file mode 100644 > > index 000..aefaa15 > > --- /dev/null > > +++ b/libavcodec/x86/aacencdsp_init.c > > @@ -0,0 +1,43 @@ > > +/* > > + * AAC encoder assembly optimizations > > + * Copyright (C) 2016 Rostislav Pehlivanov > > + * > > + * This file is part of FFmpeg. > > + * > > + * FFmpeg is free software; you can redi
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On 10/18/2016 12:07 PM, Rostislav Pehlivanov wrote: > diff --git a/libavcodec/aacenc.c b/libavcodec/aacenc.c > index ee3cbf8..622f0ba 100644 > --- a/libavcodec/aacenc.c > +++ b/libavcodec/aacenc.c > @@ -1033,6 +1033,12 @@ static av_cold int aac_encode_init(AVCodecContext > *avctx) > ff_lpc_init(&s->lpc, 2*avctx->frame_size, TNS_MAX_ORDER, > FF_LPC_TYPE_LEVINSON); > s->random_state = 0x1f2e3d4c; > > +s->abs_pow34 = &abs_pow34_v; > +s->quant_bands = &quantize_bands; No need for & in these. > + > +if (ARCH_X86) > +ff_aac_dsp_init_x86(s); > + > if (HAVE_MIPSDSP) > ff_aac_coder_init_mips(s); [...] > diff --git a/libavcodec/x86/aacencdsp.asm b/libavcodec/x86/aacencdsp.asm > new file mode 100644 > index 000..dd7b022 > --- /dev/null > +++ b/libavcodec/x86/aacencdsp.asm > @@ -0,0 +1,88 @@ > +;** > +;* SIMD optimized AAC encoder DSP functions > +;* > +;* Copyright (C) 2016 Rostislav Pehlivanov > +;* > +;* This file is part of FFmpeg. > +;* > +;* FFmpeg is free software; you can redistribute it and/or > +;* modify it under the terms of the GNU Lesser General Public > +;* License as published by the Free Software Foundation; either > +;* version 2.1 of the License, or (at your option) any later version. > +;* > +;* FFmpeg is distributed in the hope that it will be useful, > +;* but WITHOUT ANY WARRANTY; without even the implied warranty of > +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > +;* Lesser General Public License for more details. > +;* > +;* You should have received a copy of the GNU Lesser General Public > +;* License along with FFmpeg; if not, write to the Free Software > +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA > +;** > + > +%include "libavutil/x86/x86util.asm" > + > +SECTION_RODATA > + > +float_abs_mask: times 4 dd 0x7fff > + > +SECTION .text > + > +;*** > +;void ff_abs_pow34(float *out, const float *in, const int size); > +;*** > +INIT_XMM sse > +cglobal abs_pow34, 3, 3, 3, out, in, size > +mova m2, [float_abs_mask] > +shlsizeq, 2 > +addinq, sizeq > +addoutq, sizeq > +negsizeq > +.loop: > +movaps m0, [inq+sizeq] > +andps m0, m2 Remove the movaps and do andps m0, m2, [inq+sizeq] Instead. Sorry i didn't notice this last time. > +sqrtps m1, m0 > +mulps m0, m1 > +sqrtps m0, m0 > +mova [outq+sizeq], m0 > +addsizeq, mmsize > +jl.loop > +RET > + > +;*** > +;void ff_aac_quantize_bands(int *out, const float *in, const float *scaled, > +; int size, int is_signed, int maxval, const float > Q34, > +; const float rounding) > +;*** > +INIT_XMM sse2 > +cglobal aac_quantize_bands, 5, 5, 6, out, in, scaled, size, is_signed, > maxval, Q34, rounding > +%if UNIX64 == 0 > +movss m0, Q34m > +movss m1, roundingm > +cvtsi2ss m3, maxvald > +%else > +cvtsi2ss m3, dword maxvalm > +%endif The other way around. Unix64 is the one that has maxval on a reg regardless of how you init the function, whereas win64 and any x86_32 target have it on stack. > +shufpsm0, m0, 0 > +shufpsm1, m1, 0 > +shufpsm3, m3, 0 > +shl is_signedd, 31 > +movd m4, is_signedd > +shufpsm4, m4, 0 > +shl sized, 2 > +add inq, sizeq > +add outq, sizeq > +add scaledq, sizeq > +neg sizeq > +.loop: > +mulps m2, m0, [scaledq+sizeq] > +addps m2, m1 > +minps m2, m3 > +movapsm5, [inq+sizeq] > +andps m5, m4 Same as in abs_pow34, remove movaps and do andps m5, m4, [inq+sizeq] > +orps m2, m5 > +cvttps2dq m2, m2 > +mova [outq+sizeq], m2 > +add sizeq, mmsize > +jl .loop > +RET > diff --git a/libavcodec/x86/aacencdsp_init.c b/libavcodec/x86/aacencdsp_init.c > new file mode 100644 > index 000..aefaa15 > --- /dev/null > +++ b/libavcodec/x86/aacencdsp_init.c > @@ -0,0 +1,43 @@ > +/* > + * AAC encoder assembly optimizations > + * Copyright (C) 2016 Rostislav Pehlivanov > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without ev
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On 18 October 2016 at 14:51, Michael Niedermayer wrote: > On Tue, Oct 18, 2016 at 09:02:19AM +0100, Rostislav Pehlivanov wrote: > > On 17 October 2016 at 23:43, Michael Niedermayer > > > wrote: > > > > > On Mon, Oct 17, 2016 at 10:24:48PM +0100, Rostislav Pehlivanov wrote: > > > > Should fix segfaults on x86-32 > > > > > > > > Performance improvements: > > > > > > > > quant_bands: > > > > with: 681 decicycles in quant_bands, 8388453 runs,155 skips > > > > without: 1190 decicycles in quant_bands, 8388386 runs,222 skips > > > > Around 42% for the function > > > > > > > > Twoloop coder: > > > > > > > > abs_pow34: > > > > with/without: 7.82s/8.17s > > > > Around 4% for the entire encoder > > > > > > > > Both: > > > > with/without: 7.15s/8.17s > > > > Around 12% for the entire encoder > > > > > > > > Fast coder: > > > > > > > > abs_pow34: > > > > with/without: 3.40s/3.77s > > > > Around 10% for the entire encoder > > > > > > > > Both: > > > > with/without: 3.02s/3.77s > > > > Around 20% faster for the entire encoder > > > > > > > > Signed-off-by: Rostislav Pehlivanov > > > > --- > > > > libavcodec/aaccoder.c| 27 +++-- > > > > libavcodec/aaccoder_trellis.h| 2 +- > > > > libavcodec/aaccoder_twoloop.h| 2 +- > > > > libavcodec/aacenc.c | 4 ++ > > > > libavcodec/aacenc.h | 6 +++ > > > > libavcodec/aacenc_is.c | 6 +-- > > > > libavcodec/aacenc_ltp.c | 4 +- > > > > libavcodec/aacenc_pred.c | 6 +-- > > > > libavcodec/aacenc_quantization.h | 4 +- > > > > libavcodec/aacenc_utils.h| 4 +- > > > > libavcodec/x86/Makefile | 2 + > > > > libavcodec/x86/aacencdsp.asm | 87 ++ > > > ++ > > > > libavcodec/x86/aacencdsp_init.c | 43 > > > > 13 files changed, 170 insertions(+), 27 deletions(-) > > > > create mode 100644 libavcodec/x86/aacencdsp.asm > > > > create mode 100644 libavcodec/x86/aacencdsp_init.c > > > > > > fate passes on linux32/64 x86, mingw32/64 x86 > > > > > > build fails on arm: > > > > > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > > > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > > > `ff_aac_dsp_init_x86' > > > collect2: ld returned 1 exit status > > > make: *** [ffserver_g] Error 1 > > > make: *** Waiting for unfinished jobs > > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > > > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > > > `ff_aac_dsp_init_x86' > > > collect2: ld returned 1 exit status > > > make: *** [ffprobe_g] Error 1 > > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > > > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > > > `ff_aac_dsp_init_x86' > > > collect2: ld returned 1 exit status > > > make: *** [ffmpeg_g] Error 1 > > > > > > [...] > > > -- > > > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC7 > 87040B0FAB > > > > > > While the State exists there can be no freedom; when there is freedom > there > > > will be no State. -- Vladimir Lenin > > > > > > ___ > > > ffmpeg-devel mailing list > > > ffmpeg-devel@ffmpeg.org > > > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > > > > > Attaching a new version with the fixes from James Almer which should also > > fix non-x86 compilation > > > aaccoder.c| 27 +++ > > aaccoder_trellis.h|2 - > > aaccoder_twoloop.h|2 - > > aacenc.c |4 ++ > > aacenc.h |6 +++ > > aacenc_is.c |6 +-- > > aacenc_ltp.c |4 +- > > aacenc_pred.c |6 +-- > > aacenc_quantization.h |4 +- > > aacenc_utils.h|2 - > > x86/Makefile |2 + > > x86/aacencdsp.asm | 88 ++ > > > x86/aacencdsp_init.c | 43 > > 13 files changed, 170 insertions(+), 26 deletions(-) > > 84d67e14dbd62ef958a52a4027a8dff22f7480b6 0001-aacenc-add-SIMD- > optimizations-for-abs_pow34-and-quan.patch > > From d92003e23d82bc40fd85712538983209a7704248 Mon Sep 17 00:00:00 2001 > > From: Rostislav Pehlivanov > > Date: Sat, 8 Oct 2016 15:59:14 +0100 > > Subject: [PATCH] aacenc: add SIMD optimizations for abs_pow34 and > quantization > > > > Performance improvements: > > > > quant_bands: > > with: 681 decicycles in quant_bands, 8388453 runs,155 skips > > without: 1190 decicycles in quant_bands, 8388386 runs,222 skips > > Around 42% for the function > > > > Twoloop coder: > > > > abs_pow34: > > with/without: 7.82s/8.17s > > Around 4% for the entire encoder > > > > Both: > > with/without: 7.15s/8.17s > > Around 12% for the entire encoder > > > > Fast coder: > > > > abs_pow34: > > with/without: 3.40s/3.77s > > Around 10% for the entire encoder > > > > Both: > > with/without: 3.02s/3.77s > > Around 20% faste
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On Tue, Oct 18, 2016 at 09:02:19AM +0100, Rostislav Pehlivanov wrote: > On 17 October 2016 at 23:43, Michael Niedermayer > wrote: > > > On Mon, Oct 17, 2016 at 10:24:48PM +0100, Rostislav Pehlivanov wrote: > > > Should fix segfaults on x86-32 > > > > > > Performance improvements: > > > > > > quant_bands: > > > with: 681 decicycles in quant_bands, 8388453 runs,155 skips > > > without: 1190 decicycles in quant_bands, 8388386 runs,222 skips > > > Around 42% for the function > > > > > > Twoloop coder: > > > > > > abs_pow34: > > > with/without: 7.82s/8.17s > > > Around 4% for the entire encoder > > > > > > Both: > > > with/without: 7.15s/8.17s > > > Around 12% for the entire encoder > > > > > > Fast coder: > > > > > > abs_pow34: > > > with/without: 3.40s/3.77s > > > Around 10% for the entire encoder > > > > > > Both: > > > with/without: 3.02s/3.77s > > > Around 20% faster for the entire encoder > > > > > > Signed-off-by: Rostislav Pehlivanov > > > --- > > > libavcodec/aaccoder.c| 27 +++-- > > > libavcodec/aaccoder_trellis.h| 2 +- > > > libavcodec/aaccoder_twoloop.h| 2 +- > > > libavcodec/aacenc.c | 4 ++ > > > libavcodec/aacenc.h | 6 +++ > > > libavcodec/aacenc_is.c | 6 +-- > > > libavcodec/aacenc_ltp.c | 4 +- > > > libavcodec/aacenc_pred.c | 6 +-- > > > libavcodec/aacenc_quantization.h | 4 +- > > > libavcodec/aacenc_utils.h| 4 +- > > > libavcodec/x86/Makefile | 2 + > > > libavcodec/x86/aacencdsp.asm | 87 ++ > > ++ > > > libavcodec/x86/aacencdsp_init.c | 43 > > > 13 files changed, 170 insertions(+), 27 deletions(-) > > > create mode 100644 libavcodec/x86/aacencdsp.asm > > > create mode 100644 libavcodec/x86/aacencdsp_init.c > > > > fate passes on linux32/64 x86, mingw32/64 x86 > > > > build fails on arm: > > > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > > `ff_aac_dsp_init_x86' > > collect2: ld returned 1 exit status > > make: *** [ffserver_g] Error 1 > > make: *** Waiting for unfinished jobs > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > > `ff_aac_dsp_init_x86' > > collect2: ld returned 1 exit status > > make: *** [ffprobe_g] Error 1 > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > > `ff_aac_dsp_init_x86' > > collect2: ld returned 1 exit status > > make: *** [ffmpeg_g] Error 1 > > > > [...] > > -- > > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > > > While the State exists there can be no freedom; when there is freedom there > > will be no State. -- Vladimir Lenin > > > > ___ > > ffmpeg-devel mailing list > > ffmpeg-devel@ffmpeg.org > > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > > > > Attaching a new version with the fixes from James Almer which should also > fix non-x86 compilation > aaccoder.c| 27 +++ > aaccoder_trellis.h|2 - > aaccoder_twoloop.h|2 - > aacenc.c |4 ++ > aacenc.h |6 +++ > aacenc_is.c |6 +-- > aacenc_ltp.c |4 +- > aacenc_pred.c |6 +-- > aacenc_quantization.h |4 +- > aacenc_utils.h|2 - > x86/Makefile |2 + > x86/aacencdsp.asm | 88 > ++ > x86/aacencdsp_init.c | 43 > 13 files changed, 170 insertions(+), 26 deletions(-) > 84d67e14dbd62ef958a52a4027a8dff22f7480b6 > 0001-aacenc-add-SIMD-optimizations-for-abs_pow34-and-quan.patch > From d92003e23d82bc40fd85712538983209a7704248 Mon Sep 17 00:00:00 2001 > From: Rostislav Pehlivanov > Date: Sat, 8 Oct 2016 15:59:14 +0100 > Subject: [PATCH] aacenc: add SIMD optimizations for abs_pow34 and quantization > > Performance improvements: > > quant_bands: > with: 681 decicycles in quant_bands, 8388453 runs,155 skips > without: 1190 decicycles in quant_bands, 8388386 runs,222 skips > Around 42% for the function > > Twoloop coder: > > abs_pow34: > with/without: 7.82s/8.17s > Around 4% for the entire encoder > > Both: > with/without: 7.15s/8.17s > Around 12% for the entire encoder > > Fast coder: > > abs_pow34: > with/without: 3.40s/3.77s > Around 10% for the entire encoder > > Both: > with/without: 3.02s/3.77s > Around 20% faster for the entire encoder > > Signed-off-by: Rostislav Pehlivanov > --- > libavcodec/aaccoder.c| 27 ++-- > libavcodec/aaccoder_trellis.h| 2 +- > libavcodec/aaccoder_twoloop.h| 2 +- > libavcodec/aacenc.c | 4 ++ > libavcodec/aacenc.h | 6 +++ > libavcodec/aacenc_is.c
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On 17 October 2016 at 23:43, Michael Niedermayer wrote: > On Mon, Oct 17, 2016 at 10:24:48PM +0100, Rostislav Pehlivanov wrote: > > Should fix segfaults on x86-32 > > > > Performance improvements: > > > > quant_bands: > > with: 681 decicycles in quant_bands, 8388453 runs,155 skips > > without: 1190 decicycles in quant_bands, 8388386 runs,222 skips > > Around 42% for the function > > > > Twoloop coder: > > > > abs_pow34: > > with/without: 7.82s/8.17s > > Around 4% for the entire encoder > > > > Both: > > with/without: 7.15s/8.17s > > Around 12% for the entire encoder > > > > Fast coder: > > > > abs_pow34: > > with/without: 3.40s/3.77s > > Around 10% for the entire encoder > > > > Both: > > with/without: 3.02s/3.77s > > Around 20% faster for the entire encoder > > > > Signed-off-by: Rostislav Pehlivanov > > --- > > libavcodec/aaccoder.c| 27 +++-- > > libavcodec/aaccoder_trellis.h| 2 +- > > libavcodec/aaccoder_twoloop.h| 2 +- > > libavcodec/aacenc.c | 4 ++ > > libavcodec/aacenc.h | 6 +++ > > libavcodec/aacenc_is.c | 6 +-- > > libavcodec/aacenc_ltp.c | 4 +- > > libavcodec/aacenc_pred.c | 6 +-- > > libavcodec/aacenc_quantization.h | 4 +- > > libavcodec/aacenc_utils.h| 4 +- > > libavcodec/x86/Makefile | 2 + > > libavcodec/x86/aacencdsp.asm | 87 ++ > ++ > > libavcodec/x86/aacencdsp_init.c | 43 > > 13 files changed, 170 insertions(+), 27 deletions(-) > > create mode 100644 libavcodec/x86/aacencdsp.asm > > create mode 100644 libavcodec/x86/aacencdsp_init.c > > fate passes on linux32/64 x86, mingw32/64 x86 > > build fails on arm: > > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > `ff_aac_dsp_init_x86' > collect2: ld returned 1 exit status > make: *** [ffserver_g] Error 1 > make: *** Waiting for unfinished jobs > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > `ff_aac_dsp_init_x86' > collect2: ld returned 1 exit status > make: *** [ffprobe_g] Error 1 > libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': > ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to > `ff_aac_dsp_init_x86' > collect2: ld returned 1 exit status > make: *** [ffmpeg_g] Error 1 > > [...] > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > While the State exists there can be no freedom; when there is freedom there > will be no State. -- Vladimir Lenin > > ___ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > Attaching a new version with the fixes from James Almer which should also fix non-x86 compilation From d92003e23d82bc40fd85712538983209a7704248 Mon Sep 17 00:00:00 2001 From: Rostislav Pehlivanov Date: Sat, 8 Oct 2016 15:59:14 +0100 Subject: [PATCH] aacenc: add SIMD optimizations for abs_pow34 and quantization Performance improvements: quant_bands: with: 681 decicycles in quant_bands, 8388453 runs,155 skips without: 1190 decicycles in quant_bands, 8388386 runs,222 skips Around 42% for the function Twoloop coder: abs_pow34: with/without: 7.82s/8.17s Around 4% for the entire encoder Both: with/without: 7.15s/8.17s Around 12% for the entire encoder Fast coder: abs_pow34: with/without: 3.40s/3.77s Around 10% for the entire encoder Both: with/without: 3.02s/3.77s Around 20% faster for the entire encoder Signed-off-by: Rostislav Pehlivanov --- libavcodec/aaccoder.c| 27 ++-- libavcodec/aaccoder_trellis.h| 2 +- libavcodec/aaccoder_twoloop.h| 2 +- libavcodec/aacenc.c | 4 ++ libavcodec/aacenc.h | 6 +++ libavcodec/aacenc_is.c | 6 +-- libavcodec/aacenc_ltp.c | 4 +- libavcodec/aacenc_pred.c | 6 +-- libavcodec/aacenc_quantization.h | 4 +- libavcodec/aacenc_utils.h| 2 +- libavcodec/x86/Makefile | 2 + libavcodec/x86/aacencdsp.asm | 88 libavcodec/x86/aacencdsp_init.c | 43 13 files changed, 170 insertions(+), 26 deletions(-) create mode 100644 libavcodec/x86/aacencdsp.asm create mode 100644 libavcodec/x86/aacencdsp_init.c diff --git a/libavcodec/aaccoder.c b/libavcodec/aaccoder.c index 35787e8..9f3b4ed 100644 --- a/libavcodec/aaccoder.c +++ b/libavcodec/aaccoder.c @@ -88,7 +88,7 @@ static void encode_window_bands_info(AACEncContext *s, SingleChannelElement *sce float next_minrd = INFINITY; int next_mincb = 0; -abs_pow34_v(s->scoefs, sce->coeffs, 1024); +s->abs_pow34(s->scoefs, sce->coeffs, 1024); start = win*128; for (cb = 0; cb < CB_TOT_ALL; cb++) { path[0][cb].cost = 0.0f; @@ -299
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On 10/17/2016 6:24 PM, Rostislav Pehlivanov wrote: > diff --git a/libavcodec/aacenc_utils.h b/libavcodec/aacenc_utils.h > index ff9188a..f5cf77d 100644 > --- a/libavcodec/aacenc_utils.h > +++ b/libavcodec/aacenc_utils.h > @@ -37,7 +37,7 @@ > #define ROUND_TO_ZERO 0.1054f > #define C_QUANT 0.4054f > > -static inline void abs_pow34_v(float *out, const float *in, const int size) > +static inline void abs_pow34_v(float *out, const float *in, const int64_t > size) Why int64_t? There's no need for values that big... > { > int i; > for (i = 0; i < size; i++) { ...And it's certainly not correct, seeing this for loop here. > diff --git a/libavcodec/x86/aacencdsp.asm b/libavcodec/x86/aacencdsp.asm > new file mode 100644 > index 000..ff4019f > --- /dev/null > +++ b/libavcodec/x86/aacencdsp.asm > @@ -0,0 +1,87 @@ > +;** > +;* SIMD optimized AAC encoder DSP functions > +;* > +;* Copyright (C) 2016 Rostislav Pehlivanov > +;* > +;* This file is part of FFmpeg. > +;* > +;* FFmpeg is free software; you can redistribute it and/or > +;* modify it under the terms of the GNU Lesser General Public > +;* License as published by the Free Software Foundation; either > +;* version 2.1 of the License, or (at your option) any later version. > +;* > +;* FFmpeg is distributed in the hope that it will be useful, > +;* but WITHOUT ANY WARRANTY; without even the implied warranty of > +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > +;* Lesser General Public License for more details. > +;* > +;* You should have received a copy of the GNU Lesser General Public > +;* License along with FFmpeg; if not, write to the Free Software > +;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 > USA > +;** > + > +%include "libavutil/x86/x86util.asm" > + > +SECTION_RODATA > + > +float_abs_mask: times 4 dd 0x7fff > + > +SECTION .text > + > +;*** > +;void ff_abs_pow34_sse(float *out, const float *in, const int64_t size); > +;*** > +INIT_XMM sse > +cglobal abs_pow34, 3, 3, 3, out, in, size > +mova m2, [float_abs_mask] > +shlsizeq, 2 > +addinq, sizeq > +addoutq, sizeq > +negsizeq > +.loop: > +mova m0, [inq+sizeq] > +andps m0, m2 > +sqrtps m1, m0 > +mulps m0, m1 > +sqrtps m0, m0 > +mova [outq+sizeq], m0 > +addsizeq, mmsize > +jl .loop > +RET > + > +;*** > +;void ff_aac_quantize_bands_sse2(int *out, const float *in, const float > *scaled, > +;int size, int is_signed, int maxval, const > float Q34, > +;const float rounding) > +;*** > +INIT_XMM sse2 > +cglobal aac_quantize_bands, 6, 6, 6, out, in, scaled, size, is_signed, > maxval, Q34, rounding 5, 5, 6 No need to load maxval into a gpr on x86_32 and win64 when cvtsi2ss can also load it directly from memory. > +%if UNIX64 == 0 > +movss m0, Q34m > +movss m1, roundingm > +%endif > +SPLATDm0 > +SPLATDm1 shufps m0, m0, 0 shufps m1, m1, 0 On sse2, SPLATD will expand to pshufd, so better stay in float domain. If you add an AVX/FMA3 version of this function, you could instead use vbroadcastss in them, and save yourself the movss. > +cvtsi2ss m3, maxvald > +SPLATDm3 %if UNIX64 cvtsi2ss m3, maxvald %else cvtsi2ss m3, dword maxvalm %endif shufpsm3, m3, 0 After you made the function 5, 5, 6. > +shl is_signedd, 31 > +movd m4, is_signedd > +SPLATDm4 Use shufps here since the instruction using this register in the loop should be a float one. > +shl sizeq, 2 Even if it doesn't come from stack on win64, better use "sized" anyway. > +add inq, sizeq > +add outq,sizeq > +add scaledq, sizeq > +neg sizeq > +.loop: > +mova m2, [scaledq+sizeq] > +mulps m2, m0 mulps m2, m0, [scaledq+sizeq] > +addps m2, m1 You could combine the mulps and addps into a single fmaddps if you add an FMA3 version. Something like movaps m2, [scaledq+sizeq] fmaddps m2, m2, m0, m1 The movaps is needed because, unlike FMA4's fmaddps which is non-destructive, FMA3 needs dst to be the same as one of the three src registers. > +minps m2, m3 > +mova m5, [inq+sizeq] movaps m5, [inq+sizeq] > +pand m5, m4 andps m5, m4 > +orps m2, m5 > +cvttps2dq m2, m2 > +mova [outq+sizeq], m2 > +add sizeq, mmsize > +jl .loop > +RET Can't you make these function also process eight floats p
Re: [FFmpeg-devel] [PATCH v3] aacenc: add SIMD optimizations for abs_pow34 and quantization
On Mon, Oct 17, 2016 at 10:24:48PM +0100, Rostislav Pehlivanov wrote: > Should fix segfaults on x86-32 > > Performance improvements: > > quant_bands: > with: 681 decicycles in quant_bands, 8388453 runs,155 skips > without: 1190 decicycles in quant_bands, 8388386 runs,222 skips > Around 42% for the function > > Twoloop coder: > > abs_pow34: > with/without: 7.82s/8.17s > Around 4% for the entire encoder > > Both: > with/without: 7.15s/8.17s > Around 12% for the entire encoder > > Fast coder: > > abs_pow34: > with/without: 3.40s/3.77s > Around 10% for the entire encoder > > Both: > with/without: 3.02s/3.77s > Around 20% faster for the entire encoder > > Signed-off-by: Rostislav Pehlivanov > --- > libavcodec/aaccoder.c| 27 +++-- > libavcodec/aaccoder_trellis.h| 2 +- > libavcodec/aaccoder_twoloop.h| 2 +- > libavcodec/aacenc.c | 4 ++ > libavcodec/aacenc.h | 6 +++ > libavcodec/aacenc_is.c | 6 +-- > libavcodec/aacenc_ltp.c | 4 +- > libavcodec/aacenc_pred.c | 6 +-- > libavcodec/aacenc_quantization.h | 4 +- > libavcodec/aacenc_utils.h| 4 +- > libavcodec/x86/Makefile | 2 + > libavcodec/x86/aacencdsp.asm | 87 > > libavcodec/x86/aacencdsp_init.c | 43 > 13 files changed, 170 insertions(+), 27 deletions(-) > create mode 100644 libavcodec/x86/aacencdsp.asm > create mode 100644 libavcodec/x86/aacencdsp_init.c fate passes on linux32/64 x86, mingw32/64 x86 build fails on arm: libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to `ff_aac_dsp_init_x86' collect2: ld returned 1 exit status make: *** [ffserver_g] Error 1 make: *** Waiting for unfinished jobs libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to `ff_aac_dsp_init_x86' collect2: ld returned 1 exit status make: *** [ffprobe_g] Error 1 libavcodec/libavcodec.a(aacenc.o): In function `aac_encode_init': ffmpeg/arm/src/libavcodec/aacenc.c:1038: undefined reference to `ff_aac_dsp_init_x86' collect2: ld returned 1 exit status make: *** [ffmpeg_g] Error 1 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB While the State exists there can be no freedom; when there is freedom there will be no State. -- Vladimir Lenin signature.asc Description: Digital signature ___ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel