Re: [FFmpeg-devel] [PATCH] checkasm/hevc_mc : add hevc_mc for checkasm

2018-04-17 Thread Shengbin Meng


> On Apr 9, 2018, at 10:12, Yingming Fan  wrote:
> 
> From: Yingming Fan 
> 
> ---
> Hi, there.
> I plane to submit our arm32 neon codes for qpel and epel.
> While before this i will submit hevc_mc checkasm codes.
> This hevc_mc checkasm codes check every qpel and epel function, including 8 
> 10 and 12 bit.
> Passed test by using 'checkasm --test=hevc_mc' under Linux x86_64 MacOS 
> x86_64 and Linux arm64 platform.
> Also passed FATE test. 
> 
> tests/checkasm/Makefile   |   2 +-
> tests/checkasm/checkasm.c |   1 +
> tests/checkasm/checkasm.h |   1 +
> tests/checkasm/hevc_mc.c  | 547 ++
> tests/fate/checkasm.mak   |   1 +
> 5 files changed, 551 insertions(+), 1 deletion(-)
> create mode 100644 tests/checkasm/hevc_mc.c
> 
> diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
> index 0233d2f989..e6c94cd676 100644
> --- a/tests/checkasm/Makefile
> +++ b/tests/checkasm/Makefile
> @@ -23,7 +23,7 @@ AVCODECOBJS-$(CONFIG_EXR_DECODER)   += exrdsp.o
> AVCODECOBJS-$(CONFIG_HUFFYUV_DECODER)   += huffyuvdsp.o
> AVCODECOBJS-$(CONFIG_JPEG2000_DECODER)  += jpeg2000dsp.o
> AVCODECOBJS-$(CONFIG_PIXBLOCKDSP)   += pixblockdsp.o
> -AVCODECOBJS-$(CONFIG_HEVC_DECODER)  += hevc_add_res.o hevc_idct.o 
> hevc_sao.o
> +AVCODECOBJS-$(CONFIG_HEVC_DECODER)  += hevc_add_res.o hevc_idct.o 
> hevc_sao.o hevc_mc.o
> AVCODECOBJS-$(CONFIG_UTVIDEO_DECODER)   += utvideodsp.o
> AVCODECOBJS-$(CONFIG_V210_ENCODER)  += v210enc.o
> AVCODECOBJS-$(CONFIG_VP9_DECODER)   += vp9dsp.o
> diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
> index 20ce56932f..b95efc674d 100644
> --- a/tests/checkasm/checkasm.c
> +++ b/tests/checkasm/checkasm.c
> @@ -117,6 +117,7 @@ static const struct {
> { "hevc_add_res", checkasm_check_hevc_add_res },
> { "hevc_idct", checkasm_check_hevc_idct },
> { "hevc_sao", checkasm_check_hevc_sao },
> +{ "hevc_mc", checkasm_check_hevc_mc },
> #endif
> #if CONFIG_HUFFYUV_DECODER
> { "huffyuvdsp", checkasm_check_huffyuvdsp },
> diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
> index dcab74de06..5a4a612da7 100644
> --- a/tests/checkasm/checkasm.h
> +++ b/tests/checkasm/checkasm.h
> @@ -58,6 +58,7 @@ void checkasm_check_h264qpel(void);
> void checkasm_check_hevc_add_res(void);
> void checkasm_check_hevc_idct(void);
> void checkasm_check_hevc_sao(void);
> +void checkasm_check_hevc_mc(void);
> void checkasm_check_huffyuvdsp(void);
> void checkasm_check_jpeg2000dsp(void);
> void checkasm_check_llviddsp(void);
> diff --git a/tests/checkasm/hevc_mc.c b/tests/checkasm/hevc_mc.c
> new file mode 100644
> index 00..018f322c11
> --- /dev/null
> +++ b/tests/checkasm/hevc_mc.c
> @@ -0,0 +1,547 @@
> +/*
> + * Copyright (c) 2018 Yingming Fan 
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
> + */
> +
> +#include 
> +
> +#include "libavutil/intreadwrite.h"
> +
> +#include "libavcodec/avcodec.h"
> +
> +#include "libavcodec/hevcdsp.h"
> +
> +#include "checkasm.h"
> +
> +static const uint32_t pixel_mask[3] = { 0x, 0x03ff03ff, 0x0fff0fff };
> +static const uint32_t idx_width_map[8][2] = {{1, 4}, {3, 8}, {4, 12}, {5, 
> 16}, {6, 24}, {7, 32}, {8, 48}, {9, 64}};
Why not include block width 2 and 6? I notice that there are already some 
optimization code for that.

> +#define SIZEOF_PIXEL ((bit_depth + 7) / 8)
> +#define PIXEL_STRIDE (AV_INPUT_BUFFER_PADDING_SIZE + MAX_PB_SIZE + 
> AV_INPUT_BUFFER_PADDING_SIZE)
> +#define BUF_SIZE ((MAX_PB_SIZE+4+4) * PIXEL_STRIDE * 2)
> +
> +#define randomize_buffers(buf0, buf1, size) \
> +do {\
> +uint32_t mask = pixel_mask[(bit_depth - 8) >> 1];   \
> +int k;  \
> +for (k = 0; k < size; k += 4) { \
> +uint32_t r = rnd() & mask;  \
> +AV_WN32A(buf0 + k, r);  \
> +AV_WN32A(buf1 + k, r);  \
> +}   \
> +} while (0)
> +
> +#define randomize_buffers2(buf0, buf1, size)\
> +do {  

Re: [FFmpeg-devel] [PATCH v3] avcodec/arm/hevcdsp_sao : add NEON optimization for sao

2018-04-08 Thread Shengbin Meng
LGTM.

Regards,
Shengbin Meng

> On 27 Mar 2018, at 20:43, Yingming Fan  wrote:
> 
> From: Meng Wang 
> 
> Signed-off-by: Meng Wang 
> ---
> This v3 patch removed unused codes 'stride_dst /= sizeof(uint8_t);' compared 
> to v1. V1 have this codes because we referred to hevc dsp template codes.
> Also removed type cast like 'uint8_t *dst = (uint8_t *)_dst;' compared to v2.
> 
> As FFmpeg hevc decoder have no SAO neon optimization, we add sao_band and 
> sao_edge neon codes in this patch.
> I have already submit a patch called 'checkasm/hevc_sao : add hevc_sao for 
> checkasm' several days ago.
> Results below was printed by hevc_sao checkasm on an armv7 device Nexus 5. 
> From the results we can see: hevc_sao_band speed up ~2x, hevc_sao_edge speed 
> up ~4x. 
> Also passed FATE under armv7 linux and x86_64 linux MacOS.
> 
> hevc_sao_band_8x8_8_c: 804.9
> hevc_sao_band_8x8_8_neon: 452.4
> hevc_sao_band_16x16_8_c: 2638.1
> hevc_sao_band_16x16_8_neon: 1169.9
> hevc_sao_band_32x32_8_c: 9259.9
> hevc_sao_band_32x32_8_neon: 3956.1
> hevc_sao_band_48x48_8_c: 20344.6
> hevc_sao_band_48x48_8_neon: 8649.6
> hevc_sao_band_64x64_8_c: 35684.6
> hevc_sao_band_64x64_8_neon: 15213.1
> hevc_sao_edge_8x8_8_c: 1761.6
> hevc_sao_edge_8x8_8_neon: 414.6
> hevc_sao_edge_16x16_8_c: 6844.4
> hevc_sao_edge_16x16_8_neon: 1589.9
> hevc_sao_edge_32x32_8_c: 27156.4
> hevc_sao_edge_32x32_8_neon: 6116.6
> hevc_sao_edge_48x48_8_c: 60004.6
> hevc_sao_edge_48x48_8_neon: 13686.4
> hevc_sao_edge_64x64_8_c: 106708.1
> hevc_sao_edge_64x64_8_neon: 24240.1
> 
> libavcodec/arm/Makefile|   3 +-
> libavcodec/arm/hevcdsp_init_neon.c |  59 
> libavcodec/arm/hevcdsp_sao_neon.S  | 181 +
> 3 files changed, 242 insertions(+), 1 deletion(-)
> create mode 100644 libavcodec/arm/hevcdsp_sao_neon.S
> 
> diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
> index 1eeac5449e..9c164f82ae 100644
> --- a/libavcodec/arm/Makefile
> +++ b/libavcodec/arm/Makefile
> @@ -136,7 +136,8 @@ NEON-OBJS-$(CONFIG_DCA_DECODER)+= 
> arm/synth_filter_neon.o
> NEON-OBJS-$(CONFIG_HEVC_DECODER)   += arm/hevcdsp_init_neon.o   \
>   arm/hevcdsp_deblock_neon.o\
>   arm/hevcdsp_idct_neon.o   \
> -  arm/hevcdsp_qpel_neon.o
> +  arm/hevcdsp_qpel_neon.o   \
> +  arm/hevcdsp_sao_neon.o
> NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
> NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
>   arm/rv40dsp_neon.o
> diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
> b/libavcodec/arm/hevcdsp_init_neon.c
> index a4628d2a93..201a088dac 100644
> --- a/libavcodec/arm/hevcdsp_init_neon.c
> +++ b/libavcodec/arm/hevcdsp_init_neon.c
> @@ -21,8 +21,16 @@
> #include "libavutil/attributes.h"
> #include "libavutil/arm/cpu.h"
> #include "libavcodec/hevcdsp.h"
> +#include "libavcodec/avcodec.h"
> #include "hevcdsp_arm.h"
> 
> +void ff_hevc_sao_band_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src,
> +  ptrdiff_t stride_dst, ptrdiff_t stride_src,
> +  int16_t *sao_offset_val, int 
> sao_left_class,
> +  int width, int height);
> +void ff_hevc_sao_edge_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src, 
> ptrdiff_t stride_dst, int16_t *sao_offset_val,
> +  int eo, int width, int height);
> +
> void ff_hevc_v_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
> _beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
> void ff_hevc_h_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
> _beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
> void ff_hevc_v_loop_filter_chroma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
> *_tc, uint8_t *_no_p, uint8_t *_no_q);
> @@ -142,6 +150,47 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8);
> QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8);
> #undef QPEL_FUNC_UW
> 
> +void ff_hevc_sao_band_filter_neon_8(uint8_t *dst, uint8_t *src, ptrdiff_t 
> stride_dst, ptrdiff_t stride_src, int width, int height, int16_t 
> *offset_table);
> +
> +void ff_hevc_sao_band_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src,
> +  ptrdiff_t stride_dst, ptrdiff_t stride_src,
> +  int16_t *sao_offset_val, int 
> sao_left_class,
> + 

Re: [FFmpeg-devel] [PATCH v2] avcodec/arm/hevcdsp_sao : add NEON optimization for sao

2018-03-25 Thread Shengbin Meng


> On 22 Mar 2018, at 20:51, Yingming Fan  wrote:
> 
> From: Meng Wang 
> 
> Signed-off-by: Meng Wang 
> ---
> This v2 patch remove unused codes 'stride_dst /= sizeof(uint8_t);' compared 
> to v1. V1 have this codes because we referred to hevc dsp template codes.
> 
> As FFmpeg hevc decoder have no SAO neon optimization, we add sao_band and 
> sao_edge neon codes in this patch.
> I have already submit a patch called 'checkasm/hevc_sao : add hevc_sao for 
> checkasm' several days ago.
> Results below was printed by hevc_sao checkasm on an armv7 device Nexus 5. 
> From the results we can see: hevc_sao_band speed up ~2x, hevc_sao_edge speed 
> up ~4x. 
> Also test FATE under armv7 device and MacOS.
> 
> hevc_sao_band_8x8_8_c: 804.9
> hevc_sao_band_8x8_8_neon: 452.4
> hevc_sao_band_16x16_8_c: 2638.1
> hevc_sao_band_16x16_8_neon: 1169.9
> hevc_sao_band_32x32_8_c: 9259.9
> hevc_sao_band_32x32_8_neon: 3956.1
> hevc_sao_band_48x48_8_c: 20344.6
> hevc_sao_band_48x48_8_neon: 8649.6
> hevc_sao_band_64x64_8_c: 35684.6
> hevc_sao_band_64x64_8_neon: 15213.1
> hevc_sao_edge_8x8_8_c: 1761.6
> hevc_sao_edge_8x8_8_neon: 414.6
> hevc_sao_edge_16x16_8_c: 6844.4
> hevc_sao_edge_16x16_8_neon: 1589.9
> hevc_sao_edge_32x32_8_c: 27156.4
> hevc_sao_edge_32x32_8_neon: 6116.6
> hevc_sao_edge_48x48_8_c: 60004.6
> hevc_sao_edge_48x48_8_neon: 13686.4
> hevc_sao_edge_64x64_8_c: 106708.1
> hevc_sao_edge_64x64_8_neon: 24240.1
> 
> libavcodec/arm/Makefile|   3 +-
> libavcodec/arm/hevcdsp_init_neon.c |  59 
> libavcodec/arm/hevcdsp_sao_neon.S  | 181 +
> 3 files changed, 242 insertions(+), 1 deletion(-)
> create mode 100644 libavcodec/arm/hevcdsp_sao_neon.S
> 
> diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
> index 1eeac5449e..9c164f82ae 100644
> --- a/libavcodec/arm/Makefile
> +++ b/libavcodec/arm/Makefile
> @@ -136,7 +136,8 @@ NEON-OBJS-$(CONFIG_DCA_DECODER)+= 
> arm/synth_filter_neon.o
> NEON-OBJS-$(CONFIG_HEVC_DECODER)   += arm/hevcdsp_init_neon.o   \
>   arm/hevcdsp_deblock_neon.o\
>   arm/hevcdsp_idct_neon.o   \
> -  arm/hevcdsp_qpel_neon.o
> +  arm/hevcdsp_qpel_neon.o   \
> +  arm/hevcdsp_sao_neon.o
> NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
> NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
>   arm/rv40dsp_neon.o
> diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
> b/libavcodec/arm/hevcdsp_init_neon.c
> index a4628d2a93..af68e24f93 100644
> --- a/libavcodec/arm/hevcdsp_init_neon.c
> +++ b/libavcodec/arm/hevcdsp_init_neon.c
> @@ -21,8 +21,16 @@
> #include "libavutil/attributes.h"
> #include "libavutil/arm/cpu.h"
> #include "libavcodec/hevcdsp.h"
> +#include "libavcodec/avcodec.h"
> #include "hevcdsp_arm.h"
> 
> +void ff_hevc_sao_band_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src,
> +  ptrdiff_t stride_dst, ptrdiff_t stride_src,
> +  int16_t *sao_offset_val, int 
> sao_left_class,
> +  int width, int height);
> +void ff_hevc_sao_edge_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src, 
> ptrdiff_t stride_dst, int16_t *sao_offset_val,
> +  int eo, int width, int height);
> +
> void ff_hevc_v_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
> _beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
> void ff_hevc_h_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
> _beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
> void ff_hevc_v_loop_filter_chroma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
> *_tc, uint8_t *_no_p, uint8_t *_no_q);
> @@ -142,6 +150,47 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8);
> QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8);
> #undef QPEL_FUNC_UW
> 
> +void ff_hevc_sao_band_filter_neon_8(uint8_t *dst, uint8_t *src, ptrdiff_t 
> stride_dst, ptrdiff_t stride_src, int width, int height, int16_t 
> *offset_table);
> +
> +void ff_hevc_sao_band_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src,
> +  ptrdiff_t stride_dst, ptrdiff_t stride_src,
> +  int16_t *sao_offset_val, int 
> sao_left_class,
> +  int width, int height) {
> +uint8_t *dst = (uint8_t *)_dst;
> +uint8_t *src = (uint8_t *)_src;
This conversion is also not needed since we are only handling 8-bit pixels here.

> +int16_t offset_table[32] = {0};
> +int k;
> +
> +for (k = 0; k < 4; k++) {
> +offset_table[(k + sao_left_class) & 31] = sao_offset_val[k + 1];
> +}
> +
> +ff_hevc_sao_band_filter_neon_8(dst, src, stride_dst, stride_src, width, 
> height, offset_table);
> +}
> +
> +void ff_hevc_sao_edge_f

Re: [FFmpeg-devel] [PATCH] avcodec/arm/hevcdsp_sao : add NEON optimization for sao

2018-03-22 Thread Shengbin Meng
The code looks good to me. I think the wrapper is fine, because that part of 
code is not suitable for NEON assembly.

But you can remove the using of `sizeof(uint8_t)` as suggested by Carl.

Shengbin Meng

> On 19 Mar 2018, at 12:41, Yingming Fan  wrote:
> 
> Hi, is there any review about this patch? What’s your option about wrapper we 
> used in this patch.
> 
> Yingming Fan
> 
>> On 11 Mar 2018, at 8:59 PM, Yingming Fan  wrote:
>> 
>> 
>>> On 11 Mar 2018, at 8:54 PM, Carl Eugen Hoyos  wrote:
>>> 
>>> 2018-03-08 8:03 GMT+01:00 Yingming Fan :
>>>> From: Meng Wang 
>>> 
>>>> +stride_dst /= sizeof(uint8_t);
>>>> +stride_src /= sizeof(uint8_t);
>>> 
>>> FFmpeg requires sizeof(uint8_t) to be 1, please simplify
>>> your patch accordingly.
>>> 
>>> Why is the wrapper function needed?
>> 
>> We use wrapper because codes in wrapper no need to be written with assembly, 
>> C codes more readable.
>> 
>>> 
>>> Carl Eugen
>>> ___
>>> ffmpeg-devel mailing list
>>> ffmpeg-devel@ffmpeg.org
>>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>> 
> 
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] avcodec/arm/hevcdsp_sao : add NEON optimization for sao

2018-03-22 Thread Shengbin Meng
Hi,

By checkasm benchmark, I can see a speedup of ~3x for band mode and ~6x for 
edge mode on my device (the device has aarch64 CPU, but I configured ffmpeg 
with `—arch=arm`). And FATE passed as well.

Results of a checkasm run:

$./tests/checkasm/checkasm --test=hevc_sao --bench
$ sudo ./tests/checkasm/checkasm --test=hevc_sao --bench
benchmarking with Linux Perf Monitoring API
nop: 49.8
checkasm: using random seed 1088726844
NEON:
 - hevc_sao.sao_band [OK]
 - hevc_sao.sao_edge [OK]
checkasm: all 10 tests passed
hevc_sao_band_8x8_8_c: 578.0
hevc_sao_band_8x8_8_neon: 215.3
hevc_sao_band_16x16_8_c: 2004.3
hevc_sao_band_16x16_8_neon: 680.8
hevc_sao_band_32x32_8_c: 8363.5
hevc_sao_band_32x32_8_neon: 2579.3
hevc_sao_band_48x48_8_c: 18268.3
hevc_sao_band_48x48_8_neon: 5653.3
hevc_sao_band_64x64_8_c: 32001.8
hevc_sao_band_64x64_8_neon: 9952.0
hevc_sao_edge_8x8_8_c: 1211.0
hevc_sao_edge_8x8_8_neon: 217.5
hevc_sao_edge_16x16_8_c: 4708.5
hevc_sao_edge_16x16_8_neon: 767.0
hevc_sao_edge_32x32_8_c: 18673.0
hevc_sao_edge_32x32_8_neon: 2967.3
hevc_sao_edge_48x48_8_c: 41936.3
hevc_sao_edge_48x48_8_neon: 6642.8
hevc_sao_edge_64x64_8_c: 74015.8
hevc_sao_edge_64x64_8_neon: 11781.8

Regards
Shengbin

> On 11 Mar 2018, at 10:27, Yingming Fan  wrote:
> 
> Hi, there. 
> I have already pushed a patch which add hevc_sao checkasm and patch was 
> adopted.
> You can verify this optimization by using checkasm under arm device, 
> `checkasm --test=hevc_sao --bench`.
> hevc_sao_band speed up ~2x, hevc_sao_edge speed up ~4x. Also passed FATE 
> under arm platform.
> 
> Yingming Fan
> 
>> On 8 Mar 2018, at 3:03 PM, Yingming Fan  wrote:
>> 
>> From: Meng Wang 
>> 
>> Signed-off-by: Meng Wang 
>> ---
>> As FFmpeg hevc decoder have no SAO neon optimization, we add sao_band and 
>> sao_edge neon codes in this patch.
>> I have already submit a patch called 'checkasm/hevc_sao : add hevc_sao for 
>> checkasm' several days ago.
>> Results below was printed by hevc_sao checkasm on an armv7 device Nexus 5. 
>> From the results we can see: hevc_sao_band speed up ~2x, hevc_sao_edge speed 
>> up ~4x. 
>> Also test FATE under armv7 device and MacOS.
>> 
>> hevc_sao_band_8x8_8_c: 804.9
>> hevc_sao_band_8x8_8_neon: 452.4
>> hevc_sao_band_16x16_8_c: 2638.1
>> hevc_sao_band_16x16_8_neon: 1169.9
>> hevc_sao_band_32x32_8_c: 9259.9
>> hevc_sao_band_32x32_8_neon: 3956.1
>> hevc_sao_band_48x48_8_c: 20344.6
>> hevc_sao_band_48x48_8_neon: 8649.6
>> hevc_sao_band_64x64_8_c: 35684.6
>> hevc_sao_band_64x64_8_neon: 15213.1
>> hevc_sao_edge_8x8_8_c: 1761.6
>> hevc_sao_edge_8x8_8_neon: 414.6
>> hevc_sao_edge_16x16_8_c: 6844.4
>> hevc_sao_edge_16x16_8_neon: 1589.9
>> hevc_sao_edge_32x32_8_c: 27156.4
>> hevc_sao_edge_32x32_8_neon: 6116.6
>> hevc_sao_edge_48x48_8_c: 60004.6
>> hevc_sao_edge_48x48_8_neon: 13686.4
>> hevc_sao_edge_64x64_8_c: 106708.1
>> hevc_sao_edge_64x64_8_neon: 24240.1
>> 
>> libavcodec/arm/Makefile|   3 +-
>> libavcodec/arm/hevcdsp_init_neon.c |  63 +
>> libavcodec/arm/hevcdsp_sao_neon.S  | 181 
>> +
>> 3 files changed, 246 insertions(+), 1 deletion(-)
>> create mode 100644 libavcodec/arm/hevcdsp_sao_neon.S
>> 
>> diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
>> index 1eeac5449e..2ee913e8a8 100644
>> --- a/libavcodec/arm/Makefile
>> +++ b/libavcodec/arm/Makefile
>> @@ -136,7 +136,8 @@ NEON-OBJS-$(CONFIG_DCA_DECODER)+= 
>> arm/synth_filter_neon.o
>> NEON-OBJS-$(CONFIG_HEVC_DECODER)   += arm/hevcdsp_init_neon.o   \
>>  arm/hevcdsp_deblock_neon.o\
>>  arm/hevcdsp_idct_neon.o   \
>> -  arm/hevcdsp_qpel_neon.o
>> +  arm/hevcdsp_qpel_neon.o   \
>> +  arm/hevcdsp_sao_neon.o
>> NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
>> NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
>>  arm/rv40dsp_neon.o
>> diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
>> b/libavcodec/arm/hevcdsp_init_neon.c
>> index a4628d2a93..3c480f12f8 100644
>> --- a/libavcodec/arm/hevcdsp_init_neon.c
>> +++ b/libavcodec/arm/hevcdsp_init_neon.c
>> @@ -21,8 +21,16 @@
>> #include "libavutil/attributes.h"
>> #include "libavutil/arm/cpu.h"
>> #include "libavcodec/hevcdsp.h"
>> +#include "libavcodec/avcodec.h"
>> #include "hevcdsp_arm.h"
>> 
>> +void ff_hevc_sao_band_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src,
>> +  ptrdiff_t stride_dst, ptrdiff_t 
>> stride_src,
>> +  int16_t *sao_offset_val, int 
>> sao_left_class,
>> +  int width, int height);
>> +void ff_hevc_sao_edge_filter_neon_8_wrapper(uint8_t *_dst, uint8_t *_src, 
>> ptrdiff_t stride_dst, int16_t *sao_offset_val,
>> +  int eo, i

Re: [FFmpeg-devel] [PATCH 1/6] avcodec/hevcdsp: Add NEON optimization for qpel weighted mode

2017-11-22 Thread Shengbin Meng


> On 22 Nov 2017, at 20:26, Michael Niedermayer  wrote:
> 
> On Wed, Nov 22, 2017 at 07:12:01PM +0800, Shengbin Meng wrote:
>> From: Meng Wang 
>> 
>> Signed-off-by: Meng Wang 
>> ---
>> libavcodec/arm/hevcdsp_init_neon.c |  66 +
>> libavcodec/arm/hevcdsp_qpel_neon.S | 509 
>> +
>> 2 files changed, 575 insertions(+)
>> 
>> diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
>> b/libavcodec/arm/hevcdsp_init_neon.c
> 
> This seems not to apply to git master

I looked into that and it seems someone has added commits about hevc decoding 
in git master after n3.4.
My patches are based on n3.4 so some conflicts occur. And I checked the 
conflict is mainly due to the following change in master:

ff_hevcdsp_init_neon => ff_hevc_dsp_init_neon  (in 
libavcodec/arm/hevcdsp_init_neon.c, a common function name was changed).

It is a small conflict though and should be easily resolved. Anyway, I have 
updated those patches to v2 which are all based on master, for your 
convenience. They should merge all right.

And since master already contains optimization code for IDCT (even 32x32 
blocks, great plus!), our work about IDCT are removed from v2 patches.

Thank you.

Regards,
Shengbin

> Applying: avcodec/hevcdsp: Add NEON optimization for qpel weighted mode
> Using index info to reconstruct a base tree...
> M   libavcodec/arm/hevcdsp_init_neon.c
> Falling back to patching base and 3-way merge...
> Auto-merging libavcodec/arm/hevcdsp_init_neon.c
> CONFLICT (content): Merge conflict in libavcodec/arm/hevcdsp_init_neon.c
> error: Failed to merge in the changes.
> Patch failed at 0001 avcodec/hevcdsp: Add NEON optimization for qpel weighted 
> mode
> The copy of the patch that failed is found in: .git/rebase-apply/patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
> 
> 
> [...]
> -- 
> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
> 
> I am the wisest man alive, for I know one thing, and that is that I know
> nothing. -- Socrates
> ___
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH v2 4/5] avcodec/hevcdsp: Use pre-load (pld) to optimize data loading

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/hevcdsp_epel_neon.S | 10 ++
 libavcodec/arm/hevcdsp_qpel_neon.S | 24 
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/libavcodec/arm/hevcdsp_epel_neon.S 
b/libavcodec/arm/hevcdsp_epel_neon.S
index d0d93e8033..03e6504481 100644
--- a/libavcodec/arm/hevcdsp_epel_neon.S
+++ b/libavcodec/arm/hevcdsp_epel_neon.S
@@ -306,6 +306,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16{q7}, [r0], r1
 regshuffle_d4
@@ -320,6 +321,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16d14, [r0], r1
 regshuffle_d4
@@ -357,6 +359,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.8d0, [r0], r1
@@ -372,6 +375,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.32d0[0], [r0], r1
@@ -396,6 +400,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16{q0}, [r8], r9
 vqadd.s16  q0, q7
@@ -415,6 +420,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16  d0, [r8], r9
 vqadd.s16d0, d14
@@ -465,6 +471,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs  r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14   // extending signed 4x16bit data to 4x32 
bit
 vmovl.s16 q13, d15
@@ -490,6 +497,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 4x32 
bit
 vmul.s32  q14, q12, q6
@@ -535,6 +543,7 @@
 cmp   r5, #2
 beq   2f
 8:  subsr4,   #1
+pld [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 
4x32 bit
 vmovl.s16 q13, d15
@@ -569,6 +578,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16q12, d14
 vmul.s32 q14, q12, q6
diff --git a/libavcodec/arm/hevcdsp_qpel_neon.S 
b/libavcodec/arm/hevcdsp_qpel_neon.S
index 71ecc00b6e..b507fbc13b 100644
--- a/libavcodec/arm/hevcdsp_qpel_neon.S
+++ b/libavcodec/arm/hevcdsp_qpel_neon.S
@@ -231,6 +231,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16{q7}, [r0], r1
 regshuffle_d8
@@ -245,6 +246,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16d14, [r0], r1
 regshuffle_d8
@@ -273,6 +275,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.8d0, [r0], r1
@@ -288,6 +291,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.32d0[0], [r0], r1
@@ -301,6 +305,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16{q0}, [r8], r9
 vqadd.s16  q0, q7
@@ -320,6 +325,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16  d0, [r8], r9
 vqadd.s16d0, d14
@@ -358,6 +364,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs  r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14 // extending signed 4x16bit data to 4x32 bit
 vmovl.s16 q13, d15
@@ -383,6 +390,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 4x32 
bit
 vmul.s32  q14, q12, q6
@@ -412,6 +420,7 @@
 cmp r5,   #4
 beq 4f
 8:  subsr4,   #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 
4x32 bit
 vmovl.s16 q13, d15
@@ -446,6 +455,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16q12, d14
 vmul.s32 q14, q12, q6
@@ -1524,8 +1534,9 @@ function ff_hevc_put_qpel_bi_uw_pixels_neon_8, export=1
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
-vshll.u8   q7 , d8, #6// src[x] << 6 and move long to 8x16bi

[FFmpeg-devel] [PATCH v2 5/5] avcodec/hevcdsp: Add NEON optimization for sao

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/Makefile|   3 +-
 libavcodec/arm/hevcdsp_init_neon.c |  62 +
 libavcodec/arm/hevcdsp_sao_neon.S  | 181 +
 3 files changed, 245 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/arm/hevcdsp_sao_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 1acda0b1f8..fc4c0147c5 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -137,7 +137,8 @@ NEON-OBJS-$(CONFIG_HEVC_DECODER)   += 
arm/hevcdsp_init_neon.o   \
   arm/hevcdsp_deblock_neon.o\
   arm/hevcdsp_idct_neon.o   \
   arm/hevcdsp_qpel_neon.o   \
-  arm/hevcdsp_epel_neon.o
+  arm/hevcdsp_epel_neon.o   \
+ arm/hevcdsp_sao_neon.o
 NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
 NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
b/libavcodec/arm/hevcdsp_init_neon.c
index 4e57422ad4..f7efff28e1 100644
--- a/libavcodec/arm/hevcdsp_init_neon.c
+++ b/libavcodec/arm/hevcdsp_init_neon.c
@@ -23,6 +23,13 @@
 #include "libavcodec/hevcdsp.h"
 #include "hevcdsp_arm.h"
 
+void ff_hevc_sao_band_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src,
+  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+  int16_t *sao_offset_val, int sao_left_class,
+  int width, int height);
+void ff_hevc_sao_edge_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src, 
ptrdiff_t stride_dst, int16_t *sao_offset_val,
+  int eo, int width, int height);
+
 void ff_hevc_v_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
_beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
 void ff_hevc_h_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
_beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
 void ff_hevc_v_loop_filter_chroma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
*_tc, uint8_t *_no_p, uint8_t *_no_q);
@@ -414,6 +421,51 @@ EPEL_FUNC_WT(ff_hevc_put_epel_wt_h6v7_neon_8);
 EPEL_FUNC_WT(ff_hevc_put_epel_wt_h7v7_neon_8);
 #undef EPEL_FUNC_WT
 
+void ff_hevc_sao_band_filter_neon_8(uint8_t *dst, uint8_t *src, ptrdiff_t 
stride_dst, ptrdiff_t stride_src, int width, int height, int16_t *offset_table);
+
+void ff_hevc_sao_band_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src,
+  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+  int16_t *sao_offset_val, int sao_left_class,
+  int width, int height) {
+uint8_t *dst = (uint8_t *)_dst;
+uint8_t *src = (uint8_t *)_src;
+int16_t offset_table[32] = {0};
+int k;
+
+stride_dst /= sizeof(uint8_t);
+stride_src /= sizeof(uint8_t);
+
+for (k = 0; k < 4; k++) {
+offset_table[(k + sao_left_class) & 31] = sao_offset_val[k + 1];
+}
+
+ff_hevc_sao_band_filter_neon_8(dst, src, stride_dst, stride_src, width, 
height, offset_table);
+}
+
+void ff_hevc_sao_edge_filter_neon_8(uint8_t *dst, uint8_t *src, ptrdiff_t 
stride_dst, ptrdiff_t stride_src, int width, int height,
+int a_stride, int b_stride, int16_t 
*sao_offset_val, uint8_t *edge_idx);
+
+void ff_hevc_sao_edge_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src, 
ptrdiff_t stride_dst, int16_t *sao_offset_val,
+  int eo, int width, int height) {
+static uint8_t edge_idx[] = { 1, 2, 0, 3, 4 };
+static const int8_t pos[4][2][2] = {
+{ { -1,  0 }, {  1, 0 } }, // horizontal
+{ {  0, -1 }, {  0, 1 } }, // vertical
+{ { -1, -1 }, {  1, 1 } }, // 45 degree
+{ {  1, -1 }, { -1, 1 } }, // 135 degree
+};
+uint8_t *dst = (uint8_t *)_dst;
+uint8_t *src = (uint8_t *)_src;
+int a_stride, b_stride;
+ptrdiff_t stride_src = (2*64 + 32) / sizeof(uint8_t);
+stride_dst /= sizeof(uint8_t);
+
+a_stride = pos[eo][0][0] + pos[eo][0][1] * stride_src;
+b_stride = pos[eo][1][0] + pos[eo][1][1] * stride_src;
+
+ff_hevc_sao_edge_filter_neon_8(dst, src, stride_dst, stride_src, width, 
height, a_stride, b_stride, sao_offset_val, edge_idx);
+}
+
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width) {
 
@@ -505,6 +557,16 @@ av_cold void ff_hevc_dsp_init_neon(HEVCDSPContext *c, 
const int bit_depth)
 c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_neon;
 c->hevc_v_loop_filter_chroma   = ff_hevc_v_loop_filter_chroma_neon;
 c->he

[FFmpeg-devel] [PATCH v2 3/5] avcodec/hevcdsp: Add NEON optimization for whole-pixel interpolation

2017-11-22 Thread Shengbin Meng
New code is written for qpel; and then code for qpel is reused for epel,
because whole-pixel interpolation in qpel and epel are identical.

Signed-off-by: Shengbin Meng 
---
 libavcodec/arm/hevcdsp_init_neon.c | 107 ++
 libavcodec/arm/hevcdsp_qpel_neon.S | 177 +
 2 files changed, 284 insertions(+)

diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
b/libavcodec/arm/hevcdsp_init_neon.c
index 7d85c29d6a..4e57422ad4 100644
--- a/libavcodec/arm/hevcdsp_init_neon.c
+++ b/libavcodec/arm/hevcdsp_init_neon.c
@@ -89,6 +89,10 @@ static void (*put_hevc_epel_uw_neon[8][8])(uint8_t *dst, 
ptrdiff_t dststride, ui
int width, int height, int16_t* 
src2, ptrdiff_t src2stride);
 static void (*put_hevc_epel_wt_neon[8][8])(uint8_t *_dst, ptrdiff_t 
_dststride, uint8_t *_src, ptrdiff_t _srcstride,
int width, int height, int denom, 
int wx1, int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t src2stride);
+static void (*put_hevc_qpel_bi_uw_pixels_neon[1])(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *_src, ptrdiff_t _srcstride,
+  int width, int height, int16_t* 
src2, ptrdiff_t src2stride);
+static void (*put_hevc_qpel_wt_pixels_neon[1])(uint8_t *_dst, ptrdiff_t 
_dststride, uint8_t *_src, ptrdiff_t _srcstride,
+  int width, int height, 
int denom, int wx1, int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t 
src2stride);
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width);
 void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
@@ -119,6 +123,17 @@ void ff_hevc_put_epel_bi_w_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8
  int16_t *src2,
  int height, int denom, int wx0, 
int wx1,
  int ox0, int ox1, intptr_t mx, 
intptr_t my, int width);
+void ff_hevc_put_qpel_bi_uw_pixels_neon_wrapper(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *src, ptrdiff_t srcstride,
+   int16_t *src2,
+   int height, intptr_t mx, 
intptr_t my, int width);
+void ff_hevc_put_qpel_uni_wt_pixels_neon_wrapper(uint8_t *dst,  ptrdiff_t 
dststride,
+  uint8_t *src, ptrdiff_t 
srcstride,
+  int height, int denom, int wx, 
int ox,
+  intptr_t mx, intptr_t my, int 
width);
+void ff_hevc_put_qpel_bi_wt_pixels_neon_wrapper(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *src, ptrdiff_t srcstride,
+int16_t *src2,
+int height, int denom, int 
wx0, int wx1,
+int ox0, int ox1, intptr_t mx, 
intptr_t my, int width);
 
 #define QPEL_FUNC(name) \
 void name(int16_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t 
srcstride, \
@@ -172,6 +187,7 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h2v3_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v1_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8);
+QPEL_FUNC_UW(ff_hevc_put_qpel_bi_uw_pixels_neon_8);
 #undef QPEL_FUNC_UW
 
 #define QPEL_FUNC_WT(name) \
@@ -192,6 +208,7 @@ QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v3_neon_8);
 QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v1_neon_8);
 QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v2_neon_8);
 QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_pixels_neon_8);
 #undef QPEL_FUNC_WT
 
 
@@ -459,6 +476,27 @@ void ff_hevc_put_epel_bi_w_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8
 put_hevc_epel_wt_neon[my][mx](dst, dststride, src, srcstride, width, 
height, denom, wx1, ox1, wx0, ox0, src2, MAX_PB_SIZE);
 }
 
+
+void ff_hevc_put_qpel_bi_uw_pixels_neon_wrapper(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *src, ptrdiff_t srcstride,
+   int16_t *src2,
+   int height, intptr_t mx, 
intptr_t my, int width) {
+put_hevc_qpel_bi_uw_pixels_neon[0](dst, dststride, src, srcstride, width, 
height, src2, MAX_PB_SIZE);
+}
+
+void ff_hevc_put_qpel_uni_wt_pixels_neon_wrapper(uint8_t *dst,  ptrdiff_t 
dststride,
+ uint8_t *src, ptrdiff_t 
srcstride,
+ int height, int denom, 
int wx, int ox,
+ intptr_t mx, intptr_t my, 
int width) {
+put_hevc_qpel_wt_pixels_neon[0](dst, dststride, src, srcstride

[FFmpeg-devel] [PATCH v2 2/5] avcodec/hevcdsp: Add NEON optimization for epel

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/Makefile|3 +-
 libavcodec/arm/hevcdsp_epel_neon.S | 2068 
 libavcodec/arm/hevcdsp_init_neon.c |  458 
 3 files changed, 2528 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/arm/hevcdsp_epel_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 1eeac5449e..1acda0b1f8 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -136,7 +136,8 @@ NEON-OBJS-$(CONFIG_DCA_DECODER)+= 
arm/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_HEVC_DECODER)   += arm/hevcdsp_init_neon.o   \
   arm/hevcdsp_deblock_neon.o\
   arm/hevcdsp_idct_neon.o   \
-  arm/hevcdsp_qpel_neon.o
+  arm/hevcdsp_qpel_neon.o   \
+  arm/hevcdsp_epel_neon.o
 NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
 NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
diff --git a/libavcodec/arm/hevcdsp_epel_neon.S 
b/libavcodec/arm/hevcdsp_epel_neon.S
new file mode 100644
index 00..d0d93e8033
--- /dev/null
+++ b/libavcodec/arm/hevcdsp_epel_neon.S
@@ -0,0 +1,2068 @@
+/*
+ * Copyright (c) 2017 Meng Wang 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/arm/asm.S"
+#include "neon.S"
+
+.macro regshuffle_d4
+vmov d16, d17
+vmov d17, d18
+vmov d18, d19
+.endm
+
+.macro regshuffle_q4
+vmov q0, q1
+vmov q1, q2
+vmov q2, q3
+.endm
+
+.macro vextin4
+pld   [r2]
+vld1.8{q11}, [r2], r3
+vext.8d16, d22, d23, #1
+vext.8d17, d22, d23, #2
+vext.8d18, d22, d23, #3
+vext.8d19, d22, d23, #4
+.endm
+
+.macro loadin4
+pld   [r2]
+vld1.8{d16}, [r2], r3
+pld   [r2]
+vld1.8{d17}, [r2], r3
+pld   [r2]
+vld1.8{d18}, [r2], r3
+pld   [r2]
+vld1.8{d19}, [r2], r3
+.endm
+
+.macro epel_filter_1_32b
+vmov.i16   d16, #58
+vmov.i16   d17, #10
+vmull.s16  q9,  d2,  d16   // 58*b0
+vmull.s16  q10, d3,  d16   // 58*b1
+vmull.s16  q11, d4,  d17   // 10*c0
+vmull.s16  q12, d5,  d17   // 10*c1
+vadd.s32   q11, q9
+vadd.s32   q12, q10
+vaddl.s16  q9,  d0,  d6
+vaddl.s16  q10, d1,  d7
+vshl.s32   q13, q9,  #1// 2*a + 2*d
+vshl.s32   q14, q10, #1
+vsub.s32   q11, q13// -2*a + 58*b + 10*c -2*d
+vsub.s32   q12, q14
+vqshrn.s32 d16, q11, #6
+vqshrn.s32 d17, q12, #6   // out=q8
+.endm
+
+.macro epel_filter_2_32b
+vmov.i16   d16, #54
+vmull.s16  q9,  d2, d16   // 54*b0
+vmull.s16  q10, d3, d16   // 54*b1
+vshll.s16  q11, d4, #4// 16*c0
+vshll.s16  q12, d5, #4// 16*c1
+vadd.s32   q9,  q11
+vadd.s32   q10, q12
+vshll.s16  q11, d0, #2// 4*a0
+vshll.s16  q12, d1, #2// 4*a1
+vshll.s16  q13, d6, #1// 2*d0
+vshll.s16  q14, d7, #1// 2*d0
+vadd.s32   q11, q13
+vadd.s32   q12, q14
+vsub.s32   q9,  q11   // -4*a + 54*b + 16*c - 2*d
+vsub.s32   q10, q12
+vqshrn.s32 d16, q9,  #6
+vqshrn.s32 d17, q10, #6   // out=q8
+.endm
+
+.macro epel_filter_3_32b
+vmov.i16   d16, #46
+vmull.s16  q9,  d2, d16   // 46*b0
+vmull.s16  q10, d3, d16   // 46*b1
+vshll.s16  q11, d4, #5
+vshll.s16  q12, d5, #5
+vshll.s16  q13, d4, #2
+vshll.s16  q14, d5, #2
+vsub.s32   q11, q13   // 28*c0
+vsub.s32   q12, q14   // 28*c1
+vadd.s32   q9,  q11   // 46*b0 + 28*c0
+vadd.s32   q10, q12   // 46*b1 + 28*c1
+vshll.s16  q11, d6, #2// 4*d0
+vshll.s16  q12, d7, #2// 4*d1
+vmov.i16   d16, #6
+vmull.s16  q13, d0, d16   // 6*a0
+vmull.s16  q14, d1, d16   // 6*a1
+vadd.s32   q11, q13
+ 

[FFmpeg-devel] [PATCH v2 1/5] avcodec/hevcdsp: Add NEON optimization for qpel weighted mode

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/hevcdsp_init_neon.c |  67 +
 libavcodec/arm/hevcdsp_qpel_neon.S | 509 +
 2 files changed, 576 insertions(+)

diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
b/libavcodec/arm/hevcdsp_init_neon.c
index a4628d2a93..183162803e 100644
--- a/libavcodec/arm/hevcdsp_init_neon.c
+++ b/libavcodec/arm/hevcdsp_init_neon.c
@@ -81,6 +81,8 @@ static void (*put_hevc_qpel_neon[4][4])(int16_t *dst, 
ptrdiff_t dststride, uint8
int height, int width);
 static void (*put_hevc_qpel_uw_neon[4][4])(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *_src, ptrdiff_t _srcstride,
int width, int height, int16_t* src2, 
ptrdiff_t src2stride);
+static void (*put_hevc_qpel_wt_neon[4][4])(uint8_t *_dst, ptrdiff_t 
_dststride, uint8_t *_src, ptrdiff_t _srcstride,
+   int width, int height, int denom, int wx1, 
int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t src2stride);
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width);
 void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
@@ -88,6 +90,15 @@ void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8_
 void ff_hevc_put_qpel_bi_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
int16_t *src2,
int height, intptr_t mx, intptr_t my, 
int width);
+void ff_hevc_put_qpel_uni_w_neon_wrapper(uint8_t *dst,  ptrdiff_t dststride,
+ uint8_t *src, ptrdiff_t srcstride,
+ int height, int denom, int wx, int ox,
+ intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_bi_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
+int16_t *src2,
+int height, int denom, int wx0, int 
wx1,
+int ox0, int ox1, intptr_t mx, 
intptr_t my, int width);
+
 #define QPEL_FUNC(name) \
 void name(int16_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t 
srcstride, \
int height, int width)
@@ -142,6 +153,26 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8);
 #undef QPEL_FUNC_UW
 
+#define QPEL_FUNC_WT(name) \
+void name(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t 
_srcstride, \
+int width, int height, int denom, int wx1, int ox1, int wx0, int ox0, 
int16_t* src2, ptrdiff_t src2stride);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v3_neon_8);
+#undef QPEL_FUNC_WT
+
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width) {
 
@@ -160,6 +191,21 @@ void ff_hevc_put_qpel_bi_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8_t
 put_hevc_qpel_uw_neon[my][mx](dst, dststride, src, srcstride, width, 
height, src2, MAX_PB_SIZE);
 }
 
+void ff_hevc_put_qpel_uni_w_neon_wrapper(uint8_t *dst,  ptrdiff_t dststride,
+  uint8_t *src, ptrdiff_t 
srcstride,
+  int height, int denom, int wx, 
int ox,
+  intptr_t mx, intptr_t my, int 
width) {
+put_hevc_qpel_wt_neon[my][mx](dst, dststride, src, srcstride, width, 
height, denom, wx, ox, 0, 0, NULL, 0);
+}
+
+void ff_hevc_put_qpel_bi_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
+ int16_t *src2,
+ int height, int denom, int wx0, 
int wx1,
+ int ox0, int ox1, intptr_t mx, 
intptr_t my, int width) {
+put_hevc_qpel_wt_neon[my][mx](dst, dststride, src, srcstride, width, 
height, denom, wx1, ox1, wx0, ox0, src2, MAX_PB_SIZE);
+}
+
+
 av_cold void ff

[FFmpeg-devel] [PATCH 6/6] avcodec/hevcdsp: Add NEON optimization for idct16x16

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/hevcdsp_idct_neon.S | 241 +
 libavcodec/arm/hevcdsp_init_neon.c |   2 +
 2 files changed, 243 insertions(+)

diff --git a/libavcodec/arm/hevcdsp_idct_neon.S 
b/libavcodec/arm/hevcdsp_idct_neon.S
index e39d00634b..272abf279c 100644
--- a/libavcodec/arm/hevcdsp_idct_neon.S
+++ b/libavcodec/arm/hevcdsp_idct_neon.S
@@ -451,6 +451,247 @@ function ff_hevc_transform_8x8_neon_8, export=1
 bx lr
 endfunc
 
+/* 16x16 even line combine, input: q3-q10  output: q8-q15 */
+.macro tr8_combine
+vsub.s32   q12, q3, q10  // e_8[3] - o_8[3], dst[4]
+vadd.s32   q11, q3, q10  // e_8[3] + o_8[3], dst[3]
+
+vsub.s32   q13, q6, q9   // e_8[2] - o_8[2], dst[5]
+vadd.s32   q10, q6, q9   // e_8[2] + o_8[2], dst[2]
+
+vsub.s32   q14, q5, q8   // e_8[1] - o_8[1], dst[6]
+vadd.s32   q9,  q5, q8   // e_8[1] + o_8[1], dst[1]
+
+vsub.s32   q15, q4, q7   // e_8[0] - o_8[0], dst[7]
+vadd.s32   q8,  q4, q7   // e_8[0] + o_8[0], dst[0]
+.endm
+
+.macro tr16_begin in0, in1, in2, in3, in4, in5, in6, in7
+vmull.s16  q2, \in0, d2[1]   // 90 * src1
+vmull.s16  q3, \in0, d2[0]   // 87 * src1
+vmull.s16  q4, \in0, d2[3]   // 80 * src1
+vmull.s16  q5, \in0, d2[2]   // 70 * src1
+vmull.s16  q6, \in0, d3[1]   // 57 * src1
+vmull.s16  q7, \in0, d3[0]   // 43 * src1
+vmull.s16  q8, \in0, d3[3]   // 25 * src1
+vmull.s16  q9, \in0, d3[2]   //  9 * src1
+
+vmlal.s16  q2, \in1, d2[0]   // 87 * src3
+vmlal.s16  q3, \in1, d3[1]   // 57 * src3
+vmlal.s16  q4, \in1, d3[2]   //  9 * src3
+vmlsl.s16  q5, \in1, d3[0]   //-43 * src3
+vmlsl.s16  q6, \in1, d2[3]   //-80 * src3
+vmlsl.s16  q7, \in1, d2[1]   //-90 * src3
+vmlsl.s16  q8, \in1, d2[2]   //-70 * src3
+vmlsl.s16  q9, \in1, d3[3]   //-25 * src3
+
+vmlal.s16  q2, \in2, d2[3]   // 80 * src5
+vmlal.s16  q3, \in2, d3[2]   //  9 * src5
+vmlsl.s16  q4, \in2, d2[2]   //-70 * src5
+vmlsl.s16  q5, \in2, d2[0]   //-87 * src5
+vmlsl.s16  q6, \in2, d3[3]   //-25 * src5
+vmlal.s16  q7, \in2, d3[1]   // 57 * src5
+vmlal.s16  q8, \in2, d2[1]   // 90 * src5
+vmlal.s16  q9, \in2, d3[0]   // 43 * src5
+
+vmlal.s16  q2, \in3, d2[2]   // 70 * src7
+vmlsl.s16  q3, \in3, d3[0]   //-43 * src7
+vmlsl.s16  q4, \in3, d2[0]   //-87 * src7
+vmlal.s16  q5, \in3, d3[2]   //  9 * src7
+vmlal.s16  q6, \in3, d2[1]   // 90 * src7
+vmlal.s16  q7, \in3, d3[3]   // 25 * src7
+vmlsl.s16  q8, \in3, d2[3]   //-80 * src7
+vmlsl.s16  q9, \in3, d3[1]   //-57 * src7
+
+vmlal.s16  q2, \in4, d3[1]   // 57 * src9
+vmlsl.s16  q3, \in4, d2[3]   //-80 * src9
+vmlsl.s16  q4, \in4, d3[3]   //-25 * src9
+vmlal.s16  q5, \in4, d2[1]   // 90 * src9
+vmlsl.s16  q6, \in4, d3[2]   // -9 * src9
+vmlsl.s16  q7, \in4, d2[0]   //-87 * src9
+vmlal.s16  q8, \in4, d3[0]   // 43 * src9
+vmlal.s16  q9, \in4, d2[2]   // 70 * src9
+
+vmlal.s16  q2, \in5, d3[0]   // 43 * src11
+vmlsl.s16  q3, \in5, d2[1]   //-90 * src11
+vmlal.s16  q4, \in5, d3[1]   // 57 * src11
+vmlal.s16  q5, \in5, d3[3]   // 25 * src11
+vmlsl.s16  q6, \in5, d2[0]   //-87 * src11
+vmlal.s16  q7, \in5, d2[2]   // 70 * src11
+vmlal.s16  q8, \in5, d3[2]   //  9 * src11
+vmlsl.s16  q9, \in5, d2[3]   //-80 * src11
+
+vmlal.s16  q2, \in6, d3[3]   // 25 * src13
+vmlsl.s16  q3, \in6, d2[2]   //-70 * src13
+vmlal.s16  q4, \in6, d2[1]   // 90 * src13
+vmlsl.s16  q5, \in6, d2[3]   //-80 * src13
+vmlal.s16  q6, \in6, d3[0]   // 43 * src13
+vmlal.s16  q7, \in6, d3[2]   //  9 * src13
+vmlsl.s16  q8, \in6, d3[1]   //-57 * src13
+vmlal.s16  q9, \in6, d2[0]   // 87 * src13
+
+
+vmlal.s16  q2, \in7, d3[2]   //  9 * src15
+vmlsl.s16  q3, \in7, d3[3]   //-25 * src15
+vmlal.s16  q4, \in7, d3[0]   // 43 * src15
+vmlsl.s16  q5, \in7, d3[1]   //-57 * src15
+vmlal.s16  q6, \in7, d2[2]   // 70 * src15
+vmlsl.s16  q7, \in7, d2[3]   //-80 * src15
+vmlal.s16  q8, \in7, d2[0]   // 87 * src15
+vmlsl.s16  q9, \in7, d2[1]   //-90 * src15
+.endm
+
+.macro tr16_end shift
+vpop   {q2-q3}
+vadd.s32   q4, q8,  q2
+vsub.s32   q5, q8,  q2
+vqrshrn.s32d12, q4, \shift
+vqrshrn.s32d15, q5, \shift
+
+vadd.s32   q4, q9,  q3
+vsub.s32   q5, q9,  q3
+vqrshrn.s32d13, q4, \shift
+vqrshrn.s32d14, q5, \shift
+
+vpop   {q2-q3}
+vadd.s32   q4, q10, q2
+vsub.s32   q5, q10, q2
+vqrshrn.s32d16, q4, \shift
+vqrshrn.s32d19, q5, \shift
+
+vadd.s32

[FFmpeg-devel] [PATCH 2/6] avcodec/hevcdsp: Add NEON optimization for epel

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/Makefile|3 +-
 libavcodec/arm/hevcdsp_epel_neon.S | 2068 
 libavcodec/arm/hevcdsp_init_neon.c |  459 
 3 files changed, 2529 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/arm/hevcdsp_epel_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 1eeac5449e..1acda0b1f8 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -136,7 +136,8 @@ NEON-OBJS-$(CONFIG_DCA_DECODER)+= 
arm/synth_filter_neon.o
 NEON-OBJS-$(CONFIG_HEVC_DECODER)   += arm/hevcdsp_init_neon.o   \
   arm/hevcdsp_deblock_neon.o\
   arm/hevcdsp_idct_neon.o   \
-  arm/hevcdsp_qpel_neon.o
+  arm/hevcdsp_qpel_neon.o   \
+  arm/hevcdsp_epel_neon.o
 NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
 NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
diff --git a/libavcodec/arm/hevcdsp_epel_neon.S 
b/libavcodec/arm/hevcdsp_epel_neon.S
new file mode 100644
index 00..d0d93e8033
--- /dev/null
+++ b/libavcodec/arm/hevcdsp_epel_neon.S
@@ -0,0 +1,2068 @@
+/*
+ * Copyright (c) 2017 Meng Wang 
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/arm/asm.S"
+#include "neon.S"
+
+.macro regshuffle_d4
+vmov d16, d17
+vmov d17, d18
+vmov d18, d19
+.endm
+
+.macro regshuffle_q4
+vmov q0, q1
+vmov q1, q2
+vmov q2, q3
+.endm
+
+.macro vextin4
+pld   [r2]
+vld1.8{q11}, [r2], r3
+vext.8d16, d22, d23, #1
+vext.8d17, d22, d23, #2
+vext.8d18, d22, d23, #3
+vext.8d19, d22, d23, #4
+.endm
+
+.macro loadin4
+pld   [r2]
+vld1.8{d16}, [r2], r3
+pld   [r2]
+vld1.8{d17}, [r2], r3
+pld   [r2]
+vld1.8{d18}, [r2], r3
+pld   [r2]
+vld1.8{d19}, [r2], r3
+.endm
+
+.macro epel_filter_1_32b
+vmov.i16   d16, #58
+vmov.i16   d17, #10
+vmull.s16  q9,  d2,  d16   // 58*b0
+vmull.s16  q10, d3,  d16   // 58*b1
+vmull.s16  q11, d4,  d17   // 10*c0
+vmull.s16  q12, d5,  d17   // 10*c1
+vadd.s32   q11, q9
+vadd.s32   q12, q10
+vaddl.s16  q9,  d0,  d6
+vaddl.s16  q10, d1,  d7
+vshl.s32   q13, q9,  #1// 2*a + 2*d
+vshl.s32   q14, q10, #1
+vsub.s32   q11, q13// -2*a + 58*b + 10*c -2*d
+vsub.s32   q12, q14
+vqshrn.s32 d16, q11, #6
+vqshrn.s32 d17, q12, #6   // out=q8
+.endm
+
+.macro epel_filter_2_32b
+vmov.i16   d16, #54
+vmull.s16  q9,  d2, d16   // 54*b0
+vmull.s16  q10, d3, d16   // 54*b1
+vshll.s16  q11, d4, #4// 16*c0
+vshll.s16  q12, d5, #4// 16*c1
+vadd.s32   q9,  q11
+vadd.s32   q10, q12
+vshll.s16  q11, d0, #2// 4*a0
+vshll.s16  q12, d1, #2// 4*a1
+vshll.s16  q13, d6, #1// 2*d0
+vshll.s16  q14, d7, #1// 2*d0
+vadd.s32   q11, q13
+vadd.s32   q12, q14
+vsub.s32   q9,  q11   // -4*a + 54*b + 16*c - 2*d
+vsub.s32   q10, q12
+vqshrn.s32 d16, q9,  #6
+vqshrn.s32 d17, q10, #6   // out=q8
+.endm
+
+.macro epel_filter_3_32b
+vmov.i16   d16, #46
+vmull.s16  q9,  d2, d16   // 46*b0
+vmull.s16  q10, d3, d16   // 46*b1
+vshll.s16  q11, d4, #5
+vshll.s16  q12, d5, #5
+vshll.s16  q13, d4, #2
+vshll.s16  q14, d5, #2
+vsub.s32   q11, q13   // 28*c0
+vsub.s32   q12, q14   // 28*c1
+vadd.s32   q9,  q11   // 46*b0 + 28*c0
+vadd.s32   q10, q12   // 46*b1 + 28*c1
+vshll.s16  q11, d6, #2// 4*d0
+vshll.s16  q12, d7, #2// 4*d1
+vmov.i16   d16, #6
+vmull.s16  q13, d0, d16   // 6*a0
+vmull.s16  q14, d1, d16   // 6*a1
+vadd.s32   q11, q13
+ 

[FFmpeg-devel] [PATCH 4/6] avcodec/hevcdsp: Use pre-load (pld) to optimize data loading

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/hevcdsp_epel_neon.S | 10 ++
 libavcodec/arm/hevcdsp_qpel_neon.S | 24 
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/libavcodec/arm/hevcdsp_epel_neon.S 
b/libavcodec/arm/hevcdsp_epel_neon.S
index d0d93e8033..03e6504481 100644
--- a/libavcodec/arm/hevcdsp_epel_neon.S
+++ b/libavcodec/arm/hevcdsp_epel_neon.S
@@ -306,6 +306,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16{q7}, [r0], r1
 regshuffle_d4
@@ -320,6 +321,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16d14, [r0], r1
 regshuffle_d4
@@ -357,6 +359,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.8d0, [r0], r1
@@ -372,6 +375,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.32d0[0], [r0], r1
@@ -396,6 +400,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16{q0}, [r8], r9
 vqadd.s16  q0, q7
@@ -415,6 +420,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16  d0, [r8], r9
 vqadd.s16d0, d14
@@ -465,6 +471,7 @@
 cmp   r5, #2
 beq   2f
 8:  subs  r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14   // extending signed 4x16bit data to 4x32 
bit
 vmovl.s16 q13, d15
@@ -490,6 +497,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 4x32 
bit
 vmul.s32  q14, q12, q6
@@ -535,6 +543,7 @@
 cmp   r5, #2
 beq   2f
 8:  subsr4,   #1
+pld [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 
4x32 bit
 vmovl.s16 q13, d15
@@ -569,6 +578,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16q12, d14
 vmul.s32 q14, q12, q6
diff --git a/libavcodec/arm/hevcdsp_qpel_neon.S 
b/libavcodec/arm/hevcdsp_qpel_neon.S
index 71ecc00b6e..b507fbc13b 100644
--- a/libavcodec/arm/hevcdsp_qpel_neon.S
+++ b/libavcodec/arm/hevcdsp_qpel_neon.S
@@ -231,6 +231,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16{q7}, [r0], r1
 regshuffle_d8
@@ -245,6 +246,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vst1.16d14, [r0], r1
 regshuffle_d8
@@ -273,6 +275,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.8d0, [r0], r1
@@ -288,6 +291,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vqrshrun.s16   d0, q7, #6
 vst1.32d0[0], [r0], r1
@@ -301,6 +305,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16{q0}, [r8], r9
 vqadd.s16  q0, q7
@@ -320,6 +325,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vld1.16  d0, [r8], r9
 vqadd.s16d0, d14
@@ -358,6 +364,7 @@
 cmp   r5, #4
 beq   4f
 8:  subs  r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14 // extending signed 4x16bit data to 4x32 bit
 vmovl.s16 q13, d15
@@ -383,6 +390,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 4x32 
bit
 vmul.s32  q14, q12, q6
@@ -412,6 +420,7 @@
 cmp r5,   #4
 beq 4f
 8:  subsr4,   #1
+pld   [r2]
 \filter
 vmovl.s16 q12, d14  // extending signed 4x16bit data to 
4x32 bit
 vmovl.s16 q13, d15
@@ -446,6 +455,7 @@
 mov r2, r7
 b 0b
 4:  subs r4, #1
+pld   [r2]
 \filter
 vmovl.s16q12, d14
 vmul.s32 q14, q12, q6
@@ -1524,8 +1534,9 @@ function ff_hevc_put_qpel_bi_uw_pixels_neon_8, export=1
 cmp   r5, #4
 beq   4f
 8:  subs r4, #1
-vshll.u8   q7 , d8, #6// src[x] << 6 and move long to 8x16bi

[FFmpeg-devel] [PATCH 3/6] avcodec/hevcdsp: Add NEON optimization for whole-pixel interpolation

2017-11-22 Thread Shengbin Meng
New code is written for qpel; and then code for qpel is reused for epel,
because whole-pixel interpolation in qpel and epel are identical.

Signed-off-by: Shengbin Meng 
---
 libavcodec/arm/hevcdsp_init_neon.c | 106 ++
 libavcodec/arm/hevcdsp_qpel_neon.S | 177 +
 2 files changed, 283 insertions(+)

diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
b/libavcodec/arm/hevcdsp_init_neon.c
index 9d885a62a9..6171863113 100644
--- a/libavcodec/arm/hevcdsp_init_neon.c
+++ b/libavcodec/arm/hevcdsp_init_neon.c
@@ -71,6 +71,10 @@ static void (*put_hevc_epel_uw_neon[8][8])(uint8_t *dst, 
ptrdiff_t dststride, ui
int width, int height, int16_t* 
src2, ptrdiff_t src2stride);
 static void (*put_hevc_epel_wt_neon[8][8])(uint8_t *_dst, ptrdiff_t 
_dststride, uint8_t *_src, ptrdiff_t _srcstride,
int width, int height, int denom, 
int wx1, int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t src2stride);
+static void (*put_hevc_qpel_bi_uw_pixels_neon[1])(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *_src, ptrdiff_t _srcstride,
+  int width, int height, int16_t* 
src2, ptrdiff_t src2stride);
+static void (*put_hevc_qpel_wt_pixels_neon[1])(uint8_t *_dst, ptrdiff_t 
_dststride, uint8_t *_src, ptrdiff_t _srcstride,
+  int width, int height, 
int denom, int wx1, int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t 
src2stride);
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width);
 void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
@@ -101,6 +105,17 @@ void ff_hevc_put_epel_bi_w_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8
  int16_t *src2,
  int height, int denom, int wx0, 
int wx1,
  int ox0, int ox1, intptr_t mx, 
intptr_t my, int width);
+void ff_hevc_put_qpel_bi_uw_pixels_neon_wrapper(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *src, ptrdiff_t srcstride,
+   int16_t *src2,
+   int height, intptr_t mx, 
intptr_t my, int width);
+void ff_hevc_put_qpel_uni_wt_pixels_neon_wrapper(uint8_t *dst,  ptrdiff_t 
dststride,
+  uint8_t *src, ptrdiff_t 
srcstride,
+  int height, int denom, int wx, 
int ox,
+  intptr_t mx, intptr_t my, int 
width);
+void ff_hevc_put_qpel_bi_wt_pixels_neon_wrapper(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *src, ptrdiff_t srcstride,
+int16_t *src2,
+int height, int denom, int 
wx0, int wx1,
+int ox0, int ox1, intptr_t mx, 
intptr_t my, int width);
 
 #define QPEL_FUNC(name) \
 void name(int16_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t 
srcstride, \
@@ -154,6 +169,7 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h2v3_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v1_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8);
+QPEL_FUNC_UW(ff_hevc_put_qpel_bi_uw_pixels_neon_8);
 #undef QPEL_FUNC_UW
 
 #define QPEL_FUNC_WT(name) \
@@ -174,6 +190,7 @@ QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v3_neon_8);
 QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v1_neon_8);
 QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v2_neon_8);
 QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_pixels_neon_8);
 #undef QPEL_FUNC_WT
 
 
@@ -441,6 +458,26 @@ void ff_hevc_put_epel_bi_w_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8
 put_hevc_epel_wt_neon[my][mx](dst, dststride, src, srcstride, width, 
height, denom, wx1, ox1, wx0, ox0, src2, MAX_PB_SIZE);
 }
 
+void ff_hevc_put_qpel_bi_uw_pixels_neon_wrapper(uint8_t *dst, ptrdiff_t 
dststride, uint8_t *src, ptrdiff_t srcstride,
+   int16_t *src2,
+   int height, intptr_t mx, 
intptr_t my, int width) {
+put_hevc_qpel_bi_uw_pixels_neon[0](dst, dststride, src, srcstride, width, 
height, src2, MAX_PB_SIZE);
+}
+
+void ff_hevc_put_qpel_uni_wt_pixels_neon_wrapper(uint8_t *dst,  ptrdiff_t 
dststride,
+ uint8_t *src, ptrdiff_t 
srcstride,
+ int height, int denom, 
int wx, int ox,
+ intptr_t mx, intptr_t my, 
int width) {
+put_hevc_qpel_wt_pixels_neon[0](dst, dststride, src, srcstride, width

[FFmpeg-devel] [PATCH 5/6] avcodec/hevcdsp: Add NEON optimization for sao

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/Makefile|   3 +-
 libavcodec/arm/hevcdsp_init_neon.c |  62 +
 libavcodec/arm/hevcdsp_sao_neon.S  | 181 +
 3 files changed, 245 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/arm/hevcdsp_sao_neon.S

diff --git a/libavcodec/arm/Makefile b/libavcodec/arm/Makefile
index 1acda0b1f8..fc4c0147c5 100644
--- a/libavcodec/arm/Makefile
+++ b/libavcodec/arm/Makefile
@@ -137,7 +137,8 @@ NEON-OBJS-$(CONFIG_HEVC_DECODER)   += 
arm/hevcdsp_init_neon.o   \
   arm/hevcdsp_deblock_neon.o\
   arm/hevcdsp_idct_neon.o   \
   arm/hevcdsp_qpel_neon.o   \
-  arm/hevcdsp_epel_neon.o
+  arm/hevcdsp_epel_neon.o   \
+ arm/hevcdsp_sao_neon.o
 NEON-OBJS-$(CONFIG_RV30_DECODER)   += arm/rv34dsp_neon.o
 NEON-OBJS-$(CONFIG_RV40_DECODER)   += arm/rv34dsp_neon.o\
   arm/rv40dsp_neon.o
diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
b/libavcodec/arm/hevcdsp_init_neon.c
index 6171863113..33cc44ef40 100644
--- a/libavcodec/arm/hevcdsp_init_neon.c
+++ b/libavcodec/arm/hevcdsp_init_neon.c
@@ -23,6 +23,13 @@
 #include "libavcodec/hevcdsp.h"
 #include "hevcdsp_arm.h"
 
+void ff_hevc_sao_band_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src,
+  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+  int16_t *sao_offset_val, int sao_left_class,
+  int width, int height);
+void ff_hevc_sao_edge_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src, 
ptrdiff_t stride_dst, int16_t *sao_offset_val,
+  int eo, int width, int height);
+
 void ff_hevc_v_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
_beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
 void ff_hevc_h_loop_filter_luma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
_beta, int *_tc, uint8_t *_no_p, uint8_t *_no_q);
 void ff_hevc_v_loop_filter_chroma_neon(uint8_t *_pix, ptrdiff_t _stride, int 
*_tc, uint8_t *_no_p, uint8_t *_no_q);
@@ -396,6 +403,51 @@ EPEL_FUNC_WT(ff_hevc_put_epel_wt_h6v7_neon_8);
 EPEL_FUNC_WT(ff_hevc_put_epel_wt_h7v7_neon_8);
 #undef EPEL_FUNC_WT
 
+void ff_hevc_sao_band_filter_neon_8(uint8_t *dst, uint8_t *src, ptrdiff_t 
stride_dst, ptrdiff_t stride_src, int width, int height, int16_t *offset_table);
+
+void ff_hevc_sao_band_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src,
+  ptrdiff_t stride_dst, ptrdiff_t stride_src,
+  int16_t *sao_offset_val, int sao_left_class,
+  int width, int height) {
+uint8_t *dst = (uint8_t *)_dst;
+uint8_t *src = (uint8_t *)_src;
+int16_t offset_table[32] = {0};
+int k;
+
+stride_dst /= sizeof(uint8_t);
+stride_src /= sizeof(uint8_t);
+
+for (k = 0; k < 4; k++) {
+offset_table[(k + sao_left_class) & 31] = sao_offset_val[k + 1];
+}
+
+ff_hevc_sao_band_filter_neon_8(dst, src, stride_dst, stride_src, width, 
height, offset_table);
+}
+
+void ff_hevc_sao_edge_filter_neon_8(uint8_t *dst, uint8_t *src, ptrdiff_t 
stride_dst, ptrdiff_t stride_src, int width, int height,
+int a_stride, int b_stride, int16_t 
*sao_offset_val, uint8_t *edge_idx);
+
+void ff_hevc_sao_edge_filter_neon_wrapper(uint8_t *_dst, uint8_t *_src, 
ptrdiff_t stride_dst, int16_t *sao_offset_val,
+  int eo, int width, int height) {
+static uint8_t edge_idx[] = { 1, 2, 0, 3, 4 };
+static const int8_t pos[4][2][2] = {
+{ { -1,  0 }, {  1, 0 } }, // horizontal
+{ {  0, -1 }, {  0, 1 } }, // vertical
+{ { -1, -1 }, {  1, 1 } }, // 45 degree
+{ {  1, -1 }, { -1, 1 } }, // 135 degree
+};
+uint8_t *dst = (uint8_t *)_dst;
+uint8_t *src = (uint8_t *)_src;
+int a_stride, b_stride;
+ptrdiff_t stride_src = (2*64 + 32) / sizeof(uint8_t);
+stride_dst /= sizeof(uint8_t);
+
+a_stride = pos[eo][0][0] + pos[eo][0][1] * stride_src;
+b_stride = pos[eo][1][0] + pos[eo][1][1] * stride_src;
+
+ff_hevc_sao_edge_filter_neon_8(dst, src, stride_dst, stride_src, width, 
height, a_stride, b_stride, sao_offset_val, edge_idx);
+}
+
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width) {
 
@@ -486,6 +538,16 @@ av_cold void ff_hevcdsp_init_neon(HEVCDSPContext *c, const 
int bit_depth)
 c->hevc_h_loop_filter_luma = ff_hevc_h_loop_filter_luma_neon;
 c->hevc_v_loop_filter_chroma   = ff_hevc_v_loop_filter_chroma_neon;
 c->hev

[FFmpeg-devel] [PATCH 0/6] Optimize HEVC decoding on ARM (32bit) platform

2017-11-22 Thread Shengbin Meng
Our tests show that CPU clocks are reduced for each module:
~48% for qpel weight
~17% for epel
~71% for sao edge mode
~48% for sao band mode
~60% for idct of 16x16 block
And overall decoding speeds up by 20~30% (increase of FPS).

We also compared the decoding results to make sure they are the same
before and after the optimization.

These patches are based on the n3.4 release.

Meng Wang (5):
  avcodec/hevcdsp: Add NEON optimization for qpel weighted mode
  avcodec/hevcdsp: Add NEON optimization for epel
  avcodec/hevcdsp: Use pre-load (pld) to optimize data loading
  avcodec/hevcdsp: Add NEON optimization for sao
  avcodec/hevcdsp: Add NEON optimization for idct16x16

Shengbin Meng (1):
  avcodec/hevcdsp: Add NEON optimization for whole-pixel interpolation

 libavcodec/arm/Makefile|4 +-
 libavcodec/arm/hevcdsp_epel_neon.S | 2078 
 libavcodec/arm/hevcdsp_idct_neon.S |  241 +
 libavcodec/arm/hevcdsp_init_neon.c |  695 
 libavcodec/arm/hevcdsp_qpel_neon.S |  702 
 libavcodec/arm/hevcdsp_sao_neon.S  |  181 
 6 files changed, 3900 insertions(+), 1 deletion(-)
 create mode 100644 libavcodec/arm/hevcdsp_epel_neon.S
 create mode 100644 libavcodec/arm/hevcdsp_sao_neon.S

-- 
2.13.6 (Apple Git-96)

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] [PATCH 1/6] avcodec/hevcdsp: Add NEON optimization for qpel weighted mode

2017-11-22 Thread Shengbin Meng
From: Meng Wang 

Signed-off-by: Meng Wang 
---
 libavcodec/arm/hevcdsp_init_neon.c |  66 +
 libavcodec/arm/hevcdsp_qpel_neon.S | 509 +
 2 files changed, 575 insertions(+)

diff --git a/libavcodec/arm/hevcdsp_init_neon.c 
b/libavcodec/arm/hevcdsp_init_neon.c
index 1a3912c609..2559c92095 100644
--- a/libavcodec/arm/hevcdsp_init_neon.c
+++ b/libavcodec/arm/hevcdsp_init_neon.c
@@ -63,6 +63,8 @@ static void (*put_hevc_qpel_neon[4][4])(int16_t *dst, 
ptrdiff_t dststride, uint8
int height, int width);
 static void (*put_hevc_qpel_uw_neon[4][4])(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *_src, ptrdiff_t _srcstride,
int width, int height, int16_t* src2, 
ptrdiff_t src2stride);
+static void (*put_hevc_qpel_wt_neon[4][4])(uint8_t *_dst, ptrdiff_t 
_dststride, uint8_t *_src, ptrdiff_t _srcstride,
+   int width, int height, int denom, int wx1, 
int ox1, int wx0, int ox0, int16_t* src2, ptrdiff_t src2stride);
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width);
 void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
@@ -70,6 +72,15 @@ void ff_hevc_put_qpel_uni_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8_
 void ff_hevc_put_qpel_bi_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
int16_t *src2,
int height, intptr_t mx, intptr_t my, 
int width);
+void ff_hevc_put_qpel_uni_w_neon_wrapper(uint8_t *dst,  ptrdiff_t dststride,
+ uint8_t *src, ptrdiff_t srcstride,
+ int height, int denom, int wx, int ox,
+ intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_qpel_bi_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
+int16_t *src2,
+int height, int denom, int wx0, int 
wx1,
+int ox0, int ox1, intptr_t mx, 
intptr_t my, int width);
+
 #define QPEL_FUNC(name) \
 void name(int16_t *dst, ptrdiff_t dststride, uint8_t *src, ptrdiff_t 
srcstride, \
int height, int width)
@@ -124,6 +135,26 @@ QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v2_neon_8);
 QPEL_FUNC_UW(ff_hevc_put_qpel_uw_h3v3_neon_8);
 #undef QPEL_FUNC_UW
 
+#define QPEL_FUNC_WT(name) \
+void name(uint8_t *_dst, ptrdiff_t _dststride, uint8_t *_src, ptrdiff_t 
_srcstride, \
+int width, int height, int denom, int wx1, int ox1, int wx0, int ox0, 
int16_t* src2, ptrdiff_t src2stride);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h1v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h2v3_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v1_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v2_neon_8);
+QPEL_FUNC_WT(ff_hevc_put_qpel_wt_h3v3_neon_8);
+#undef QPEL_FUNC_WT
+
 void ff_hevc_put_qpel_neon_wrapper(int16_t *dst, uint8_t *src, ptrdiff_t 
srcstride,
int height, intptr_t mx, intptr_t my, int 
width) {
 
@@ -142,6 +173,20 @@ void ff_hevc_put_qpel_bi_neon_wrapper(uint8_t *dst, 
ptrdiff_t dststride, uint8_t
 put_hevc_qpel_uw_neon[my][mx](dst, dststride, src, srcstride, width, 
height, src2, MAX_PB_SIZE);
 }
 
+void ff_hevc_put_qpel_uni_w_neon_wrapper(uint8_t *dst,  ptrdiff_t dststride,
+  uint8_t *src, ptrdiff_t 
srcstride,
+  int height, int denom, int wx, 
int ox,
+  intptr_t mx, intptr_t my, int 
width) {
+put_hevc_qpel_wt_neon[my][mx](dst, dststride, src, srcstride, width, 
height, denom, wx, ox, 0, 0, NULL, 0);
+}
+
+void ff_hevc_put_qpel_bi_w_neon_wrapper(uint8_t *dst, ptrdiff_t dststride, 
uint8_t *src, ptrdiff_t srcstride,
+ int16_t *src2,
+ int height, int denom, int wx0, 
int wx1,
+ int ox0, int ox1, intptr_t mx, 
intptr_t my, int width) {
+put_hevc_qpel_wt_neon[my][mx](dst, dststride, src, srcstride, width, 
height, denom, wx1, ox1, wx0, ox0, src2, MAX_PB_SIZE);
+}
+
 av_cold void ff_h

Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-21 Thread Shengbin Meng

> On 19 Nov 2017, at 01:35, Rafal Dabrowa  wrote:
> 
> 
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.

Nice to see the work for aarch64! 

We are also in the process of doing NEON optimization for HEVC decoding. 
(http://ffmpeg.org/pipermail/ffmpeg-devel/2017-October/218233.html 
)

Now we are just about to finish arm 32-bit work and ready to send some patches 
out. Looks like for aarch64 we can join force:) What do you think?

> 
> The patch contains optimizations for most heavily used qpel, epel, sao and 
> idct
> functions.  Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
> 

I see that optimization for loop filter already exists for arm 32-bit code. Why 
not use that algorithm?


Regards,
Shengbin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


[FFmpeg-devel] HEVC ARM optimization

2017-10-20 Thread Shengbin Meng
Hi,

I’d like to know if anyone is dong or interested in ARM optimization for the 
native HEVC decoder in FFmpeg?

We can see that some time-consuming operations in HEVC decoding have not been 
optimized using NEON, e.g, qpel and epel interpolation, SAO, IDCT of large 
blocks.
I have some optimization code here, and I am considering to submit to FFmpeg, 
so that we can develop together.

Regards,
Shengbin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel