Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2018-04-08 Thread Yingming Fan
Hi Rafal,

It’s very nice to see your work on hevc arm64 neon optimization.

You should test if your codes can pass FATE firstly. The main purpose is to 
test if your codes can pass all hevc-conformance bitstream. About fate please 
refer to https://www.ffmpeg.org/fate.html 

You can also use checkasm to benchmark your arm64 neon codes. For example 
‘checkasm —test=hevc_sao —bench’ to test sao function performance.

You should also split your pathch into smaller one. For example patch about sao 
mc and idct. Every patch should tell us speed-up ratio by using checkasm.

Yingming Fan

___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-25 Thread Clément Bœsch
On Sat, Nov 18, 2017 at 06:35:48PM +0100, Rafal Dabrowa wrote:
> 
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.
> 
> I'm testing my optimizations on NanoPi M3 device. I'm using
> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
> The video file was pulled from libde265.org page, see
> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
> The movie duration is 00:10:34.53.
> 
> Overall performance gain is about 2x. Without optimizations the movie
> playback stops in practice after a few seconds. With
> optimizations the file is played smoothly 99% of the time.
> 
> For performance testing the following command was used:
> 
> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe - 
> >/dev/null
> 
> The video file was pre-read before test to minimize disk reads during testing.
> Program execution time without optimization was as follows:
> 
> real  11m48.576s
> user  43m8.111s
> sys   0m12.469s
> 
> Execution time with optimizations:
> 
> real  6m17.046s
> user  21m19.792s
> sys   0m14.724s
> 

Can you post the results of checkasm --bench for hevc?

Did you run it to check for any calling convention violation?

> 
> The patch contains optimizations for most heavily used qpel, epel, sao and 
> idct
> functions.  Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
> 

You may want to check x86/hevc_deblock.asm then (no idea if these are
implemented).

[...]
> +function ff_hevc_put_hevc_pel_pixels4_8_neon, export=1
> +mov x7, 128
> +1:  ld1 { v0.s }[0], [x1], x2
> +ushll   v4.8h, v0.8b, 6

> +st1 { v4.d }[0], [x0], x7

using #128 not possible?

> +subsx3, x3, 1
> +b.ne1b
> +ret

here and below: no use of the x6 register?

A few comments on the style:

- please use a consistent spacing (current function mismatches with later
  code), preferably using a relatively large number of spaces as common
  ground (check the other sources)
- we use capitalized size suffixes (B, H, ...); and IIRC the lower case
  form are problematic with some assembler but don't quote me on that.
- we don't use spaces between {}

> +endfunc
> +
> +function ff_hevc_put_hevc_pel_pixels6_8_neon, export=1
> +mov x7, 120
> +1:  ld1 { v0.8b }, [x1], x2
> +ushll   v4.8h, v0.8b, 6

> +st1 { v4.d }[0], [x0], 8

I think you need to use # as prefix for the immediates

> +st1 { v4.s }[2], [x0], x7

I assume you can't use #120?

Have you checked if using #128 here and decrementing x0 afterward isn't
faster?

[...]
> +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1
> +mov x10, 128
> +1:  ld1 { v0.16b, v1.16b }, [x2], x3// src
> +ushll   v16.8h, v0.8b, 6
> +ushll2  v17.8h, v0.16b, 6
> +ushll   v18.8h, v1.8b, 6
> +ushll2  v19.8h, v1.16b, 6
> +ld1 { v20.8h, v21.8h, v22.8h, v23.8h }, [x4], x10   // src2
> +sqadd   v16.8h, v16.8h, v20.8h
> +sqadd   v17.8h, v17.8h, v21.8h
> +sqadd   v18.8h, v18.8h, v22.8h
> +sqadd   v19.8h, v19.8h, v23.8h

> +sqrshrunv0.8b,  v16.8h, 7
> +sqrshrun2   v0.16b, v17.8h, 7
> +sqrshrunv1.8b,  v18.8h, 7
> +sqrshrun2   v1.16b, v19.8h, 7

does pairing helps here?

sqrshrunv0.8b,  v16.8h, 7
sqrshrunv1.8b,  v18.8h, 7
sqrshrun2   v0.16b, v17.8h, 7
sqrshrun2   v1.16b, v19.8h, 7

[...]
> +sqrshrunv0.8b,  v16.8h, 7
> +sqrshrun2   v0.16b, v17.8h, 7
> +sqrshrunv1.8b,  v18.8h, 7
> +sqrshrun2   v1.16b, v19.8h, 7
> +sqrshrunv2.8b,  v20.8h, 7
> +sqrshrun2   v2.16b, v21.8h, 7
> +sqrshrunv3.8b,  v22.8h, 7
> +sqrshrun2   v3.16b, v23.8h, 7

Again, this might be a good candidate for attempting to shuffle the
instructions and see if it helps (there are many other places, I picked
one randomly).

> +.Lepel_filters:

const/endconst + align might be better for all these labels

[...]
> +function ff_hevc_put_hevc_epel_hv12_8_neon, export=1
> +add x10, x3, 3
> +lsl x10, x10, 7
> +sub sp, sp, x10 // tmp_array
> +stp x0, x3, [sp, -16]!
> +stp x5, x30, [sp, -16]!
> +add x0, sp, 32
> +sub x1, x1, x2
> +add x3, x3, 3
> +bl  ff_hevc_put_hevc_epel_h12_8_neon
> +ldp x5, x30, [sp], 16
> +ldp x0, x3, [sp], 16
> +load_epel_filterh x5, x4
> +mov x5, 112
> +mov x10, 128
> +ld1 { v16.8h, v17.8h }, [sp], x10
> +ld1 { v18.8h, v19.8h }, [sp], x10
> +ld1 { v20.8h, v21.8h }, [sp], x10
> +1:  ld1 { v22.8h, v23.8h }, [sp], x10
> +calc_epelh  v4, v16, v1

Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-21 Thread Rafal Dabrowa

On 11/21/2017 11:51 AM, Shengbin Meng wrote:


On 19 Nov 2017, at 01:35, Rafal Dabrowa > wrote:



This is a proposal of performance optimizations for 8-bit
hevc video decoding on aarch64 platform with neon (simd) extension.


Nice to see the work for aarch64!

We are also in the process of doing NEON optimization for HEVC 
decoding. 
(http://ffmpeg.org/pipermail/ffmpeg-devel/2017-October/218233.html)


Now we are just about to finish arm 32-bit work and ready to send some 
patches out. Looks like for aarch64 we can join force:) What do you think?
Why not. I started to work on aarch64 because my device, although has 
VPU, but the VPU does not support hevc. Hence the h264 format, even full 
HD one is played smoothly but playback of hevc looks poorly. I was 
curious how much hevc decoding might be optimized. I optimized one 
function, then another one...


Currently I'm focused on patch size reduction. But I'm open to cooperation.





The patch contains optimizations for most heavily used qpel, epel, 
sao and idct

functions.  Among the functions provided for optimization there are two
intensively used, but not optimized in this patch: 
hevc_v_loop_filter_luma_8

and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
hence I leaved them without optimizations.



I see that optimization for loop filter already exists for arm 32-bit 
code. Why not use that algorithm?


Maybe... Although optimization for aarch64 is a different story. I have 
noticed that gcc with -O3 option on aarch64 produces really good code. I 
was surprised how much the code execution time is reduced in some cases. 
Sometimes it is hard to optimize code better than compiler does.



Rafal Dabrowa
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-21 Thread Shengbin Meng

> On 19 Nov 2017, at 01:35, Rafal Dabrowa  wrote:
> 
> 
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.

Nice to see the work for aarch64! 

We are also in the process of doing NEON optimization for HEVC decoding. 
(http://ffmpeg.org/pipermail/ffmpeg-devel/2017-October/218233.html 
)

Now we are just about to finish arm 32-bit work and ready to send some patches 
out. Looks like for aarch64 we can join force:) What do you think?

> 
> The patch contains optimizations for most heavily used qpel, epel, sao and 
> idct
> functions.  Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
> 

I see that optimization for loop filter already exists for arm 32-bit code. Why 
not use that algorithm?


Regards,
Shengbin
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-19 Thread Rafal Dabrowa

On 11/18/2017 07:41 PM, James Almer wrote:

On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:



On 18 November 2017 at 17:35, Rafal Dabrowa  wrote:

This is a proposal of performance optimizations for 8-bit
hevc video decoding on aarch64 platform with neon (simd) extension.

I'm testing my optimizations on NanoPi M3 device. I'm using
mainly "Big Buck Bunny" video file in format 1280x720 for testing.
The video file was pulled from libde265.org page, see
http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
The movie duration is 00:10:34.53.

Overall performance gain is about 2x. Without optimizations the movie
playback stops in practice after a few seconds. With
optimizations the file is played smoothly 99% of the time.

For performance testing the following command was used:

 time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
- >/dev/null

The video file was pre-read before test to minimize disk reads during
testing.
Program execution time without optimization was as follows:

real11m48.576s
user43m8.111s
sys 0m12.469s

Execution time with optimizations:

real6m17.046s
user21m19.792s
sys 0m14.724s


The patch contains optimizations for most heavily used qpel, epel, sao and
idct
functions.  Among the functions provided for optimization there are two
intensively used, but not optimized in this patch:
hevc_v_loop_filter_luma_8
and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
hence I leaved them without optimizations.



Signed-off-by: Rafal Dabrowa 
---
  libavcodec/aarch64/Makefile   |5 +
  libavcodec/aarch64/hevcdsp_epel_8.S   | 3949 
  libavcodec/aarch64/hevcdsp_idct_8.S   | 1980 ++
  libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
  libavcodec/aarch64/hevcdsp_qpel_8.S   | 5666
+
  libavcodec/aarch64/hevcdsp_sao_8.S|  166 +
  libavcodec/hevcdsp.c  |2 +
  libavcodec/hevcdsp.h  |1 +
  8 files changed, 11939 insertions(+)
  create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
  create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
  create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
  create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S



Very nice.
The way we test SIMD is to put START_TIMER("function_name"); and
STOP_TIMER; (they're located in libavutil/timer.h) around where the
function gets called in the C code, then we do a run with the C code (no
SIMD) and a separate run with whatever SIMD optimizations we're
implementing. We take the last printed value of both runs and that's what's
used to measure speedup.

I don't think there's a need to split the patch into multiple patches for
each idividual version though yet, that's usually only done if some
function's C implementation is faster than the SIMD code.

It would be nice however to at least split it into two patches, one for
MC and one for SAO.

Could you explain whose functions are MC?

I can split patch into a few, but dependency between patches
is unavoidable because the non-optimized function pointers are
replaced with optimized all together, in one function body.
One of the patches must add the function and must add the function call.


Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
lot to add, and I'm sure a sizable portion is duplicated with only some
small differences between functions.

I used macros sparingly because code without macros is
easier to understand and to improve. Sometimes even order
of assembly instructions is important. But, of course, I can reduce
the code size using macros if the patch will be accepted. I didn't know
whether you are interested with the patch at all.


Regarding performance testing. I wrapped every function with another
one, which calls START_TIMER and STOP_TIMER. It looks these macros
aren't reentrant, I needed to force the program to run in single thread.
Without this I had strange results, very differing between runs, for 
example:


22190 UNITS in put_hevc_qpel_uni_h12_8,   16232 runs,    152 skips
1126 UNITS in put_hevc_qpel_uni_h12_8,   12001 runs,   4383 skips

Force to run in single-threaded mode was not easy, the -filter_threads
option didn't help.

Below is the outcome. Meaning of the columns:

FUNCTION - the function to optimize
UNITS_NOOPT - last UNITS result in run without optimization
OPT - last UNITS result in run with optimization
CALLS - sum of runs and skips
NSKIPS - number of skips in non-optimized version
OSKIPS - number of skips in optimized version


FUNCTION UNITS_NOOPT  OPT CALLS   NSKIPS OSKIPS
-
idct_16x16_8  113074    24079   2097152 0    0
idct_32x32_8  587447   100434    524288 0    0
put_hevc_epel_bi_h4_8   7651 36

Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-18 Thread James Almer
On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:
>>
>>
>>
>> On 18 November 2017 at 17:35, Rafal Dabrowa  wrote:
>>
>> This is a proposal of performance optimizations for 8-bit
>> hevc video decoding on aarch64 platform with neon (simd) extension.
>>
>> I'm testing my optimizations on NanoPi M3 device. I'm using
>> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
>> The video file was pulled from libde265.org page, see
>> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
>> The movie duration is 00:10:34.53.
>>
>> Overall performance gain is about 2x. Without optimizations the movie
>> playback stops in practice after a few seconds. With
>> optimizations the file is played smoothly 99% of the time.
>>
>> For performance testing the following command was used:
>>
>> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
>> - >/dev/null
>>
>> The video file was pre-read before test to minimize disk reads during
>> testing.
>> Program execution time without optimization was as follows:
>>
>> real11m48.576s
>> user43m8.111s
>> sys 0m12.469s
>>
>> Execution time with optimizations:
>>
>> real6m17.046s
>> user21m19.792s
>> sys 0m14.724s
>>
>>
>> The patch contains optimizations for most heavily used qpel, epel, sao and
>> idct
>> functions.  Among the functions provided for optimization there are two
>> intensively used, but not optimized in this patch:
>> hevc_v_loop_filter_luma_8
>> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
>> hence I leaved them without optimizations.
>>
>>
>>
>> Signed-off-by: Rafal Dabrowa 
>> ---
>>  libavcodec/aarch64/Makefile   |5 +
>>  libavcodec/aarch64/hevcdsp_epel_8.S   | 3949 
>>  libavcodec/aarch64/hevcdsp_idct_8.S   | 1980 ++
>>  libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
>>  libavcodec/aarch64/hevcdsp_qpel_8.S   | 5666
>> +
>>  libavcodec/aarch64/hevcdsp_sao_8.S|  166 +
>>  libavcodec/hevcdsp.c  |2 +
>>  libavcodec/hevcdsp.h  |1 +
>>  8 files changed, 11939 insertions(+)
>>  create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
>>  create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
>>  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>>  create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
>>  create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S
> 
> 
> 
> Very nice.
> The way we test SIMD is to put START_TIMER("function_name"); and
> STOP_TIMER; (they're located in libavutil/timer.h) around where the
> function gets called in the C code, then we do a run with the C code (no
> SIMD) and a separate run with whatever SIMD optimizations we're
> implementing. We take the last printed value of both runs and that's what's
> used to measure speedup.
> 
> I don't think there's a need to split the patch into multiple patches for
> each idividual version though yet, that's usually only done if some
> function's C implementation is faster than the SIMD code.

It would be nice however to at least split it into two patches, one for
MC and one for SAO.

Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
lot to add, and I'm sure a sizable portion is duplicated with only some
small differences between functions.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-18 Thread Rostislav Pehlivanov
>
>
>
> On 18 November 2017 at 17:35, Rafal Dabrowa  wrote:
>
> This is a proposal of performance optimizations for 8-bit
> hevc video decoding on aarch64 platform with neon (simd) extension.
>
> I'm testing my optimizations on NanoPi M3 device. I'm using
> mainly "Big Buck Bunny" video file in format 1280x720 for testing.
> The video file was pulled from libde265.org page, see
> http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
> The movie duration is 00:10:34.53.
>
> Overall performance gain is about 2x. Without optimizations the movie
> playback stops in practice after a few seconds. With
> optimizations the file is played smoothly 99% of the time.
>
> For performance testing the following command was used:
>
> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
> - >/dev/null
>
> The video file was pre-read before test to minimize disk reads during
> testing.
> Program execution time without optimization was as follows:
>
> real11m48.576s
> user43m8.111s
> sys 0m12.469s
>
> Execution time with optimizations:
>
> real6m17.046s
> user21m19.792s
> sys 0m14.724s
>
>
> The patch contains optimizations for most heavily used qpel, epel, sao and
> idct
> functions.  Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch:
> hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
>
>
>
> Signed-off-by: Rafal Dabrowa 
> ---
>  libavcodec/aarch64/Makefile   |5 +
>  libavcodec/aarch64/hevcdsp_epel_8.S   | 3949 
>  libavcodec/aarch64/hevcdsp_idct_8.S   | 1980 ++
>  libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
>  libavcodec/aarch64/hevcdsp_qpel_8.S   | 5666
> +
>  libavcodec/aarch64/hevcdsp_sao_8.S|  166 +
>  libavcodec/hevcdsp.c  |2 +
>  libavcodec/hevcdsp.h  |1 +
>  8 files changed, 11939 insertions(+)
>  create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
>  create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
>  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
>  create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
>  create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S



Very nice.
The way we test SIMD is to put START_TIMER("function_name"); and
STOP_TIMER; (they're located in libavutil/timer.h) around where the
function gets called in the C code, then we do a run with the C code (no
SIMD) and a separate run with whatever SIMD optimizations we're
implementing. We take the last printed value of both runs and that's what's
used to measure speedup.

I don't think there's a need to split the patch into multiple patches for
each idividual version though yet, that's usually only done if some
function's C implementation is faster than the SIMD code.
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel


Re: [FFmpeg-devel] [PATCH] 8-bit hevc decoding optimization on aarch64 with neon

2017-11-18 Thread Carl Eugen Hoyos
2017-11-18 18:35 GMT+01:00 Rafal Dabrowa :

> For performance testing the following command was used:
>
> time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe - 
> >/dev/null

An alternative is:
./ffmpeg -benchmark -i ~/bbb-1280x720-cfg06.mkv -f null -

> The video file was pre-read before test to minimize disk reads during testing.
> Program execution time without optimization was as follows:
>
> real11m48.576s
> user43m8.111s
> sys 0m12.469s
>
> Execution time with optimizations:
>
> real6m17.046s
> user21m19.792s
> sys 0m14.724s

Looks impressive.


> +av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
> +{
> +int cpu_flags = av_get_cpu_flags();
> +
> +if (have_neon(cpu_flags) && bit_depth == 8) {
> +NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels);
> +NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h);
> +NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v);
> +NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv);
> +NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v);
> +NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv);
> +NEON8_FNASSIGN(c->put_hevc_epel_bi, 0, 0, pel_bi_pixels);
> +NEON8_FNASSIGN(c->put_hevc_epel_bi, 0, 1, epel_bi_h);
> +NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 0, epel_bi_v);
> +NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv);
> +NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels);
> +NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h);
> +NEON8_FNASSIGN(c->put_hevc_qpel, 1, 0, qpel_v);
> +NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv);
> +NEON8_FNASSIGN(c->put_hevc_qpel_uni, 0, 1, qpel_uni_h);
> +NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 0, qpel_uni_v);
> +NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv);
> +NEON8_FNASSIGN(c->put_hevc_qpel_bi, 0, 0, pel_bi_pixels);
> +NEON8_FNASSIGN(c->put_hevc_qpel_bi, 0, 1, qpel_bi_h);
> +NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 0, qpel_bi_v);
> +NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv);

I wonder if it would have made sense to test and send that patches
in smaller portions, so that those with possible improvements
can be identified.

Thank you, Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel