[libav-devel] [PATCH 04/19] aarch64: vp8: Fix assembling with armasm64

2019-02-01 Thread Martin Storsjö
--- libavcodec/aarch64/vp8dsp_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S index f371ea7..14a9d11 100644 --- a/libavcodec/aarch64/vp8dsp_neon.S +++ b/libavcodec/aarch64/vp8dsp_neon.S @@ -28,7 +28,7

[libav-devel] [PATCH 19/19] aarch64: vp8: Optimize vp8_idct_add_neon for aarch64

2019-02-01 Thread Martin Storsjö
The previous version was a pretty exact translation of the arm version. This version does do some unnecessary arithemetic (it does more operations on vectors that are only half filled; it does 4 uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead of packing data together (which

[libav-devel] [PATCH 01/19] libavcodec: vp8 neon optimizations for aarch64

2019-02-01 Thread Martin Storsjö
From: Magnus Röös Partial port of the ARM Neon for aarch64. Benchmarks from fate: benchmarking with Linux Perf Monitoring API nop: 58.6 checkasm: using random seed 1760970128 NEON: - vp8dsp.idct [OK] - vp8dsp.mc [OK] - vp8dsp.loopfilter [OK] checkasm: all 21 tests passed

[libav-devel] [PATCH 03/19] aarch64: vp8: Fix assembling with clang

2019-02-01 Thread Martin Storsjö
This also partially fixes assembling with MS armasm64 (via gas-preprocessor). --- libavcodec/aarch64/vp8dsp_neon.S | 124 +++ 1 file changed, 62 insertions(+), 62 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S

[libav-devel] [PATCH 18/19] aarch64: vp8: Skip saturating in shrn in ff_vp8_idct_add_neon

2019-02-01 Thread Martin Storsjö
The original arm version didn't do saturation here. This probably doesn't make any difference for performance, but reduces the differences. --- libavcodec/aarch64/vp8dsp_neon.S | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp_neon.S

[libav-devel] [PATCH 12/19] aarch64: vp8: Port vp8_luma_dc_wht and vp8_idct_dc_add4uv from arm version

2019-02-01 Thread Martin Storsjö
Cortex A53A72A73 vp8_luma_dc_wht_c:115.7 75.7 90.7 vp8_luma_dc_wht_neon: 60.7 41.2 45.7 vp8_idct_dc_add4uv_c: 376.1 262.9 282.5 vp8_idct_dc_add4uv_neon: 52.0 29.0 37.0 --- libavcodec/aarch64/vp8dsp_init_aarch64.c | 3 +

[libav-devel] [PATCH 11/19] aarch64: vp8: Fix a typo in a comment

2019-02-01 Thread Martin Storsjö
--- libavcodec/aarch64/vp8dsp_neon.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S index c19ab0d..2b5b049 100644 --- a/libavcodec/aarch64/vp8dsp_neon.S +++ b/libavcodec/aarch64/vp8dsp_neon.S @@ -743,7 +743,7

[libav-devel] [PATCH 13/19] aarch64: vp8: Port missing epel8 functions from arm version

2019-02-01 Thread Martin Storsjö
Cortex A53 A72 A73 vp8_put_epel8_h4_c: 2594.8 1159.6 1374.8 vp8_put_epel8_h4_neon: 506.4 244.2 314.0 vp8_put_epel8_h6_c: 3445.8 1677.1 1811.3 vp8_put_epel8_h6_neon: 634.4 371.7 433.0 vp8_put_epel8_v4_c: 2614.0 1174.8 1378.0

[libav-devel] [PATCH 15/19] aarch64: vp8: Port bilin functions from arm version

2019-02-01 Thread Martin Storsjö
Cortex A53 A72 A73 vp8_put_bilin4_h_c:303.8 102.2 161.8 vp8_put_bilin4_h_neon: 100.040.941.2 vp8_put_bilin4_hv_c: 322.8 201.0 305.9 vp8_put_bilin4_hv_neon:156.872.677.0 vp8_put_bilin4_v_c:304.7 101.7 166.5

[libav-devel] [PATCH 10/19] aarch64: vp8: Reorder the function pointer inits to match the arm original

2019-02-01 Thread Martin Storsjö
--- libavcodec/aarch64/vp8dsp_init_aarch64.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp_init_aarch64.c b/libavcodec/aarch64/vp8dsp_init_aarch64.c index 3fb254a..da54efd 100644 --- a/libavcodec/aarch64/vp8dsp_init_aarch64.c +++

[libav-devel] [PATCH 05/19] aarch64: vp8: Fix linking for iOS

2019-02-01 Thread Martin Storsjö
The mach-o relocations don't allow a negative offset to a symbol; use the third movrel parameter to handle this issue transparently. --- libavcodec/aarch64/vp8dsp_neon.S | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp_neon.S

[libav-devel] [PATCH 02/19] aarch64: vp8: Fix the include guard

2019-02-01 Thread Martin Storsjö
From: Carl Eugen Hoyos --- libavcodec/aarch64/vp8dsp.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp.h b/libavcodec/aarch64/vp8dsp.h index 8a0c8fb..40d0cae 100644 --- a/libavcodec/aarch64/vp8dsp.h +++ b/libavcodec/aarch64/vp8dsp.h @@ -16,8

[libav-devel] [PATCH 09/19] aarch64: vp8: Move the vp8dsp makefile entries to the right places

2019-02-01 Thread Martin Storsjö
Even if NEON would be disabled, the init functions should be built as they are called as long as ARCH_AARCH64 is set. These functions are part of a generic DSP subsytem, not tied directly to one decoder. (They should be built if the vp7 decoder is enabled, even if the vp8 decoder is disabled.)

[libav-devel] [PATCH 06/19] aarch64: vp8: Use the proper aarch64 form for conditional branches

2019-02-01 Thread Martin Storsjö
The previous form also does seem to assemble on current tools, but I think it might fail on some older aarch64 tools. --- libavcodec/aarch64/vp8dsp_neon.S | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp_neon.S

[libav-devel] [PATCH 08/19] aarch64: vp8: Remove superfluous includes

2019-02-01 Thread Martin Storsjö
--- libavcodec/aarch64/vp8dsp_init_aarch64.c | 4 1 file changed, 4 deletions(-) diff --git a/libavcodec/aarch64/vp8dsp_init_aarch64.c b/libavcodec/aarch64/vp8dsp_init_aarch64.c index f93bcfa..3fb254a 100644 --- a/libavcodec/aarch64/vp8dsp_init_aarch64.c +++

[libav-devel] [PATCH 16/19] arm: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2

2019-02-01 Thread Martin Storsjö
This makes it similar to put_epel16_v6, and gives a 10-25% speedup of this function. Before: Cortex A7 A8 A9 A53 A72 vp8_put_epel16_h6v6_neon:3058.0 2218.5 2459.8 2183.0 1572.2 After: vp8_put_epel16_h6v6_neon:2670.8 1934.2 2244.4 1729.4

[libav-devel] [PATCH 17/19] aarch64: vp8: Optimize put_epel16_h6v6 with vp8_epel8_v6_y2

2019-02-01 Thread Martin Storsjö
This makes it similar to put_epel16_v6, and gives a large speedup on Cortex A53, a minor speedup on A72 and a very minor slowdown on A73. Before: Cortex A53 A72 A73 vp8_put_epel16_h6v6_neon: 2211.4 1586.5 1431.7 After: vp8_put_epel16_h6v6_neon: 1736.9 1522.0 1448.1

[libav-devel] [PATCH 14/19] aarch64: vp8: Port epel4 functions from arm version

2019-02-01 Thread Martin Storsjö
Cortex A53A72A73 vp8_put_epel4_h4_c:631.4 291.7 367.8 vp8_put_epel4_h4_neon: 241.0 131.0 155.7 vp8_put_epel4_h4v4_c: 967.5 529.3 667.7 vp8_put_epel4_h4v4_neon: 429.3 241.8 279.7 vp8_put_epel4_h4v6_c: 1374.7 657.5 864.5

[libav-devel] [PATCH 07/19] vp8dsp: Move the aarch64 dsp init call into alphabetical order

2019-02-01 Thread Martin Storsjö
--- libavcodec/vp8dsp.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/vp8dsp.c b/libavcodec/vp8dsp.c index 3c8d1c8..ac9a6af 100644 --- a/libavcodec/vp8dsp.c +++ b/libavcodec/vp8dsp.c @@ -679,14 +679,14 @@ av_cold void ff_vp78dsp_init(VP8DSPContext *dsp)