---
libavcodec/aarch64/vp8dsp_neon.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S
index f371ea7..14a9d11 100644
--- a/libavcodec/aarch64/vp8dsp_neon.S
+++ b/libavcodec/aarch64/vp8dsp_neon.S
@@ -28,7 +28,7
The previous version was a pretty exact translation of the arm
version. This version does do some unnecessary arithemetic (it does
more operations on vectors that are only half filled; it does 4
uaddw and 4 sqxtun instead of 2 of each), but it reduces the overhead
of packing data together (which
From: Magnus Röös
Partial port of the ARM Neon for aarch64.
Benchmarks from fate:
benchmarking with Linux Perf Monitoring API
nop: 58.6
checkasm: using random seed 1760970128
NEON:
- vp8dsp.idct [OK]
- vp8dsp.mc [OK]
- vp8dsp.loopfilter [OK]
checkasm: all 21 tests passed
This also partially fixes assembling with MS armasm64 (via
gas-preprocessor).
---
libavcodec/aarch64/vp8dsp_neon.S | 124 +++
1 file changed, 62 insertions(+), 62 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S
The original arm version didn't do saturation here. This probably
doesn't make any difference for performance, but reduces the
differences.
---
libavcodec/aarch64/vp8dsp_neon.S | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp_neon.S
Cortex A53A72A73
vp8_luma_dc_wht_c:115.7 75.7 90.7
vp8_luma_dc_wht_neon: 60.7 41.2 45.7
vp8_idct_dc_add4uv_c: 376.1 262.9 282.5
vp8_idct_dc_add4uv_neon: 52.0 29.0 37.0
---
libavcodec/aarch64/vp8dsp_init_aarch64.c | 3 +
---
libavcodec/aarch64/vp8dsp_neon.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/aarch64/vp8dsp_neon.S b/libavcodec/aarch64/vp8dsp_neon.S
index c19ab0d..2b5b049 100644
--- a/libavcodec/aarch64/vp8dsp_neon.S
+++ b/libavcodec/aarch64/vp8dsp_neon.S
@@ -743,7 +743,7
Cortex A53 A72 A73
vp8_put_epel8_h4_c: 2594.8 1159.6 1374.8
vp8_put_epel8_h4_neon: 506.4 244.2 314.0
vp8_put_epel8_h6_c: 3445.8 1677.1 1811.3
vp8_put_epel8_h6_neon: 634.4 371.7 433.0
vp8_put_epel8_v4_c: 2614.0 1174.8 1378.0
Cortex A53 A72 A73
vp8_put_bilin4_h_c:303.8 102.2 161.8
vp8_put_bilin4_h_neon: 100.040.941.2
vp8_put_bilin4_hv_c: 322.8 201.0 305.9
vp8_put_bilin4_hv_neon:156.872.677.0
vp8_put_bilin4_v_c:304.7 101.7 166.5
---
libavcodec/aarch64/vp8dsp_init_aarch64.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp_init_aarch64.c
b/libavcodec/aarch64/vp8dsp_init_aarch64.c
index 3fb254a..da54efd 100644
--- a/libavcodec/aarch64/vp8dsp_init_aarch64.c
+++
The mach-o relocations don't allow a negative offset to a symbol;
use the third movrel parameter to handle this issue transparently.
---
libavcodec/aarch64/vp8dsp_neon.S | 14 +++---
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp_neon.S
From: Carl Eugen Hoyos
---
libavcodec/aarch64/vp8dsp.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp.h b/libavcodec/aarch64/vp8dsp.h
index 8a0c8fb..40d0cae 100644
--- a/libavcodec/aarch64/vp8dsp.h
+++ b/libavcodec/aarch64/vp8dsp.h
@@ -16,8
Even if NEON would be disabled, the init functions should be built
as they are called as long as ARCH_AARCH64 is set.
These functions are part of a generic DSP subsytem, not tied directly
to one decoder. (They should be built if the vp7 decoder is enabled,
even if the vp8 decoder is disabled.)
The previous form also does seem to assemble on current tools,
but I think it might fail on some older aarch64 tools.
---
libavcodec/aarch64/vp8dsp_neon.S | 28 ++--
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp_neon.S
---
libavcodec/aarch64/vp8dsp_init_aarch64.c | 4
1 file changed, 4 deletions(-)
diff --git a/libavcodec/aarch64/vp8dsp_init_aarch64.c
b/libavcodec/aarch64/vp8dsp_init_aarch64.c
index f93bcfa..3fb254a 100644
--- a/libavcodec/aarch64/vp8dsp_init_aarch64.c
+++
This makes it similar to put_epel16_v6, and gives a 10-25%
speedup of this function.
Before: Cortex A7 A8 A9 A53 A72
vp8_put_epel16_h6v6_neon:3058.0 2218.5 2459.8 2183.0 1572.2
After:
vp8_put_epel16_h6v6_neon:2670.8 1934.2 2244.4 1729.4
This makes it similar to put_epel16_v6, and gives a large speedup
on Cortex A53, a minor speedup on A72 and a very minor slowdown on
A73.
Before: Cortex A53 A72 A73
vp8_put_epel16_h6v6_neon: 2211.4 1586.5 1431.7
After:
vp8_put_epel16_h6v6_neon: 1736.9 1522.0 1448.1
Cortex A53A72A73
vp8_put_epel4_h4_c:631.4 291.7 367.8
vp8_put_epel4_h4_neon: 241.0 131.0 155.7
vp8_put_epel4_h4v4_c: 967.5 529.3 667.7
vp8_put_epel4_h4v4_neon: 429.3 241.8 279.7
vp8_put_epel4_h4v6_c: 1374.7 657.5 864.5
---
libavcodec/vp8dsp.c | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/libavcodec/vp8dsp.c b/libavcodec/vp8dsp.c
index 3c8d1c8..ac9a6af 100644
--- a/libavcodec/vp8dsp.c
+++ b/libavcodec/vp8dsp.c
@@ -679,14 +679,14 @@ av_cold void ff_vp78dsp_init(VP8DSPContext *dsp)
19 matches
Mail list logo