Optimize pixel_avg_pp_12x16_neon by using more suitable load and store instructions. Using LD1 for the 32-bit lane is a constructive operation - needing to merge the new value for lane 0 with the existing top half of the vector. Using LDR turns this into a wholly destructive operation since LDR zeros the rest of the vector - removing the false dependency. --- source/common/aarch64/mc-a.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S index 8c2878b3e..130bf1a4a 100644 --- a/source/common/aarch64/mc-a.S +++ b/source/common/aarch64/mc-a.S @@ -73,13 +73,13 @@ function PFX(pixel_avg_pp_12x16_neon) sub x3, x3, #4 sub x5, x5, #4 .rept 16 - ld1 {v0.s}[0], [x2], #4 + ldr s0, [x2], #4 ld1 {v1.8b}, [x2], x3 - ld1 {v2.s}[0], [x4], #4 + ldr s2, [x4], #4 ld1 {v3.8b}, [x4], x5 urhadd v4.8b, v0.8b, v2.8b urhadd v5.8b, v1.8b, v3.8b - st1 {v4.s}[0], [x0], #4 + str s4, [x0], #4 st1 {v5.8b}, [x0], x1 .endr ret -- 2.39.5 (Apple Git-154)
>From 56a22d5ea62fe1d86f4032c0858832bb80d88972 Mon Sep 17 00:00:00 2001 From: Gerda Zsejke More <gerdazsejke.m...@arm.com> Date: Sun, 27 Apr 2025 10:32:45 +0200 Subject: [PATCH] AArch64: Optimize pixel_avg_pp_12x16_neon Optimize pixel_avg_pp_12x16_neon by using more suitable load and store instructions. Using LD1 for the 32-bit lane is a constructive operation - needing to merge the new value for lane 0 with the existing top half of the vector. Using LDR turns this into a wholly destructive operation since LDR zeros the rest of the vector - removing the false dependency. --- source/common/aarch64/mc-a.S | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S index 8c2878b3e..130bf1a4a 100644 --- a/source/common/aarch64/mc-a.S +++ b/source/common/aarch64/mc-a.S @@ -73,13 +73,13 @@ function PFX(pixel_avg_pp_12x16_neon) sub x3, x3, #4 sub x5, x5, #4 .rept 16 - ld1 {v0.s}[0], [x2], #4 + ldr s0, [x2], #4 ld1 {v1.8b}, [x2], x3 - ld1 {v2.s}[0], [x4], #4 + ldr s2, [x4], #4 ld1 {v3.8b}, [x4], x5 urhadd v4.8b, v0.8b, v2.8b urhadd v5.8b, v1.8b, v3.8b - st1 {v4.s}[0], [x0], #4 + str s4, [x0], #4 st1 {v5.8b}, [x0], x1 .endr ret -- 2.39.5 (Apple Git-154)
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel