Hi,
Thank for improve instruction, it looks good to me. Regards, Chen At 2025-05-07 14:49:51, "Gerda Zsejke More" <gerdazsejke.m...@arm.com> wrote: >Optimize pixel_avg_pp_12x16_neon by using more suitable load and >store instructions. Using LD1 for the 32-bit lane is a constructive >operation - needing to merge the new value for lane 0 with the >existing top half of the vector. Using LDR turns this into a wholly >destructive operation since LDR zeros the rest of the vector - >removing the false dependency. >--- > source/common/aarch64/mc-a.S | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > >diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S >index 8c2878b3e..130bf1a4a 100644 >--- a/source/common/aarch64/mc-a.S >+++ b/source/common/aarch64/mc-a.S >@@ -73,13 +73,13 @@ function PFX(pixel_avg_pp_12x16_neon) > sub x3, x3, #4 > sub x5, x5, #4 > .rept 16 >- ld1 {v0.s}[0], [x2], #4 >+ ldr s0, [x2], #4 > ld1 {v1.8b}, [x2], x3 >- ld1 {v2.s}[0], [x4], #4 >+ ldr s2, [x4], #4 > ld1 {v3.8b}, [x4], x5 > urhadd v4.8b, v0.8b, v2.8b > urhadd v5.8b, v1.8b, v3.8b >- st1 {v4.s}[0], [x0], #4 >+ str s4, [x0], #4 > st1 {v5.8b}, [x0], x1 > .endr > ret >-- >2.39.5 (Apple Git-154) >
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel