Hi Chen, Thanks for the feedback.
I had a try using LDR with offsets and unrolling by 2, the performance is almost the same for the 2 approaches (<=0.03x deviation up or down for different block sizes). Regards, Li From: x265-devel <x265-devel-boun...@videolan.org> on behalf of chen <chenm...@163.com> Date: Friday, 2025. June 20. at 7:09 To: Development for x265 <x265-devel@videolan.org> Cc: nd <n...@arm.com> Subject: Re: [x265] [PATCH] AArch64: Optimize pixel_avg_pp_4xh The code looks good to me btw: The LDR support Register Indirect Addressing, how about unroll(2) to reduce ADD operators? At 2025-06-19 22:58:53, "Li Zhang" <li.zha...@arm.com> wrote: >Use LDR and STR instead of LD1 to lane in the pixel_avg_pp_4xh assembly >implementation. The new approach is a wholly destructive operation and >removes a false dependency on the existing register contents. > >The change provides up to 2.5x speed up. >--- > source/common/aarch64/mc-a.S | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > >diff --git a/source/common/aarch64/mc-a.S b/source/common/aarch64/mc-a.S >index 130bf1a4a..ff18713fa 100644 >--- a/source/common/aarch64/mc-a.S >+++ b/source/common/aarch64/mc-a.S >@@ -38,10 +38,13 @@ > .macro pixel_avg_pp_4xN_neon h > function PFX(pixel_avg_pp_4x\h\()_neon) > .rept \h >- ld1 {v0.s}[0], [x2], x3 >- ld1 {v1.s}[0], [x4], x5 >+ ldr s0, [x2] >+ ldr s1, [x4] >+ add x2, x2, x3 >+ add x4, x4, x5 > urhadd v2.8b, v0.8b, v1.8b >- st1 {v2.s}[0], [x0], x1 >+ str s2, [x0] >+ add x0, x0, x1 > .endr > ret > endfunc >-- >2.39.5 (Apple Git-154) >
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel