Hi Chen,

Thanks for the comment.

LDP+STP is recommended in optimization guide for the memory copy loops.
Older compilers sometimes struggle to generate optimal code from the 
vld1q_<x>_x2 intrinsics.
Using 2 vld1q_<x> is most likely to get most compilers to generate something 
optimal (LDP + STP).

Regards,
Li

From: chen <chenm...@163.com>
Date: Tuesday, 2025. May 20. at 5:18
To: Development for x265 <x265-devel@videolan.org>
Cc: nd <n...@arm.com>, Li Zhang <li.zha...@arm.com>
Subject: Re:[x265] [PATCH 0/8] AArch64: Clean up and optimize block copy 
primitives

Hi Li,



Thank for the improve patches.

It looks good to me, just a little comment below



In the most function,
+ int16x8_t a0 = vld1q_s16(src + w + 0); + int16x8_t a1 = vld1q_s16(src + w + 
8);
How about performance compare to vld1q_s16_x2 ?


Regards,

Chen



At 2025-05-20 00:41:39, "Li Zhang" <li.zha...@arm.com> wrote:

>Hello,

>

>This patch series optimizes and implements several AArch64 block copy

>primitives using Neon intrinsics. It also cleans up and removes the Neon

>and SVE assembly implementations that are either slower or offer no

>performance benefit.

>

>Many thanks,

>Li

>

>Li Zhang (8):

>  AArch64: Optimize blockcopy_pp_neon intrinsics implementation

>  AArch64: Optimize blockcopy_ps Neon intrinsics implementation

>  AArch64: Implement blockcopy_ss primitives using Neon intrinsics

>  AArch64: Implement blockcopy_sp primitives using Neon intrinsics

>  AArch64: Optimize cpy1Dto2D_shl Neon intrinsics implementation

>  AArch64: Optimize cpy2Dto1D_shl Neon intrinsics implementation

>  AArch64: Implement cpy2Dto1D_shr using Neon intrinsics

>  AArch64: Implement cpy1Dto2D_shr using Neon intrinsics

>

> source/common/CMakeLists.txt              |    2 +-

> source/common/aarch64/asm-primitives.cpp  |  180 ---

> source/common/aarch64/blockcopy8-common.S |   54 -

> source/common/aarch64/blockcopy8-sve.S    | 1346 ---------------------

> source/common/aarch64/blockcopy8.S        | 1049 ----------------

> source/common/aarch64/pixel-prim.cpp      |  358 +++++-

> 6 files changed, 305 insertions(+), 2684 deletions(-)

> delete mode 100644 source/common/aarch64/blockcopy8-common.S

>

>--

>2.39.5 (Apple Git-154)

>

>_______________________________________________

>x265-devel mailing list

>x265-devel@videolan.org

>https://mailman.videolan.org/listinfo/x265-devel
_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Reply via email to