Hi Li,
Thank for the improve patches. It looks good to me, just a little comment below In the most function, + int16x8_t a0 = vld1q_s16(src + w + 0); + int16x8_t a1 = vld1q_s16(src + w + 8); How about performance compare to vld1q_s16_x2 ? Regards, Chen At 2025-05-20 00:41:39, "Li Zhang" <li.zha...@arm.com> wrote: >Hello, > >This patch series optimizes and implements several AArch64 block copy >primitives using Neon intrinsics. It also cleans up and removes the Neon >and SVE assembly implementations that are either slower or offer no >performance benefit. > >Many thanks, >Li > >Li Zhang (8): > AArch64: Optimize blockcopy_pp_neon intrinsics implementation > AArch64: Optimize blockcopy_ps Neon intrinsics implementation > AArch64: Implement blockcopy_ss primitives using Neon intrinsics > AArch64: Implement blockcopy_sp primitives using Neon intrinsics > AArch64: Optimize cpy1Dto2D_shl Neon intrinsics implementation > AArch64: Optimize cpy2Dto1D_shl Neon intrinsics implementation > AArch64: Implement cpy2Dto1D_shr using Neon intrinsics > AArch64: Implement cpy1Dto2D_shr using Neon intrinsics > > source/common/CMakeLists.txt | 2 +- > source/common/aarch64/asm-primitives.cpp | 180 --- > source/common/aarch64/blockcopy8-common.S | 54 - > source/common/aarch64/blockcopy8-sve.S | 1346 --------------------- > source/common/aarch64/blockcopy8.S | 1049 ---------------- > source/common/aarch64/pixel-prim.cpp | 358 +++++- > 6 files changed, 305 insertions(+), 2684 deletions(-) > delete mode 100644 source/common/aarch64/blockcopy8-common.S > >-- >2.39.5 (Apple Git-154) > >_______________________________________________ >x265-devel mailing list >x265-devel@videolan.org >https://mailman.videolan.org/listinfo/x265-devel
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel