You are welcome. on your CPU, the ldp still slower, so we can keep origin version and improve it again in future. This version looks good for me, thank you for your contribute.
At 2021-06-24 10:01:40, "Pop, Sebastian" <s...@amazon.com> wrote: Thanks again Chen for your careful review and recommendations. I added the following change to the attached patch as we get better performance: --- a/source/common/aarch64/ipfilter8.S +++ b/source/common/aarch64/ipfilter8.S @@ -35,14 +35,14 @@ function x265_filterPixelToShort_4x4_neon movi v2.8h, #0xe0, lsl #8 ld1 {v0.s}[0], [x0], x1 ld1 {v0.s}[1], [x0], x1 - ld1 {v1.s}[2], [x0], x1 - ld1 {v1.s}[3], [x0], x1 ushll v3.8h, v0.8b, #6 - ushll2 v4.8h, v1.16b, #6 add v3.8h, v3.8h, v2.8h - add v4.8h, v4.8h, v2.8h st1 {v3.d}[0], [x2], x3 st1 {v3.d}[1], [x2], x3 + ld1 {v1.s}[0], [x0], x1 + ld1 {v1.s}[1], [x0], x1 + ushll v4.8h, v1.8b, #6 + add v4.8h, v4.8h, v2.8h st1 {v4.d}[0], [x2], x3 st1 {v4.d}[1], [x2], x3 ret Before: convert_p2s[ 4x4] 1.20x 4.99 6.01 After: convert_p2s[ 4x4] 1.38x 4.20 5.78 I tried the ldp with post-increment as you recommended. Performance is slightly lower with the change: function x265_filterPixelToShort_64x\h\()_neon add x3, x3, x3 sub x3, x3, #0x40 + sub x1, x1, #0x20 movi v4.8h, #0xe0, lsl #8 mov x9, #\r .loop_filterP2S_64x\h: subs x9, x9, #1 .rept 2 - ld1 {v0.16b-v3.16b}, [x0], x1 + ldp q0, q1, [x0], #0x20 + ldp q0, q1, [x0] + add x0, x0, x1 ushll v16.8h, v0.8b, #6 ushll2 v17.8h, v0.16b, #6 ushll v18.8h, v1.8b, #6 Before: convert_p2s[64x16] 1.46x 105.52 154.47 convert_p2s[64x32] 1.47x 212.06 312.14 convert_p2s[64x48] 1.47x 318.75 467.61 convert_p2s[64x64] 1.46x 425.61 622.36 After: convert_p2s[64x16] 1.42x 108.41 154.37 convert_p2s[64x32] 1.45x 215.18 312.12 convert_p2s[64x48] 1.44x 325.01 468.76 convert_p2s[64x64] 1.44x 432.46 622.36
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel