Hi, > You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1 > depends on result, we can sum of these -1 to get totally number of non-zero > coeffs, it reduce 3 instructions to 2.
You are right. With this change I see a lot of improvement: @@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon .rept 2 ld1 {v0.8b}, [x1], x2 ld1 {v1.8b}, [x1], x2 - clz v2.4h, v0.4h - clz v3.4h, v1.4h - ushr v2.4h, v2.4h, #4 - ushr v3.4h, v3.4h, #4 - add v2.4h, v2.4h, v3.4h - add v4.4h, v4.4h, v2.4h st1 {v0.8b}, [x0], #8 st1 {v1.8b}, [x0], #8 + cmeq v0.4h, v0.4h, #0 + cmeq v1.4h, v1.4h, #0 + add v4.4h, v4.4h, v0.4h + add v4.4h, v4.4h, v1.4h .endr uaddlv s4, v4.4h - fmov w12, s4 - mov w11, #16 - sub w0, w11, w12 + umov w12, v4.h[0] + sxth w12, w12 + add x0, x12, #16 ret endfunc Before: copy_cnt[4x4] 13.93x 7.50 104.56 copy_cnt[8x8] 31.20x 12.70 396.33 copy_cnt[16x16] 43.22x 36.00 1556.03 copy_cnt[32x32] 47.39x 129.34 6129.63 After: copy_cnt[4x4] 14.76x 7.12 105.12 copy_cnt[8x8] 37.56x 10.60 398.25 copy_cnt[16x16] 52.57x 29.74 1563.60 copy_cnt[32x32] 62.22x 98.37 6120.29 > + xtn v0.8b, v0.8h > + xtn2 v0.16b, v1.8h > equal to > tbl v0, {v0,v1}, v2 You are right. With this change I see a lot of improvement: Before: copy_sp[16x16] 85.13x 18.78 1599.19 copy_sp[32x32] 96.31x 65.07 6266.88 copy_sp[64x64] 98.81x 252.38 24937.40 [i422] copy_sp[16x32] 91.93x 34.32 3154.89 [i422] copy_sp[32x64] 99.54x 128.29 12769.10 After: copy_sp[16x16] 96.23x 16.42 1579.74 copy_sp[32x32] 104.33x 57.84 6034.24 copy_sp[64x64] 110.79x 221.66 24558.72 [i422] copy_sp[16x32] 97.74x 31.89 3116.46 [i422] copy_sp[32x64] 111.37x 112.39 12517.52 Please see the amended patch. Thanks, Sebastian
0001-arm64-port-count_nonzero-blkfill-and-copy_-ss-sp-ps.patch
Description: 0001-arm64-port-count_nonzero-blkfill-and-copy_-ss-sp-ps.patch
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel