Hi,
Some comments, +.macro SAD_X_END_64 x + uaddlp v16.4s, v16.8h The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 32-bits in here + uaddlp v17.4s, v17.8h + uaddlp v18.4s, v18.8h + uaddlp v20.4s, v20.8h + uaddlp v21.4s, v21.8h + uaddlp v22.4s, v22.8h + add v16.4s, v16.4s, v20.4s + add v17.4s, v17.4s, v21.4s + add v18.4s, v18.4s, v22.4s + trn2 v20.2d, v16.2d, v16.2d + trn2 v21.2d, v17.2d, v17.2d + trn2 v22.2d, v18.2d, v18.2d + add v16.2s, v16.2s, v20.2s + add v17.2s, v17.2s, v21.2s + add v18.2s, v18.2s, v22.2s + uaddlp v16.1d, v16.2s ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s As we analyze dynamic range in above, we can replace it by ADD v16, v20 ; 15-bits (ignore inst for V17=V17+V21, etc) ADD v16, V17 ; 16-bits (ignore other registers) ADDLV s0,v16 + uaddlp v17.1d, v17.2s + uaddlp v18.1d, v18.2s + st1 {v16.s}[0], [x6], #4 + st1 {v17.s}[0], [x6], #4 + st1 {v18.s}[0], [x6], #4 I guess STP may store two result in a cycle Regards, Min Chen 2021-07-22 14:30:50,"Pop, Sebastian" <s...@amazon.com> Hi, the attached patch ports to arm64 the following kernels: sad_x3[ 4x4] 12.23x 13.79 168.68 sad_x4[ 4x4] 14.12x 15.82 223.43 sad_x3[ 8x8] 35.05x 17.45 611.47 sad_x4[ 8x8] 38.48x 21.18 814.95 sad_x3[ 8x4] 27.19x 11.46 311.48 sad_x4[ 8x4] 30.40x 13.60 413.37 sad_x3[ 4x8] 14.16x 22.99 325.37 sad_x4[ 4x8] 15.82x 27.39 433.23 sad_x3[16x16] 40.94x 57.94 2371.97 sad_x4[16x16] 43.63x 72.44 3160.44 sad_x3[ 16x8] 38.84x 30.54 1186.15 sad_x4[ 16x8] 39.23x 40.16 1575.43 sad_x3[ 8x16] 38.74x 31.43 1217.71 sad_x4[ 8x16] 41.48x 39.01 1618.17 sad_x3[ 16x4] 31.82x 18.88 600.72 sad_x4[ 16x4] 36.35x 21.87 795.00 sad_x3[16x12] 40.27x 43.87 1766.74 sad_x4[16x12] 42.58x 55.94 2381.75 sad_x3[ 4x16] 15.34x 42.16 646.67 sad_x4[ 4x16] 17.08x 51.06 872.12 sad_x3[12x16] 29.45x 61.06 1798.28 sad_x4[12x16] 30.39x 78.94 2399.17 sad_x3[32x32] 42.85x 216.39 9272.65 sad_x4[32x32] 42.53x 294.98 12544.76 sad_x3[32x16] 42.09x 110.35 4644.86 sad_x4[32x16] 41.71x 151.05 6301.01 sad_x3[16x32] 44.19x 106.99 4728.04 sad_x4[16x32] 44.72x 139.94 6257.96 sad_x3[ 32x8] 40.10x 58.16 2332.47 sad_x4[ 32x8] 41.17x 76.65 3155.96 sad_x3[32x24] 42.69x 162.76 6947.64 sad_x4[32x24] 42.08x 223.88 9421.46 sad_x3[ 8x32] 41.86x 57.89 2423.47 sad_x4[ 8x32] 45.26x 71.56 3239.07 sad_x3[24x32] 45.10x 155.22 6999.53 sad_x4[24x32] 45.30x 205.87 9325.60 sad_x3[64x64] 39.87x 925.36 36892.50 sad_x4[64x64] 40.80x 1214.79 49557.66 sad_x3[64x32] 39.40x 468.08 18444.51 sad_x4[64x32] 40.71x 609.27 24803.74 sad_x3[32x64] 43.48x 426.05 18522.95 sad_x4[32x64] 43.31x 577.80 25024.14 sad_x3[64x16] 38.67x 238.72 9231.84 sad_x4[64x16] 40.36x 308.10 12435.08 sad_x3[64x48] 39.70x 695.95 27628.87 sad_x4[64x48] 40.74x 912.56 37173.46 sad_x3[16x64] 44.85x 208.19 9337.52 sad_x4[16x64] 45.46x 274.68 12487.54 sad_x3[48x64] 42.68x 653.74 27903.74 sad_x4[48x64] 44.67x 835.79 37336.87 Ok to commit? Thanks, Sebastian
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel