Starting from b3ce1c32da303c92241e85bd69298ab9903c0126, We observed that the video streams generated by x265 compiled with CLANG and GCC, using the same parameters and input video, are inconsistent.
Encoding logs for CLANG compiled x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74146.30 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67231.66 Encoding logs for GCC complied x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74022.17 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67220.26 This because a macro LOAD_DIFF_16x4_sve2 was introduced in commit b3ce1c32da303c92241e85bd69298ab9903c0126 for AArch64 (This macro has been renamed to LOAD_DIFF_16x4_sve now) and is used by SATD kernels. The macro LOAD_DIFF_16x4_sve uses unprotected callee-saved SVE registers z8 to z15. Unfortunately, in the caller function MotionEstimate of SATD, the CLANG generates some code to use s10. Here is a simplified code snippet: fmov s10, #0.50000000 bl subpelCompare -> will invoke SATD fadd s0, s0, s10 Thus, the content of s10 used in MotionEstimate is overwritten by the SATD function, which affects the correctness of subsequent video encoding. But we did not observe similar code in GCC to access s8 - s15 in MotionEstimate. At the same time, we can influence CLANG by modifying -mpu=neoverse-N2, in this case CLANG will not use s10 in MotionEstimate, thereby making the video stream consistent. However, such compiler dependent behavior is unreliable. Therefore, in this patch, we avoid using z8 to z15 and extra stack operations by reusing z0 to z7 in LOAD_DIFF_16x4_sve. Here are the test results after applying the patch: Encoding logs for CLANG compiled x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74022.17 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67220.26 Encoding logs for GCC compiled x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74022.17 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67220.26 Change-Id: Ic0e0c0706b99e53b138cceb758ee6ce148130e4b --- source/common/aarch64/pixel-util-sve.S | 34 +++++++++++++------------- 1 file changed, 17 insertions(+), 17 deletions(-)diff --git a/source/common/aarch64/pixel-util-sve.S b/source/common/aarch64/pixel-util-sve.S
index 3d073d42e..106ba903a 100644 --- a/source/common/aarch64/pixel-util-sve.S +++ b/source/common/aarch64/pixel-util-sve.S @@ -190,27 +190,27 @@ endfunc ld1b {z7.h}, p0/z, [x2, x11] add x0, x0, x1 add x2, x2, x3 - ld1b {z29.h}, p0/z, [x0] - ld1b {z9.h}, p0/z, [x0, x11] - ld1b {z10.h}, p0/z, [x2] - ld1b {z11.h}, p0/z, [x2, x11] - add x0, x0, x1 - add x2, x2, x3 - ld1b {z12.h}, p0/z, [x0] - ld1b {z13.h}, p0/z, [x0, x11] - ld1b {z14.h}, p0/z, [x2] - ld1b {z15.h}, p0/z, [x2, x11] - add x0, x0, x1 - add x2, x2, x3 - sub \v0\().h, z0.h, z2.h sub \v4\().h, z1.h, z3.h sub \v1\().h, z4.h, z6.h sub \v5\().h, z5.h, z7.h - sub \v2\().h, z29.h, z10.h - sub \v6\().h, z9.h, z11.h - sub \v3\().h, z12.h, z14.h - sub \v7\().h, z13.h, z15.h + + ld1b {z0.h}, p0/z, [x0] + ld1b {z1.h}, p0/z, [x0, x11] + ld1b {z2.h}, p0/z, [x2] + ld1b {z3.h}, p0/z, [x2, x11] + add x0, x0, x1 + add x2, x2, x3 + ld1b {z4.h}, p0/z, [x0] + ld1b {z5.h}, p0/z, [x0, x11] + ld1b {z6.h}, p0/z, [x2] + ld1b {z7.h}, p0/z, [x2, x11] + add x0, x0, x1 + add x2, x2, x3 + sub \v2\().h, z0.h, z2.h + sub \v6\().h, z1.h, z3.h + sub \v3\().h, z4.h, z6.h + sub \v7\().h, z5.h, z7.h .endm // one vertical hadamard pass and two horizontal -- 2.34.1
From 2b01d1cebd89ebdfe34bc92bd44b444d83594110 Mon Sep 17 00:00:00 2001 From: Wei Chen <wei.c...@arm.com> Date: Thu, 5 Sep 2024 11:14:39 +0800 Subject: [PATCH] AArch64: Fixed the inconsistent bitstreams for GCC and CLANG Starting from b3ce1c32da303c92241e85bd69298ab9903c0126, We observed that the video streams generated by x265 compiled with CLANG and GCC, using the same parameters and input video, are inconsistent. Encoding logs for CLANG compiled x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74146.30 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67231.66 Encoding logs for GCC complied x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74022.17 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67220.26 This because a macro LOAD_DIFF_16x4_sve2 was introduced in commit b3ce1c32da303c92241e85bd69298ab9903c0126 for AArch64 (This macro has been renamed to LOAD_DIFF_16x4_sve now) and is used by SATD kernels. The macro LOAD_DIFF_16x4_sve uses unprotected callee-saved SVE registers z8 to z15. Unfortunately, in the caller function MotionEstimate of SATD, the CLANG generates some code to use s10. Here is a simplified code snippet: fmov s10, #0.50000000 bl subpelCompare -> will invoke SATD fadd s0, s0, s10 Thus, the content of s10 used in MotionEstimate is overwritten by the SATD function, which affects the correctness of subsequent video encoding. But we did not observe similar code in GCC to access s8 - s15 in MotionEstimate. At the same time, we can influence CLANG by modifying -mpu=neoverse-N2, in this case CLANG will not use s10 in MotionEstimate, thereby making the video stream consistent. However, such compiler dependent behavior is unreliable. Therefore, in this patch, we avoid using z8 to z15 and extra stack operations by reusing z0 to z7 in LOAD_DIFF_16x4_sve. Here are the test results after applying the patch: Encoding logs for CLANG compiled x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74022.17 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67220.26 Encoding logs for GCC compiled x265: x265 [info]: frame I: 2, Avg QP:37.79 kb/s: 72220.75 x265 [info]: frame P: 20, Avg QP:37.75 kb/s: 74022.17 x265 [info]: frame B: 76, Avg QP:39.75 kb/s: 67220.26 Change-Id: Ic0e0c0706b99e53b138cceb758ee6ce148130e4b --- source/common/aarch64/pixel-util-sve.S | 34 +++++++++++++------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/source/common/aarch64/pixel-util-sve.S b/source/common/aarch64/pixel-util-sve.S index 3d073d42e..106ba903a 100644 --- a/source/common/aarch64/pixel-util-sve.S +++ b/source/common/aarch64/pixel-util-sve.S @@ -190,27 +190,27 @@ endfunc ld1b {z7.h}, p0/z, [x2, x11] add x0, x0, x1 add x2, x2, x3 - ld1b {z29.h}, p0/z, [x0] - ld1b {z9.h}, p0/z, [x0, x11] - ld1b {z10.h}, p0/z, [x2] - ld1b {z11.h}, p0/z, [x2, x11] - add x0, x0, x1 - add x2, x2, x3 - ld1b {z12.h}, p0/z, [x0] - ld1b {z13.h}, p0/z, [x0, x11] - ld1b {z14.h}, p0/z, [x2] - ld1b {z15.h}, p0/z, [x2, x11] - add x0, x0, x1 - add x2, x2, x3 - sub \v0\().h, z0.h, z2.h sub \v4\().h, z1.h, z3.h sub \v1\().h, z4.h, z6.h sub \v5\().h, z5.h, z7.h - sub \v2\().h, z29.h, z10.h - sub \v6\().h, z9.h, z11.h - sub \v3\().h, z12.h, z14.h - sub \v7\().h, z13.h, z15.h + + ld1b {z0.h}, p0/z, [x0] + ld1b {z1.h}, p0/z, [x0, x11] + ld1b {z2.h}, p0/z, [x2] + ld1b {z3.h}, p0/z, [x2, x11] + add x0, x0, x1 + add x2, x2, x3 + ld1b {z4.h}, p0/z, [x0] + ld1b {z5.h}, p0/z, [x0, x11] + ld1b {z6.h}, p0/z, [x2] + ld1b {z7.h}, p0/z, [x2, x11] + add x0, x0, x1 + add x2, x2, x3 + sub \v2\().h, z0.h, z2.h + sub \v6\().h, z1.h, z3.h + sub \v3\().h, z4.h, z6.h + sub \v7\().h, z5.h, z7.h .endm // one vertical hadamard pass and two horizontal -- 2.34.1
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel