Hi, This patch series fixes bugs in the Arm SVE 16x16 and 32x32 DCT implementations, and also mitigates a portion of the performance regression due to the fix. Both SVE DCT implementations are still sgnificantly faster than the equivalent Neon paths.
Note that the DCT unit tests did not show these bugs. They were found after differences in encoded output videos were observed on Arm and x86 for veryslow, slower and slow encoding presets. With these patches applied encoded output matches for all speed presets. Thanks, Jonathan Jonathan Wright (2): AArch64: Fix SVE 16x16 and 32x32 DCT implementations AArch64: Specialize passes of 16x16 and 32x32 SVE DCTs source/common/aarch64/dct-prim-sve.cpp | 338 ++++++++++++++++++++++--- 1 file changed, 306 insertions(+), 32 deletions(-) -- 2.39.5 (Apple Git-154) _______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel