All the patches of this series have been pushed to the master branch. *__________________________* *Karam Singh* *Ph.D. IIT Guwahati* Senior Software (Video Coding) Engineer Mobile: +91 8011279030 Block 9A, 6th floor, DLF Cyber City Manapakkam, Chennai 600 089
On Wed, Aug 28, 2024 at 10:48 AM chen <chenm...@163.com> wrote: > Hi Hari, > > > Thank for the update, it looks good to me now. > > > For the DCT optimize, we may keep your version, if we write hand-write > version future, we may try again. > > > Regards, > Chen > > > At 2024-08-27 23:10:18, "Hari Limaye" <hari.lim...@arm.com> wrote: > >Hi Chen, > > > >Thank you for reviewing the patches. Please see replies below - we have > >pushed this updated series to address your comments. > > > >>1. These function will share for 8/10/12 bpp, if move into X265_NS, it will > >>make duplicated copy > > > >If I understand correctly, you are referring to the duplication that will > >occur when building the "Multi-library interface"? I think this duplication > >exists already in the current multi bitdepth build, its just that the > >symbols are defined with internal linkage. By moving things into the X265_NS > >we don't add any duplicated copies, although we do at least make the > >functions reference-able from other code. Also if I understand correctly, > >this should not be an issue for the C implementations - as when building the > >Multi-library interface you should set EXPORT_C_API=OFF for all but one of > >the builds. > > > >>2. add new section "namespace X265_NS" before these functions are better > >>than move, it affects code history record. > > > >Updated patch to do this. > > > >>partialButterfly8_neon > >For size 8, butterfly E & O is necessary, but EE/EO is not a good idea, Odd > >spends 8 operators per line, Even spends 4 operators plus 1 temporary store > >and 2 prepare operators, totally 7 operators with dependency link, looks no > >performance benefits, especally SVE SVDOT may get more performance with Odd > >method. > > > >Regarding EE/EO not being a good idea in 8x8: register pressure is also a > >consideration here, as well as how the end of the first pass interacts with > >the start of the second (in terms of data positioning once they have been > >inlined - since everything stays in registers rather than being stored > >to/loaded from memory.) The current implementation performs better than the > >suggested implementations as computing EE and EO reduces register pressure > >(an issue 16-bit dot product cannot mitigate) and also allows the compiler > >to optimize more effectively across both passes after they have been inlined. > > > >>Code style mismatch in different code section, one line is better. > >>+ int32x4_t t01 = vpaddq_s32(vmull_s16(c1, O[j + 0]), > >>+ vmull_s16(c1, O[j + 1])); > >>*** vs > >>+ t01 = vpaddq_s32(vmull_s16(c3, O[j + 0]), vmull_s16(c3, O[j + 1])); > > > >The reason for the mismatch here is because the second line goes over the > >character limit of 80. This formatting is produced automatic formatting > >software (uncrustify). > > > >>dct8_neon > >>Better inline two of partialButterfly8_neon, it reduce some operators, such > >>as int32x4_t c0 = vld1q_s32(t8_even[0]); > >>16x16 and 32x2 are similar > > > >The compiler currently inlines these - we have updated patches to be more > >explicit by using the `inline` keyword for the butterfly helper functions. > > > >>const table may share in between Neon and Sve code > > > >Updated with this done. > > > >Many thanks, > >Hari > > > >Hari Limaye (5): > > Move C DCT implementations into X265_NS > > AArch64: Move Neon DCT implementations into X265_NS > > AArch64: Optimise partialButterfly8_neon > > AArch64: Optimise partialButterfly16_neon > > AArch64: Optimise partialButterfly32_neon > > > >Jonathan Wright (4): > > AArch64: Move Neon-SVE bridge helpers into dedicated header > > AArch64: Add SVE implementation of 8x8 DCT > > AArch64: Add SVE implementation of 16x16 DCT > > AArch64: Add SVE implementation of 32x32 DCT > > > > source/common/CMakeLists.txt | 2 +- > > source/common/aarch64/asm-primitives.cpp | 1 + > > source/common/aarch64/dct-prim-sve.cpp | 491 ++++++++++++++++++++ > > source/common/aarch64/dct-prim.cpp | 557 +++++++++++++++-------- > > source/common/aarch64/dct-prim.h | 40 ++ > > source/common/aarch64/neon-sve-bridge.h | 67 +++ > > source/common/aarch64/sao-prim.h | 32 +- > > source/common/dct.cpp | 22 +- > > 8 files changed, 980 insertions(+), 232 deletions(-) > > create mode 100644 source/common/aarch64/dct-prim-sve.cpp > > create mode 100644 source/common/aarch64/neon-sve-bridge.h > > > >-- > >2.42.1 > > _______________________________________________ > x265-devel mailing list > x265-devel@videolan.org > https://mailman.videolan.org/listinfo/x265-devel >
_______________________________________________ x265-devel mailing list x265-devel@videolan.org https://mailman.videolan.org/listinfo/x265-devel