Re: [x265] [PATCH v2 0/9] AArch64: Optimise DCT primitives

Karam Singh Sun, 01 Sep 2024 04:17:27 -0700

All the patches of this series have been pushed to the master branch.
*__________________________*
*Karam Singh*
*Ph.D. IIT Guwahati*
Senior Software (Video Coding) Engineer
Mobile: +91 8011279030
Block 9A, 6th floor, DLF Cyber City
Manapakkam, Chennai 600 089



On Wed, Aug 28, 2024 at 10:48 AM chen <chenm...@163.com> wrote:

> Hi Hari,
>
>
> Thank for the update, it looks good to me now.
>
>
> For the DCT optimize, we may keep your version, if we write hand-write
> version future, we may try again.
>
>
> Regards,
> Chen
>
>
> At 2024-08-27 23:10:18, "Hari Limaye" <hari.lim...@arm.com> wrote:
> >Hi Chen,
> >
> >Thank you for reviewing the patches. Please see replies below - we have 
> >pushed this updated series to address your comments.
> >
> >>1. These function will share for 8/10/12 bpp, if move into X265_NS, it will 
> >>make duplicated copy
> >
> >If I understand correctly, you are referring to the duplication that will 
> >occur when building the "Multi-library interface"? I think this duplication 
> >exists already in the current multi bitdepth build, its just that the 
> >symbols are defined with internal linkage. By moving things into the X265_NS 
> >we don't add any duplicated copies, although we do at least make the 
> >functions reference-able from other code. Also if I understand correctly, 
> >this should not be an issue for the C implementations - as when building the 
> >Multi-library interface you should set EXPORT_C_API=OFF for all but one of 
> >the builds.
> >
> >>2. add new section "namespace X265_NS" before these functions are better 
> >>than move, it affects code history record.
> >
> >Updated patch to do this.
> >
> >>partialButterfly8_neon
> >For size 8, butterfly E & O is necessary, but EE/EO is not a good idea, Odd 
> >spends 8 operators per line, Even spends 4 operators plus 1 temporary store 
> >and 2 prepare operators, totally 7 operators with dependency link, looks no 
> >performance benefits, especally SVE SVDOT may get more performance with Odd 
> >method.
> >
> >Regarding EE/EO not being a good idea in 8x8: register pressure is also a 
> >consideration here, as well as how the end of the first pass interacts with 
> >the start of the second (in terms of data positioning once they have been 
> >inlined - since everything stays in registers rather than being stored 
> >to/loaded from memory.) The current implementation performs better than the 
> >suggested implementations as computing EE and EO reduces register pressure 
> >(an issue 16-bit dot product cannot mitigate) and also allows the compiler 
> >to optimize more effectively across both passes after they have been inlined.
> >
> >>Code style mismatch in different code section, one line is better.
> >>+        int32x4_t t01 = vpaddq_s32(vmull_s16(c1, O[j + 0]),
> >>+                                   vmull_s16(c1, O[j + 1]));
> >>*** vs
> >>+        t01 = vpaddq_s32(vmull_s16(c3, O[j + 0]), vmull_s16(c3, O[j + 1]));
> >
> >The reason for the mismatch here is because the second line goes over the 
> >character limit of 80. This formatting is produced automatic formatting 
> >software (uncrustify).
> >
> >>dct8_neon
> >>Better inline two of partialButterfly8_neon, it reduce some operators, such 
> >>as int32x4_t c0 = vld1q_s32(t8_even[0]);
> >>16x16 and 32x2 are similar
> >
> >The compiler currently inlines these - we have updated patches to be more 
> >explicit by using the `inline` keyword for the butterfly helper functions.
> >
> >>const table may share in between Neon and Sve code
> >
> >Updated with this done.
> >
> >Many thanks,
> >Hari
> >
> >Hari Limaye (5):
> >  Move C DCT implementations into X265_NS
> >  AArch64: Move Neon DCT implementations into X265_NS
> >  AArch64: Optimise partialButterfly8_neon
> >  AArch64: Optimise partialButterfly16_neon
> >  AArch64: Optimise partialButterfly32_neon
> >
> >Jonathan Wright (4):
> >  AArch64: Move Neon-SVE bridge helpers into dedicated header
> >  AArch64: Add SVE implementation of 8x8 DCT
> >  AArch64: Add SVE implementation of 16x16 DCT
> >  AArch64: Add SVE implementation of 32x32 DCT
> >
> > source/common/CMakeLists.txt             |   2 +-
> > source/common/aarch64/asm-primitives.cpp |   1 +
> > source/common/aarch64/dct-prim-sve.cpp   | 491 ++++++++++++++++++++
> > source/common/aarch64/dct-prim.cpp       | 557 +++++++++++++++--------
> > source/common/aarch64/dct-prim.h         |  40 ++
> > source/common/aarch64/neon-sve-bridge.h  |  67 +++
> > source/common/aarch64/sao-prim.h         |  32 +-
> > source/common/dct.cpp                    |  22 +-
> > 8 files changed, 980 insertions(+), 232 deletions(-)
> > create mode 100644 source/common/aarch64/dct-prim-sve.cpp
> > create mode 100644 source/common/aarch64/neon-sve-bridge.h
> >
> >--
> >2.42.1
>
> _______________________________________________
> x265-devel mailing list
> x265-devel@videolan.org
> https://mailman.videolan.org/listinfo/x265-devel
>

_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Re: [x265] [PATCH v2 0/9] AArch64: Optimise DCT primitives

Reply via email to