Re: [x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations

2024-05-29 Thread chen
Hi Hari,




Thank you for your information.

My A77 document looks older, it does not show uOps, so we can keep your LDR+ADD 
in patch, thanks.




Regards,
Chen

At 2024-05-29 19:24:16, "Hari Limaye"  wrote:
>Hi Chen,
>
>Thank you for clarifying.
>
>From the Arm CPU Software Optimisation Guides, LD1R requires an extra micro-op 
>for the broadcast compared to the regular load (LDR). Benchmarking shows that 
>using LD1R in the sad functions of width 4 is ~20% slower than using the LDR, 
>ADD sequence.
>
>Many thanks,
>
>Hari
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations

2024-05-28 Thread chen
Hi Hari,




Thank you for explain more details.

It is my fault, I don't point out we may replace LD1 by LD1R in last comment.

The instruction `ld1 {v0.s}[0], [x0], x1` is not good here due to partial 
register access false dependency link, but `ld1r {v0.2s}, [x0], x1` may avoid 
issue

Could you please take a look performace with LD1R?.




Regards,

Chen

At 2024-05-28 18:03:43, "Hari Limaye"  wrote:
>Hi Chen,
>
>Thank you for reviewing the patches.
>
>>In this case, replace LD1 by LDR+ADD is not get benefit
>
>Here, the existing instruction `ld1  {v0.s}[0], [x0], x1` is a 
>read-modify-write operation and so creates a false dependency on the previous 
>value of the register. Replacing this initial load with an LDR instruction 
>removes this issue, as it is a completely destructive operation.
>
>The speed-test results for the block sizes with width 4, when compared to the 
>existing Neon code on a Neoverse V1 machine:
>
>sad[4x4]| 2.94x
>sad[4x8]| 3.47x
>sad[4x16]   | 2.49x
>sad_x3[4x4] | 1.94x
>sad_x3[4x8] | 1.59x
>sad_x3[4x16] | 1.46x
>sad_x4[4x4] | 1.59x
>sad_x4[4x8] | 1.45x
>sad_x4[4x16] | 1.27x
>
>Many thanks,
>
>Hari
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 0/8] AArch64 SAD/SADxN Optimisations

2024-05-24 Thread chen
Hi Hari,




These 8 patches looks good, the only comment on below code




=

.macro SAD_START_4 f

-ld1 {v0.s}[0], [x0], x1
+ldr s0, [x0]
+ldr s1, [x2]
+add x0, x0, x1
+add x2, x2, x3
 ld1 {v0.s}[1], [x0], x1
-ld1 {v1.s}[0], [x2], x3
 ld1 {v1.s}[1], [x2], x3
 \f  v16.8h, v0.8b, v1.8b
 .endm

In the document

LDR latency 5/-, throughput 2

ADD latency 2, throughput 2 

LD1  latency 7, throughput 2  (latency may optimize to 5)




In this case, replace LD1 by LDR+ADD is not get benefit

btw: same comment in SAD_X_START_4




=



At 2024-05-24 01:12:04, "Hari Limaye"  wrote:
>Hi, > >This patch-series optimises the Neon implementations of SAD/SADxN 
>primitives, adds new Armv8.4 Neon DotProd implementations, and performs some 
>refactoring to AArch64 code. > >This series is based on the previously 
>submitted refactoring patch-series (AArch64 saoCuStats Optimisations). > 
>>Geometric mean of performance uplift when compiled with LLVM 17 on a Neoverse 
>V1 machine (higher is better): > >Existing Neon -> Optimised Neon: 1.45x 
>>Optimised Neon -> Armv8.4 Neon DotProd: 1.03x > >Many thanks, > >Hari > >Hari 
>Limaye (8): > AArch64: Optimise Neon assembly implementations of SAD > 
>AArch64: Optimise Neon assembly implementations of SADxN > AArch64: Remove 
>SVE2 SAD/SADxN primitives > AArch64: Clean up CMake feature detection > 
>AArch64: Add Armv8.4 Neon DotProd feature detection > AArch64: Refactor setup 
>of optimised assembly primitives > AArch64: Add Armv8.4 Neon DotProd 
>implementations of SAD > AArch64: Add Armv8.4 Neon DotProd implementations of 
>SADxN > > build/README.txt | 8 + > source/CMakeLists.txt | 89 ++- > 
>source/cmake/FindNEON_DOTPROD.cmake | 21 + > source/common/CMakeLists.txt | 6 
>+- > source/common/aarch64/asm-primitives.cpp | 832 ++- > 
>source/common/aarch64/fun-decls.h | 21 + > 
>source/common/aarch64/sad-a-common.S | 514 -- > 
>source/common/aarch64/sad-a-sve2.S | 511 -- > 
>source/common/aarch64/sad-a.S | 506 +- > 
>source/common/aarch64/sad-neon-dotprod.S | 302  > 
>source/common/cpu.cpp | 19 +- > source/test/testbench.cpp | 3 +- > 
>source/x265.h | 11 +- > 13 files changed, 958 insertions(+), 1885 deletions(-) 
>> create mode 100644 source/cmake/FindNEON_DOTPROD.cmake > delete mode 100644 
>source/common/aarch64/sad-a-common.S > delete mode 100644 
>source/common/aarch64/sad-a-sve2.S > create mode 100644 
>source/common/aarch64/sad-neon-dotprod.S > >-- >2.42.1 > 
>>___ >x265-devel mailing list 
>>x265-devel@videolan.org >https://mailman.videolan.org/listinfo/x265-devel___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 0/7] AArch64 saoCuStats Optimisations

2024-05-22 Thread chen
Hi Hari,




The new patches looks good for me now, thank you for your patches.




Regards,

Chen

At 2024-05-23 03:09:26, "Hari Limaye"  wrote:
>Hi Chen,
>
>Thank you for reviewing the patches.
>
>>In signOf_neon
>>>+ // signOf(a - b) = -(a > b) | (b > a)
>>comments is not clear, suggest
>>-(a > b ? -1 : 0) | ( a < b)
>
>I have posted updated versions of patches 3, 4, 6 to make these comments more 
>clear with respect to the possible outputs of Neon comparison instructions.
>
>>In saoCuStatsBO_neon
>>It is memory bandwidth optimize only, interval memory access strong depends 
>>on CPU pipeline design and >compiler, it is not generic, not sure how about 
>>on other kind of CPUs.
>
>Yes it is primarily a memory bandwidth optimisation - we have tested with 
>recent GCC and Clang on a range of Neoverse CPUs and find it to be faster than 
>the C implementation.
>
>>In saoCuStatsE*_neon
>>No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than 
>>vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, >it mostly faster on modern CPUs
>
>Yes, we found that this instruction sequence was faster than the alternatives, 
>for the Neon implementation.
>
>Many thanks,
>
>Hari
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 0/7] AArch64 saoCuStats Optimisations

2024-05-21 Thread chen
Hi Hari,




Thanks for the new ARM patches.

In signOf_neon
>+ // signOf(a - b) = -(a > b) | (b > a)
comments is not clear, suggest
-(a > b ? -1 : 0) | ( a < b)
In saoCuStatsBO_neon
It is memory bandwidth optimize only, interval memory access strong depends on 
CPU pipeline design and compiler, it is not generic, not sure how about on 
other kind of CPUs.

In saoCuStatsE*_neon
No comments, it looks vmulq_s16+vmlaq_s16 reduce 1 instruction than 
vandq_s16+vandq_s16+vaddq_s16 or tbl/tbx, it mostly faster on modern CPUs
In saoCuStats*_sve, saoCuStats*_sve2
No comments since it is similar algorithm as Neon



Regards,
Chen

At 2024-05-21 00:14:35, "Hari Limaye"  wrote:

>Hi,
>
>This patch-series adds AArch64 Neon, SVE, and SVE2 implementations of
>the saoCuStats function primitives for low and high bitdepth.
>
>This series is based on the previously submitted refactoring patch
>series.
>
>Performance numbers:
>
>C -> Neon on Neoverse V1:
>Low bitdepth:
>saoCuStatsBO | 1.09x
>saoCuStatsE0 | 2.67x
>saoCuStatsE1 | 2.82x
>saoCuStatsE2 | 2.93x
>saoCuStatsE3 | 3.26x
>
>High bitdepth:
>saoCuStatsBO | 1.09x
>saoCuStatsE0 | 2.39x
>saoCuStatsE1 | 2.67x
>saoCuStatsE2 | 2.47x
>saoCuStatsE3 | 2.86x
>
>Neon -> SVE on Neoverse V1:
>Low bitdepth:
>saoCuStatsE0 | 1.12x
>saoCuStatsE1 | 1.15x
>saoCuStatsE2 | 1.21x
>saoCuStatsE3 | 1.14x
>
>High bitdepth:
>saoCuStatsE0 | 1.19x
>saoCuStatsE1 | 1.28x
>saoCuStatsE2 | 1.19x
>saoCuStatsE3 | 1.12x
>
>SVE -> SVE2 on Neoverse V2:
>Low bitdepth:
>saoCuStatsE0 | 1.08x
>saoCuStatsE1 | 1.06x
>saoCuStatsE2 | 1.06x
>saoCuStatsE3 | 1.09x
>
>High bitdepth:
>saoCuStatsE0 | 1.03x
>saoCuStatsE1 | 1.10x
>saoCuStatsE2 | 1.08x
>saoCuStatsE3 | 1.09x
>
>Many thanks,
>
>Hari
>
>Hari Limaye (7):
>  Test: Relax constraints of check_saoCuStatsE*
>  Move duplicated signOf function to common header
>  AArch64: Add Neon saoCuStats primitives for low bitdepth
>  AArch64: Add Neon saoCuStats primitives for high bitdepth
>  AArch64: Add check for arm_neon_sve_bridge.h
>  AArch64: Add SVE saoCuStats primitives
>  AArch64: Add SVE2 saoCuStats primitives
>
> source/CMakeLists.txt |  35 +-
> source/common/CMakeLists.txt  |  19 +-
> source/common/aarch64/asm-primitives.cpp  |  14 +
> source/common/aarch64/loopfilter-prim.cpp |  19 +-
> source/common/aarch64/sao-prim-sve.cpp| 271 +++
> source/common/aarch64/sao-prim-sve2.cpp   | 317 ++
> source/common/aarch64/sao-prim.cpp| 380 ++
> source/common/aarch64/sao-prim.h  | 100 ++
> source/common/common.h|   6 +
> source/common/loopfilter.cpp  |  16 +-
> source/encoder/sao.cpp|  74 ++---
> source/test/pixelharness.cpp  |  11 +-
> 12 files changed, 1187 insertions(+), 75 deletions(-)
> create mode 100644 source/common/aarch64/sao-prim-sve.cpp
> create mode 100644 source/common/aarch64/sao-prim-sve2.cpp
> create mode 100644 source/common/aarch64/sao-prim.cpp
> create mode 100644 source/common/aarch64/sao-prim.h
>
>-- 
>2.42.1
>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 01/12] AArch64: Fix costCoeffNxN test on Apple Silicon

2024-05-06 Thread chen
Hi Hari Limaye,




Thank you fix AARCH64 build issues, these 12 patches looks good for me.




Regards,

Chen

At 2024-05-03 05:19:36, "Hari Limaye"  wrote:
>The assembly routine x265_costCoeffNxN_neon is buggy and produces an
>incorrect result on Apple Silicon, causing the pixel testbench to fail
>on these platforms.
>
>x265_costCoeffNxN assumes that parameter `int subPosBase`, the second
>parameter of type `int` passed on the stack, is at position `sp + 8`;
>this assumption is consistent with the AArch64 PCS, as arguments smaller
>than 8 bytes are widened to 8 bytes (aapcs64 6.8.2 C.16).
>However arm64e diverges from AAPCS64: 'Function arguments may consume
>slots on the stack that are not multiples of 8 bytes'.
>---
> source/common/aarch64/asm.S| 12 +++-
> source/common/aarch64/pixel-util.S |  4 ++--
> 2 files changed, 13 insertions(+), 3 deletions(-)
>
>diff --git a/source/common/aarch64/asm.S b/source/common/aarch64/asm.S
>index ce0668103..742978631 100644
>--- a/source/common/aarch64/asm.S
>+++ b/source/common/aarch64/asm.S
>@@ -72,6 +72,16 @@
>
> #define PFX_C(name)JOIN(JOIN(JOIN(EXTERN_ASM, X265_NS), _), name)
>
>+// Alignment of stack arguments of size less than 8 bytes.
>+#ifdef __APPLE__
>+#define STACK_ARG_ALIGNMENT 4
>+#else
>+#define STACK_ARG_ALIGNMENT 8
>+#endif
>+
>+// Get offset from SP of stack argument at index `idx`.
>+#define STACK_ARG_OFFSET(idx) (idx * STACK_ARG_ALIGNMENT)
>+
> #ifdef __APPLE__
> .macro endfunc
> ELF .size \name, . - \name
>@@ -184,4 +194,4 @@ ELF .size   \name, . - \name
> vtrn\t3, \t4, \s3, \s4
> .endm
>
>-#endif
>\ No newline at end of file
>+#endif
>diff --git a/source/common/aarch64/pixel-util.S 
>b/source/common/aarch64/pixel-util.S
>index 9b3c11504..378c6891c 100644
>--- a/source/common/aarch64/pixel-util.S
>+++ b/source/common/aarch64/pixel-util.S
>@@ -2311,7 +2311,7 @@ endfunc
> //uint8_t *baseCtx,  // x6
> //int offset,// x7
> //int scanPosSigOff, // sp
>-//int subPosBase)// sp + 8
>+//int subPosBase)// sp + 8, or sp + 4 on APPLE
> function PFX(costCoeffNxN_neon)
> // abs(coeff)
> add x2, x2, x2
>@@ -2410,7 +2410,7 @@ function PFX(costCoeffNxN_neon)
> add x4, x4, x15
> str h2, [x13]  // absCoeff[numNonZero] = 
> tmpCoeff[blkPos]
>
>-ldr x9, [sp, #8]   // subPosBase
>+ldr x9, [sp, #STACK_ARG_OFFSET(1)]   // subPosBase
> uxthw9, w9
> cmp w9, #0
> csetx2, eq
>--
>2.42.1
>
>IMPORTANT NOTICE: The contents of this email and any attachments are 
>confidential and may also be privileged. If you are not the intended 
>recipient, please notify the sender immediately and do not disclose the 
>contents to any other person, use it for any purpose, or store or copy the 
>information in any medium. Thank you.
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Fwd: NASM 2.15.03 (MYS2/MinGW) throws a huge amount of macro warnings

2023-05-23 Thread chen
Hello,


Could you please try my local patch?


Regards,
Min Chen
2023-05-21 17:27:35,"Mario *LigH* Rohkrämer"  
>Almost 3 years later, NASM version 2.16.01, and still no solution, 
>nobody is responsible for "just" warnings.
>
>-- 
>
>Fun and success!
>
>Mario *LigH* Rohkrämer
>maito:cont...@ligh.de
>


0001-ASM-Improve-source-common-x86-x86inc.asm-to-cleanup-.patch
Description: Binary data
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] ARM patches

2022-11-05 Thread chen
Hi,


I haven't OS X environment, so I just guess the reason.
The GCC and LLVM use different symbol prefix.
We use "[private_prefix %+ _entropyStateBits]" in the x86 assembly code to suit 
these changes.
But use as "movrel  x1, x265_entropyStateBits" in aarch64, I don't found these 
little differences in previous code review.




Regards,

Min




At 2022-11-05 23:42:38, "Nomis101"  wrote:
>Hi,
>
>same here, I'm seeing the same error as below if I try to build x265 (latest 
>master) for macOS arm64 
>on macOS Ventura (Xcode 14.1).
>
>
>
>
>Am 04.11.22 um 08:39 schrieb Damiano Galassi:
>> Hi,
>> 
>> I’m getting the following error when trying to build on macOS arm64 since 
>> these patches:
>> 
>> Undefined symbols for architecture arm64:
>>"x265_entropyStateBits", referenced from:
>>_x265_costCoeffNxN_neon in libx265.a(pixel-util.S.o)
>>_x265_10bit_costCoeffNxN_neon in libx265.a(pixel-util.S.o)
>>_x265_12bit_costCoeffNxN_neon in libx265.a(pixel-util.S.o)
>>   (maybe you meant: _x265_entropyStateBits)
>> ld: symbol(s) not found for architecture arm64
>> clang: error: linker command failed with exit code 1 (use -v to see 
>> invocation)
>> 
>>> Il giorno 29 ott 2022, alle ore 20:59, Pop, Sebastian  ha 
>>> scritto:
>>>
>>> Hello Mahesh,
>>>
>>> +x265-devel mailing-list
>>>
>>> Please find attached the last patches from my local git development tree 
>>> that are not yet part of 
>>> the public x265 git repo.
>>> Could you please run smoke tests and integrate those patches to the public 
>>> x265?
>>> Please let me know if you want me to address any further issue.
>>>
>>> Thanks,
>>> Sebastian
>>> 
>>> *From:*Mahesh Pittala >> >
>>> *Sent:*Friday, October 28, 2022 2:41:51 AM
>>> *To:*Pop, Sebastian
>>> *Cc:*Swathi Gurumani; Santhoshini Sekar; Gopi Satykrishna Akisetty
>>> *Subject:*[EXTERNAL] ARM patches
>>> *CAUTION*: This email originated from outside of the organization. Do not 
>>> click links or open 
>>> attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>> Hello Sebastian,
>>>
>>> I have observed a few ARM patches locally which are not pushed to the 
>>> public x265 repo, we ran 
>>> smoke tests and it was successful.
>>>
>>> Can you please share it to x265 videoLAN so that we can push it ? If you 
>>> have updated patches 
>>> please share them
>>>
>>> Thanks,
>>> Mahesh
>>>
>>>
>>>
>>>
>>> <0007-arm64-disable-scanPosLast_neon-on-Apple-processors.patch><0006-arm64-remove-two-fmov-instructions.patch><0005-arm64-do-not-use-FP-register-v15.patch><0004-arm64-use-better-addressing-modes-with-ld1-st1.patch><0003-arm64-register-several-ASM-routines.patch><0002-arm64-Register-the-assembly-routines-x265_satd_-_neo.patch><0001-arm64-port-costCoeffNxN.patch>___
>>> x265-devel mailing list
>>> x265-devel@videolan.org 
>>> https://mailman.videolan.org/listinfo/x265-devel 
>>> 
>> 
>> 
>> ___
>> x265-devel mailing list
>> x265-devel@videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] PING [PATCH] aarch64: replace ldr pseudo-instruction with adrp+add

2022-10-18 Thread chen
Hi Song,


This is a long time history bug. the '-DPIC' just add to compile option when 
Nasm is enabled, the bug can simple fix with below patch (NO TEST since I 
haven't ARM environment).
After patch, the user may acivate these compile option with ENABLE_PIC=ON


Could the x265 team please help verify patch?


Regards,
Min Chen


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index 13e4750de..80f8e59a9 100755
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -266,6 +266,9 @@ if(GCC)
 add_definitions(-DHAVE_NEON)
 endif()
 endif()
+if(ENABLE_PIC)
+list(APPEND ARM_ARGS -DPIC)
+endif()
 add_definitions(${ARM_ARGS})
 if(FPROFILE_GENERATE)
 if(INTEL_CXX)


 2022-10-18 14:37:10,"Fangrui Song"  
On Thu, Oct 6, 2022 at 1:39 AM chen  wrote:


Hi Song,




I means the current tree support these adrp+add mode with compile option -DPIC, 
so we need not patch the code.




Regards,
Min Chen


Hi Min, IIUC -DPIC is only defined for x86-64 
(source/cmake/CMakeASM_NASMInformation.cmake).
AArch64 does not get -DPIC.

2022-10-06 06:42:33,"Fangrui Song"  

Hi Min, sorry but I just saw your question. I do not understand the request.
adrp+add is just strictly superior to ldr (which uses a constant pool and does 
not decrease code size) and avoids text relocations (which are prohibited in 
many systems).  `ldr \rd, =\val+\offset` should just be removed. 
adrp+add also works on Mach-O systems as well.
 
At 2022-09-24 15:21:35, "Fangrui Song"  wrote:
>Ping.  The breaks lld build and some binutils configurating defaulting
>to disallow text relocations.
>
>On 2022-08-29, Fangrui Song wrote:
>>On 2022-08-29, Fangrui Song wrote:
>>>On 2022-08-30, chen wrote:
>>>>Hi Song,
>>>>
>>>>
>>>>Thank you for your patch.
>>>>
>>>>
>>>>However, syntax of ':lo12:' depends on compiler, so more general LDR is 
>>>>better in here.
>>>>
>>>>
>>>>Regards,
>>>>Min Chen
>>>
>>>:lo12: is standard aarch64 assembly syntax.
>>>Which aarch64 compiler supported by x265 does not support :lo12:?
>>
>>Note that LDR has another problem that it produced an absolute relocation 
>>R_AARCH64_ABS64. It will trigger an error
>>when text relocations are disabled (default in ld.lld. See 
>>https://maskray.me/blog/2020-12-19-lld-and-gnu-linker-incompatibilities#:~:text=Text%20relocations)
>>
>>```
>>% : && /usr/bin/c++ -fPIC -O3 -DNDEBUG  -Wl,-Bsymbolic,-znoexecstack -shared 
>>...
>>ld: error: relocation R_AARCH64_ABS64 cannot be used against local
>>symbol; recompile with -fPIC
>>>>>defined in sad-a.S.o
>>>>>>>> referenced by sad-a.S.o:(.text+0x9B00)
>>```
>>
>>adrp+add work fine.
>>
>>(I am a maintainer of lld/ELF.)
>>
>>>>At 2022-08-30 02:33:37, "Fangrui Song"  wrote:
>>>>>The ldr pseudo-instruction uses a literal pool, which is less efficient
>>>>>and does not decrease the code size.
>>>>>---
>>>>>source/common/aarch64/asm.S | 4 +---
>>>>>1 file changed, 1 insertion(+), 3 deletions(-)
>>>>>
>>>>>diff --git a/source/common/aarch64/asm.S b/source/common/aarch64/asm.S
>>>>>index 399c37cf2..2506f50aa 100644
>>>>>--- a/source/common/aarch64/asm.S
>>>>>+++ b/source/common/aarch64/asm.S
>>>>>@@ -130,11 +130,9 @@ ELF .size   \name, . - \name
>>>>>   adrp\rd, \val+(\offset)
>>>>>   add \rd, \rd, :lo12:\val+(\offset)
>>>>> .endif
>>>>>-#elif defined(PIC)
>>>>>+#else
>>>>>   adrp\rd, \val+(\offset)
>>>>>   add \rd, \rd, :lo12:\val+(\offset)
>>>>>-#else
>>>>>-ldr \rd, =\val+\offset
>>>>>#endif
>>>>>.endm
>>>>>
>>>>>--
>>>>>2.37.2.672.g94769d06f0-goog
>>>>>
>>>>>___
>>>>>x265-devel mailing list
>>>>>x265-devel@videolan.org
>>>>>https://mailman.videolan.org/listinfo/x265-devel
>>>
>>>>___
>>>>x265-devel mailing list
>>>>x265-devel@videolan.org
>>>>https://mailman.videolan.org/listinfo/x265-devel
>>>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel





--

宋方睿
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel





--

宋方睿

0001-AARCH64-Support-DPIC-as-compiler-option.patch
Description: Binary data
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] PING [PATCH] aarch64: replace ldr pseudo-instruction with adrp+add

2022-10-06 Thread chen
Hi Song,




I means the current tree support these adrp+add mode with compile option -DPIC, 
so we need not patch the code.




Regards,
Min Chen




2022-10-06 06:42:33,"Fangrui Song"  

Hi Min, sorry but I just saw your question. I do not understand the request.
adrp+add is just strictly superior to ldr (which uses a constant pool and does 
not decrease code size) and avoids text relocations (which are prohibited in 
many systems).  `ldr \rd, =\val+\offset` should just be removed. 
adrp+add also works on Mach-O systems as well.
 
At 2022-09-24 15:21:35, "Fangrui Song"  wrote:
>Ping.  The breaks lld build and some binutils configurating defaulting
>to disallow text relocations.
>
>On 2022-08-29, Fangrui Song wrote:
>>On 2022-08-29, Fangrui Song wrote:
>>>On 2022-08-30, chen wrote:
>>>>Hi Song,
>>>>
>>>>
>>>>Thank you for your patch.
>>>>
>>>>
>>>>However, syntax of ':lo12:' depends on compiler, so more general LDR is 
>>>>better in here.
>>>>
>>>>
>>>>Regards,
>>>>Min Chen
>>>
>>>:lo12: is standard aarch64 assembly syntax.
>>>Which aarch64 compiler supported by x265 does not support :lo12:?
>>
>>Note that LDR has another problem that it produced an absolute relocation 
>>R_AARCH64_ABS64. It will trigger an error
>>when text relocations are disabled (default in ld.lld. See 
>>https://maskray.me/blog/2020-12-19-lld-and-gnu-linker-incompatibilities#:~:text=Text%20relocations)
>>
>>```
>>% : && /usr/bin/c++ -fPIC -O3 -DNDEBUG  -Wl,-Bsymbolic,-znoexecstack -shared 
>>...
>>ld: error: relocation R_AARCH64_ABS64 cannot be used against local
>>symbol; recompile with -fPIC
>>>>>defined in sad-a.S.o
>>>>>>>> referenced by sad-a.S.o:(.text+0x9B00)
>>```
>>
>>adrp+add work fine.
>>
>>(I am a maintainer of lld/ELF.)
>>
>>>>At 2022-08-30 02:33:37, "Fangrui Song"  wrote:
>>>>>The ldr pseudo-instruction uses a literal pool, which is less efficient
>>>>>and does not decrease the code size.
>>>>>---
>>>>>source/common/aarch64/asm.S | 4 +---
>>>>>1 file changed, 1 insertion(+), 3 deletions(-)
>>>>>
>>>>>diff --git a/source/common/aarch64/asm.S b/source/common/aarch64/asm.S
>>>>>index 399c37cf2..2506f50aa 100644
>>>>>--- a/source/common/aarch64/asm.S
>>>>>+++ b/source/common/aarch64/asm.S
>>>>>@@ -130,11 +130,9 @@ ELF .size   \name, . - \name
>>>>>   adrp\rd, \val+(\offset)
>>>>>   add \rd, \rd, :lo12:\val+(\offset)
>>>>> .endif
>>>>>-#elif defined(PIC)
>>>>>+#else
>>>>>   adrp\rd, \val+(\offset)
>>>>>   add \rd, \rd, :lo12:\val+(\offset)
>>>>>-#else
>>>>>-ldr \rd, =\val+\offset
>>>>>#endif
>>>>>.endm
>>>>>
>>>>>--
>>>>>2.37.2.672.g94769d06f0-goog
>>>>>
>>>>>___
>>>>>x265-devel mailing list
>>>>>x265-devel@videolan.org
>>>>>https://mailman.videolan.org/listinfo/x265-devel
>>>
>>>>___
>>>>x265-devel mailing list
>>>>x265-devel@videolan.org
>>>>https://mailman.videolan.org/listinfo/x265-devel
>>>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel





--

宋方睿___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] PING [PATCH] aarch64: replace ldr pseudo-instruction with adrp+add

2022-09-24 Thread chen
Hi Song,


Sorry for delay.
The new patches just clean up some code that assume already declare 
defined(PIC) on aarch64, so I still think we need not do like this, declare 
-DPIC on cmdline is more general.
On previous error message of sad-a.S, the compile option is '-fPIC', I guess we 
need '-fPIC -DPIC'


Regards, Min Chen


At 2022-09-24 15:21:35, "Fangrui Song"  wrote:
>Ping.  The breaks lld build and some binutils configurating defaulting
>to disallow text relocations.
>
>On 2022-08-29, Fangrui Song wrote:
>>On 2022-08-29, Fangrui Song wrote:
>>>On 2022-08-30, chen wrote:
>>>>Hi Song,
>>>>
>>>>
>>>>Thank you for your patch.
>>>>
>>>>
>>>>However, syntax of ':lo12:' depends on compiler, so more general LDR is 
>>>>better in here.
>>>>
>>>>
>>>>Regards,
>>>>Min Chen
>>>
>>>:lo12: is standard aarch64 assembly syntax.
>>>Which aarch64 compiler supported by x265 does not support :lo12:?
>>
>>Note that LDR has another problem that it produced an absolute relocation 
>>R_AARCH64_ABS64. It will trigger an error
>>when text relocations are disabled (default in ld.lld. See 
>>https://maskray.me/blog/2020-12-19-lld-and-gnu-linker-incompatibilities#:~:text=Text%20relocations)
>>
>>```
>>% : && /usr/bin/c++ -fPIC -O3 -DNDEBUG  -Wl,-Bsymbolic,-znoexecstack -shared 
>>...
>>ld: error: relocation R_AARCH64_ABS64 cannot be used against local
>>symbol; recompile with -fPIC
>>>>>defined in sad-a.S.o
>>>>>>>> referenced by sad-a.S.o:(.text+0x9B00)
>>```
>>
>>adrp+add work fine.
>>
>>(I am a maintainer of lld/ELF.)
>>
>>>>At 2022-08-30 02:33:37, "Fangrui Song"  wrote:
>>>>>The ldr pseudo-instruction uses a literal pool, which is less efficient
>>>>>and does not decrease the code size.
>>>>>---
>>>>>source/common/aarch64/asm.S | 4 +---
>>>>>1 file changed, 1 insertion(+), 3 deletions(-)
>>>>>
>>>>>diff --git a/source/common/aarch64/asm.S b/source/common/aarch64/asm.S
>>>>>index 399c37cf2..2506f50aa 100644
>>>>>--- a/source/common/aarch64/asm.S
>>>>>+++ b/source/common/aarch64/asm.S
>>>>>@@ -130,11 +130,9 @@ ELF .size   \name, . - \name
>>>>>   adrp\rd, \val+(\offset)
>>>>>   add \rd, \rd, :lo12:\val+(\offset)
>>>>> .endif
>>>>>-#elif defined(PIC)
>>>>>+#else
>>>>>   adrp\rd, \val+(\offset)
>>>>>   add \rd, \rd, :lo12:\val+(\offset)
>>>>>-#else
>>>>>-ldr \rd, =\val+\offset
>>>>>#endif
>>>>>.endm
>>>>>
>>>>>--
>>>>>2.37.2.672.g94769d06f0-goog
>>>>>
>>>>>___
>>>>>x265-devel mailing list
>>>>>x265-devel@videolan.org
>>>>>https://mailman.videolan.org/listinfo/x265-devel
>>>
>>>>___
>>>>x265-devel mailing list
>>>>x265-devel@videolan.org
>>>>https://mailman.videolan.org/listinfo/x265-devel
>>>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] aarch64: replace ldr pseudo-instruction with adrp+add

2022-08-30 Thread chen
Hi Song,


In your patch, it looks you just change default mode from Absolute to Relative 
addres.
I understand newer OS may required binary execute code without text relocation, 
such as Android 6 and above, IOS, etc.
But we may also generate these relative address by cmake option ENABLE_PIC. I 
suggest keep current code, so the user may configuration by themselves.


btw: the ENABLE_PIC looks just work with GCC, so I think we need take a look 
these option on Apple platform.


Regards,
Min Chen
At 2022-08-30 13:42:39, "Fangrui Song"  wrote:
>On 2022-08-29, Fangrui Song wrote:
>>On 2022-08-30, chen wrote:
>>>Hi Song,
>>>
>>>
>>>Thank you for your patch.
>>>
>>>
>>>However, syntax of ':lo12:' depends on compiler, so more general LDR is 
>>>better in here.
>>>
>>>
>>>Regards,
>>>Min Chen
>>
>>:lo12: is standard aarch64 assembly syntax.
>>Which aarch64 compiler supported by x265 does not support :lo12:?
>
>Note that LDR has another problem that it produced an absolute relocation 
>R_AARCH64_ABS64. It will trigger an error
>when text relocations are disabled (default in ld.lld. See 
>https://maskray.me/blog/2020-12-19-lld-and-gnu-linker-incompatibilities#:~:text=Text%20relocations)
>
>```
>% : && /usr/bin/c++ -fPIC -O3 -DNDEBUG  -Wl,-Bsymbolic,-znoexecstack -shared 
>...
>ld: error: relocation R_AARCH64_ABS64 cannot be used against local
>symbol; recompile with -fPIC
>>>> defined in sad-a.S.o
>>>> >>> referenced by sad-a.S.o:(.text+0x9B00)
>```
>
>adrp+add work fine.
>
>(I am a maintainer of lld/ELF.)
>
>>>At 2022-08-30 02:33:37, "Fangrui Song"  wrote:
>>>>The ldr pseudo-instruction uses a literal pool, which is less efficient
>>>>and does not decrease the code size.
>>>>---
>>>>source/common/aarch64/asm.S | 4 +---
>>>>1 file changed, 1 insertion(+), 3 deletions(-)
>>>>
>>>>diff --git a/source/common/aarch64/asm.S b/source/common/aarch64/asm.S
>>>>index 399c37cf2..2506f50aa 100644
>>>>--- a/source/common/aarch64/asm.S
>>>>+++ b/source/common/aarch64/asm.S
>>>>@@ -130,11 +130,9 @@ ELF .size   \name, . - \name
>>>>adrp\rd, \val+(\offset)
>>>>add \rd, \rd, :lo12:\val+(\offset)
>>>>  .endif
>>>>-#elif defined(PIC)
>>>>+#else
>>>>adrp\rd, \val+(\offset)
>>>>add \rd, \rd, :lo12:\val+(\offset)
>>>>-#else
>>>>-ldr \rd, =\val+\offset
>>>>#endif
>>>>.endm
>>>>
>>>>--
>>>>2.37.2.672.g94769d06f0-goog
>>>>
>>>>___
>>>>x265-devel mailing list
>>>>x265-devel@videolan.org
>>>>https://mailman.videolan.org/listinfo/x265-devel
>>
>>>___
>>>x265-devel mailing list
>>>x265-devel@videolan.org
>>>https://mailman.videolan.org/listinfo/x265-devel
>>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] aarch64: replace ldr pseudo-instruction with adrp+add

2022-08-29 Thread chen
Hi Song,


Thank you for your patch.


However, syntax of ':lo12:' depends on compiler, so more general LDR is better 
in here.


Regards,
Min Chen
At 2022-08-30 02:33:37, "Fangrui Song"  wrote:
>The ldr pseudo-instruction uses a literal pool, which is less efficient
>and does not decrease the code size.
>---
> source/common/aarch64/asm.S | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
>diff --git a/source/common/aarch64/asm.S b/source/common/aarch64/asm.S
>index 399c37cf2..2506f50aa 100644
>--- a/source/common/aarch64/asm.S
>+++ b/source/common/aarch64/asm.S
>@@ -130,11 +130,9 @@ ELF .size   \name, . - \name
> adrp\rd, \val+(\offset)
> add \rd, \rd, :lo12:\val+(\offset)
>   .endif
>-#elif defined(PIC)
>+#else
> adrp\rd, \val+(\offset)
> add \rd, \rd, :lo12:\val+(\offset)
>-#else
>-ldr \rd, =\val+\offset
> #endif
> .endm
> 
>-- 
>2.37.2.672.g94769d06f0-goog
>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Some more Arm64 patches to bring performance up on Graviton processors

2022-03-25 Thread chen
Hello,


a little comments


+function PFX(cpy2Dto1D_shl_64x64_neon)
+cpy2Dto1D_shl_start
+mov w12, #32
+.loop_cpy2Dto1D_shl_64:
+sub w12, w12, #1
+.rept 2
+ldp q2, q3, [x1]
+ldp q4, q5, [x1, #32]
[MC] Why not LD1? same as STP





-#if X86_64

+#if X86_64 || defined(__aarch64__)

[MC] This is right, but for more generic, we can check with sizeof(long*)==8




Other are fine.


Regards,
Min Chen







2022-03-25 00:24:01,"Pop, Sebastian"  

Hi,





Please find attached a few more changes that bring up the performance of x265 
on Arm64 processors.


Patches tested on Graviton2 aarch64-linux.


Ok to commit?





Thanks,


Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port costCoeffNxN

2022-03-10 Thread chen
Hi Sebastian,


Sorry for delay. this version looks good either.


For future loop optimize, we need first to modify algorithm.
For example, we access arrar as below
=
absCoeff[numNonZero] = tmpCoeff[blkPos];
numNonZero += sig;
=
These code break parallel due to dependent on unpredictable sig, If we allow 
the data in absCoeff to be stored sparsely, we can get parallel processing all 
of 16 elements.


Regards,
Min Chen



At 2022-03-05 04:24:09, "Pop, Sebastian"  wrote:

Thanks Min Chen for your feedback.


Please see attached a patch that avoids one transfer from NEON to gpr by using 
`str h2, [x13]`.


I'm not sure how to optimize the loop, however I see that x86 avx2+bmi has a 
much shorter loop.


Do you recommend doing as the avx2 implementation?





Thanks,


Sebastian





From: x265-devel  on behalf of chen 

Sent: Wednesday, March 2, 2022 10:20 PM
To: Development for x265
Subject: RE: [EXTERNAL] [x265] [arm64] port costCoeffNxN
 
|

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

|


Hi Sebastian,


Thank you for your contibution, the code looks good.


Just a little comment for future performance improve,
"fmov w12, s2" are expensive because data across Neon and Integer fields, 
especally it is inside the loop.
There are also some deep-seated data organization and algorithm problems, for 
example, we spends many instructions for absCoeff[numNonZero], if we allow 
spare zeros inside of array, we will reduce many of instructions.


Regards,
Min Chen




At 2022-03-02 07:28:15, "Pop, Sebastian"  wrote:

Hi,





the attached patch fixes the registration of costCoeffNxN function hook and 
removes the early return that I used for testing.





Sebastian




___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Wrong version info?

2022-03-10 Thread chen
Hi Roger,


Both version number looks right.
The branch stable is 1 commit ahead of Tag 3.5, and branch master ahead more.
So the version number is 3.5+1 and 3.5+34, the other part is git hash


Regards,
Min Chen
At 2022-03-11 12:41:44, "Roger Pack"  wrote:
>Hello.
>As a note if I cross compile "for windows" from origin/master 64 bit,
>I get this:
>
>x265.exe --version
>x265 [info]: HEVC encoder version 3.5+34-7a5709048
>x265 [info]: build info [Windows][GCC 10.2.0][32 bit][noasm] 8bit+10bit+12bit
>
>From origin/stable I get:
>
>x265.exe --version
>x265 [info]: HEVC encoder version 3.5+1-ce882936d
>x265 [info]: build info [Windows][GCC 10.2.0][64 bit] 8bit+10bit+12bit
>
>Which is right.
>
>This is using cmake ex:  -DCMAKE_C_COMPILER=${cross_prefix}gcc etc.
>https://github.com/rdp/ffmpeg-windows-build-helpers/blob/ff1f2e9337fc81675f6fcc3d16d01778b3688ae8/cross_compile_ffmpeg.sh#L554
>
>Thanks!
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port costCoeffNxN

2022-03-02 Thread chen
Hi Sebastian,


Thank you for your contibution, the code looks good.


Just a little comment for future performance improve,
"fmov w12, s2" are expensive because data across Neon and Integer fields, 
especally it is inside the loop.
There are also some deep-seated data organization and algorithm problems, for 
example, we spends many instructions for absCoeff[numNonZero], if we allow 
spare zeros inside of array, we will reduce many of instructions.


Regards,
Min Chen




At 2022-03-02 07:28:15, "Pop, Sebastian"  wrote:

Hi,





the attached patch fixes the registration of costCoeffNxN function hook and 
removes the early return that I used for testing.





Sebastian
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] Status and combined patch

2022-01-28 Thread chen
Hi Sebastian,


Thank your contribute, I haven't more comments now.




Regards,
Min Chen










2022-01-29 02:35:24,"Pop, Sebastian"  

Hi,





> [MC] how about CMHI with a vector register that hold zeros?


This works wonderfully, thanks for the suggestion!
Performance improves to:


   scanPosLast  5.56x768.92  4278.01


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] Status and combined patch

2022-01-27 Thread chen
Hi Sebastian,


Thank you for your explain more, I inline my comments.



At 2022-01-28 10:08:36, "Pop, Sebastian"  wrote:

Hi Min Chen,





Thank you for your review comments, that helped improve the performance of 
scanPosLast on arm64:





   scanPosLast  5.46x782.47  4275.92

I think I addressed all the changes you requested with the exception of the two 
below:


> +// get sign
> +cmeqv5.16b, v3.16b, #0  //  equal to zero
> +mvn v5.16b, v5.16b  // v5 = non-zero
> [MC] Why not replace cmeq+mvn by cmgt?


[SP] We cannot replace the sequence with cmgt.
cmgt #0 is "Compare signed Greater than zero".
cmgt #0 would only select positive values.
We need all non-zero values, i.e., negative and positive values.


[MC] This is my fault, I forgot CMGT #0 work on Signed only, how about CMHI 
with a vector register that hold zeros?


> +// val - w13 = pmovmskb(v3)
> +and v3.16b, v3.16b, v28.16b
> +mov d4, v3.d[1]
> +addvb13, v3.8b
> +addvb14, v4.8b
> [MC] ADDV support .16b?


[SP] I cannot use the .16b variant of ADDV.
The data in v3.16b is ANDed with a mask in v28.16b:
and v3.16b, v3.16b, v28.16b
The mask in v28 is:
.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x1, 0x2, 0x4, 0x8, 0x10, 
0x20, 0x40, 0x80
This is used to select which byte gets counted in which position.


To use an ADDV .16b I would need to encode the position of the bytes
in 16 bits instead of 8 bits, i.e., the mask would be:
.byte 0x1, 0x2, 0x4, 0x8, 0x10, 0x20, 0x40, 0x80, 0x100, 0x200, 0x400, 0x800, 
0x1000, 0x2000, 0x4000, 0x8000
however that would require the data to be in 16bit vector elements and NEON 
vectors would be 8h which is half too short.


Another solution I was considering is to decrease the vector factor for the 
loop from 16 to 8.

That would simplify the code for pmovmskb, however the scalar code would be 
less efficient, as it would only deal with half the bytes.

Do you think I should try out with a lower vector factor 8?



[MC]  two of my algorithms use shll & ushl to reduce count of addv, and 
accelerate with 2 parallelism data path, but it is same 7 instructions, so we 
can keep your current version here.



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] Status and combined patch

2022-01-21 Thread chen
Hi Sebastian,


Thank you for your contribution, I reviewed and made some of comments, could 
you please take a look.


Regards,
Min Chen




At 2022-01-19 23:25:30, "Pop, Sebastian"  wrote:

Hi Gopi,





Please find attached a patch that ports scanPosLast to arm64 NEON.





 scanPosLast  5.08x842.11  4277.83





When encoding a video where scanPosLast was accounting for 4.66% of the total 
samples,

with the patch the function now accounts for 1.4% of the total samples.





I still see costCoeffNxN_c at 3.5% on some profiles, and I will send a patch to 
implement it for arm64.





Would it be possible to commit all the arm64 NEON patches to x265 git?


How can I help to speed up the process?





Thanks,


Sebastian








From: x265-devel  on behalf of Pop, Sebastian 

Sent: Thursday, December 9, 2021 4:48 PM
To: Gopi Satykrishna Akisetty; Development for x265
Subject: Re: [x265] [arm64] Status and combined patch
 

Hi,





Attached is a patch for weight_pp and weight_sp for arm64.





 weight_pp  4.66x182.14  849.07
 weight_sp  1.16x621.23  718.51



Sebastian







From: x265-devel  on behalf of Pop, Sebastian 

Sent: Monday, November 15, 2021 5:43 PM
To: Gopi Satykrishna Akisetty; Development for x265
Subject: Re: [x265] [arm64] Status and combined patch
 

Hi,





Here is a patch to implement 8bit normFact on arm64.





normFact[8x8]6.98x 11.9983.66   
normFact[16x16]6.40x 53.95345.39  
normFact[32x32]5.54x 245.17   1359.08 
normFact[64x64]5.45x 996.32   5433.85 

Sebastian








From: x265-devel  on behalf of Pop, Sebastian 

Sent: Monday, November 15, 2021 4:58 PM
To: Gopi Satykrishna Akisetty; Development for x265
Subject: Re: [x265] [arm64] Status and combined patch
 

Hi,





Here is a patch to implement 8bit ssimDist on top of the previous patches.


Tested on arm64-linux.





ssimDist[4x4]   3.66x8.6731.72
ssimDist[8x8]   4.69x27.65   129.62
ssimDist[16x16] 5.00x106.38  531.60
ssimDist[32x32] 6.98x434.51  3034.55
ssimDist[64x64] 6.72x1792.07 12046.95



Sebastian





From: x265-devel  on behalf of Pop, Sebastian 

Sent: Monday, October 25, 2021 7:08 PM
To: Gopi Satykrishna Akisetty; Development for x265
Subject: Re: [x265] [arm64] Status and combined patch
 

Hi Gopi,





Please find attached the updated patches to fix an issue in sad_x4[12x16] where 
I was using v31 uninitialized.


The patch now passes TestBench and produces the same output on the following 
command:


./x265 --input=/home/ubuntu/old_town_cross_444_720p50.y4m --preset slower --crf 
4 --cu-lossless --no-info --hash=1 --psnr --ssim -o out.hevc





I have also tested the patch with ./build/linux/mulitlib.sh.





Sebastian





From: Pop, Sebastian
Sent: Friday, October 22, 2021 10:30 AM
To: Gopi Satykrishna Akisetty
Cc: Siva Viswanathan; Janani T E; Liwei Wang
Subject: Re: [EXTERNAL] [x265] [arm64] Status and combined patch
 

Thanks Gopi for the clarification.


I will make sure the values in the following fields remain the same with and 
without the patches:


"222539.85 kb/s, Avg QP:11.68, Global PSNR: 47.406, SSIM Mean Y: 0.9957770 
(23.744 dB)"





From: Gopi Satykrishna Akisetty 
Sent: Friday, October 22, 2021 10:19 AM
To: Pop, Sebastian
Cc: Siva Viswanathan; Janani T E; Liwei Wang
Subject: RE: [EXTERNAL] [x265] [arm64] Status and combined patch
 
|

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

|


Hi Sebastian,


The bitstream generated on the master tip is not the same as the bitstream 
generated after applying the eight patches. You can get this info from the logs 
where bitrate, PSNR, SSIM values are printed. 
For ex:
encoded 500 frames in 1823.56s (0.27 fps), 222539.85 kb/s, Avg QP:11.68, Global 
PSNR: 47.406, SSIM Mean Y: 0.9957770 (23.744 dB)

vs
encoded 500 frames in 1595.87s (0.31 fps), 222530.92 kb/s, Avg QP:11.68, Global 
PSNR: 47.405, SSIM Mean Y: 0.9957767 (23.743 dB)



Thanks,
Gopi.


On Fri, Oct 22, 2021 at 8:43 PM Pop, Sebastian  wrote:


Hi Gopi,


Could you please let me know exactly what I need to pay attention to in the 
diff between logs on "Master Tip" and logs "after applying 8 patches".


i.e., which numbers in the diff need to be exactly the same.




Thanks,

Sebastian 





From: Gopi Satykrishna Akisetty 
Sent: Friday, October 22, 2021 9:25 AM
To: Pop, Sebastian
Cc: Siva Viswanathan; Janani T E; Liwei Wang
Subject: RE: [EXTERNAL] [x265] [arm64] Status and combined patch
 
|

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

|


Hi Sebastian,


We are seeing some output changes after applying the 8 patches shared above by 
you. I have attached some sample 

Re: [x265] x265 bug report

2021-11-24 Thread chen
Hi Nathan,


Ah, I have same question in couple years ago.


The root cause lies in the type of intermediate variable, the sign bit will 
affacf high part of combine varible.
so if you change sum_t to sum2_t on variable A & B, you will get correct result.


Regards,
Min Chen

At 2021-11-25 12:36:18, "nathan" <13022198...@163.com> wrote:

hi Guys,
At source/common/pixel.cpp: line 203
An function seems have an error:
// in: a pseudo-simd number of the form x+(y<<16)
// return: abs(x)+(abs(y)<<16)
inlinesum2_tabs2(sum2_ta)
{
sum2_ts = ((a >> (BITS_PER_SUM - 1)) & (((sum2_t)1 << BITS_PER_SUM) + 1)) * 
((sum_t)-1);


return (a + s) ^ s;
}
see my test :
int main()
{
sum_tA = -1;   # 0x
sum_tB = -2;   # 0xfffe
sum2_t  C = (B << BITS_PER_SUM) + A; # 0xfffe
sum2_t  D = abs2(C); # 0x00010001
sum2_t  E = (2 << BITS_PER_SUM) + 1;  # 0x00020001
}
At the function description, the D should be equal to  E, but it not the same 
from my test.
(please check the test is ok or not, in case I missed something)



the abs2() is used to calculate SATD,  which effect the selection of motion 
evaluation, not effect the correctness of encoder function.


(can someone tell me where can I commit this bug report to?  Thanks!)
BR-x265,
 Nathan





 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] missing files while compile x265-devel

2021-09-13 Thread chen
Hi,


stdint.h is C/C++ standard header files.
sys/time.h is depends on OS, but I guess it is not necessary during runing, it 
mostly use by collection performance data.
memory.h may be ignore if you declare memory manager functions in other 
headers, such as malloc()


Regards,
Min Chen

At 2021-09-13 20:23:24, "yehuda marko"  wrote:

 

Hello,

 

I’m in migration process of https://github.com/videolan/x265 , to RTOS called 
vxworks 5.5.

 

Q1.

I get file not found for the following files:

 

Memory.h

Stdint.h

Sys/time.h

 

Can you provide those files.

 

Q2.

The toolchain I’m using is powerpc arch , what flags I need to add , to use the 
PPC , source files?

 

 

Re,

Yehuda Marko

 

yehuda.ma...@scaleil.com +972544373003

ScaleIL

 

 

This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and contains information that is confidential and 
proprietary to ScaleIL. All unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply e-mail and destroy all copies of the original 
message.

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port ssim_4x4x2_core

2021-08-07 Thread chen
Hi,


Code looks good.
The only comment is UADALP is slower, we can adjust order of sum to avoid it.




Regards,
Min Chen




 2021-08-07 02:01:13,"Pop, Sebastian"  

Hi,

the attached patch ports to arm64 the following kernel:

 

ssim_4x4x2_core  30.69x   13.39   410.85

 

Ok to commit?

 

Thanks,

Sebastian

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port scale1D_128to64 and scale2D_64to32

2021-07-30 Thread chen
I have no idea to significant improve performance, the macro helpful code 
readable.
some little comment:
move SUB follow by LD1 will hidden memory operator latency, also mixed ST1 with 
next LD1, etc.
But in these case the code readable became bad, so I do not suggest these 
adjust.


Regards,
Min Chen

At 2021-07-31 12:14:29, "Pop, Sebastian"  wrote:

Hi,

 

Please let me know if you have ideas on how to make this code faster.

I tried to remove the stall by fetching more memory earlier, still no change in 
performance:

 

// void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride)

function x265_scale2D_64to32_neon

mov w12, #15

ld1 {v0.16b-v3.16b}, [x1], x2

ld1 {v4.16b-v7.16b}, [x1], x2

.loop_scale2D:

sub w12, w12, #1

ld1 {v20.16b-v23.16b}, [x1], x2

ld1 {v24.16b-v27.16b}, [x1], x2

scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7

ld1 {v0.16b-v3.16b}, [x1], x2

ld1 {v4.16b-v7.16b}, [x1], x2

scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27

cbnzw12, .loop_scale2D

ld1 {v20.16b-v23.16b}, [x1], x2

ld1 {v24.16b-v27.16b}, [x1], x2

scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7

scale2D_1 v20, v21, v22, v23, v24, v25, v26, v27

ret

endfunc

 

.macro scale2D_1 v0, v1, v2, v3, v4, v5, v6, v7

uaddlp  \v0\().8h, \v0\().16b

uaddlp  \v1\().8h, \v1\().16b

uaddlp  \v2\().8h, \v2\().16b

uaddlp  \v3\().8h, \v3\().16b

uaddlp  \v4\().8h, \v4\().16b

uaddlp  \v5\().8h, \v5\().16b

uaddlp  \v6\().8h, \v6\().16b

uaddlp  \v7\().8h, \v7\().16b

add \v0\().8h, \v0\().8h, \v4\().8h

add \v1\().8h, \v1\().8h, \v5\().8h

add \v2\().8h, \v2\().8h, \v6\().8h

add \v3\().8h, \v3\().8h, \v7\().8h

uqrshrn \v0\().8b, \v0\().8h, #2

uqrshrn2\v0\().16b, \v1\().8h, #2

uqrshrn \v1\().8b, \v2\().8h, #2

uqrshrn2\v1\().16b, \v3\().8h, #2

st1 {\v0\().16b-\v1\().16b}, [x0], #32

.endm

 

The only change that I did is to further optimize for code size by re-rolling 
the loop that was unrolled 2x.

No change in performance, and 2x smaller code.

 

Sebastian

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port scale1D_128to64 and scale2D_64to32

2021-07-30 Thread chen
Hi, 


The code looks good.
little performance change because pipeline stall, two of LD1 can't hidden 
latency penalty, but it is not big problem, we saved the code size.
Could you please make a stalone patch, I guess patch to patch is not good idea.


Regards,
Min Chen

At 2021-07-31 02:27:36, "Pop, Sebastian"  wrote:

A small change to save a few bytes in code size.

I replaced the 4 LD1 2 regs with 2 LD1 4 regs.

No performance change.

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port scale1D_128to64 and scale2D_64to32

2021-07-29 Thread chen
+ld2 {v0.16b,v1.16b}, [x1], #32

+ld2 {v2.16b,v3.16b}, [x1], x2

+ld2 {v4.16b,v5.16b}, [x1], #32

+ld2 {v6.16b,v7.16b}, [x1], x2

+uaddl   v16.8h, v0.8b, v1.8b

+uaddl2  v17.8h, v0.16b, v1.16b

LD2+UADDL equal to LD1+ADDLP




btw: excuse me, other patches need more time, probability review on weekend.


Regards,
Min Chen


 2021-07-30 06:13:34,"Pop, Sebastian"  

Hi,

the attached patch ports to arm64 the following kernels:

 

   scale1D_128to64  68.89x   12.06   830.58

scale2D_64to32  62.21x   220.95  13744.77

 

Ok to commit?

 

Thanks,

Sebastian

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port addAvg

2021-07-27 Thread chen
Hi Sebastian,


Looks good now, thanks.


Regards,
Min chen

At 2021-07-27 23:50:19, "Pop, Sebastian"  wrote:

Thanks Min Chen for your reviews.

In the attached patch I used dup instead of memory load, and I rescheduled some 
of the instructions to avoid pipeline stalls.

 

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port addAvg

2021-07-27 Thread chen
Hi,


I just a little comments.


+.macro addAvg_start

+lsl x3, x3, #1

+lsl x4, x4, #1

+movrel  x11, addAvg_offset

+ld1 {v30.8h}, [x11]
All of value in the addAvg_offset is 0x40, why not DUP?



+add v0.8h, v0.8h, v1.8h

+saddl   v16.4s, v0.4h, v30.4h
immediate use v0 may make pipeline stall



+saddl2  v17.4s, v0.8h, v30.8h

+add v2.8h, v2.8h, v3.8h

+saddl   v18.4s, v2.4h, v30.4h

+saddl2  v19.4s, v2.8h, v30.8h




 2021-07-27 09:01:32,"Pop, Sebastian"  

Hi,

the attached patch ports to arm64 the following kernels:

 

 addAvg[  4x4]  22.03x   9.87217.35

 addAvg[  8x8]  41.06x   21.01   862.77

 [i420]  addAvg[  4x4]  21.07x   10.31   217.20

 [i422]  addAvg[  4x8]  23.19x   17.87   414.44

 addAvg[  8x4]  35.10x   12.46   437.40

 [i420]  addAvg[  4x2]  13.23x   8.01105.94

 addAvg[  4x8]  23.17x   17.89   414.54

 addAvg[16x16]  50.38x   63.28   3187.50

 [i420]  addAvg[  8x8]  38.47x   21.93   843.59

 [i422]  addAvg[ 8x16]  44.45x   38.55   1713.69

 addAvg[ 16x8]  47.63x   33.70   1605.09

 [i420]  addAvg[  8x4]  34.13x   12.86   439.01

 [i422]  addAvg[  8x8]  39.22x   21.87   857.94

 addAvg[ 8x16]  42.08x   40.88   1720.30

 [i420]  addAvg[  4x8]  23.03x   17.93   413.10

 [i422]  addAvg[ 4x16]  24.58x   32.44   797.45

 addAvg[ 16x4]  44.62x   18.13   809.08

 [i420]  addAvg[  8x2]  28.08x   8.17229.29

 [i422]  addAvg[  8x4]  34.00x   12.82   435.82

 addAvg[16x12]  50.69x   48.05   2435.74

 [i420]  addAvg[  8x6]  38.48x   17.07   656.91

 [i422]  addAvg[ 8x12]  42.95x   30.00   1288.53

 addAvg[ 4x16]  25.31x   31.73   802.95

 addAvg[12x16]  35.76x   67.70   2421.01

 [i420]  addAvg[  6x8]  19.93x   30.26   603.15

 [i422]  addAvg[ 6x16]  20.47x   57.31   1172.97

 addAvg[32x32]  48.23x   254.84  12291.57

 [i420]  addAvg[16x16]  49.59x   63.82   3164.65

 [i422]  addAvg[16x32]  51.79x   123.15  6377.69

 addAvg[32x16]  49.46x   128.27  6343.50

 [i420]  addAvg[ 16x8]  48.03x   33.75   1620.91

 [i422]  addAvg[16x16]  50.35x   62.86   3164.73

 addAvg[16x32]  51.75x   122.50  6339.62

 [i420]  addAvg[ 8x16]  43.78x   38.62   1690.74

 [i422]  addAvg[ 8x32]  45.53x   72.44   3298.22

 addAvg[ 32x8]  47.93x   65.87   3156.92

 [i420]  addAvg[ 16x4]  43.43x   18.64   809.56

 [i422]  addAvg[ 16x8]  47.47x   33.64   1596.84

 addAvg[32x24]  49.16x   191.04  9392.00

 [i420]  addAvg[16x12]  49.27x   48.68   2398.20

 [i422]  addAvg[16x24]  50.96x   93.21   4750.37

 addAvg[ 8x32]  45.61x   72.32   3298.91

 [i420]  addAvg[ 4x16]  24.65x   32.30   796.37

 [i422]  addAvg[ 4x32]  25.97x   60.57   1572.78

 addAvg[24x32]  46.28x   204.88  9481.85

 [i420]  addAvg[12x16]  35.58x   68.07   2422.33

 [i422]  addAvg[12x32]  37.35x   130.66  4879.55

 addAvg[64x64]  45.30x   1066.50 48309.83

 [i420]  addAvg[32x32]  48.17x   255.22  12293.77

 [i422]  addAvg[32x64]  48.67x   505.28  24591.01

 addAvg[64x32]  45.22x   535.51  24215.25

 [i420]  addAvg[32x16]  48.63x   130.26  6334.18

 [i422]  addAvg[32x32]  48.33x   255.33  12341.31

 addAvg[32x64]  48.88x   504.10  24641.61

 [i420]  addAvg[16x32]  51.87x   123.09  6384.44

 [i422]  addAvg[16x64]  53.21x   242.70  12914.20

 addAvg[64x16]  44.87x   270.22  12125.58

 [i420]  addAvg[ 32x8]  46.57x   66.57   3100.05

 [i422]  addAvg[32x16]  48.76x   129.97  6336.97

 addAvg[64x48]  46.57x   800.90  37301.68

 [i420]  addAvg[32x24]  49.21x   192.49  9473.39

 [i422]  addAvg[32x48]  49.02x   379.97  18627.41

 addAvg[16x64]  53.24x   242.72  12922.55

 [i420]  addAvg[ 8x32]  44.63x   74.53   3326.18

 [i422]  addAvg[ 8x64]  48.12x   138.94  6686.57

 addAvg[48x64]  47.97x   754.41  36187.82

 [i420]  addAvg[24x32]  45.60x   205.26  9360.26

 [i422]  addAvg[24x64]  45.69x   408.96  18684.47

 

Ok to commit?

 

Thanks,

Sebastian

 ___
x265-devel mailing list

Re: [x265] [arm64] port cpy2Dto1D_{shl, shr} and cpy1Dto2D_{shl, shr}

2021-07-27 Thread chen
Looks good, thanks.




2021-07-27 02:53:10,"Pop, Sebastian"  

Hi,

the attached patch ports to arm64 the following kernels:

 

cpy2Dto1D_shl[4x4]  15.69x   6.73105.60

cpy2Dto1D_shr[4x4]  12.97x   6.6586.28

cpy2Dto1D_shl[8x8]  43.32x   8.85383.16

cpy2Dto1D_shr[8x8]  34.56x   9.75336.91

  cpy2Dto1D_shl[16x16]  52.93x   21.95   1161.97

  cpy2Dto1D_shr[16x16]  52.10x   27.88   1452.72

  cpy2Dto1D_shl[32x32]  68.29x   89.12   6085.54

  cpy2Dto1D_shr[32x32]  38.55x   105.73  4076.24

 

cpy1Dto2D_shl[4x4]  19.04x   5.63107.16

cpy1Dto2D_shl_aligned[4x4]  19.22x   5.60107.68

cpy1Dto2D_shr[4x4]  15.32x   6.5299.89

cpy1Dto2D_shl[8x8]  47.59x   8.27393.34

cpy1Dto2D_shl_aligned[8x8]  47.22x   8.28390.90

cpy1Dto2D_shr[8x8]  36.68x   9.74357.15

  cpy1Dto2D_shl[16x16]  71.02x   21.51   1527.64

cpy1Dto2D_shl_aligned[16x16]69.37x   21.71   1506.23

  cpy1Dto2D_shr[16x16]  39.06x   28.23   1102.52

  cpy1Dto2D_shl[32x32]  68.19x   89.34   6092.00

cpy1Dto2D_shl_aligned[32x32]70.01x   89.26   6248.95

  cpy1Dto2D_shr[32x32]  56.47x   105.90  5979.45

 

Ok to commit?

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}

2021-07-24 Thread chen
Hi,


@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon
..

+uaddlv  s4, v4.4h

Unsigned?




+umovw12, v4.h[0]

+sxthw12, w12

+add x0, x12, #16




The SXTH is unnecessary because count of zeros must be in range [0,16],  so the 
W12 in the range [-16,0]

Please also remind the W0 is low part of X0, and result in the reg S4 is int32.




Others in the patch looks good.




Regards,

Min Chen

At 2021-07-25 13:31:06, "Pop, Sebastian"  wrote:

Hi,

 

> You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1

> depends on result, we can sum of these -1 to get totally number of non-zero

> coeffs, it reduce 3 instructions to 2.

 

You are right.  With this change I see a lot of improvement:

 

@@ -508,19 +508,17 @@ function x265_copy_cnt_4_neon

.rept 2

 ld1 {v0.8b}, [x1], x2

 ld1 {v1.8b}, [x1], x2

-clz v2.4h, v0.4h

-clz v3.4h, v1.4h

-ushrv2.4h, v2.4h, #4

-ushrv3.4h, v3.4h, #4

-add v2.4h, v2.4h, v3.4h

-add v4.4h, v4.4h, v2.4h

 st1 {v0.8b}, [x0], #8

 st1 {v1.8b}, [x0], #8

+cmeqv0.4h, v0.4h, #0

+cmeqv1.4h, v1.4h, #0

+add v4.4h, v4.4h, v0.4h

+add v4.4h, v4.4h, v1.4h

.endr

 uaddlv  s4, v4.4h

-fmovw12, s4

-mov w11, #16

-sub w0, w11, w12

+umovw12, v4.h[0]

+sxthw12, w12

+add x0, x12, #16

 ret

endfunc

 

 

Before:

 copy_cnt[4x4]  13.93x   7.50104.56

 copy_cnt[8x8]  31.20x   12.70   396.33

   copy_cnt[16x16]  43.22x   36.00   1556.03

   copy_cnt[32x32]  47.39x   129.34  6129.63

 

After:

 copy_cnt[4x4]  14.76x   7.12105.12

 copy_cnt[8x8]  37.56x   10.60   398.25

   copy_cnt[16x16]  52.57x   29.74   1563.60

   copy_cnt[32x32]  62.22x   98.37   6120.29

 

 

> +xtn v0.8b, v0.8h

> +xtn2v0.16b, v1.8h

> equal to

> tbl v0, {v0,v1}, v2

 

You are right.  With this change I see a lot of improvement:

 

Before:

copy_sp[16x16]  85.13x   18.78   1599.19

copy_sp[32x32]  96.31x   65.07   6266.88

copy_sp[64x64]  98.81x   252.38  24937.40

[i422] copy_sp[16x32]  91.93x   34.32   3154.89

[i422] copy_sp[32x64]  99.54x   128.29  12769.10

 

After:

copy_sp[16x16]  96.23x   16.42   1579.74

copy_sp[32x32]  104.33x  57.84   6034.24

copy_sp[64x64]  110.79x  221.66  24558.72

[i422] copy_sp[16x32]  97.74x   31.89   3116.46

[i422] copy_sp[32x64]  111.37x  112.39  12517.52

 

Please see the amended patch.

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port sad_x{3,4}

2021-07-23 Thread chen
Hi,


That's my fault, I lost these part of SAD, so your code is no problem now, 
thank you.


Regards,
Min Chen

At 2021-07-24 03:54:46, "Pop, Sebastian"  wrote:

Hi Min Chen,

thanks for your reviews.

 

> +.macro SAD_X_END_64 x

> +uaddlp  v16.4s, v16.8h

> The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 
> 32-bits in here

> 

> +uaddlp  v17.4s, v17.8h

> +uaddlp  v18.4s, v18.8h

> +uaddlp  v20.4s, v20.8h

> +uaddlp  v21.4s, v21.8h

> +uaddlp  v22.4s, v22.8h

> +add v16.4s, v16.4s, v20.4s

> +add v17.4s, v17.4s, v21.4s

> +add v18.4s, v18.4s, v22.4s

> +trn2v20.2d, v16.2d, v16.2d

> +trn2v21.2d, v17.2d, v17.2d

> +trn2v22.2d, v18.2d, v18.2d

> +add v16.2s, v16.2s, v20.2s

> 

> +add v17.2s, v17.2s, v21.2s

> +add v18.2s, v18.2s, v22.2s

> +uaddlp  v16.1d, v16.2s

> ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s

> 

> As we analyze dynamic range in above, we can replace it by

> ADD v16, v20   ; 15-bits

> (ignore inst for V17=V17+V21, etc)

> ADD v16, V17  ; 16-bits

> (ignore other registers)

> ADDLV s0,v16

 

Following your recommendation I tried the following code to delay widening to

the last step with uaddlv.  This code does not pass correctness tests.

 

.macro SAD_X_END_64 x

add v16.8h, v16.8h, v20.8h

add v17.8h, v17.8h, v21.8h

add v18.8h, v18.8h, v22.8h

trn2v20.2d, v16.2d, v16.2d

trn2v21.2d, v17.2d, v17.2d

trn2v22.2d, v18.2d, v18.2d

add v16.4h, v16.4h, v20.4h

add v17.4h, v17.4h, v21.4h

add v18.4h, v18.4h, v22.4h

uaddlv  s16, v16.4h

uaddlv  s17, v17.4h

uaddlv  s18, v18.4h

stp s16, s17, [x6], #8

.if \x == 3

str s18, [x6]

.elseif \x == 4

add v19.8h, v19.8h, v23.8h

trn2v23.2d, v19.2d, v19.2d

add v19.2s, v19.2s, v23.2s

uaddlv  s19, v19.4h

stp s18, s19, [x6]

.endif

ret

.endm

 

As we start executing the above code, the values observed in each lane of v16 to

v23 are already 16-bit.  For example,

 

(gdb) p $v16.h.u

$21 = {65024, 65024, 65024, 65024, 65024, 65024, 65024, 65024}

 

Each lane of v16 accumulates 4 differences of range 255:

uabal   \v1\().8h, v0.8b, v4.8b

uabal   \v1\().8h, v1.8b, v5.8b

uabal   \v1\().8h, v2.8b, v6.8b

uabal   \v1\().8h, v3.8b, v7.8b

and this is in a loop of 64 iterations.

So the dynamic range for each vector element is 4*64*255 = 65280 -> 16-bits

We need to widen arithmetic in the first step as in the original patch,

and we cannot postpone widening to the last step of the reduction.

 

> I guess STP may store two result in a cycle

 

Please see attached the amended patch that uses store pairs.

I have seen a small performance improvement with this change.

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port avg_pp

2021-07-23 Thread chen
Looks good




2021-07-24 07:04:03,"Pop, Sebastian"  

Hi,

 

the attached patch ports to arm64 the following kernels:

 

 

 avg_pp[  4x4]  8.50x8.8575.21

avg_pp_aligned[  4x4]  8.49x8.8975.46

 avg_pp[  8x8]  29.12x   11.61   338.01

avg_pp_aligned[  8x8]  30.20x   11.42   344.78

 avg_pp[  8x4]  27.12x   7.34199.01

avg_pp_aligned[  8x4]  27.18x   7.40201.11

 avg_pp[  4x8]  9.63x14.89   143.37

avg_pp_aligned[  4x8]  10.65x   14.94   159.20

 avg_pp[16x16]  50.41x   22.63   1140.85

avg_pp_aligned[16x16]  49.74x   22.45   1116.51

 avg_pp[ 16x8]  66.87x   11.27   753.83

avg_pp_aligned[ 16x8]  68.10x   11.16   759.76

 avg_pp[ 8x16]  25.07x   22.85   572.83

avg_pp_aligned[ 8x16]  24.71x   22.69   560.73

 avg_pp[ 16x4]  41.45x   7.34304.42

avg_pp_aligned[ 16x4]  48.04x   7.43356.89

 avg_pp[16x12]  63.50x   16.99   1078.53

avg_pp_aligned[16x12]  45.91x   16.87   774.56

 avg_pp[ 4x16]  10.80x   26.74   288.84

avg_pp_aligned[ 4x16]  10.90x   26.69   290.97

 avg_pp[12x16]  30.99x   31.28   969.46

avg_pp_aligned[12x16]  26.61x   31.61   841.17

 avg_pp[32x32]  92.92x   55.84   5189.14

avg_pp_aligned[32x32]  71.96x   55.72   4009.62

 avg_pp[32x16]  93.70x   28.91   2709.20

avg_pp_aligned[32x16]  68.55x   29.06   1992.17

 avg_pp[16x32]  65.12x   45.81   2983.30

avg_pp_aligned[16x32]  51.43x   45.51   2340.67

 avg_pp[ 32x8]  93.24x   15.82   1475.04

avg_pp_aligned[ 32x8]  76.66x   15.88   1217.75

 avg_pp[32x24]  70.85x   42.36   3001.17

avg_pp_aligned[32x24]  70.10x   42.46   2976.72

 avg_pp[ 8x32]  31.19x   45.98   1434.10

avg_pp_aligned[ 8x32]  27.80x   45.73   1271.58

 avg_pp[24x32]  50.96x   75.62   3853.13

avg_pp_aligned[24x32]  50.17x   75.71   3798.44

 avg_pp[64x64]  74.94x   221.76  16617.97

avg_pp_aligned[64x64]  71.24x   221.74  15797.84

 avg_pp[64x32]  82.22x   112.25  9229.40

avg_pp_aligned[64x32]  70.60x   112.25  7925.30

 avg_pp[32x64]  79.00x   110.78  8751.21

avg_pp_aligned[32x64]  71.68x   110.70  7934.54

 avg_pp[64x16]  87.17x   57.66   5026.56

avg_pp_aligned[64x16]  68.42x   57.66   3945.34

 avg_pp[64x48]  87.96x   166.85  14676.53

avg_pp_aligned[64x48]  71.82x   166.86  11983.28

 avg_pp[16x64]  48.84x   92.63   4523.80

avg_pp_aligned[16x64]  43.73x   92.32   4037.08

 avg_pp[48x64]  96.16x   143.53  13801.49

avg_pp_aligned[48x64]  83.02x   143.73  11932.26

 

Ok to commit?

 

Thanks,

Sebastian

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port count_nonzero, blkfill, and copy_{ss, sp, ps}

2021-07-23 Thread chen



At 2021-07-24 05:23:44, "Pop, Sebastian"  wrote:

Hi,

 

> +fmovw12, s4

> +neg w12, w12

> +add w0, w12, #16

> (-w12) + 16 equal to 16-w12, load #16 into w0 may execution parallelism with 
> FMOV.

 

I see a small improvement with this change.  Please see attached a patch.

 

> +clz v2.4h, v0.4h

> +clz v3.4h, v1.4h

> +ushrv2.4h, v2.4h, #4

> +ushrv3.4h, v3.4h, #4

> +add v2.4h, v2.4h, v3.4h

> clz+ushr+add is slower than cmeq+add in either exection throughput or cycles.

 

I do not see any improvement with this change applied to 
x265_copy_cnt_{4,8,16,32}:

 

@@ -508,14 +508,14 @@ function x265_copy_cnt_4_neon

.rept 2

 ld1 {v0.8b}, [x1], x2

 ld1 {v1.8b}, [x1], x2

-clz v2.4h, v0.4h

-clz v3.4h, v1.4h

-ushrv2.4h, v2.4h, #4

-ushrv3.4h, v3.4h, #4

-add v2.4h, v2.4h, v3.4h

-add v4.4h, v4.4h, v2.4h

 st1 {v0.8b}, [x0], #8

 st1 {v1.8b}, [x0], #8

+cmeqv0.4h, v0.4h, #0

+cmeqv1.4h, v1.4h, #0

+ushrv0.4h, v0.4h, #15

+ushrv1.4h, v1.4h, #15

+add v4.4h, v4.4h, v0.4h

+add v4.4h, v4.4h, v1.4h

.endr

 uaddlv  s4, v4.4h

 fmovw12, s4

 

Before this change, the time is slightly better:

 

 copy_cnt[4x4]  13.84x   7.53104.19

 copy_cnt[8x8]  31.37x   12.44   390.16

   copy_cnt[16x16]  43.34x   35.83   1553.07

   copy_cnt[32x32]  47.40x   129.28  6127.89

 

than after the change:

 

 copy_cnt[4x4]  13.91x   7.50104.25

 copy_cnt[8x8]  31.09x   12.57   390.92

   copy_cnt[16x16]  43.12x   36.04   1554.11

   copy_cnt[32x32]  47.38x   129.34  6128.81

 

Neoverse-N1 SWOG says:

https://documentation-service.arm.com/static/5f05e93dcafe527e86f61acd

 

CLZ  latency 2, throughput 2

CMEQ latency 2, throughput 1

 

Changing CLZ to CMEQ has less parallelism with a lower throughput.







You didn't see improve because you still use USHR, after CMEQ, we get 0 or -1 
depends on result, we can sum of these -1 to get totally number of non-zero 
coeffs, it reduce 3 instructions to 2.




 

> The copy_s* looks good, my only comment is the instruction TBL faster than 
> XTN/XTN2

> 

 

Neoverse-N1 SWOG says TBL is as fast as XTN:

 

TBL (with 1 or 2 table regs) latency 2 throughput 2

XTN latency 2 throughput 2




+xtn v0.8b, v0.8h

+xtn2v0.16b, v1.8h

equal to
tbl v0, {v0,v1}, v2







 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port sad_x{3,4}

2021-07-22 Thread chen
Hi,




Some comments,




+.macro SAD_X_END_64 x

+uaddlp  v16.4s, v16.8h
The dynamic range is 64*255 = 16320 -> 14-bits, so we are not need extend to 
32-bits in here



+uaddlp  v17.4s, v17.8h

+uaddlp  v18.4s, v18.8h

+uaddlp  v20.4s, v20.8h

+uaddlp  v21.4s, v21.8h

+uaddlp  v22.4s, v22.8h

+add v16.4s, v16.4s, v20.4s

+add v17.4s, v17.4s, v21.4s

+add v18.4s, v18.4s, v22.4s

+trn2v20.2d, v16.2d, v16.2d

+trn2v21.2d, v17.2d, v17.2d

+trn2v22.2d, v18.2d, v18.2d

+add v16.2s, v16.2s, v20.2s



+add v17.2s, v17.2s, v21.2s
+add v18.2s, v18.2s, v22.2s
+uaddlp  v16.1d, v16.2s
ADD+TRN2+ADD generate sum of v16+v20 in V.2s, follow by UADDLP into V.1s


As we analyze dynamic range in above, we can replace it by
ADD v16, v20   ; 15-bits
(ignore inst for V17=V17+V21, etc)
ADD v16, V17  ; 16-bits
(ignore other registers)
ADDLV s0,v16




+uaddlp  v17.1d, v17.2s
+uaddlp  v18.1d, v18.2s


+st1 {v16.s}[0], [x6], #4
+st1 {v17.s}[0], [x6], #4

+st1 {v18.s}[0], [x6], #4

I guess STP may store two result in a cycle




Regards,
Min Chen




 2021-07-22 14:30:50,"Pop, Sebastian"  

Hi,

 

the attached patch ports to arm64 the following kernels:

 

 sad_x3[  4x4]  12.23x   13.79   168.68

 sad_x4[  4x4]  14.12x   15.82   223.43

 sad_x3[  8x8]  35.05x   17.45   611.47

 sad_x4[  8x8]  38.48x   21.18   814.95

 sad_x3[  8x4]  27.19x   11.46   311.48

 sad_x4[  8x4]  30.40x   13.60   413.37

 sad_x3[  4x8]  14.16x   22.99   325.37

 sad_x4[  4x8]  15.82x   27.39   433.23

 sad_x3[16x16]  40.94x   57.94   2371.97

 sad_x4[16x16]  43.63x   72.44   3160.44

 sad_x3[ 16x8]  38.84x   30.54   1186.15

 sad_x4[ 16x8]  39.23x   40.16   1575.43

 sad_x3[ 8x16]  38.74x   31.43   1217.71

 sad_x4[ 8x16]  41.48x   39.01   1618.17

 sad_x3[ 16x4]  31.82x   18.88   600.72

 sad_x4[ 16x4]  36.35x   21.87   795.00

 sad_x3[16x12]  40.27x   43.87   1766.74

 sad_x4[16x12]  42.58x   55.94   2381.75

 sad_x3[ 4x16]  15.34x   42.16   646.67

 sad_x4[ 4x16]  17.08x   51.06   872.12

 sad_x3[12x16]  29.45x   61.06   1798.28

 sad_x4[12x16]  30.39x   78.94   2399.17

 sad_x3[32x32]  42.85x   216.39  9272.65

 sad_x4[32x32]  42.53x   294.98  12544.76

 sad_x3[32x16]  42.09x   110.35  4644.86

 sad_x4[32x16]  41.71x   151.05  6301.01

 sad_x3[16x32]  44.19x   106.99  4728.04

 sad_x4[16x32]  44.72x   139.94  6257.96

 sad_x3[ 32x8]  40.10x   58.16   2332.47

 sad_x4[ 32x8]  41.17x   76.65   3155.96

 sad_x3[32x24]  42.69x   162.76  6947.64

sad_x4[32x24]  42.08x   223.88  9421.46

 sad_x3[ 8x32]  41.86x   57.89   2423.47

 sad_x4[ 8x32]  45.26x   71.56   3239.07

 sad_x3[24x32]  45.10x   155.22  6999.53

 sad_x4[24x32]  45.30x   205.87  9325.60

 sad_x3[64x64]  39.87x   925.36  36892.50

 sad_x4[64x64]  40.80x   1214.79 49557.66

 sad_x3[64x32]  39.40x   468.08  18444.51

 sad_x4[64x32]  40.71x   609.27  24803.74

 sad_x3[32x64]  43.48x   426.05  18522.95

 sad_x4[32x64]  43.31x   577.80  25024.14

 sad_x3[64x16]  38.67x   238.72  9231.84

 sad_x4[64x16]  40.36x   308.10  12435.08

 sad_x3[64x48]  39.70x   695.95  27628.87

 sad_x4[64x48]  40.74x   912.56  37173.46

 sad_x3[16x64]  44.85x   208.19  9337.52

 sad_x4[16x64]  45.46x   274.68  12487.54

 sad_x3[48x64]  42.68x   653.74  27903.74

 sad_x4[48x64]  44.67x   835.79  37336.87

 

Ok to commit?

 

Thanks,

Sebastian

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port sad

2021-07-20 Thread chen
Hi Sebastian,


The code looks good.


I guess you can't find so much performance improve because two things:
1. Replace 5 instructions by another 5 instrcuctions may get similar 
performance on OOO CPU, but different on In-Order CPU.
2. There have a potential pipeline stall may affect performance.


Regards,
Min Chn



At 2021-07-20 12:45:03, "Pop, Sebastian"  wrote:

Thanks Min Chen for your reviews.

I tried your suggestion to remove one of the FP->GPR transfers.

With the following patch I do not see any improvement for the 64x routines, and 
the number of instructions remains the same:

 

--- a/source/common/aarch64/sad-a.S

+++ b/source/common/aarch64/sad-a.S

@@ -137,14 +137,14 @@

 add v16.8h, v16.8h, v17.8h

 add v17.8h, v18.8h, v19.8h

 add v16.8h, v16.8h, v17.8h

-uaddlv  s0,  v16.8h

-fmovw0,  s0

+uaddlp  v16.4s, v16.8h

use v16 immedidate follow by instruction ADD may make pipeline stall




 add v18.8h, v20.8h, v21.8h

 add v19.8h, v22.8h, v23.8h

 add v17.8h, v18.8h, v19.8h

-uaddlv  s1,  v17.8h

-fmovw1,  s1

-add w0, w0, w1

+uaddlp  v17.4s, v17.8h

+add v16.4s, v16.4s, v17.4s

+uaddlv  d0, v16.4s

+fmovx0, d0

 ret

.endm

 

Please see the amended patch with your recommended change.

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port sad

2021-07-17 Thread chen
Hi Sebastian,



Thank you for your code.


At first, sorry for delay, I am very busy on my family and my toy hardware 
codec in last week, I just have a little spare-time during weekend.
The next, I didn't take a look all of functions, but I made some comments on 
64x64.


On the function, unroll=8 (4*2) will get good performance on Out-Of-Order (OOO) 
CPU, but may drain performance due to cache miss and related issues on low-end 
CPU such as Cortex-A53, Of course, this is not problem on this versiong of 
patch.


In the 64x64, the sum calculate by below code.
==

+.macro SAD_END_64

+add v16.8h, v16.8h, v17.8h

+add v17.8h, v18.8h, v19.8h

+add v16.8h, v16.8h, v17.8h

+uaddlv  s0,  v16.8h

+fmovw0,  s0

+add v18.8h, v20.8h, v21.8h

+add v19.8h, v22.8h, v23.8h

+add v17.8h, v18.8h, v19.8h

+uaddlv  s1,  v17.8h

+fmovw1,  s1

+add w0, w0, w1

+ret

+.endm

==


You use two of UADDLV to avoid overflow, how about sum these partial registers 
on NEON field to reduce instruction UADDLV?
e.g.
UADDLP v16,v16
UADDLP v17,v17
ADD v16,v17
UADDLV s0,v16


Regards,
Min Chen

2021-07-17 04:44:05,"Pop, Sebastian"  

Hi,

the attached patch ports to arm64 the following kernels:

 

sad[  4x4]  10.11x   6.5065.72

sad[  8x8]  28.95x   8.50246.00

sad[  8x4]  23.03x   5.45125.43

sad[  4x8]  12.09x   10.64   128.68

sad[16x16]  53.37x   19.19   1024.05

sad[ 16x8]  43.09x   11.62   500.84

sad[ 8x16]  31.03x   16.87   523.44

sad[ 16x4]  39.73x   6.27249.10

sad[16x12]  50.55x   15.10   763.44

sad[ 4x16]  14.23x   19.39   275.91

sad[12x16]  33.68x   22.95   772.81

sad[32x32]  62.10x   64.84   4026.97

sad[32x16]  59.82x   33.74   2018.56

sad[16x32]  57.94x   35.01   2028.17

sad[ 32x8]  53.98x   18.77   1013.48

sad[32x24]  61.29x   49.36   3024.90

sad[ 8x32]  31.84x   32.49   1034.56

sad[24x32]  53.61x   56.39   3022.97

sad[64x64]  65.24x   255.86  16692.29

sad[64x32]  61.77x   131.16  8100.90

sad[32x64]  62.31x   128.90  8031.79

sad[64x16]  60.28x   67.35   4060.31

sad[64x48]  62.53x   193.59  12104.64

sad[16x64]  61.10x   66.13   4040.26

sad[48x64]  61.75x   194.68  12022.14

 

Ok to commit?

 

Thanks,

Sebastian

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port LUMA_VPP_4xN

2021-07-07 Thread chen
Hi Sebastian,


It looks good, thanks.


Regards,
Min Chen

At 2021-07-08 02:20:01, "Pop, Sebastian"  wrote:

Attached the amended patch with movi.

That improved performance, thanks!

 

I have seen the cmp/br pattern several times.

We can do the reordering tuning after all the interpolate functions are ported.

 

Sebastian

 

From: x265-devel  on behalf of chen 

Reply-To: Development for x265 
Date: Tuesday, July 6, 2021 at 9:10 PM
To: Development for x265 
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN

 

|

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

|

 

Looks good for me.

 

There have some little improve, it may update in future version.

For example,

 

+mov w12, #32

+dup v16.4s, w12

Equal to

MOVI v16.4s,#32

 

We may get more performance by reorder compare & branch

+cmp x4, #0

+b.eq0f

+cmp x4, #1

+b.eq1f

+cmp x4, #2

+b.eq2f

+cmp x4, #3

+b.eq3f

+0:

 

At 2021-07-07 00:01:17, "Pop, Sebastian"  wrote:

Thanks for your careful reviews.

I addressed the problems for eor and rodata.

Please see the attached patch.

 

Sebastian

 

From: x265-devel  on behalf of chen 

Reply-To: Development for x265 
Date: Friday, July 2, 2021 at 8:11 PM
To: Development for x265 
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN

 

|

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

|

 

Hi,

 

I put my comments inline. thanks.

 

btw: I found more improve on this patch.

+eor v17.16b, v17.16b, v17.16b

The clear register operator may replace by MOVI

At 2021-07-03 02:43:07, "Pop, Sebastian"  wrote:

Hi,

thanks for your review.

 

> +#ifdef __MACH__

> +#   define MACH

> +#else

> +#   define MACH #

> This is not good idea to bypass .const_data

 

MACH uses ".const_data" directive, which is invalid for ELF.

For ELF the directive is ".rodata":

 

> ELF .section.rodata

> MACH.const_data

 

[MC] I means you may declare MACH_RODATA so similar macro, it is empty on ELF 
but something on Macho, I guess it better than '#' to bypass unnecessary 
statement.

 

> +ushll   v0.8h, v0.8b, #0

> ...

> +mul v16.8h, v0.8h, v24.8h

> Why not MULL?

 

That would not work for the rest of the computation.

Part of the data in v0 gets used in the next computation,

and then I would have to split mla into a mull + add.

 

[MC] This is depends on your algorithm, in your code

below, you combin row1 & row2 and multiplier

coeff[0], however, it also works with 8b x 8b

with UMULL.

However, it is a little complex algorithm,

so we can keep this version and improve in

future.

*** Code

> +mul v16.8h, v0.8h, v24.8h

> +ext v21.16b, v0.16b, v1.16b, #8

> +mul v17.8h, v21.8h, v24.8h

> +mov v0.16b, v1.16b

*** End






> +orr v0.16b, v1.16b, v1.16b

> This is equal to MOV, I guess compiler will replace to right instruction on 
> ARM64

 

I replaced orr with mov instructions.

 

> +// sum row[0-7]

> +dup v18.2d, v16.d[1]

> +dup v19.2d, v17.d[1]

> +add v16.4h, v16.4h, v18.4h

> +add v17.4h, v17.4h, v19.4h

> +trn1v16.2d, v16.2d, v17.2d

> How about ADDP?

 

I replaced the above 5 instructions with the following 3 and the performance 
improved.

 

trn1v20.2d, v16.2d, v17.2d

trn2v21.2d, v16.2d, v17.2d

add v16.8h, v20.8h, v21.8h

 

Please see attached the amended patch.

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port LUMA_VPP_4xN

2021-07-06 Thread chen
Looks good for me.


There have some little improve, it may update in future version.
For example,


+mov w12, #32
+dup v16.4s, w12
Equal to
MOVI v16.4s,#32


We may get more performance by reorder compare & branch
+cmp x4, #0
+b.eq0f
+cmp x4, #1
+b.eq1f
+cmp x4, #2
+b.eq2f
+cmp x4, #3
+b.eq3f
+0:



At 2021-07-07 00:01:17, "Pop, Sebastian"  wrote:

Thanks for your careful reviews.

I addressed the problems for eor and rodata.

Please see the attached patch.

 

Sebastian

 

From: x265-devel  on behalf of chen 

Reply-To: Development for x265 
Date: Friday, July 2, 2021 at 8:11 PM
To: Development for x265 
Subject: RE: [EXTERNAL] [x265] [arm64] port LUMA_VPP_4xN

 

|

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

|

 

Hi,

 

I put my comments inline. thanks.

 

btw: I found more improve on this patch.

+eor v17.16b, v17.16b, v17.16b

The clear register operator may replace by MOVI

At 2021-07-03 02:43:07, "Pop, Sebastian"  wrote:

Hi,

thanks for your review.

 

> +#ifdef __MACH__

> +#   define MACH

> +#else

> +#   define MACH #

> This is not good idea to bypass .const_data

 

MACH uses ".const_data" directive, which is invalid for ELF.

For ELF the directive is ".rodata":

 

> ELF .section.rodata

> MACH.const_data

 

[MC] I means you may declare MACH_RODATA so similar macro, it is empty on ELF 
but something on Macho, I guess it better than '#' to bypass unnecessary 
statement.

 

> +ushll   v0.8h, v0.8b, #0

> ...

> +mul v16.8h, v0.8h, v24.8h

> Why not MULL?

 

That would not work for the rest of the computation.

Part of the data in v0 gets used in the next computation,

and then I would have to split mla into a mull + add.

 

[MC] This is depends on your algorithm, in your code

below, you combin row1 & row2 and multiplier

coeff[0], however, it also works with 8b x 8b

with UMULL.

However, it is a little complex algorithm,

so we can keep this version and improve in

future.

*** Code

> +mul v16.8h, v0.8h, v24.8h

> +ext v21.16b, v0.16b, v1.16b, #8

> +mul v17.8h, v21.8h, v24.8h

> +mov v0.16b, v1.16b

*** End





> +orr v0.16b, v1.16b, v1.16b

> This is equal to MOV, I guess compiler will replace to right instruction on 
> ARM64

 

I replaced orr with mov instructions.

 

> +// sum row[0-7]

> +dup v18.2d, v16.d[1]

> +dup v19.2d, v17.d[1]

> +add v16.4h, v16.4h, v18.4h

> +add v17.4h, v17.4h, v19.4h

> +trn1v16.2d, v16.2d, v17.2d

> How about ADDP?

 

I replaced the above 5 instructions with the following 3 and the performance 
improved.

 

trn1v20.2d, v16.2d, v17.2d

trn2v21.2d, v16.2d, v17.2d

add v16.8h, v20.8h, v21.8h

 

Please see attached the amended patch.

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port LUMA_VPP_4xN

2021-07-02 Thread chen
Hi,


I put my comments inline. thanks.


btw: I found more improve on this patch.
+eor v17.16b, v17.16b, v17.16b
The clear register operator may replace by MOVI

At 2021-07-03 02:43:07, "Pop, Sebastian"  wrote:

Hi,

thanks for your review.

 

> +#ifdef __MACH__

> +#   define MACH

> +#else

> +#   define MACH #

> This is not good idea to bypass .const_data

 

MACH uses ".const_data" directive, which is invalid for ELF.

For ELF the directive is ".rodata":

 

> ELF .section.rodata

> MACH.const_data




[MC] I means you may declare MACH_RODATA so similar macro, it is empty on ELF 
but something on Macho, I guess it better than '#' to bypass unnecessary 
statement.

 

> +ushll   v0.8h, v0.8b, #0

> ...

> +mul v16.8h, v0.8h, v24.8h

> Why not MULL?

 

That would not work for the rest of the computation.

Part of the data in v0 gets used in the next computation,

and then I would have to split mla into a mull + add.




[MC] This is depends on your algorithm, in your code

below, you combin row1 & row2 and multiplier

coeff[0], however, it also works with 8b x 8b

with UMULL.

However, it is a little complex algorithm,

so we can keep this version and improve in

future.

*** Code

> +mul v16.8h, v0.8h, v24.8h

> +ext v21.16b, v0.16b, v1.16b, #8

> +mul v17.8h, v21.8h, v24.8h

> +mov v0.16b, v1.16b

*** End




> +orr v0.16b, v1.16b, v1.16b

> This is equal to MOV, I guess compiler will replace to right instruction on 
> ARM64

 

I replaced orr with mov instructions.

 

> +// sum row[0-7]

> +dup v18.2d, v16.d[1]

> +dup v19.2d, v17.d[1]

> +add v16.4h, v16.4h, v18.4h

> +add v17.4h, v17.4h, v19.4h

> +trn1v16.2d, v16.2d, v17.2d

> How about ADDP?

 

I replaced the above 5 instructions with the following 3 and the performance 
improved.

 

trn1v20.2d, v16.2d, v17.2d

trn2v21.2d, v16.2d, v17.2d

add v16.8h, v20.8h, v21.8h

 

Please see attached the amended patch.

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port LUMA_VPP_4xN

2021-07-01 Thread chen
Hello,



Thank your patch, I make some comments.



+#ifdef __MACH__

+#   define MACH

+#else

+#   define MACH #

This is not good idea to bypass .const_data


+ld1 {v0.s}[0], [x0], x1
+ld1 {v0.s}[1], [x0], x1
+ushll   v0.8h, v0.8b, #0
...
+// row[0-1]
+mul v16.8h, v0.8h, v24.8h
Why not MULL?

+ext v21.16b, v0.16b, v1.16b, #8

+mul v17.8h, v21.8h, v24.8h

+orr v0.16b, v1.16b, v1.16b

This is equal to MOV, I guess compiler will replace to right instruction on 
ARM64



+// sum row[0-7]

+dup v18.2d, v16.d[1]

+dup v19.2d, v17.d[1]

+add v16.4h, v16.4h, v18.4h

+add v17.4h, v17.4h, v19.4h

How about ADDP?




 2021-07-02 01:18:42,"Pop, Sebastian"  

Hi,

the attached patch ports to arm64 the following kernels:

 

luma_vpp[  4x4] 18.77x   27.66   519.22

luma_vpp[  4x8] 22.73x   45.35   1030.72

luma_vpp[ 4x16] 25.10x   82.32   2066.41

 

Ok to commit?

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port filterPixelToShort

2021-06-24 Thread chen
I have not comment on this patch, thanks.

2021-06-25 01:45:03,"Pop, Sebastian"  

Added one missing function:

 

convert_p2s[48x64]  1.56x300.44  469.25

 

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port filterPixelToShort

2021-06-23 Thread chen
The patch looks good, no more modify necessary, thanks.


btw: you didn't see change with CBNZ, I guess two reasons, one is 'sub x9' too 
is in first part of loop,  I more likely move these independent instruction 
fill into pipeline stall slots, the second is count of loop is not many enough 
since this is small function.

At 2021-06-24 10:34:02, "Pop, Sebastian"  wrote:

Also added cbnz, no perf change.

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port filterPixelToShort

2021-06-23 Thread chen
it looks good for me, thanks.


btw: ARM64 have new instruction CBZ / CBNZ.

At 2021-06-24 10:11:32, "Pop, Sebastian"  wrote:

I added the following change in the attached patch.

It has better performance with ldp as it allows to re-schedule the instructions 
in independent ways:

 

function x265_filterPixelToShort_64x\h\()_neon

 add x3, x3, x3

 sub x3, x3, #0x40

+sub x1, x1, #0x20

 moviv4.8h, #0xe0, lsl #8

 mov x9, #\r

.loop_filterP2S_64x\h:

 subsx9, x9, #1

.rept 2

-ld1 {v0.16b-v3.16b}, [x0], x1

+ldp q0, q1, [x0], #0x20

 ushll   v16.8h, v0.8b, #6

 ushll2  v17.8h, v0.16b, #6

 ushll   v18.8h, v1.8b, #6

 ushll2  v19.8h, v1.16b, #6

-ushll   v20.8h, v2.8b, #6

-ushll2  v21.8h, v2.16b, #6

-ushll   v22.8h, v3.8b, #6

-ushll2  v23.8h, v3.16b, #6

 add v16.8h, v16.8h, v4.8h

 add v17.8h, v17.8h, v4.8h

 add v18.8h, v18.8h, v4.8h

 add v19.8h, v19.8h, v4.8h

+st1 {v16.16b-v19.16b}, [x2], #0x40

+

+ldp q2, q3, [x0]

+add x0, x0, x1

+ushll   v20.8h, v2.8b, #6

+ushll2  v21.8h, v2.16b, #6

+ushll   v22.8h, v3.8b, #6

+ushll2  v23.8h, v3.16b, #6

 add v20.8h, v20.8h, v4.8h

 add v21.8h, v21.8h, v4.8h

 add v22.8h, v22.8h, v4.8h

 add v23.8h, v23.8h, v4.8h

-st1 {v16.16b-v19.16b}, [x2], #0x40

 st1 {v20.16b-v23.16b}, [x2], x3

.endr

 bgt .loop_filterP2S_64x\h

 

Before:

convert_p2s[64x16]1.46x  105.51  
154.37

convert_p2s[64x32]1.47x  212.07  
312.12

convert_p2s[64x48]1.46x  318.76  
466.80

convert_p2s[64x64]1.47x  425.34  
623.56

 

After:

convert_p2s[64x16]1.47x  105.24  
154.46

convert_p2s[64x32]1.50x  207.42  
312.09

convert_p2s[64x48]1.49x  312.30  
466.27

convert_p2s[64x64]1.50x  415.77  
623.56___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port filterPixelToShort

2021-06-23 Thread chen
You are welcome.
on your CPU, the ldp still slower, so we can keep origin version and improve it 
again in future.
This version looks good for me, thank you for your contribute.



At 2021-06-24 10:01:40, "Pop, Sebastian"  wrote:

Thanks again Chen for your careful review and recommendations.

 

I added the following change to the attached patch as we get better performance:

 

--- a/source/common/aarch64/ipfilter8.S

+++ b/source/common/aarch64/ipfilter8.S

@@ -35,14 +35,14 @@ function x265_filterPixelToShort_4x4_neon

 moviv2.8h, #0xe0, lsl #8

 ld1 {v0.s}[0], [x0], x1

 ld1 {v0.s}[1], [x0], x1

-ld1 {v1.s}[2], [x0], x1

-ld1 {v1.s}[3], [x0], x1

 ushll   v3.8h, v0.8b, #6

-ushll2  v4.8h, v1.16b, #6

 add v3.8h, v3.8h, v2.8h

-add v4.8h, v4.8h, v2.8h

 st1 {v3.d}[0], [x2], x3

 st1 {v3.d}[1], [x2], x3

+ld1 {v1.s}[0], [x0], x1

+ld1 {v1.s}[1], [x0], x1

+ushll   v4.8h, v1.8b, #6

+add v4.8h, v4.8h, v2.8h

 st1 {v4.d}[0], [x2], x3

 st1 {v4.d}[1], [x2], x3

 ret

 

Before:

convert_p2s[  4x4]   1.20x  4.99  6.01

 

After:

convert_p2s[  4x4]   1.38x  4.20  5.78

 

I tried the ldp with post-increment as you recommended.

Performance is slightly lower with the change:

 

function x265_filterPixelToShort_64x\h\()_neon

 add x3, x3, x3

 sub x3, x3, #0x40

+sub x1, x1, #0x20

 moviv4.8h, #0xe0, lsl #8

 mov x9, #\r

.loop_filterP2S_64x\h:

 subsx9, x9, #1

.rept 2

-ld1 {v0.16b-v3.16b}, [x0], x1

+ldp q0, q1, [x0], #0x20

+ldp q0, q1, [x0]

+add x0, x0, x1

 ushll   v16.8h, v0.8b, #6

 ushll2  v17.8h, v0.16b, #6

 ushll   v18.8h, v1.8b, #6

 

Before:

convert_p2s[64x16]1.46x  105.52  
154.47

convert_p2s[64x32]1.47x  212.06  
312.14

convert_p2s[64x48]1.47x  318.75  
467.61

convert_p2s[64x64]1.46x  425.61  
622.36

 

After:

convert_p2s[64x16]1.42x  108.41  
154.37

convert_p2s[64x32]1.45x  215.18  
312.12

convert_p2s[64x48]1.44x  325.01  
468.76

convert_p2s[64x64]1.44x  432.46  
622.36

 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port filterPixelToShort

2021-06-23 Thread chen
Could you please also try comments in last email? thanks.

At 2021-06-24 09:09:09, "Pop, Sebastian"  wrote:

> +.macro filterPixelToShort_64xN h

> +function x265_filterPixelToShort_64x\h\()_neon

> +add x3, x3, x3

> +sub x3, x3, #0x40

> +moviv4.8h, #0xe0, lsl #8

> +.rept \h

> I guess unroll N is not good idea, because the code section too large, it 
> most probability to make cache flush and missing.

 

Please see attached the amended patch to include the loop.

Ok to commit?

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [arm64] port filterPixelToShort

2021-06-23 Thread chen
Thank your response, comment inline.

At 2021-06-24 08:57:20, "Pop, Sebastian"  wrote:

Hi Chen,

 

Thanks for your review!

 

> +function x265_filterPixelToShort_4x4_neon

> +add x3, x3, x3

> +moviv2.8h, #0xe0, lsl #8

> are you compiler does not handle constant 0xe000 automatic? it is more 
> readable

 

GNU assembler errors with that immediate:

ipfilter8.S:35: Error: immediate value out of range -128 to 255 at operand 2 -- 
`movi v2.8h,#0xe000'




Look old binutils issus, so we can keep your origin version.




 

> 

> +ld1 {v0.s}[0], [x0], x1

> +ld1 {v0.s}[1], [x0], x1

> +ld1 {v1.s}[2], [x0], x1

> Why not v0.s?

 

It is slightly faster to use an independent register for the upper part:

 

when using {v0.s}[3] and {v0.s}[4]

convert_p2s[  4x4]  1.13x5.356.03

 

performance is lower than when using {v1.s}[3] and {v1.s}[4]

convert_p2s[  4x4]  1.21x4.996.03




Yes, in here, independent register may faster, but we can use lower part and 
ushll later, use one register high pare directly may make false register 
dependency path. 




> 

> +ld1 {v1.s}[3], [x0], x1

> 

> +.macro filterPixelToShort_32xN h

> +function x265_filterPixelToShort_32x\h\()_neon

> +add x3, x3, x3

> +moviv6.8h, #0xe0, lsl #8

> +.rept \h

> +ld1 {v0.16b-v1.16b}, [x0], x1

> ldp maybe provide more bandwidth

 

ld1 could be replaced with ldp + add, like this:

 

ldp q0, q1, [x0]

add x0, x0, x1

 

convert_p2s[ 32x8]  1.39x26.62   37.07

convert_p2s[32x16]  1.42x53.19   75.58

convert_p2s[32x24]  1.41x80.23   113.11

convert_p2s[32x32]  1.42x107.08  151.63

convert_p2s[32x64]  1.41x215.11  303.37

 

Performance with ldp + add is lower than with ld1:

 

convert_p2s[ 32x8]  1.48x25.00   37.06

convert_p2s[32x16]  1.49x50.64   75.56

convert_p2s[32x24]  1.48x76.46   113.31

convert_p2s[32x32]  1.49x101.97  151.63

convert_p2s[32x64]  1.48x205.15  303.31




ldp immediately follow by add may make pipeline stall or similar issue, if 
there no better choice, we can keep origin version.




> 

> +.macro filterPixelToShort_64xN h

> +function x265_filterPixelToShort_64x\h\()_neon

> +add x3, x3, x3

> +sub x3, x3, #0x40

> +moviv4.8h, #0xe0, lsl #8

> +.rept \h

> I guess unroll N is not good idea, because the code section too large, it 
> most probability to make cache flush and missing.

 

Performance is slightly lower with a loop, i.e., with this change:

 

--- a/source/common/aarch64/ipfilter8.S

+++ b/source/common/aarch64/ipfilter8.S

@@ -173,12 +173,15 @@ filterPixelToShort_32xN 24

filterPixelToShort_32xN 32

filterPixelToShort_32xN 64

 

-.macro filterPixelToShort_64xN h

+.macro filterPixelToShort_64xN h r

function x265_filterPixelToShort_64x\h\()_neon

 add x3, x3, x3

sub x3, x3, #0x40

 moviv4.8h, #0xe0, lsl #8

-.rept \h

+mov x9, #\r

+.loop_filterP2S_64x\h:

+subsx9, x9, #1

+.rept 2

 ld1 {v0.16b-v3.16b}, [x0], x1

 ushll   v16.8h, v0.8b, #6

 ushll2  v17.8h, v0.16b, #6

@@ -199,14 +202,15 @@ function x265_filterPixelToShort_64x\h\()_neon

 st1 {v16.16b-v19.16b}, [x2], #0x40

 st1 {v20.16b-v23.16b}, [x2], x3

.endr

+bgt .loop_filterP2S_64x\h

 ret

endfunc

.endm

 

-filterPixelToShort_64xN 16

-filterPixelToShort_64xN 32

-filterPixelToShort_64xN 48

-filterPixelToShort_64xN 64

+filterPixelToShort_64xN 16 8

+filterPixelToShort_64xN 32 16

+filterPixelToShort_64xN 48 24

+filterPixelToShort_64xN 64 32

 

.macro qpel_filter_0_32b

 moviv24.8h, #64

 

With the above change adding a loop I get

convert_p2s[64x16]1.46x  105.52  
154.34

convert_p2s[64x32]1.47x  212.07  
311.71

convert_p2s[64x48]1.47x  318.75  
468.04

convert_p2s[64x64]1.46x  425.61  
622.25

 

whereas with the fully unrolled version performance is slightly higher:

convert_p2s[64x16]1.48x  104.14  
154.36

convert_p2s[64x32]1.49x  209.43  
312.13

convert_p2s[64x48]1.48x  315.33  
466.37

convert_p2s[64x64]1.49x  4

Re: [x265] [arm64] port filterPixelToShort

2021-06-23 Thread chen
Hi Sebastian,


thanks your patch.
I have some comments.



+function x265_filterPixelToShort_4x4_neon

+add x3, x3, x3

+moviv2.8h, #0xe0, lsl #8

are you compiler does not handle constant 0xe000 automatic? it is more readable


+ld1 {v0.s}[0], [x0], x1
+ld1 {v0.s}[1], [x0], x1
+ld1 {v1.s}[2], [x0], x1
Why not v0.s?


+ld1 {v1.s}[3], [x0], x1



+.macro filterPixelToShort_32xN h

+function x265_filterPixelToShort_32x\h\()_neon

+add x3, x3, x3

+moviv6.8h, #0xe0, lsl #8

+.rept \h

+ld1 {v0.16b-v1.16b}, [x0], x1

ldp maybe provide more bandwidth


+.macro filterPixelToShort_64xN h
+function x265_filterPixelToShort_64x\h\()_neon
+add x3, x3, x3
+sub x3, x3, #0x40
+moviv4.8h, #0xe0, lsl #8
+.rept \h
I guess unroll N is not good idea, because the code section too large, it most 
probability to make cache flush and missing.


+ld1 {v0.16b-v3.16b}, [x0], x1
+ushll   v16.8h, v0.8b, #6
+ushll2  v17.8h, v0.16b, #6
+ushll   v18.8h, v1.8b, #6
+ushll2  v19.8h, v1.16b, #6
+ushll   v20.8h, v2.8b, #6
+ushll2  v21.8h, v2.16b, #6
+ushll   v22.8h, v3.8b, #6
+ushll2  v23.8h, v3.16b, #6
+add v16.8h, v16.8h, v4.8h
+add v17.8h, v17.8h, v4.8h
+add v18.8h, v18.8h, v4.8h
+add v19.8h, v19.8h, v4.8h
+add v20.8h, v20.8h, v4.8h
+add v21.8h, v21.8h, v4.8h
+add v22.8h, v22.8h, v4.8h
+add v23.8h, v23.8h, v4.8h
+st1 {v16.16b-v19.16b}, [x2], #0x40
ldp may reduce pipeline stall and more bandwidth


+st1 {v20.16b-v23.16b}, [x2], x3
+.endr
+ret
+endfunc
+.endm












 2021-06-24 07:52:22,"Pop, Sebastian"  

Hi,

 

The attached patch ports filterPixelToShort to arm64.

Tested on graviton2 arm64-linux.

 

convert_p2s[  4x4]  1.21x4.986.03

convert_p2s[  8x8]  2.20x6.2013.65

convert_p2s[16x16]  1.54x25.24   38.94

convert_p2s[32x32]  1.49x101.99  151.63

convert_p2s[64x64]  1.48x420.31  622.36

convert_p2s[  8x4]  2.18x3.056.64

convert_p2s[  4x8]  1.91x6.0111.49

convert_p2s[ 16x8]  1.47x12.19   17.92

convert_p2s[ 8x16]  1.95x13.30   25.94

convert_p2s[32x16]  1.49x50.63   75.58

convert_p2s[16x32]  1.56x49.92   77.66

convert_p2s[64x32]  1.49x209.43  312.13

convert_p2s[32x64]  1.48x205.16  304.53

convert_p2s[16x12]  1.65x17.62   29.08

convert_p2s[12x16]  6.22x24.07   149.61

convert_p2s[ 16x4]  1.60x5.378.59

convert_p2s[ 4x16]  1.75x13.58   23.73

convert_p2s[32x24]  1.48x76.47   113.22

convert_p2s[24x32]  2.69x78.12   210.52

convert_p2s[ 32x8]  1.48x25.00   37.06

convert_p2s[ 8x32]  1.63x29.10   47.46

convert_p2s[64x48]  1.48x314.74  466.77

convert_p2s[64x16]  1.48x104.13  154.48

convert_p2s[16x64]  1.58x98.66   155.67

 

Ok to commit?

 

Thanks,

Sebastian___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] NASM 2.15.03 (MYS2/MinGW) throws a huge amount of macro warnings

2021-04-09 Thread chen
Hello,


Sorry for delay.
I had been fix these warnings with new version nasm in my local tree, but I 
don't
know how to merge it into the current x265 tree, please wait the x265 team to
fix these issues.


Regards,
Min Chen

At 2021-04-09 15:48:37, "Mario *LigH* Rohkrämer"  wrote:
>Repeating my request from 7 (seven) months ago.
>
>Attaching a log so you may better understand the amount I'm talking about.
>
>I doubt you prefer me posting it inline in the mail body. The log 
>contained in the ZIP archive is 9 MB large.
>
>
>Nomis101 schrieb am 04.09.2020 um 20:43:
>> Hi Min,
>> 
>> thanks for looking into it. My patch is just a plain copy of that what 
>> was done for x264
>> https://code.videolan.org/videolan/x264/-/merge_requests/31/diffs?commit_id=d78c1e83a1a9d34857eb53294282b3fbca3aba18
>>  
>> 
>> and for FFmpeg
>> https://github.com/FFmpeg/FFmpeg/commit/bb3490e7f9645babab4cf84fdb2b2dd4922d81a6
>>  
>> 
>> 
>> And then I checked that it builds. If you need some modifications, I 
>> hope somebody can help out here, because I'm not such an ASM person. :-(
>> 
>> Regards
>> 
>> 
>> 
>> Am 04.09.20 um 01:45 schrieb chen:
>>> Hi,
>>>
>>>
>>> I fast review the patch, it looks DEFINE_ARGS_INTERNAL can be modify 
>>> other than remove and inline, the other part have similar issue.
>>>
>>>
>>> btw: the latest macro parameters included all of parameter left after 
>>> that.
>>>
>>> For example
>>>
>>> %macro test 1-3+
>>>
>>>
>>> %3 will be included %4 and after
>>>
>>>
>>> Regards,
>>>
>>> Min Chen
>>>
>>>
>>> At 2020-09-03 23:56:13, "Nomis101"  wrote:
>>>> Am 03.09.20 um 15:28 schrieb Mario *LigH* Rohkrämer:
>>>>> In the meantime, MSYS2 provides NASM 2.15.04; same output.
>>>>>
>>>>
>>>> I had a patch for this in this list. Maybe you could try if this 
>>>> patch will fix the issue for you.
>>>> https://mailman.videolan.org/pipermail/x265-devel/2020-July/013062.html
>>>>
>>>> ___
>>>> x265-devel mailing list
>>>> x265-devel@videolan.org
>>>> https://mailman.videolan.org/listinfo/x265-devel
>>>
>>>
>>> ___
>>> x265-devel mailing list
>>> x265-devel@videolan.org
>>> https://mailman.videolan.org/listinfo/x265-devel
>>>
>> 
>> 
>> ___
>> x265-devel mailing list
>> x265-devel@videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>
>
>-- 
>
>Fun and success!
>
>Mario *LigH* Rohkrämer
>maito:cont...@ligh.de
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] NASM 2.15.03 (MYS2/MinGW) throws a huge amount of macro warnings

2020-09-03 Thread chen
Hi,




I fast review the patch, it looks DEFINE_ARGS_INTERNAL can be modify other than 
remove and inline, the other part have similar issue.




btw: the latest macro parameters included all of parameter left after that.

For example

%macro test 1-3+




%3 will be included %4 and after




Regards,

Min Chen


At 2020-09-03 23:56:13, "Nomis101"  wrote:
>Am 03.09.20 um 15:28 schrieb Mario *LigH* Rohkrämer:
>> In the meantime, MSYS2 provides NASM 2.15.04; same output.
>> 
>
>I had a patch for this in this list. Maybe you could try if this patch will 
>fix the issue for you.
>https://mailman.videolan.org/pipermail/x265-devel/2020-July/013062.html
>
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] arm64 neon asm optimizations

2020-08-26 Thread chen
Hi Gopi, thank you help review these patches. 

At 2020-08-27 00:42:11, "Gopi Satykrishna Akisetty" 
 wrote:

Hi Min,


On Thu, Aug 20, 2020 at 7:48 AM chen  wrote:

Hi Damiano,


Thank your information.


I fast take a look, it is based on Intrinsic, the perforamance strong depends 
on compiler.
However, it is a good start point to improve our ARM64 performance.


Does there any x265 team member can work on it?
Yes, I will start going through the changes and can help in reviewing the 
patches.
I am also glad to help review during my unpaid leave that start at September  
2020.


Regards,
Min Chen


At 2020-08-20 00:01:52, "Damiano Galassi"  wrote:
>Hi, Apple contributed to the HandBrake project a x265 patch
>with a bunch of neon asm to improve x265 performance on Apple’s upcoming ARM 
>Macs,
>but I don’t have the expertise or the time to review these changes, all I can 
>do is send this patch to you.
>The code contributed by Apple is under MIT license and without any "copyright 
>assertion on it”.
>
>The neon asm speeds up is almost 2x, so I hope someone will try to integrate 
>it.
>The additions are viewable on 
>https://github.com/HandBrake/HandBrake/commit/67a490a171764eafe8eb8afb72fa1dff763a3275
>
>Regards,
>Damiano Galassi
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] arm64 neon asm optimizations

2020-08-19 Thread chen
Hi Damiano,


Thank your information.


I fast take a look, it is based on Intrinsic, the perforamance strong depends 
on compiler.
However, it is a good start point to improve our ARM64 performance.


Does there any x265 team member can work on it?
I am also glad to help review during my unpaid leave that start at September  
2020.


Regards,
Min Chen


At 2020-08-20 00:01:52, "Damiano Galassi"  wrote:
>Hi, Apple contributed to the HandBrake project a x265 patch
>with a bunch of neon asm to improve x265 performance on Apple’s upcoming ARM 
>Macs,
>but I don’t have the expertise or the time to review these changes, all I can 
>do is send this patch to you.
>The code contributed by Apple is under MIT license and without any "copyright 
>assertion on it”.
>
>The neon asm speeds up is almost 2x, so I hope someone will try to integrate 
>it.
>The additions are viewable on 
>https://github.com/HandBrake/HandBrake/commit/67a490a171764eafe8eb8afb72fa1dff763a3275
>
>Regards,
>Damiano Galassi
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] Add aarch64 support - Part 2

2020-03-17 Thread chen
Hi Xiyuan,


I have been forwarded the email to you directly.


Regards,
Min Chen

 2020-03-18 09:38:18,"Xiyuan Wang"  

Hi chen
   we didn't receive your reply about Part-1, can you resend it? Maybe the 
content is too large and the mail list blocked it. You can just quote the code 
where you have questions.


Thanks.


chen  于2020年3月18日周三 上午9:07写道:









On Tue, Mar 17, 2020 at 2:59 PM Suyimeng  wrote:


 

From: x265-devel [mailto:x265-devel-boun...@videolan.org] On Behalf Of Gopi 
Satykrishna Akisetty
Sent: Tuesday, March 17, 2020 4:53 PM
To: Development for x265 
Subject: Re: [x265] [PATCH] Add aarch64 support - Part 2

 

diff --git a/source/common/pixel.cpp b/source/common/pixel.cpp
index 99b84449c..e4f890cd5 100644
--- a/source/common/pixel.cpp
+++ b/source/common/pixel.cpp
@@ -5,6 +5,7 @@
  *  Mandar Gurav 
  *  Mahesh Pittala 
  *  Min Chen 
+ *  Hongbin Liu
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -265,6 +266,10 @@ int satd4(const pixel* pix1, intptr_t stride_pix1, const 
pixel* pix2, intptr_t s
 {
 int satd = 0;

+#if ENABLE_ASSEMBLY && X265_ARCH_ARM64
+pixelcmp_t satd_4x4 = x265_pixel_satd_4x4_neon;
+#endif

is there any specific reason why the above code is added?? is this a kind of a 
temporary fix for the output mismatch between c and asm code? 

No, c and asm output is matched. Currently we only complete partial satd 
primatives. This is a workaround that improve all satd primitives with asm 
code. Maybe there is a bad code style.

If I understand correctly, you are trying to use a combination of c and asm 
code for all other kernel sizes that you have not completed asm implementation 
yet? 

Yes, you are right.

ok. If this code block is going to be removed in the future patches, where you 
will be implementing the asm for remaining satd  kernels, then this patch is 
good to be pushed.


before push the patches, I want to double check how about response for my 
review on Part-1?
I am not sure I missed these email, or my post still in pending.


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] Add aarch64 support - Part 2

2020-03-17 Thread chen







On Tue, Mar 17, 2020 at 2:59 PM Suyimeng  wrote:


 

From: x265-devel [mailto:x265-devel-boun...@videolan.org] On Behalf Of Gopi 
Satykrishna Akisetty
Sent: Tuesday, March 17, 2020 4:53 PM
To: Development for x265 
Subject: Re: [x265] [PATCH] Add aarch64 support - Part 2

 

diff --git a/source/common/pixel.cpp b/source/common/pixel.cpp
index 99b84449c..e4f890cd5 100644
--- a/source/common/pixel.cpp
+++ b/source/common/pixel.cpp
@@ -5,6 +5,7 @@
  *  Mandar Gurav 
  *  Mahesh Pittala 
  *  Min Chen 
+ *  Hongbin Liu
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -265,6 +266,10 @@ int satd4(const pixel* pix1, intptr_t stride_pix1, const 
pixel* pix2, intptr_t s
 {
 int satd = 0;

+#if ENABLE_ASSEMBLY && X265_ARCH_ARM64
+pixelcmp_t satd_4x4 = x265_pixel_satd_4x4_neon;
+#endif

is there any specific reason why the above code is added?? is this a kind of a 
temporary fix for the output mismatch between c and asm code? 

No, c and asm output is matched. Currently we only complete partial satd 
primatives. This is a workaround that improve all satd primitives with asm 
code. Maybe there is a bad code style.

If I understand correctly, you are trying to use a combination of c and asm 
code for all other kernel sizes that you have not completed asm implementation 
yet? 

Yes, you are right.

ok. If this code block is going to be removed in the future patches, where you 
will be implementing the asm for remaining satd  kernels, then this patch is 
good to be pushed.


before push the patches, I want to double check how about response for my 
review on Part-1?
I am not sure I missed these email, or my post still in pending.

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH]Add: Auto AQ Mode

2020-02-27 Thread chen



At 2020-02-27 16:59:18, "Niranjan Bala"  wrote:

+double computeBrightnessIntensity(pixel *inPlane, int width, int height, 
intptr_t stride)
+{
+pixel* rowStart = inPlane;
restrict with const prefix may better.

+double count = 0;
why declare as Double?

+
+for (int i = 0; i < height; i++)
+{
+for (int j = 0; j < width; j++)
+{
+if (rowStart[j] > BRIGHTNESS_THRESHOLD)
+count++;
+}
+rowStart += stride;
+}
+
+/* Returns the brightness percentage of the input plane */
+return (count / (width * height)) * 100;
+}
+
+double computeEdgeIntensity(pixel *inPlane, int width, int height, intptr_t 
stride)
+{
+pixel* rowStart = inPlane;
+double count = 0;
+
+for (int i = 0; i < height; i++)
+{
+for (int j = 0; j < width; j++)
+{
+if (rowStart[j] > 0)
+count++;
+}
+rowStart += stride;
+}
+
+/* Returns the edge percentage of the input plane */
+return (count / (width * height)) * 100;
100 is integer, multiplication with Double.


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] [PATCH] Improve all_angs_pred_c by remove unnecessary transpose

2019-11-04 Thread chen
From 7e495390396d6a55f95ad4649e46b56fd7d2ef1c Mon Sep 17 00:00:00 2001
From: Min Chen 
Date: Mon, 4 Nov 2019 16:21:20 +0800
Subject: [PATCH] Improve all_angs_pred_c by remove unnecessary transpose


---
 source/common/intrapred.cpp | 22 +++---
 1 file changed, 3 insertions(+), 19 deletions(-)


diff --git a/source/common/intrapred.cpp b/source/common/intrapred.cpp
index 0b65ccf..2fb4eb5 100644
--- a/source/common/intrapred.cpp
+++ b/source/common/intrapred.cpp
@@ -99,7 +99,7 @@ void planar_pred_c(pixel* dst, intptr_t dstStride, const 
pixel* srcPix, int /*di
 dst[y * dstStride + x] = (pixel) (((blkSize - 1 - x) * left[y] + 
(blkSize - 1 -y) * above[x] + (x + 1) * topRight + (y + 1) * bottomLeft + 
blkSize) >> (log2Size + 1));
 }
 
-template
+template
 void intra_pred_ang_c(pixel* dst, intptr_t dstStride, const pixel *srcPix0, 
int dirMode, int bFilter)
 {
 int width2 = width << 1;
@@ -189,7 +189,7 @@ void intra_pred_ang_c(pixel* dst, intptr_t dstStride, const 
pixel *srcPix0, int
 }
 
 // Flip for horizontal.
-if (horMode)
+if (!disableTranspose && horMode)
 {
 for (int y = 0; y < width - 1; y++)
 {
@@ -212,24 +212,8 @@ void all_angs_pred_c(pixel *dest, pixel *refPix, pixel 
*filtPix, int bLuma)
 pixel *srcPix  = (g_intraFilterFlags[mode] & size ? filtPix  : refPix);
 pixel *out = dest + ((mode - 2) << (log2Size * 2));
 
-intra_pred_ang_c(out, size, srcPix, mode, bLuma);
-
 // Optimize code don't flip buffer
-bool modeHor = (mode < 18);
-
-// transpose the block if this is a horizontal mode
-if (modeHor)
-{
-for (int k = 0; k < size - 1; k++)
-{
-for (int l = k + 1; l < size; l++)
-{
-pixel tmp = out[k * size + l];
-out[k * size + l] = out[l * size + k];
-out[l * size + k] = tmp;
-}
-}
-}
+intra_pred_ang_c(out, size, srcPix, mode, bLuma);
 }
 }
 }
-- 
2.9.0.windows.1



0001-Improve-all_angs_pred_c-by-remove-unnecessary-transp.patch
Description: Binary data
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [x265 patch] Adaptive Frame Duplication

2019-09-22 Thread chen


At 2019-09-23 12:50:22, "Akil"  wrote:

# HG changeset patch
# User Akil Ayyappan
# Date 1568370446 -19800
#  Fri Sep 13 15:57:26 2019 +0530
# Node ID 531f6b03eed0a40a38d3589dec03f14743293146
# Parent  c4b098f973e6b0ee4aee3bf0d7b54da4e2734d42
Adaptive Frame duplication
+uint32_t y = 0;
+
+/* Consume rows in ever narrower chunks of height */
+for (int size = BLOCK_64x64; size >= BLOCK_4x4 && y < height; size--)
+{
+uint32_t rowHeight = 1 << (size + 2);
+
+for (; y + rowHeight <= height; y += rowHeight)
+{
+uint32_t y1, x = 0;
+
+/* Consume each row using the largest square blocks possible */
+if (size == BLOCK_64x64 && !(stride & 31))
+for (; x + 64 <= width; x += 64)
+ssd += primitives.cu[BLOCK_64x64].sse_pp(fenc + x, stride, 
rec + x, stride);
+
+if (size >= BLOCK_32x32 && !(stride & 15))
+for (; x + 32 <= width; x += 32)
+for (y1 = 0; y1 + 32 <= rowHeight; y1 += 32)
+ssd += primitives.cu[BLOCK_32x32].sse_pp(fenc + y1 * 
stride + x, stride, rec + y1 * stride + x, stride);
+
+if (size >= BLOCK_16x16)
+for (; x + 16 <= width; x += 16)
+for (y1 = 0; y1 + 16 <= rowHeight; y1 += 16)
+ssd += primitives.cu[BLOCK_16x16].sse_pp(fenc + y1 * 
stride + x, stride, rec + y1 * stride + x, stride);
+
+if (size >= BLOCK_8x8)
+for (; x + 8 <= width; x += 8)
+for (y1 = 0; y1 + 8 <= rowHeight; y1 += 8)
+ssd += primitives.cu[BLOCK_8x8].sse_pp(fenc + y1 * 
stride + x, stride, rec + y1 * stride + x, stride);
+
+for (; x + 4 <= width; x += 4)
+for (y1 = 0; y1 + 4 <= rowHeight; y1 += 4)
+ssd += primitives.cu[BLOCK_4x4].sse_pp(fenc + y1 * stride 
+ x, stride, rec + y1 * stride + x, stride);
+
+fenc += stride * rowHeight;
+rec += stride * rowHeight;
+}
+}
+
+return ssd;
+}



You try to processing block as big as possible, however, this code styles is 
less readable.
Suggest put trick in optimized version other than inside C model.



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] why fail to build x265 with vmaf

2019-09-08 Thread chen
Could you please try #include ?


At 2019-09-09 10:18:13, "qw"  wrote:
Hi,


The latest vmaf source code is used, but I still fail to build x265. Below is 
the error message:


Scanning dependencies of target common
[  1%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/pixel-a.asm.o
[  2%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/const-a.asm.o
[  3%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/cpu-a.asm.o
[  4%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/ssd-a.asm.o
[  6%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/mc-a.asm.o
[  7%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/mc-a2.asm.o
[  8%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/pixel-util8.asm.o
[  9%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/blockcopy8.asm.o
[ 10%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/pixeladd8.asm.o
[ 12%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/dct8.asm.o
[ 13%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/seaintegral.asm.o
[ 14%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/sad-a.asm.o
[ 15%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/intrapred8.asm.o
[ 17%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/intrapred8_allangs.asm.o
[ 18%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/v4-ipfilter8.asm.o
[ 19%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/h-ipfilter8.asm.o
[ 20%] Building ASM_NASM object common/CMakeFiles/common.dir/x86/ipfilter8.asm.o
[ 21%] Building ASM_NASM object 
common/CMakeFiles/common.dir/x86/loopfilter.asm.o
[ 23%] Building CXX object common/CMakeFiles/common.dir/x86/asm-primitives.cpp.o
[ 24%] Building CXX object common/CMakeFiles/common.dir/vec/vec-primitives.cpp.o
[ 25%] Building CXX object common/CMakeFiles/common.dir/vec/dct-sse3.cpp.o
[ 26%] Building CXX object common/CMakeFiles/common.dir/vec/dct-ssse3.cpp.o
[ 28%] Building CXX object common/CMakeFiles/common.dir/vec/dct-sse41.cpp.o
[ 29%] Building CXX object common/CMakeFiles/common.dir/primitives.cpp.o
[ 30%] Building CXX object common/CMakeFiles/common.dir/pixel.cpp.o
[ 31%] Building CXX object common/CMakeFiles/common.dir/dct.cpp.o
[ 32%] Building CXX object common/CMakeFiles/common.dir/lowpassdct.cpp.o
[ 34%] Building CXX object common/CMakeFiles/common.dir/ipfilter.cpp.o
[ 35%] Building CXX object common/CMakeFiles/common.dir/intrapred.cpp.o
[ 36%] Building CXX object common/CMakeFiles/common.dir/loopfilter.cpp.o
[ 37%] Building CXX object common/CMakeFiles/common.dir/constants.cpp.o
[ 39%] Building CXX object common/CMakeFiles/common.dir/cpu.cpp.o
[ 40%] Building CXX object common/CMakeFiles/common.dir/version.cpp.o
[ 41%] Building CXX object common/CMakeFiles/common.dir/threading.cpp.o
[ 42%] Building CXX object common/CMakeFiles/common.dir/threadpool.cpp.o
[ 43%] Building CXX object common/CMakeFiles/common.dir/wavefront.cpp.o
[ 45%] Building CXX object common/CMakeFiles/common.dir/md5.cpp.o
[ 46%] Building CXX object common/CMakeFiles/common.dir/bitstream.cpp.o
[ 47%] Building CXX object common/CMakeFiles/common.dir/yuv.cpp.o
[ 48%] Building CXX object common/CMakeFiles/common.dir/shortyuv.cpp.o
[ 50%] Building CXX object common/CMakeFiles/common.dir/picyuv.cpp.o
[ 51%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
[ 52%] Building CXX object common/CMakeFiles/common.dir/param.cpp.o
[ 53%] Building CXX object common/CMakeFiles/common.dir/frame.cpp.o
[ 54%] Building CXX object common/CMakeFiles/common.dir/framedata.cpp.o
[ 56%] Building CXX object common/CMakeFiles/common.dir/cudata.cpp.o
[ 57%] Building CXX object common/CMakeFiles/common.dir/slice.cpp.o
[ 58%] Building CXX object common/CMakeFiles/common.dir/lowres.cpp.o
[ 59%] Building CXX object common/CMakeFiles/common.dir/piclist.cpp.o
[ 60%] Building CXX object common/CMakeFiles/common.dir/predict.cpp.o
[ 62%] Building CXX object common/CMakeFiles/common.dir/scalinglist.cpp.o
[ 63%] Building CXX object common/CMakeFiles/common.dir/quant.cpp.o
[ 64%] Building CXX object common/CMakeFiles/common.dir/deblock.cpp.o
[ 64%] Built target common
Scanning dependencies of target encoder
[ 65%] Building CXX object encoder/CMakeFiles/encoder.dir/analysis.cpp.o
[ 67%] Building CXX object encoder/CMakeFiles/encoder.dir/search.cpp.o
[ 68%] Building CXX object encoder/CMakeFiles/encoder.dir/bitcost.cpp.o
[ 69%] Building CXX object encoder/CMakeFiles/encoder.dir/motion.cpp.o
[ 70%] Building CXX object encoder/CMakeFiles/encoder.dir/slicetype.cpp.o
[ 71%] Building CXX object encoder/CMakeFiles/encoder.dir/frameencoder.cpp.o
[ 73%] Building CXX object encoder/CMakeFiles/encoder.dir/framefilter.cpp.o
[ 74%] Building CXX object encoder/CMakeFiles/encoder.dir/level.cpp.o
[ 75%] Building CXX object encoder/CMakeFiles/encoder.dir/nal.cpp.o
[ 76%] Building CXX object encoder/CMakeFiles/encoder.dir/sei.cpp.o
[ 78%] Building CXX object 

Re: [x265] [x265 patch] New AQ mode with Variance and Edge information

2019-07-15 Thread chen
Thank you.


for constant, I explain a little more
we may declare
const intptr_t row_n2 = (rowNum - 2)*stride;


Now,


src[(rowNum - 2)*stride + (colNum - 2)]
src[(rowNum - 2)*stride + (colNum - 1)] 


==>


src[row_n2 + (colNum - 2)]
src[row_n2 + (colNum - 1)]


a little better to read.


At 2019-07-15 13:58:53, "Akil"  wrote:

Thanks for your suggestions, Chen. Have added the matrix in comments. That 
should make the code more readable. Regarding the last point, I think 
(rowNum+X)*stride cannot be replaced by a constant since it tends to change 
every time.


On Fri, Jul 12, 2019 at 7:27 AM chen  wrote:


On Wed, Jul 10, 2019 at 3:41 PM Akil  wrote:

# HG changeset patch
# User Akil Ayyappan
# Date 1561035091 -19800
#  Thu Jun 20 18:21:31 2019 +0530
# Node ID d25c33cc2b748401c5e908af445a0a110e26c3cf
# Parent  4f6dde51a5db4f9229bddb60db176f16ac98f505
AQ: New AQ mode with Variance and Edge information


+//Applying Gaussian filter on the picture
+src = (pixel*)curFrame->m_fencPic->m_picOrg[0];
+refPic = pic2 + curFrame->m_fencPic->m_lumaMarginY * stride + 
curFrame->m_fencPic->m_lumaMarginX;
+pixel pixelValue = 0;
+for (int rowNum = 0; rowNum < height; rowNum++)
+{
+for (int colNum = 0; colNum < width; colNum++)
+{
+if ((rowNum >= 2) && (colNum >= 2) && (rowNum != height - 2) && 
(colNum != width - 2)) //Ignoring the border pixels of the picture
+{
+pixelValue = ((2 * src[(rowNum - 2)*stride + (colNum - 2)] + 4 
* src[(rowNum - 2)*stride + (colNum - 1)] + 5 * src[(rowNum - 2)*stride + 
(colNum)] + 4 * src[(rowNum - 2)*stride + (colNum + 1)] + 2 * src[(rowNum - 
2)*stride + (colNum + 2)] +
+4 * src[(rowNum - 1)*stride + (colNum - 2)] + 9 * 
src[(rowNum - 1)*stride + (colNum - 1)] + 12 * src[(rowNum - 1)*stride + 
(colNum)] + 9 * src[(rowNum - 1)*stride + (colNum + 1)] + 4 * src[(rowNum - 
1)*stride + (colNum + 2)] +
+5 * src[(rowNum)*stride + (colNum - 2)] + 12 * 
src[(rowNum)*stride + (colNum - 1)] + 15 * src[(rowNum)*stride + (colNum)] + 12 
* src[(rowNum)*stride + (colNum + 1)] + 5 * src[(rowNum)*stride + (colNum + 2)] 
+
+4 * src[(rowNum + 1)*stride + (colNum - 2)] + 9 * 
src[(rowNum + 1)*stride + (colNum - 1)] + 12 * src[(rowNum + 1)*stride + 
(colNum)] + 9 * src[(rowNum + 1)*stride + (colNum + 1)] + 4 * src[(rowNum + 
1)*stride + (colNum + 2)] +
+2 * src[(rowNum + 2)*stride + (colNum - 2)] + 4 * 
src[(rowNum + 2)*stride + (colNum - 1)] + 5 * src[(rowNum + 2)*stride + 
(colNum)] + 4 * src[(rowNum + 2)*stride + (colNum + 1)] + 2 * src[(rowNum + 
2)*stride + (colNum + 2)]) / 159);
+refPic[(rowNum*stride) + colNum] = pixelValue;
+}
+}
+}



Could you please modify a little?
Ident or give coif matrix as comment, it will be more readable
moreover, (rowNum+X)*stride can be replace by constant, it does not affect 
compiled code performance but help human read code.


___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel





--

Regards,
Akil R___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [x265 patch] New AQ mode with Variance and Edge information

2019-07-11 Thread chen

On Wed, Jul 10, 2019 at 3:41 PM Akil  wrote:

# HG changeset patch
# User Akil Ayyappan
# Date 1561035091 -19800
#  Thu Jun 20 18:21:31 2019 +0530
# Node ID d25c33cc2b748401c5e908af445a0a110e26c3cf
# Parent  4f6dde51a5db4f9229bddb60db176f16ac98f505
AQ: New AQ mode with Variance and Edge information


+//Applying Gaussian filter on the picture
+src = (pixel*)curFrame->m_fencPic->m_picOrg[0];
+refPic = pic2 + curFrame->m_fencPic->m_lumaMarginY * stride + 
curFrame->m_fencPic->m_lumaMarginX;
+pixel pixelValue = 0;
+for (int rowNum = 0; rowNum < height; rowNum++)
+{
+for (int colNum = 0; colNum < width; colNum++)
+{
+if ((rowNum >= 2) && (colNum >= 2) && (rowNum != height - 2) && 
(colNum != width - 2)) //Ignoring the border pixels of the picture
+{
+pixelValue = ((2 * src[(rowNum - 2)*stride + (colNum - 2)] + 4 
* src[(rowNum - 2)*stride + (colNum - 1)] + 5 * src[(rowNum - 2)*stride + 
(colNum)] + 4 * src[(rowNum - 2)*stride + (colNum + 1)] + 2 * src[(rowNum - 
2)*stride + (colNum + 2)] +
+4 * src[(rowNum - 1)*stride + (colNum - 2)] + 9 * 
src[(rowNum - 1)*stride + (colNum - 1)] + 12 * src[(rowNum - 1)*stride + 
(colNum)] + 9 * src[(rowNum - 1)*stride + (colNum + 1)] + 4 * src[(rowNum - 
1)*stride + (colNum + 2)] +
+5 * src[(rowNum)*stride + (colNum - 2)] + 12 * 
src[(rowNum)*stride + (colNum - 1)] + 15 * src[(rowNum)*stride + (colNum)] + 12 
* src[(rowNum)*stride + (colNum + 1)] + 5 * src[(rowNum)*stride + (colNum + 2)] 
+
+4 * src[(rowNum + 1)*stride + (colNum - 2)] + 9 * 
src[(rowNum + 1)*stride + (colNum - 1)] + 12 * src[(rowNum + 1)*stride + 
(colNum)] + 9 * src[(rowNum + 1)*stride + (colNum + 1)] + 4 * src[(rowNum + 
1)*stride + (colNum + 2)] +
+2 * src[(rowNum + 2)*stride + (colNum - 2)] + 4 * 
src[(rowNum + 2)*stride + (colNum - 1)] + 5 * src[(rowNum + 2)*stride + 
(colNum)] + 4 * src[(rowNum + 2)*stride + (colNum + 1)] + 2 * src[(rowNum + 
2)*stride + (colNum + 2)]) / 159);
+refPic[(rowNum*stride) + colNum] = pixelValue;
+}
+}
+}



Could you please modify a little?
Ident or give coif matrix as comment, it will be more readable
moreover, (rowNum+X)*stride can be replace by constant, it does not affect 
compiled code performance but help human read code.

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] how to build x265 that supports both 8bit and 10bit

2019-05-18 Thread chen
The debug info affect compiler code generate, so lost a few performance, but we 
can ignore them since it is not much.
 
Regards,
Min


At 2019-05-18 22:09:06, "qw"  wrote:

If I want to build x265 with release and ,debug info, I will choose the option 
of CMAKE_BUILD_TYPE=RelWithDebInfo.


Is that right? and the option will never affect x265 performance?




Thanks!


Regards


Andrew





At 2019-05-17 20:01:16, "Mario *LigH* Rohkrämer"  wrote:
>Regarding no optimization: You can generate cmake files with the option 
>"-DENABLE_ASSEMBLY=OFF", which happens for High Bit Depth builds for a 
>32 bit target anyway. Then it contains only C/C++ code.
>
>
>qw schrieb am 17.05.2019 um 13:00:
>> How to build x265 without any optimization and with debug information?
>> 
>> Thanks!
>> 
>> Regard
>> 
>> Andrew
>
>
>-- 
>
>Fun and success!
>
>Mario *LigH* Rohkrämer
>maito:cont...@ligh.de
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel





 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] how to build x265 that supports both 8bit and 10bit

2019-05-16 Thread chen
Hi,


Could you please try it with multilib.bat
It is Steve's idea, we build lib two times with different bit_depth and combine 
these libs into one multiple feature lib.


Regards,
Min Chen


At 2019-05-17 11:06:13, "qw"  wrote:
hi,


I read x265 source code, and find one function, as shown below:




int x265_param_apply_profile(x265_param *param, const char *profile)
{
if (!param || !profile)
return 0;


/* Check if profile bit-depth requirement is exceeded by internal bit depth 
*/
bool bInvalidDepth = false;
#if X265_DEPTH > 8
if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || 
!strcmp(profile, "msp") ||
!strcmp(profile, "main444-8") || !strcmp(profile, "main-intra") ||
!strcmp(profile, "main444-intra") || !strcmp(profile, 
"main444-stillpicture"))
bInvalidDepth = true;
#endif
#if X265_DEPTH > 10
if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || 
!strcmp(profile, "main444-10") ||
!strcmp(profile, "main10-intra") || !strcmp(profile, 
"main422-10-intra") || !strcmp(profile, "main444-10-intra"))
bInvalidDepth = true;
#endif
#if X265_DEPTH > 12
if (!strcmp(profile, "main12") || !strcmp(profile, "main422-12") || 
!strcmp(profile, "main444-12") ||
!strcmp(profile, "main12-intra") || !strcmp(profile, 
"main422-12-intra") || !strcmp(profile, "main444-12-intra"))
bInvalidDepth = true;
#endif


if (bInvalidDepth)
{
x265_log(param, X265_LOG_ERROR, "%s profile not supported, internal bit 
depth %d.\n", profile, X265_DEPTH);
return -1;
}


It seems that the logic will report error, when x265 is built with X265_DEPTH = 
10 and profile is of 8bit.


How to make x265 support both 8bit and 10bit?


Thanks!


Regards


Andrew














 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH x265] Add AVX2 assembly code for normFactor primitive.

2019-03-07 Thread chen
Just say it works.


First at all,
The expect algorithm is square of (x >> shift)
It is 8 bits (I assume we talk with 8bpp, the 16bpp are similar) multiple of 
8-bits and result is 16 bits.
The function works on CU-level, the blockSize is up to 64 only, or call 6-bits.
So, we can decide the maximum dynamic range is 16+6+6 = 28 bits 


In this way, the output uint64_t is unnecessary on 8bpp mode.


Moreover, PMOVZXBD+VPMULDQ can be replace by PMOVZXBW+PMADDWD, (please remember 
that PMADDUBSW just work on one of unsigned input),
this way may accelerate 3~4 times of processing throughput. 
I don't why not VPMULLD, it almost double performance


Further, unnecessary VPSRLDQ because we choice VPMULDQ


+vpmuldqm2,  m1,m1
+vpsrldqm1,  m1,4
+vpmuldqm1,  m1,m1




Regards,
Min


At 2019-03-07 17:36:19, "Dinesh Kumar Reddy"  
wrote:

+static void normFact_c(const pixel* src, uint32_t blockSize, int shift, 
uint64_t *z_k)
+{
+*z_k = 0;
+for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)
+{
+for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)
+{
+uint32_t temp = src[block_yy * blockSize + block_xx] >> shift;
+*z_k += temp * temp;
+}
+}
+}
+
diff -r d12a4caf7963 -r 19f27e0c8a6f source/common/x86/pixel-a.asm
--- a/source/common/x86/pixel-a.asmWed Feb 27 12:35:02 2019 +0530
+++ b/source/common/x86/pixel-a.asmMon Mar 04 15:36:38 2019 +0530
@@ -388,6 +388,16 @@
 vpaddq m7, m6
 %endmacro
 
+%macro NORM_FACT_COL 1
+vpsrld m1,  m0,SSIMRD_SHIFT
+vpmuldqm2,  m1,m1
+vpsrldqm1,  m1,4
+vpmuldqm1,  m1,m1
+
+vpaddq m1,  m2
+vpaddq m3,  m1
+%endmacro
+
 ; FIXME avoid the spilling of regs to hold 3*stride.
 ; for small blocks on x86_32, modify pixel pointer instead.
 
@@ -16303,3 +16313,266 @@
 movq   [r4], xm4
 movq   [r6], xm7
 RET
+
+
+;static void normFact_c(const pixel* src, uint32_t blockSize, int shift, 
uint64_t *z_k)
+;{
+;*z_k = 0;
+;for (uint32_t block_yy = 0; block_yy < blockSize; block_yy += 1)
+;{
+;for (uint32_t block_xx = 0; block_xx < blockSize; block_xx += 1)
+;{
+;uint32_t temp = src[block_yy * blockSize + block_xx] >> shift;
+;*z_k += temp * temp;
+;}
+;}
+;}
+;--
+; void normFact_c(const pixel* src, uint32_t blockSize, int shift, uint64_t 
*z_k)
+;--
+INIT_YMM avx2
+cglobal normFact8, 4, 5, 6
+movr4d,   8
+vpxor  m3,m3   ;z_k
+vpxor  m5,m5
+.row:
+%if HIGH_BIT_DEPTH
+vpmovzxwd  m0,[r0] ;src
+%elif BIT_DEPTH == 8
+vpmovzxbd  m0,[r0]
+%else
+%error Unsupported BIT_DEPTH!
+%endif

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Original C++ code used for sad functions' assembly code in COST_MV?

2018-09-04 Thread Jeffrey Chen
Hi, I would like to configure the sad function in COST_MV for another
platform. However, the assembly code would not be supported on the other
platform. Where can I find the original programming language code that was
made into the assembly language code?
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] reference patch to remove unnecessary pow(x,2)

2018-07-27 Thread chen
This patch remove unnecessary pow() and abs()



0001-improve-pow-x-2.patch
Description: Binary data
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] non-optimized chroma p2s[] on ARM platform

2018-07-17 Thread chen
I found that p.chroma[X265_CSP_I420].pu[i].p2s was not initialize on ARM 
platform, all of them execute as C-model, I guess these functions may reuse 
NEON's convert_p2s[*]

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] bug in IntraPred [DC]

2018-07-04 Thread chen
Please ignore my previous email, the dcVal initialize value is width, so this 
module have not bug. Sorry for disturb.___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] bug in IntraPred [DC]

2018-07-04 Thread chen
Hi,


There have a long time bug in our intra prediction DC mode, see details:




 HM 


// Function for calculating DC value of the reference samples used in Intra 
prediction
//NOTE: Bit-Limit - 25-bit source
Pel TComPrediction::predIntraGetPredValDC( const Pel* pSrc, Int iSrcStride, 
UInt iWidth, UInt iHeight)
{
  assert(iWidth > 0 && iHeight > 0);
  Int iInd, iSum = 0;
  Pel pDcVal;


  for (iInd = 0;iInd < iWidth;iInd++)
  {
iSum += pSrc[iInd-iSrcStride];
  }
  for (iInd = 0;iInd < iHeight;iInd++)
  {
iSum += pSrc[iInd*iSrcStride-1];
  }


  pDcVal = (iSum + iWidth) / (iWidth + iHeight);


  return pDcVal;
}




*** x265 **
dcVal = dcVal / (width + width);


*




I have been double checked my origin x265 tree, it does not affect in there, so 
I guess we need fix it on the current tree.


Regards,
Min

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Another performance issue on ARM code

2018-06-11 Thread chen
I found some issues in ARM code, I don't point out on time, that's my failure.
Such as these garbage code in x265_pixel_add_ps_4x4_neon:


vmov.u16q10, #255
veor.u16q11, q11
veor.u16d3, d3
veor.u16d5, d5


btw: the ARM build was broken after integrate AVX512 patches

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] Code performance issue

2018-06-01 Thread chen
There have series performance issues, such as,


uint32_t sum = (uint32_t)pow((outOfBound >> 2), 2);


Are you want to get square value from a small integer?

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 300 of 307] x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2

2018-04-06 Thread chen
Sorry, I miss a line, resend with addition comment

At 2018-04-07 01:27:34, "chen" <chenm...@163.com> wrote:


At 2018-04-06 21:17:37, mythr...@multicorewareinc.com wrote:
># HG changeset patch
># User Jayashree
># Date 1517283539 28800
>#  Mon Jan 29 19:38:59 2018 -0800
># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81
># Parent  624c83571d1df840e1206c46e589044fbf87ff32
>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>
>count_nonzero[16x16]   18.88x ->  23.04x
>
>+;-
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+;-
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+mov r1, 0x
>+kmovq   k2, r1



https://www.cs.utexas.edu/~hunt/class/2017-spring/cs350c/documents/Intel-x86-Docs/64-ia-32-architectures-instruction-set-extensions-reference-manual.pdf
2.5.1.1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can 
not be used as a predicate operand.
Opmask k0 cannot be encoded as a predicate operand for a vector operation; the 
encoding value that would select
opmask k0 will instead selects an implicit opmask value of 0x, 
thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes 
opmask register(s) as operand(s)
(either source or destination).



>+xor r3, r3
>+pxorm0, m0
>+
>+%assign x 0

>+%rep 4
unroll 4 times only, so unnecessary unroll in here
I suggest load all of bytes in same time, it can be hidden memory latency with 
calculate instructions.


>+movum1, [r0 + x]

>+vpacksswb   m1, [r0 + x + 64]
>+%assign x x+128
>+vpcmpb  k1 {k2}, m1, m0, 0100b
could you please declare a new macro/const, the developers are difficult to 
understand that the '0100b' (4) means NE (on Intel's document).


>+kmovq   r1, k1
>+popcnt  r2, r1
>+add r3d, r2d
>+%endrep
>+mov eax, r3d
>+
>+RET
>+

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 300 of 307] x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2

2018-04-06 Thread chen

At 2018-04-06 21:17:37, mythr...@multicorewareinc.com wrote:
># HG changeset patch
># User Jayashree
># Date 1517283539 28800
>#  Mon Jan 29 19:38:59 2018 -0800
># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81
># Parent  624c83571d1df840e1206c46e589044fbf87ff32
>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>
>count_nonzero[16x16]   18.88x ->  23.04x
>
>+;-
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+;-
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+mov r1, 0x
>+kmovq   k2, r1



https://www.cs.utexas.edu/~hunt/class/2017-spring/cs350c/documents/Intel-x86-Docs/64-ia-32-architectures-instruction-set-extensions-reference-manual.pdf
2.5.1.1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can 
not be used as a predicate operand.
Opmask k0 cannot be encoded as a predicate operand for a vector operation; the 
encoding value that would select
opmask k0 will instead selects an implicit opmask value of 0x, 
thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes 
opmask register(s) as operand(s)
(either source or destination).



>+xor r3, r3
>+pxorm0, m0
>+
>+%assign x 0

>+%rep 4
unroll 4 times only, so unnecessary unroll in here
I suggest load all of bytes in same time, it can be hidden memory latency with 
calculate instructions.


>+movum1, [r0 + x]

>+vpacksswb   m1, [r0 + x + 64]
>+%assign x x+128
>+vpcmpb  k1 {k2}, m1, m0, 0100b
>+kmovq   r1, k1
>+popcnt  r2, r1
>+add r3d, r2d
>+%endrep
>+mov eax, r3d
>+
>+RET
>+

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] unsigned promotion prevents encoding frames with negative strides

2018-01-17 Thread chen
Hi,


Thank you report this bug.
I think the root cause is not sizeof(), the negative stride is invalid in 
encoder/decoder core.
To avoid these invalid input parameters, the x264 insert a middle-layer that 
convert color space and images, but x265 doesn't it.


Of course, crash is worst way to report invalid input parameters, we will fix 
it soon.


Thanks,
Min

At 2018-01-18 00:24:19, "Vittorio Giovara"  wrote:

Hi,

I'm developing an app which flips a frame in this way



frame->stride[0] *= -1;
frame->stride[1] *= -1;
frame->stride[2] *= -1;
frame->data[0]= frame->data[1] + frame->stride[0];
frame->data[1]= frame->data[2] + frame->stride[1];
frame->data[2]= frame->data[2] + (((frame->height >> 
desc->log2_chroma_h) - 1) * -frame->stride[2]);



and then proceeds to encode it with either x264 or x265.


When feeding this frame to x265_encode(), I get a crash in x265_upShift_16_avx2 
at picyuv.cpp:322. While debugging I noticed that there is something wrong in 
the division for the input stride: for a 640x480 yuv 10 bit frame (with 
negative strides) the computed value is 9223372036854775168 instead of -640.


I believe the problem lies in the operation pic.stride[0] / sizeof(*yShort) 
where the result of every division is promoted to unsigned since sizeof() 
returns size_t (aka unsigned long). Unfortunately this issue is spread across 
the entire codebase, affecting every arithmetic operation whenever a stride is 
computed or updated. I think this was introduced four years ago in 
https://bitbucket.org/multicoreware/x265/commits/eadec14402d6.



The solution would be to properly cast every sizeof() operation to ssize_t or 
int, or modify the internal functions to operate on bytes instead of pixels. 
The same frame fed to x264 is correctly encoded just fine.


Can anybody verify and apply a proper fix? Thank you

--

Vittorio___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] Use atomic bit test and set/reset operations on x86

2018-01-10 Thread chen

At 2018-01-11 00:06:29, "Andrey Semashev" <andrey.semas...@gmail.com> wrote:
>On 01/10/18 18:53, chen wrote:
>> Hi Andrey,
>> 
>> Our code rule prohibit inline assembly, especially the patch used GCC 
>> extension syntax.
>
>Ok, I see.
>
>> the "lock" prefix will lock the CPU bus, it will be greater penalty on 
>> the multi-core system.
>
>Just for the record, the lock prefix is implemented much more 
>efficiently nowdays and involves CPU cache management rather bus 
>locking. It used to lock the memory bus on early CPUs (I want to say 
>before Pentium, but I'm not sure which exact architecture changed this). 
>In any case, the patch does not introduce new lock instructions but it 
>replaces "lock; cmpxchg" loops that are normally generated for the 

>atomic AND and OR operations with a single instruction.


https://htor.inf.ethz.ch/publications/img/atomic-bench.pdf
In this paper, the author explain toat lock (SWP) just performance drop
a little in modern CPUs, but they just try less cores system (Xeon Phi
have more lost and it is single socket CPU), on multi-socket system,
the cache coherency maintenance will be very expensive.
However, the intrinsic may get more benefit from compiler, it may decide
which method is best choice on target platform.


> >> At 2018-01-10 23:30:06, "Andrey Semashev" <andrey.semas...@gmail.com 
> >> <mailto:andrey.semas...@gmail.com>> wrote: >>>Any feedback on this one? 
> >> >>> >>>I've been using it for quite some time locally. It does seem to 
> >> work >>>slightly faster on my Sandy Bridge machine (it should be a few 
> >> percents >>>of gain in fps, although I didn't save the benchmark numbers). 
> >> >>> >>>On 01/01/18 15:28, Andrey Semashev wrote: >>>> # HG changeset patch 
> >> >>>> # User Andrey Semashev <andrey.semas...@gmail.com 
> >> <mailto:andrey.semas...@gmail.com>> >>>> # Date 1514809583 -10800 >>>> # 
> >> Mon Jan 01 15:26:23 2018 +0300 >>>> # Branch atomic_bit_opsv2 >>>> # Node 
> >> ID 81529b6bd6adc8eb31162daeee44399dc1f95999 >>>> # Parent 
> >> ff02513b92c000c3bb3dcc51deb79af57f5358d5 >>>> Use atomic bit test and 
> >> set/reset operations on x86. >>>> >>>> The 'lock bts/btr' instructions are 
> >> potentially more efficient than the >>>> 'lock cmpxchg' loops which are 
> >> emitted to implement ATOMIC_AND and ATOMIC_OR >>>> on x86. The commit adds 
> >> new macros ATOMIC_BTS and ATOMIC_BTR which atomically >>>> set/reset the 
> >> specified bit in the integer and return the previous value of >>>> the 
> >> modified bit. >>>> >>>> Since in many places of the code the result is not 
> >> needed, two more macros are >>>> provided as well: ATOMIC_BTS_VOID and 
> >> ATOMIC_BTR_VOID. The effect of these >>>> macros is the same except that 
> >> they don't return the previous value. These >>>> macros may generate a 
> >> slightly more efficient code. >>>> >>>> diff -r ff02513b92c0 -r 
> >> 81529b6bd6ad source/common/threading.h >>>> --- 
> >> a/source/common/threading.h Fri Dec 22 18:23:24 2017 +0530 >>>> +++ 
> >> b/source/common/threading.h Mon Jan 01 15:26:23 2018 +0300 >>>> @@ -80,6 
> >> +80,91 @@ >>>> #define ATOMIC_ADD(ptr, val) __sync_fetch_and_add((volatile 
> >> int32_t*)ptr, val) >>>> #define GIVE_UP_TIME() usleep(0) >>>> >>>> +#if 
> >> defined(__x86_64__) || defined(__i386__) >>>> + >>>> +namespace X265_NS { 
> >> >>>> + >>>> +inline __attribute__((always_inline)) void 
> >> atomic_bit_test_and_set_void(uint32_t* ptr, uint32_t bit) >>>> +{ >>>> + 
> >> __asm__ __volatile__ >>>> + ( >>>> + "lock; btsl %[bit], %[mem]\n\t" >>>> 
> >> + : [mem] "+m" (*ptr) >>>> + : [bit] "Kq" (bit) >>>> + : "memory" >>>> + 
> >> ); >>>> +} >>>> + >>>> +inline __attribute__((always_inline)) void 
> >> atomic_bit_test_and_reset_void(uint32_t* ptr, uint32_t bit) >>>> +{ >>>> + 
> >> __asm__ __volatile__ >>>> + ( >>>> + "lock; btrl %[bit], %[mem]\n\t" >>

Re: [x265] [PATCH] Use atomic bit test and set/reset operations on x86

2018-01-10 Thread chen
Hi Andrey,


Our code rule prohibit inline assembly, especially the patch used GCC extension 
syntax.
the "lock" prefix will lock the CPU bus, it will be greater penalty on the 
multi-core system.
Thanks,
Min


At 2018-01-10 23:30:06, "Andrey Semashev"  wrote:
>Any feedback on this one?
>
>I've been using it for quite some time locally. It does seem to work 
>slightly faster on my Sandy Bridge machine (it should be a few percents 
>of gain in fps, although I didn't save the benchmark numbers).
>
>On 01/01/18 15:28, Andrey Semashev wrote:
>> # HG changeset patch
>> # User Andrey Semashev 
>> # Date 1514809583 -10800
>> #  Mon Jan 01 15:26:23 2018 +0300
>> # Branch atomic_bit_opsv2
>> # Node ID 81529b6bd6adc8eb31162daeee44399dc1f95999
>> # Parent  ff02513b92c000c3bb3dcc51deb79af57f5358d5
>> Use atomic bit test and set/reset operations on x86.
>> 
>> The 'lock bts/btr' instructions are potentially more efficient than the
>> 'lock cmpxchg' loops which are emitted to implement ATOMIC_AND and ATOMIC_OR
>> on x86. The commit adds new macros ATOMIC_BTS and ATOMIC_BTR which atomically
>> set/reset the specified bit in the integer and return the previous value of
>> the modified bit.
>> 
>> Since in many places of the code the result is not needed, two more macros 
>> are
>> provided as well: ATOMIC_BTS_VOID and ATOMIC_BTR_VOID. The effect of these
>> macros is the same except that they don't return the previous value. These
>> macros may generate a slightly more efficient code.
>> 
>> diff -r ff02513b92c0 -r 81529b6bd6ad source/common/threading.h
>> --- a/source/common/threading.h  Fri Dec 22 18:23:24 2017 +0530
>> +++ b/source/common/threading.h  Mon Jan 01 15:26:23 2018 +0300
>> @@ -80,6 +80,91 @@
>>   #define ATOMIC_ADD(ptr, val)  __sync_fetch_and_add((volatile int32_t*)ptr, 
>> val)
>>   #define GIVE_UP_TIME()usleep(0)
>>   
>> +#if defined(__x86_64__) || defined(__i386__)
>> +
>> +namespace X265_NS {
>> +
>> +inline __attribute__((always_inline)) void 
>> atomic_bit_test_and_set_void(uint32_t* ptr, uint32_t bit)
>> +{
>> +__asm__ __volatile__
>> +(
>> +"lock; btsl %[bit], %[mem]\n\t"
>> +: [mem] "+m" (*ptr)
>> +: [bit] "Kq" (bit)
>> +: "memory"
>> +);
>> +}
>> +
>> +inline __attribute__((always_inline)) void 
>> atomic_bit_test_and_reset_void(uint32_t* ptr, uint32_t bit)
>> +{
>> +__asm__ __volatile__
>> +(
>> +"lock; btrl %[bit], %[mem]\n\t"
>> +: [mem] "+m" (*ptr)
>> +: [bit] "Kq" (bit)
>> +: "memory"
>> +);
>> +}
>> +
>> +inline __attribute__((always_inline)) bool 
>> atomic_bit_test_and_set(uint32_t* ptr, uint32_t bit)
>> +{
>> +bool res;
>> +#if defined(__GCC_ASM_FLAG_OUTPUTS__)
>> +__asm__ __volatile__
>> +(
>> +"lock; btsl %[bit], %[mem]\n\t"
>> +: [mem] "+m" (*ptr), [res] "=@ccc" (res)
>> +: [bit] "Kq" (bit)
>> +: "memory"
>> +);
>> +#else
>> +res = false; // to avoid false dependency on the higher part of the 
>> result register
>> +__asm__ __volatile__
>> +(
>> +"lock; btsl %[bit], %[mem]\n\t"
>> +"setc %[res]\n\t"
>> +: [mem] "+m" (*ptr), [res] "+q" (res)
>> +: [bit] "Kq" (bit)
>> +: "memory"
>> +);
>> +#endif
>> +return res;
>> +}
>> +
>> +inline __attribute__((always_inline)) bool 
>> atomic_bit_test_and_reset(uint32_t* ptr, uint32_t bit)
>> +{
>> +bool res;
>> +#if defined(__GCC_ASM_FLAG_OUTPUTS__)
>> +__asm__ __volatile__
>> +(
>> +"lock; btrl %[bit], %[mem]\n\t"
>> +: [mem] "+m" (*ptr), [res] "=@ccc" (res)
>> +: [bit] "Kq" (bit)
>> +: "memory"
>> +);
>> +#else
>> +res = false; // to avoid false dependency on the higher part of the 
>> result register
>> +__asm__ __volatile__
>> +(
>> +"lock; btrl %[bit], %[mem]\n\t"
>> +"setc %[res]\n\t"
>> +: [mem] "+m" (*ptr), [res] "+q" (res)
>> +: [bit] "Kq" (bit)
>> +: "memory"
>> +);
>> +#endif
>> +return res;
>> +}
>> +
>> +}
>> +
>> +#define ATOMIC_BTS_VOID(ptr, bit)  
>> atomic_bit_test_and_set_void((uint32_t*)(ptr), (bit))
>> +#define ATOMIC_BTR_VOID(ptr, bit)  
>> atomic_bit_test_and_reset_void((uint32_t*)(ptr), (bit))
>> +#define ATOMIC_BTS(ptr, bit)  atomic_bit_test_and_set((uint32_t*)(ptr), 
>> (bit))
>> +#define ATOMIC_BTR(ptr, bit)  atomic_bit_test_and_reset((uint32_t*)(ptr), 
>> (bit))
>> +
>> +#endif // defined(__x86_64__) || defined(__i386__)
>> +
>>   #elif defined(_MSC_VER)   /* Windows atomic intrinsics */
>>   
>>   #include 
>> @@ -93,8 +178,26 @@
>>   #define ATOMIC_AND(ptr, mask) _InterlockedAnd((volatile LONG*)ptr, 
>> (LONG)mask)
>>   #define GIVE_UP_TIME()Sleep(0)
>>   
>> +#if defined(_M_IX86) || defined(_M_X64)
>> +#define ATOMIC_BTS(ptr, bit)  (!!_interlockedbittestandset((long*)(ptr), 
>> (bit)))
>> +#define ATOMIC_BTR(ptr, bit)  

Re: [x265] [PATCH] intra: sse4 version of strong intrasmoothing

2017-11-29 Thread chen
SSSE3 pmulhrsw also improve pmullw+paddw+psraw




At 2017-11-28 23:57:50, "Ximing Cheng"  wrote:
># HG changeset patch
># User Ximing Cheng 
># Date 1511862059 -28800
>#  Tue Nov 28 17:40:59 2017 +0800
># Node ID 9cd0cf6e2fd88604d939138e539dd481ec429ab3
># Parent  b24454f3ff6de650aab6835e291837fc4e2a4466
>intra: sse4 version of strong intrasmoothing

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] intra: sse4 version of strong intrasmoothing

2017-11-28 Thread chen
I have a few comments.

At 2017-11-28 23:57:50, "Ximing Cheng"  wrote:
>diff -r b24454f3ff6d -r 9cd0cf6e2fd8 source/common/x86/const-a.asm
>--- a/source/common/x86/const-a.asmWed Nov 22 22:00:48 2017 +0530
>+++ b/source/common/x86/const-a.asmTue Nov 28 17:40:59 2017 +0800
>@@ -114,6 +114,10 @@
> const multiH3,  times  1 dw  25,  26,  27,  28,  29,  30,  31,  32
> const multiL,   times  1 dw   1,   2,   3,   4,   5,   6,   7,   
> 8,   9,  10,  11,  12,  13,  14,  15,  16
> const multiH2,  times  1 dw  17,  18,  19,  20,  21,  22,  23,  
> 24,  25,  26,  27,  28,  29,  30,  31,  32
>+const multiH3_1,times  1 dw  33,  34,  35,  36,  37,  38,  39,  
>40,  41,  42,  43,  44,  45,  46,  47,  48

>+const multiH3_2,times  1 dw  41,  42,  43,  44,  45,  46,  47,  48
please check alignment issue on above constants


>+const multiH4,  times  1 dw  49,  50,  51,  52,  53,  54,  55,  
>56,  57,  58,  59,  60,  61,  62,  63,  64
>+const multiH4_1,times  1 dw  57,  58,  59,  60,  61,  62,  63,  64
> const pw_planar16_mul,  times  1 dw  15,  14,  13,  12,  11,  10,   9,   
> 8,   7,   6,   5,   4,   3,   2,   1,   0
> const pw_planar32_mul,  times  1 dw  31,  30,  29,  28,  27,  26,  25,  
> 24,  23,  22,  21,  20,  19,  18,  17,  16
> const pw_FFF0,   dw 0x00

>diff -r b24454f3ff6d -r 9cd0cf6e2fd8 source/common/x86/intrapred8.asm
>--- a/source/common/x86/intrapred8.asm Wed Nov 22 22:00:48 2017 +0530
>+++ b/source/common/x86/intrapred8.asm Tue Nov 28 17:40:59 2017 +0800
>@@ -543,6 +543,10 @@
> cextern multiH
> cextern multiH2
> cextern multiH3
>+cextern multiH3_1
>+cextern multiH3_2
>+cextern multiH4
>+cextern multiH4_1
> cextern multi_2Row
> cextern trans8_shuf
> cextern pw_planar16_mul
>@@ -22313,11 +22317,142 @@
> mov [r1 + 64], r3b  ; LeftLast
> RET
> 
>-INIT_XMM sse4
>-cglobal intra_filter_32x32, 2,4,6
>-mov r2b, byte [r0 +  64]; topLast
>-mov r3b, byte [r0 + 128]; LeftLast
>-
>+; this function add strong intra filter
>+INIT_XMM sse4
>+cglobal intra_filter_32x32, 3,8,7
>+movzx   r3d, byte [r0 +  64]; topLast
>+movzx   r4d, byte [r0 + 128]; LeftLast
>+
>+; strong intra filter is disabled
>+cmp r2m, byte 0
>+jz  .normal_filter32
>+; decide to do strong intra filter
>+movzx   r5d, byte [r0]  ; topLeft
>+movzx   r6d, byte [r0 + 32] ; topMiddle
>+
>+; threshold = 8
>+mov r2d, r3d
>+add r2d, r5d; (topLast + topLeft)
>+shl r6d, 1  ; 2 * topMiddle
>+mov r7d, r2d
>+sub r2d, r6d; (topLast + topLeft) - 2 
>* topMiddle
>+sub r6d, r7d; 2 * topMiddle - 
>(topLast + topLeft)
>+cmovg   r2d, r6d
>+cmp r2d, 8
>+; bilinearAbove is false
>+jns .normal_filter32
>+
>+movzx   r6d, byte [r0 + 96] ; leftMiddle
>+mov r2d, r5d
>+add r2d, r4d
>+shl r6d, 1
>+mov r7d, r2d
>+sub r2d, r6d
>+sub r6d, r7d
>+cmovg   r2d, r6d
>+cmp r2d, 8
>+; bilinearLeft is false
>+jns .normal_filter32
>+
>+; do strong intra filter shift = 6
>+mov r2d, r5d
>+shl r2d, 6
>+add r2d, 32 ; init
>+mov r6d, r4d
>+sub r6d, r5d; deltaL
>+mov r7d, r3d
>+sub r7d, r5d; deltaR
>+
>+movdm0, r2d
>+pshuflw m0, m0, 0
>+movlhps m0, m0
>+movam4, m0
>+
>+
>+movdm1, r7d
>+pshuflw m1, m1, 0
>+movlhps m1, m1
>+pmullw  m2, m1, [multiL]; [ 1  2  3  4  5  6  7  
>8]

>+pmullw  m3, m1, [multiH]; [ 9 10 11 12 13 14 15 
>16]
what's store in high part of m2?
moreover, X * 9 = X * 1 + X * 8, so how about store X * 8 in unused m7 to 
reduce memory load operator (3 cycles latency)?


>+ paddw m5, m0, m2 >+ paddw m6, m4, m3 >+ psraw m5, 6 >+ psraw m6, 6 >+ 
>packuswb m5, m6 >+ movu [r1 + 1], m5 >+ >+ pmullw m2, m1, [multiH2] ; [17 18 
>19 20 21 22 23 24] >+ pmullw m3, m1, [multiH3] ; [25 26 27 28 29 30 31 32] >+ 
>paddw m5, m0, m2 >+ paddw m6, m4, m3 >+ psraw m5, 6 >+ psraw m6, 6 >+ packuswb 
>m5, m6 >+ movu [r1 + 17], m5 >+ >+ pmullw m2, m1, [multiH3_1] ; [33 - 40]
>+ pmullw m3, m1, [multiH3_2] ; [41 - 48]
>+ paddw m5, m0, m2
>+ paddw m6, m4, m3 >+ psraw m5, 6 >+ psraw m6, 6 >+ packuswb 

Re: [x265] [PATCH] intra: sse4 version of strong intra smoothing

2017-11-20 Thread chen
>diff -r a7c2f80c18af -r 973560d58dfb source/common/x86/intrapred8.asm
>--- a/source/common/x86/intrapred8.asm Mon Nov 20 14:31:22 2017 +0530
>+++ b/source/common/x86/intrapred8.asm Tue Nov 21 03:10:14 2017 +0800
>@@ -22313,11 +22313,144 @@
> mov [r1 + 64], r3b  ; LeftLast
> RET
> 
>-INIT_XMM sse4
>-cglobal intra_filter_32x32, 2,4,6
>-mov r2b, byte [r0 +  64]; topLast
>-mov r3b, byte [r0 + 128]; LeftLast
>-
>+; this function add strong intra filter
>+INIT_XMM sse4
>+cglobal intra_filter_32x32, 3,8,7
>+xor r3d, r3d ; R9
>+xor r4d, r4d ; R10
>+mov r3b, byte [r0 +  64] ; topLast
>+mov r4b, byte [r0 + 128] ; LeftLast

xor+mov = movzx, the xor (clear to zero) does not spending cycle, but affect 
instruction decode rate


>+
>+; strong intra filter is diabled
>+cmp r2m, byte 0
>+jz  .normal_filter32
>+; decide to do strong intra filter
>+xor r5d, r5d ; R11
>+xor r6d, r6d ; RAX
>+xor r7d, r7d ; RDI
>+mov r5b, byte [r0]   ; topLeft
>+mov r6b, byte [r0 + 96]  ; leftMiddle
>+mov r7b, byte [r0 + 32]  ; topMiddle
>+
>+; threshold = 8
>+mov r2d, r3d ; R8
>+add r2d, r5d ; (topLast + topLeft)
>+shl r7d, 1   ; 2 * topMiddle

>+sub r2d, r7d
(A+B) - 2 * C  <==> (A-C) + (B-C)


>+mov r7d, r2d ; backup r2d
>+sar r7d, 31
>+xor r2d, r7d
>+sub r2d, r7d ; abs(r2d)

>+cmp r2d, 8
; how about this or instruction cdq?
; abs(x-y)
mov eax, X
sub eax, Y
sub Y, X
cmovg eax, Y




>+; bilinearAbove is false
>+jns .normal_filter32
>+
>+mov r2d, r5d
>+add r2d, r4d
>+shl r6d, 1
>+sub r2d, r6d
>+mov r6d, r2d
>+sar r6d, 31
>+xor r2d, r6d
>+sub r2d, r6d
>+cmp r2d, 8
>+; bilinearLeft is false
>+jns .normal_filter32
>+
>+; do strong intra filter shift = 6
>+mov r2d, r5d
>+shl r2d, 6
>+add r2d, 32  ; init
>+mov r6d, r4d

>+sub r6w, r5w ; deltaL size is word
partial register may stall in here


>+mov r7d, r3d
>+sub r7w, r5w ; deltaR size is word
>+movdxmm0, r2d

>+vpbroadcastwxmm0, xmm0
SSE4?


>+movaxmm4, xmm0
>+

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] [PATCH] fix build error on VS2008 ( ambiguous on pow() )

2017-06-28 Thread chen
From 360c25c6198e7aaa3a9f0ad611d99f94a1ea6347 Mon Sep 17 00:00:00 2001
From: Min Chen <chenm...@163.com>
Date: Wed, 28 Jun 2017 11:54:05 -0500
Subject: [PATCH] fix build error on VS2008 ( ambiguous on pow() )


---
 source/encoder/slicetype.cpp |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)


diff --git a/source/encoder/slicetype.cpp b/source/encoder/slicetype.cpp
index b013335..d7638a4 100644
--- a/source/encoder/slicetype.cpp
+++ b/source/encoder/slicetype.cpp
@@ -1819,7 +1819,8 @@ void Lookahead::calcMotionAdaptiveQuantFrame(Lowres 
**frames, int p0, int p1, in
 MV *mvs = frames[b]->lowresMvs[list][listDist[list]];
 int32_t x = mvs[cuIndex].x;
 int32_t y = mvs[cuIndex].y;
-displacement += sqrt(pow(abs(x), 2) + pow(abs(y), 2));
+// NOTE: the dynamic range of abs(x) and abs(y) is 15-bits
+displacement += sqrt((double)(abs(x) * abs(x)) + 
(double)(abs(y) * abs(y)));
 }
 else
 displacement += 0.0;
-- 
1.7.9.msysgit.0

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] avx2: 'integral4v' asm code -> 7.48x faster than 'C' version

2017-05-08 Thread chen
Hi Guillaume,


Our development platform is Visual Studio, the compiler can't auto-vectorize.
We also can't assume user have advanced compiler on their computer.


Regards,
Min


At 2017-05-08 19:36:24,"Guillaume POIRIER"  wrote:
>Hello Praveen Tiwari,
>
>Just for curiosity, when comparing your code's performance with the
>plain C version, did you give a chance too the compiler to vectorize
>the code itself?
>Such a trivial loop should not be difficult to handle for the compiler
>I think...
>
>Cheers,
>
>Guillaume
>
>
>On Mon, May 8, 2017 at 6:31 AM,   wrote:
>> # HG changeset patch
>> # User Praveen Tiwari 
>> # Date 1493905428 -19800
>> #  Thu May 04 19:13:48 2017 +0530
>> # Node ID 41611825c2f4661536500e1306db7d8c4bf7fd07
>> # Parent  48502979a4b21f6982dcdacbf7796bf5d9fb395c
>> avx2: 'integral4v' asm code -> 7.48x faster than 'C' version
>>
>>integral_init4v  7.48x202.53  1515.14
>>
>> diff -r 48502979a4b2 -r 41611825c2f4 source/common/x86/seaintegral.asm
>> --- a/source/common/x86/seaintegral.asm Wed May 03 11:26:26 2017 +0530
>> +++ b/source/common/x86/seaintegral.asm Thu May 04 19:13:48 2017 +0530
>> @@ -32,8 +32,19 @@
>>  ;void integral_init4v_c(uint32_t *sum4, intptr_t stride)
>>  
>> ;-
>>  INIT_YMM avx2
>> -cglobal integral4v, 2, 2, 0
>> -
>> +cglobal integral4v, 2, 3, 2
>> +mov r2, r1
>> +shl r2, 4
>> +
>> +.loop
>> +movum0, [r0]
>> +movum1, [r0 + r2]
>> +psubd   m1, m0
>> +movu[r0], m1
>> +add r0, 32
>> +sub r1, 8
>> +cmp r1, 0
>> +jnz .loop
>>  RET
>>
>>  
>> ;-
>> ___
>> x265-devel mailing list
>> x265-devel@videolan.org
>> https://mailman.videolan.org/listinfo/x265-devel
>
>
>
>-- 
>Wearing a Rolex is like driving an Audi: It says you've got some
>money, but nothing to say.
>John Lefèvre
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] x265 crashes/gets stuck when giving more than '--slices 16'

2017-03-14 Thread chen
Good morning Michael,

I made a restrict on count of slices because we have limited number of output 
NAL buffers.
Every slices need a independent NAL, but the SPS/PPS/VPS will also allocate at 
least one of NAL, so I made slices limit to (MAX_NAL_UNITS - 1)


Best regards,
Min

At 2017-03-14 15:23:03,"Michael Lackner"  
wrote:
>Good morning!
>
>I applied the patch and rebuilt the code (2.3+9-820f4327ddac in this case).
>
>Invoking x265 on a 8292x3428 video with '--slices 20' now gives the following 
>warning:
>
>x265 [warning]: maxSlices can not be more than min(rows, MAX_NAL_UNITS-1), 
>force set to 15
>
>And it shows:
>
>x265 [info]: Slices  : 15
>
>Just a laymans' question: I can see in source/encoder/nal.h, that 
>MAX_NAL_UNITS is 16.
>Should maxSlices really be MAX_NAL_UNITS-1 and not MAX_NAL_UNITS in this case? 
>I'm just
>asking because encoding with 16 slices seemed to work just fine before the 
>patch.
>
>Or are there issues with that?
>
>Thanks!
>
>Best,
>Michael
>
>-- 
>Michael Lackner
>Lehrstuhl für Informationstechnologie (CiT)
>Montanuniversität Leoben
>Tel.: +43 (0)3842/402-1505 | Mail: michael.lack...@unileoben.ac.at
>Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech
>
>On 03/14/2017 05:15 AM, Pradeep Ramachandran wrote:
>> The limit check was checking against the wrong variable - it should've
>> checked against slicesLimit and not against numRows.
>> Can you please check if https://patches.videolan.org/patch/15905/ solves
>> your issue?
>> 
>> Pradeep Ramachandran, PhD
>> Solution Architect at www.multicorewareinc.com/
>> Adjunct Faculty at www.cse.iitm.ac.in/
>> pradeeprama.info/
>> Ph:   +91 99627 82018
>> 
>> On Mon, Mar 6, 2017 at 4:30 PM, Michael Lackner <
>> michael.lack...@unileoben.ac.at> wrote:
>> 
>>> Greetings,
>>>
>>> During my experimentation with creating a x265-based benchmark, I found
>>> out that x265
>>> segfaults and/or freezes when giving it more than 16 frame slices to
>>> encode (e.g. --slices
>>> 17 or --slices 20 or --slices 32 or similar).
>>>
>>> It seems easy to reproduce as well. In my case, I'm feeding raw 16-bit YUV
>>> 4:4:4 content
>>> to it (16-bit per channel, 48-bit per pixel, no chroma subsampling). The
>>> resolution is
>>> 8192×3428, an upscale of the free movie "Tears of Steel".
>>>
>>> It works perfectly fine with <=16 slices on all my systems though, here's
>>> the ones I've
>>> tested it on:
>>>
>>> Environments / software versions:
>>>
>>>   * x265 2.2+22-20217c8af8ac
>>>
>>>  CentOS 6.8 Linux x86_64, x265 built by GCC 6.2.0 + yasm 1.3.0
>>>   => Segmentation fault
>>>
>>>   * x265 2.3+9-820f4327ddac
>>>
>>> o CentOS 6.8 Linux x86_64, x265 built by GCC 6.2.0 + yasm 1.3.0
>>>   => Segmentation fault
>>>
>>> o FreeBSD 10.3 UNIX x86_64, x265 built by clang 3.8.1 + yasm 1.3.0
>>>   => Process locks up in STOP state and becomes unkillable, even
>>>  to SIGKILL
>>>
>>> o Windows XP Professional x64 Edition (NT5.2), x265 built by
>>>   MSVC2010 SP1, cl.exe 16.00.40219.01 + yasm 1.3.0
>>>   => Terminates with %ERRORLEVEL% still set to 0, so this crash is
>>>   completely uncaught
>>>
>>> As said, it works fine on all platforms with --slices [1..16].
>>>
>>> Here's some sample output of the crash from Linux, it's supposed to be a
>>> 2-pass encode:
>>>
>>> 29163 Segmentation fault  (core dumped) nice -n19 x265 ./raw8k.yuv
>>> --frames 500
>>> --input-depth 16 --dither --input-res 8192x3428 --input-csp i444 -D 10
>>> --fps 24 --slices
>>> 20 -p veryslow --pmode --pme --open-gop --ref 6 --bframes 16 --b-pyramid
>>> --weightb
>>> --max-merge 5 --b-intra --bitrate 5 --rect --amp --aq-mode 2 --no-sao
>>> --qcomp 0.75
>>> --no-strong-intra-smoothing --psy-rd 1.6 --psy-rdoq 5.0 --rdoq-level 1
>>> --tu-inter-depth 4
>>> --tu-intra-depth 4 --ctu 32 --max-tu-size 32 --pass 1 --slow-firstpass
>>> --stats v.stats
>>> --sar 1 --range full -o pass1.h265
>>>
>>> I don't have any debug builds of x265 right now and I don't really know
>>> how to even do any
>>> debugging, but if you can tell me if you need anything, I can always try
>>> to build a debug
>>> version to generate more helpful output / dump files.
>>>
>>> Or maybe 16 frame slices are supposed to be the maximum, but it's just not
>>> handled
>>> correctly yet?!
>>>
>>> Is there any information available on that behavior?
>>>
>>> --
>>> Michael Lackner
>>> Lehrstuhl für Informationstechnologie (CiT)
>>> Montanuniversität Leoben
>>> Tel.: +43 (0)3842/402-1505 | Mail: michael.lack...@unileoben.ac.at
>>> Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.
>>> at/infotech
>>> ___
>>> x265-devel mailing list
>>> x265-devel@videolan.org
>>> https://mailman.videolan.org/listinfo/x265-devel
>___
>x265-devel mailing list
>x265-devel@videolan.org

[x265] fix logic timing bug

2016-11-23 Thread chen
# HG changeset patch
# User Min Chen <chenm...@163.com>
# Date 1479924604 21600
# Node ID c5ea19f5852aadd42bedd1d9fe4eb4b350a31e73
# Parent  a895b6344a82f2b5a0f8bc4ba7a913f0c40d114d
fix logic timing bug
---
 source/encoder/framefilter.cpp |   11 ---
 1 files changed, 8 insertions(+), 3 deletions(-)


diff -r a895b6344a82 -r c5ea19f5852a source/encoder/framefilter.cpp
--- a/source/encoder/framefilter.cppWed Nov 16 18:50:28 2016 +0530
+++ b/source/encoder/framefilter.cppWed Nov 23 12:10:04 2016 -0600
@@ -499,16 +499,18 @@
 if (!ctu->m_bFirstRowInSlice)
 processPostRow(row - 1);
 
-if (ctu->m_bLastRowInSlice)
-processPostRow(row);
-
 // NOTE: slices parallelism will be execute out-of-order
 int numRowFinished = 0;
 if (m_frame->m_reconRowFlag)
 {
 for (numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++)
+{
 if (!m_frame->m_reconRowFlag[numRowFinished].get())
 break;
+
+if (numRowFinished == row)
+continue;
+}
 }
 
 if (numRowFinished == m_numRows)
@@ -525,6 +527,9 @@
 m_parallelFilter[0].m_sao.rdoSaoUnitRowEnd(saoParam, 
encData.m_slice->m_sps->numCUsInFrame);
 }
 }
+
+if (ctu->m_bLastRowInSlice)
+processPostRow(row);
 }
 
 void FrameFilter::processPostRow(int row)

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] cleanup debug code

2016-11-16 Thread chen
# HG changeset patch
# User Min Chen <min.c...@multicorewareinc.com>
# Date 1479317016 21600
# Node ID 99a4a2d29d5c2b997745b06e5954a03bc080478f
# Parent  4c1652f3884fba9fab4c589dd057b12e6bf33d5b
cleanup debug code
---
 source/encoder/sao.cpp |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)


diff -r 4c1652f3884f -r 99a4a2d29d5c source/encoder/sao.cpp
--- a/source/encoder/sao.cppTue Nov 15 11:16:04 2016 +0530
+++ b/source/encoder/sao.cppWed Nov 16 11:23:36 2016 -0600
@@ -1206,12 +1206,10 @@
 void SAO::rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus)
 {
 if (!saoParam->bSaoFlag[0])
-{
 m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0;
-}
 else
 {
-assert(m_numNoSao[0] <= numctus);
+X265_CHECK(m_numNoSao[0] <= numctus, "m_numNoSao check failure!");
 m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[0] / 
((double)numctus);
 }
 

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] [slices] restrict mv never beyond boundary in both slices and non-slices mode

2016-11-01 Thread chen
# HG changeset patch
# User Min Chen <min.c...@multicorewareinc.com>
# Date 1478030336 18000
# Node ID 201758801366fb5e5b59710d87f4b8da911d6b73
# Parent  5fe7ac3068ebedc3d58451518c54c501e3c41103
[slices] restrict mv never beyond boundary in both slices and non-slices mode
---
 source/encoder/motion.cpp |   57 +++--
 1 files changed, 29 insertions(+), 28 deletions(-)


diff -r 5fe7ac3068eb -r 201758801366 source/encoder/motion.cpp
--- a/source/encoder/motion.cppTue Nov 01 14:58:53 2016 -0500
+++ b/source/encoder/motion.cppTue Nov 01 14:58:56 2016 -0500
@@ -278,13 +278,13 @@
 costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \
 costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \
 costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \
-if ((g_maxSlices == 1) | ((omv.y + m0y >= mvmin.y) & (omv.y + m0y <= 
mvmax.y))) \
+if ((omv.y + m0y >= mvmin.y) & (omv.y + m0y <= mvmax.y)) \
 COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
-if ((g_maxSlices == 1) | ((omv.y + m1y >= mvmin.y) & (omv.y + m1y <= 
mvmax.y))) \
+if ((omv.y + m1y >= mvmin.y) & (omv.y + m1y <= mvmax.y)) \
 COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
-if ((g_maxSlices == 1) | ((omv.y + m2y >= mvmin.y) & (omv.y + m2y <= 
mvmax.y))) \
+if ((omv.y + m2y >= mvmin.y) & (omv.y + m2y <= mvmax.y)) \
 COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
-if ((g_maxSlices == 1) | ((omv.y + m3y >= mvmin.y) & (omv.y + m3y <= 
mvmax.y))) \
+if ((omv.y + m3y >= mvmin.y) & (omv.y + m3y <= mvmax.y)) \
 COPY2_IF_LT(bcost, costs[3], bmv, omv + MV(m3x, m3y)); \
 }
 
@@ -631,6 +631,7 @@
 {
 bcost = cost;
 bmv = 0;
+bmv.y = X265_MAX(X265_MIN(0, mvmax.y), mvmin.y);
 }
 }
 
@@ -663,9 +664,9 @@
 do
 {
 COST_MV_X4_DIR(0, -1, 0, 1, -1, 0, 1, 0, costs);
-if ((g_maxSlices == 1) | ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= 
mvmax.y)))
+if ((bmv.y - 1 >= mvmin.y) & (bmv.y - 1 <= mvmax.y))
 COPY1_IF_LT(bcost, (costs[0] << 4) + 1);
-if ((g_maxSlices == 1) | ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= 
mvmax.y)))
+if ((bmv.y + 1 >= mvmin.y) & (bmv.y + 1 <= mvmax.y))
 COPY1_IF_LT(bcost, (costs[1] << 4) + 3);
 COPY1_IF_LT(bcost, (costs[2] << 4) + 4);
 COPY1_IF_LT(bcost, (costs[3] << 4) + 12);
@@ -704,18 +705,18 @@
   /* equivalent to the above, but eliminates duplicate candidates */
 COST_MV_X3_DIR(-2, 0, -1, 2,  1, 2, costs);
 bcost <<= 3;
-if ((g_maxSlices == 1) | ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y)))
+if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y))
 COPY1_IF_LT(bcost, (costs[0] << 3) + 2);
-if ((g_maxSlices == 1) | ((bmv.y + 2 >= mvmin.y) & (bmv.y + 2 <= 
mvmax.y)))
+if ((bmv.y + 2 >= mvmin.y) & (bmv.y + 2 <= mvmax.y))
 {
 COPY1_IF_LT(bcost, (costs[1] << 3) + 3);
 COPY1_IF_LT(bcost, (costs[2] << 3) + 4);
 }
 
 COST_MV_X3_DIR(2, 0,  1, -2, -1, -2, costs);
-if ((g_maxSlices == 1) | ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y)))
+if ((bmv.y >= mvmin.y) & (bmv.y <= mvmax.y))
 COPY1_IF_LT(bcost, (costs[0] << 3) + 5);
-if ((g_maxSlices == 1) | ((bmv.y - 2 >= mvmin.y) & (bmv.y - 2 <= 
mvmax.y)))
+if ((bmv.y - 2 >= mvmin.y) & (bmv.y - 2 <= mvmax.y))
 {
 COPY1_IF_LT(bcost, (costs[1] << 3) + 6);
 COPY1_IF_LT(bcost, (costs[2] << 3) + 7);
@@ -725,7 +726,7 @@
 {
 int dir = (bcost & 7) - 2;
 
-if ((g_maxSlices == 1) | ((bmv.y + hex2[dir + 1].y >= mvmin.y) & 
(bmv.y + hex2[dir + 1].y <= mvmax.y)))
+if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + hex2[dir + 
1].y <= mvmax.y))
 {
 bmv += hex2[dir + 1];
 
@@ -738,13 +739,13 @@
 costs);
 bcost &= ~7;
 
-if ((g_maxSlices == 1) | ((bmv.y + hex2[dir + 0].y >= 
mvmin.y) & (bmv.y + hex2[dir + 0].y <= mvmax.y)))
+if ((bmv.y + hex2[dir + 0].y >= mvmin.y) & (bmv.y + 
hex2[dir + 0].y <= mvmax.y))
 COPY1_IF_LT(bcost, (costs[0] << 3) + 1);
 
-if ((g_maxSlices == 1) | ((bmv.y + hex2[dir + 1].y >= 
mvmin.y) & (bmv.y + hex2[dir + 1].y <= mvmax.y)))
+if ((bmv.y + hex2[dir + 1].y >= mvmin.y) & (bmv.y + 
hex2[dir + 1].y <= mvmax.

Re: [x265] [slices] fix multi-slices output non-determination bug

2016-11-01 Thread chen

 2016-11-01 11:40:45,"Pradeep Ramachandran" <prad...@multicorewareinc.com> :



On Mon, Oct 31, 2016 at 11:03 PM, chen <chenm...@163.com> wrote:

# HG changeset patch
# User Min Chen <min.c...@multicorewareinc.com>
# Date 1477935084 18000
# Node ID 9be03f08789954f772a50f26485a9c96ca745497
# Parent  b08109b3701e9b86010c5a5ed0ad7b3d6a051911
[slices] fix multi-slices output non-determination bug
---
 source/common/common.h  |2 +-
 source/encoder/analysis.cpp |8 +-
 source/encoder/frameencoder.cpp |   15 ++---
 source/encoder/motion.cpp   |  116 +++---
 source/encoder/sao.cpp  |7 ++
 source/encoder/search.cpp   |3 +
 6 files changed, 104 insertions(+), 47 deletions(-)


diff -r b08109b3701e -r 9be03f087899 source/common/common.h
--- a/source/common/common.hFri Oct 28 10:28:15 2016 +0800
+++ b/source/common/common.hMon Oct 31 12:31:24 2016 -0500
@@ -176,7 +176,7 @@
 
 #define X265_MIN(a, b) ((a) < (b) ? (a) : (b))
 #define X265_MAX(a, b) ((a) > (b) ? (a) : (b))
-#define COPY1_IF_LT(x, y) if ((y) < (x)) (x) = (y);
+#define COPY1_IF_LT(x, y) {if ((y) < (x)) (x) = (y);}
 #define COPY2_IF_LT(x, y, a, b) \
 if ((y) < (x)) \
 { \
diff -r b08109b3701e -r 9be03f087899 source/encoder/analysis.cpp
--- a/source/encoder/analysis.cppFri Oct 28 10:28:15 2016 +0800
+++ b/source/encoder/analysis.cppMon Oct 31 12:31:24 2016 -0500
@@ -1942,12 +1942,12 @@
 if (m_param->maxSlices > 1)
 {
 // NOTE: First row in slice can't negative
-if ((candMvField[i][0].mv.y < m_sliceMinY) | 
(candMvField[i][1].mv.y < m_sliceMinY))
+if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < 
m_sliceMinY)
 continue;
 
 // Last row in slice can't reference beyond bound since it is 
another slice area
 // TODO: we may beyond bound in future since these area have a 
chance to finish because we use parallel slices. Necessary prepare research on 
load balance
-if ((candMvField[i][0].mv.y > m_sliceMaxY) | 
(candMvField[i][1].mv.y > m_sliceMaxY))
+if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > 
m_sliceMaxY)
 continue;
 }
 
@@ -2072,12 +2072,12 @@
 if (m_param->maxSlices > 1)
 {
 // NOTE: First row in slice can't negative
-if ((candMvField[i][0].mv.y < m_sliceMinY) | 
(candMvField[i][1].mv.y < m_sliceMinY))
+if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < 
m_sliceMinY)
 continue;
 
 // Last row in slice can't reference beyond bound since it is 
another slice area
 // TODO: we may beyond bound in future since these area have a 
chance to finish because we use parallel slices. Necessary prepare research on 
load balance
-if ((candMvField[i][0].mv.y > m_sliceMaxY) | 
(candMvField[i][1].mv.y > m_sliceMaxY))
+if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > 
m_sliceMaxY)
 continue;
 }
 
diff -r b08109b3701e -r 9be03f087899 source/encoder/frameencoder.cpp
--- a/source/encoder/frameencoder.cppFri Oct 28 10:28:15 2016 +0800
+++ b/source/encoder/frameencoder.cppMon Oct 31 12:31:24 2016 -0500
@@ -123,7 +123,7 @@
 int range  = m_param->searchRange;   /* fpel search */
 range += !!(m_param->searchMethod < 2);  /* diamond/hex range check lag */
 range += NTAPS_LUMA / 2; /* subpel filter half-length */
-range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 
2; /* subpel refine steps */
+range += 2 + (MotionEstimate::hpelIterationCount(m_param->subpelRefine) + 
1) / 2; /* subpel refine steps */
 m_refLagRows = /*(m_param->maxSlices > 1 ? 1 : 0) +*/ 1 + ((range + 
g_maxCUSize - 1) / g_maxCUSize);
 
 // NOTE: 2 times of numRows because both Encoder and Filter in same queue
@@ -654,8 +654,7 @@
 const uint32_t sliceEndRow = m_sliceBaseRow[sliceId + 1] - 1;
 const uint32_t row = sliceStartRow + rowInSlice;
 
-if (row >= m_numRows)
-break;
+X265_CHECK(row < m_numRows, "slices row fault was detected");
 
 if (row > sliceEndRow)
 continue;
@@ -674,7 +673,7 @@
 refpic->m_reconRowFlag[rowIdx].waitForChange(0);
 
 if ((bUseWeightP || bUseWeightB) && 
m_mref[l][ref].isWeighted)
-m_mref[l][ref].applyWeight(row + m_refLagRows, 
m_numRows, sliceEndRow + 1, sliceId);
+m_mref[l][ref].applyWeight(rowIdx, m_numRows, 
sliceEndRow, sliceId

[x265] [slices] fix multi-slices output non-determination bug

2016-10-31 Thread chen
# HG changeset patch
# User Min Chen <min.c...@multicorewareinc.com>
# Date 1477935084 18000
# Node ID 9be03f08789954f772a50f26485a9c96ca745497
# Parent  b08109b3701e9b86010c5a5ed0ad7b3d6a051911
[slices] fix multi-slices output non-determination bug
---
 source/common/common.h  |2 +-
 source/encoder/analysis.cpp |8 +-
 source/encoder/frameencoder.cpp |   15 ++---
 source/encoder/motion.cpp   |  116 +++---
 source/encoder/sao.cpp  |7 ++
 source/encoder/search.cpp   |3 +
 6 files changed, 104 insertions(+), 47 deletions(-)


diff -r b08109b3701e -r 9be03f087899 source/common/common.h
--- a/source/common/common.hFri Oct 28 10:28:15 2016 +0800
+++ b/source/common/common.hMon Oct 31 12:31:24 2016 -0500
@@ -176,7 +176,7 @@
 
 #define X265_MIN(a, b) ((a) < (b) ? (a) : (b))
 #define X265_MAX(a, b) ((a) > (b) ? (a) : (b))
-#define COPY1_IF_LT(x, y) if ((y) < (x)) (x) = (y);
+#define COPY1_IF_LT(x, y) {if ((y) < (x)) (x) = (y);}
 #define COPY2_IF_LT(x, y, a, b) \
 if ((y) < (x)) \
 { \
diff -r b08109b3701e -r 9be03f087899 source/encoder/analysis.cpp
--- a/source/encoder/analysis.cppFri Oct 28 10:28:15 2016 +0800
+++ b/source/encoder/analysis.cppMon Oct 31 12:31:24 2016 -0500
@@ -1942,12 +1942,12 @@
 if (m_param->maxSlices > 1)
 {
 // NOTE: First row in slice can't negative
-if ((candMvField[i][0].mv.y < m_sliceMinY) | 
(candMvField[i][1].mv.y < m_sliceMinY))
+if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < 
m_sliceMinY)
 continue;
 
 // Last row in slice can't reference beyond bound since it is 
another slice area
 // TODO: we may beyond bound in future since these area have a 
chance to finish because we use parallel slices. Necessary prepare research on 
load balance
-if ((candMvField[i][0].mv.y > m_sliceMaxY) | 
(candMvField[i][1].mv.y > m_sliceMaxY))
+if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > 
m_sliceMaxY)
 continue;
 }
 
@@ -2072,12 +2072,12 @@
 if (m_param->maxSlices > 1)
 {
 // NOTE: First row in slice can't negative
-if ((candMvField[i][0].mv.y < m_sliceMinY) | 
(candMvField[i][1].mv.y < m_sliceMinY))
+if (X265_MIN(candMvField[i][0].mv.y, candMvField[i][1].mv.y) < 
m_sliceMinY)
 continue;
 
 // Last row in slice can't reference beyond bound since it is 
another slice area
 // TODO: we may beyond bound in future since these area have a 
chance to finish because we use parallel slices. Necessary prepare research on 
load balance
-if ((candMvField[i][0].mv.y > m_sliceMaxY) | 
(candMvField[i][1].mv.y > m_sliceMaxY))
+if (X265_MAX(candMvField[i][0].mv.y, candMvField[i][1].mv.y) > 
m_sliceMaxY)
 continue;
 }
 
diff -r b08109b3701e -r 9be03f087899 source/encoder/frameencoder.cpp
--- a/source/encoder/frameencoder.cppFri Oct 28 10:28:15 2016 +0800
+++ b/source/encoder/frameencoder.cppMon Oct 31 12:31:24 2016 -0500
@@ -123,7 +123,7 @@
 int range  = m_param->searchRange;   /* fpel search */
 range += !!(m_param->searchMethod < 2);  /* diamond/hex range check lag */
 range += NTAPS_LUMA / 2; /* subpel filter half-length */
-range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 
2; /* subpel refine steps */
+range += 2 + (MotionEstimate::hpelIterationCount(m_param->subpelRefine) + 
1) / 2; /* subpel refine steps */
 m_refLagRows = /*(m_param->maxSlices > 1 ? 1 : 0) +*/ 1 + ((range + 
g_maxCUSize - 1) / g_maxCUSize);
 
 // NOTE: 2 times of numRows because both Encoder and Filter in same queue
@@ -654,8 +654,7 @@
 const uint32_t sliceEndRow = m_sliceBaseRow[sliceId + 1] - 1;
 const uint32_t row = sliceStartRow + rowInSlice;
 
-if (row >= m_numRows)
-break;
+X265_CHECK(row < m_numRows, "slices row fault was detected");
 
 if (row > sliceEndRow)
 continue;
@@ -674,7 +673,7 @@
 refpic->m_reconRowFlag[rowIdx].waitForChange(0);
 
 if ((bUseWeightP || bUseWeightB) && 
m_mref[l][ref].isWeighted)
-m_mref[l][ref].applyWeight(row + m_refLagRows, 
m_numRows, sliceEndRow + 1, sliceId);
+m_mref[l][ref].applyWeight(rowIdx, m_numRows, 
sliceEndRow, sliceId);
 }
 }
 
@@ -714,7 +713,7 @@
 refpic->m_reconRowFlag[rowIdx].waitForChange(0);
 
 if ((bUseWeightP

[x265] [PATCH] [slices] allow number of slices more than rows (Issue #300-3)

2016-10-27 Thread chen
From e697fcd5fa0d36b33d42d01c2845ca36533dbd96 Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Thu, 27 Oct 2016 11:11:09 -0500
Subject: [PATCH] [slices] allow number of slices more than rows (Issue #300-3)


---
 source/common/param.cpp|2 --
 source/encoder/encoder.cpp |   13 +
 source/encoder/nal.h   |1 +
 3 files changed, 14 insertions(+), 2 deletions(-)


diff --git a/source/common/param.cpp b/source/common/param.cpp
index cd62310..d8ffa16 100644
--- a/source/common/param.cpp
+++ b/source/common/param.cpp
@@ -1249,8 +1249,6 @@ int x265_check_params(x265_param* param)
 "qpmin exceeds supported range (0 to 69)");
 CHECK(param->log2MaxPocLsb < 4,
 "maximum of the picture order count can not be less than 4");
-CHECK(1 > param->maxSlices || param->maxSlices > ((param->sourceHeight + 
param->maxCUSize - 1) / param->maxCUSize),
-"The slices can not be more than number of rows");
 return check_failed;
 }
 
diff --git a/source/encoder/encoder.cpp b/source/encoder/encoder.cpp
index 424018b..3734c24 100644
--- a/source/encoder/encoder.cpp
+++ b/source/encoder/encoder.cpp
@@ -2120,6 +2120,19 @@ void Encoder::configure(x265_param *p)
 p->log2MaxPocLsb = 4;
 }
 
+if (p->maxSlices < 1)
+{
+x265_log(p, X265_LOG_WARNING, "maxSlices can not be less than 1, force 
set to 1\n");
+p->maxSlices = 1;
+}
+
+const uint32_t numRows = (p->sourceHeight + p->maxCUSize - 1) / 
p->maxCUSize;
+const uint32_t slicesLimit = X265_MIN(numRows, NALList::MAX_NAL_UNITS - 1);
+if (p->maxSlices > numRows)
+{
+x265_log(p, X265_LOG_WARNING, "maxSlices can not be more than 
min(rows, MAX_NAL_UNITS-1), force set to %d\n", slicesLimit);
+p->maxSlices = slicesLimit;
+}
 }
 
 void Encoder::allocAnalysis(x265_analysis_data* analysis)
diff --git a/source/encoder/nal.h b/source/encoder/nal.h
index 15e542d..35f6961 100644
--- a/source/encoder/nal.h
+++ b/source/encoder/nal.h
@@ -34,6 +34,7 @@ class Bitstream;
 
 class NALList
 {
+public:
 static const int MAX_NAL_UNITS = 16;
 
 public:
-- 
1.7.9.msysgit.0

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] [PPC] support option --no-asm to disable Altivec

2016-10-25 Thread chen
All of his origin files in another patch, that is very large and mail-list 
block it until you approval.


At 2016-10-25 11:59:45,"Pradeep Ramachandran" <prad...@multicorewareinc.com> 
wrote:



On Tue, Oct 25, 2016 at 2:59 AM, chen <chenm...@163.com> wrote:

From d23527c6204921b782ef8bc2f1a69de88163202a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Mon, 24 Oct 2016 16:27:35 -0500
Subject: [PATCH] [PPC] support option --no-asm to disable Altivec


On what parent was this patch created? These don't apply on the current tip.
Also, these don't look like regular hg patches that come on the mailing list; 
can you please fix and send?
 


---
 source/CMakeLists.txt|2 +-
 source/common/cpu.cpp|   17 -
 source/common/primitives.cpp |   11 +++
 source/common/version.cpp|4 +---
 source/x265.h|3 +++
 5 files changed, 28 insertions(+), 9 deletions(-)


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index 18cad9a..9e8e5ab 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -426,7 +426,7 @@ if(POWER)
 
 option(CPU_POWER8 "Enable CPU POWER8 profiling instrumentation" ON)
 if(CPU_POWER8)
-add_definitions(-mcpu=power8)
+add_definitions(-mcpu=power8 -DX265_ARCH_POWER8=1)
 endif()
 endif()
 
diff --git a/source/common/cpu.cpp b/source/common/cpu.cpp
index 0dafe48..5bd1e0f 100644
--- a/source/common/cpu.cpp
+++ b/source/common/cpu.cpp
@@ -99,6 +99,10 @@ const cpu_name_t cpu_names[] =
 { "ARMv6",   X265_CPU_ARMV6 },
 { "NEON",X265_CPU_NEON },
 { "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
+
+#elif X265_ARCH_POWER8
+{ "Altivec", X265_CPU_ALTIVEC },
+
 #endif // if X265_ARCH_X86
 { "", 0 },
 };
@@ -363,7 +367,18 @@ uint32_t cpu_detect(void)
 return flags;
 }
 
-#else // if X265_ARCH_X86
+#elif X265_ARCH_POWER8
+
+uint32_t cpu_detect(void)
+{
+#if HAVE_ALTIVEC
+return X265_CPU_ALTIVEC;
+#else
+return 0;
+#endif
+}
+
+#else // if X265_ARCH_POWER8
 
 uint32_t cpu_detect(void)
 {
diff --git a/source/common/primitives.cpp b/source/common/primitives.cpp
index 569f8ad..e000a2f 100644
--- a/source/common/primitives.cpp
+++ b/source/common/primitives.cpp
@@ -244,10 +244,13 @@ void x265_setup_primitives(x265_param *param)
 setupAssemblyPrimitives(primitives, param->cpuid);
 #endif
 #if HAVE_ALTIVEC
-setupPixelPrimitives_altivec(primitives);  // pixel_altivec.cpp, 
overwrite the initialization for altivec optimizated functions
-setupDCTPrimitives_altivec(primitives);// dct_altivec.cpp, 
overwrite the initialization for altivec optimizated functions
-setupFilterPrimitives_altivec(primitives); // ipfilter.cpp, 
overwrite the initialization for altivec optimizated functions
-setupIntraPrimitives_altivec(primitives); // intrapred_altivec.cpp, 
overwrite the initialization for altivec optimizated functions
+if (param->cpuid & X265_CPU_ALTIVEC)
+{
+setupPixelPrimitives_altivec(primitives);   // 
pixel_altivec.cpp, overwrite the initialization for altivec optimizated 
functions
+setupDCTPrimitives_altivec(primitives); // 
dct_altivec.cpp, overwrite the initialization for altivec optimizated functions
+setupFilterPrimitives_altivec(primitives);  // ipfilter.cpp, 
overwrite the initialization for altivec optimizated functions
+setupIntraPrimitives_altivec(primitives);   // 
intrapred_altivec.cpp, overwrite the initialization for altivec optimizated 
functions
+}
 #endif
 
 setupAliasPrimitives(primitives);
diff --git a/source/common/version.cpp b/source/common/version.cpp
index dd114a3..e4d7554 100644
--- a/source/common/version.cpp
+++ b/source/common/version.cpp
@@ -77,10 +77,8 @@
 #define BITS"[32 bit]"
 #endif
 
-#if defined(ENABLE_ASSEMBLY)
+#if defined(ENABLE_ASSEMBLY) || HAVE_ALTIVEC
 #define ASM ""
-#elif HAVE_ALTIVEC
-#define ASM "[altivec]"
 #else
 #define ASM "[noasm]"
 #endif
diff --git a/source/x265.h b/source/x265.h
index 6ef27de..e6a8b01 100644
--- a/source/x265.h
+++ b/source/x265.h
@@ -335,6 +335,9 @@ typedef enum
 #define X265_CPU_NEON0x002  /* ARM NEON */
 #define X265_CPU_FAST_NEON_MRC   0x004  /* Transfer from NEON to ARM 
register is fast (Cortex-A9) */
 
+/* IBM Power8 */
+#define X265_CPU_ALTIVEC 0x001
+
 #define X265_MAX_SUBPEL_LEVEL   7
 
 /* Log level */
-- 
1.7.9.msysgit.0



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel



___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] [PATCH] [PPC] GPL v2 copyright header

2016-10-24 Thread chen
From 1bea85513646e4d9d992bbe326a9cb3275ec313a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Mon, 24 Oct 2016 16:38:55 -0500
Subject: [PATCH] [PPC] GPL v2 copyright header


---
 source/common/ppc/dct_altivec.cpp   |   24 
 source/common/ppc/intrapred_altivec.cpp |   28 
 source/common/ppc/ipfilter_altivec.cpp  |   28 
 3 files changed, 72 insertions(+), 8 deletions(-)


diff --git a/source/common/ppc/dct_altivec.cpp 
b/source/common/ppc/dct_altivec.cpp
index 0d33be4..7542a8e 100644
--- a/source/common/ppc/dct_altivec.cpp
+++ b/source/common/ppc/dct_altivec.cpp
@@ -1,3 +1,27 @@
+/*
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Roger Moussalli <rmous...@us.ibm.com>
+ *  Min Chen <min.c...@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ */
+
 #include "common.h"
 #include "primitives.h"
 #include "contexts.h"   // costCoeffNxN_c
diff --git a/source/common/ppc/intrapred_altivec.cpp 
b/source/common/ppc/intrapred_altivec.cpp
index f2b1c5e..d27f5b6 100644
--- a/source/common/ppc/intrapred_altivec.cpp
+++ b/source/common/ppc/intrapred_altivec.cpp
@@ -1,12 +1,32 @@
-#include 
+/*
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Roger Moussalli <rmous...@us.ibm.com>
+ *  Min Chen <min.c...@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ */
 
+#include 
 #include 
-
 #include 
-
 #include 
 #include 
-
 #include 
 #include 
 #include 
diff --git a/source/common/ppc/ipfilter_altivec.cpp 
b/source/common/ppc/ipfilter_altivec.cpp
index 3468968..55ee76a 100644
--- a/source/common/ppc/ipfilter_altivec.cpp
+++ b/source/common/ppc/ipfilter_altivec.cpp
@@ -1,14 +1,34 @@
-#include 
+/*
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Roger Moussalli <rmous...@us.ibm.com>
+ *  Min Chen <min.c...@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ */
 
+#include 
 #incl

[x265] [PATCH] [PPC] support option --no-asm to disable Altivec

2016-10-24 Thread chen
From d23527c6204921b782ef8bc2f1a69de88163202a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Mon, 24 Oct 2016 16:27:35 -0500
Subject: [PATCH] [PPC] support option --no-asm to disable Altivec


---
 source/CMakeLists.txt|2 +-
 source/common/cpu.cpp|   17 -
 source/common/primitives.cpp |   11 +++
 source/common/version.cpp|4 +---
 source/x265.h|3 +++
 5 files changed, 28 insertions(+), 9 deletions(-)


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index 18cad9a..9e8e5ab 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -426,7 +426,7 @@ if(POWER)
 
 option(CPU_POWER8 "Enable CPU POWER8 profiling instrumentation" ON)
 if(CPU_POWER8)
-add_definitions(-mcpu=power8)
+add_definitions(-mcpu=power8 -DX265_ARCH_POWER8=1)
 endif()
 endif()
 
diff --git a/source/common/cpu.cpp b/source/common/cpu.cpp
index 0dafe48..5bd1e0f 100644
--- a/source/common/cpu.cpp
+++ b/source/common/cpu.cpp
@@ -99,6 +99,10 @@ const cpu_name_t cpu_names[] =
 { "ARMv6",   X265_CPU_ARMV6 },
 { "NEON",X265_CPU_NEON },
 { "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
+
+#elif X265_ARCH_POWER8
+{ "Altivec", X265_CPU_ALTIVEC },
+
 #endif // if X265_ARCH_X86
 { "", 0 },
 };
@@ -363,7 +367,18 @@ uint32_t cpu_detect(void)
 return flags;
 }
 
-#else // if X265_ARCH_X86
+#elif X265_ARCH_POWER8
+
+uint32_t cpu_detect(void)
+{
+#if HAVE_ALTIVEC
+return X265_CPU_ALTIVEC;
+#else
+return 0;
+#endif
+}
+
+#else // if X265_ARCH_POWER8
 
 uint32_t cpu_detect(void)
 {
diff --git a/source/common/primitives.cpp b/source/common/primitives.cpp
index 569f8ad..e000a2f 100644
--- a/source/common/primitives.cpp
+++ b/source/common/primitives.cpp
@@ -244,10 +244,13 @@ void x265_setup_primitives(x265_param *param)
 setupAssemblyPrimitives(primitives, param->cpuid);
 #endif
 #if HAVE_ALTIVEC
-setupPixelPrimitives_altivec(primitives);  // pixel_altivec.cpp, 
overwrite the initialization for altivec optimizated functions
-setupDCTPrimitives_altivec(primitives);// dct_altivec.cpp, 
overwrite the initialization for altivec optimizated functions
-setupFilterPrimitives_altivec(primitives); // ipfilter.cpp, 
overwrite the initialization for altivec optimizated functions
-setupIntraPrimitives_altivec(primitives); // intrapred_altivec.cpp, 
overwrite the initialization for altivec optimizated functions
+if (param->cpuid & X265_CPU_ALTIVEC)
+{
+setupPixelPrimitives_altivec(primitives);   // 
pixel_altivec.cpp, overwrite the initialization for altivec optimizated 
functions
+setupDCTPrimitives_altivec(primitives); // 
dct_altivec.cpp, overwrite the initialization for altivec optimizated functions
+setupFilterPrimitives_altivec(primitives);  // ipfilter.cpp, 
overwrite the initialization for altivec optimizated functions
+setupIntraPrimitives_altivec(primitives);   // 
intrapred_altivec.cpp, overwrite the initialization for altivec optimizated 
functions
+}
 #endif
 
 setupAliasPrimitives(primitives);
diff --git a/source/common/version.cpp b/source/common/version.cpp
index dd114a3..e4d7554 100644
--- a/source/common/version.cpp
+++ b/source/common/version.cpp
@@ -77,10 +77,8 @@
 #define BITS"[32 bit]"
 #endif
 
-#if defined(ENABLE_ASSEMBLY)
+#if defined(ENABLE_ASSEMBLY) || HAVE_ALTIVEC
 #define ASM ""
-#elif HAVE_ALTIVEC
-#define ASM "[altivec]"
 #else
 #define ASM "[noasm]"
 #endif
diff --git a/source/x265.h b/source/x265.h
index 6ef27de..e6a8b01 100644
--- a/source/x265.h
+++ b/source/x265.h
@@ -335,6 +335,9 @@ typedef enum
 #define X265_CPU_NEON0x002  /* ARM NEON */
 #define X265_CPU_FAST_NEON_MRC   0x004  /* Transfer from NEON to ARM 
register is fast (Cortex-A9) */
 
+/* IBM Power8 */
+#define X265_CPU_ALTIVEC 0x001
+
 #define X265_MAX_SUBPEL_LEVEL   7
 
 /* Log level */
-- 
1.7.9.msysgit.0

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Tile support in x265

2016-10-12 Thread chen
Thank you help reply that message.

I am the developer for WPP and Slices, the motion vectors has restricted in 
slice boundary now, I will also make same restricts on Tiles. In future, we 
will addition a new user option to allow MV beyond boundary.
Paid attention, it is a low priority task in the company, we focus on other 
funded task in next couple months.


Regards,
Min

At 2016-10-12 16:45:32,"Mario *LigH* Rohkrämer"  wrote:
>I am no developer; but as far as I remember from conversations of the last  
>month:
>
>Not yet, so far there is a mature implementation of WPP, and slices are in  
>development; tiles would be the next step "soon(tm)", but I don't know  
>about the schedule.
>
>
>Am 12.10.2016, 10:32 Uhr, schrieb Kammachi-Sreedhar Kashyap  
>(Nokia-TECH/Tampere) :
>
>>
>> Does x265 provide support for tiles (where motion vectors are bounded  
>> within the tile boundary)?
>>
>> Regards
>> KS
>
>
>-- 
>
>Fun and success!
>Mario *LigH* Rohkrämer
>mailto:cont...@ligh.de
> 
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] x265-devel Digest, Vol 40, Issue 26

2016-09-28 Thread chen
Hello Xuefeng,

I understand your concept, I just said your algorithm output result close to 
average of slice base QP, but through complicated compute.

Regards,
Min


At 2016-09-28 13:59:13,xuefeng <xuef...@multicorewareinc.com> wrote:

hello Min,
Thanks for your reply.


a)
deltaQp = sliceQp - ppsQp
ppsQp is saved in PPS and deltaQp is saved in bitstream.  
In fact, I don't calculate new QP, but calculate a best delta for ppsQP and 
deltaQp without changing sliceQp.
For example,  deltaQp = 14(cost 4 bits),  sliceQp = 40, and ppsQp = 26
If we set deltaQp = 14 - 14 = 0 ( cost 1 bits)  and ppsQp = 26 + 14 = 40,  
sliceQp is still 40.  Then we save 3 - 1 = 2 bits in bitstream for each slice.


b)
I have change QP Range with range extension. Please see this patch.


Regards,
 
 Xuefeng Jiang
 
xuef...@multicorewareinc.com
 
Message: 3
Date: Wed, 28 Sep 2016 00:02:33 +0800 (CST)
From: chen  <chenm...@163.com>
To: "Development for x265" <x265-devel@videolan.org>
Subject: Re: [x265] Optimize slice QP in PPS for x265
Message-ID: <2e196b48.110e.1576c622432.coremail.chenm...@163.com>
Content-Type: text/plain; charset="gbk"
 
Hello Xuefeng,
 
 
Your idea is good, in low bitrate environment, the MV, header are most 
important part in bitstream.
I take a look your code, it sounds some problems.
 
 
Your calculate correlation between sliceQp and QP Range (it is [0, 51] without 
range extension), so you will got a constant correlative array for every QP 
value.
In the final, your algorithm output a QP close to average value of sliceQP.
It is right, just spending more time on compute.
 
 
Regards,
Min
 
 
At 2016-09-27 14:46:14,xuefeng <xuef...@multicorewareinc.com> wrote:
 
All,
hello!
 
 
x265 set the slice QP in PPS to 26.  Bits can be saved by calculating a closer 
approximation to the actual slice QP values utilized to encode the bitstream at 
different quality levels. The delta QP in each slice header is huge especially 
at low bit rate and quality levels.
 
 
My test command is as follows.
--repeat-headers --hash 1  --input-res 1280x720 --keyint 30 --min-keyint 30 
--input "Johnny_1280x720.y4m" --fps 30 --output "test_new.mp4"
 
 
There is a patch in the attachment for this method based on "Changeset: 11587 
(d20b78d6d138)".
There is information for x265 coding to see that bitrate goes down.   There is 
informations for HM decoding to see that MD5 and QP are the same with the 
method.
 
 
 
 
 
 
Regards,
 Xuefeng Jiang
xuef...@multicorewareinc.com___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] Optimize slice QP in PPS for x265

2016-09-27 Thread chen
Hello Xuefeng,


Your idea is good, in low bitrate environment, the MV, header are most 
important part in bitstream.
I take a look your code, it sounds some problems.


Your calculate correlation between sliceQp and QP Range (it is [0, 51] without 
range extension), so you will got a constant correlative array for every QP 
value.
In the final, your algorithm output a QP close to average value of sliceQP.
It is right, just spending more time on compute.


Regards,
Min


At 2016-09-27 14:46:14,xuefeng  wrote:

All,
hello! 


 x265 set the slice QP in PPS to 26.  Bits can be saved by calculating a closer 
approximation to the actual slice QP values utilized to encode the bitstream at 
different quality levels. The delta QP in each slice header is huge especially 
at low bit rate and quality levels.


My test command is as follows.
--repeat-headers --hash 1  --input-res 1280x720 --keyint 30 --min-keyint 30 
--input "Johnny_1280x720.y4m" --fps 30 --output "test_new.mp4" 


There is a patch in the attachment for this method based on "Changeset: 11587 
(d20b78d6d138)".
There is information for x265 coding to see that bitrate goes down.   There is 
informations for HM decoding to see that MD5 and QP are the same with the 
method. 






Regards,
 
 Xuefeng Jiang
 
xuef...@multicorewareinc.com
 ___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] frameFilter: check for reconRowFlag

2016-09-27 Thread chen
This patch made logic bug, the m_reconRowFlag and numRowFinished use to enable 
Sao filter when all row finished.


At 2016-09-27 19:17:16,as...@multicorewareinc.com wrote:
># HG changeset patch
># User Ashok Kumar Mishra
># Date 1474974965 -19800
>#  Tue Sep 27 16:46:05 2016 +0530
># Node ID 5fa48115cfaa9022a72c84337b46df366c063ad0
># Parent  c0d91c2b40484664c3420abfffa10fa9cb707598
>frameFilter: check for reconRowFlag
>
>diff -r c0d91c2b4048 -r 5fa48115cfaa source/encoder/framefilter.cpp
>--- a/source/encoder/framefilter.cpp   Tue Sep 27 14:37:25 2016 +0530
>+++ b/source/encoder/framefilter.cpp   Tue Sep 27 16:46:05 2016 +0530
>@@ -503,10 +503,13 @@
> processPostRow(row);
> 
> // NOTE: slices parallelism will be execute out-of-order
>-int numRowFinished;
>-for(numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++)
>-if (!m_frame->m_reconRowFlag[numRowFinished].get())
>-break;
>+int numRowFinished = 0;
>+if (m_frame->m_reconRowFlag)
>+{
>+for (numRowFinished = 0; numRowFinished < m_numRows; numRowFinished++)
>+if (!m_frame->m_reconRowFlag[numRowFinished].get())
>+break;
>+}
> 
> if (numRowFinished == m_numRows)
> {
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)

2016-09-24 Thread chen
Hi Aasaipriya,


I think I know we different, I use the standard tree, otherthan multilib tree. 
I will try multilib tree in next Monday.


my tip is "rc: fix non-IDR slicetype in multi-pass".


Thanks,
Min


At 2016-09-24 19:49:30,"Aasaipriya Chandran" <aasaipr...@multicorewareinc.com> 
wrote:

Min,


Yes to confirm with the output-mismatch, I just checked with very basic 
commandline BasketballDrive_1920x1080_50.y4m --preset ultrafast .
 The outputs are getting mismatched in all three builds(8/10/12 builds) in 
vc12-x86_64. I builded each build separately and compared.
As praveen said only binary file sizes remained same, but outputs mismatched.
 
I checked on changeset-d20b78d6d138 , and applying your patch on top of this 
changeset ..


Can we know at which tip you tested ? and anyother way you compared ? 




Thanks,
Aasaipriya


On Fri, Sep 23, 2016 at 4:04 PM, Praveen Tiwari <prav...@multicorewareinc.com> 
wrote:

Hi Min,
 Can you please verify for VC12 ? I double checked on this I am getting 
different output for this patch. 8-bit encoded file size is same but different 
binary (compared using beyond compare), 10 and 12 bit size and binary both are 
different. I applied you patch build once (like 8 bit build)  and collected all 
depth outputs (8, 10 and 12), compared with three builds of x265 i.e 8 bit, 10 
bit and 12 bit. 


Regards,
Praveen  




On Fri, Sep 23, 2016 at 2:47 AM, chen <chenm...@163.com> wrote:

Hi Praveen,


I test your cmdlind on my VS2008 build.
I build three bit-depth version and compare with one bit-depth version, but the 
output are still matched in both 10 and 12 bit.


Regards,
Min

At 2016-09-22 14:39:50,"Praveen Tiwari" <prav...@multicorewareinc.com> wrote:

Hi Min,


 After this patch outputs are changing, tested for following command line for 
10-bit and 12-bit outputs.


--input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600 
--fps=60  --numa-pools="NULL" --output-depth=12 --hash=1 -o  NFOut12.hevc









Regards,
Praveen


On Thu, Sep 15, 2016 at 1:55 AM, chen <chenm...@163.com> wrote:

From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Wed, 14 Sep 2016 15:23:38 -0500
Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)


---
 source/CMakeLists.txt |   40 +++-
 1 files changed, 39 insertions(+), 1 deletions(-)


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index dd19d28..c2c2f7f 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
"${PROJECT_BINARY_DIR}/x265_config.h")
 
+
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
 # System architecture detection
@@ -396,6 +397,39 @@ if(WIN32)
 endif(WINXP_SUPPORT)
 endif()
 
+
+if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?setParamAspectRatio@x265@@YAXPEAUx265_param@@HH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?getParamAspectRatio@x265@@YAXPEAUx265_param@@AEAH1@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log_file@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_10bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_12bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265@@YA_JXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_picturePlaneSize@x265@@YAI@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_report_simd@x265@@YAXPEAUx265_param@@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_fopen@x265@@YAPEAU_iobuf@@PEBD0@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_malloc@x265@@YAPEAX_K@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265@@YAXPEAX@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_atoi@x265@@YAHPEBDAEA_N@Z\n")
+f

Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)

2016-09-23 Thread chen
Hi Praveen,


I try again with VC12 (VS2013), the output has matched.


btw: I was enabled configuration WINXP_SUPPORT since I haven't install Win8 SDK.


Regards,
Min


At 2016-09-23 18:34:50,"Praveen Tiwari" <prav...@multicorewareinc.com> wrote:

Hi Min,
 Can you please verify for VC12 ? I double checked on this I am getting 
different output for this patch. 8-bit encoded file size is same but different 
binary (compared using beyond compare), 10 and 12 bit size and binary both are 
different. I applied you patch build once (like 8 bit build)  and collected all 
depth outputs (8, 10 and 12), compared with three builds of x265 i.e 8 bit, 10 
bit and 12 bit. 


Regards,
Praveen  




On Fri, Sep 23, 2016 at 2:47 AM, chen <chenm...@163.com> wrote:

Hi Praveen,


I test your cmdlind on my VS2008 build.
I build three bit-depth version and compare with one bit-depth version, but the 
output are still matched in both 10 and 12 bit.


Regards,
Min

At 2016-09-22 14:39:50,"Praveen Tiwari" <prav...@multicorewareinc.com> wrote:

Hi Min,


 After this patch outputs are changing, tested for following command line for 
10-bit and 12-bit outputs.


--input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600 
--fps=60  --numa-pools="NULL" --output-depth=12 --hash=1 -o  NFOut12.hevc









Regards,
Praveen


On Thu, Sep 15, 2016 at 1:55 AM, chen <chenm...@163.com> wrote:

From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Wed, 14 Sep 2016 15:23:38 -0500
Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)


---
 source/CMakeLists.txt |   40 +++-
 1 files changed, 39 insertions(+), 1 deletions(-)


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index dd19d28..c2c2f7f 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
"${PROJECT_BINARY_DIR}/x265_config.h")
 
+
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
 # System architecture detection
@@ -396,6 +397,39 @@ if(WIN32)
 endif(WINXP_SUPPORT)
 endif()
 
+
+if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?setParamAspectRatio@x265@@YAXPEAUx265_param@@HH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?getParamAspectRatio@x265@@YAXPEAUx265_param@@AEAH1@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log_file@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_10bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_12bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265@@YA_JXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_picturePlaneSize@x265@@YAI@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_report_simd@x265@@YAXPEAUx265_param@@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_fopen@x265@@YAPEAU_iobuf@@PEBD0@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_malloc@x265@@YAPEAX_K@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265@@YAXPEAX@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_atoi@x265@@YAHPEBDAEA_N@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?start@Thread@x265@@QEAA_NXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@@QEAAXXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thread@x265@@QEAA@XZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??1Thread@x265@@UEAA@XZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?g_maxCUDepth@x265@@3IA\n")
+if(WINXP_SUPPORT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_init@x265@@YAHPEAUConditionVariable@1@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_wait@x265@@YAHPEAUConditionVariable@1@PEAU_RTL_CRITICAL_SECT

Re: [x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)

2016-09-22 Thread chen
Hi Praveen,


I test your cmdlind on my VS2008 build.
I build three bit-depth version and compare with one bit-depth version, but the 
output are still matched in both 10 and 12 bit.


Regards,
Min

At 2016-09-22 14:39:50,"Praveen Tiwari" <prav...@multicorewareinc.com> wrote:

Hi Min,


 After this patch outputs are changing, tested for following command line for 
10-bit and 12-bit outputs.


--input=NebutaFestival_2560x1600_60_10bit_crop.yuv --input-res=2560x1600 
--fps=60  --numa-pools="NULL" --output-depth=12 --hash=1 -o  NFOut12.hevc









Regards,
Praveen


On Thu, Sep 15, 2016 at 1:55 AM, chen <chenm...@163.com> wrote:

From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Wed, 14 Sep 2016 15:23:38 -0500
Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)


---
 source/CMakeLists.txt |   40 +++-
 1 files changed, 39 insertions(+), 1 deletions(-)


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index dd19d28..c2c2f7f 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
"${PROJECT_BINARY_DIR}/x265_config.h")
 
+
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
 # System architecture detection
@@ -396,6 +397,39 @@ if(WIN32)
 endif(WINXP_SUPPORT)
 endif()
 
+
+if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?setParamAspectRatio@x265@@YAXPEAUx265_param@@HH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?getParamAspectRatio@x265@@YAXPEAUx265_param@@AEAH1@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log_file@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_10bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_12bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265@@YA_JXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_picturePlaneSize@x265@@YAI@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_report_simd@x265@@YAXPEAUx265_param@@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_fopen@x265@@YAPEAU_iobuf@@PEBD0@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_malloc@x265@@YAPEAX_K@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265@@YAXPEAX@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_atoi@x265@@YAHPEBDAEA_N@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?start@Thread@x265@@QEAA_NXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@@QEAAXXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thread@x265@@QEAA@XZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??1Thread@x265@@UEAA@XZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?g_maxCUDepth@x265@@3IA\n")
+if(WINXP_SUPPORT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_init@x265@@YAHPEAUConditionVariable@1@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_wait@x265@@YAHPEAUConditionVariable@1@PEAU_RTL_CRITICAL_SECTION@@K@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_destroy@x265@@YAXPEAUConditionVariable@1@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_broadcast@x265@@YAXPEAUConditionVariable@1@@Z\n")
+endif()
+endif()
+
 include(version) # determine X265_VERSION and X265_LATEST_TAG
 include_directories(. common encoder "${PROJECT_BINARY_DIR}")
 
@@ -608,7 +642,11 @@ if(ENABLE_CLI)
 if(WIN32 OR NOT ENABLE_SHARED OR INTEL_CXX)
 # The CLI cannot link to the shared library on Windows, it
 # requires internal APIs not exported from the DLL
-target_link_libraries(cli x265-static ${PLATFORM_LIBS})
+if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
+ 

[x265] [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)

2016-09-14 Thread chen
From ea50e494473623ed0dbff2907194aaf268dc449a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Wed, 14 Sep 2016 15:23:38 -0500
Subject: [PATCH] [multi-lib] Support 8+10+12 bits in single DLL (Workaround)


---
 source/CMakeLists.txt |   40 +++-
 1 files changed, 39 insertions(+), 1 deletions(-)


diff --git a/source/CMakeLists.txt b/source/CMakeLists.txt
index dd19d28..c2c2f7f 100644
--- a/source/CMakeLists.txt
+++ b/source/CMakeLists.txt
@@ -36,6 +36,7 @@ configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
"${PROJECT_BINARY_DIR}/x265_config.h")
 
+
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
 # System architecture detection
@@ -396,6 +397,39 @@ if(WIN32)
 endif(WINXP_SUPPORT)
 endif()
 
+
+if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?setParamAspectRatio@x265@@YAXPEAUx265_param@@HH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?getParamAspectRatio@x265@@YAXPEAUx265_param@@AEAH1@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log_file@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?general_log@x265@@YAXPEBUx265_param@@PEBDH1ZZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_10bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_get_94@x265_12bit@@YAPEBUx265_api@@H@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_10bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_api_query@x265_12bit@@YAPEBUx265_api@@HHPEAH@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_mdate@x265@@YA_JXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_picturePlaneSize@x265@@YAI@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_ssim2dB@x265@@YANN@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_report_simd@x265@@YAXPEAUx265_param@@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_fopen@x265@@YAPEAU_iobuf@@PEBD0@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_malloc@x265@@YAPEAX_K@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?x265_free@x265@@YAXPEAX@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?x265_atoi@x265@@YAHPEBDAEA_N@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?start@Thread@x265@@QEAA_NXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?stop@Thread@x265@@QEAAXXZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??0Thread@x265@@QEAA@XZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "??1Thread@x265@@UEAA@XZ\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def "?g_maxCUDepth@x265@@3IA\n")
+if(WINXP_SUPPORT)
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_init@x265@@YAHPEAUConditionVariable@1@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_wait@x265@@YAHPEAUConditionVariable@1@PEAU_RTL_CRITICAL_SECTION@@K@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_destroy@x265@@YAXPEAUConditionVariable@1@@Z\n")
+file(APPEND ${PROJECT_BINARY_DIR}/x265.def 
"?cond_broadcast@x265@@YAXPEAUConditionVariable@1@@Z\n")
+endif()
+endif()
+
 include(version) # determine X265_VERSION and X265_LATEST_TAG
 include_directories(. common encoder "${PROJECT_BINARY_DIR}")
 
@@ -608,7 +642,11 @@ if(ENABLE_CLI)
 if(WIN32 OR NOT ENABLE_SHARED OR INTEL_CXX)
 # The CLI cannot link to the shared library on Windows, it
 # requires internal APIs not exported from the DLL
-target_link_libraries(cli x265-static ${PLATFORM_LIBS})
+if(ENABLE_SHARED AND LINKED_10BIT AND LINKED_12BIT)
+target_link_libraries(cli x265-shared ${PLATFORM_LIBS})
+else()
+target_link_libraries(cli x265-static ${PLATFORM_LIBS})
+endif()
 else()
 target_link_libraries(cli x265-shared ${PLATFORM_LIBS})
 endif()
-- 
1.7.9.msysgit.0

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] [PATCH] [slice] fix help information defaule value mistake

2016-09-13 Thread chen
From ea93a3ddb7e8c7e106955acef56f6df72a15587a Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Tue, 13 Sep 2016 10:59:09 -0500
Subject: [PATCH] [slice] fix help information defaule value mistake


---
 source/x265cli.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/source/x265cli.h b/source/x265cli.h
index 2bd853f..99ac5c9 100644
--- a/source/x265cli.h
+++ b/source/x265cli.h
@@ -302,7 +302,7 @@ static void showHelp(x265_param *param)
 H0(" '-' implies no threads on node, '+' 
implies one thread per core on node\n");
 H0("-F/--frame-threads  Number of concurrently encoded 
frames. 0: auto-determined by core count\n");
 H0("   --[no-]wppEnable Wavefront Parallel Processing. 
Default %s\n", OPT(param->bEnableWavefront));
-H0("   --[no-]slicesEnable Multiple Slices feature. 
Default %s\n", OPT(param->maxSlices));
+H0("   --[no-]slicesEnable Multiple Slices feature. 
Default %d\n", param->maxSlices);
 H0("   --[no-]pmode  Parallel mode analysis. Default 
%s\n", OPT(param->bDistributeModeAnalysis));
 H0("   --[no-]pmeParallel motion estimation. Default 
%s\n", OPT(param->bDistributeMotionEstimation));
 H0("   --[no-]asm <bool|int|string>  Override CPU detection. Default: 
auto\n");
-- 
1.7.9.msysgit.0

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


Re: [x265] [PATCH 1 of 2] [slice] slice feature in help menu

2016-09-13 Thread chen
Thank you point out my fault, I forgot to check default value field, I was 
fixed this bug now.

At 2016-09-13 15:12:33,"Mario *LigH* Rohkrämer" <cont...@ligh.de> wrote:
>Am 07.09.2016, 22:27 Uhr, schrieb Min Chen <chenm...@163.com>:
>
>> +H0("   --[no-]slicesEnable Multiple Slices  
>> feature. Default %s\n", OPT(param->maxSlices));
>
>The result is:
>
>--[no-]slicesEnable Multiple Slices feature. Default  
>enabled
>
>Apparently, OPT(param->maxSlices) is preferably interpreted as boolean,  
>despite requesting an integer number as parameter when enabled. I would  
>guess the default number is 1?
>
>Should there be a less ambiguous format to represent a "default value, if  
>not disabled", or do I think too "German" here? ;-)
>
>-- 
>
>Fun and success!
>Mario *LigH* Rohkrämer
>mailto:cont...@ligh.de
> 
>___
>x265-devel mailing list
>x265-devel@videolan.org
>https://mailman.videolan.org/listinfo/x265-devel
___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


[x265] [PATCH] [slice] verify untest path and enable it

2016-09-12 Thread chen
From dc6d861fd8f91c90e6bbdee366cfb7df5fdf183f Mon Sep 17 00:00:00 2001
From: Min Chen <min.c...@multicorewareinc.com>
Date: Mon, 12 Sep 2016 13:18:32 -0500
Subject: [PATCH] [slice] verify untest path and enable it


---
 source/encoder/frameencoder.cpp |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/source/encoder/frameencoder.cpp b/source/encoder/frameencoder.cpp
index 256eed8..65370ba 100644
--- a/source/encoder/frameencoder.cpp
+++ b/source/encoder/frameencoder.cpp
@@ -666,7 +666,7 @@ void FrameEncoder::compressFrame()
 refpic->m_reconRowFlag[rowIdx].waitForChange(0);
 
 if ((bUseWeightP || bUseWeightB) && 
m_mref[l][ref].isWeighted)
-m_mref[list][ref].applyWeight(i + m_refLagRows, 
m_numRows, m_numRows, 0xbaadbaad);
+m_mref[list][ref].applyWeight(i + m_refLagRows, 
m_numRows, m_numRows, 0);
 }
 }
 
-- 
1.7.9.msysgit.0

___
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel


  1   2   3   4   5   6   7   8   9   10   >