Sorry, I miss a line, resend with addition comment

At 2018-04-07 01:27:34, "chen" <> wrote:

At 2018-04-06 21:17:37, wrote:
># HG changeset patch
># User Jayashree
># Date 1517283539 28800
>#      Mon Jan 29 19:38:59 2018 -0800
># Node ID 3c6e5ce07dbca7f967e4b5b62fe450979da3bf81
># Parent  624c83571d1df840e1206c46e589044fbf87ff32
>x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>count_nonzero[16x16]   18.88x ->  23.04x
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+    mov             r1, 0xFFFFFFFFFFFFFFFF
>+    kmovq           k2, r1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can 
not be used as a predicate operand.
Opmask k0 cannot be encoded as a predicate operand for a vector operation; the 
encoding value that would select
opmask k0 will instead selects an implicit opmask value of 0xFFFFFFFFFFFFFFFF, 
thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes 
opmask register(s) as operand(s)
(either source or destination).

>+    xor             r3, r3
>+    pxor            m0, m0
>+%assign x 0

>+%rep 4
unroll 4 times only, so unnecessary unroll in here
I suggest load all of bytes in same time, it can be hidden memory latency with 
calculate instructions.

>+    movu            m1, [r0 + x]

>+    vpacksswb       m1, [r0 + x + 64]
>+%assign x x+128
>+    vpcmpb          k1 {k2}, m1, m0, 00000100b
could you please declare a new macro/const, the developers are difficult to 
understand that the '00000100b' (4) means NE (on Intel's document).

>+    kmovq           r1, k1
>+    popcnt          r2, r1
>+    add             r3d, r2d
>+    mov             eax, r3d
>+    RET

x265-devel mailing list

Reply via email to