User Jayashree
Date 1517283539 28800
Mon Jan 29 19:38:59 2018 -0800
x86: AVX512 'count_nonzero_16x16' avx-512 kernel, 22% speedup over avx2
>count_nonzero[16x16]   18.88x ->  23.04x
>+; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff);
>+INIT_ZMM avx512
>+cglobal count_nonzero_16x16, 1,4,2
>+    mov             r1, 0xFFFFFFFFFFFFFFFF
>+    kmovq           k2, r1 Opmask Register K0
The only exception to the opmask rules described above is that opmask k0 can 
not be used as a predicate operand.
Opmask k0 cannot be encoded as a predicate operand for a vector operation; the 
encoding value that would select
opmask k0 will instead selects an implicit opmask value of 0xFFFFFFFFFFFFFFFF, 
thereby effectively disabling
masking. Opmask register k0 can still be used for any instruction that takes 
opmask register(s) as operand(s)
(either source or destination).

>+    xor             r3, r3
>+    pxor            m0, m0
>+%assign x 0

>+%rep 4
unroll 4 times only, so unnecessary unroll in here
I suggest load all of bytes in same time, it can be hidden memory latency with 
calculate instructions.

>+    movu            m1, [r0 + x]

>+    vpacksswb       m1, [r0 + x + 64]
>+%assign x x+128
>+    vpcmpb          k1 {k2}, m1, m0, 00000100b
could you please declare a new macro/const, the developers are difficult to 
understand that the '00000100b' (4) means NE (on Intel's document).

>+    kmovq           r1, k1
>+    popcnt          r2, r1
>+    add             r3d, r2d
>+    mov             eax, r3d
>+    RET

