[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2024-04-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

Richard Biener  changed:

   What|Removed |Added

   Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot 
gnu.org
 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #11 from Richard Biener  ---
I think "fixed" as far as we can get, esp. w/o considering all possible vector
sizes.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2024-02-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #10 from Richard Biener  ---
So this is now fixed if you use --param vect-partial-vector-usage=2, there is
at the moment no way to get masking/not masking costed against each other.  In
theory vect_analyze_loop_costing and vect_estimate_min_profitable_iters
could do both and we could delay vect_determine_partial_vectors_and_peeling.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

Richard Biener  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-13 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #9 from rguenther at suse dot de  ---
On Tue, 13 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410
> 
> --- Comment #8 from Hongtao.liu  ---
> 
> > Can x86 do this?  We'd want to apply this to a scalar, so move ivtmp
> > to xmm, apply pack_usat or as you say below, the non-existing us_trunc
> > and then broadcast.
> 
> I see, we don't have scalar version. Also vector instruction looks not very
> fast.
> 
> https://uops.info/html-instr/VPMOVSDB_XMM_XMM.html

Uh, yeah.  Well, Zen4 looks reasonable though latency could be better.

Preliminary performance data also shows masked epilogues are a
mixed bag.  I'll finish off the implementation and then we'll see
if we can selectively enable it for the profitable cases somehow.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-12 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #8 from Hongtao.liu  ---

> Can x86 do this?  We'd want to apply this to a scalar, so move ivtmp
> to xmm, apply pack_usat or as you say below, the non-existing us_trunc
> and then broadcast.

I see, we don't have scalar version. Also vector instruction looks not very
fast.

https://uops.info/html-instr/VPMOVSDB_XMM_XMM.html

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-12 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #7 from rguenther at suse dot de  ---
On Mon, 12 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410
> 
> --- Comment #6 from Hongtao.liu  ---
> 
> > and the key thing to optimize is
> > 
> >   ivtmp_78 = ivtmp_77 + 4294967232; // -64
> >   _79 = MIN_EXPR ;
> >   _80 = (unsigned char) _79;
> >   _81 = {_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> > _80, _80, _80, _80, _80, _80};
> > 
> > that is we want to broadcast a saturated (to vector element precision) 
> > value.
> 
> Yes, backend needs to support vec_pack_ssat_m, vec_pack_usat_m.

Can x86 do this?  We'd want to apply this to a scalar, so move ivtmp
to xmm, apply pack_usat or as you say below, the non-existing us_trunc
and then broadcast.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-11 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #6 from Hongtao.liu  ---

> and the key thing to optimize is
> 
>   ivtmp_78 = ivtmp_77 + 4294967232; // -64
>   _79 = MIN_EXPR ;
>   _80 = (unsigned char) _79;
>   _81 = {_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
> _80, _80, _80, _80, _80, _80};
> 
> that is we want to broadcast a saturated (to vector element precision) value.

Yes, backend needs to support vec_pack_ssat_m, vec_pack_usat_m.
But I didn't find optab for ss_truncate or us_truncate which might be used by
BB vectorizer.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #5 from Richard Biener  ---
Btw, for the case we can use the same mask compare type as we use as type for
the IV (so we know we can represent all required values) we can elide the
saturation.  So for example

void foo (double * __restrict a, double *b, double *c, int n)
{
  for (int i = 0; i < n; ++i)
a[i] = b[i] + c[i];
}

can produce

testl   %ecx, %ecx
jle .L5
vmovdqa .LC0(%rip), %ymm3
vpbroadcastd%ecx, %ymm2
xorl%eax, %eax
subl$8, %ecx
vpcmpud $6, %ymm3, %ymm2, %k1
.p2align 4
.p2align 3
.L3:
vmovupd (%rsi,%rax), %zmm1{%k1}
vmovupd (%rdx,%rax), %zmm0{%k1}
movl%ecx, %r8d
vaddpd  %zmm1, %zmm0, %zmm2{%k1}{z}
addl$8, %r8d
vmovupd %zmm2, (%rdi,%rax){%k1}
vpbroadcastd%ecx, %ymm2
addq$64, %rax
subl$8, %ecx
vpcmpud $6, %ymm3, %ymm2, %k1
cmpl$8, %r8d
ja  .L3
vzeroupper
.L5:
ret

That should work as long as the data size is larger or matches the IV size
which is hopefully the case for all FP testcases.  The trick is going to be
to make this visible to costing - I'm not sure we get to decide whether
to use masking or not when we do not want to decide between vector sizes
(the x86 backend picks the first successful one).  For SVE it's either
masking (with SVE modes) or not masking (with NEON modes) so it's
decided based on mode rather than as additional knob.

Performance-wise the above is likely still slower than not using masking
plus a masked epilog but it would actually save on code-size for -Os
or -O2.  Of course for code-size we might want to stick to SSE/AVX
for the smaller encoding.

Note we have to watch out for all-zero masks for masked stores since
that's very slow (for a reason unknown to me), when we have a stmt
split to multiple vector stmts it's not uncommon (esp. for the epilog)
to have one of them with an all-zero bit mask.  For the loop case and
.MASK_STORE we emit branchy code for this but we might want to avoid
the situation by costing (and not using a masked loop/epilog in that
case).

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-06-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #4 from Richard Biener  ---
Adding fully masked AVX512 and AVX512 with a masked epilog data:

size   scalar 128 256 512512e512f
19.42   11.329.35   11.17   15.13   16.89
25.726.536.666.667.628.56
34.495.105.105.745.085.73
44.104.334.295.213.794.25
63.783.853.864.762.542.85
83.641.893.764.501.922.16
   123.562.213.754.261.261.42
   163.360.831.064.160.951.07
   203.391.421.334.070.750.85
   243.230.661.724.220.620.70
   283.181.092.044.200.540.61
   323.160.470.410.410.470.53
   343.160.670.610.560.440.50
   383.190.950.950.820.400.45
   423.090.581.211.130.360.40

text sizes are not much different:

 138918372125162917211689

the AVX2 size is large because we completely peel the scalar epilogue,
same for the SSE case.  The scalar epilogue of the 512 loop iterates
32 times (too many for peeling), the masked loop/epilogue are quite
large due to the EVEX encoded instructions so the saved scalar/vector
epilogues do not show.

The AVX512 masked epilogue case now looks like:

.p2align 3
.L5:
vmovdqu8(%r8,%rax), %zmm0
vpavgb  (%rsi,%rax), %zmm0, %zmm0
vmovdqu8%zmm0, (%rdi,%rax)
addq$64, %rax
cmpq%rcx, %rax
jne .L5
movl%edx, %ecx
andl$-64, %ecx
testb   $63, %dl
je  .L19
.L4:
movl%ecx, %eax
subl%ecx, %edx
movl$255, %ecx
cmpl%ecx, %edx
cmova   %ecx, %edx
vpbroadcastb%edx, %zmm0
vpcmpub $6, .LC0(%rip), %zmm0, %k1
vmovdqu8(%rsi,%rax), %zmm0{%k1}{z}
vmovdqu8(%r8,%rax), %zmm1{%k1}{z}
vpavgb  %zmm1, %zmm0, %zmm0
vmovdqu8%zmm0, (%rdi,%rax){%k1}
.L19:
vzeroupper
ret

where there's a missed optimization around the saturation to 255.

The fully masked AVX512 loop is

vmovdqa64   .LC0(%rip), %zmm3
movl$255, %eax
cmpl%eax, %ecx 
cmovbe  %ecx, %eax
vpbroadcastb%eax, %zmm0
vpcmpub $6, %zmm3, %zmm0, %k1
.p2align 4
.p2align 3
.L4:
vmovdqu8(%rsi,%rax), %zmm1{%k1}
vmovdqu8(%r8,%rax), %zmm2{%k1}
movl%r10d, %edx
movl$255, %ecx
subl%eax, %edx
cmpl%ecx, %edx
cmova   %ecx, %edx
vpavgb  %zmm2, %zmm1, %zmm0
vmovdqu8%zmm0, (%rdi,%rax){%k1}
vpbroadcastb%edx, %zmm0
addq$64, %rax
movl%r9d, %edx
subl%eax, %edx
vpcmpub $6, %zmm3, %zmm0, %k1
cmpl$64, %edx
ja  .L4
vzeroupper
ret

which is a much larger loop body due to the mask creation.  At least
that interleaves nicely (dependence wise) with the loop control and
vectorized stmts.  What needs to be optimized somehow is what IVOPTs
makes out of the decreasing remaining scalar iters IV with the 
IV required for the memory accesses.  Without IVOPTs the body looks
like

.L4:
vmovdqu8(%rsi), %zmm1{%k1}
vmovdqu8(%rdx), %zmm2{%k1}
movl$255, %eax
movl%ecx, %r8d
subl$64, %ecx
addq$64, %rsi
addq$64, %rdx
vpavgb  %zmm2, %zmm1, %zmm0
vmovdqu8%zmm0, (%rdi){%k1}
addq$64, %rdi
cmpl%eax, %ecx
cmovbe  %ecx, %eax
vpbroadcastb%eax, %zmm0
vpcmpub $6, %zmm3, %zmm0, %k1
cmpl$64, %r8d
ja  .L4

and the key thing to optimize is

  ivtmp_78 = ivtmp_77 + 4294967232; // -64
  _79 = MIN_EXPR ;
  _80 = (unsigned char) _79;
  _81 = {_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80,
_80, _80};

that is we want to broadcast a saturated (to vector element precision) value.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-01-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #3 from Richard Biener  ---
the naiive "bad" code-gen produces

size  512-masked
  212.19
  4 6.09
  6 4.06
  8 3.04
 12 2.03
 14 1.52
 16 1.21
 20 1.01
 24 0.87
 32 0.76
 34 0.71
 38 0.64
 42 0.58

on alberti (you seem to have used the same machine).  So the AVX512 "stupid"
code-gen is faster for 6+ elements and I guess optimizing it should then
outperform scalar also for 4 elements.  The exact matches for 8 on 128
and 16 on 256 are hard to beat of course, likewise the single or two iteration
case.

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-01-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

--- Comment #2 from Richard Biener  ---
The naiive masked epilogue (--param vect-partial-vector-usage=1 and support
for whilesiult as in a prototype I have) then looks like

leal-1(%rdx), %eax
cmpl$62, %eax
jbe .L11

.L11:
xorl%ecx, %ecx
jmp .L4

.L4:
movl%ecx, %eax
subl%ecx, %edx
addq%rax, %rsi
addq%rax, %rdi
addq%r8, %rax
cmpl$64, %edx
jl  .L8 
kxorq   %k1, %k1, %k1
kxnorq  %k1, %k1, %k1
.L7:
vmovdqu8(%rsi), %zmm0{%k1}{z}
vmovdqu8(%rdi), %zmm1{%k1}{z}
vpavgb  %zmm1, %zmm0, %zmm0
vmovdqu8%zmm0, (%rax){%k1}
.L21:
vzeroupper
ret

.L8:
vmovdqa64   .LC0(%rip), %zmm1
vpbroadcastb%edx, %zmm0
vpcmpb  $1, %zmm0, %zmm1, %k1
jmp .L7

RTL isn't good at jump threading the mess caused by my ad-hoc whileult
RTL expansion - representing this at a higher level is probably the way
to go.  What you'd basically should get is for the epilogue (also used
when the main vectorized loop isn't entered):

vmovdqa64   .LC0(%rip), %zmm1
vpbroadcastb%edx, %zmm0
vpcmpb  $1, %zmm0, %zmm1, %k1
vmovdqu8(%rsi), %zmm0{%k1}{z}
vmovdqu8(%rdi), %zmm1{%k1}{z}
vpavgb  %zmm1, %zmm0, %zmm0
vmovdqu8%zmm0, (%rax){%k1}

that is a compare of a vector with { niter, niter, ... } with { 0, 1,2 3, .. }
producing the mask (that has a latency of 3 according to agner) and then
simply the vectorized code masked.  You can probably assembly code that
if you'd be interested in the (optimal) performance outcome.

For now we probably want to have the main loop traditionally vectorized
without masking because Intel has poor mask support and AMD has bad
latency on the mask producing compares.  But having a masked vectorized
epilog avoids the need for a scalar epilog, saving code-size, and
avoids the need to vectorize that multiple times (or choosing SSE vectors
here).  For Zen4 the above will of course utilize two 512bit op halves
even when one is fully masked (well, I suppose at least that this is the case).

[Bug middle-end/108410] x264 averaging loop not optimized well for avx512

2023-01-16 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
   Last reconfirmed||2023-01-16
 Target||x86_64-*-*
   Keywords||missed-optimization
 CC||rguenth at gcc dot gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #1 from Richard Biener  ---
One issue is that we at most perform one epilogue loop vectorization, so with
AVX512 we vectorize the epilogue with AVX2 but its epilogue remains
unvectorized.  With AVX512 we'd want to use a fully masked epilogue using
AVX512 instead.

I started working on fully masked vectorization support for AVX512 but
got distracted.

Another option would be to use SSE vectorization for the epilogue
(note for SSE we vectorize the epilogue with 64bit half-SSE vectors!),
which would mean giving the target (some) control over the mode used
for vectorizing the epilogue.   That is, in vect_analyze_loop change

  /* For epilogues start the analysis from the first mode.  The motivation
 behind starting from the beginning comes from cases where the VECTOR_MODES
 array may contain length-agnostic and length-specific modes.  Their
 ordering is not guaranteed, so we could end up picking a mode for the main
 loop that is after the epilogue's optimal mode.  */
  vector_modes[0] = autodetected_vector_mode;

to go through a target hook (possibly first produce a "candidate mode" set
and allow the target to prune that).  This might be an "easy" fix for the
AVX512 issue for low-trip loops.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations