[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 Richard Biener changed: What|Removed |Added Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #11 from Richard Biener --- I think "fixed" as far as we can get, esp. w/o considering all possible vector sizes.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #10 from Richard Biener --- So this is now fixed if you use --param vect-partial-vector-usage=2, there is at the moment no way to get masking/not masking costed against each other. In theory vect_analyze_loop_costing and vect_estimate_min_profitable_iters could do both and we could delay vect_determine_partial_vectors_and_peeling.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 Richard Biener changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #9 from rguenther at suse dot de --- On Tue, 13 Jun 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 > > --- Comment #8 from Hongtao.liu --- > > > Can x86 do this? We'd want to apply this to a scalar, so move ivtmp > > to xmm, apply pack_usat or as you say below, the non-existing us_trunc > > and then broadcast. > > I see, we don't have scalar version. Also vector instruction looks not very > fast. > > https://uops.info/html-instr/VPMOVSDB_XMM_XMM.html Uh, yeah. Well, Zen4 looks reasonable though latency could be better. Preliminary performance data also shows masked epilogues are a mixed bag. I'll finish off the implementation and then we'll see if we can selectively enable it for the profitable cases somehow.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #8 from Hongtao.liu --- > Can x86 do this? We'd want to apply this to a scalar, so move ivtmp > to xmm, apply pack_usat or as you say below, the non-existing us_trunc > and then broadcast. I see, we don't have scalar version. Also vector instruction looks not very fast. https://uops.info/html-instr/VPMOVSDB_XMM_XMM.html
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #7 from rguenther at suse dot de --- On Mon, 12 Jun 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 > > --- Comment #6 from Hongtao.liu --- > > > and the key thing to optimize is > > > > ivtmp_78 = ivtmp_77 + 4294967232; // -64 > > _79 = MIN_EXPR ; > > _80 = (unsigned char) _79; > > _81 = {_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > > _80, _80, _80, _80, _80, _80}; > > > > that is we want to broadcast a saturated (to vector element precision) > > value. > > Yes, backend needs to support vec_pack_ssat_m, vec_pack_usat_m. Can x86 do this? We'd want to apply this to a scalar, so move ivtmp to xmm, apply pack_usat or as you say below, the non-existing us_trunc and then broadcast.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #6 from Hongtao.liu --- > and the key thing to optimize is > > ivtmp_78 = ivtmp_77 + 4294967232; // -64 > _79 = MIN_EXPR ; > _80 = (unsigned char) _79; > _81 = {_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, > _80, _80, _80, _80, _80, _80}; > > that is we want to broadcast a saturated (to vector element precision) value. Yes, backend needs to support vec_pack_ssat_m, vec_pack_usat_m. But I didn't find optab for ss_truncate or us_truncate which might be used by BB vectorizer.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #5 from Richard Biener --- Btw, for the case we can use the same mask compare type as we use as type for the IV (so we know we can represent all required values) we can elide the saturation. So for example void foo (double * __restrict a, double *b, double *c, int n) { for (int i = 0; i < n; ++i) a[i] = b[i] + c[i]; } can produce testl %ecx, %ecx jle .L5 vmovdqa .LC0(%rip), %ymm3 vpbroadcastd%ecx, %ymm2 xorl%eax, %eax subl$8, %ecx vpcmpud $6, %ymm3, %ymm2, %k1 .p2align 4 .p2align 3 .L3: vmovupd (%rsi,%rax), %zmm1{%k1} vmovupd (%rdx,%rax), %zmm0{%k1} movl%ecx, %r8d vaddpd %zmm1, %zmm0, %zmm2{%k1}{z} addl$8, %r8d vmovupd %zmm2, (%rdi,%rax){%k1} vpbroadcastd%ecx, %ymm2 addq$64, %rax subl$8, %ecx vpcmpud $6, %ymm3, %ymm2, %k1 cmpl$8, %r8d ja .L3 vzeroupper .L5: ret That should work as long as the data size is larger or matches the IV size which is hopefully the case for all FP testcases. The trick is going to be to make this visible to costing - I'm not sure we get to decide whether to use masking or not when we do not want to decide between vector sizes (the x86 backend picks the first successful one). For SVE it's either masking (with SVE modes) or not masking (with NEON modes) so it's decided based on mode rather than as additional knob. Performance-wise the above is likely still slower than not using masking plus a masked epilog but it would actually save on code-size for -Os or -O2. Of course for code-size we might want to stick to SSE/AVX for the smaller encoding. Note we have to watch out for all-zero masks for masked stores since that's very slow (for a reason unknown to me), when we have a stmt split to multiple vector stmts it's not uncommon (esp. for the epilog) to have one of them with an all-zero bit mask. For the loop case and .MASK_STORE we emit branchy code for this but we might want to avoid the situation by costing (and not using a masked loop/epilog in that case).
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #4 from Richard Biener --- Adding fully masked AVX512 and AVX512 with a masked epilog data: size scalar 128 256 512512e512f 19.42 11.329.35 11.17 15.13 16.89 25.726.536.666.667.628.56 34.495.105.105.745.085.73 44.104.334.295.213.794.25 63.783.853.864.762.542.85 83.641.893.764.501.922.16 123.562.213.754.261.261.42 163.360.831.064.160.951.07 203.391.421.334.070.750.85 243.230.661.724.220.620.70 283.181.092.044.200.540.61 323.160.470.410.410.470.53 343.160.670.610.560.440.50 383.190.950.950.820.400.45 423.090.581.211.130.360.40 text sizes are not much different: 138918372125162917211689 the AVX2 size is large because we completely peel the scalar epilogue, same for the SSE case. The scalar epilogue of the 512 loop iterates 32 times (too many for peeling), the masked loop/epilogue are quite large due to the EVEX encoded instructions so the saved scalar/vector epilogues do not show. The AVX512 masked epilogue case now looks like: .p2align 3 .L5: vmovdqu8(%r8,%rax), %zmm0 vpavgb (%rsi,%rax), %zmm0, %zmm0 vmovdqu8%zmm0, (%rdi,%rax) addq$64, %rax cmpq%rcx, %rax jne .L5 movl%edx, %ecx andl$-64, %ecx testb $63, %dl je .L19 .L4: movl%ecx, %eax subl%ecx, %edx movl$255, %ecx cmpl%ecx, %edx cmova %ecx, %edx vpbroadcastb%edx, %zmm0 vpcmpub $6, .LC0(%rip), %zmm0, %k1 vmovdqu8(%rsi,%rax), %zmm0{%k1}{z} vmovdqu8(%r8,%rax), %zmm1{%k1}{z} vpavgb %zmm1, %zmm0, %zmm0 vmovdqu8%zmm0, (%rdi,%rax){%k1} .L19: vzeroupper ret where there's a missed optimization around the saturation to 255. The fully masked AVX512 loop is vmovdqa64 .LC0(%rip), %zmm3 movl$255, %eax cmpl%eax, %ecx cmovbe %ecx, %eax vpbroadcastb%eax, %zmm0 vpcmpub $6, %zmm3, %zmm0, %k1 .p2align 4 .p2align 3 .L4: vmovdqu8(%rsi,%rax), %zmm1{%k1} vmovdqu8(%r8,%rax), %zmm2{%k1} movl%r10d, %edx movl$255, %ecx subl%eax, %edx cmpl%ecx, %edx cmova %ecx, %edx vpavgb %zmm2, %zmm1, %zmm0 vmovdqu8%zmm0, (%rdi,%rax){%k1} vpbroadcastb%edx, %zmm0 addq$64, %rax movl%r9d, %edx subl%eax, %edx vpcmpub $6, %zmm3, %zmm0, %k1 cmpl$64, %edx ja .L4 vzeroupper ret which is a much larger loop body due to the mask creation. At least that interleaves nicely (dependence wise) with the loop control and vectorized stmts. What needs to be optimized somehow is what IVOPTs makes out of the decreasing remaining scalar iters IV with the IV required for the memory accesses. Without IVOPTs the body looks like .L4: vmovdqu8(%rsi), %zmm1{%k1} vmovdqu8(%rdx), %zmm2{%k1} movl$255, %eax movl%ecx, %r8d subl$64, %ecx addq$64, %rsi addq$64, %rdx vpavgb %zmm2, %zmm1, %zmm0 vmovdqu8%zmm0, (%rdi){%k1} addq$64, %rdi cmpl%eax, %ecx cmovbe %ecx, %eax vpbroadcastb%eax, %zmm0 vpcmpub $6, %zmm3, %zmm0, %k1 cmpl$64, %r8d ja .L4 and the key thing to optimize is ivtmp_78 = ivtmp_77 + 4294967232; // -64 _79 = MIN_EXPR ; _80 = (unsigned char) _79; _81 = {_80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80, _80}; that is we want to broadcast a saturated (to vector element precision) value.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #3 from Richard Biener --- the naiive "bad" code-gen produces size 512-masked 212.19 4 6.09 6 4.06 8 3.04 12 2.03 14 1.52 16 1.21 20 1.01 24 0.87 32 0.76 34 0.71 38 0.64 42 0.58 on alberti (you seem to have used the same machine). So the AVX512 "stupid" code-gen is faster for 6+ elements and I guess optimizing it should then outperform scalar also for 4 elements. The exact matches for 8 on 128 and 16 on 256 are hard to beat of course, likewise the single or two iteration case.
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 --- Comment #2 from Richard Biener --- The naiive masked epilogue (--param vect-partial-vector-usage=1 and support for whilesiult as in a prototype I have) then looks like leal-1(%rdx), %eax cmpl$62, %eax jbe .L11 .L11: xorl%ecx, %ecx jmp .L4 .L4: movl%ecx, %eax subl%ecx, %edx addq%rax, %rsi addq%rax, %rdi addq%r8, %rax cmpl$64, %edx jl .L8 kxorq %k1, %k1, %k1 kxnorq %k1, %k1, %k1 .L7: vmovdqu8(%rsi), %zmm0{%k1}{z} vmovdqu8(%rdi), %zmm1{%k1}{z} vpavgb %zmm1, %zmm0, %zmm0 vmovdqu8%zmm0, (%rax){%k1} .L21: vzeroupper ret .L8: vmovdqa64 .LC0(%rip), %zmm1 vpbroadcastb%edx, %zmm0 vpcmpb $1, %zmm0, %zmm1, %k1 jmp .L7 RTL isn't good at jump threading the mess caused by my ad-hoc whileult RTL expansion - representing this at a higher level is probably the way to go. What you'd basically should get is for the epilogue (also used when the main vectorized loop isn't entered): vmovdqa64 .LC0(%rip), %zmm1 vpbroadcastb%edx, %zmm0 vpcmpb $1, %zmm0, %zmm1, %k1 vmovdqu8(%rsi), %zmm0{%k1}{z} vmovdqu8(%rdi), %zmm1{%k1}{z} vpavgb %zmm1, %zmm0, %zmm0 vmovdqu8%zmm0, (%rax){%k1} that is a compare of a vector with { niter, niter, ... } with { 0, 1,2 3, .. } producing the mask (that has a latency of 3 according to agner) and then simply the vectorized code masked. You can probably assembly code that if you'd be interested in the (optimal) performance outcome. For now we probably want to have the main loop traditionally vectorized without masking because Intel has poor mask support and AMD has bad latency on the mask producing compares. But having a masked vectorized epilog avoids the need for a scalar epilog, saving code-size, and avoids the need to vectorize that multiple times (or choosing SSE vectors here). For Zen4 the above will of course utilize two 512bit op halves even when one is fully masked (well, I suppose at least that this is the case).
[Bug middle-end/108410] x264 averaging loop not optimized well for avx512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 Richard Biener changed: What|Removed |Added Blocks||53947 Last reconfirmed||2023-01-16 Target||x86_64-*-* Keywords||missed-optimization CC||rguenth at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from Richard Biener --- One issue is that we at most perform one epilogue loop vectorization, so with AVX512 we vectorize the epilogue with AVX2 but its epilogue remains unvectorized. With AVX512 we'd want to use a fully masked epilogue using AVX512 instead. I started working on fully masked vectorization support for AVX512 but got distracted. Another option would be to use SSE vectorization for the epilogue (note for SSE we vectorize the epilogue with 64bit half-SSE vectors!), which would mean giving the target (some) control over the mode used for vectorizing the epilogue. That is, in vect_analyze_loop change /* For epilogues start the analysis from the first mode. The motivation behind starting from the beginning comes from cases where the VECTOR_MODES array may contain length-agnostic and length-specific modes. Their ordering is not guaranteed, so we could end up picking a mode for the main loop that is after the epilogue's optimal mode. */ vector_modes[0] = autodetected_vector_mode; to go through a target hook (possibly first produce a "candidate mode" set and allow the target to prune that). This might be an "easy" fix for the AVX512 issue for low-trip loops. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations