[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #17 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:e7a7dbb5ca5dd69689f1a462ba7620180acfe8b0 commit r12-6342-ge7a7dbb5ca5dd69689f1a462ba7620180acfe8b0 Author: liuhongt Date: Mon Dec 20 11:13:38 2021 +0800 Allow propagations from inner loop to outer loop. NULL is considered as an outer loop of any other loop. gcc/ChangeLog: PR rtl-optimization/103750 * fwprop.c (forward_propagate_into): Allow propagations from inner loop to outer loop. gcc/testsuite/ChangeLog: * g++.target/i386/pr103750-fwprop-1.C: New test.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #16 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:1a7ce8570997eb1596c803443d20687b43fa2e47 commit r12-6103-g1a7ce8570997eb1596c803443d20687b43fa2e47 Author: liuhongt Date: Wed Dec 22 16:48:54 2021 +0800 Combine vpcmpuw + zero_extend to vpcmpuw. vcmp{ps,ph,pd} and vpcmp{,u}{b,w,d,q} implicitly clear the upper bits of dest. gcc/ChangeLog: PR target/103750 * config/i386/sse.md (*_cmp3_zero_extend): New pre_reload define_insn_and_split. (*_cmp3_zero_extend): Ditto. (*_ucmp3_zero_extend): Ditto. (*_ucmp3_zero_extend): Ditto. (*_cmp3_zero_extend_2): Ditto. (*_cmp3_zero_extend_2): Ditto. (*_ucmp3_zero_extend_2): Ditto. (*_ucmp3_zero_extend_2): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512bw-pr103750-1.c: New test. * gcc.target/i386/avx512bw-pr103750-2.c: New test. * gcc.target/i386/avx512f-pr103750-1.c: New test. * gcc.target/i386/avx512f-pr103750-2.c: New test. * gcc.target/i386/avx512fp16-pr103750-1.c: New test. * gcc.target/i386/avx512fp16-pr103750-2.c: New test.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #15 from Hongtao.liu --- (In reply to Hongtao.liu from comment #14) > Created attachment 52032 [details] > update patch > > Update patch, Now gcc can generate optimal code > current fix add define_insn_and_splitter for 3 things: 1. Combine vpcmpuw and zero_extend into vpcmpuw. 2. Canonicalize vpcmpuw pattern so CSE can replace duplicate vpcmpuw to just kmov 3. Use DImode as dest of zero_extend so cprop_hardreg can eliminate redundant kmov. But the sink issue still exists, i.e. for testcase in PR103774, there's memory_operand in vpcmpuw, and combine failed due to cost increase, and the redudant kmov remains in the loop.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 Hongtao.liu changed: What|Removed |Added Attachment #52031|0 |1 is obsolete|| --- Comment #14 from Hongtao.liu --- Created attachment 52032 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52032=edit update patch Update patch, Now gcc can generate optimal code for #c0 .L4: vmovdqu (%rdi), %ymm1 vmovdqu16 32(%rdi), %ymm2 vpcmpuw $0, %ymm0, %ymm1, %k1 vpcmpuw $0, %ymm0, %ymm2, %k0 kortestw%k0, %k1 je .L10 kortestw%k1, %k1 je .L5 kmovd %k1, %eax For #c6 .L4: vmovdqu (%rdi), %ymm2 vmovdqu 32(%rdi), %ymm1 vpcmpuw $0, %ymm0, %ymm2, %k3 vpcmpuw $0, %ymm0, %ymm1, %k0 kortestd%k0, %k3 je .L10 kortestw%k3, %k3 je .L5
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #13 from Hongtao.liu --- Created attachment 52031 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52031=edit untested patch. Attached patch can optimize #c0 to vmovdqu (%rdi), %ymm1 vmovdqu16 32(%rdi), %ymm2 vpcmpuw $0, %ymm0, %ymm1, %k1 vpcmpuw $0, %ymm0, %ymm2, %k0 kmovw %k1, %k2 kortestw%k0, %k1 je .L10 and #c6 to .L4: vmovdqu (%rdi), %ymm2 vmovdqu 32(%rdi), %ymm1 vpcmpuw $0, %ymm0, %ymm2, %k3 vpcmpuw $0, %ymm0, %ymm1, %k0 kmovw %k3, %k1 kmovw %k0, %k2 kortestd%k2, %k1 je .L10 It should be much better than orginal version, but still a little suboptimal: the frist kmovw should be sinked to the exit edge, the latter 2 kmovw should be emilated.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #12 from Hongtao.liu --- (In reply to Hongtao.liu from comment #11) > (In reply to Thiago Macieira from comment #6) > > It got worse. Now I'm seeing: > > > > .L807: > > vmovdqu16 (%rsi), %ymm2 > > vmovdqu16 32(%rsi), %ymm3 > > vpcmpuw $6, %ymm0, %ymm2, %k2 > > vpcmpuw $6, %ymm0, %ymm3, %k3 > > kmovw %k2, %eax > > kmovw %k3, %edx > > kmovd %eax, %k4 > > kmovd %edx, %k5 Guess there're other usage below for %eax,%edx, that's why RA didn't choose k alternative.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #11 from Hongtao.liu --- (In reply to Thiago Macieira from comment #6) > It got worse. Now I'm seeing: > > .L807: > vmovdqu16 (%rsi), %ymm2 > vmovdqu16 32(%rsi), %ymm3 > vpcmpuw $6, %ymm0, %ymm2, %k2 > vpcmpuw $6, %ymm0, %ymm3, %k3 > kmovw %k2, %eax > kmovw %k3, %edx > kmovd %eax, %k4 > kmovd %edx, %k5 > kortestd%k5, %k4 > je .L814 > (define_insn "*zero_extendsi2" [(set (match_operand:SI 0 "register_operand" "=r,*r,*k") (zero_extend:SI (match_operand:SWI12 1 "nonimmediate_operand" "m,*k,*km")))] "!(TARGET_ZERO_EXTEND_WITH_AND && optimize_function_for_speed_p (cfun))" zero_extendhisi is supported with k alternative, it should be optimized to vmovdqu16 (%rsi), %ymm2 vmovdqu16 32(%rsi), %ymm3 vpcmpuw $6, %ymm0, %ymm2, %k2 vpcmpuw $6, %ymm0, %ymm3, %k3 kmovw %k2, %k4 kmovw %k3, %k5 kortestd%k5, %k4 And considering vpcmpuw will implicitly zero extend k2, it can be further optimized to vmovdqu16 (%rsi), %ymm2 vmovdqu16 32(%rsi), %ymm3 vpcmpuw $6, %ymm0, %ymm2, %k2 vpcmpuw $6, %ymm0, %ymm3, %k3 kortestd%k3, %k2 > Code snippet: > > auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U) > { > if constexpr (sizeof(Char) == 1) { > __m256i mval = _mm256_set1_epi8(maxval); > __m256i data = _mm256_maskz_loadu_epi8(mask, ptr); > return _mm256_cmpgt_epu8_mask(data, mval); > } else if constexpr (sizeof(Char) == 2) { > __m256i mval = _mm256_set1_epi16(maxval); > __m256i data = _mm256_maskz_loadu_epi16(mask, ptr); > return _mm256_cmpgt_epu16_mask(data, mval); > } else if constexpr (sizeof(Char) == 4) { > __m256i mval = _mm256_set1_epi32(maxval); > __m256i data = _mm256_maskz_loadu_epi32(mask, ptr); > return _mm256_cmpgt_epu32_mask(data, mval); > } > }; > /*...*/ > auto mask1 = loadAndCompare(n1); > auto mask2 = loadAndCompare(n2); > > I can make a compilable version if you need me to Yes, please.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #10 from Hongtao.liu --- (In reply to Uroš Bizjak from comment #9) > (In reply to Thiago Macieira from comment #0) > > Testcase: > ... > > The assembly for this produces: > > > > vmovdqu16 (%rdi), %ymm1 > > vmovdqu16 32(%rdi), %ymm2 > > vpcmpuw $0, %ymm0, %ymm1, %k0 > > vpcmpuw $0, %ymm0, %ymm2, %k1 > > kmovw %k0, %edx > > kmovw %k1, %eax > > kortestw%k1, %k0 > > je .L10 > > > > Those two KMOVW instructions aren't required for the check that follows. > > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST > > can't be dispatched until those two have executed, thus introducing a > > 2-cycle delay in this loop. > > These are not NOP moves but zero-extensions. > > vmovdqu16 (%rdi), %ymm1 # 93[c=17 l=6] > movv16hi_internal/2 > vmovdqu16 32(%rdi), %ymm2 # 94[c=21 l=7] > movv16hi_internal/2 > vpcmpuw $0, %ymm0, %ymm1, %k0 # 21[c=4 l=7] > avx512vl_ucmpv16hi3 > vpcmpuw $0, %ymm0, %ymm2, %k1 # 27[c=4 l=7] > avx512vl_ucmpv16hi3 > kmovw %k0, %edx # 30[c=4 l=4] *zero_extendhisi2/1 > kmovw %k1, %eax # 29[c=4 l=4] *zero_extendhisi2/1 > kortestw%k1, %k0# 31[c=4 l=4] kortesthi > > since for some reason tree optimizers give us: > > _28 = VIEW_CONVERT_EXPR<__v16hi>(_31); > _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535); > _26 = VIEW_CONVERT_EXPR<__v16hi>(_30); > _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535); > _2 = (int) _27; > _3 = (int) _29; > _15 = __builtin_ia32_kortestzhi (_3, _2); > > Is there any way to avoid zero_extension for > _2 = (int) _27; > _3 = (int) _29; Since __builtin_ia32_kortestzhi is defined to accept 2 short parameters. Also ABI doesn't ask for clearing the upper bits. i.e. for testcase int __attribute__((noipa)) foo (short a) { return a; } int foo1 (short a) { return foo (a); } _Z3foos: movswl %di, %eax ret _Z4foo1s: movswl %di, %edi jmp _Z3foos movswl in foo1 seems redundant.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #9 from Uroš Bizjak --- (In reply to Thiago Macieira from comment #0) > Testcase: ... > The assembly for this produces: > > vmovdqu16 (%rdi), %ymm1 > vmovdqu16 32(%rdi), %ymm2 > vpcmpuw $0, %ymm0, %ymm1, %k0 > vpcmpuw $0, %ymm0, %ymm2, %k1 > kmovw %k0, %edx > kmovw %k1, %eax > kortestw%k1, %k0 > je .L10 > > Those two KMOVW instructions aren't required for the check that follows. > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST > can't be dispatched until those two have executed, thus introducing a > 2-cycle delay in this loop. These are not NOP moves but zero-extensions. vmovdqu16 (%rdi), %ymm1 # 93[c=17 l=6] movv16hi_internal/2 vmovdqu16 32(%rdi), %ymm2 # 94[c=21 l=7] movv16hi_internal/2 vpcmpuw $0, %ymm0, %ymm1, %k0 # 21[c=4 l=7] avx512vl_ucmpv16hi3 vpcmpuw $0, %ymm0, %ymm2, %k1 # 27[c=4 l=7] avx512vl_ucmpv16hi3 kmovw %k0, %edx # 30[c=4 l=4] *zero_extendhisi2/1 kmovw %k1, %eax # 29[c=4 l=4] *zero_extendhisi2/1 kortestw%k1, %k0# 31[c=4 l=4] kortesthi since for some reason tree optimizers give us: _28 = VIEW_CONVERT_EXPR<__v16hi>(_31); _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535); _26 = VIEW_CONVERT_EXPR<__v16hi>(_30); _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535); _2 = (int) _27; _3 = (int) _29; _15 = __builtin_ia32_kortestzhi (_3, _2); > Clang generates: > > .LBB0_2:# =>This Inner Loop Header: Depth=1 > vpcmpeqw(%rdi), %ymm0, %k0 > vpcmpeqw32(%rdi), %ymm0, %k1 > kortestw%k0, %k1 > jne .LBB0_3 > > ICC inserts one KMOVW, but not the other. > > Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M > > LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78 > It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC > code is 3 cycles/loop. > > LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles > per loop, because it can run in parallel with the second load, given that > the loads are ports 2 and 3.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #8 from Thiago Macieira --- Update again: looks like the issue was the next line I didn't paste, which was performing _kortestz_mask32_u8 on an __mmask16. The type mismatch was causing this problem. If I Use the correct _kortestz_maskXX_u8, I'm getting: vmovdqu8(%rsi), %ymm2 vmovdqu832(%rsi), %ymm3 vpcmpub $6, %ymm0, %ymm2, %k0 vpcmpub $6, %ymm0, %ymm3, %k1 kortestd%k1, %k0 je .L794 vmovdqu16 (%rsi), %ymm2 vmovdqu16 32(%rsi), %ymm3 vpcmpuw $6, %ymm0, %ymm2, %k0 vpcmpuw $6, %ymm0, %ymm3, %k1 kortestw%k1, %k0 je .L807 So it looks like GCC is not completely wrong, but it could be more lenient (Clang is). You can lower the severity of this issue.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #7 from Thiago Macieira --- I should add the same is not happening for Char == char, meaning the returned type is an __mmask32 (unsigned) vmovdqu8(%rsi), %ymm2 vmovdqu832(%rsi), %ymm3 vpcmpub $6, %ymm0, %ymm2, %k0 vpcmpub $6, %ymm0, %ymm3, %k1 kortestd%k1, %k0 je .L792
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #6 from Thiago Macieira --- It got worse. Now I'm seeing: .L807: vmovdqu16 (%rsi), %ymm2 vmovdqu16 32(%rsi), %ymm3 vpcmpuw $6, %ymm0, %ymm2, %k2 vpcmpuw $6, %ymm0, %ymm3, %k3 kmovw %k2, %eax kmovw %k3, %edx kmovd %eax, %k4 kmovd %edx, %k5 kortestd%k5, %k4 je .L814 Code snippet: auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U) { if constexpr (sizeof(Char) == 1) { __m256i mval = _mm256_set1_epi8(maxval); __m256i data = _mm256_maskz_loadu_epi8(mask, ptr); return _mm256_cmpgt_epu8_mask(data, mval); } else if constexpr (sizeof(Char) == 2) { __m256i mval = _mm256_set1_epi16(maxval); __m256i data = _mm256_maskz_loadu_epi16(mask, ptr); return _mm256_cmpgt_epu16_mask(data, mval); } else if constexpr (sizeof(Char) == 4) { __m256i mval = _mm256_set1_epi32(maxval); __m256i data = _mm256_maskz_loadu_epi32(mask, ptr); return _mm256_cmpgt_epu32_mask(data, mval); } }; /*...*/ auto mask1 = loadAndCompare(n1); auto mask2 = loadAndCompare(n2); I can make a compilable version if you need me to
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #5 from Thiago Macieira --- Maybe this is running afoul of GCC's thinking that a simple register-register move is free? I've seen it save a constant in an opmask register, but kmov{d,q} is not free like mov{l,q} is.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #4 from Hongtao.liu --- (In reply to Hongtao.liu from comment #3) > (In reply to Hongtao.liu from comment #2) > > Failed here > > > > /* Allow propagations into a loop only for reg-to-reg copies, since > > replacing one register by another shouldn't increase the cost. */ > > struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father; > > struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father; > > if ((reg_prop_only || def_loop != use_loop) > > && (!reg_single_def_p (dest) || !reg_single_def_p (src))) > > return false; > > > > But def_loop is inner loop of use_loop, it should be ok to propagate from > > inner loop to outer loop. > > Guess def_loop != use_loop used here with assumption that those "invariant" > should be sinked into same loop. There seems to be no pass in rtl level to do "sink" stuff.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #3 from Hongtao.liu --- (In reply to Hongtao.liu from comment #2) > Failed here > > /* Allow propagations into a loop only for reg-to-reg copies, since > replacing one register by another shouldn't increase the cost. */ > struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father; > struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father; > if ((reg_prop_only || def_loop != use_loop) > && (!reg_single_def_p (dest) || !reg_single_def_p (src))) > return false; > > But def_loop is inner loop of use_loop, it should be ok to propagate from > inner loop to outer loop. Guess def_loop != use_loop used here with assumption that those "invariant" should be sinked into same loop.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #2 from Hongtao.liu --- Failed here /* Allow propagations into a loop only for reg-to-reg copies, since replacing one register by another shouldn't increase the cost. */ struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father; struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father; if ((reg_prop_only || def_loop != use_loop) && (!reg_single_def_p (dest) || !reg_single_def_p (src))) return false; But def_loop is inner loop of use_loop, it should be ok to propagate from inner loop to outer loop.
[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750 --- Comment #1 from Hongtao.liu --- kmovw here is zero_extend, and at gimple level it's not redundant in loop. _31 = MEM[(const __m256i_u * {ref-all})n_5]; _30 = MEM[(const __m256i_u * {ref-all})n_5 + 32B]; _28 = VIEW_CONVERT_EXPR<__v16hi>(_31); _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535); _26 = VIEW_CONVERT_EXPR<__v16hi>(_30); _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535); _2 = (int) _27; _3 = (int) _29; _15 = __builtin_ia32_kortestzhi (_3, _2); _25 = (unsigned char) _15; if (_25 != 0) but at rtl level, _28/_29 propagate into kortest and be partial redundant but failed to sink or be eliminated. (insn 29 27 30 3 (set (reg:SI 83 [ _2 ]) (zero_extend:SI (reg:HI 111))) "test.c":24:32 147 {*zero_extendhisi2} (nil)) (insn 30 29 31 3 (set (reg:SI 116 [ _29 ]) (zero_extend:SI (reg:HI 106))) "test.c":24:32 147 {*zero_extendhisi2} (nil)) (insn 58 56 60 8 (parallel [ (set (reg:HI 120) (unspec:HI [ (subreg:HI (reg:SI 83 [ _2 ]) 0) ] UNSPEC_TZCNT)) (clobber (reg:CC 17 flags)) (insn 52 51 55 7 (parallel [ (set (reg/v:SI 88 [ idx ]) (ctz:SI (reg:SI 116 [ _29 ]))) (clobber (reg:CC 17 flags))