[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2022-01-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #17 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:e7a7dbb5ca5dd69689f1a462ba7620180acfe8b0

commit r12-6342-ge7a7dbb5ca5dd69689f1a462ba7620180acfe8b0
Author: liuhongt 
Date:   Mon Dec 20 11:13:38 2021 +0800

Allow propagations from inner loop to outer loop.

NULL is considered as an outer loop of any other loop.

gcc/ChangeLog:

PR rtl-optimization/103750
* fwprop.c (forward_propagate_into): Allow propagations from
inner loop to outer loop.

gcc/testsuite/ChangeLog:

* g++.target/i386/pr103750-fwprop-1.C: New test.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-22 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #16 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:1a7ce8570997eb1596c803443d20687b43fa2e47

commit r12-6103-g1a7ce8570997eb1596c803443d20687b43fa2e47
Author: liuhongt 
Date:   Wed Dec 22 16:48:54 2021 +0800

Combine vpcmpuw + zero_extend to vpcmpuw.

vcmp{ps,ph,pd} and vpcmp{,u}{b,w,d,q} implicitly clear the upper bits
of dest.

gcc/ChangeLog:

PR target/103750
* config/i386/sse.md
(*_cmp3_zero_extend):
New pre_reload define_insn_and_split.
(*_cmp3_zero_extend):
Ditto.
(*_ucmp3_zero_extend):
Ditto.
(*_ucmp3_zero_extend):
Ditto.
(*_cmp3_zero_extend_2):
Ditto.
(*_cmp3_zero_extend_2):
Ditto.
(*_ucmp3_zero_extend_2):
Ditto.
(*_ucmp3_zero_extend_2):
Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx512bw-pr103750-1.c: New test.
* gcc.target/i386/avx512bw-pr103750-2.c: New test.
* gcc.target/i386/avx512f-pr103750-1.c: New test.
* gcc.target/i386/avx512f-pr103750-2.c: New test.
* gcc.target/i386/avx512fp16-pr103750-1.c: New test.
* gcc.target/i386/avx512fp16-pr103750-2.c: New test.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-20 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #15 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #14)
> Created attachment 52032 [details]
> update patch
> 
> Update patch, Now gcc can generate optimal code
> 

current fix add define_insn_and_splitter for 3 things:
1. Combine vpcmpuw and zero_extend into vpcmpuw.
2. Canonicalize vpcmpuw pattern so CSE can replace duplicate vpcmpuw to just
kmov
3. Use DImode as dest of zero_extend so cprop_hardreg can eliminate redundant
kmov.

But the sink issue still exists, i.e. for testcase in PR103774, there's
memory_operand in vpcmpuw, and combine failed due to cost increase, and the
redudant kmov remains in the loop.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

Hongtao.liu  changed:

   What|Removed |Added

  Attachment #52031|0   |1
is obsolete||

--- Comment #14 from Hongtao.liu  ---
Created attachment 52032
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52032=edit
update patch

Update patch, Now gcc can generate optimal code

for #c0

.L4:
vmovdqu (%rdi), %ymm1
vmovdqu16   32(%rdi), %ymm2
vpcmpuw $0, %ymm0, %ymm1, %k1
vpcmpuw $0, %ymm0, %ymm2, %k0
kortestw%k0, %k1
je  .L10
kortestw%k1, %k1
je  .L5
kmovd   %k1, %eax



For #c6

.L4:
vmovdqu (%rdi), %ymm2
vmovdqu 32(%rdi), %ymm1
vpcmpuw $0, %ymm0, %ymm2, %k3
vpcmpuw $0, %ymm0, %ymm1, %k0
kortestd%k0, %k3
je  .L10
kortestw%k3, %k3
je  .L5

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #13 from Hongtao.liu  ---
Created attachment 52031
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52031=edit
untested patch.

Attached patch can optimize #c0 to

vmovdqu (%rdi), %ymm1
vmovdqu16   32(%rdi), %ymm2
vpcmpuw $0, %ymm0, %ymm1, %k1
vpcmpuw $0, %ymm0, %ymm2, %k0
kmovw   %k1, %k2
kortestw%k0, %k1
je  .L10


and #c6 to

.L4:
vmovdqu (%rdi), %ymm2
vmovdqu 32(%rdi), %ymm1
vpcmpuw $0, %ymm0, %ymm2, %k3
vpcmpuw $0, %ymm0, %ymm1, %k0
kmovw   %k3, %k1
kmovw   %k0, %k2
kortestd%k2, %k1
je  .L10


It should be much better than orginal version, but still a little suboptimal:
the frist kmovw should be sinked to the exit edge, the latter 2 kmovw should be
emilated.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #12 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Thiago Macieira from comment #6)
> > It got worse. Now I'm seeing:
> > 
> > .L807:
> > vmovdqu16   (%rsi), %ymm2
> > vmovdqu16   32(%rsi), %ymm3
> > vpcmpuw $6, %ymm0, %ymm2, %k2
> > vpcmpuw $6, %ymm0, %ymm3, %k3
> > kmovw   %k2, %eax
> > kmovw   %k3, %edx
> > kmovd   %eax, %k4
> > kmovd   %edx, %k5
Guess there're other usage below for %eax,%edx, that's why RA didn't choose k
alternative.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #11 from Hongtao.liu  ---
(In reply to Thiago Macieira from comment #6)
> It got worse. Now I'm seeing:
> 
> .L807:
> vmovdqu16   (%rsi), %ymm2
> vmovdqu16   32(%rsi), %ymm3
> vpcmpuw $6, %ymm0, %ymm2, %k2
> vpcmpuw $6, %ymm0, %ymm3, %k3
> kmovw   %k2, %eax
> kmovw   %k3, %edx
> kmovd   %eax, %k4
> kmovd   %edx, %k5
> kortestd%k5, %k4
> je  .L814
> 
(define_insn "*zero_extendsi2"
  [(set (match_operand:SI 0 "register_operand" "=r,*r,*k")
(zero_extend:SI
  (match_operand:SWI12 1 "nonimmediate_operand" "m,*k,*km")))]
  "!(TARGET_ZERO_EXTEND_WITH_AND && optimize_function_for_speed_p (cfun))"

zero_extendhisi is supported with k alternative, it should be optimized to 

 vmovdqu16   (%rsi), %ymm2
 vmovdqu16   32(%rsi), %ymm3
 vpcmpuw $6, %ymm0, %ymm2, %k2
 vpcmpuw $6, %ymm0, %ymm3, %k3
 kmovw   %k2, %k4
 kmovw   %k3, %k5
 kortestd%k5, %k4

And considering vpcmpuw will implicitly zero extend k2, it can be further
optimized to

 vmovdqu16   (%rsi), %ymm2
 vmovdqu16   32(%rsi), %ymm3
 vpcmpuw $6, %ymm0, %ymm2, %k2
 vpcmpuw $6, %ymm0, %ymm3, %k3
 kortestd%k3, %k2

> Code snippet:
> 
> auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U)
> {
> if constexpr (sizeof(Char) == 1) {
> __m256i mval = _mm256_set1_epi8(maxval);
> __m256i data = _mm256_maskz_loadu_epi8(mask, ptr);
> return _mm256_cmpgt_epu8_mask(data, mval);
> } else if constexpr (sizeof(Char) == 2) {
> __m256i mval = _mm256_set1_epi16(maxval);
> __m256i data = _mm256_maskz_loadu_epi16(mask, ptr);
> return _mm256_cmpgt_epu16_mask(data, mval);
> } else if constexpr (sizeof(Char) == 4) {
> __m256i mval = _mm256_set1_epi32(maxval);
> __m256i data = _mm256_maskz_loadu_epi32(mask, ptr);
> return _mm256_cmpgt_epu32_mask(data, mval);
> }
> };
> /*...*/
> auto mask1 = loadAndCompare(n1);
> auto mask2 = loadAndCompare(n2);
> 
> I can make a compilable version if you need me to

Yes, please.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-19 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #10 from Hongtao.liu  ---
(In reply to Uroš Bizjak from comment #9)
> (In reply to Thiago Macieira from comment #0)
> > Testcase:
> ...
> > The assembly for this produces:
> > 
> > vmovdqu16   (%rdi), %ymm1
> > vmovdqu16   32(%rdi), %ymm2
> > vpcmpuw $0, %ymm0, %ymm1, %k0
> > vpcmpuw $0, %ymm0, %ymm2, %k1
> > kmovw   %k0, %edx
> > kmovw   %k1, %eax
> > kortestw%k1, %k0
> > je  .L10
> > 
> > Those two KMOVW instructions aren't required for the check that follows.
> > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> > can't be dispatched until those two have executed, thus introducing a
> > 2-cycle delay in this loop.
> 
> These are not NOP moves but zero-extensions.
> 
> vmovdqu16   (%rdi), %ymm1   # 93[c=17 l=6] 
> movv16hi_internal/2
> vmovdqu16   32(%rdi), %ymm2 # 94[c=21 l=7] 
> movv16hi_internal/2
> vpcmpuw $0, %ymm0, %ymm1, %k0   # 21[c=4 l=7] 
> avx512vl_ucmpv16hi3
> vpcmpuw $0, %ymm0, %ymm2, %k1   # 27[c=4 l=7] 
> avx512vl_ucmpv16hi3
> kmovw   %k0, %edx   # 30[c=4 l=4]  *zero_extendhisi2/1
> kmovw   %k1, %eax   # 29[c=4 l=4]  *zero_extendhisi2/1
> kortestw%k1, %k0# 31[c=4 l=4]  kortesthi
> 
> since for some reason tree optimizers give us:
> 
>   _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
>   _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
>   _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
>   _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
>   _2 = (int) _27;
>   _3 = (int) _29;
>   _15 = __builtin_ia32_kortestzhi (_3, _2);
> 
> 

Is there any way to avoid zero_extension for
>   _2 = (int) _27;
>   _3 = (int) _29;

Since __builtin_ia32_kortestzhi is defined to accept 2 short parameters. Also
ABI doesn't ask for clearing the upper bits.

i.e. for testcase
int
__attribute__((noipa))
foo (short a)
{
return a;
}

int
foo1 (short a)
{
return foo (a);
}


_Z3foos:
movswl  %di, %eax
ret
_Z4foo1s:
movswl  %di, %edi
jmp _Z3foos


movswl in foo1 seems redundant.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #9 from Uroš Bizjak  ---
(In reply to Thiago Macieira from comment #0)
> Testcase:
...
> The assembly for this produces:
> 
> vmovdqu16   (%rdi), %ymm1
> vmovdqu16   32(%rdi), %ymm2
> vpcmpuw $0, %ymm0, %ymm1, %k0
> vpcmpuw $0, %ymm0, %ymm2, %k1
> kmovw   %k0, %edx
> kmovw   %k1, %eax
> kortestw%k1, %k0
> je  .L10
> 
> Those two KMOVW instructions aren't required for the check that follows.
> They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> can't be dispatched until those two have executed, thus introducing a
> 2-cycle delay in this loop.

These are not NOP moves but zero-extensions.

vmovdqu16   (%rdi), %ymm1   # 93[c=17 l=6]  movv16hi_internal/2
vmovdqu16   32(%rdi), %ymm2 # 94[c=21 l=7]  movv16hi_internal/2
vpcmpuw $0, %ymm0, %ymm1, %k0   # 21[c=4 l=7]  avx512vl_ucmpv16hi3
vpcmpuw $0, %ymm0, %ymm2, %k1   # 27[c=4 l=7]  avx512vl_ucmpv16hi3
kmovw   %k0, %edx   # 30[c=4 l=4]  *zero_extendhisi2/1
kmovw   %k1, %eax   # 29[c=4 l=4]  *zero_extendhisi2/1
kortestw%k1, %k0# 31[c=4 l=4]  kortesthi

since for some reason tree optimizers give us:

  _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
  _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
  _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
  _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
  _2 = (int) _27;
  _3 = (int) _29;
  _15 = __builtin_ia32_kortestzhi (_3, _2);


> Clang generates:
> 
> .LBB0_2:# =>This Inner Loop Header: Depth=1
> vpcmpeqw(%rdi), %ymm0, %k0
> vpcmpeqw32(%rdi), %ymm0, %k1
> kortestw%k0, %k1
> jne .LBB0_3
> 
> ICC inserts one KMOVW, but not the other.
> 
> Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M
> 
> LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
> It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
> code is 3 cycles/loop.
> 
> LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles
> per loop, because it can run in parallel with the second load, given that
> the loads are ports 2 and 3.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #8 from Thiago Macieira  ---
Update again: looks like the issue was the next line I didn't paste, which was
performing _kortestz_mask32_u8 on an __mmask16. The type mismatch was causing
this problem.

If I Use the correct _kortestz_maskXX_u8, I'm getting:

vmovdqu8(%rsi), %ymm2
vmovdqu832(%rsi), %ymm3
vpcmpub $6, %ymm0, %ymm2, %k0
vpcmpub $6, %ymm0, %ymm3, %k1
kortestd%k1, %k0
je  .L794


vmovdqu16   (%rsi), %ymm2
vmovdqu16   32(%rsi), %ymm3
vpcmpuw $6, %ymm0, %ymm2, %k0
vpcmpuw $6, %ymm0, %ymm3, %k1
kortestw%k1, %k0
je  .L807

So it looks like GCC is not completely wrong, but it could be more lenient
(Clang is). You can lower the severity of this issue.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #7 from Thiago Macieira  ---
I should add the same is not happening for Char == char, meaning the returned
type is an __mmask32 (unsigned)

vmovdqu8(%rsi), %ymm2
vmovdqu832(%rsi), %ymm3
vpcmpub $6, %ymm0, %ymm2, %k0
vpcmpub $6, %ymm0, %ymm3, %k1
kortestd%k1, %k0
je  .L792

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #6 from Thiago Macieira  ---
It got worse. Now I'm seeing:

.L807:
vmovdqu16   (%rsi), %ymm2
vmovdqu16   32(%rsi), %ymm3
vpcmpuw $6, %ymm0, %ymm2, %k2
vpcmpuw $6, %ymm0, %ymm3, %k3
kmovw   %k2, %eax
kmovw   %k3, %edx
kmovd   %eax, %k4
kmovd   %edx, %k5
kortestd%k5, %k4
je  .L814

Code snippet:

auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U) {
if constexpr (sizeof(Char) == 1) {
__m256i mval = _mm256_set1_epi8(maxval);
__m256i data = _mm256_maskz_loadu_epi8(mask, ptr);
return _mm256_cmpgt_epu8_mask(data, mval);
} else if constexpr (sizeof(Char) == 2) {
__m256i mval = _mm256_set1_epi16(maxval);
__m256i data = _mm256_maskz_loadu_epi16(mask, ptr);
return _mm256_cmpgt_epu16_mask(data, mval);
} else if constexpr (sizeof(Char) == 4) {
__m256i mval = _mm256_set1_epi32(maxval);
__m256i data = _mm256_maskz_loadu_epi32(mask, ptr);
return _mm256_cmpgt_epu32_mask(data, mval);
}
};
/*...*/
auto mask1 = loadAndCompare(n1);
auto mask2 = loadAndCompare(n2);

I can make a compilable version if you need me to

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #5 from Thiago Macieira  ---
Maybe this is running afoul of GCC's thinking that a simple register-register
move is free? I've seen it save a constant in an opmask register, but kmov{d,q}
is not free like mov{l,q} is.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #4 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #3)
> (In reply to Hongtao.liu from comment #2)
> > Failed here
> > 
> >   /* Allow propagations into a loop only for reg-to-reg copies, since
> >  replacing one register by another shouldn't increase the cost.  */
> >   struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father;
> >   struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father;
> >   if ((reg_prop_only || def_loop != use_loop)
> >   && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
> > return false;
> > 
> > But def_loop is inner loop of use_loop, it should be ok to propagate from
> > inner loop to outer loop.
> 
> Guess def_loop != use_loop used here with assumption that those "invariant"
> should be sinked into same loop.

There seems to be no pass in rtl level to do "sink" stuff.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #3 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #2)
> Failed here
> 
>   /* Allow propagations into a loop only for reg-to-reg copies, since
>  replacing one register by another shouldn't increase the cost.  */
>   struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father;
>   struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father;
>   if ((reg_prop_only || def_loop != use_loop)
>   && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
> return false;
> 
> But def_loop is inner loop of use_loop, it should be ok to propagate from
> inner loop to outer loop.

Guess def_loop != use_loop used here with assumption that those "invariant"
should be sinked into same loop.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #2 from Hongtao.liu  ---
Failed here

  /* Allow propagations into a loop only for reg-to-reg copies, since
 replacing one register by another shouldn't increase the cost.  */
  struct loop *def_loop = def_insn->bb ()->cfg_bb ()->loop_father;
  struct loop *use_loop = use->bb ()->cfg_bb ()->loop_father;
  if ((reg_prop_only || def_loop != use_loop)
  && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
return false;

But def_loop is inner loop of use_loop, it should be ok to propagate from inner
loop to outer loop.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #1 from Hongtao.liu  ---
kmovw here is zero_extend, and at gimple level it's not redundant in loop.

  _31 = MEM[(const __m256i_u * {ref-all})n_5];
  _30 = MEM[(const __m256i_u * {ref-all})n_5 + 32B];
  _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
  _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
  _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
  _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
  _2 = (int) _27;
  _3 = (int) _29;
  _15 = __builtin_ia32_kortestzhi (_3, _2);
  _25 = (unsigned char) _15;
  if (_25 != 0)

but at rtl level, _28/_29 propagate into kortest and be partial redundant but
failed to sink or be eliminated.

(insn 29 27 30 3 (set (reg:SI 83 [ _2 ])
(zero_extend:SI (reg:HI 111))) "test.c":24:32 147 {*zero_extendhisi2}
 (nil))
(insn 30 29 31 3 (set (reg:SI 116 [ _29 ])
(zero_extend:SI (reg:HI 106))) "test.c":24:32 147 {*zero_extendhisi2}
 (nil))

(insn 58 56 60 8 (parallel [
(set (reg:HI 120)
(unspec:HI [
(subreg:HI (reg:SI 83 [ _2 ]) 0)
] UNSPEC_TZCNT))
(clobber (reg:CC 17 flags))

(insn 52 51 55 7 (parallel [
(set (reg/v:SI 88 [ idx ])
(ctz:SI (reg:SI 116 [ _29 ])))
(clobber (reg:CC 17 flags))