Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
On Tue, Oct 1, 2013 at 3:50 PM, Jan Hubicka wrote: >> > Hi Wei Mi, >> > >> > Have you checked in your patch? >> > >> > -- >> > H.J. >> >> No, I havn't. Honza wants me to wait for his testing on AMD hardware. >> http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01603.html > I only wanted to separate it from the changes in generic so the regular > testers > can pick it up separately. So just go ahead and check it in. > > Honza Thanks, check in as r203095. Wei Mi.
Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
> > Hi Wei Mi, > > > > Have you checked in your patch? > > > > -- > > H.J. > > No, I havn't. Honza wants me to wait for his testing on AMD hardware. > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01603.html I only wanted to separate it from the changes in generic so the regular testers can pick it up separately. So just go ahead and check it in. Honza
Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
> Hi Wei Mi, > > Have you checked in your patch? > > -- > H.J. No, I havn't. Honza wants me to wait for his testing on AMD hardware. http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01603.html
Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
On Sun, Sep 22, 2013 at 2:29 AM, Uros Bizjak wrote: > On Wed, Sep 18, 2013 at 3:45 PM, Zamyatin, Igor > wrote: >> Ccing Uros. Changes in i386.md could be related to the fix for PR57954. > >> -Original Message- >> From: Wei Mi [mailto:[email protected]] >> Sent: Thursday, September 12, 2013 2:51 AM >> To: GCC Patches >> Cc: David Li; Zamyatin, Igor >> Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL >> >> For the following testcase 1.c, on westmere and sandybridge, performance >> with the option -mtune=^use_vector_fp_converts is better (improves from >> 3.46s to 2.83s). It means cvtss2sd is often better than >> unpcklps+cvtps2pd on recent x86 platforms. >> >> 1.c: >> float total = 0.2; >> int k = 5; >> >> int main() { >> int i; >> >> for (i = 0; i < 10; i++) { >>total += (0.5 + k); >> } >> >> return total == 0.3; >> } >> >> assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts >> .L2: >> unpcklps%xmm0, %xmm0 >> subl$1, %eax >> cvtps2pd%xmm0, %xmm0 >> addsd %xmm1, %xmm0 >> unpcklpd%xmm0, %xmm0 >> cvtpd2ps%xmm0, %xmm0 >> jne .L2 >> >> assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts >> .L2: >> cvtss2sd%xmm0, %xmm0 >> subl$1, %eax >> addsd %xmm1, %xmm0 >> cvtsd2ss%xmm0, %xmm0 >> jne .L2 >> >> But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance >> with the option -mtune=^use_vector_fp_converts is worse. >> Analysis to the assembly shows the performance degradation comes from >> partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before >> cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back. >> >> 2.c: >> double b[1024]; >> >> float a[1024]; >> >> int main() >> { >> int i; >> for(i = 0 ; i < 1024 * 1024 * 256; i++) >> a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; >> return (int)a[512]; >> } >> >> without -mtune-crtl=^use_vector_fp_converts >> .L2: >> movl%eax, %edx >> addl$1, %eax >> andl$1023, %edx >> cmpl$268435456, %eax >> movsd b(,%rdx,8), %xmm0 >> cvtpd2ps%xmm0, %xmm0==> without partial reg stall >> because of movsd. >> mulss a(,%rdx,4), %xmm0 >> movss %xmm0, a(,%rdx,4) >> jne .L2 >> >> with -mtune-crtl=^use_vector_fp_converts >> .L2: >> movl%eax, %edx >> addl$1, %eax >> andl$1023, %edx >> cmpl$268435456, %eax >> cvtsd2ssb(,%rdx,8), %xmm0 ==> with partial reg >> stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. >> mulss a(,%rdx,4), %xmm0 >> movss %xmm0, a(,%rdx,4) >> jne .L2 >> >> So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use >> cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before >> cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does >> for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? >> >> Thanks, >> Wei Mi. >> >> 2013-09-11 Wei Mi >> >> * config/i386/x86-tune.def (DEF_TUNE): Remove >> m_CORE_ALL. >> * config/i386/i386.md: Add define_peephole2 to >> break partial reg stall for cvtss2sd/cvtsd2ss. > > You don't need reload_completed in peephole2 patterns. > > Otherwise the patch is OK. > > Thanks, > Uros. Hi Wei Mi, Have you checked in your patch? -- H.J.
Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
On Wed, Sep 18, 2013 at 3:45 PM, Zamyatin, Igor wrote: > Ccing Uros. Changes in i386.md could be related to the fix for PR57954. > -Original Message- > From: Wei Mi [mailto:[email protected]] > Sent: Thursday, September 12, 2013 2:51 AM > To: GCC Patches > Cc: David Li; Zamyatin, Igor > Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL > > For the following testcase 1.c, on westmere and sandybridge, performance with > the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to > 2.83s). It means cvtss2sd is often better than > unpcklps+cvtps2pd on recent x86 platforms. > > 1.c: > float total = 0.2; > int k = 5; > > int main() { > int i; > > for (i = 0; i < 10; i++) { >total += (0.5 + k); > } > > return total == 0.3; > } > > assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts > .L2: > unpcklps%xmm0, %xmm0 > subl$1, %eax > cvtps2pd%xmm0, %xmm0 > addsd %xmm1, %xmm0 > unpcklpd%xmm0, %xmm0 > cvtpd2ps%xmm0, %xmm0 > jne .L2 > > assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts > .L2: > cvtss2sd%xmm0, %xmm0 > subl$1, %eax > addsd %xmm1, %xmm0 > cvtsd2ss%xmm0, %xmm0 > jne .L2 > > But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance > with the option -mtune=^use_vector_fp_converts is worse. > Analysis to the assembly shows the performance degradation comes from partial > reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss > b(,%rdx,8), %xmm0 gets the performance back. > > 2.c: > double b[1024]; > > float a[1024]; > > int main() > { > int i; > for(i = 0 ; i < 1024 * 1024 * 256; i++) > a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; > return (int)a[512]; > } > > without -mtune-crtl=^use_vector_fp_converts > .L2: > movl%eax, %edx > addl$1, %eax > andl$1023, %edx > cmpl$268435456, %eax > movsd b(,%rdx,8), %xmm0 > cvtpd2ps%xmm0, %xmm0==> without partial reg stall > because of movsd. > mulss a(,%rdx,4), %xmm0 > movss %xmm0, a(,%rdx,4) > jne .L2 > > with -mtune-crtl=^use_vector_fp_converts > .L2: > movl%eax, %edx > addl$1, %eax > andl$1023, %edx > cmpl$268435456, %eax > cvtsd2ssb(,%rdx,8), %xmm0 ==> with partial reg > stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. > mulss a(,%rdx,4), %xmm0 > movss %xmm0, a(,%rdx,4) > jne .L2 > > So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use > cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before > cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does > for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? > > Thanks, > Wei Mi. > > 2013-09-11 Wei Mi > > * config/i386/x86-tune.def (DEF_TUNE): Remove > m_CORE_ALL. > * config/i386/i386.md: Add define_peephole2 to > break partial reg stall for cvtss2sd/cvtsd2ss. You don't need reload_completed in peephole2 patterns. Otherwise the patch is OK. Thanks, Uros.
Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
Ping. > -Original Message- > From: Wei Mi [mailto:[email protected]] > Sent: Thursday, September 12, 2013 2:51 AM > To: GCC Patches > Cc: David Li; Zamyatin, Igor > Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL > > For the following testcase 1.c, on westmere and sandybridge, performance with > the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to > 2.83s). It means cvtss2sd is often better than > unpcklps+cvtps2pd on recent x86 platforms. > > 1.c: > float total = 0.2; > int k = 5; > > int main() { > int i; > > for (i = 0; i < 10; i++) { >total += (0.5 + k); > } > > return total == 0.3; > } > > assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts > .L2: > unpcklps%xmm0, %xmm0 > subl$1, %eax > cvtps2pd%xmm0, %xmm0 > addsd %xmm1, %xmm0 > unpcklpd%xmm0, %xmm0 > cvtpd2ps%xmm0, %xmm0 > jne .L2 > > assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts > .L2: > cvtss2sd%xmm0, %xmm0 > subl$1, %eax > addsd %xmm1, %xmm0 > cvtsd2ss%xmm0, %xmm0 > jne .L2 > > But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance > with the option -mtune=^use_vector_fp_converts is worse. > Analysis to the assembly shows the performance degradation comes from partial > reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss > b(,%rdx,8), %xmm0 gets the performance back. > > 2.c: > double b[1024]; > > float a[1024]; > > int main() > { > int i; > for(i = 0 ; i < 1024 * 1024 * 256; i++) > a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; > return (int)a[512]; > } > > without -mtune-crtl=^use_vector_fp_converts > .L2: > movl%eax, %edx > addl$1, %eax > andl$1023, %edx > cmpl$268435456, %eax > movsd b(,%rdx,8), %xmm0 > cvtpd2ps%xmm0, %xmm0==> without partial reg stall > because of movsd. > mulss a(,%rdx,4), %xmm0 > movss %xmm0, a(,%rdx,4) > jne .L2 > > with -mtune-crtl=^use_vector_fp_converts > .L2: > movl%eax, %edx > addl$1, %eax > andl$1023, %edx > cmpl$268435456, %eax > cvtsd2ssb(,%rdx,8), %xmm0 ==> with partial reg > stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. > mulss a(,%rdx,4), %xmm0 > movss %xmm0, a(,%rdx,4) > jne .L2 > > So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use > cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before > cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does > for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? > > Thanks, > Wei Mi. > > 2013-09-11 Wei Mi > > * config/i386/x86-tune.def (DEF_TUNE): Remove > m_CORE_ALL. > * config/i386/i386.md: Add define_peephole2 to > break partial reg stall for cvtss2sd/cvtsd2ss. > > Index: config/i386/x86-tune.def > === > --- config/i386/x86-tune.def(revision 201963) > +++ config/i386/x86-tune.def(working copy) > @@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_ > /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion > from FP to FP. */ > DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts", > - m_CORE_ALL | m_AMDFAM10 | m_GENERIC) > + m_AMDFAM10 | m_GENERIC) > /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion > from integer to FP. */ > DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10) > Index: config/i386/i386.md > === > --- config/i386/i386.md (revision 201963) > +++ config/i386/i386.md (working copy) > @@ -5075,6 +5075,63 @@ >emit_move_insn (operands[0], CONST0_RTX (mode)); > }) > > +;; Break partial reg stall for cvtsd2ss. > + > +(define_peephole2 > + [(set (match_operand:SF 0 "register_operand") > +(float_truncate:SF > + (match_operand:DF 1 "nonimmediate_operand")))] > + "TARGET_SSE2 && TARGET_SSE_MATH > + && TARGET_SSE_PARTIAL_REG_DEPENDENCY > + && optimize_function_for_speed_p (cfun) > + && reload_completed && SSE_REG_P (operands[0]) > + && peep2_reg_dead_p (0, operands[0]) > + && (!SSE_REG_P (operands[1]) > + || REGNO (operands[0]) != REGNO (operands[1]))" > + [(set (match_dup 0) > + (vec_merge:V4SF > + (vec_duplicate:V4SF > + (float_truncate:V2SF > + (match_dup 1))) > + (match_dup 0) > + (const_int 1)))] > +{ > + operands[0] = simplify_gen_subreg (V4SFmode, operands[0], > +SFmode, 0); > + operands[1] = simplify_gen_subreg (V2DFmode, operands[1], >
RE: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
Ccing Uros. Changes in i386.md could be related to the fix for PR57954. Thanks, Igor -Original Message- From: Wei Mi [mailto:[email protected]] Sent: Thursday, September 12, 2013 2:51 AM To: GCC Patches Cc: David Li; Zamyatin, Igor Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL For the following testcase 1.c, on westmere and sandybridge, performance with the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to 2.83s). It means cvtss2sd is often better than unpcklps+cvtps2pd on recent x86 platforms. 1.c: float total = 0.2; int k = 5; int main() { int i; for (i = 0; i < 10; i++) { total += (0.5 + k); } return total == 0.3; } assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts .L2: unpcklps%xmm0, %xmm0 subl$1, %eax cvtps2pd%xmm0, %xmm0 addsd %xmm1, %xmm0 unpcklpd%xmm0, %xmm0 cvtpd2ps%xmm0, %xmm0 jne .L2 assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts .L2: cvtss2sd%xmm0, %xmm0 subl$1, %eax addsd %xmm1, %xmm0 cvtsd2ss%xmm0, %xmm0 jne .L2 But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance with the option -mtune=^use_vector_fp_converts is worse. Analysis to the assembly shows the performance degradation comes from partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back. 2.c: double b[1024]; float a[1024]; int main() { int i; for(i = 0 ; i < 1024 * 1024 * 256; i++) a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; return (int)a[512]; } without -mtune-crtl=^use_vector_fp_converts .L2: movl%eax, %edx addl$1, %eax andl$1023, %edx cmpl$268435456, %eax movsd b(,%rdx,8), %xmm0 cvtpd2ps%xmm0, %xmm0==> without partial reg stall because of movsd. mulss a(,%rdx,4), %xmm0 movss %xmm0, a(,%rdx,4) jne .L2 with -mtune-crtl=^use_vector_fp_converts .L2: movl%eax, %edx addl$1, %eax andl$1023, %edx cmpl$268435456, %eax cvtsd2ssb(,%rdx,8), %xmm0 ==> with partial reg stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. mulss a(,%rdx,4), %xmm0 movss %xmm0, a(,%rdx,4) jne .L2 So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? Thanks, Wei Mi. 2013-09-11 Wei Mi * config/i386/x86-tune.def (DEF_TUNE): Remove m_CORE_ALL. * config/i386/i386.md: Add define_peephole2 to break partial reg stall for cvtss2sd/cvtsd2ss. Index: config/i386/x86-tune.def === --- config/i386/x86-tune.def(revision 201963) +++ config/i386/x86-tune.def(working copy) @@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_ /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. */ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts", - m_CORE_ALL | m_AMDFAM10 | m_GENERIC) + m_AMDFAM10 | m_GENERIC) /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion from integer to FP. */ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10) Index: config/i386/i386.md === --- config/i386/i386.md (revision 201963) +++ config/i386/i386.md (working copy) @@ -5075,6 +5075,63 @@ emit_move_insn (operands[0], CONST0_RTX (mode)); }) +;; Break partial reg stall for cvtsd2ss. + +(define_peephole2 + [(set (match_operand:SF 0 "register_operand") +(float_truncate:SF + (match_operand:DF 1 "nonimmediate_operand")))] + "TARGET_SSE2 && TARGET_SSE_MATH + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && optimize_function_for_speed_p (cfun) + && reload_completed && SSE_REG_P (operands[0]) + && peep2_reg_dead_p (0, operands[0]) + && (!SSE_REG_P (operands[1]) + || REGNO (operands[0]) != REGNO (operands[1]))" + [(set (match_dup 0) + (vec_merge:V4SF + (vec_duplicate:V4SF + (float_truncate:V2SF + (match_dup 1))) + (match_dup 0) + (const_int 1)))] +{ + operands[0] = simplify_gen_subreg (V4SFmode, operands[0], +SFmode, 0); + operands[1] = simplify_gen_subreg (V2DFmode, operands[1], +DFmode, 0); + emit_move_insn (operands[0], CONST0_RTX (V4SFmode)); +}) + +;; Break partial reg stall for cvtss2sd. + +(define_peephole2 + [(set (match_operand
