Re: Revisit Core tunning flags
>> > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html > > This patch seems resonable. (in fact I have pretty much same in my tree) > use_vector_fp_converts is actually trying to solve the same problem in AMD > hardware - you need to type the whole register when converting. > So it may work well for AMD chips too or may be the difference is that > Intel chips somehow handle "cvtpd2ps%xmm0, %xmm0" well even though > the upper half of xmm0 is ill defined, while AMD chips doesn't. > > The patch seems OK. I do not see rason for > && peep2_reg_dead_p (0, operands[0]) > test. Reg has to be dead since it is full destination of the operation. Ok, I see. I will delete it. > > Lets wait few days before commit so we know effect of > individual changes. I will test it on AMD hardware and we can decide on > generic tuning then. > > Honza Ok, thanks. Wei Mi.
Re: Revisit Core tunning flags
> On Sat, Sep 21, 2013 at 3:51 PM, Xinliang David Li wrote: > > On Sat, Sep 21, 2013 at 12:54 PM, Jan Hubicka wrote: > >> Hi, > >> this is upated version of patch discussed at > >> http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00841.html > >> > >> It makes CORE tuning to more follow the optimization guidelines. > >> In particular it removes some tuning flags for features I implemented years > >> back specifically for K7/K8 chips that ended up in Core tunning becuase > >> it was based on generic. Incrementally I plan to drop some of these from > >> generic, too. > >> > >> Compared to previous version of patch I left out INC_DEC change, even > >> though Core I7+ should resolve dependencies on partial flags correctly. > >> Optimization manual still seems to suggest to not use this: > >> > >> Assembly/Compiler Coding Rule 33. (M impact, H generality) > >> INC and DEC instructions should be replaced with ADD or SUB instructions, > >> because ADD and SUB overwrite all flags, whereas INC and DEC do not, > >> therefore > >> creating false dependencies on earlier instructions that set the flags. > >> > >> Other change dropped is use_vector_fp_converts that seems to improve > >> Core perofrmance. > > > > I did not see this in your patch, but Wei has this tuning in this patch: > > > > Sorry, I meant to ask why dropping this part? Because I wanted to go with obvious changes first. > > David > > > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html This patch seems resonable. (in fact I have pretty much same in my tree) use_vector_fp_converts is actually trying to solve the same problem in AMD hardware - you need to type the whole register when converting. So it may work well for AMD chips too or may be the difference is that Intel chips somehow handle "cvtpd2ps%xmm0, %xmm0" well even though the upper half of xmm0 is ill defined, while AMD chips doesn't. The patch seems OK. I do not see rason for && peep2_reg_dead_p (0, operands[0]) test. Reg has to be dead since it is full destination of the operation. Lets wait few days before commit so we know effect of individual changes. I will test it on AMD hardware and we can decide on generic tuning then. Honza
Re: Revisit Core tunning flags
On Sat, Sep 21, 2013 at 3:51 PM, Xinliang David Li wrote: > On Sat, Sep 21, 2013 at 12:54 PM, Jan Hubicka wrote: >> Hi, >> this is upated version of patch discussed at >> http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00841.html >> >> It makes CORE tuning to more follow the optimization guidelines. >> In particular it removes some tuning flags for features I implemented years >> back specifically for K7/K8 chips that ended up in Core tunning becuase >> it was based on generic. Incrementally I plan to drop some of these from >> generic, too. >> >> Compared to previous version of patch I left out INC_DEC change, even >> though Core I7+ should resolve dependencies on partial flags correctly. >> Optimization manual still seems to suggest to not use this: >> >> Assembly/Compiler Coding Rule 33. (M impact, H generality) >> INC and DEC instructions should be replaced with ADD or SUB instructions, >> because ADD and SUB overwrite all flags, whereas INC and DEC do not, >> therefore >> creating false dependencies on earlier instructions that set the flags. >> >> Other change dropped is use_vector_fp_converts that seems to improve >> Core perofrmance. > > I did not see this in your patch, but Wei has this tuning in this patch: > Sorry, I meant to ask why dropping this part? David > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html > > thanks, > > David > > >> >> I benchmarked the patch on SPEC2k and earlier it was benchmarked on 2k6 >> and the performance difference seems in noise. It causes about 0.3% code >> size reduction. Main motivation for the patch is to drop some codegen >> oddities that do not make sense on modern chips. >> >> Bootstrapped/regtested x86_64-linux, will commit it shortly. >> Honza >> >> * x86-tune.def (partial_reg_stall): Disable for CoreI7 and newer. >> (sse_typeless_stores): Enable for core >> (sse_load0_by_pxor): Likewise. >> (four_jump_limit): Disable for core. >> (pad_returns): Likewise. >> (avoid_vector_decode): Likewise. >> (fuse_cmp_and_branch): Enable for cores. >> * i386.c (x86_accumulate_outgoing_args): Disable for cores. >> Index: x86-tune.def >> === >> *** x86-tune.def(revision 202812) >> --- x86-tune.def(working copy) >> *** DEF_TUNE (X86_TUNE_MOVX, "movx", >> *** 52,58 >> and can happen in caller/callee saving sequences. */ >> DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO) >> DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall", >> ! m_CORE_ALL | m_GENERIC) >> /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall >>* on 16-bit immediate moves into memory on Core2 and Corei7. */ >> DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC) >> --- 52,58 >> and can happen in caller/callee saving sequences. */ >> DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO) >> DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall", >> ! m_CORE2 | m_GENERIC) >> /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall >>* on 16-bit immediate moves into memory on Core2 and Corei7. */ >> DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC) >> *** DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INS >> *** 125,132 >> maintain just lower part of scalar values in proper format leaving the >> upper part undefined. */ >> DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8) >> ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", >> m_AMD_MULTIPLE) >> ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", m_PPRO | >> m_P4_NOCONA) >> DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall", >> m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | >> m_GENERIC) >> DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move", >> --- 125,134 >> maintain just lower part of scalar values in proper format leaving the >> upper part undefined. */ >> DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8) >> ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", >> ! m_AMD_MULTIPLE | m_CORE_ALL) >> ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", >> ! m_PPRO | m_P4_NOCONA | m_CORE_ALL) >> DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall", >> m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | >> m_GENERIC) >> DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move", >> *** DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSION >> *** 144,150 >> /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more >> than 4 branch instructions in the 16 byte window. */ >> DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit", >> ! m_PPRO | m_P4_NOCO
Re: Revisit Core tunning flags
On Sat, Sep 21, 2013 at 12:54 PM, Jan Hubicka wrote: > Hi, > this is upated version of patch discussed at > http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00841.html > > It makes CORE tuning to more follow the optimization guidelines. > In particular it removes some tuning flags for features I implemented years > back specifically for K7/K8 chips that ended up in Core tunning becuase > it was based on generic. Incrementally I plan to drop some of these from > generic, too. > > Compared to previous version of patch I left out INC_DEC change, even > though Core I7+ should resolve dependencies on partial flags correctly. > Optimization manual still seems to suggest to not use this: > > Assembly/Compiler Coding Rule 33. (M impact, H generality) > INC and DEC instructions should be replaced with ADD or SUB instructions, > because ADD and SUB overwrite all flags, whereas INC and DEC do not, therefore > creating false dependencies on earlier instructions that set the flags. > > Other change dropped is use_vector_fp_converts that seems to improve > Core perofrmance. I did not see this in your patch, but Wei has this tuning in this patch: http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html thanks, David > > I benchmarked the patch on SPEC2k and earlier it was benchmarked on 2k6 > and the performance difference seems in noise. It causes about 0.3% code > size reduction. Main motivation for the patch is to drop some codegen > oddities that do not make sense on modern chips. > > Bootstrapped/regtested x86_64-linux, will commit it shortly. > Honza > > * x86-tune.def (partial_reg_stall): Disable for CoreI7 and newer. > (sse_typeless_stores): Enable for core > (sse_load0_by_pxor): Likewise. > (four_jump_limit): Disable for core. > (pad_returns): Likewise. > (avoid_vector_decode): Likewise. > (fuse_cmp_and_branch): Enable for cores. > * i386.c (x86_accumulate_outgoing_args): Disable for cores. > Index: x86-tune.def > === > *** x86-tune.def(revision 202812) > --- x86-tune.def(working copy) > *** DEF_TUNE (X86_TUNE_MOVX, "movx", > *** 52,58 > and can happen in caller/callee saving sequences. */ > DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO) > DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall", > ! m_CORE_ALL | m_GENERIC) > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall >* on 16-bit immediate moves into memory on Core2 and Corei7. */ > DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC) > --- 52,58 > and can happen in caller/callee saving sequences. */ > DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO) > DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall", > ! m_CORE2 | m_GENERIC) > /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall >* on 16-bit immediate moves into memory on Core2 and Corei7. */ > DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC) > *** DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INS > *** 125,132 > maintain just lower part of scalar values in proper format leaving the > upper part undefined. */ > DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8) > ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", > m_AMD_MULTIPLE) > ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", m_PPRO | > m_P4_NOCONA) > DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall", > m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | > m_GENERIC) > DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move", > --- 125,134 > maintain just lower part of scalar values in proper format leaving the > upper part undefined. */ > DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8) > ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", > ! m_AMD_MULTIPLE | m_CORE_ALL) > ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", > ! m_PPRO | m_P4_NOCONA | m_CORE_ALL) > DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall", > m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | > m_GENERIC) > DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move", > *** DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSION > *** 144,150 > /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more > than 4 branch instructions in the 16 byte window. */ > DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit", > ! m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM| m_AMD_MULTIPLE > | m_GENERIC) > DEF_TUNE (X86_TUNE_SCHEDULE, "schedule", > m_PENT | m_PPRO | m_CORE_ALL | m_ATOM | m_SLM | m_K6_GEODE > --- 146,152 > /* X86_
