Re: Revisit Core tunning flags

2013-09-22 Thread Wei Mi
>> > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html
>
> This patch seems resonable. (in fact I have pretty much same in my tree)
> use_vector_fp_converts is actually trying to solve the same problem in AMD
> hardware - you need to type the whole register when converting.
> So it may work well for AMD chips too or may be the difference is that
> Intel chips somehow handle "cvtpd2ps%xmm0, %xmm0" well even though
> the upper half of xmm0 is ill defined, while AMD chips doesn't.
>
> The patch seems OK. I do not see rason for
>   && peep2_reg_dead_p (0, operands[0])
> test.  Reg has to be dead since it is full destination of the operation.

Ok, I see. I will delete it.

>
> Lets wait few days before commit so we know effect of
> individual changes.  I will test it on AMD hardware and we can decide on
> generic tuning then.
>
> Honza

Ok, thanks.

Wei Mi.


Re: Revisit Core tunning flags

2013-09-22 Thread Jan Hubicka
> On Sat, Sep 21, 2013 at 3:51 PM, Xinliang David Li  wrote:
> > On Sat, Sep 21, 2013 at 12:54 PM, Jan Hubicka  wrote:
> >> Hi,
> >> this is upated version of patch discussed at
> >> http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00841.html
> >>
> >> It makes CORE tuning to more follow the optimization guidelines.
> >> In particular it removes some tuning flags for features I implemented years
> >> back specifically for K7/K8 chips that ended up in Core tunning becuase
> >> it was based on generic. Incrementally I plan to drop some of these from
> >> generic, too.
> >>
> >> Compared to previous version of patch I left out INC_DEC change, even
> >> though Core I7+ should resolve dependencies on partial flags correctly.
> >> Optimization manual still seems to suggest to not use this:
> >>
> >> Assembly/Compiler Coding Rule 33. (M impact, H generality)
> >> INC and DEC instructions should be replaced with ADD or SUB instructions,
> >> because ADD and SUB overwrite all flags, whereas INC and DEC do not, 
> >> therefore
> >> creating false dependencies on earlier instructions that set the flags.
> >>
> >> Other change dropped is use_vector_fp_converts that seems to improve
> >> Core perofrmance.
> >
> > I did not see this in your patch, but Wei has this tuning in this patch:
> >
> 
> Sorry, I meant to ask why dropping this part?

Because I wanted to go with obvious changes first.
> 
> David
> 
> > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html

This patch seems resonable. (in fact I have pretty much same in my tree)
use_vector_fp_converts is actually trying to solve the same problem in AMD
hardware - you need to type the whole register when converting.   
So it may work well for AMD chips too or may be the difference is that
Intel chips somehow handle "cvtpd2ps%xmm0, %xmm0" well even though
the upper half of xmm0 is ill defined, while AMD chips doesn't.

The patch seems OK. I do not see rason for
  && peep2_reg_dead_p (0, operands[0])
test.  Reg has to be dead since it is full destination of the operation.

Lets wait few days before commit so we know effect of
individual changes.  I will test it on AMD hardware and we can decide on
generic tuning then.

Honza


Re: Revisit Core tunning flags

2013-09-21 Thread Xinliang David Li
On Sat, Sep 21, 2013 at 3:51 PM, Xinliang David Li  wrote:
> On Sat, Sep 21, 2013 at 12:54 PM, Jan Hubicka  wrote:
>> Hi,
>> this is upated version of patch discussed at
>> http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00841.html
>>
>> It makes CORE tuning to more follow the optimization guidelines.
>> In particular it removes some tuning flags for features I implemented years
>> back specifically for K7/K8 chips that ended up in Core tunning becuase
>> it was based on generic. Incrementally I plan to drop some of these from
>> generic, too.
>>
>> Compared to previous version of patch I left out INC_DEC change, even
>> though Core I7+ should resolve dependencies on partial flags correctly.
>> Optimization manual still seems to suggest to not use this:
>>
>> Assembly/Compiler Coding Rule 33. (M impact, H generality)
>> INC and DEC instructions should be replaced with ADD or SUB instructions,
>> because ADD and SUB overwrite all flags, whereas INC and DEC do not, 
>> therefore
>> creating false dependencies on earlier instructions that set the flags.
>>
>> Other change dropped is use_vector_fp_converts that seems to improve
>> Core perofrmance.
>
> I did not see this in your patch, but Wei has this tuning in this patch:
>

Sorry, I meant to ask why dropping this part?

David

> http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html
>
> thanks,
>
> David
>
>
>>
>> I benchmarked the patch on SPEC2k and earlier it was benchmarked on 2k6
>> and the performance difference seems in noise.  It causes about 0.3% code
>> size reduction.  Main motivation for the patch is to drop some codegen
>> oddities that do not make sense on modern chips.
>>
>> Bootstrapped/regtested x86_64-linux, will commit it shortly.
>> Honza
>>
>> * x86-tune.def (partial_reg_stall): Disable for CoreI7 and newer.
>> (sse_typeless_stores): Enable for core
>> (sse_load0_by_pxor): Likewise.
>> (four_jump_limit): Disable for core.
>> (pad_returns): Likewise.
>> (avoid_vector_decode): Likewise.
>> (fuse_cmp_and_branch): Enable for cores.
>> * i386.c (x86_accumulate_outgoing_args): Disable for cores.
>> Index: x86-tune.def
>> ===
>> *** x86-tune.def(revision 202812)
>> --- x86-tune.def(working copy)
>> *** DEF_TUNE (X86_TUNE_MOVX, "movx",
>> *** 52,58 
>>  and can happen in caller/callee saving sequences.  */
>>   DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO)
>>   DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall",
>> !   m_CORE_ALL | m_GENERIC)
>>   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>* on 16-bit immediate moves into memory on Core2 and Corei7.  */
>>   DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC)
>> --- 52,58 
>>  and can happen in caller/callee saving sequences.  */
>>   DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO)
>>   DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall",
>> !   m_CORE2 | m_GENERIC)
>>   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>>* on 16-bit immediate moves into memory on Core2 and Corei7.  */
>>   DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC)
>> *** DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INS
>> *** 125,132 
>>  maintain just lower part of scalar values in proper format leaving the
>>  upper part undefined.  */
>>   DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8)
>> ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", 
>> m_AMD_MULTIPLE)
>> ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", m_PPRO | 
>> m_P4_NOCONA)
>>   DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall",
>> m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | 
>> m_GENERIC)
>>   DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move",
>> --- 125,134 
>>  maintain just lower part of scalar values in proper format leaving the
>>  upper part undefined.  */
>>   DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8)
>> ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
>> ! m_AMD_MULTIPLE | m_CORE_ALL)
>> ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor",
>> ! m_PPRO | m_P4_NOCONA | m_CORE_ALL)
>>   DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall",
>> m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | 
>> m_GENERIC)
>>   DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move",
>> *** DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSION
>> *** 144,150 
>>   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>>  than 4 branch instructions in the 16 byte window.  */
>>   DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit",
>> !   m_PPRO | m_P4_NOCO

Re: Revisit Core tunning flags

2013-09-21 Thread Xinliang David Li
On Sat, Sep 21, 2013 at 12:54 PM, Jan Hubicka  wrote:
> Hi,
> this is upated version of patch discussed at
> http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00841.html
>
> It makes CORE tuning to more follow the optimization guidelines.
> In particular it removes some tuning flags for features I implemented years
> back specifically for K7/K8 chips that ended up in Core tunning becuase
> it was based on generic. Incrementally I plan to drop some of these from
> generic, too.
>
> Compared to previous version of patch I left out INC_DEC change, even
> though Core I7+ should resolve dependencies on partial flags correctly.
> Optimization manual still seems to suggest to not use this:
>
> Assembly/Compiler Coding Rule 33. (M impact, H generality)
> INC and DEC instructions should be replaced with ADD or SUB instructions,
> because ADD and SUB overwrite all flags, whereas INC and DEC do not, therefore
> creating false dependencies on earlier instructions that set the flags.
>
> Other change dropped is use_vector_fp_converts that seems to improve
> Core perofrmance.

I did not see this in your patch, but Wei has this tuning in this patch:

http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html

thanks,

David


>
> I benchmarked the patch on SPEC2k and earlier it was benchmarked on 2k6
> and the performance difference seems in noise.  It causes about 0.3% code
> size reduction.  Main motivation for the patch is to drop some codegen
> oddities that do not make sense on modern chips.
>
> Bootstrapped/regtested x86_64-linux, will commit it shortly.
> Honza
>
> * x86-tune.def (partial_reg_stall): Disable for CoreI7 and newer.
> (sse_typeless_stores): Enable for core
> (sse_load0_by_pxor): Likewise.
> (four_jump_limit): Disable for core.
> (pad_returns): Likewise.
> (avoid_vector_decode): Likewise.
> (fuse_cmp_and_branch): Enable for cores.
> * i386.c (x86_accumulate_outgoing_args): Disable for cores.
> Index: x86-tune.def
> ===
> *** x86-tune.def(revision 202812)
> --- x86-tune.def(working copy)
> *** DEF_TUNE (X86_TUNE_MOVX, "movx",
> *** 52,58 
>  and can happen in caller/callee saving sequences.  */
>   DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO)
>   DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall",
> !   m_CORE_ALL | m_GENERIC)
>   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>* on 16-bit immediate moves into memory on Core2 and Corei7.  */
>   DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC)
> --- 52,58 
>  and can happen in caller/callee saving sequences.  */
>   DEF_TUNE (X86_TUNE_PARTIAL_REG_STALL, "partial_reg_stall", m_PPRO)
>   DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall",
> !   m_CORE2 | m_GENERIC)
>   /* X86_TUNE_LCP_STALL: Avoid an expensive length-changing prefix stall
>* on 16-bit immediate moves into memory on Core2 and Corei7.  */
>   DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall", m_CORE_ALL | m_GENERIC)
> *** DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INS
> *** 125,132 
>  maintain just lower part of scalar values in proper format leaving the
>  upper part undefined.  */
>   DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8)
> ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores", 
> m_AMD_MULTIPLE)
> ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor", m_PPRO | 
> m_P4_NOCONA)
>   DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall",
> m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | 
> m_GENERIC)
>   DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move",
> --- 125,134 
>  maintain just lower part of scalar values in proper format leaving the
>  upper part undefined.  */
>   DEF_TUNE (X86_TUNE_SSE_SPLIT_REGS, "sse_split_regs", m_ATHLON_K8)
> ! DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
> ! m_AMD_MULTIPLE | m_CORE_ALL)
> ! DEF_TUNE (X86_TUNE_SSE_LOAD0_BY_PXOR, "sse_load0_by_pxor",
> ! m_PPRO | m_P4_NOCONA | m_CORE_ALL)
>   DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall",
> m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM | m_AMD_MULTIPLE | 
> m_GENERIC)
>   DEF_TUNE (X86_TUNE_PROLOGUE_USING_MOVE, "prologue_using_move",
> *** DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSION
> *** 144,150 
>   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
>  than 4 branch instructions in the 16 byte window.  */
>   DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit",
> !   m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_ATOM | m_SLM| m_AMD_MULTIPLE
> | m_GENERIC)
>   DEF_TUNE (X86_TUNE_SCHEDULE, "schedule",
> m_PENT | m_PPRO | m_CORE_ALL | m_ATOM | m_SLM | m_K6_GEODE
> --- 146,152 
>   /* X86_