*Ping^1* [PATCH v3] ipa-cp: Fix PGO regression caused by r278808

2020-03-03 Thread luoxhu
Hi Honza,

The performance regression still exists. For exchange2, the performance
is about 28% slower for option:
"-fprofile-generate/-fprofile-use --param ipa-cp-eval-threshold=0 --param 
ipa-cp-unit-growth=80 -fno-inline".

r278808:
commit ad06966f6677d55c11214d9c7b6d5518f915e341
Author: hubicka 
Date:   Thu Nov 28 14:16:29 2019 +

* ipa-cp.c (update_profiling_info): Fix scaling.


Fix v3 patch and logs are here:
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00764.html


Thanks
Xionghu

On 2020/1/14 14:45, luoxhu wrote:
> Hi,
> 
> On 2020/1/3 00:58, Jan Hubicka wrote:
>>> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
>>> index 14064ae0034..947bf7c7199 100644
>>> --- a/gcc/ipa-cp.c
>>> +++ b/gcc/ipa-cp.c
>>> @@ -4272,6 +4272,31 @@ update_profiling_info (struct cgraph_node 
>>> *orig_node,
>>>     false);
>>>     new_sum = stats.count_sum;
>>> +  class ipa_node_params *info = IPA_NODE_REF (orig_node);
>>> +  if (info && info->node_is_self_scc)
>>> +    {
>>> +  profile_count self_recursive_count;
>>> +
>>> +  /* The self recursive edge is on the orig_node.  */
>>> +  for (cs = orig_node->callees; cs; cs = cs->next_callee)
>>> +    if (ipa_edge_within_scc (cs))
>>> +  {
>>> +    cgraph_node *callee = cs->callee->function_symbol ();
>>> +    if (callee && cs->caller == cs->callee)
>>> +  {
>>> +    self_recursive_count = cs->count;
>>> +    break;
>>> +  }
>>> +  }
>>
>> What happens here when there are multiple self recursive calls or when
>> the SCC has two mutually recursive functions?
>>
>> I am still confused by the logic of this function.  I will take a deeper
>> look at your previous email.
>>> +
>>> +  /* Proportion count for self recursive node from all callers.  */
>>> +  new_sum
>>> +    = (orig_sum + new_sum).apply_scale (self_recursive_count, orig_sum);
>>> +
>>> +  /* Proportion count for specialized cloned node.  */
>>> +  new_sum = new_sum.apply_scale (1, param_ipa_cp_max_recursive_depth);
>>> +    }
>>> +
>>>     if (orig_node_count < orig_sum + new_sum)
>>>   {
>>>     if (dump_file)
>>> diff --git a/gcc/params.opt b/gcc/params.opt
>>> index d88ae0c468b..40a03b04580 100644
>>> --- a/gcc/params.opt
>>> +++ b/gcc/params.opt
>>> @@ -199,7 +199,7 @@ Common Joined UInteger 
>>> Var(param_ipa_cp_loop_hint_bonus) Init(64) Param
>>>   Compile-time bonus IPA-CP assigns to candidates which make loop bounds 
>>> or strides known.
>>>   -param=ipa-cp-max-recursive-depth=
>>> -Common Joined UInteger Var(param_ipa_cp_max_recursive_depth) Init(8) Param
>>> +Common Joined UInteger Var(param_ipa_cp_max_recursive_depth) Init(8) 
>>> IntegerRange(1, 10) Param
>>>   Maximum depth of recursive cloning for self-recursive function.
>>
>> The values stats from 0 but I also wonder why 10 is a good upper bound?
>> If the function calls itself with one type of value (like depth-1) then
>> we may clone well over 10 times, if it calls itself with two different
>> sets then 8 is already quite high I would say...
>>
>> I suppose the call probabilities will eventually drop to be very low,
>> but I am not quite sure about that because we do not handle any IP
>> frequency propagation.  Do we have some way to treat wide trees? Or do
>> we clone only all self recursive calls are the same?
>>
>> Honza
> 
> Update v3 patch.  This regression could only be reproduced when built with
> "-fprofile-generate/-fprofile-use --param ipa-cp-eval-threshold=0 --param
> ipa-cp-unit-growth=80" on exchange_2, on some platforms -fno-inline may be
> also needed, I attached 3 files (compressed to exchange.tar.gz)
> exchange2_gcc.use.orig.wpa.071i.cp.old, exchange2_gcc.use.orig.wpa.071i.cp.new
> and cp.diff to show the profile count difference of specialized node
> digits_2.constprop/152 to digits_2.constprop/159 without/with this patch.
> 
> Profile count is decreasing slowly with this patch instead of keeping very
> small from the first clone (very low count as cold will cause complete unroll
> not working), it still differs from really execution (exchange2.png), but this
> average model takes the recursive edge as feedback.  Thanks.
> 
> 
> v3:
> 1. Enable proportion orig_sum to the new nodes for self recursive node (k 
> means
>    multiple self recursive calls):
>    new_sum = (orig_sum + new_sum) * 1 / k \
>    * self_recursive_probability * (1 / param_ipa_cp_max_recursive_depth).
> 2. Two mutually recursive functions are not supported in self recursive
>    clone yet so also not supported in update_profiling_info here.
> 3. Improve value range for param_ipa_cp_max_recursive_depth to (0, 8).
>    If it calls itself two different sets, usually recursive boudary limit
>    will stop the specialize first, otherwise it is slow even without
>    recursive specialize.
> 
> The performance of exchange2 built with PGO will decrease ~28% by r278808
> due to profile count set incorrectly.  The cloned nodes are updated to a
> very small 

Ping^1: [PATCH v3] ipa-cp: Fix PGO regression caused by r278808

2020-02-09 Thread luoxhu
Ping,

attachment of 
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00764/exchange2.tar.gz
shows the profile count difference on cloned nodes digits_2.constprop.[0...8]
without/with this patch.  Thanks!

Xiong Hu


On 2020/1/14 14:45, luoxhu wrote:
> Hi,
> 
> On 2020/1/3 00:58, Jan Hubicka wrote:
>>> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
>>> index 14064ae0034..947bf7c7199 100644
>>> --- a/gcc/ipa-cp.c
>>> +++ b/gcc/ipa-cp.c
>>> @@ -4272,6 +4272,31 @@ update_profiling_info (struct cgraph_node 
>>> *orig_node,
>>>     false);
>>>     new_sum = stats.count_sum;
>>> +  class ipa_node_params *info = IPA_NODE_REF (orig_node);
>>> +  if (info && info->node_is_self_scc)
>>> +    {
>>> +  profile_count self_recursive_count;
>>> +
>>> +  /* The self recursive edge is on the orig_node.  */
>>> +  for (cs = orig_node->callees; cs; cs = cs->next_callee)
>>> +    if (ipa_edge_within_scc (cs))
>>> +  {
>>> +    cgraph_node *callee = cs->callee->function_symbol ();
>>> +    if (callee && cs->caller == cs->callee)
>>> +  {
>>> +    self_recursive_count = cs->count;
>>> +    break;
>>> +  }
>>> +  }
>>
>> What happens here when there are multiple self recursive calls or when
>> the SCC has two mutually recursive functions?
>>
>> I am still confused by the logic of this function.  I will take a deeper
>> look at your previous email.
>>> +
>>> +  /* Proportion count for self recursive node from all callers.  */
>>> +  new_sum
>>> +    = (orig_sum + new_sum).apply_scale (self_recursive_count, orig_sum);
>>> +
>>> +  /* Proportion count for specialized cloned node.  */
>>> +  new_sum = new_sum.apply_scale (1, param_ipa_cp_max_recursive_depth);
>>> +    }
>>> +
>>>     if (orig_node_count < orig_sum + new_sum)
>>>   {
>>>     if (dump_file)
>>> diff --git a/gcc/params.opt b/gcc/params.opt
>>> index d88ae0c468b..40a03b04580 100644
>>> --- a/gcc/params.opt
>>> +++ b/gcc/params.opt
>>> @@ -199,7 +199,7 @@ Common Joined UInteger 
>>> Var(param_ipa_cp_loop_hint_bonus) Init(64) Param
>>>   Compile-time bonus IPA-CP assigns to candidates which make loop bounds 
>>> or strides known.
>>>   -param=ipa-cp-max-recursive-depth=
>>> -Common Joined UInteger Var(param_ipa_cp_max_recursive_depth) Init(8) Param
>>> +Common Joined UInteger Var(param_ipa_cp_max_recursive_depth) Init(8) 
>>> IntegerRange(1, 10) Param
>>>   Maximum depth of recursive cloning for self-recursive function.
>>
>> The values stats from 0 but I also wonder why 10 is a good upper bound?
>> If the function calls itself with one type of value (like depth-1) then
>> we may clone well over 10 times, if it calls itself with two different
>> sets then 8 is already quite high I would say...
>>
>> I suppose the call probabilities will eventually drop to be very low,
>> but I am not quite sure about that because we do not handle any IP
>> frequency propagation.  Do we have some way to treat wide trees? Or do
>> we clone only all self recursive calls are the same?
>>
>> Honza
> 
> Update v3 patch.  This regression could only be reproduced when built with
> "-fprofile-generate/-fprofile-use --param ipa-cp-eval-threshold=0 --param
> ipa-cp-unit-growth=80" on exchange_2, on some platforms -fno-inline may be
> also needed, I attached 3 files (compressed to exchange.tar.gz)
> exchange2_gcc.use.orig.wpa.071i.cp.old, exchange2_gcc.use.orig.wpa.071i.cp.new
> and cp.diff to show the profile count difference of specialized node
> digits_2.constprop/152 to digits_2.constprop/159 without/with this patch.
> 
> Profile count is decreasing slowly with this patch instead of keeping very
> small from the first clone (very low count as cold will cause complete unroll
> not working), it still differs from really execution (exchange2.png), but this
> average model takes the recursive edge as feedback.  Thanks.
> 
> 
> v3:
> 1. Enable proportion orig_sum to the new nodes for self recursive node (k 
> means
>    multiple self recursive calls):
>    new_sum = (orig_sum + new_sum) * 1 / k \
>    * self_recursive_probability * (1 / param_ipa_cp_max_recursive_depth).
> 2. Two mutually recursive functions are not supported in self recursive
>    clone yet so also not supported in update_profiling_info here.
> 3. Improve value range for param_ipa_cp_max_recursive_depth to (0, 8).
>    If it calls itself two different sets, usually recursive boudary limit
>    will stop the specialize first, otherwise it is slow even without
>    recursive specialize.
> 
> The performance of exchange2 built with PGO will decrease ~28% by r278808
> due to profile count set incorrectly.  The cloned nodes are updated to a
> very small count caused later pass cunroll fail to unroll the recursive
> function in exchange2.
> 
> digits_2 ->
> digits_2.constprop.0, digits_2.constprop.1, etc.
> 
> gcc/ChangeLog:
> 
>  2020-01-14  Luo Xiong Hu  
> 
>  * ipa-cp.c (update_profiling_info): Check self_scc node.
>  * params.opt