Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-26 Thread Richard Biener
On Wed, Oct 26, 2016 at 11:28 AM, Richard Biener
 wrote:
> On Tue, Oct 25, 2016 at 4:31 PM, Martin Liška  wrote:
>> On 10/24/2016 03:51 PM, Richard Biener wrote:
>>
>>> It's quite ad-hoc :/  The IFN will also be a memory optimization
>>> barrier unless you add special support
>>> for it in the alias oracle - so the performance measurement needs to
>>> be taken with a grain of salt
>>> (same is true for all atomics of course... - I have some local patches
>>> to improve things here).
>>
>> Good, thus please ping me with the patches you have and I'll integrate it.

This is what I have in my tree (appearantly only points-to changes, I
suppose general
alias changes will be controversical as the builtins would lose their
"compiler memory
barrier" behavior).

Richard.

>>>
>>> The way you implement process_sm_for_coverage_counter is more like a
>>> final value replacement.
>>> You could maybe use create_iv for the loop counter or even wind up
>>> computing the final value
>>> (number of iterations) only after the loop, avoiding the IV completely
>>> (eventually SCEV cprop
>>> saves you here afterwards).
>>
>> Or maybe we can basically assign loop->niter as the argument of 
>> UPDATE_COVERAGE_COUNTER
>> function?
>
> Yes, that's what I said.
>
> Richard.
>
>> Martin
>>
>>>
>>> Richard.
>>


p
Description: Binary data


Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-26 Thread Richard Biener
On Tue, Oct 25, 2016 at 4:31 PM, Martin Liška  wrote:
> On 10/24/2016 03:51 PM, Richard Biener wrote:
>
>> It's quite ad-hoc :/  The IFN will also be a memory optimization
>> barrier unless you add special support
>> for it in the alias oracle - so the performance measurement needs to
>> be taken with a grain of salt
>> (same is true for all atomics of course... - I have some local patches
>> to improve things here).
>
> Good, thus please ping me with the patches you have and I'll integrate it.
>
>>
>> The way you implement process_sm_for_coverage_counter is more like a
>> final value replacement.
>> You could maybe use create_iv for the loop counter or even wind up
>> computing the final value
>> (number of iterations) only after the loop, avoiding the IV completely
>> (eventually SCEV cprop
>> saves you here afterwards).
>
> Or maybe we can basically assign loop->niter as the argument of 
> UPDATE_COVERAGE_COUNTER
> function?

Yes, that's what I said.

Richard.

> Martin
>
>>
>> Richard.
>


Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-25 Thread Martin Liška
On 10/24/2016 03:51 PM, Richard Biener wrote:

> It's quite ad-hoc :/  The IFN will also be a memory optimization
> barrier unless you add special support
> for it in the alias oracle - so the performance measurement needs to
> be taken with a grain of salt
> (same is true for all atomics of course... - I have some local patches
> to improve things here).

Good, thus please ping me with the patches you have and I'll integrate it.

> 
> The way you implement process_sm_for_coverage_counter is more like a
> final value replacement.
> You could maybe use create_iv for the loop counter or even wind up
> computing the final value
> (number of iterations) only after the loop, avoiding the IV completely
> (eventually SCEV cprop
> saves you here afterwards).

Or maybe we can basically assign loop->niter as the argument of 
UPDATE_COVERAGE_COUNTER
function?

Martin

> 
> Richard.



Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-24 Thread Martin Liška
On 10/17/2016 02:03 PM, Richard Biener wrote:
> On Mon, Oct 17, 2016 at 1:46 PM, Martin Liška  wrote:
>> On 10/13/2016 11:43 AM, Richard Biener wrote:
>>> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška  wrote:
 On 10/04/2016 11:45 AM, Richard Biener wrote:
> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška  wrote:
>> On 09/07/2016 02:09 PM, Richard Biener wrote:
>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška  wrote:
 On 08/18/2016 06:06 PM, Richard Biener wrote:
> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek 
>  wrote:
>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
 I'd prefer to make updates atomic in multi-threaded applications.
 The best proxy we have for that is -pthread.

 Is it slower, most definitely, but odds are we're giving folks
 garbage data otherwise, which in many ways is even worse.
>>>
>>> It will likely be catastrophically slower in some cases.
>>>
>>> Catastrophically as in too slow to be usable.
>>>
>>> An atomic instruction is a lot more expensive than a single
>> increment. Also
>>> they sometimes are really slow depending on the state of the 
>>> machine.
>>
>> Can't we just have thread-local copies of all the counters (perhaps
>> using
>> __thread pointer as base) and just atomically merge at thread
>> termination?
>
> I suggested that as well but of course it'll have its own class of 
> issues (short lived threads, so we need to somehow re-use counters 
> from terminated threads, large number of threads and thus using too 
> much memory for the counters)
>
> Richard.

 Hello.

 I've got written the approach on my TODO list, let's see whether it 
 would be doable in a reasonable amount of time.

 I've just finished some measurements to illustrate slow-down of 
 -fprofile-update=atomic approach.
 All numbers are: no profile, -fprofile-generate, -fprofile-generate 
 -fprofile-update=atomic
 c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
 unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
 tramp3d (1 thread, -O3): 18.0, 46.6, 168s

 So the slow-down is roughly 300% compared to -fprofile-generate. I'm 
 not having much experience with default option
 selection, but these numbers can probably help.

 Thoughts?
>>>
>>> Look at the generated code for an instrumented simple loop and see that 
>>> for
>>> the non-atomic updates we happily apply store-motion to the counter 
>>> update
>>> and thus we only get one counter update per loop exit rather than one 
>>> per
>>> loop iteration.  Now see what happens for the atomic case (I suspect you
>>> get one per iteration).
>>>
>>> I'll bet this accounts for most of the slowdown.
>>>
>>> Back in time ICC which had atomic counter updates (but using function
>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>>> didn't have early inlining -- removing abstraction helps reducing the 
>>> number
>>> of counters significantly).
>>>
>>> Richard.
>>
>> Hi.
>>
>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>> profile counter updates. My first attempt is to utilize TLS storage, 
>> where
>> every function is accumulating arcs counters. These are eventually added
>> (using atomic operations) to the global one at the very end of a 
>> function.
>> Currently I rely on target support of TLS, which is questionable whether
>> to have such a requirement for -fprofile-update=atomic, or to add a new 
>> option value
>> like -fprofile-update=atomic-tls?
>>
>> Running the patch on tramp3d, compared to previous numbers, it takes 88s 
>> to finish.
>> Time shrinks to 50%, compared to the current implementation.
>>
>> Thoughts?
>
> Hmm, I thought I suggested that you can simply use automatic storage
> (which effectively
> is TLS...) for regions that are not forked or abnormally left (which
> means SESE regions
> that have no calls that eventually terminate or throw externally).
>
> So why did you end up with TLS?

 Hi.

 Usage for TLS does not makes sense, stupid mistake ;)

 By using SESE regions, do you mean the infrastructure that is utilized
 by Graphite machinery?
>>>
>>> No, just as "single-entry single-exit region" which means placing of
>>> initializations of the internal counters to zero and the updates of the
>>> actual counters is "obvious".
>>>
>>> Note that this 

Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-17 Thread Martin Liška
On 10/13/2016 11:43 AM, Richard Biener wrote:
> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška  wrote:
>> On 10/04/2016 11:45 AM, Richard Biener wrote:
>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška  wrote:
 On 09/07/2016 02:09 PM, Richard Biener wrote:
> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška  wrote:
>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek 
>>>  wrote:
 On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>> I'd prefer to make updates atomic in multi-threaded applications.
>> The best proxy we have for that is -pthread.
>>
>> Is it slower, most definitely, but odds are we're giving folks
>> garbage data otherwise, which in many ways is even worse.
>
> It will likely be catastrophically slower in some cases.
>
> Catastrophically as in too slow to be usable.
>
> An atomic instruction is a lot more expensive than a single
 increment. Also
> they sometimes are really slow depending on the state of the machine.

 Can't we just have thread-local copies of all the counters (perhaps
 using
 __thread pointer as base) and just atomically merge at thread
 termination?
>>>
>>> I suggested that as well but of course it'll have its own class of 
>>> issues (short lived threads, so we need to somehow re-use counters from 
>>> terminated threads, large number of threads and thus using too much 
>>> memory for the counters)
>>>
>>> Richard.
>>
>> Hello.
>>
>> I've got written the approach on my TODO list, let's see whether it 
>> would be doable in a reasonable amount of time.
>>
>> I've just finished some measurements to illustrate slow-down of 
>> -fprofile-update=atomic approach.
>> All numbers are: no profile, -fprofile-generate, -fprofile-generate 
>> -fprofile-update=atomic
>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>
>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
>> having much experience with default option
>> selection, but these numbers can probably help.
>>
>> Thoughts?
>
> Look at the generated code for an instrumented simple loop and see that 
> for
> the non-atomic updates we happily apply store-motion to the counter update
> and thus we only get one counter update per loop exit rather than one per
> loop iteration.  Now see what happens for the atomic case (I suspect you
> get one per iteration).
>
> I'll bet this accounts for most of the slowdown.
>
> Back in time ICC which had atomic counter updates (but using function
> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
> didn't have early inlining -- removing abstraction helps reducing the 
> number
> of counters significantly).
>
> Richard.

 Hi.

 During Cauldron I discussed with Richi approaches how to speed-up ARCS
 profile counter updates. My first attempt is to utilize TLS storage, where
 every function is accumulating arcs counters. These are eventually added
 (using atomic operations) to the global one at the very end of a function.
 Currently I rely on target support of TLS, which is questionable whether
 to have such a requirement for -fprofile-update=atomic, or to add a new 
 option value
 like -fprofile-update=atomic-tls?

 Running the patch on tramp3d, compared to previous numbers, it takes 88s 
 to finish.
 Time shrinks to 50%, compared to the current implementation.

 Thoughts?
>>>
>>> Hmm, I thought I suggested that you can simply use automatic storage
>>> (which effectively
>>> is TLS...) for regions that are not forked or abnormally left (which
>>> means SESE regions
>>> that have no calls that eventually terminate or throw externally).
>>>
>>> So why did you end up with TLS?
>>
>> Hi.
>>
>> Usage for TLS does not makes sense, stupid mistake ;)
>>
>> By using SESE regions, do you mean the infrastructure that is utilized
>> by Graphite machinery?
> 
> No, just as "single-entry single-exit region" which means placing of
> initializations of the internal counters to zero and the updates of the
> actual counters is "obvious".
> 
> Note that this "optimization" isn't one if the SESE region does not contain
> cycle(s).  Unless there is a way to do an atomic update of a bunch of
> counters faster than doing them separately.  This optimization will also
> increase register pressure (or force the internal counters to the stack).
> Thus selecting which counters to "optimize" and which ones to leave in place
> might be necessary.

Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-13 Thread Richard Biener
On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška  wrote:
> On 10/04/2016 11:45 AM, Richard Biener wrote:
>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška  wrote:
>>> On 09/07/2016 02:09 PM, Richard Biener wrote:
 On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška  wrote:
> On 08/18/2016 06:06 PM, Richard Biener wrote:
>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek 
>>  wrote:
>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
> I'd prefer to make updates atomic in multi-threaded applications.
> The best proxy we have for that is -pthread.
>
> Is it slower, most definitely, but odds are we're giving folks
> garbage data otherwise, which in many ways is even worse.

 It will likely be catastrophically slower in some cases.

 Catastrophically as in too slow to be usable.

 An atomic instruction is a lot more expensive than a single
>>> increment. Also
 they sometimes are really slow depending on the state of the machine.
>>>
>>> Can't we just have thread-local copies of all the counters (perhaps
>>> using
>>> __thread pointer as base) and just atomically merge at thread
>>> termination?
>>
>> I suggested that as well but of course it'll have its own class of 
>> issues (short lived threads, so we need to somehow re-use counters from 
>> terminated threads, large number of threads and thus using too much 
>> memory for the counters)
>>
>> Richard.
>
> Hello.
>
> I've got written the approach on my TODO list, let's see whether it would 
> be doable in a reasonable amount of time.
>
> I've just finished some measurements to illustrate slow-down of 
> -fprofile-update=atomic approach.
> All numbers are: no profile, -fprofile-generate, -fprofile-generate 
> -fprofile-update=atomic
> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>
> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
> having much experience with default option
> selection, but these numbers can probably help.
>
> Thoughts?

 Look at the generated code for an instrumented simple loop and see that for
 the non-atomic updates we happily apply store-motion to the counter update
 and thus we only get one counter update per loop exit rather than one per
 loop iteration.  Now see what happens for the atomic case (I suspect you
 get one per iteration).

 I'll bet this accounts for most of the slowdown.

 Back in time ICC which had atomic counter updates (but using function
 calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
 didn't have early inlining -- removing abstraction helps reducing the 
 number
 of counters significantly).

 Richard.
>>>
>>> Hi.
>>>
>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>>> profile counter updates. My first attempt is to utilize TLS storage, where
>>> every function is accumulating arcs counters. These are eventually added
>>> (using atomic operations) to the global one at the very end of a function.
>>> Currently I rely on target support of TLS, which is questionable whether
>>> to have such a requirement for -fprofile-update=atomic, or to add a new 
>>> option value
>>> like -fprofile-update=atomic-tls?
>>>
>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to 
>>> finish.
>>> Time shrinks to 50%, compared to the current implementation.
>>>
>>> Thoughts?
>>
>> Hmm, I thought I suggested that you can simply use automatic storage
>> (which effectively
>> is TLS...) for regions that are not forked or abnormally left (which
>> means SESE regions
>> that have no calls that eventually terminate or throw externally).
>>
>> So why did you end up with TLS?
>
> Hi.
>
> Usage for TLS does not makes sense, stupid mistake ;)
>
> By using SESE regions, do you mean the infrastructure that is utilized
> by Graphite machinery?

No, just as "single-entry single-exit region" which means placing of
initializations of the internal counters to zero and the updates of the
actual counters is "obvious".

Note that this "optimization" isn't one if the SESE region does not contain
cycle(s).  Unless there is a way to do an atomic update of a bunch of
counters faster than doing them separately.  This optimization will also
increase register pressure (or force the internal counters to the stack).
Thus selecting which counters to "optimize" and which ones to leave in place
might be necessary.

Richard.

> Thanks,
> Martin
>
>>
>> Richard.
>>
>>> Martin
>>>

> Martin
>
>>
>>>  Jakub
>>
>>
>
>>>
>


Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-12 Thread Martin Liška
On 10/04/2016 11:45 AM, Richard Biener wrote:
> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška  wrote:
>> On 09/07/2016 02:09 PM, Richard Biener wrote:
>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška  wrote:
 On 08/18/2016 06:06 PM, Richard Biener wrote:
> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek  
> wrote:
>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
 I'd prefer to make updates atomic in multi-threaded applications.
 The best proxy we have for that is -pthread.

 Is it slower, most definitely, but odds are we're giving folks
 garbage data otherwise, which in many ways is even worse.
>>>
>>> It will likely be catastrophically slower in some cases.
>>>
>>> Catastrophically as in too slow to be usable.
>>>
>>> An atomic instruction is a lot more expensive than a single
>> increment. Also
>>> they sometimes are really slow depending on the state of the machine.
>>
>> Can't we just have thread-local copies of all the counters (perhaps
>> using
>> __thread pointer as base) and just atomically merge at thread
>> termination?
>
> I suggested that as well but of course it'll have its own class of issues 
> (short lived threads, so we need to somehow re-use counters from 
> terminated threads, large number of threads and thus using too much 
> memory for the counters)
>
> Richard.

 Hello.

 I've got written the approach on my TODO list, let's see whether it would 
 be doable in a reasonable amount of time.

 I've just finished some measurements to illustrate slow-down of 
 -fprofile-update=atomic approach.
 All numbers are: no profile, -fprofile-generate, -fprofile-generate 
 -fprofile-update=atomic
 c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
 unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
 tramp3d (1 thread, -O3): 18.0, 46.6, 168s

 So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
 having much experience with default option
 selection, but these numbers can probably help.

 Thoughts?
>>>
>>> Look at the generated code for an instrumented simple loop and see that for
>>> the non-atomic updates we happily apply store-motion to the counter update
>>> and thus we only get one counter update per loop exit rather than one per
>>> loop iteration.  Now see what happens for the atomic case (I suspect you
>>> get one per iteration).
>>>
>>> I'll bet this accounts for most of the slowdown.
>>>
>>> Back in time ICC which had atomic counter updates (but using function
>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>>> didn't have early inlining -- removing abstraction helps reducing the number
>>> of counters significantly).
>>>
>>> Richard.
>>
>> Hi.
>>
>> During Cauldron I discussed with Richi approaches how to speed-up ARCS
>> profile counter updates. My first attempt is to utilize TLS storage, where
>> every function is accumulating arcs counters. These are eventually added
>> (using atomic operations) to the global one at the very end of a function.
>> Currently I rely on target support of TLS, which is questionable whether
>> to have such a requirement for -fprofile-update=atomic, or to add a new 
>> option value
>> like -fprofile-update=atomic-tls?
>>
>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to 
>> finish.
>> Time shrinks to 50%, compared to the current implementation.
>>
>> Thoughts?
> 
> Hmm, I thought I suggested that you can simply use automatic storage
> (which effectively
> is TLS...) for regions that are not forked or abnormally left (which
> means SESE regions
> that have no calls that eventually terminate or throw externally).
> 
> So why did you end up with TLS?

Hi.

Usage for TLS does not makes sense, stupid mistake ;)

By using SESE regions, do you mean the infrastructure that is utilized
by Graphite machinery?

Thanks,
Martin

> 
> Richard.
> 
>> Martin
>>
>>>
 Martin

>
>>  Jakub
>
>

>>



Re: [RFC] Speed-up -fprofile-update=atomic

2016-10-04 Thread Richard Biener
On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška  wrote:
> On 09/07/2016 02:09 PM, Richard Biener wrote:
>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška  wrote:
>>> On 08/18/2016 06:06 PM, Richard Biener wrote:
 On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek  
 wrote:
> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>>> I'd prefer to make updates atomic in multi-threaded applications.
>>> The best proxy we have for that is -pthread.
>>>
>>> Is it slower, most definitely, but odds are we're giving folks
>>> garbage data otherwise, which in many ways is even worse.
>>
>> It will likely be catastrophically slower in some cases.
>>
>> Catastrophically as in too slow to be usable.
>>
>> An atomic instruction is a lot more expensive than a single
> increment. Also
>> they sometimes are really slow depending on the state of the machine.
>
> Can't we just have thread-local copies of all the counters (perhaps
> using
> __thread pointer as base) and just atomically merge at thread
> termination?

 I suggested that as well but of course it'll have its own class of issues 
 (short lived threads, so we need to somehow re-use counters from 
 terminated threads, large number of threads and thus using too much memory 
 for the counters)

 Richard.
>>>
>>> Hello.
>>>
>>> I've got written the approach on my TODO list, let's see whether it would 
>>> be doable in a reasonable amount of time.
>>>
>>> I've just finished some measurements to illustrate slow-down of 
>>> -fprofile-update=atomic approach.
>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate 
>>> -fprofile-update=atomic
>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>>
>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
>>> having much experience with default option
>>> selection, but these numbers can probably help.
>>>
>>> Thoughts?
>>
>> Look at the generated code for an instrumented simple loop and see that for
>> the non-atomic updates we happily apply store-motion to the counter update
>> and thus we only get one counter update per loop exit rather than one per
>> loop iteration.  Now see what happens for the atomic case (I suspect you
>> get one per iteration).
>>
>> I'll bet this accounts for most of the slowdown.
>>
>> Back in time ICC which had atomic counter updates (but using function
>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
>> didn't have early inlining -- removing abstraction helps reducing the number
>> of counters significantly).
>>
>> Richard.
>
> Hi.
>
> During Cauldron I discussed with Richi approaches how to speed-up ARCS
> profile counter updates. My first attempt is to utilize TLS storage, where
> every function is accumulating arcs counters. These are eventually added
> (using atomic operations) to the global one at the very end of a function.
> Currently I rely on target support of TLS, which is questionable whether
> to have such a requirement for -fprofile-update=atomic, or to add a new 
> option value
> like -fprofile-update=atomic-tls?
>
> Running the patch on tramp3d, compared to previous numbers, it takes 88s to 
> finish.
> Time shrinks to 50%, compared to the current implementation.
>
> Thoughts?

Hmm, I thought I suggested that you can simply use automatic storage
(which effectively
is TLS...) for regions that are not forked or abnormally left (which
means SESE regions
that have no calls that eventually terminate or throw externally).

So why did you end up with TLS?

Richard.

> Martin
>
>>
>>> Martin
>>>

>  Jakub


>>>
>


[RFC] Speed-up -fprofile-update=atomic

2016-09-15 Thread Martin Liška
On 09/07/2016 02:09 PM, Richard Biener wrote:
> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška  wrote:
>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek  
>>> wrote:
 On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>> I'd prefer to make updates atomic in multi-threaded applications.
>> The best proxy we have for that is -pthread.
>>
>> Is it slower, most definitely, but odds are we're giving folks
>> garbage data otherwise, which in many ways is even worse.
>
> It will likely be catastrophically slower in some cases.
>
> Catastrophically as in too slow to be usable.
>
> An atomic instruction is a lot more expensive than a single
 increment. Also
> they sometimes are really slow depending on the state of the machine.

 Can't we just have thread-local copies of all the counters (perhaps
 using
 __thread pointer as base) and just atomically merge at thread
 termination?
>>>
>>> I suggested that as well but of course it'll have its own class of issues 
>>> (short lived threads, so we need to somehow re-use counters from terminated 
>>> threads, large number of threads and thus using too much memory for the 
>>> counters)
>>>
>>> Richard.
>>
>> Hello.
>>
>> I've got written the approach on my TODO list, let's see whether it would be 
>> doable in a reasonable amount of time.
>>
>> I've just finished some measurements to illustrate slow-down of 
>> -fprofile-update=atomic approach.
>> All numbers are: no profile, -fprofile-generate, -fprofile-generate 
>> -fprofile-update=atomic
>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>
>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
>> having much experience with default option
>> selection, but these numbers can probably help.
>>
>> Thoughts?
> 
> Look at the generated code for an instrumented simple loop and see that for
> the non-atomic updates we happily apply store-motion to the counter update
> and thus we only get one counter update per loop exit rather than one per
> loop iteration.  Now see what happens for the atomic case (I suspect you
> get one per iteration).
> 
> I'll bet this accounts for most of the slowdown.
> 
> Back in time ICC which had atomic counter updates (but using function
> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
> didn't have early inlining -- removing abstraction helps reducing the number
> of counters significantly).
> 
> Richard.

Hi.

During Cauldron I discussed with Richi approaches how to speed-up ARCS
profile counter updates. My first attempt is to utilize TLS storage, where
every function is accumulating arcs counters. These are eventually added
(using atomic operations) to the global one at the very end of a function.
Currently I rely on target support of TLS, which is questionable whether
to have such a requirement for -fprofile-update=atomic, or to add a new option 
value
like -fprofile-update=atomic-tls?

Running the patch on tramp3d, compared to previous numbers, it takes 88s to 
finish.
Time shrinks to 50%, compared to the current implementation.

Thoughts?
Martin

> 
>> Martin
>>
>>>
  Jakub
>>>
>>>
>>

>From 91b5342c422950b32d1ba7d616bda418c7993a84 Mon Sep 17 00:00:00 2001
From: marxin 
Date: Thu, 15 Sep 2016 09:49:41 +0200
Subject: [PATCH] Improve implementation of -fprofile-update=atomic

---
 gcc/coverage.c | 99 ++
 gcc/coverage.h |  6 ++--
 gcc/profile.c  |  2 ++
 gcc/tree-profile.c | 45 +
 4 files changed, 114 insertions(+), 38 deletions(-)

diff --git a/gcc/coverage.c b/gcc/coverage.c
index a6a888a..1e0052f 100644
--- a/gcc/coverage.c
+++ b/gcc/coverage.c
@@ -48,6 +48,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "intl.h"
 #include "params.h"
 #include "auto-profile.h"
+#include "varasm.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "tree-vrp.h"
+#include "tree-ssanames.h"
 
 #include "gcov-io.c"
 
@@ -93,6 +98,9 @@ static GTY(()) tree fn_v_ctrs[GCOV_COUNTERS];   /* counter variables.  */
 static unsigned fn_n_ctrs[GCOV_COUNTERS]; /* Counters allocated.  */
 static unsigned fn_b_ctrs[GCOV_COUNTERS]; /* Allocation base.  */
 
+/* Thread-local storage variable of ARCS counters.  */
+static GTY(()) tree arcs_tls_ctr;
+
 /* Coverage info VAR_DECL and function info type nodes.  */
 static GTY(()) tree gcov_info_var;
 static GTY(()) tree gcov_fn_info_type;
@@ -127,7 +135,8 @@ static const char *const ctr_names[GCOV_COUNTERS] = {
 
 /* Forward declarations.  */
 static void read_counts_file (void);
-static tree build_var (tree, tree, int);
+static tree build_var (tree fn_decl, tree type, int counter,
+		   bool is_tls = false);