Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-13 Thread Jiri Olsa
On Tue, Sep 11, 2018 at 04:42:09PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 11.09.2018 11:34, Jiri Olsa wrote:
> > On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
> >>
> >> Hi Ingo,
> >>
> >> On 11.09.2018 9:35, Ingo Molnar wrote:
> >>>
> >>> * Alexey Budankov  wrote:
> >>>
>  It may sound too optimistic but glibc API is expected to be backward 
>  compatible 
>  and for POSIX AIO API part too. Internal implementation also tends to 
>  evolve to 
>  better option overtime, more probably basing on modern kernel 
>  capabilities 
>  mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> >>>
> >>> I'm not talking about compatibility, and I'm not just talking about 
> >>> glibc, perf works under 
> >>> other libcs as well - and let me phrase it in another way: basic event 
> >>> handling, threading, 
> >>> scheduling internals should be a *core competency* of a tracing/profiling 
> >>> tool.
> >>
> >> Well, the requirement of independence from some specific libc 
> >> implementation 
> >> as well as *core competency* design approach clarify a lot. Thanks!
> >>
> >>>
> >>> I.e. we might end up using the exact same per event fd thread pool design 
> >>> that glibc uses 
> >>> currently. Or not. Having that internal and open coded to perf, like Jiri 
> >>> has started 
> >>> implementing it, allows people to experiment with it.
> >>
> >> My point here is that following some standardized programming models and 
> >> APIs 
> >> (like POSIX) in the tool code, even if the tool itself provides internal 
> >> open 
> >> coded implementation for the APIs, would simplify experimenting with the 
> >> tool 
> >> as well as lower barriers for new comers. Perf project could benefit from 
> >> that.
> >>
> >>>
> >>> This isn't some GUI toolkit, this is at the essence of perf, and we are 
> >>> not very good on large 
> >>> systems right now, and I think the design should be open-coded threading, 
> >>> not relying on an 
> >>> (perf-)external AIO library to get it right.
> >>>
> >>> The glibc thread pool implementation of POSIX AIO is basically a 
> >>> fall-back 
> >>> implementation, for the case where there's no native KAIO interface to 
> >>> rely on.
> >>>
>  Well, explicit threading in the tool for AIO, in the simplest case, 
>  means 
>  incorporating some POSIX API implementation into the tool, avoiding 
>  code reuse in the first place. That tends to be error prone and costly.
> >>>
> >>> It's a core competency, we better do it right and not outsource it.
> >>
> >> Yep, makes sense.
> > 
> > on the other hand, we are already trying to tie this up under perf_mmap
> > object, which is what the threaded patchset operates on.. so I'm quite
> > confident that with little effort we could make those 2 things live next
> > to each other and let the user to decide which one to take and compare
> > 
> > possibilities would be like: (not sure yet the last one makes sense, but 
> > still..)
> > 
> >   # perf record --threads=...  ...
> >   # perf record --aio ...
> >   # perf record --threads=... --aio ...
> > 
> > how about that?
> 
> That might be an option. What is the semantics of --threads?

that's my latest post on this:
  https://marc.info/?l=linux-kernel=151551213322861=2

working on repost ;-)

jirka

> 
> Be aware that when experimenting with serial trace writing on an 8-core 
> client machines running an HPC benchmark heavily utilizing all 8 cores 
> we noticed that single Perf tool thread contended with the benchmark 
> threads.
> 
> That manifested like libiomp.so (Intel OpenMP implementation) functions 
> appearing among the top hotspots functions and this was indication of 
> imbalance induced by the tool during profiling.
> 
> That's why we decided to first go with AIO approach, as it is posted,
> and benefit from it the most thru multi AIO, prior turning to more 
> resource consuming multi-threading alternative. 
> 
> > 
> > I just rebased the thread patchset, will make some tests (it's been few 
> > months,
> > so it needs some kicking/checking) and post it out hopefuly this week> 
> > jirka
> > 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-13 Thread Jiri Olsa
On Tue, Sep 11, 2018 at 04:42:09PM +0300, Alexey Budankov wrote:
> Hi,
> 
> On 11.09.2018 11:34, Jiri Olsa wrote:
> > On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
> >>
> >> Hi Ingo,
> >>
> >> On 11.09.2018 9:35, Ingo Molnar wrote:
> >>>
> >>> * Alexey Budankov  wrote:
> >>>
>  It may sound too optimistic but glibc API is expected to be backward 
>  compatible 
>  and for POSIX AIO API part too. Internal implementation also tends to 
>  evolve to 
>  better option overtime, more probably basing on modern kernel 
>  capabilities 
>  mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> >>>
> >>> I'm not talking about compatibility, and I'm not just talking about 
> >>> glibc, perf works under 
> >>> other libcs as well - and let me phrase it in another way: basic event 
> >>> handling, threading, 
> >>> scheduling internals should be a *core competency* of a tracing/profiling 
> >>> tool.
> >>
> >> Well, the requirement of independence from some specific libc 
> >> implementation 
> >> as well as *core competency* design approach clarify a lot. Thanks!
> >>
> >>>
> >>> I.e. we might end up using the exact same per event fd thread pool design 
> >>> that glibc uses 
> >>> currently. Or not. Having that internal and open coded to perf, like Jiri 
> >>> has started 
> >>> implementing it, allows people to experiment with it.
> >>
> >> My point here is that following some standardized programming models and 
> >> APIs 
> >> (like POSIX) in the tool code, even if the tool itself provides internal 
> >> open 
> >> coded implementation for the APIs, would simplify experimenting with the 
> >> tool 
> >> as well as lower barriers for new comers. Perf project could benefit from 
> >> that.
> >>
> >>>
> >>> This isn't some GUI toolkit, this is at the essence of perf, and we are 
> >>> not very good on large 
> >>> systems right now, and I think the design should be open-coded threading, 
> >>> not relying on an 
> >>> (perf-)external AIO library to get it right.
> >>>
> >>> The glibc thread pool implementation of POSIX AIO is basically a 
> >>> fall-back 
> >>> implementation, for the case where there's no native KAIO interface to 
> >>> rely on.
> >>>
>  Well, explicit threading in the tool for AIO, in the simplest case, 
>  means 
>  incorporating some POSIX API implementation into the tool, avoiding 
>  code reuse in the first place. That tends to be error prone and costly.
> >>>
> >>> It's a core competency, we better do it right and not outsource it.
> >>
> >> Yep, makes sense.
> > 
> > on the other hand, we are already trying to tie this up under perf_mmap
> > object, which is what the threaded patchset operates on.. so I'm quite
> > confident that with little effort we could make those 2 things live next
> > to each other and let the user to decide which one to take and compare
> > 
> > possibilities would be like: (not sure yet the last one makes sense, but 
> > still..)
> > 
> >   # perf record --threads=...  ...
> >   # perf record --aio ...
> >   # perf record --threads=... --aio ...
> > 
> > how about that?
> 
> That might be an option. What is the semantics of --threads?

that's my latest post on this:
  https://marc.info/?l=linux-kernel=151551213322861=2

working on repost ;-)

jirka

> 
> Be aware that when experimenting with serial trace writing on an 8-core 
> client machines running an HPC benchmark heavily utilizing all 8 cores 
> we noticed that single Perf tool thread contended with the benchmark 
> threads.
> 
> That manifested like libiomp.so (Intel OpenMP implementation) functions 
> appearing among the top hotspots functions and this was indication of 
> imbalance induced by the tool during profiling.
> 
> That's why we decided to first go with AIO approach, as it is posted,
> and benefit from it the most thru multi AIO, prior turning to more 
> resource consuming multi-threading alternative. 
> 
> > 
> > I just rebased the thread patchset, will make some tests (it's been few 
> > months,
> > so it needs some kicking/checking) and post it out hopefuly this week> 
> > jirka
> > 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-12 Thread Alexey Budankov


Hi,

On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>>> incorporating some POSIX API implementation into the tool, avoiding 
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
>> very good 
>> starting point.
> 
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.

Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing 
matrix multiplication on 96 core dual socket machines.

> 
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.

Yep, *FENCE may be costly in HW, especially on larger scale.

> 

Thanks,
Alexey


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-12 Thread Alexey Budankov


Hi,

On 11.09.2018 17:19, Peter Zijlstra wrote:
> On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
>>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>>> incorporating some POSIX API implementation into the tool, avoiding 
>>> code reuse in the first place. That tends to be error prone and costly.
>>
>> It's a core competency, we better do it right and not outsource it.
>>
>> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
>> very good 
>> starting point.
> 
> There's another reason for doing custom per-cpu threads; it avoids
> bouncing the buffer memory around the machine. If the task doing the
> buffer reads is the exact same as the one doing the writes, there's less
> memory traffic on the interconnects.

Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains
for kernel/user buffers allocation, needs to be taken into account by the
effective solution. Luckily data losses hasn't been observed when testing 
matrix multiplication on 96 core dual socket machines.

> 
> Also, I think we can avoid the MFENCE in that case, but I'm not sure
> that one is hot enough to bother about on the perf reading side of
> things.

Yep, *FENCE may be costly in HW, especially on larger scale.

> 

Thanks,
Alexey


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Peter Zijlstra
On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
> > Well, explicit threading in the tool for AIO, in the simplest case, means 
> > incorporating some POSIX API implementation into the tool, avoiding 
> > code reuse in the first place. That tends to be error prone and costly.
> 
> It's a core competency, we better do it right and not outsource it.
> 
> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
> very good 
> starting point.

There's another reason for doing custom per-cpu threads; it avoids
bouncing the buffer memory around the machine. If the task doing the
buffer reads is the exact same as the one doing the writes, there's less
memory traffic on the interconnects.

Also, I think we can avoid the MFENCE in that case, but I'm not sure
that one is hot enough to bother about on the perf reading side of
things.


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Peter Zijlstra
On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote:
> > Well, explicit threading in the tool for AIO, in the simplest case, means 
> > incorporating some POSIX API implementation into the tool, avoiding 
> > code reuse in the first place. That tends to be error prone and costly.
> 
> It's a core competency, we better do it right and not outsource it.
> 
> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
> very good 
> starting point.

There's another reason for doing custom per-cpu threads; it avoids
bouncing the buffer memory around the machine. If the task doing the
buffer reads is the exact same as the one doing the writes, there's less
memory traffic on the interconnects.

Also, I think we can avoid the MFENCE in that case, but I'm not sure
that one is hot enough to bother about on the perf reading side of
things.


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Alexey Budankov
Hi,

On 11.09.2018 11:34, Jiri Olsa wrote:
> On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
>>
>> Hi Ingo,
>>
>> On 11.09.2018 9:35, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov  wrote:
>>>
 It may sound too optimistic but glibc API is expected to be backward 
 compatible 
 and for POSIX AIO API part too. Internal implementation also tends to 
 evolve to 
 better option overtime, more probably basing on modern kernel capabilities 
 mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
>>>
>>> I'm not talking about compatibility, and I'm not just talking about glibc, 
>>> perf works under 
>>> other libcs as well - and let me phrase it in another way: basic event 
>>> handling, threading, 
>>> scheduling internals should be a *core competency* of a tracing/profiling 
>>> tool.
>>
>> Well, the requirement of independence from some specific libc implementation 
>> as well as *core competency* design approach clarify a lot. Thanks!
>>
>>>
>>> I.e. we might end up using the exact same per event fd thread pool design 
>>> that glibc uses 
>>> currently. Or not. Having that internal and open coded to perf, like Jiri 
>>> has started 
>>> implementing it, allows people to experiment with it.
>>
>> My point here is that following some standardized programming models and 
>> APIs 
>> (like POSIX) in the tool code, even if the tool itself provides internal 
>> open 
>> coded implementation for the APIs, would simplify experimenting with the 
>> tool 
>> as well as lower barriers for new comers. Perf project could benefit from 
>> that.
>>
>>>
>>> This isn't some GUI toolkit, this is at the essence of perf, and we are not 
>>> very good on large 
>>> systems right now, and I think the design should be open-coded threading, 
>>> not relying on an 
>>> (perf-)external AIO library to get it right.
>>>
>>> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
>>> implementation, for the case where there's no native KAIO interface to rely 
>>> on.
>>>
 Well, explicit threading in the tool for AIO, in the simplest case, means 
 incorporating some POSIX API implementation into the tool, avoiding 
 code reuse in the first place. That tends to be error prone and costly.
>>>
>>> It's a core competency, we better do it right and not outsource it.
>>
>> Yep, makes sense.
> 
> on the other hand, we are already trying to tie this up under perf_mmap
> object, which is what the threaded patchset operates on.. so I'm quite
> confident that with little effort we could make those 2 things live next
> to each other and let the user to decide which one to take and compare
> 
> possibilities would be like: (not sure yet the last one makes sense, but 
> still..)
> 
>   # perf record --threads=...  ...
>   # perf record --aio ...
>   # perf record --threads=... --aio ...
> 
> how about that?

That might be an option. What is the semantics of --threads?

Be aware that when experimenting with serial trace writing on an 8-core 
client machines running an HPC benchmark heavily utilizing all 8 cores 
we noticed that single Perf tool thread contended with the benchmark 
threads.

That manifested like libiomp.so (Intel OpenMP implementation) functions 
appearing among the top hotspots functions and this was indication of 
imbalance induced by the tool during profiling.

That's why we decided to first go with AIO approach, as it is posted,
and benefit from it the most thru multi AIO, prior turning to more 
resource consuming multi-threading alternative. 

> 
> I just rebased the thread patchset, will make some tests (it's been few 
> months,
> so it needs some kicking/checking) and post it out hopefuly this week> 
> jirka
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Alexey Budankov
Hi,

On 11.09.2018 11:34, Jiri Olsa wrote:
> On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
>>
>> Hi Ingo,
>>
>> On 11.09.2018 9:35, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov  wrote:
>>>
 It may sound too optimistic but glibc API is expected to be backward 
 compatible 
 and for POSIX AIO API part too. Internal implementation also tends to 
 evolve to 
 better option overtime, more probably basing on modern kernel capabilities 
 mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
>>>
>>> I'm not talking about compatibility, and I'm not just talking about glibc, 
>>> perf works under 
>>> other libcs as well - and let me phrase it in another way: basic event 
>>> handling, threading, 
>>> scheduling internals should be a *core competency* of a tracing/profiling 
>>> tool.
>>
>> Well, the requirement of independence from some specific libc implementation 
>> as well as *core competency* design approach clarify a lot. Thanks!
>>
>>>
>>> I.e. we might end up using the exact same per event fd thread pool design 
>>> that glibc uses 
>>> currently. Or not. Having that internal and open coded to perf, like Jiri 
>>> has started 
>>> implementing it, allows people to experiment with it.
>>
>> My point here is that following some standardized programming models and 
>> APIs 
>> (like POSIX) in the tool code, even if the tool itself provides internal 
>> open 
>> coded implementation for the APIs, would simplify experimenting with the 
>> tool 
>> as well as lower barriers for new comers. Perf project could benefit from 
>> that.
>>
>>>
>>> This isn't some GUI toolkit, this is at the essence of perf, and we are not 
>>> very good on large 
>>> systems right now, and I think the design should be open-coded threading, 
>>> not relying on an 
>>> (perf-)external AIO library to get it right.
>>>
>>> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
>>> implementation, for the case where there's no native KAIO interface to rely 
>>> on.
>>>
 Well, explicit threading in the tool for AIO, in the simplest case, means 
 incorporating some POSIX API implementation into the tool, avoiding 
 code reuse in the first place. That tends to be error prone and costly.
>>>
>>> It's a core competency, we better do it right and not outsource it.
>>
>> Yep, makes sense.
> 
> on the other hand, we are already trying to tie this up under perf_mmap
> object, which is what the threaded patchset operates on.. so I'm quite
> confident that with little effort we could make those 2 things live next
> to each other and let the user to decide which one to take and compare
> 
> possibilities would be like: (not sure yet the last one makes sense, but 
> still..)
> 
>   # perf record --threads=...  ...
>   # perf record --aio ...
>   # perf record --threads=... --aio ...
> 
> how about that?

That might be an option. What is the semantics of --threads?

Be aware that when experimenting with serial trace writing on an 8-core 
client machines running an HPC benchmark heavily utilizing all 8 cores 
we noticed that single Perf tool thread contended with the benchmark 
threads.

That manifested like libiomp.so (Intel OpenMP implementation) functions 
appearing among the top hotspots functions and this was indication of 
imbalance induced by the tool during profiling.

That's why we decided to first go with AIO approach, as it is posted,
and benefit from it the most thru multi AIO, prior turning to more 
resource consuming multi-threading alternative. 

> 
> I just rebased the thread patchset, will make some tests (it's been few 
> months,
> so it needs some kicking/checking) and post it out hopefuly this week> 
> jirka
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Jiri Olsa
On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
> 
> Hi Ingo,
> 
> On 11.09.2018 9:35, Ingo Molnar wrote:
> > 
> > * Alexey Budankov  wrote:
> > 
> >> It may sound too optimistic but glibc API is expected to be backward 
> >> compatible 
> >> and for POSIX AIO API part too. Internal implementation also tends to 
> >> evolve to 
> >> better option overtime, more probably basing on modern kernel capabilities 
> >> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> > 
> > I'm not talking about compatibility, and I'm not just talking about glibc, 
> > perf works under 
> > other libcs as well - and let me phrase it in another way: basic event 
> > handling, threading, 
> > scheduling internals should be a *core competency* of a tracing/profiling 
> > tool.
> 
> Well, the requirement of independence from some specific libc implementation 
> as well as *core competency* design approach clarify a lot. Thanks!
> 
> > 
> > I.e. we might end up using the exact same per event fd thread pool design 
> > that glibc uses 
> > currently. Or not. Having that internal and open coded to perf, like Jiri 
> > has started 
> > implementing it, allows people to experiment with it.
> 
> My point here is that following some standardized programming models and APIs 
> (like POSIX) in the tool code, even if the tool itself provides internal open 
> coded implementation for the APIs, would simplify experimenting with the tool 
> as well as lower barriers for new comers. Perf project could benefit from 
> that.
> 
> > 
> > This isn't some GUI toolkit, this is at the essence of perf, and we are not 
> > very good on large 
> > systems right now, and I think the design should be open-coded threading, 
> > not relying on an 
> > (perf-)external AIO library to get it right.
> > 
> > The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> > implementation, for the case where there's no native KAIO interface to rely 
> > on.
> > 
> >> Well, explicit threading in the tool for AIO, in the simplest case, means 
> >> incorporating some POSIX API implementation into the tool, avoiding 
> >> code reuse in the first place. That tends to be error prone and costly.
> > 
> > It's a core competency, we better do it right and not outsource it.
> 
> Yep, makes sense.

on the other hand, we are already trying to tie this up under perf_mmap
object, which is what the threaded patchset operates on.. so I'm quite
confident that with little effort we could make those 2 things live next
to each other and let the user to decide which one to take and compare

possibilities would be like: (not sure yet the last one makes sense, but 
still..)

  # perf record --threads=...  ...
  # perf record --aio ...
  # perf record --threads=... --aio ...

how about that?

I just rebased the thread patchset, will make some tests (it's been few months,
so it needs some kicking/checking) and post it out hopefuly this week

jirka


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Jiri Olsa
On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote:
> 
> Hi Ingo,
> 
> On 11.09.2018 9:35, Ingo Molnar wrote:
> > 
> > * Alexey Budankov  wrote:
> > 
> >> It may sound too optimistic but glibc API is expected to be backward 
> >> compatible 
> >> and for POSIX AIO API part too. Internal implementation also tends to 
> >> evolve to 
> >> better option overtime, more probably basing on modern kernel capabilities 
> >> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> > 
> > I'm not talking about compatibility, and I'm not just talking about glibc, 
> > perf works under 
> > other libcs as well - and let me phrase it in another way: basic event 
> > handling, threading, 
> > scheduling internals should be a *core competency* of a tracing/profiling 
> > tool.
> 
> Well, the requirement of independence from some specific libc implementation 
> as well as *core competency* design approach clarify a lot. Thanks!
> 
> > 
> > I.e. we might end up using the exact same per event fd thread pool design 
> > that glibc uses 
> > currently. Or not. Having that internal and open coded to perf, like Jiri 
> > has started 
> > implementing it, allows people to experiment with it.
> 
> My point here is that following some standardized programming models and APIs 
> (like POSIX) in the tool code, even if the tool itself provides internal open 
> coded implementation for the APIs, would simplify experimenting with the tool 
> as well as lower barriers for new comers. Perf project could benefit from 
> that.
> 
> > 
> > This isn't some GUI toolkit, this is at the essence of perf, and we are not 
> > very good on large 
> > systems right now, and I think the design should be open-coded threading, 
> > not relying on an 
> > (perf-)external AIO library to get it right.
> > 
> > The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> > implementation, for the case where there's no native KAIO interface to rely 
> > on.
> > 
> >> Well, explicit threading in the tool for AIO, in the simplest case, means 
> >> incorporating some POSIX API implementation into the tool, avoiding 
> >> code reuse in the first place. That tends to be error prone and costly.
> > 
> > It's a core competency, we better do it right and not outsource it.
> 
> Yep, makes sense.

on the other hand, we are already trying to tie this up under perf_mmap
object, which is what the threaded patchset operates on.. so I'm quite
confident that with little effort we could make those 2 things live next
to each other and let the user to decide which one to take and compare

possibilities would be like: (not sure yet the last one makes sense, but 
still..)

  # perf record --threads=...  ...
  # perf record --aio ...
  # perf record --threads=... --aio ...

how about that?

I just rebased the thread patchset, will make some tests (it's been few months,
so it needs some kicking/checking) and post it out hopefuly this week

jirka


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Alexey Budankov


Hi Ingo,

On 11.09.2018 9:35, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
>> It may sound too optimistic but glibc API is expected to be backward 
>> compatible 
>> and for POSIX AIO API part too. Internal implementation also tends to evolve 
>> to 
>> better option overtime, more probably basing on modern kernel capabilities 
>> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> 
> I'm not talking about compatibility, and I'm not just talking about glibc, 
> perf works under 
> other libcs as well - and let me phrase it in another way: basic event 
> handling, threading, 
> scheduling internals should be a *core competency* of a tracing/profiling 
> tool.

Well, the requirement of independence from some specific libc implementation 
as well as *core competency* design approach clarify a lot. Thanks!

> 
> I.e. we might end up using the exact same per event fd thread pool design 
> that glibc uses 
> currently. Or not. Having that internal and open coded to perf, like Jiri has 
> started 
> implementing it, allows people to experiment with it.

My point here is that following some standardized programming models and APIs 
(like POSIX) in the tool code, even if the tool itself provides internal open 
coded implementation for the APIs, would simplify experimenting with the tool 
as well as lower barriers for new comers. Perf project could benefit from that.

> 
> This isn't some GUI toolkit, this is at the essence of perf, and we are not 
> very good on large 
> systems right now, and I think the design should be open-coded threading, not 
> relying on an 
> (perf-)external AIO library to get it right.
> 
> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> implementation, for the case where there's no native KAIO interface to rely 
> on.
> 
>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>> incorporating some POSIX API implementation into the tool, avoiding 
>> code reuse in the first place. That tends to be error prone and costly.
> 
> It's a core competency, we better do it right and not outsource it.

Yep, makes sense.

Thanks!
Alexey

> 
> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
> very good 
> starting point.
> 
> Thanks,
> 
>   Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Alexey Budankov


Hi Ingo,

On 11.09.2018 9:35, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
>> It may sound too optimistic but glibc API is expected to be backward 
>> compatible 
>> and for POSIX AIO API part too. Internal implementation also tends to evolve 
>> to 
>> better option overtime, more probably basing on modern kernel capabilities 
>> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html
> 
> I'm not talking about compatibility, and I'm not just talking about glibc, 
> perf works under 
> other libcs as well - and let me phrase it in another way: basic event 
> handling, threading, 
> scheduling internals should be a *core competency* of a tracing/profiling 
> tool.

Well, the requirement of independence from some specific libc implementation 
as well as *core competency* design approach clarify a lot. Thanks!

> 
> I.e. we might end up using the exact same per event fd thread pool design 
> that glibc uses 
> currently. Or not. Having that internal and open coded to perf, like Jiri has 
> started 
> implementing it, allows people to experiment with it.

My point here is that following some standardized programming models and APIs 
(like POSIX) in the tool code, even if the tool itself provides internal open 
coded implementation for the APIs, would simplify experimenting with the tool 
as well as lower barriers for new comers. Perf project could benefit from that.

> 
> This isn't some GUI toolkit, this is at the essence of perf, and we are not 
> very good on large 
> systems right now, and I think the design should be open-coded threading, not 
> relying on an 
> (perf-)external AIO library to get it right.
> 
> The glibc thread pool implementation of POSIX AIO is basically a fall-back 
> implementation, for the case where there's no native KAIO interface to rely 
> on.
> 
>> Well, explicit threading in the tool for AIO, in the simplest case, means 
>> incorporating some POSIX API implementation into the tool, avoiding 
>> code reuse in the first place. That tends to be error prone and costly.
> 
> It's a core competency, we better do it right and not outsource it.

Yep, makes sense.

Thanks!
Alexey

> 
> Please take a look at Jiri's patches (once he re-posts them), I think it's a 
> very good 
> starting point.
> 
> Thanks,
> 
>   Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Ingo Molnar


* Alexey Budankov  wrote:

> It may sound too optimistic but glibc API is expected to be backward 
> compatible 
> and for POSIX AIO API part too. Internal implementation also tends to evolve 
> to 
> better option overtime, more probably basing on modern kernel capabilities 
> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

I'm not talking about compatibility, and I'm not just talking about glibc, perf 
works under 
other libcs as well - and let me phrase it in another way: basic event 
handling, threading, 
scheduling internals should be a *core competency* of a tracing/profiling tool.

I.e. we might end up using the exact same per event fd thread pool design that 
glibc uses 
currently. Or not. Having that internal and open coded to perf, like Jiri has 
started 
implementing it, allows people to experiment with it.

This isn't some GUI toolkit, this is at the essence of perf, and we are not 
very good on large 
systems right now, and I think the design should be open-coded threading, not 
relying on an 
(perf-)external AIO library to get it right.

The glibc thread pool implementation of POSIX AIO is basically a fall-back 
implementation, for the case where there's no native KAIO interface to rely on.

> Well, explicit threading in the tool for AIO, in the simplest case, means 
> incorporating some POSIX API implementation into the tool, avoiding 
> code reuse in the first place. That tends to be error prone and costly.

It's a core competency, we better do it right and not outsource it.

Please take a look at Jiri's patches (once he re-posts them), I think it's a 
very good 
starting point.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-11 Thread Ingo Molnar


* Alexey Budankov  wrote:

> It may sound too optimistic but glibc API is expected to be backward 
> compatible 
> and for POSIX AIO API part too. Internal implementation also tends to evolve 
> to 
> better option overtime, more probably basing on modern kernel capabilities 
> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

I'm not talking about compatibility, and I'm not just talking about glibc, perf 
works under 
other libcs as well - and let me phrase it in another way: basic event 
handling, threading, 
scheduling internals should be a *core competency* of a tracing/profiling tool.

I.e. we might end up using the exact same per event fd thread pool design that 
glibc uses 
currently. Or not. Having that internal and open coded to perf, like Jiri has 
started 
implementing it, allows people to experiment with it.

This isn't some GUI toolkit, this is at the essence of perf, and we are not 
very good on large 
systems right now, and I think the design should be open-coded threading, not 
relying on an 
(perf-)external AIO library to get it right.

The glibc thread pool implementation of POSIX AIO is basically a fall-back 
implementation, for the case where there's no native KAIO interface to rely on.

> Well, explicit threading in the tool for AIO, in the simplest case, means 
> incorporating some POSIX API implementation into the tool, avoiding 
> code reuse in the first place. That tends to be error prone and costly.

It's a core competency, we better do it right and not outsource it.

Please take a look at Jiri's patches (once he re-posts them), I think it's a 
very good 
starting point.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi,

On 10.09.2018 16:58, Arnaldo Carvalho de Melo wrote:
> Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu:
>> * Alexey Budankov  wrote:
>>> On 10.09.2018 12:18, Ingo Molnar wrote:
 * Alexey Budankov  wrote:
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
>
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
>
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.

 Yay! I saw this frequently on a 120-CPU box (hw is broken now).

> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
>
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.

 Hm, instead of AIO why don't we use explicit threads instead? I think 
 Posix AIO will fall back 
 to threads anyway when there's no kernel AIO support (which there probably 
 isn't for perf 
 events).
>>>
>>> Explicit threading is surely an option but having more threads 
>>> in the tool that stream performance data is a considerable 
>>> design complication.
>>>
>>> Luckily, glibc AIO implementation is already based on pthreads, 
>>> but having a writing thread for every distinct fd only.
>>
>> My argument is, we don't want to rely on glibc's choices here. They might
>> use a different threading design in the future, or it might differ between
>> libc versions.
>>
>> The basic flow of tracing/profiling data is something we should control 
>> explicitly,
>> via explicit threading.
>>
>> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
>> record -a', not 
>> inherited workflow tracing. For system-wide profiling the ideal tracing 
>> setup is clean per-CPU 
>> separation, i.e. per CPU event fds, per CPU threads that read and then write 
>> into separate 
>> per-CPU files.
> 
> My main request here is that we think about the 'perf top' and 'perf
> trace' workflows as well when working on this, i.e. that we don't take
> for granted that we'll have the perf.data files to work with.

Made manual sanity checks of perf top and perf trace modes using the same 
matrix multiplication workload. The modes look working after applying 
the patch set.

Regards,
Alexey

> 
> I.e. N threads, that periodically use that FINISHED_ROUND event to order
> events and go on consuming. All of the objects already have refcounts
> and locking to allow for things like decaying of samples to take care of
> trowing away no longer needed objects (struct map, thread, dso, symbol
> tables, etc) to trim memory usage.
> 
> - Arnaldo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi,

On 10.09.2018 16:58, Arnaldo Carvalho de Melo wrote:
> Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu:
>> * Alexey Budankov  wrote:
>>> On 10.09.2018 12:18, Ingo Molnar wrote:
 * Alexey Budankov  wrote:
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
>
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
>
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.

 Yay! I saw this frequently on a 120-CPU box (hw is broken now).

> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
>
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.

 Hm, instead of AIO why don't we use explicit threads instead? I think 
 Posix AIO will fall back 
 to threads anyway when there's no kernel AIO support (which there probably 
 isn't for perf 
 events).
>>>
>>> Explicit threading is surely an option but having more threads 
>>> in the tool that stream performance data is a considerable 
>>> design complication.
>>>
>>> Luckily, glibc AIO implementation is already based on pthreads, 
>>> but having a writing thread for every distinct fd only.
>>
>> My argument is, we don't want to rely on glibc's choices here. They might
>> use a different threading design in the future, or it might differ between
>> libc versions.
>>
>> The basic flow of tracing/profiling data is something we should control 
>> explicitly,
>> via explicit threading.
>>
>> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
>> record -a', not 
>> inherited workflow tracing. For system-wide profiling the ideal tracing 
>> setup is clean per-CPU 
>> separation, i.e. per CPU event fds, per CPU threads that read and then write 
>> into separate 
>> per-CPU files.
> 
> My main request here is that we think about the 'perf top' and 'perf
> trace' workflows as well when working on this, i.e. that we don't take
> for granted that we'll have the perf.data files to work with.

Made manual sanity checks of perf top and perf trace modes using the same 
matrix multiplication workload. The modes look working after applying 
the patch set.

Regards,
Alexey

> 
> I.e. N threads, that periodically use that FINISHED_ROUND event to order
> events and go on consuming. All of the objects already have refcounts
> and locking to allow for things like decaying of samples to take care of
> trowing away no longer needed objects (struct map, thread, dso, symbol
> tables, etc) to trim memory usage.
> 
> - Arnaldo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi Ingo,

On 10.09.2018 15:06, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
>> Hi Ingo,
>>
>> On 10.09.2018 12:18, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov  wrote:
>>>

 Currently in record mode the tool implements trace writing serially. 
 The algorithm loops over mapped per-cpu data buffers and stores 
 ready data chunks into a trace file using write() system call.

 At some circumstances the kernel may lack free space in a buffer 
 because the other buffer's half is not yet written to disk due to 
 some other buffer's data writing by the tool at the moment.

 Thus serial trace writing implementation may cause the kernel 
 to loose profiling data and that is what observed when profiling 
 highly parallel CPU bound workloads on machines with big number 
 of cores.
>>>
>>> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
>>>
 Data loss metrics is the ratio lost_time/elapsed_time where 
 lost_time is the sum of time intervals containing PERF_RECORD_LOST 
 records and elapsed_time is the elapsed application run time 
 under profiling.

 Applying asynchronous trace streaming thru Posix AIO API
 (http://man7.org/linux/man-pages/man7/aio.7.html) 
 lowers data loss metrics value providing 2x improvement -
 lowering 98% loss to almost 0%.
>>>
>>> Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
>>> AIO will fall back 
>>> to threads anyway when there's no kernel AIO support (which there probably 
>>> isn't for perf 
>>> events).
>>
>> Explicit threading is surely an option but having more threads 
>> in the tool that stream performance data is a considerable 
>> design complication.
>>
>> Luckily, glibc AIO implementation is already based on pthreads, 
>> but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.> 
> The basic flow of tracing/profiling data is something we should control 
> explicitly,
> via explicit threading.

It may sound too optimistic but glibc API is expected to be backward compatible 
and for POSIX AIO API part too. Internal implementation also tends to evolve to 
better option overtime, more probably basing on modern kernel capabilities 
mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

Well, explicit threading in the tool for AIO, in the simplest case, means 
incorporating some POSIX API implementation into the tool, avoiding 
code reuse in the first place. That tends to be error prone and costly.

Regards,
Alexey

> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
> record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup 
> is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write 
> into separate 
> per-CPU files.
> 
> Thanks,
> 
>   Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi Ingo,

On 10.09.2018 15:06, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
>> Hi Ingo,
>>
>> On 10.09.2018 12:18, Ingo Molnar wrote:
>>>
>>> * Alexey Budankov  wrote:
>>>

 Currently in record mode the tool implements trace writing serially. 
 The algorithm loops over mapped per-cpu data buffers and stores 
 ready data chunks into a trace file using write() system call.

 At some circumstances the kernel may lack free space in a buffer 
 because the other buffer's half is not yet written to disk due to 
 some other buffer's data writing by the tool at the moment.

 Thus serial trace writing implementation may cause the kernel 
 to loose profiling data and that is what observed when profiling 
 highly parallel CPU bound workloads on machines with big number 
 of cores.
>>>
>>> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
>>>
 Data loss metrics is the ratio lost_time/elapsed_time where 
 lost_time is the sum of time intervals containing PERF_RECORD_LOST 
 records and elapsed_time is the elapsed application run time 
 under profiling.

 Applying asynchronous trace streaming thru Posix AIO API
 (http://man7.org/linux/man-pages/man7/aio.7.html) 
 lowers data loss metrics value providing 2x improvement -
 lowering 98% loss to almost 0%.
>>>
>>> Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
>>> AIO will fall back 
>>> to threads anyway when there's no kernel AIO support (which there probably 
>>> isn't for perf 
>>> events).
>>
>> Explicit threading is surely an option but having more threads 
>> in the tool that stream performance data is a considerable 
>> design complication.
>>
>> Luckily, glibc AIO implementation is already based on pthreads, 
>> but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.> 
> The basic flow of tracing/profiling data is something we should control 
> explicitly,
> via explicit threading.

It may sound too optimistic but glibc API is expected to be backward compatible 
and for POSIX AIO API part too. Internal implementation also tends to evolve to 
better option overtime, more probably basing on modern kernel capabilities 
mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html

Well, explicit threading in the tool for AIO, in the simplest case, means 
incorporating some POSIX API implementation into the tool, avoiding 
code reuse in the first place. That tends to be error prone and costly.

Regards,
Alexey

> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
> record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup 
> is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write 
> into separate 
> per-CPU files.
> 
> Thanks,
> 
>   Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Arnaldo Carvalho de Melo
Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu:
> * Alexey Budankov  wrote:
> > On 10.09.2018 12:18, Ingo Molnar wrote:
> > > * Alexey Budankov  wrote:
> > >> Currently in record mode the tool implements trace writing serially. 
> > >> The algorithm loops over mapped per-cpu data buffers and stores 
> > >> ready data chunks into a trace file using write() system call.
> > >>
> > >> At some circumstances the kernel may lack free space in a buffer 
> > >> because the other buffer's half is not yet written to disk due to 
> > >> some other buffer's data writing by the tool at the moment.
> > >>
> > >> Thus serial trace writing implementation may cause the kernel 
> > >> to loose profiling data and that is what observed when profiling 
> > >> highly parallel CPU bound workloads on machines with big number 
> > >> of cores.
> > > 
> > > Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> > > 
> > >> Data loss metrics is the ratio lost_time/elapsed_time where 
> > >> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> > >> records and elapsed_time is the elapsed application run time 
> > >> under profiling.
> > >>
> > >> Applying asynchronous trace streaming thru Posix AIO API
> > >> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> > >> lowers data loss metrics value providing 2x improvement -
> > >> lowering 98% loss to almost 0%.
> > > 
> > > Hm, instead of AIO why don't we use explicit threads instead? I think 
> > > Posix AIO will fall back 
> > > to threads anyway when there's no kernel AIO support (which there 
> > > probably isn't for perf 
> > > events).
> > 
> > Explicit threading is surely an option but having more threads 
> > in the tool that stream performance data is a considerable 
> > design complication.
> > 
> > Luckily, glibc AIO implementation is already based on pthreads, 
> > but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.
> 
> The basic flow of tracing/profiling data is something we should control 
> explicitly,
> via explicit threading.
> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
> record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup 
> is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write 
> into separate 
> per-CPU files.

My main request here is that we think about the 'perf top' and 'perf
trace' workflows as well when working on this, i.e. that we don't take
for granted that we'll have the perf.data files to work with.

I.e. N threads, that periodically use that FINISHED_ROUND event to order
events and go on consuming. All of the objects already have refcounts
and locking to allow for things like decaying of samples to take care of
trowing away no longer needed objects (struct map, thread, dso, symbol
tables, etc) to trim memory usage.

- Arnaldo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Arnaldo Carvalho de Melo
Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu:
> * Alexey Budankov  wrote:
> > On 10.09.2018 12:18, Ingo Molnar wrote:
> > > * Alexey Budankov  wrote:
> > >> Currently in record mode the tool implements trace writing serially. 
> > >> The algorithm loops over mapped per-cpu data buffers and stores 
> > >> ready data chunks into a trace file using write() system call.
> > >>
> > >> At some circumstances the kernel may lack free space in a buffer 
> > >> because the other buffer's half is not yet written to disk due to 
> > >> some other buffer's data writing by the tool at the moment.
> > >>
> > >> Thus serial trace writing implementation may cause the kernel 
> > >> to loose profiling data and that is what observed when profiling 
> > >> highly parallel CPU bound workloads on machines with big number 
> > >> of cores.
> > > 
> > > Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> > > 
> > >> Data loss metrics is the ratio lost_time/elapsed_time where 
> > >> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> > >> records and elapsed_time is the elapsed application run time 
> > >> under profiling.
> > >>
> > >> Applying asynchronous trace streaming thru Posix AIO API
> > >> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> > >> lowers data loss metrics value providing 2x improvement -
> > >> lowering 98% loss to almost 0%.
> > > 
> > > Hm, instead of AIO why don't we use explicit threads instead? I think 
> > > Posix AIO will fall back 
> > > to threads anyway when there's no kernel AIO support (which there 
> > > probably isn't for perf 
> > > events).
> > 
> > Explicit threading is surely an option but having more threads 
> > in the tool that stream performance data is a considerable 
> > design complication.
> > 
> > Luckily, glibc AIO implementation is already based on pthreads, 
> > but having a writing thread for every distinct fd only.
> 
> My argument is, we don't want to rely on glibc's choices here. They might
> use a different threading design in the future, or it might differ between
> libc versions.
> 
> The basic flow of tracing/profiling data is something we should control 
> explicitly,
> via explicit threading.
> 
> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
> record -a', not 
> inherited workflow tracing. For system-wide profiling the ideal tracing setup 
> is clean per-CPU 
> separation, i.e. per CPU event fds, per CPU threads that read and then write 
> into separate 
> per-CPU files.

My main request here is that we think about the 'perf top' and 'perf
trace' workflows as well when working on this, i.e. that we don't take
for granted that we'll have the perf.data files to work with.

I.e. N threads, that periodically use that FINISHED_ROUND event to order
events and go on consuming. All of the objects already have refcounts
and locking to allow for things like decaying of samples to take care of
trowing away no longer needed objects (struct map, thread, dso, symbol
tables, etc) to trim memory usage.

- Arnaldo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Alexey Budankov  wrote:

> Hi Ingo,
> 
> On 10.09.2018 12:18, Ingo Molnar wrote:
> > 
> > * Alexey Budankov  wrote:
> > 
> >>
> >> Currently in record mode the tool implements trace writing serially. 
> >> The algorithm loops over mapped per-cpu data buffers and stores 
> >> ready data chunks into a trace file using write() system call.
> >>
> >> At some circumstances the kernel may lack free space in a buffer 
> >> because the other buffer's half is not yet written to disk due to 
> >> some other buffer's data writing by the tool at the moment.
> >>
> >> Thus serial trace writing implementation may cause the kernel 
> >> to loose profiling data and that is what observed when profiling 
> >> highly parallel CPU bound workloads on machines with big number 
> >> of cores.
> > 
> > Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> > 
> >> Data loss metrics is the ratio lost_time/elapsed_time where 
> >> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> >> records and elapsed_time is the elapsed application run time 
> >> under profiling.
> >>
> >> Applying asynchronous trace streaming thru Posix AIO API
> >> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> >> lowers data loss metrics value providing 2x improvement -
> >> lowering 98% loss to almost 0%.
> > 
> > Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
> > AIO will fall back 
> > to threads anyway when there's no kernel AIO support (which there probably 
> > isn't for perf 
> > events).
> 
> Explicit threading is surely an option but having more threads 
> in the tool that stream performance data is a considerable 
> design complication.
> 
> Luckily, glibc AIO implementation is already based on pthreads, 
> but having a writing thread for every distinct fd only.

My argument is, we don't want to rely on glibc's choices here. They might
use a different threading design in the future, or it might differ between
libc versions.

The basic flow of tracing/profiling data is something we should control 
explicitly,
via explicit threading.

BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
record -a', not 
inherited workflow tracing. For system-wide profiling the ideal tracing setup 
is clean per-CPU 
separation, i.e. per CPU event fds, per CPU threads that read and then write 
into separate 
per-CPU files.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Alexey Budankov  wrote:

> Hi Ingo,
> 
> On 10.09.2018 12:18, Ingo Molnar wrote:
> > 
> > * Alexey Budankov  wrote:
> > 
> >>
> >> Currently in record mode the tool implements trace writing serially. 
> >> The algorithm loops over mapped per-cpu data buffers and stores 
> >> ready data chunks into a trace file using write() system call.
> >>
> >> At some circumstances the kernel may lack free space in a buffer 
> >> because the other buffer's half is not yet written to disk due to 
> >> some other buffer's data writing by the tool at the moment.
> >>
> >> Thus serial trace writing implementation may cause the kernel 
> >> to loose profiling data and that is what observed when profiling 
> >> highly parallel CPU bound workloads on machines with big number 
> >> of cores.
> > 
> > Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> > 
> >> Data loss metrics is the ratio lost_time/elapsed_time where 
> >> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> >> records and elapsed_time is the elapsed application run time 
> >> under profiling.
> >>
> >> Applying asynchronous trace streaming thru Posix AIO API
> >> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> >> lowers data loss metrics value providing 2x improvement -
> >> lowering 98% loss to almost 0%.
> > 
> > Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
> > AIO will fall back 
> > to threads anyway when there's no kernel AIO support (which there probably 
> > isn't for perf 
> > events).
> 
> Explicit threading is surely an option but having more threads 
> in the tool that stream performance data is a considerable 
> design complication.
> 
> Luckily, glibc AIO implementation is already based on pthreads, 
> but having a writing thread for every distinct fd only.

My argument is, we don't want to rely on glibc's choices here. They might
use a different threading design in the future, or it might differ between
libc versions.

The basic flow of tracing/profiling data is something we should control 
explicitly,
via explicit threading.

BTW., the usecase I was primarily concentrating on was a simpler one: 'perf 
record -a', not 
inherited workflow tracing. For system-wide profiling the ideal tracing setup 
is clean per-CPU 
separation, i.e. per CPU event fds, per CPU threads that read and then write 
into separate 
per-CPU files.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi,

On 10.09.2018 13:23, Jiri Olsa wrote:
> On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote:
>>
>> * Jiri Olsa  wrote:
>>
>>> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:

 * Jiri Olsa  wrote:

>> Per-CPU threading the record session would have so many other advantages 
>> as well (scalability, 
>> etc.).
>>
>> Jiri did per-CPU recording patches a couple of months ago, not sure how 
>> usable they are at the 
>> moment?
>
> it's still usable, I can rebase it and post a branch pointer,
> the problem is I haven't been able to find a case with a real
> performance benefit yet.. ;-)
>
> perhaps because I haven't tried on server with really big cpu
> numbers

 Maybe Alexey could pick up from there? Your concept looked fairly mature 
 to me
 and I tried it on a big-CPU box back then and there were real improvements.
>>>
>>> too bad u did not share your results, it could have been already in ;-)
>>
>> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have 
>> tested it on
>> broke ...
>>
>>> let me rebase/repost once more and let's see
>>
>> Thanks!
>>
>>> I think we could benefit from both multiple threads event reading
>>> and AIO writing for perf.data.. it could be merged together
>>
>> So instead of AIO writing perf.data, why not just turn perf.data into a 
>> directory structure 
>> with per CPU files? That would allow all sorts of neat future performance 
>> features such as 
> 
> that's basically what the multiple-thread record patchset does

Re-posting part of my answer here...

Please note that tool threads may contend, and actually do, with 
application threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.

So this or that tool design is also a matter of proper system balancing
when profiling so that the gathered performance data would be actual.

Thanks,
Alexey

> 
> jirka
> 
>> mmap() or splice() based zero-copy.
>>
>> User-space post-processing can then read the files and put them into global 
>> order - or use the 
>> per CPU nature of them, which would be pretty useful too.
>>
>> Also note how well this works on NUMA as well, as the backing pages would be 
>> allocated in a 
>> NUMA-local fashion.
>>
>> I.e. the whole per-CPU threading would enable such a separation of the 
>> tracing/event streams 
>> and would allow true scalability.
>>
>> Thanks,
>>
>>  Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi,

On 10.09.2018 13:23, Jiri Olsa wrote:
> On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote:
>>
>> * Jiri Olsa  wrote:
>>
>>> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:

 * Jiri Olsa  wrote:

>> Per-CPU threading the record session would have so many other advantages 
>> as well (scalability, 
>> etc.).
>>
>> Jiri did per-CPU recording patches a couple of months ago, not sure how 
>> usable they are at the 
>> moment?
>
> it's still usable, I can rebase it and post a branch pointer,
> the problem is I haven't been able to find a case with a real
> performance benefit yet.. ;-)
>
> perhaps because I haven't tried on server with really big cpu
> numbers

 Maybe Alexey could pick up from there? Your concept looked fairly mature 
 to me
 and I tried it on a big-CPU box back then and there were real improvements.
>>>
>>> too bad u did not share your results, it could have been already in ;-)
>>
>> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have 
>> tested it on
>> broke ...
>>
>>> let me rebase/repost once more and let's see
>>
>> Thanks!
>>
>>> I think we could benefit from both multiple threads event reading
>>> and AIO writing for perf.data.. it could be merged together
>>
>> So instead of AIO writing perf.data, why not just turn perf.data into a 
>> directory structure 
>> with per CPU files? That would allow all sorts of neat future performance 
>> features such as 
> 
> that's basically what the multiple-thread record patchset does

Re-posting part of my answer here...

Please note that tool threads may contend, and actually do, with 
application threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.

So this or that tool design is also a matter of proper system balancing
when profiling so that the gathered performance data would be actual.

Thanks,
Alexey

> 
> jirka
> 
>> mmap() or splice() based zero-copy.
>>
>> User-space post-processing can then read the files and put them into global 
>> order - or use the 
>> per CPU nature of them, which would be pretty useful too.
>>
>> Also note how well this works on NUMA as well, as the backing pages would be 
>> allocated in a 
>> NUMA-local fashion.
>>
>> I.e. the whole per-CPU threading would enable such a separation of the 
>> tracing/event streams 
>> and would allow true scalability.
>>
>> Thanks,
>>
>>  Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi Ingo,

On 10.09.2018 12:18, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
>>
>> Currently in record mode the tool implements trace writing serially. 
>> The algorithm loops over mapped per-cpu data buffers and stores 
>> ready data chunks into a trace file using write() system call.
>>
>> At some circumstances the kernel may lack free space in a buffer 
>> because the other buffer's half is not yet written to disk due to 
>> some other buffer's data writing by the tool at the moment.
>>
>> Thus serial trace writing implementation may cause the kernel 
>> to loose profiling data and that is what observed when profiling 
>> highly parallel CPU bound workloads on machines with big number 
>> of cores.
> 
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> 
>> Data loss metrics is the ratio lost_time/elapsed_time where 
>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>> records and elapsed_time is the elapsed application run time 
>> under profiling.
>>
>> Applying asynchronous trace streaming thru Posix AIO API
>> (http://man7.org/linux/man-pages/man7/aio.7.html) 
>> lowers data loss metrics value providing 2x improvement -
>> lowering 98% loss to almost 0%.
> 
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
> AIO will fall back 
> to threads anyway when there's no kernel AIO support (which there probably 
> isn't for perf 
> events).

Explicit threading is surely an option but having more threads 
in the tool that stream performance data is a considerable 
design complication.

Luckily, glibc AIO implementation is already based on pthreads, 
but having a writing thread for every distinct fd only.

> 
> Per-CPU threading the record session would have so many other advantages as 
> well (scalability, 
> etc.).> 
> Jiri did per-CPU recording patches a couple of months ago, not sure how 
> usable they are at the 
> moment?

Tool threads may contend, and actually do, with application 
threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.

Thanks,
Alexey

> 
> Thanks,
> 
>   Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Alexey Budankov
Hi Ingo,

On 10.09.2018 12:18, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
>>
>> Currently in record mode the tool implements trace writing serially. 
>> The algorithm loops over mapped per-cpu data buffers and stores 
>> ready data chunks into a trace file using write() system call.
>>
>> At some circumstances the kernel may lack free space in a buffer 
>> because the other buffer's half is not yet written to disk due to 
>> some other buffer's data writing by the tool at the moment.
>>
>> Thus serial trace writing implementation may cause the kernel 
>> to loose profiling data and that is what observed when profiling 
>> highly parallel CPU bound workloads on machines with big number 
>> of cores.
> 
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> 
>> Data loss metrics is the ratio lost_time/elapsed_time where 
>> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
>> records and elapsed_time is the elapsed application run time 
>> under profiling.
>>
>> Applying asynchronous trace streaming thru Posix AIO API
>> (http://man7.org/linux/man-pages/man7/aio.7.html) 
>> lowers data loss metrics value providing 2x improvement -
>> lowering 98% loss to almost 0%.
> 
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
> AIO will fall back 
> to threads anyway when there's no kernel AIO support (which there probably 
> isn't for perf 
> events).

Explicit threading is surely an option but having more threads 
in the tool that stream performance data is a considerable 
design complication.

Luckily, glibc AIO implementation is already based on pthreads, 
but having a writing thread for every distinct fd only.

> 
> Per-CPU threading the record session would have so many other advantages as 
> well (scalability, 
> etc.).> 
> Jiri did per-CPU recording patches a couple of months ago, not sure how 
> usable they are at the 
> moment?

Tool threads may contend, and actually do, with application 
threads, under heavy load when all CPU cores are utilized,
and this may alter performance profile.

Thanks,
Alexey

> 
> Thanks,
> 
>   Ingo
> 


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Jiri Olsa
On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa  wrote:
> 
> > On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> > > 
> > > * Jiri Olsa  wrote:
> > > 
> > > > > Per-CPU threading the record session would have so many other 
> > > > > advantages as well (scalability, 
> > > > > etc.).
> > > > > 
> > > > > Jiri did per-CPU recording patches a couple of months ago, not sure 
> > > > > how usable they are at the 
> > > > > moment?
> > > > 
> > > > it's still usable, I can rebase it and post a branch pointer,
> > > > the problem is I haven't been able to find a case with a real
> > > > performance benefit yet.. ;-)
> > > > 
> > > > perhaps because I haven't tried on server with really big cpu
> > > > numbers
> > > 
> > > Maybe Alexey could pick up from there? Your concept looked fairly mature 
> > > to me
> > > and I tried it on a big-CPU box back then and there were real 
> > > improvements.
> > 
> > too bad u did not share your results, it could have been already in ;-)
> 
> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have 
> tested it on
> broke ...
> 
> > let me rebase/repost once more and let's see
> 
> Thanks!
> 
> > I think we could benefit from both multiple threads event reading
> > and AIO writing for perf.data.. it could be merged together
> 
> So instead of AIO writing perf.data, why not just turn perf.data into a 
> directory structure 
> with per CPU files? That would allow all sorts of neat future performance 
> features such as 

that's basically what the multiple-thread record patchset does

jirka

> mmap() or splice() based zero-copy.
> 
> User-space post-processing can then read the files and put them into global 
> order - or use the 
> per CPU nature of them, which would be pretty useful too.
> 
> Also note how well this works on NUMA as well, as the backing pages would be 
> allocated in a 
> NUMA-local fashion.
> 
> I.e. the whole per-CPU threading would enable such a separation of the 
> tracing/event streams 
> and would allow true scalability.
> 
> Thanks,
> 
>   Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Jiri Olsa
On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa  wrote:
> 
> > On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> > > 
> > > * Jiri Olsa  wrote:
> > > 
> > > > > Per-CPU threading the record session would have so many other 
> > > > > advantages as well (scalability, 
> > > > > etc.).
> > > > > 
> > > > > Jiri did per-CPU recording patches a couple of months ago, not sure 
> > > > > how usable they are at the 
> > > > > moment?
> > > > 
> > > > it's still usable, I can rebase it and post a branch pointer,
> > > > the problem is I haven't been able to find a case with a real
> > > > performance benefit yet.. ;-)
> > > > 
> > > > perhaps because I haven't tried on server with really big cpu
> > > > numbers
> > > 
> > > Maybe Alexey could pick up from there? Your concept looked fairly mature 
> > > to me
> > > and I tried it on a big-CPU box back then and there were real 
> > > improvements.
> > 
> > too bad u did not share your results, it could have been already in ;-)
> 
> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have 
> tested it on
> broke ...
> 
> > let me rebase/repost once more and let's see
> 
> Thanks!
> 
> > I think we could benefit from both multiple threads event reading
> > and AIO writing for perf.data.. it could be merged together
> 
> So instead of AIO writing perf.data, why not just turn perf.data into a 
> directory structure 
> with per CPU files? That would allow all sorts of neat future performance 
> features such as 

that's basically what the multiple-thread record patchset does

jirka

> mmap() or splice() based zero-copy.
> 
> User-space post-processing can then read the files and put them into global 
> order - or use the 
> per CPU nature of them, which would be pretty useful too.
> 
> Also note how well this works on NUMA as well, as the backing pages would be 
> allocated in a 
> NUMA-local fashion.
> 
> I.e. the whole per-CPU threading would enable such a separation of the 
> tracing/event streams 
> and would allow true scalability.
> 
> Thanks,
> 
>   Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Jiri Olsa  wrote:

> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> > 
> > * Jiri Olsa  wrote:
> > 
> > > > Per-CPU threading the record session would have so many other 
> > > > advantages as well (scalability, 
> > > > etc.).
> > > > 
> > > > Jiri did per-CPU recording patches a couple of months ago, not sure how 
> > > > usable they are at the 
> > > > moment?
> > > 
> > > it's still usable, I can rebase it and post a branch pointer,
> > > the problem is I haven't been able to find a case with a real
> > > performance benefit yet.. ;-)
> > > 
> > > perhaps because I haven't tried on server with really big cpu
> > > numbers
> > 
> > Maybe Alexey could pick up from there? Your concept looked fairly mature to 
> > me
> > and I tried it on a big-CPU box back then and there were real improvements.
> 
> too bad u did not share your results, it could have been already in ;-)

Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have 
tested it on
broke ...

> let me rebase/repost once more and let's see

Thanks!

> I think we could benefit from both multiple threads event reading
> and AIO writing for perf.data.. it could be merged together

So instead of AIO writing perf.data, why not just turn perf.data into a 
directory structure 
with per CPU files? That would allow all sorts of neat future performance 
features such as 
mmap() or splice() based zero-copy.

User-space post-processing can then read the files and put them into global 
order - or use the 
per CPU nature of them, which would be pretty useful too.

Also note how well this works on NUMA as well, as the backing pages would be 
allocated in a 
NUMA-local fashion.

I.e. the whole per-CPU threading would enable such a separation of the 
tracing/event streams 
and would allow true scalability.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Jiri Olsa  wrote:

> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> > 
> > * Jiri Olsa  wrote:
> > 
> > > > Per-CPU threading the record session would have so many other 
> > > > advantages as well (scalability, 
> > > > etc.).
> > > > 
> > > > Jiri did per-CPU recording patches a couple of months ago, not sure how 
> > > > usable they are at the 
> > > > moment?
> > > 
> > > it's still usable, I can rebase it and post a branch pointer,
> > > the problem is I haven't been able to find a case with a real
> > > performance benefit yet.. ;-)
> > > 
> > > perhaps because I haven't tried on server with really big cpu
> > > numbers
> > 
> > Maybe Alexey could pick up from there? Your concept looked fairly mature to 
> > me
> > and I tried it on a big-CPU box back then and there were real improvements.
> 
> too bad u did not share your results, it could have been already in ;-)

Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have 
tested it on
broke ...

> let me rebase/repost once more and let's see

Thanks!

> I think we could benefit from both multiple threads event reading
> and AIO writing for perf.data.. it could be merged together

So instead of AIO writing perf.data, why not just turn perf.data into a 
directory structure 
with per CPU files? That would allow all sorts of neat future performance 
features such as 
mmap() or splice() based zero-copy.

User-space post-processing can then read the files and put them into global 
order - or use the 
per CPU nature of them, which would be pretty useful too.

Also note how well this works on NUMA as well, as the backing pages would be 
allocated in a 
NUMA-local fashion.

I.e. the whole per-CPU threading would enable such a separation of the 
tracing/event streams 
and would allow true scalability.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Jiri Olsa
On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa  wrote:
> 
> > > Per-CPU threading the record session would have so many other advantages 
> > > as well (scalability, 
> > > etc.).
> > > 
> > > Jiri did per-CPU recording patches a couple of months ago, not sure how 
> > > usable they are at the 
> > > moment?
> > 
> > it's still usable, I can rebase it and post a branch pointer,
> > the problem is I haven't been able to find a case with a real
> > performance benefit yet.. ;-)
> > 
> > perhaps because I haven't tried on server with really big cpu
> > numbers
> 
> Maybe Alexey could pick up from there? Your concept looked fairly mature to me
> and I tried it on a big-CPU box back then and there were real improvements.

too bad u did not share your results, it could have been already in ;-)

let me rebase/repost once more and let's see

I think we could benefit from both multiple threads event reading
and AIO writing for perf.data.. it could be merged together

jirka


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Jiri Olsa
On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa  wrote:
> 
> > > Per-CPU threading the record session would have so many other advantages 
> > > as well (scalability, 
> > > etc.).
> > > 
> > > Jiri did per-CPU recording patches a couple of months ago, not sure how 
> > > usable they are at the 
> > > moment?
> > 
> > it's still usable, I can rebase it and post a branch pointer,
> > the problem is I haven't been able to find a case with a real
> > performance benefit yet.. ;-)
> > 
> > perhaps because I haven't tried on server with really big cpu
> > numbers
> 
> Maybe Alexey could pick up from there? Your concept looked fairly mature to me
> and I tried it on a big-CPU box back then and there were real improvements.

too bad u did not share your results, it could have been already in ;-)

let me rebase/repost once more and let's see

I think we could benefit from both multiple threads event reading
and AIO writing for perf.data.. it could be merged together

jirka


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Jiri Olsa  wrote:

> > Per-CPU threading the record session would have so many other advantages as 
> > well (scalability, 
> > etc.).
> > 
> > Jiri did per-CPU recording patches a couple of months ago, not sure how 
> > usable they are at the 
> > moment?
> 
> it's still usable, I can rebase it and post a branch pointer,
> the problem is I haven't been able to find a case with a real
> performance benefit yet.. ;-)
> 
> perhaps because I haven't tried on server with really big cpu
> numbers

Maybe Alexey could pick up from there? Your concept looked fairly mature to me
and I tried it on a big-CPU box back then and there were real improvements.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Jiri Olsa  wrote:

> > Per-CPU threading the record session would have so many other advantages as 
> > well (scalability, 
> > etc.).
> > 
> > Jiri did per-CPU recording patches a couple of months ago, not sure how 
> > usable they are at the 
> > moment?
> 
> it's still usable, I can rebase it and post a branch pointer,
> the problem is I haven't been able to find a case with a real
> performance benefit yet.. ;-)
> 
> perhaps because I haven't tried on server with really big cpu
> numbers

Maybe Alexey could pick up from there? Your concept looked fairly mature to me
and I tried it on a big-CPU box back then and there were real improvements.

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Jiri Olsa
On Mon, Sep 10, 2018 at 11:18:41AM +0200, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
> > 
> > Currently in record mode the tool implements trace writing serially. 
> > The algorithm loops over mapped per-cpu data buffers and stores 
> > ready data chunks into a trace file using write() system call.
> > 
> > At some circumstances the kernel may lack free space in a buffer 
> > because the other buffer's half is not yet written to disk due to 
> > some other buffer's data writing by the tool at the moment.
> > 
> > Thus serial trace writing implementation may cause the kernel 
> > to loose profiling data and that is what observed when profiling 
> > highly parallel CPU bound workloads on machines with big number 
> > of cores.
> 
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> 
> > Data loss metrics is the ratio lost_time/elapsed_time where 
> > lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> > records and elapsed_time is the elapsed application run time 
> > under profiling.
> > 
> > Applying asynchronous trace streaming thru Posix AIO API
> > (http://man7.org/linux/man-pages/man7/aio.7.html) 
> > lowers data loss metrics value providing 2x improvement -
> > lowering 98% loss to almost 0%.
> 
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
> AIO will fall back 
> to threads anyway when there's no kernel AIO support (which there probably 
> isn't for perf 
> events).

this patch adds the aoi for writing to the perf.data
file, reading of events is unchanged

> 
> Per-CPU threading the record session would have so many other advantages as 
> well (scalability, 
> etc.).
> 
> Jiri did per-CPU recording patches a couple of months ago, not sure how 
> usable they are at the 
> moment?

it's still usable, I can rebase it and post a branch pointer,
the problem is I haven't been able to find a case with a real
performance benefit yet.. ;-)

perhaps because I haven't tried on server with really big cpu
numbers

jirka


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Jiri Olsa
On Mon, Sep 10, 2018 at 11:18:41AM +0200, Ingo Molnar wrote:
> 
> * Alexey Budankov  wrote:
> 
> > 
> > Currently in record mode the tool implements trace writing serially. 
> > The algorithm loops over mapped per-cpu data buffers and stores 
> > ready data chunks into a trace file using write() system call.
> > 
> > At some circumstances the kernel may lack free space in a buffer 
> > because the other buffer's half is not yet written to disk due to 
> > some other buffer's data writing by the tool at the moment.
> > 
> > Thus serial trace writing implementation may cause the kernel 
> > to loose profiling data and that is what observed when profiling 
> > highly parallel CPU bound workloads on machines with big number 
> > of cores.
> 
> Yay! I saw this frequently on a 120-CPU box (hw is broken now).
> 
> > Data loss metrics is the ratio lost_time/elapsed_time where 
> > lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> > records and elapsed_time is the elapsed application run time 
> > under profiling.
> > 
> > Applying asynchronous trace streaming thru Posix AIO API
> > (http://man7.org/linux/man-pages/man7/aio.7.html) 
> > lowers data loss metrics value providing 2x improvement -
> > lowering 98% loss to almost 0%.
> 
> Hm, instead of AIO why don't we use explicit threads instead? I think Posix 
> AIO will fall back 
> to threads anyway when there's no kernel AIO support (which there probably 
> isn't for perf 
> events).

this patch adds the aoi for writing to the perf.data
file, reading of events is unchanged

> 
> Per-CPU threading the record session would have so many other advantages as 
> well (scalability, 
> etc.).
> 
> Jiri did per-CPU recording patches a couple of months ago, not sure how 
> usable they are at the 
> moment?

it's still usable, I can rebase it and post a branch pointer,
the problem is I haven't been able to find a case with a real
performance benefit yet.. ;-)

perhaps because I haven't tried on server with really big cpu
numbers

jirka


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Alexey Budankov  wrote:

> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.

Yay! I saw this frequently on a 120-CPU box (hw is broken now).

> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.

Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO 
will fall back 
to threads anyway when there's no kernel AIO support (which there probably 
isn't for perf 
events).

Per-CPU threading the record session would have so many other advantages as 
well (scalability, 
etc.).

Jiri did per-CPU recording patches a couple of months ago, not sure how usable 
they are at the 
moment?

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-10 Thread Ingo Molnar


* Alexey Budankov  wrote:

> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.

Yay! I saw this frequently on a 120-CPU box (hw is broken now).

> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.

Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO 
will fall back 
to threads anyway when there's no kernel AIO support (which there probably 
isn't for perf 
events).

Per-CPU threading the record session would have so many other advantages as 
well (scalability, 
etc.).

Jiri did per-CPU recording patches a couple of months ago, not sure how usable 
they are at the 
moment?

Thanks,

Ingo


Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-07 Thread Alexey Budankov



On 07.09.2018 10:07, Alexey Budankov wrote:
> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.
> 
> Experiment with profiling matrix multiplication code executing 128 
> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
> demonstrates data loss metrics value of 98%:
> 
> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
> --call-graph dwarf,1024 --user-regs=IP,SP,BP \
> --switch-events -e 
> cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
> matrix.gcc
> 
> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.
> 
> ---
>  Alexey Budankov (3):
> perf util: map data buffer for preserving collected data
> perf record: enable asynchronous trace writing
> perf record: extend trace writing to multi AIO
>  
>  tools/perf/builtin-record.c | 166 
> ++--
>  tools/perf/perf.h   |   1 +
>  tools/perf/util/evlist.c|   7 +-
>  tools/perf/util/evlist.h|   3 +-
>  tools/perf/util/mmap.c  | 114 ++
>  tools/perf/util/mmap.h  |  11 ++-
>  6 files changed, 277 insertions(+), 25 deletions(-)

The whole thing for 

git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux perf/core 

repository follows:

 tools/perf/builtin-record.c | 165 ++--
 tools/perf/perf.h   |   1 +
 tools/perf/util/evlist.c|   7 +-
 tools/perf/util/evlist.h|   3 +-
 tools/perf/util/mmap.c  | 114 ++
 tools/perf/util/mmap.h  |  11 ++-
 6 files changed, 276 insertions(+), 25 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9853552bcf16..7bb7947072e5 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -121,6 +121,112 @@ static int record__write(struct record *rec, void *bf, 
size_t size)
return 0;
 }
 
+static int record__aio_write(struct aiocb *cblock, int trace_fd,
+   void *buf, size_t size, off_t off)
+{
+   int rc;
+
+   cblock->aio_fildes = trace_fd;
+   cblock->aio_buf= buf;
+   cblock->aio_nbytes = size;
+   cblock->aio_offset = off;
+   cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+   do {
+   rc = aio_write(cblock);
+   if (rc == 0) {
+   break;
+   } else if (errno != EAGAIN) {
+   cblock->aio_fildes = -1;
+   pr_err("failed to queue perf data, error: %m\n");
+   break;
+   }
+   } while (1);
+
+   return rc;
+}
+
+static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
+{
+   void *rem_buf;
+   off_t rem_off;
+   size_t rem_size;
+   int rc, aio_errno;
+   ssize_t aio_ret, written;
+
+   aio_errno = aio_error(cblock);
+   if (aio_errno == EINPROGRESS)
+   return 0;
+
+   written = aio_ret = aio_return(cblock);
+   if (aio_ret < 0) {
+   if (!(aio_errno == EINTR))
+   pr_err("failed to write perf data, error: %m\n");
+   written = 0;
+   }
+
+   rem_size = cblock->aio_nbytes - written;
+
+   if (rem_size == 0) {
+   cblock->aio_fildes = -1;
+   /*
+* md->refcount is incremented in perf_mmap__push() for
+* every enqueued aio write request so decrement it because
+* the request is now complete.
+*/
+   perf_mmap__put(md);
+   rc = 1;
+   } else {
+   /*
+* aio write request may require restart with the
+* reminder if the kernel didn't write whole
+* chunk at once.
+*/
+   rem_off = cblock->aio_offset + written;
+   rem_buf = (void *)(cblock->aio_buf + written);
+   record__aio_write(cblock, cblock->aio_fildes,
+   

Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-07 Thread Alexey Budankov



On 07.09.2018 10:07, Alexey Budankov wrote:
> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.
> 
> Experiment with profiling matrix multiplication code executing 128 
> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
> demonstrates data loss metrics value of 98%:
> 
> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
> --call-graph dwarf,1024 --user-regs=IP,SP,BP \
> --switch-events -e 
> cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
> matrix.gcc
> 
> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.
> 
> ---
>  Alexey Budankov (3):
> perf util: map data buffer for preserving collected data
> perf record: enable asynchronous trace writing
> perf record: extend trace writing to multi AIO
>  
>  tools/perf/builtin-record.c | 166 
> ++--
>  tools/perf/perf.h   |   1 +
>  tools/perf/util/evlist.c|   7 +-
>  tools/perf/util/evlist.h|   3 +-
>  tools/perf/util/mmap.c  | 114 ++
>  tools/perf/util/mmap.h  |  11 ++-
>  6 files changed, 277 insertions(+), 25 deletions(-)

The whole thing for 

git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux perf/core 

repository follows:

 tools/perf/builtin-record.c | 165 ++--
 tools/perf/perf.h   |   1 +
 tools/perf/util/evlist.c|   7 +-
 tools/perf/util/evlist.h|   3 +-
 tools/perf/util/mmap.c  | 114 ++
 tools/perf/util/mmap.h  |  11 ++-
 6 files changed, 276 insertions(+), 25 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9853552bcf16..7bb7947072e5 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -121,6 +121,112 @@ static int record__write(struct record *rec, void *bf, 
size_t size)
return 0;
 }
 
+static int record__aio_write(struct aiocb *cblock, int trace_fd,
+   void *buf, size_t size, off_t off)
+{
+   int rc;
+
+   cblock->aio_fildes = trace_fd;
+   cblock->aio_buf= buf;
+   cblock->aio_nbytes = size;
+   cblock->aio_offset = off;
+   cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+   do {
+   rc = aio_write(cblock);
+   if (rc == 0) {
+   break;
+   } else if (errno != EAGAIN) {
+   cblock->aio_fildes = -1;
+   pr_err("failed to queue perf data, error: %m\n");
+   break;
+   }
+   } while (1);
+
+   return rc;
+}
+
+static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
+{
+   void *rem_buf;
+   off_t rem_off;
+   size_t rem_size;
+   int rc, aio_errno;
+   ssize_t aio_ret, written;
+
+   aio_errno = aio_error(cblock);
+   if (aio_errno == EINPROGRESS)
+   return 0;
+
+   written = aio_ret = aio_return(cblock);
+   if (aio_ret < 0) {
+   if (!(aio_errno == EINTR))
+   pr_err("failed to write perf data, error: %m\n");
+   written = 0;
+   }
+
+   rem_size = cblock->aio_nbytes - written;
+
+   if (rem_size == 0) {
+   cblock->aio_fildes = -1;
+   /*
+* md->refcount is incremented in perf_mmap__push() for
+* every enqueued aio write request so decrement it because
+* the request is now complete.
+*/
+   perf_mmap__put(md);
+   rc = 1;
+   } else {
+   /*
+* aio write request may require restart with the
+* reminder if the kernel didn't write whole
+* chunk at once.
+*/
+   rem_off = cblock->aio_offset + written;
+   rem_buf = (void *)(cblock->aio_buf + written);
+   record__aio_write(cblock, cblock->aio_fildes,
+