Re: ftrace global trace_pipe_raw

2019-01-16 Thread Steven Rostedt
On Wed, 16 Jan 2019 09:00:00 +0100
Claudio  wrote:

> Indeed the perf event interface would be awesome, if only it would support 
> tracing all processes.
> 
> Unfortunately for my use case, it can only trace one process on any cpus, or 
> all processes on one (1) cpu.
> 
> I guess for some kind of security concerns..

Not security, but performance.

Serialized writes are best done on serialized instances (per cpu or a
single task). Having all tasks on all CPUs write to a single location
is a huge detriment to performance, and has a significant impact. Which
is why the buffer you are asking for doesn't exist.

-- Steve


> 
> I'll take a look at how much work it would be to extend the interface for the 
> any process/any cpu use case.
> 
> Ciao and thank you,
> 
> Claudio
> 
> 



Re: ftrace global trace_pipe_raw

2019-01-16 Thread Claudio
Hi Steven, happy new year,

On 12/19/2018 05:37 PM, Steven Rostedt wrote:
> On Wed, 19 Dec 2018 12:32:41 +0100
> Claudio  wrote:
> 

 I would imagine the core functionality is already available, since 
 trace_pipe
 in the tracing directory already shows all events regardless of CPU, and so
 it would be a matter of doing the same for trace_pipe_raw.  
>>>
>>> The difference between trace_pipe and trace_pipe_raw is that trace_pipe
>>> is post processed, and reads the per CPU buffers and interleaves them
>>> one event at a time. The trace_pipe_raw just sends you the raw
>>> unprocessed data directly from the buffers, which are grouped per CPU.  
>>
>> I think that what I am looking for, to improve the performance of our system,
>> is a post processed stream of binary entry data, already merged from all CPUs
>> and sorted per timestamp, in the same way that it is done for textual output
>> in __find_next_entry:
>>
>>for_each_tracing_cpu(cpu) {
>>
>> if (ring_buffer_empty_cpu(buffer, cpu))
>> continue;
>>
>> ent = peek_next_entry(iter, cpu, , _events);
>>
>> /*   
>>  
>>  * Pick the entry with the smallest timestamp:   
>>  
>>  */
>> if (ent && (!next || ts < next_ts)) {
>> next = ent;
>> next_cpu = cpu;
>> next_ts = ts;
>> next_lost = lost_events;
>> next_size = iter->ent_size;
>> }
>> }
>>
>> We first tried to use the textual output directly, but this lead to
>> unacceptable overheads in parsing the text.
>>
>> Please correct me if I do not understand, however it seems to me that it
>> would be possible do the same kind of post processing including generating
>> a sorted stream of entries, just avoiding the text output formatting,
>> and outputting the binary data of the entry directly, which would be way
>> more efficient to consume directly from user space correlators.
>>
>> But maybe this is not a general enough requirement to be acceptable for
>> implementing directly into the kernel?
>>
>> We have the requirement of using the OS tracing events, including
>> scheduling events, to react from software immediately
>> (vs doing after-the-fact analysis).
> 
> Have you looked at using the perf event interface? I believe it uses a
> single buffer for all events. At least for tracing a single process.
> 
> -- Steve
> 

Indeed the perf event interface would be awesome, if only it would support 
tracing all processes.

Unfortunately for my use case, it can only trace one process on any cpus, or 
all processes on one (1) cpu.

I guess for some kind of security concerns..

I'll take a look at how much work it would be to extend the interface for the 
any process/any cpu use case.

Ciao and thank you,

Claudio





Re: ftrace global trace_pipe_raw

2018-12-19 Thread Steven Rostedt
On Wed, 19 Dec 2018 12:32:41 +0100
Claudio  wrote:

> >>
> >> I would imagine the core functionality is already available, since 
> >> trace_pipe
> >> in the tracing directory already shows all events regardless of CPU, and so
> >> it would be a matter of doing the same for trace_pipe_raw.  
> > 
> > The difference between trace_pipe and trace_pipe_raw is that trace_pipe
> > is post processed, and reads the per CPU buffers and interleaves them
> > one event at a time. The trace_pipe_raw just sends you the raw
> > unprocessed data directly from the buffers, which are grouped per CPU.  
> 
> I think that what I am looking for, to improve the performance of our system,
> is a post processed stream of binary entry data, already merged from all CPUs
> and sorted per timestamp, in the same way that it is done for textual output
> in __find_next_entry:
> 
>for_each_tracing_cpu(cpu) {
> 
> if (ring_buffer_empty_cpu(buffer, cpu))
> continue;
> 
> ent = peek_next_entry(iter, cpu, , _events);
> 
> /*
> 
>  * Pick the entry with the smallest timestamp:
> 
>  */
> if (ent && (!next || ts < next_ts)) {
> next = ent;
> next_cpu = cpu;
> next_ts = ts;
> next_lost = lost_events;
> next_size = iter->ent_size;
> }
> }
> 
> We first tried to use the textual output directly, but this lead to
> unacceptable overheads in parsing the text.
> 
> Please correct me if I do not understand, however it seems to me that it
> would be possible do the same kind of post processing including generating
> a sorted stream of entries, just avoiding the text output formatting,
> and outputting the binary data of the entry directly, which would be way
> more efficient to consume directly from user space correlators.
> 
> But maybe this is not a general enough requirement to be acceptable for
> implementing directly into the kernel?
> 
> We have the requirement of using the OS tracing events, including
> scheduling events, to react from software immediately
> (vs doing after-the-fact analysis).

Have you looked at using the perf event interface? I believe it uses a
single buffer for all events. At least for tracing a single process.

-- Steve


Re: ftrace global trace_pipe_raw

2018-12-19 Thread Claudio
Hi Steven,

going back to this old theme to clarify a bit what I was trying to achieve:

On 07/24/2018 04:23 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 11:58:18 +0200
> Claudio  wrote:
> 
>> Hello Steven,
>>
>> I am doing correlation of linux sched events, following all tasks between 
>> cpus,
>> and one thing that would be really convenient would be to have a global
>> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.

I think that I asked for the wrong thing, since I did not understand how the 
implementation worked.
Which lead to your response, thank you for the clarification.

>>
>> I would imagine the core functionality is already available, since trace_pipe
>> in the tracing directory already shows all events regardless of CPU, and so
>> it would be a matter of doing the same for trace_pipe_raw.
> 
> The difference between trace_pipe and trace_pipe_raw is that trace_pipe
> is post processed, and reads the per CPU buffers and interleaves them
> one event at a time. The trace_pipe_raw just sends you the raw
> unprocessed data directly from the buffers, which are grouped per CPU.

I think that what I am looking for, to improve the performance of our system,
is a post processed stream of binary entry data, already merged from all CPUs
and sorted per timestamp, in the same way that it is done for textual output
in __find_next_entry:

   for_each_tracing_cpu(cpu) {

if (ring_buffer_empty_cpu(buffer, cpu))
continue;

ent = peek_next_entry(iter, cpu, , _events);

/*  
  
 * Pick the entry with the smallest timestamp:  
  
 */
if (ent && (!next || ts < next_ts)) {
next = ent;
next_cpu = cpu;
next_ts = ts;
next_lost = lost_events;
next_size = iter->ent_size;
}
}

We first tried to use the textual output directly, but this lead to
unacceptable overheads in parsing the text.

Please correct me if I do not understand, however it seems to me that it
would be possible do the same kind of post processing including generating
a sorted stream of entries, just avoiding the text output formatting,
and outputting the binary data of the entry directly, which would be way
more efficient to consume directly from user space correlators.

But maybe this is not a general enough requirement to be acceptable for
implementing directly into the kernel?

We have the requirement of using the OS tracing events, including
scheduling events, to react from software immediately
(vs doing after-the-fact analysis).

Thank you for your comment on this and I wish you nice holidays.

Ciao,

Claudio


Re: ftrace global trace_pipe_raw

2018-07-24 Thread Claudio
Hello Steve,

thank you for your answer,

On 07/24/2018 04:25 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 10:23:16 -0400
> Steven Rostedt  wrote:
> 
>>>
>>> Would work in the direction of adding a global trace_pipe_raw be considered
>>> for inclusion?  
>>
>> The design of the lockless ring buffer requires not to be preempted,
>> and that the data cannot be written to from more than one location. To
>> do so, we make a per CPU buffer, and disable preemption when writing.
>> This means that we have only one writer at a time. It can handle
>> interrupts and NMIs, because they will finish before they return and
>> this doesn't break the algorithm. But having writers from multiple CPUs
>> would require locking or other heaving synchronization operations that
>> will greatly reduce the speed of writing to the buffers (not to mention
>> the cache thrashing).
> 

I understand, it is not a simple matter then.

> And why would you need a single buffer? 

I am interested in all events that have to do with a specific task,
regardless of the CPU they appear on.
Having an already post-processed stream of binary data would be awesome I think.

> Note, we are working on making
> libtracecmd.so that will allow applications to read the buffers and the
> library will take care of the interleaving of the raw data. This should
> hopefully be ready in about three months or so.
> 
> -- Steve

That would be great! So the library could handle this kind of preprocessing,
and create a single stream of sorted timestamps/events with the binary data?

Thanks a lot,

Claudio


Re: ftrace global trace_pipe_raw

2018-07-24 Thread Claudio
Hello Steve,

thank you for your answer,

On 07/24/2018 04:25 PM, Steven Rostedt wrote:
> On Tue, 24 Jul 2018 10:23:16 -0400
> Steven Rostedt  wrote:
> 
>>>
>>> Would work in the direction of adding a global trace_pipe_raw be considered
>>> for inclusion?  
>>
>> The design of the lockless ring buffer requires not to be preempted,
>> and that the data cannot be written to from more than one location. To
>> do so, we make a per CPU buffer, and disable preemption when writing.
>> This means that we have only one writer at a time. It can handle
>> interrupts and NMIs, because they will finish before they return and
>> this doesn't break the algorithm. But having writers from multiple CPUs
>> would require locking or other heaving synchronization operations that
>> will greatly reduce the speed of writing to the buffers (not to mention
>> the cache thrashing).
> 

I understand, it is not a simple matter then.

> And why would you need a single buffer? 

I am interested in all events that have to do with a specific task,
regardless of the CPU they appear on.
Having an already post-processed stream of binary data would be awesome I think.

> Note, we are working on making
> libtracecmd.so that will allow applications to read the buffers and the
> library will take care of the interleaving of the raw data. This should
> hopefully be ready in about three months or so.
> 
> -- Steve

That would be great! So the library could handle this kind of preprocessing,
and create a single stream of sorted timestamps/events with the binary data?

Thanks a lot,

Claudio


Re: ftrace global trace_pipe_raw

2018-07-24 Thread Steven Rostedt
On Tue, 24 Jul 2018 10:23:16 -0400
Steven Rostedt  wrote:

> > 
> > Would work in the direction of adding a global trace_pipe_raw be considered
> > for inclusion?  
> 
> The design of the lockless ring buffer requires not to be preempted,
> and that the data cannot be written to from more than one location. To
> do so, we make a per CPU buffer, and disable preemption when writing.
> This means that we have only one writer at a time. It can handle
> interrupts and NMIs, because they will finish before they return and
> this doesn't break the algorithm. But having writers from multiple CPUs
> would require locking or other heaving synchronization operations that
> will greatly reduce the speed of writing to the buffers (not to mention
> the cache thrashing).

And why would you need a single buffer? Note, we are working on making
libtracecmd.so that will allow applications to read the buffers and the
library will take care of the interleaving of the raw data. This should
hopefully be ready in about three months or so.

-- Steve


Re: ftrace global trace_pipe_raw

2018-07-24 Thread Steven Rostedt
On Tue, 24 Jul 2018 10:23:16 -0400
Steven Rostedt  wrote:

> > 
> > Would work in the direction of adding a global trace_pipe_raw be considered
> > for inclusion?  
> 
> The design of the lockless ring buffer requires not to be preempted,
> and that the data cannot be written to from more than one location. To
> do so, we make a per CPU buffer, and disable preemption when writing.
> This means that we have only one writer at a time. It can handle
> interrupts and NMIs, because they will finish before they return and
> this doesn't break the algorithm. But having writers from multiple CPUs
> would require locking or other heaving synchronization operations that
> will greatly reduce the speed of writing to the buffers (not to mention
> the cache thrashing).

And why would you need a single buffer? Note, we are working on making
libtracecmd.so that will allow applications to read the buffers and the
library will take care of the interleaving of the raw data. This should
hopefully be ready in about three months or so.

-- Steve


Re: ftrace global trace_pipe_raw

2018-07-24 Thread Steven Rostedt
On Tue, 24 Jul 2018 11:58:18 +0200
Claudio  wrote:

> Hello Steven,
> 
> I am doing correlation of linux sched events, following all tasks between 
> cpus,
> and one thing that would be really convenient would be to have a global
> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.
> 
> I would imagine the core functionality is already available, since trace_pipe
> in the tracing directory already shows all events regardless of CPU, and so
> it would be a matter of doing the same for trace_pipe_raw.

The difference between trace_pipe and trace_pipe_raw is that trace_pipe
is post processed, and reads the per CPU buffers and interleaves them
one event at a time. The trace_pipe_raw just sends you the raw
unprocessed data directly from the buffers, which are grouped per CPU.

> 
> But is there a good reason why trace_pipe_raw is available only per-cpu?

Yes, because it maps the ring buffers themselves without any post
processing.

> 
> Would work in the direction of adding a global trace_pipe_raw be considered
> for inclusion?

The design of the lockless ring buffer requires not to be preempted,
and that the data cannot be written to from more than one location. To
do so, we make a per CPU buffer, and disable preemption when writing.
This means that we have only one writer at a time. It can handle
interrupts and NMIs, because they will finish before they return and
this doesn't break the algorithm. But having writers from multiple CPUs
would require locking or other heaving synchronization operations that
will greatly reduce the speed of writing to the buffers (not to mention
the cache thrashing).

-- Steve


Re: ftrace global trace_pipe_raw

2018-07-24 Thread Steven Rostedt
On Tue, 24 Jul 2018 11:58:18 +0200
Claudio  wrote:

> Hello Steven,
> 
> I am doing correlation of linux sched events, following all tasks between 
> cpus,
> and one thing that would be really convenient would be to have a global
> trace_pipe_raw, in addition to the per-cpu ones, with already sorted events.
> 
> I would imagine the core functionality is already available, since trace_pipe
> in the tracing directory already shows all events regardless of CPU, and so
> it would be a matter of doing the same for trace_pipe_raw.

The difference between trace_pipe and trace_pipe_raw is that trace_pipe
is post processed, and reads the per CPU buffers and interleaves them
one event at a time. The trace_pipe_raw just sends you the raw
unprocessed data directly from the buffers, which are grouped per CPU.

> 
> But is there a good reason why trace_pipe_raw is available only per-cpu?

Yes, because it maps the ring buffers themselves without any post
processing.

> 
> Would work in the direction of adding a global trace_pipe_raw be considered
> for inclusion?

The design of the lockless ring buffer requires not to be preempted,
and that the data cannot be written to from more than one location. To
do so, we make a per CPU buffer, and disable preemption when writing.
This means that we have only one writer at a time. It can handle
interrupts and NMIs, because they will finish before they return and
this doesn't break the algorithm. But having writers from multiple CPUs
would require locking or other heaving synchronization operations that
will greatly reduce the speed of writing to the buffers (not to mention
the cache thrashing).

-- Steve