Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-23 Thread Alexei Starovoitov
On Mon, Feb 16, 2015 at 6:26 AM, He Kuang  wrote:
> Hi, Alexei
>
> Another suggestion on bpf syscall interface. Currently, BPF +
> syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
> commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
> case, bpf bytecode cannot be loaded to the kernel.

I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc
When it's all ready, we can turn that config on by default.

> If we turn the functionality of BPF_SYSCALL into a loadable module, then
> we can use it without any dependencies on the kernel. What about change
> bpf syscall to a /dev node or /sys file which can be exported by a
> kernel module?

I don't think we will allow extending bpf by modules.
'bpf in modules' is an interface that is too easy to abuse.
So all of bpf core, helper functions and program types will be builtin.

As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls
first. Then add right set of helpers to make sure that use cases
like 'tcp stack instrumentation' are fully addressed.
Then there were few great ideas of accelerating kprobes
with trace markers and debug tracepoints that we can do later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-23 Thread Alexei Starovoitov
On Mon, Feb 16, 2015 at 6:26 AM, He Kuang heku...@huawei.com wrote:
 Hi, Alexei

 Another suggestion on bpf syscall interface. Currently, BPF +
 syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
 commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
 case, bpf bytecode cannot be loaded to the kernel.

I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc
When it's all ready, we can turn that config on by default.

 If we turn the functionality of BPF_SYSCALL into a loadable module, then
 we can use it without any dependencies on the kernel. What about change
 bpf syscall to a /dev node or /sys file which can be exported by a
 kernel module?

I don't think we will allow extending bpf by modules.
'bpf in modules' is an interface that is too easy to abuse.
So all of bpf core, helper functions and program types will be builtin.

As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls
first. Then add right set of helpers to make sure that use cases
like 'tcp stack instrumentation' are fully addressed.
Then there were few great ideas of accelerating kprobes
with trace markers and debug tracepoints that we can do later.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-16 Thread He Kuang

Hi, Alexei

Another suggestion on bpf syscall interface. Currently, BPF +
syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
case, bpf bytecode cannot be loaded to the kernel.

If we turn the functionality of BPF_SYSCALL into a loadable module, then
we can use it without any dependencies on the kernel. What about change
bpf syscall to a /dev node or /sys file which can be exported by a
kernel module?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-16 Thread He Kuang

Hi, Alexei

Another suggestion on bpf syscall interface. Currently, BPF +
syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on
commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this
case, bpf bytecode cannot be loaded to the kernel.

If we turn the functionality of BPF_SYSCALL into a loadable module, then
we can use it without any dependencies on the kernel. What about change
bpf syscall to a /dev node or /sys file which can be exported by a
kernel module?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-14 Thread Alexei Starovoitov
On Wed, Feb 11, 2015 at 11:58 PM, Hekuang  wrote:
>
>>> eBPF is very flexible, which means it is bound to have someone use it
>>> in a way you never dreamed of, and that will be what bites you in the
>>> end (pun intended).
>>
>> understood :)
>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
>
> I think BPF + system calls/kprobes can meet our use case
> (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
> improved.
>
> I suggest that you can improve bpf+kprobes when attached to function
> headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1,
> arg2.., then top models and architectures can be separated by bpf.
>
> BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
> is a 'regs->rdx' kind of information, such information is both
> architecture and kernel version related.

for kprobes in the middle of the function, kernel cannot
convert pt_regs into argN. Placement was decided by compiler
and can only be found in debug info.
I think bpf+kprobe will be using it when it is available.
When there is no debug info, kprobes will be limited
to function entry and mapping of regs/stack into
argN can be done by user space depending on architecture.
So user tracing scripts in some higher level language
can be kernel/arch independent when 'perf probe+bpf'
is loading them on the fly on the given machine.

> We hope to establish some models for describing kernel procedures such
> as IO and network, which requires that it does not rely on architecture
> and does not rely to a specific kernel version as much as possible.

That's obviously a goal, but it requires a new approach to tracepoints.
I think a lot of great ideas were discussed in this thread, so I'm
hopeful that we'll come up with solution that will satisfy even
strictest Peter's requirements :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-14 Thread Alexei Starovoitov
On Wed, Feb 11, 2015 at 7:51 AM, Steven Rostedt  wrote:
> On Tue, 10 Feb 2015 22:33:05 -0800
> Alexei Starovoitov  wrote:
>
>
>> fair enough.
>> Something like TRACE_MARKER(arg1, arg2) that prints
>> it was hit without accessing the args would be enough.
>> Without any args it is indeed a 'fast kprobe' only.
>> Debug info would still be needed to access
>> function arguments.
>> On x64 function entry point and x64 abi make it easy
>> to access args, but i386 or kprobe in the middle
>> lose visibility when debug info is not available.
>> TRACE_MARKER (with few key args that function
>> is operating on) is enough to achieve roughly the same
>> as kprobe without debug info.
>
> Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
> possibly a full trace event layout.
>
> The difference would be that the trace events do not show up unless you
> have "trace_debug" on the command line. This should prevent
> applications from depending on them.
>
> I could even do the nasty dmesg output like I do with trace_printk()s,
> that would definitely keep a production kernel from adding it by
> default.
>
> When trace_debug is not there, the trace points could still be accessed
> but perhaps only via bpf, or act like a simple trace marker.

I think that is a great idea!
Makes it clear that all prints are for debugging and
no abi guarantees.

> Note, if you need ids, I rather have them in another directory than
> tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
> simple.

indeed. makes sense. no reason to burn fs memory just
to get an id from name. may be perf_event api can be
extended to lookup id from name. I think perf will benefit as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-14 Thread Alexei Starovoitov
On Wed, Feb 11, 2015 at 5:28 AM, Peter Zijlstra  wrote:
>
> We're compiling the BPF stuff against the 'current' kernel headers
> right?

the tracex1 example is pulling kernel headers to demonstrate
how bpf_fetch*() helpers can be used to walk kernel structures
without debug info.
The other examples don't need any internal headers.

> So would enforcing module versioning not be sufficient?

I'm going to redo the ex1 to use kprobe and some form of
version check. Indeed module-like versioning should
be enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-14 Thread Alexei Starovoitov
On Wed, Feb 11, 2015 at 11:58 PM, Hekuang heku...@huawei.com wrote:

 eBPF is very flexible, which means it is bound to have someone use it
 in a way you never dreamed of, and that will be what bites you in the
 end (pun intended).

 understood :)
 let's start slow then with bpf+syscall and bpf+kprobe only.


 I think BPF + system calls/kprobes can meet our use case
 (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
 improved.

 I suggest that you can improve bpf+kprobes when attached to function
 headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx-arg1,
 arg2.., then top models and architectures can be separated by bpf.

 BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
 is a 'regs-rdx' kind of information, such information is both
 architecture and kernel version related.

for kprobes in the middle of the function, kernel cannot
convert pt_regs into argN. Placement was decided by compiler
and can only be found in debug info.
I think bpf+kprobe will be using it when it is available.
When there is no debug info, kprobes will be limited
to function entry and mapping of regs/stack into
argN can be done by user space depending on architecture.
So user tracing scripts in some higher level language
can be kernel/arch independent when 'perf probe+bpf'
is loading them on the fly on the given machine.

 We hope to establish some models for describing kernel procedures such
 as IO and network, which requires that it does not rely on architecture
 and does not rely to a specific kernel version as much as possible.

That's obviously a goal, but it requires a new approach to tracepoints.
I think a lot of great ideas were discussed in this thread, so I'm
hopeful that we'll come up with solution that will satisfy even
strictest Peter's requirements :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-14 Thread Alexei Starovoitov
On Wed, Feb 11, 2015 at 5:28 AM, Peter Zijlstra pet...@infradead.org wrote:

 We're compiling the BPF stuff against the 'current' kernel headers
 right?

the tracex1 example is pulling kernel headers to demonstrate
how bpf_fetch*() helpers can be used to walk kernel structures
without debug info.
The other examples don't need any internal headers.

 So would enforcing module versioning not be sufficient?

I'm going to redo the ex1 to use kprobe and some form of
version check. Indeed module-like versioning should
be enough.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-14 Thread Alexei Starovoitov
On Wed, Feb 11, 2015 at 7:51 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 10 Feb 2015 22:33:05 -0800
 Alexei Starovoitov a...@plumgrid.com wrote:


 fair enough.
 Something like TRACE_MARKER(arg1, arg2) that prints
 it was hit without accessing the args would be enough.
 Without any args it is indeed a 'fast kprobe' only.
 Debug info would still be needed to access
 function arguments.
 On x64 function entry point and x64 abi make it easy
 to access args, but i386 or kprobe in the middle
 lose visibility when debug info is not available.
 TRACE_MARKER (with few key args that function
 is operating on) is enough to achieve roughly the same
 as kprobe without debug info.

 Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
 possibly a full trace event layout.

 The difference would be that the trace events do not show up unless you
 have trace_debug on the command line. This should prevent
 applications from depending on them.

 I could even do the nasty dmesg output like I do with trace_printk()s,
 that would definitely keep a production kernel from adding it by
 default.

 When trace_debug is not there, the trace points could still be accessed
 but perhaps only via bpf, or act like a simple trace marker.

I think that is a great idea!
Makes it clear that all prints are for debugging and
no abi guarantees.

 Note, if you need ids, I rather have them in another directory than
 tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
 simple.

indeed. makes sense. no reason to burn fs memory just
to get an id from name. may be perf_event api can be
extended to lookup id from name. I think perf will benefit as well.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Hekuang



eBPF is very flexible, which means it is bound to have someone use it
in a way you never dreamed of, and that will be what bites you in the
end (pun intended).

understood :)
let's start slow then with bpf+syscall and bpf+kprobe only.



I think BPF + system calls/kprobes can meet our use case
(https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
improved.

I suggest that you can improve bpf+kprobes when attached to function
headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1,
arg2.., then top models and architectures can be separated by bpf.

BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
is a 'regs->rdx' kind of information, such information is both
architecture and kernel version related.

We hope to establish some models for describing kernel procedures such
as IO and network, which requires that it does not rely on architecture
and does not rely to a specific kernel version as much as possible.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Steven Rostedt
On Tue, 10 Feb 2015 22:33:05 -0800
Alexei Starovoitov  wrote:


> fair enough.
> Something like TRACE_MARKER(arg1, arg2) that prints
> it was hit without accessing the args would be enough.
> Without any args it is indeed a 'fast kprobe' only.
> Debug info would still be needed to access
> function arguments.
> On x64 function entry point and x64 abi make it easy
> to access args, but i386 or kprobe in the middle
> lose visibility when debug info is not available.
> TRACE_MARKER (with few key args that function
> is operating on) is enough to achieve roughly the same
> as kprobe without debug info.

Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
possibly a full trace event layout.

The difference would be that the trace events do not show up unless you
have "trace_debug" on the command line. This should prevent
applications from depending on them.

I could even do the nasty dmesg output like I do with trace_printk()s,
that would definitely keep a production kernel from adding it by
default.

When trace_debug is not there, the trace points could still be accessed
but perhaps only via bpf, or act like a simple trace marker.

Note, if you need ids, I rather have them in another directory than
tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
simple.

This is something that needs to probably be discussed at bit.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:53:59PM -0500, Steven Rostedt wrote:
> > >> In the future we might add a tracepoint and pass a single
> > >> pointer to interesting data struct to it. bpf programs will walk
> > >> data structures 'as safe modules' via bpf_fetch*() methods
> > >> without exposing it as ABI.
> > >
> > > Will this work if that structure changes? When the field we are looking
> > > for no longer exists?
> > 
> > bpf_fetch*() is the same mechanism as perf probe.
> > If there is a mistake by user space tools, the program
> > will be reading some junk, but it won't be crashing.
> 
> But what if the userspace tool depends on that value returning
> something meaningful. If it was meaningful in the past, it will have to
> be meaningful in the future, even if the internals of the kernel make
> it otherwise.

We're compiling the BPF stuff against the 'current' kernel headers
right? So would enforcing module versioning not be sufficient?

We already break out-of-tree modules without a second thought, the
module interface is not guaranteed. They just need to cope with it.

Anything using the kernel headers to look into the kernel guts should be
bound to the same rules.

So if we think of BFP stuff as out-of-tree modules, and treat them the
same, I see no problem.

I'm sure some BFP 'scripts' will turn in the same right mess that
out-of-tree modules are, with endless #ifdef version checks, but hey,
not my problem ;-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> > It would need to do more that that. It may have to calculate the value
> > that it returns, as the internal value may be different with different
> > kernels.
> 
> back to 'prio'... the 'prio' accessible from the program
> should be the same 'prio' that we're storing inside task_struct.

Its not, task_struct::prio is an entirely different value than the one
used in sched_param::sched_priority / sched_attr::sched_priority.

And the 'problem' is, prio is only relevant to SCHED_RR/SCHED_FIFO
tasks, we have more classes.

> No extra conversions.

We're not going to add runtime/space overhead to the kernel just because
someone might maybe someday trace the kernel.

That leaves the option of either tracing the kernel internal value and
userspace will just have to deal with it, or making the tracepoint more
expensive by having it do the conversion.

Now the big question is, what do we do when we add/extend a scheduling
class and have more parameters? We cannot change the tracepoint because
userspace assumes format. And I simply refuse to add a second -- because
that will end up being a third and fourth etc.. -- tracepoint right next
to it with a different layout.

Note that we just did add a class, we grew SCHED_DEADLINE a few releases
ago, and that has 3 parameters (or 6 depending on how you look at it).
You currently cannot 'debug' that with the existing tracepoints.

In short, I loathe traceevents, they're more trouble than they're worth.

Now I do love the infrastructure, its very useful debugging, but that's
all with local hacks that will never see the light of day.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.

Right, so trace_events were/are root only, and root 'should' be in the
root pid namespace, and therefore pid magically works.

And I'm not sure, but I don't think the 'nested' root available from
containers should have access to ftrace, so that should not be an issue.

Perf tries real hard to present PERF_SAMPLE_PID data in the pid
namespace of the task that created the event.

As to prio; yes this is a prime example of suck, I would love to change
that but cannot :-(. The only solace I have is that everybody who is
relying on it is broken.

There is a very good reason I'm against adding more tracepoints to the
scheduler, its a nightmare.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
> >> not all tools use libtraceevent.
> >> gdb calls perf_event_open directly:
> >> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
> >> and parses PERF_RECORD_SAMPLE as a binary.
> >> In this case it's branch records, but I think we never said anywhere
> >> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
> >> in this particular order.
> >
> > What particular order? Note, that's a hardware event, not a software
> > one.
> 
> yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
> when attr.sample_type = IP | ADDR whereas this is an
> internal order of 'if' statements inside perf_output_sample()...

This is indeed promised in the data layout description in
include/uapi/linux/perf_event.h.

There is no other way to find where fields are; perf relies on
predetermined order of fields coupled with the requested field bitmask.

So we promise the order: id, ip, pid, tid, time, addr,.. etc.

So if you request IP and ADDR but none of the other fields, then you
know your sample will start with IP and then contain ADDR.

The traceevent thing has a debug/trace-fs format description of fields
that is supposed to be used.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:53:59PM -0500, Steven Rostedt wrote:
   In the future we might add a tracepoint and pass a single
   pointer to interesting data struct to it. bpf programs will walk
   data structures 'as safe modules' via bpf_fetch*() methods
   without exposing it as ABI.
  
   Will this work if that structure changes? When the field we are looking
   for no longer exists?
  
  bpf_fetch*() is the same mechanism as perf probe.
  If there is a mistake by user space tools, the program
  will be reading some junk, but it won't be crashing.
 
 But what if the userspace tool depends on that value returning
 something meaningful. If it was meaningful in the past, it will have to
 be meaningful in the future, even if the internals of the kernel make
 it otherwise.

We're compiling the BPF stuff against the 'current' kernel headers
right? So would enforcing module versioning not be sufficient?

We already break out-of-tree modules without a second thought, the
module interface is not guaranteed. They just need to cope with it.

Anything using the kernel headers to look into the kernel guts should be
bound to the same rules.

So if we think of BFP stuff as out-of-tree modules, and treat them the
same, I see no problem.

I'm sure some BFP 'scripts' will turn in the same right mess that
out-of-tree modules are, with endless #ifdef version checks, but hey,
not my problem ;-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
  not all tools use libtraceevent.
  gdb calls perf_event_open directly:
  https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
  and parses PERF_RECORD_SAMPLE as a binary.
  In this case it's branch records, but I think we never said anywhere
  that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
  in this particular order.
 
  What particular order? Note, that's a hardware event, not a software
  one.
 
 yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
 when attr.sample_type = IP | ADDR whereas this is an
 internal order of 'if' statements inside perf_output_sample()...

This is indeed promised in the data layout description in
include/uapi/linux/perf_event.h.

There is no other way to find where fields are; perf relies on
predetermined order of fields coupled with the requested field bitmask.

So we promise the order: id, ip, pid, tid, time, addr,.. etc.

So if you request IP and ADDR but none of the other fields, then you
know your sample will start with IP and then contain ADDR.

The traceevent thing has a debug/trace-fs format description of fields
that is supposed to be used.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
 well, -prio and -pid are already printed by sched tracepoints
 and their meaning depends on scheduler. So users taking that
 into account.

Right, so trace_events were/are root only, and root 'should' be in the
root pid namespace, and therefore pid magically works.

And I'm not sure, but I don't think the 'nested' root available from
containers should have access to ftrace, so that should not be an issue.

Perf tries real hard to present PERF_SAMPLE_PID data in the pid
namespace of the task that created the event.

As to prio; yes this is a prime example of suck, I would love to change
that but cannot :-(. The only solace I have is that everybody who is
relying on it is broken.

There is a very good reason I'm against adding more tracepoints to the
scheduler, its a nightmare.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Peter Zijlstra
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote:
  It would need to do more that that. It may have to calculate the value
  that it returns, as the internal value may be different with different
  kernels.
 
 back to 'prio'... the 'prio' accessible from the program
 should be the same 'prio' that we're storing inside task_struct.

Its not, task_struct::prio is an entirely different value than the one
used in sched_param::sched_priority / sched_attr::sched_priority.

And the 'problem' is, prio is only relevant to SCHED_RR/SCHED_FIFO
tasks, we have more classes.

 No extra conversions.

We're not going to add runtime/space overhead to the kernel just because
someone might maybe someday trace the kernel.

That leaves the option of either tracing the kernel internal value and
userspace will just have to deal with it, or making the tracepoint more
expensive by having it do the conversion.

Now the big question is, what do we do when we add/extend a scheduling
class and have more parameters? We cannot change the tracepoint because
userspace assumes format. And I simply refuse to add a second -- because
that will end up being a third and fourth etc.. -- tracepoint right next
to it with a different layout.

Note that we just did add a class, we grew SCHED_DEADLINE a few releases
ago, and that has 3 parameters (or 6 depending on how you look at it).
You currently cannot 'debug' that with the existing tracepoints.

In short, I loathe traceevents, they're more trouble than they're worth.

Now I do love the infrastructure, its very useful debugging, but that's
all with local hacks that will never see the light of day.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Steven Rostedt
On Tue, 10 Feb 2015 22:33:05 -0800
Alexei Starovoitov a...@plumgrid.com wrote:


 fair enough.
 Something like TRACE_MARKER(arg1, arg2) that prints
 it was hit without accessing the args would be enough.
 Without any args it is indeed a 'fast kprobe' only.
 Debug info would still be needed to access
 function arguments.
 On x64 function entry point and x64 abi make it easy
 to access args, but i386 or kprobe in the middle
 lose visibility when debug info is not available.
 TRACE_MARKER (with few key args that function
 is operating on) is enough to achieve roughly the same
 as kprobe without debug info.

Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and
possibly a full trace event layout.

The difference would be that the trace events do not show up unless you
have trace_debug on the command line. This should prevent
applications from depending on them.

I could even do the nasty dmesg output like I do with trace_printk()s,
that would definitely keep a production kernel from adding it by
default.

When trace_debug is not there, the trace points could still be accessed
but perhaps only via bpf, or act like a simple trace marker.

Note, if you need ids, I rather have them in another directory than
tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs
simple.

This is something that needs to probably be discussed at bit.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-11 Thread Hekuang



eBPF is very flexible, which means it is bound to have someone use it
in a way you never dreamed of, and that will be what bites you in the
end (pun intended).

understood :)
let's start slow then with bpf+syscall and bpf+kprobe only.



I think BPF + system calls/kprobes can meet our use case
(https://lkml.org/lkml/2015/2/6/44), but there're some issues to be
improved.

I suggest that you can improve bpf+kprobes when attached to function
headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx-arg1,
arg2.., then top models and architectures can be separated by bpf.

BPF bytecode is cross-platform, but what we can get by using bpf+kprobes
is a 'regs-rdx' kind of information, such information is both
architecture and kernel version related.

We hope to establish some models for describing kernel procedures such
as IO and network, which requires that it does not rely on architecture
and does not rely to a specific kernel version as much as possible.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt  wrote:
>> > Again, this would mean they become invisible to ftrace, and even
>> > ftrace_dump_on_oops.
>>
>> yes, since these new tracepoints have no meat inside them.
>> They're placeholders sitting idle and waiting for bpf to do
>> something useful with them.
>
> Hmm, I have a patch somewhere (never posted it), that add
> TRACE_MARKER(), which basically would just print that it was hit. But
> no data was passed to it. Something like that I would be more inclined
> to take. Then the question is, what can bpf access there? Could just be
> a place holder to add a "fast kprobe". This way, it can show up in
> trace events with enable and and all that, it just wont have any data
> to print out besides the normal pid, flags, etc.
>
> But because it will inject a nop, that could be converted to a jump, it
> will give you the power of kprobes but with the speed of a tracepoint.

fair enough.
Something like TRACE_MARKER(arg1, arg2) that prints
it was hit without accessing the args would be enough.
Without any args it is indeed a 'fast kprobe' only.
Debug info would still be needed to access
function arguments.
On x64 function entry point and x64 abi make it easy
to access args, but i386 or kprobe in the middle
lose visibility when debug info is not available.
TRACE_MARKER (with few key args that function
is operating on) is enough to achieve roughly the same
as kprobe without debug info.

>> > I'm not fully understanding what is to be exported by this new ABI. If
>> > the fields available, will always be available, then why can't the
>> > appear in a TP_printk()?
>>
>> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
>> It will have only one argument 'skb'.
>> bpf program will read and print skb fields the way it likes
>> for particular tracing scenario.
>> So instead of making
>> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
>> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
>> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
>> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
>> gso_type=%#x",...
>> the abi exposed via trace_pipe (as it is today),
>> the new tracepoint_bpf abi is presence of 'skb' pointer as one
>> and only argument to bpf program.
>> Future refactoring of netif_rx would need to guarantee
>> that trace_netif_rx_entry(skb) is called. that's it.
>> imo such tracepoints are much easier to deal with during
>> code changes.
>
> But what can you access from that skb that is guaranteed to be there?
> If you say anything, then there's no reason it can't be added to the
> printk as well.

programs can access any field via bpf_fetch*() helpers which
make them kernel layout dependent or via user-ized sk_buff
with few fields which is portable.
In both cases kernel/user abi is only 'skb' pointer.
whether it's debugging program that needs full access
via fetch* helpers or portable program that uses stable api
it's up to program author.
Just like kprobes, it's clear, that if program is using
fetch* helpers it's doing it without any abi guarantees.
'perf probe' and 'bpf with fetch* helpers are the same.
perf probe creates wrappers on top of probe_kernel_read and
bpf_fetch* helpers are wrappers on top of probe_kernel_read.
Complains that 'my kprobe with flags=%cx mode=+4($stack)
stopped working in new kernel' are equivalent to complains
that program with bpf_fetch* stopped working.

Whereas if program is using user-ized structs it will work
across kernel versions, though it will be able to see
only very limited slice of in-kernel data.

>> May be some of the existing tracepoints like this one that
>> takes one argument can be marked 'bpf-ready', so that
>> programs can attach to them only.
>
> I really hate the idea of adding tracepoints that ftrace can't use. It
> basically kills the entire busy box usage scenario, as boards that have
> extremely limited userspace still make full use of ftrace via the
> existing tracepoints.

agree. I think your trace_marker with few args is a good
middle ground.

> I still don't see the argument that adding data via the bpf functions
> is any different than adding those same items to fields in an event.
> Once you add a bpf function, then you must maintain those fields.
>
> Look, you can always add more to a TP_printk(), as that is standard
> with all text file kernel parsing. Like stat in /proc. Fields can not
> be removed, but more can always be appended to the end.
>
> Any tool that parses trace_pipe is broken if it can't handle extended
> fields. The api can be extended, and for text files, that is by
> appending to them.

I agree that any text parsing script should be able to cope
with additional args without problems. I think it's a fear of
<1% breakage is causing maintainers to avoid any changes
to tracepoints even when they just add few args to the end
of TP_printk.
When tracepoints stop printing and the only thing they see
is single pointer to a 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Tue, 10 Feb 2015 19:04:55 -0800
Alexei Starovoitov  wrote:


> > You mean to be completely invisible to ftrace? And the debugfs/tracefs
> > directory?
> 
> I mean it will be seen in tracefs to get 'id', but without 
> enable/format/filter

In other words, invisible to ftrace. I'm not sure I'll be on board to
work with that.


> > Again, this would mean they become invisible to ftrace, and even
> > ftrace_dump_on_oops.
> 
> yes, since these new tracepoints have no meat inside them.
> They're placeholders sitting idle and waiting for bpf to do
> something useful with them.

Hmm, I have a patch somewhere (never posted it), that add
TRACE_MARKER(), which basically would just print that it was hit. But
no data was passed to it. Something like that I would be more inclined
to take. Then the question is, what can bpf access there? Could just be
a place holder to add a "fast kprobe". This way, it can show up in
trace events with enable and and all that, it just wont have any data
to print out besides the normal pid, flags, etc.

But because it will inject a nop, that could be converted to a jump, it
will give you the power of kprobes but with the speed of a tracepoint.

> 
> > I'm not fully understanding what is to be exported by this new ABI. If
> > the fields available, will always be available, then why can't the
> > appear in a TP_printk()?
> 
> say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
> It will have only one argument 'skb'.
> bpf program will read and print skb fields the way it likes
> for particular tracing scenario.
> So instead of making
> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
> gso_type=%#x",...
> the abi exposed via trace_pipe (as it is today),
> the new tracepoint_bpf abi is presence of 'skb' pointer as one
> and only argument to bpf program.
> Future refactoring of netif_rx would need to guarantee
> that trace_netif_rx_entry(skb) is called. that's it.
> imo such tracepoints are much easier to deal with during
> code changes.

But what can you access from that skb that is guaranteed to be there?
If you say anything, then there's no reason it can't be added to the
printk as well.

> 
> May be some of the existing tracepoints like this one that
> takes one argument can be marked 'bpf-ready', so that
> programs can attach to them only.

I really hate the idea of adding tracepoints that ftrace can't use. It
basically kills the entire busy box usage scenario, as boards that have
extremely limited userspace still make full use of ftrace via the
existing tracepoints.

I still don't see the argument that adding data via the bpf functions
is any different than adding those same items to fields in an event.
Once you add a bpf function, then you must maintain those fields.

Look, you can always add more to a TP_printk(), as that is standard
with all text file kernel parsing. Like stat in /proc. Fields can not
be removed, but more can always be appended to the end.

Any tool that parses trace_pipe is broken if it can't handle extended
fields. The api can be extended, and for text files, that is by
appending to them.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt  wrote:
>
>> >> But some maintainers think of them as ABI, whereas others
>> >> are using them freely. imo it's time to remove ambiguity.
>> >
>> > I would love to, and have brought this up at Kernel Summit more than
>> > once with no solution out of it.
>>
>> let's try it again at plumbers in august?
>
> Well, we need a statement from Linus. And it would be nice if we could
> also get Ingo involved in the discussion, but he seldom comes to
> anything but Kernel Summit.

+1

> BTW, I wonder if I could make a simple compiler in the kernel that
> would translate the current ftrace filters into a BPF program, where it
> could use the program and not use the current filter logic.

yep. I've sent that patch last year.
It converted pred_tree into bpf program.
I can try to dig it up. It doesn't provide extra programmability
though, just makes filtering logic much faster.

>> imo the solution is DEFINE_EVENT_BPF that doesn't
>> print anything and a bpf program to process it.
>
> You mean to be completely invisible to ftrace? And the debugfs/tracefs
> directory?

I mean it will be seen in tracefs to get 'id', but without enable/format/filter

>> I'm not suggesting to preserve the meaning of 'pid' semantically
>> in all cases. That's not what users would want anyway.
>> I want to allow programs to access important fields and print
>> them in more generic way than current TP_printk does.
>> Then exposed ABI of such tracepoint_bpf is smaller than
>> with current tracepoints.
>
> Again, this would mean they become invisible to ftrace, and even
> ftrace_dump_on_oops.

yes, since these new tracepoints have no meat inside them.
They're placeholders sitting idle and waiting for bpf to do
something useful with them.

> I'm not fully understanding what is to be exported by this new ABI. If
> the fields available, will always be available, then why can't the
> appear in a TP_printk()?

say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
It will have only one argument 'skb'.
bpf program will read and print skb fields the way it likes
for particular tracing scenario.
So instead of making
TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
gso_type=%#x",...
the abi exposed via trace_pipe (as it is today),
the new tracepoint_bpf abi is presence of 'skb' pointer as one
and only argument to bpf program.
Future refactoring of netif_rx would need to guarantee
that trace_netif_rx_entry(skb) is called. that's it.
imo such tracepoints are much easier to deal with during
code changes.

May be some of the existing tracepoints like this one that
takes one argument can be marked 'bpf-ready', so that
programs can attach to them only.

>> let's start slow then with bpf+syscall and bpf+kprobe only.
>
> I'm fine with that.

thanks. will wait for merge window to close and will repost.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Tue, 10 Feb 2015 16:22:50 -0800
Alexei Starovoitov  wrote:

> > Yep, and if this becomes a standard, then any change that makes
> > trace_pipe different will be reverted.
> 
> I think reading of trace_pipe is widespread.

I've heard of a few, but nothing that has broken when something changed.

Is it scripts or actual C code?

Again, it matters if people complain about the change.


> >> But some maintainers think of them as ABI, whereas others
> >> are using them freely. imo it's time to remove ambiguity.
> >
> > I would love to, and have brought this up at Kernel Summit more than
> > once with no solution out of it.
> 
> let's try it again at plumbers in august?

Well, we need a statement from Linus. And it would be nice if we could
also get Ingo involved in the discussion, but he seldom comes to
anything but Kernel Summit.

> 
> For now I'm going to drop bpf+tracepoints, since it's so controversial
> and go with bpf+syscall and bpf+kprobe only.

Probably the safest bet.

> 
> Hopefully by august it will be clear what bpf+kprobes can do
> and why I'm excited about bpf+tracepoints in the future.

BTW, I wonder if I could make a simple compiler in the kernel that
would translate the current ftrace filters into a BPF program, where it
could use the program and not use the current filter logic.


> >> These tracepoint will receive one or two pointers to important
> >> structs only. They will not have TP_printk, assign and fields.
> >> The placement and arguments to these new tracepoints
> >> will be an ABI.
> >> All existing tracepoints are not.
> >
> > TP_printk() is not really an issue.
> 
> I think it is. The way things are printed is the most
> visible part of tracepoints and I suspect maintainers don't
> want to add new ones, because internal fields are printed
> and users do parse trace_pipe.
> Recent discussion about tcp instrumentation was
> about adding new tracepoints and a module to print them.
> As soon as something like this is in, the next question
> 'what we're going to do when more arguments need
> to be printed'...

I should rephrase that. It's not that it's not an issue, it's just that
it hasn't been an issue. the trace_pipe code is slow. The
raw_trace_pipe is much faster. Any tool would benefit from using it.

I really need to get a library out to help users do such a thing.


> 
> imo the solution is DEFINE_EVENT_BPF that doesn't
> print anything and a bpf program to process it.

You mean to be completely invisible to ftrace? And the debugfs/tracefs
directory?


> >
> >> it is portable and will run on any kernel.
> >> In uapi header we can define:
> >> struct task_struct_user {
> >>   int pid;
> >>   int prio;
> >
> > Here's a perfect example of something that looks stable to show to
> > user space, but is really a pimple that is hiding cancer.
> >
> > Lets start with pid. We have name spaces. What pid will be put there?
> > We have to show the pid of the name space it is under.
> >
> > Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> > meaningless. Also, it is meaningless in SCHED_OTHER.
> >
> > Also note that even for SCHED_FIFO, the prio is used differently in the
> > kernel than it is in userspace. For the kernel, lower is higher.
> 
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.

I know, and Peter hates this.

> I'm not suggesting to preserve the meaning of 'pid' semantically
> in all cases. That's not what users would want anyway.
> I want to allow programs to access important fields and print
> them in more generic way than current TP_printk does.
> Then exposed ABI of such tracepoint_bpf is smaller than
> with current tracepoints.

Again, this would mean they become invisible to ftrace, and even
ftrace_dump_on_oops.

I'm not fully understanding what is to be exported by this new ABI. If
the fields available, will always be available, then why can't the
appear in a TP_printk()?


> > eBPF is very flexible, which means it is bound to have someone use it
> > in a way you never dreamed of, and that will be what bites you in the
> > end (pun intended).
> 
> understood :)
> let's start slow then with bpf+syscall and bpf+kprobe only.

I'm fine with that.

> 
> >> also not all bpf programs are equal.
> >> bpf+existing tracepoint is not ABI
> >
> > Why not?
> 
> well, because we want to see more tracepoints in the kernel.
> We're already struggling to add more.

Still, the question is, even with a new "tracepoint" that limits what
it shows, there still needs to be something that is guaranteed to
always be there. I still don't see how this is different than the
current tracepoints.

> 
> >> bpf+new tracepoint is ABI if programs are not using bpf_fetch
> >
> > How is this different?
> 
> the new ones will be explicit by definition.

Who monitors this?



> > To give you an example, we thought about scrambling the trace event
> > field locations from boot to boot to 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt  wrote:
> On Tue, 10 Feb 2015 11:53:22 -0800
> Alexei Starovoitov  wrote:
>
>> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt  wrote:
>> > On Mon, 9 Feb 2015 22:10:45 -0800
>> > Alexei Starovoitov  wrote:
>> >
>> >> One can argue that current TP_printk format is already an ABI,
>> >> because somebody might be parsing the text output.
>> >
>> > If somebody does, then it is an ABI. Luckily, it's not that useful to
>> > parse, thus it hasn't been an issue. As Linus has stated in the past,
>> > it's not that we can't change ABI interfaces, its just that we can not
>> > change them if there's a user space application that depends on it.
>>
>> there are already tools that parse trace_pipe:
>> https://github.com/brendangregg/perf-tools
>
> Yep, and if this becomes a standard, then any change that makes
> trace_pipe different will be reverted.

I think reading of trace_pipe is widespread.

>> > and expect some events to have specific fields. Now we can add new
>> > fields, or even remove fields that no user space tool is using. This is
>> > because today, tools use libtraceevent to parse the event data.
>>
>> not all tools use libtraceevent.
>> gdb calls perf_event_open directly:
>> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
>> and parses PERF_RECORD_SAMPLE as a binary.
>> In this case it's branch records, but I think we never said anywhere
>> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
>> in this particular order.
>
> What particular order? Note, that's a hardware event, not a software
> one.

yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
when attr.sample_type = IP | ADDR whereas this is an
internal order of 'if' statements inside perf_output_sample()...

>> But some maintainers think of them as ABI, whereas others
>> are using them freely. imo it's time to remove ambiguity.
>
> I would love to, and have brought this up at Kernel Summit more than
> once with no solution out of it.

let's try it again at plumbers in august?

For now I'm going to drop bpf+tracepoints, since it's so controversial
and go with bpf+syscall and bpf+kprobe only.

Hopefully by august it will be clear what bpf+kprobes can do
and why I'm excited about bpf+tracepoints in the future.

>> The idea for new style of tracepoints is the following:
>> introduce new macro: DEFINE_EVENT_USER
>> and let programs attach to them.
>
> We actually have that today. But it's TRACE_EVENT_FLAGS(), although
> that should be cleaned up a bit. Frederic added it to label events that
> are safe for perf non root. It seems to be used currently only for
> syscalls.

I didn't mean to let unprivileged apps to use.
Better name probably would be: DEFINE_EVENT_BPF
and only let bpf programs use it.

>> These tracepoint will receive one or two pointers to important
>> structs only. They will not have TP_printk, assign and fields.
>> The placement and arguments to these new tracepoints
>> will be an ABI.
>> All existing tracepoints are not.
>
> TP_printk() is not really an issue.

I think it is. The way things are printed is the most
visible part of tracepoints and I suspect maintainers don't
want to add new ones, because internal fields are printed
and users do parse trace_pipe.
Recent discussion about tcp instrumentation was
about adding new tracepoints and a module to print them.
As soon as something like this is in, the next question
'what we're going to do when more arguments need
to be printed'...

imo the solution is DEFINE_EVENT_BPF that doesn't
print anything and a bpf program to process it.
Then programs decide what they like to pass to
user space and in what format.
The kernel/user ABI is the position of this new tracepoint
and arguments that are passed in.
bpf program takes care of walking pointers, extracting
interesting fields, aggregating and passing them to
user in the format specific to this particular program.
Then when user wants to collect more fields it just
changes the program and corresponding userland side.

>> The position of new tracepoints and their arguments
>> will be an ABI and the programs can be both.
>
> You means "special tracepoints" one that does export the arguments?
>
> Question is, how many maintainers will add these, knowing that they
> will have to be forever maintained as is.

Obviously we would need to prove usefulness of
tracepoint_bpf before they're accepted.
Hopefully bpf+kprobe will make an impression on its own
and users would want similar but without debug info.

>> If program is using bpf_fetch*() helpers it obviously
>> wants to access internal data structures, so
>> it's really nothing more, but 'safe kernel module'
>> and kernel layout specific.
>> Both old and new tracepoints + programs will be used
>> for live kernel debugging.
>>
>> If program is accessing user-ized data structures then
>
> Technically, the TP_struct__entry is a user-ized structure.
>
>> it is portable and will run on any kernel.
>> In uapi 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Tue, 10 Feb 2015 11:53:22 -0800
Alexei Starovoitov  wrote:

> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt  wrote:
> > On Mon, 9 Feb 2015 22:10:45 -0800
> > Alexei Starovoitov  wrote:
> >
> >> One can argue that current TP_printk format is already an ABI,
> >> because somebody might be parsing the text output.
> >
> > If somebody does, then it is an ABI. Luckily, it's not that useful to
> > parse, thus it hasn't been an issue. As Linus has stated in the past,
> > it's not that we can't change ABI interfaces, its just that we can not
> > change them if there's a user space application that depends on it.
> 
> there are already tools that parse trace_pipe:
> https://github.com/brendangregg/perf-tools

Yep, and if this becomes a standard, then any change that makes
trace_pipe different will be reverted.

> 
> > and expect some events to have specific fields. Now we can add new
> > fields, or even remove fields that no user space tool is using. This is
> > because today, tools use libtraceevent to parse the event data.
> 
> not all tools use libtraceevent.
> gdb calls perf_event_open directly:
> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
> and parses PERF_RECORD_SAMPLE as a binary.
> In this case it's branch records, but I think we never said anywhere
> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
> in this particular order.

What particular order? Note, that's a hardware event, not a software
one.

> 
> > This is why I'm nervous about exporting the parameters of the trace
> > event call. Right now, those parameters can always change, because the
> > only way to know they exist is by looking at the code. And currently,
> > there's no way to interact with those parameters. Once we have eBPF in
> > mainline, we now have a way to interact with the parameters and if
> > those parameters change, then the eBPF program will break, and if eBPF
> > can be part of a user space tool, that will break that tool and
> > whatever change in the trace point that caused this breakage would have
> > to be reverted. IOW, this can limit development in the kernel.
> 
> it can limit development unless we say that bpf programs
> that attach to tracepoints are not part of ABI.
> Easy enough to add large comment similar to perf_event.h

Again, it doesn't matter what we say. Linus made that very clear. He
stated if you provide an interface, and someone uses that interface for
a user space application, and they complain if it breaks, that is just
reason to revert whatever broke it.

> 
> > Al Viro currently does not let any tracepoint in VFS because he doesn't
> > want the internals of that code locked to an ABI. He's right to be
> > worried.
> 
> Same with networking bits. We don't want tracepoints to limit
> kernel development, but we want debuggability and kernel
> analytics.
> All existing tracepoints defined via DEFINE_EVENT should
> not be an ABI.

I agree, but that doesn't make it so :-/

> But some maintainers think of them as ABI, whereas others
> are using them freely. imo it's time to remove ambiguity.

I would love to, and have brought this up at Kernel Summit more than
once with no solution out of it.

> 
> The idea for new style of tracepoints is the following:
> introduce new macro: DEFINE_EVENT_USER
> and let programs attach to them.

We actually have that today. But it's TRACE_EVENT_FLAGS(), although
that should be cleaned up a bit. Frederic added it to label events that
are safe for perf non root. It seems to be used currently only for
syscalls.

> These tracepoint will receive one or two pointers to important
> structs only. They will not have TP_printk, assign and fields.
> The placement and arguments to these new tracepoints
> will be an ABI.
> All existing tracepoints are not.

TP_printk() is not really an issue.

> 
> The main reason to attach to tracepoint is that they are
> accessible without debug info (unlike kprobe)

That is, if you have a special bpf call to access variables, right? How
else do you access part of a data structure.

> Another reason is speed. tracepoints are much faster than
> optimized kprobes and for real-time analytics the speed
> is critical.
> 
> The position of new tracepoints and their arguments
> will be an ABI and the programs can be both.

You means "special tracepoints" one that does export the arguments?

Question is, how many maintainers will add these, knowing that they
will have to be forever maintained as is.

> If program is using bpf_fetch*() helpers it obviously
> wants to access internal data structures, so
> it's really nothing more, but 'safe kernel module'
> and kernel layout specific.
> Both old and new tracepoints + programs will be used
> for live kernel debugging.
> 
> If program is accessing user-ized data structures then

Technically, the TP_struct__entry is a user-ized structure.

> it is portable and will run on any kernel.
> In uapi header we can define:
> struct task_struct_user {
>   int pid;
>   int 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt  wrote:
> On Mon, 9 Feb 2015 22:10:45 -0800
> Alexei Starovoitov  wrote:
>
>> One can argue that current TP_printk format is already an ABI,
>> because somebody might be parsing the text output.
>
> If somebody does, then it is an ABI. Luckily, it's not that useful to
> parse, thus it hasn't been an issue. As Linus has stated in the past,
> it's not that we can't change ABI interfaces, its just that we can not
> change them if there's a user space application that depends on it.

there are already tools that parse trace_pipe:
https://github.com/brendangregg/perf-tools

> and expect some events to have specific fields. Now we can add new
> fields, or even remove fields that no user space tool is using. This is
> because today, tools use libtraceevent to parse the event data.

not all tools use libtraceevent.
gdb calls perf_event_open directly:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
and parses PERF_RECORD_SAMPLE as a binary.
In this case it's branch records, but I think we never said anywhere
that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
in this particular order.

> This is why I'm nervous about exporting the parameters of the trace
> event call. Right now, those parameters can always change, because the
> only way to know they exist is by looking at the code. And currently,
> there's no way to interact with those parameters. Once we have eBPF in
> mainline, we now have a way to interact with the parameters and if
> those parameters change, then the eBPF program will break, and if eBPF
> can be part of a user space tool, that will break that tool and
> whatever change in the trace point that caused this breakage would have
> to be reverted. IOW, this can limit development in the kernel.

it can limit development unless we say that bpf programs
that attach to tracepoints are not part of ABI.
Easy enough to add large comment similar to perf_event.h

> Al Viro currently does not let any tracepoint in VFS because he doesn't
> want the internals of that code locked to an ABI. He's right to be
> worried.

Same with networking bits. We don't want tracepoints to limit
kernel development, but we want debuggability and kernel
analytics.
All existing tracepoints defined via DEFINE_EVENT should
not be an ABI.
But some maintainers think of them as ABI, whereas others
are using them freely. imo it's time to remove ambiguity.

The idea for new style of tracepoints is the following:
introduce new macro: DEFINE_EVENT_USER
and let programs attach to them.
These tracepoint will receive one or two pointers to important
structs only. They will not have TP_printk, assign and fields.
The placement and arguments to these new tracepoints
will be an ABI.
All existing tracepoints are not.

The main reason to attach to tracepoint is that they are
accessible without debug info (unlike kprobe)
Another reason is speed. tracepoints are much faster than
optimized kprobes and for real-time analytics the speed
is critical.

The position of new tracepoints and their arguments
will be an ABI and the programs can be both.
If program is using bpf_fetch*() helpers it obviously
wants to access internal data structures, so
it's really nothing more, but 'safe kernel module'
and kernel layout specific.
Both old and new tracepoints + programs will be used
for live kernel debugging.

If program is accessing user-ized data structures then
it is portable and will run on any kernel.
In uapi header we can define:
struct task_struct_user {
  int pid;
  int prio;
};
and let bpf programs access it via real 'struct task_struct *'
pointer passed into tracepoint.
bpf loader will translate offsets and sizes used inside
the program into real task_struct's offsets and loads.
(all structs are read-only of course)
programs will be fast and kernel independent.
They will be used for analytics (latency, etc)

>> so in some cases we cannot change tracepoints without
>> somebody complaining that his tool broke.
>> In other cases tracepoints are used for debugging only
>> and no one will notice when they change...
>> It was and still a grey area.
>
> Not really. If a tool uses the tracepoint, it can lock that tracepoint
> down. This is exactly what latencytop did. It happened, it's not a
> hypothetical situation.

correct.

>> bpf doesn't change any of that.
>> It actually makes addition of new tracepoints easier.
>
> I totally disagree. It adds more ways to see inside the kernel, and if
> user space depends on this, it adds more ways the kernel can not change.
>
> It comes down to how robust eBPF is with the internals of the kernel
> changing. If we limit eBPF to system call tracepoints only, that's
> fine because those have the same ABI as the system call itself. I'm
> worried about the internal tracepoints for scheduling, irqs, file
> systems, etc.

agree. we need to make it clear that existing tracepoints
+ programs is not ABI.

>> In the future we might add a 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Mon, 9 Feb 2015 22:10:45 -0800
Alexei Starovoitov  wrote:

> One can argue that current TP_printk format is already an ABI,
> because somebody might be parsing the text output.

If somebody does, then it is an ABI. Luckily, it's not that useful to
parse, thus it hasn't been an issue. As Linus has stated in the past,
it's not that we can't change ABI interfaces, its just that we can not
change them if there's a user space application that depends on it.

The harder we make it for interface changes to break user space, the
better. The field layouts is a user interface. In fact, some of those
fields must always be there. This is because tools do parse the layout
and expect some events to have specific fields. Now we can add new
fields, or even remove fields that no user space tool is using. This is
because today, tools use libtraceevent to parse the event data.

This is why I'm nervous about exporting the parameters of the trace
event call. Right now, those parameters can always change, because the
only way to know they exist is by looking at the code. And currently,
there's no way to interact with those parameters. Once we have eBPF in
mainline, we now have a way to interact with the parameters and if
those parameters change, then the eBPF program will break, and if eBPF
can be part of a user space tool, that will break that tool and
whatever change in the trace point that caused this breakage would have
to be reverted. IOW, this can limit development in the kernel.

Al Viro currently does not let any tracepoint in VFS because he doesn't
want the internals of that code locked to an ABI. He's right to be
worried.

> so in some cases we cannot change tracepoints without
> somebody complaining that his tool broke.
> In other cases tracepoints are used for debugging only
> and no one will notice when they change...
> It was and still a grey area.

Not really. If a tool uses the tracepoint, it can lock that tracepoint
down. This is exactly what latencytop did. It happened, it's not a
hypothetical situation.

> bpf doesn't change any of that.
> It actually makes addition of new tracepoints easier.

I totally disagree. It adds more ways to see inside the kernel, and if
user space depends on this, it adds more ways the kernel can not change.

It comes down to how robust eBPF is with the internals of the kernel
changing. If we limit eBPF to system call tracepoints only, that's
fine because those have the same ABI as the system call itself. I'm
worried about the internal tracepoints for scheduling, irqs, file
systems, etc.


> In the future we might add a tracepoint and pass a single
> pointer to interesting data struct to it. bpf programs will walk
> data structures 'as safe modules' via bpf_fetch*() methods
> without exposing it as ABI.

Will this work if that structure changes? When the field we are looking
for no longer exists?

> whereas today we pass a lot of fields to tracepoints and
> make all of these fields immutable.

The parameters passed to the tracepoint are not shown to userspace and
can change at will. Now, we present the final parsing of the parameters
that convert to fields. As all currently known tools uses
libtraceevent.a, and parse the format files, those fields can move
around and even change in size. The structures are not immutable. The
fields are locked down if user space relies on them. But they can move
about within the tracepoint, because the parsing allows for it.

Remember, these are processed fields. The result of TP_fast_assign()
and what gets put into the ring buffer. Now what is passed to the
actual tracepoint is not visible by userspace, and in lots of cases, it
is just a pointer to some structure. What eBPF brings to the table is a
way to access this structure from user space. What keeps a structured
passed to a tracepoint from becoming immutable if there's a eBPF
program that expects it to have a specific field?


> 
> To me tracepoints are like gdb breakpoints.

Unfortunately, it doesn't matter what you think they are. It doesn't
matter what I think they are. What matters is what Linus thinks they
are and if user space depends on it and Linus decides to revert what
ever change broke that user space program, no matter what we think, we
just screwed ourselves.

I'm being stubborn on this because this is exactly what happened in the
past, which caused every trace point to hold 4 bytes of padding. 4
bytes may not sound like a lot, but when you have 1 million
tracepoints, that's 4 megs of wasted space.

> and bpf programs like live debugger that examine things.

If bpf programs only dealt with kprobes, I may agree. But tracepoints
have already been proven to be a type of ABI. If we open another window
into the kernel, this can screw us later. It's better to solve this now
than when we are fighting with Linus over user space breakage.

> 
> the next step is to be able to write bpf scripts on the fly
> without leaving debugger. Something like perf probe +
> editor + live execution. 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Mon, 9 Feb 2015 21:51:05 -0800
Alexei Starovoitov  wrote:

> On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt  wrote:
> >
> > Looks like this is entirely perf based and does not interact with
> > ftrace at all. In other words, it's perf not tracing.
> >
> > It makes more sense to go through tip than the tracing tree.
> 
> well, all of earlier series were based on ftrace only,
> but I was given convincing enough arguments that
> perf_even_open+ioctl is a better interface :)
> Ok. will rebase on tip in the next version.

That's fine. If this proves useful, I'll probably add the interface
back to ftrace as well.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt rost...@goodmis.org wrote:

  But some maintainers think of them as ABI, whereas others
  are using them freely. imo it's time to remove ambiguity.
 
  I would love to, and have brought this up at Kernel Summit more than
  once with no solution out of it.

 let's try it again at plumbers in august?

 Well, we need a statement from Linus. And it would be nice if we could
 also get Ingo involved in the discussion, but he seldom comes to
 anything but Kernel Summit.

+1

 BTW, I wonder if I could make a simple compiler in the kernel that
 would translate the current ftrace filters into a BPF program, where it
 could use the program and not use the current filter logic.

yep. I've sent that patch last year.
It converted pred_tree into bpf program.
I can try to dig it up. It doesn't provide extra programmability
though, just makes filtering logic much faster.

 imo the solution is DEFINE_EVENT_BPF that doesn't
 print anything and a bpf program to process it.

 You mean to be completely invisible to ftrace? And the debugfs/tracefs
 directory?

I mean it will be seen in tracefs to get 'id', but without enable/format/filter

 I'm not suggesting to preserve the meaning of 'pid' semantically
 in all cases. That's not what users would want anyway.
 I want to allow programs to access important fields and print
 them in more generic way than current TP_printk does.
 Then exposed ABI of such tracepoint_bpf is smaller than
 with current tracepoints.

 Again, this would mean they become invisible to ftrace, and even
 ftrace_dump_on_oops.

yes, since these new tracepoints have no meat inside them.
They're placeholders sitting idle and waiting for bpf to do
something useful with them.

 I'm not fully understanding what is to be exported by this new ABI. If
 the fields available, will always be available, then why can't the
 appear in a TP_printk()?

say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
It will have only one argument 'skb'.
bpf program will read and print skb fields the way it likes
for particular tracing scenario.
So instead of making
TP_printk(dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
gso_type=%#x,...
the abi exposed via trace_pipe (as it is today),
the new tracepoint_bpf abi is presence of 'skb' pointer as one
and only argument to bpf program.
Future refactoring of netif_rx would need to guarantee
that trace_netif_rx_entry(skb) is called. that's it.
imo such tracepoints are much easier to deal with during
code changes.

May be some of the existing tracepoints like this one that
takes one argument can be marked 'bpf-ready', so that
programs can attach to them only.

 let's start slow then with bpf+syscall and bpf+kprobe only.

 I'm fine with that.

thanks. will wait for merge window to close and will repost.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Tue, 10 Feb 2015 19:04:55 -0800
Alexei Starovoitov a...@plumgrid.com wrote:


  You mean to be completely invisible to ftrace? And the debugfs/tracefs
  directory?
 
 I mean it will be seen in tracefs to get 'id', but without 
 enable/format/filter

In other words, invisible to ftrace. I'm not sure I'll be on board to
work with that.


  Again, this would mean they become invisible to ftrace, and even
  ftrace_dump_on_oops.
 
 yes, since these new tracepoints have no meat inside them.
 They're placeholders sitting idle and waiting for bpf to do
 something useful with them.

Hmm, I have a patch somewhere (never posted it), that add
TRACE_MARKER(), which basically would just print that it was hit. But
no data was passed to it. Something like that I would be more inclined
to take. Then the question is, what can bpf access there? Could just be
a place holder to add a fast kprobe. This way, it can show up in
trace events with enable and and all that, it just wont have any data
to print out besides the normal pid, flags, etc.

But because it will inject a nop, that could be converted to a jump, it
will give you the power of kprobes but with the speed of a tracepoint.

 
  I'm not fully understanding what is to be exported by this new ABI. If
  the fields available, will always be available, then why can't the
  appear in a TP_printk()?
 
 say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
 It will have only one argument 'skb'.
 bpf program will read and print skb fields the way it likes
 for particular tracing scenario.
 So instead of making
 TP_printk(dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
 vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
 ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
 mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
 gso_type=%#x,...
 the abi exposed via trace_pipe (as it is today),
 the new tracepoint_bpf abi is presence of 'skb' pointer as one
 and only argument to bpf program.
 Future refactoring of netif_rx would need to guarantee
 that trace_netif_rx_entry(skb) is called. that's it.
 imo such tracepoints are much easier to deal with during
 code changes.

But what can you access from that skb that is guaranteed to be there?
If you say anything, then there's no reason it can't be added to the
printk as well.

 
 May be some of the existing tracepoints like this one that
 takes one argument can be marked 'bpf-ready', so that
 programs can attach to them only.

I really hate the idea of adding tracepoints that ftrace can't use. It
basically kills the entire busy box usage scenario, as boards that have
extremely limited userspace still make full use of ftrace via the
existing tracepoints.

I still don't see the argument that adding data via the bpf functions
is any different than adding those same items to fields in an event.
Once you add a bpf function, then you must maintain those fields.

Look, you can always add more to a TP_printk(), as that is standard
with all text file kernel parsing. Like stat in /proc. Fields can not
be removed, but more can always be appended to the end.

Any tool that parses trace_pipe is broken if it can't handle extended
fields. The api can be extended, and for text files, that is by
appending to them.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Tue, 10 Feb 2015 11:53:22 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt rost...@goodmis.org wrote:
  On Mon, 9 Feb 2015 22:10:45 -0800
  Alexei Starovoitov a...@plumgrid.com wrote:
 
  One can argue that current TP_printk format is already an ABI,
  because somebody might be parsing the text output.
 
  If somebody does, then it is an ABI. Luckily, it's not that useful to
  parse, thus it hasn't been an issue. As Linus has stated in the past,
  it's not that we can't change ABI interfaces, its just that we can not
  change them if there's a user space application that depends on it.
 
 there are already tools that parse trace_pipe:
 https://github.com/brendangregg/perf-tools

Yep, and if this becomes a standard, then any change that makes
trace_pipe different will be reverted.

 
  and expect some events to have specific fields. Now we can add new
  fields, or even remove fields that no user space tool is using. This is
  because today, tools use libtraceevent to parse the event data.
 
 not all tools use libtraceevent.
 gdb calls perf_event_open directly:
 https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
 and parses PERF_RECORD_SAMPLE as a binary.
 In this case it's branch records, but I think we never said anywhere
 that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
 in this particular order.

What particular order? Note, that's a hardware event, not a software
one.

 
  This is why I'm nervous about exporting the parameters of the trace
  event call. Right now, those parameters can always change, because the
  only way to know they exist is by looking at the code. And currently,
  there's no way to interact with those parameters. Once we have eBPF in
  mainline, we now have a way to interact with the parameters and if
  those parameters change, then the eBPF program will break, and if eBPF
  can be part of a user space tool, that will break that tool and
  whatever change in the trace point that caused this breakage would have
  to be reverted. IOW, this can limit development in the kernel.
 
 it can limit development unless we say that bpf programs
 that attach to tracepoints are not part of ABI.
 Easy enough to add large comment similar to perf_event.h

Again, it doesn't matter what we say. Linus made that very clear. He
stated if you provide an interface, and someone uses that interface for
a user space application, and they complain if it breaks, that is just
reason to revert whatever broke it.

 
  Al Viro currently does not let any tracepoint in VFS because he doesn't
  want the internals of that code locked to an ABI. He's right to be
  worried.
 
 Same with networking bits. We don't want tracepoints to limit
 kernel development, but we want debuggability and kernel
 analytics.
 All existing tracepoints defined via DEFINE_EVENT should
 not be an ABI.

I agree, but that doesn't make it so :-/

 But some maintainers think of them as ABI, whereas others
 are using them freely. imo it's time to remove ambiguity.

I would love to, and have brought this up at Kernel Summit more than
once with no solution out of it.

 
 The idea for new style of tracepoints is the following:
 introduce new macro: DEFINE_EVENT_USER
 and let programs attach to them.

We actually have that today. But it's TRACE_EVENT_FLAGS(), although
that should be cleaned up a bit. Frederic added it to label events that
are safe for perf non root. It seems to be used currently only for
syscalls.

 These tracepoint will receive one or two pointers to important
 structs only. They will not have TP_printk, assign and fields.
 The placement and arguments to these new tracepoints
 will be an ABI.
 All existing tracepoints are not.

TP_printk() is not really an issue.

 
 The main reason to attach to tracepoint is that they are
 accessible without debug info (unlike kprobe)

That is, if you have a special bpf call to access variables, right? How
else do you access part of a data structure.

 Another reason is speed. tracepoints are much faster than
 optimized kprobes and for real-time analytics the speed
 is critical.
 
 The position of new tracepoints and their arguments
 will be an ABI and the programs can be both.

You means special tracepoints one that does export the arguments?

Question is, how many maintainers will add these, knowing that they
will have to be forever maintained as is.

 If program is using bpf_fetch*() helpers it obviously
 wants to access internal data structures, so
 it's really nothing more, but 'safe kernel module'
 and kernel layout specific.
 Both old and new tracepoints + programs will be used
 for live kernel debugging.
 
 If program is accessing user-ized data structures then

Technically, the TP_struct__entry is a user-ized structure.

 it is portable and will run on any kernel.
 In uapi header we can define:
 struct task_struct_user {
   int pid;
   int prio;

Here's a perfect example of something that looks 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Mon, 9 Feb 2015 22:10:45 -0800
 Alexei Starovoitov a...@plumgrid.com wrote:

 One can argue that current TP_printk format is already an ABI,
 because somebody might be parsing the text output.

 If somebody does, then it is an ABI. Luckily, it's not that useful to
 parse, thus it hasn't been an issue. As Linus has stated in the past,
 it's not that we can't change ABI interfaces, its just that we can not
 change them if there's a user space application that depends on it.

there are already tools that parse trace_pipe:
https://github.com/brendangregg/perf-tools

 and expect some events to have specific fields. Now we can add new
 fields, or even remove fields that no user space tool is using. This is
 because today, tools use libtraceevent to parse the event data.

not all tools use libtraceevent.
gdb calls perf_event_open directly:
https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
and parses PERF_RECORD_SAMPLE as a binary.
In this case it's branch records, but I think we never said anywhere
that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
in this particular order.

 This is why I'm nervous about exporting the parameters of the trace
 event call. Right now, those parameters can always change, because the
 only way to know they exist is by looking at the code. And currently,
 there's no way to interact with those parameters. Once we have eBPF in
 mainline, we now have a way to interact with the parameters and if
 those parameters change, then the eBPF program will break, and if eBPF
 can be part of a user space tool, that will break that tool and
 whatever change in the trace point that caused this breakage would have
 to be reverted. IOW, this can limit development in the kernel.

it can limit development unless we say that bpf programs
that attach to tracepoints are not part of ABI.
Easy enough to add large comment similar to perf_event.h

 Al Viro currently does not let any tracepoint in VFS because he doesn't
 want the internals of that code locked to an ABI. He's right to be
 worried.

Same with networking bits. We don't want tracepoints to limit
kernel development, but we want debuggability and kernel
analytics.
All existing tracepoints defined via DEFINE_EVENT should
not be an ABI.
But some maintainers think of them as ABI, whereas others
are using them freely. imo it's time to remove ambiguity.

The idea for new style of tracepoints is the following:
introduce new macro: DEFINE_EVENT_USER
and let programs attach to them.
These tracepoint will receive one or two pointers to important
structs only. They will not have TP_printk, assign and fields.
The placement and arguments to these new tracepoints
will be an ABI.
All existing tracepoints are not.

The main reason to attach to tracepoint is that they are
accessible without debug info (unlike kprobe)
Another reason is speed. tracepoints are much faster than
optimized kprobes and for real-time analytics the speed
is critical.

The position of new tracepoints and their arguments
will be an ABI and the programs can be both.
If program is using bpf_fetch*() helpers it obviously
wants to access internal data structures, so
it's really nothing more, but 'safe kernel module'
and kernel layout specific.
Both old and new tracepoints + programs will be used
for live kernel debugging.

If program is accessing user-ized data structures then
it is portable and will run on any kernel.
In uapi header we can define:
struct task_struct_user {
  int pid;
  int prio;
};
and let bpf programs access it via real 'struct task_struct *'
pointer passed into tracepoint.
bpf loader will translate offsets and sizes used inside
the program into real task_struct's offsets and loads.
(all structs are read-only of course)
programs will be fast and kernel independent.
They will be used for analytics (latency, etc)

 so in some cases we cannot change tracepoints without
 somebody complaining that his tool broke.
 In other cases tracepoints are used for debugging only
 and no one will notice when they change...
 It was and still a grey area.

 Not really. If a tool uses the tracepoint, it can lock that tracepoint
 down. This is exactly what latencytop did. It happened, it's not a
 hypothetical situation.

correct.

 bpf doesn't change any of that.
 It actually makes addition of new tracepoints easier.

 I totally disagree. It adds more ways to see inside the kernel, and if
 user space depends on this, it adds more ways the kernel can not change.

 It comes down to how robust eBPF is with the internals of the kernel
 changing. If we limit eBPF to system call tracepoints only, that's
 fine because those have the same ABI as the system call itself. I'm
 worried about the internal tracepoints for scheduling, irqs, file
 systems, etc.

agree. we need to make it clear that existing tracepoints
+ programs is not ABI.

 In the future we might add a tracepoint and pass a 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt rost...@goodmis.org wrote:
  Again, this would mean they become invisible to ftrace, and even
  ftrace_dump_on_oops.

 yes, since these new tracepoints have no meat inside them.
 They're placeholders sitting idle and waiting for bpf to do
 something useful with them.

 Hmm, I have a patch somewhere (never posted it), that add
 TRACE_MARKER(), which basically would just print that it was hit. But
 no data was passed to it. Something like that I would be more inclined
 to take. Then the question is, what can bpf access there? Could just be
 a place holder to add a fast kprobe. This way, it can show up in
 trace events with enable and and all that, it just wont have any data
 to print out besides the normal pid, flags, etc.

 But because it will inject a nop, that could be converted to a jump, it
 will give you the power of kprobes but with the speed of a tracepoint.

fair enough.
Something like TRACE_MARKER(arg1, arg2) that prints
it was hit without accessing the args would be enough.
Without any args it is indeed a 'fast kprobe' only.
Debug info would still be needed to access
function arguments.
On x64 function entry point and x64 abi make it easy
to access args, but i386 or kprobe in the middle
lose visibility when debug info is not available.
TRACE_MARKER (with few key args that function
is operating on) is enough to achieve roughly the same
as kprobe without debug info.

  I'm not fully understanding what is to be exported by this new ABI. If
  the fields available, will always be available, then why can't the
  appear in a TP_printk()?

 say, we define trace_netif_rx_entry() as this new tracepoint_bpf.
 It will have only one argument 'skb'.
 bpf program will read and print skb fields the way it likes
 for particular tracing scenario.
 So instead of making
 TP_printk(dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p
 vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x
 ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u
 mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d
 gso_type=%#x,...
 the abi exposed via trace_pipe (as it is today),
 the new tracepoint_bpf abi is presence of 'skb' pointer as one
 and only argument to bpf program.
 Future refactoring of netif_rx would need to guarantee
 that trace_netif_rx_entry(skb) is called. that's it.
 imo such tracepoints are much easier to deal with during
 code changes.

 But what can you access from that skb that is guaranteed to be there?
 If you say anything, then there's no reason it can't be added to the
 printk as well.

programs can access any field via bpf_fetch*() helpers which
make them kernel layout dependent or via user-ized sk_buff
with few fields which is portable.
In both cases kernel/user abi is only 'skb' pointer.
whether it's debugging program that needs full access
via fetch* helpers or portable program that uses stable api
it's up to program author.
Just like kprobes, it's clear, that if program is using
fetch* helpers it's doing it without any abi guarantees.
'perf probe' and 'bpf with fetch* helpers are the same.
perf probe creates wrappers on top of probe_kernel_read and
bpf_fetch* helpers are wrappers on top of probe_kernel_read.
Complains that 'my kprobe with flags=%cx mode=+4($stack)
stopped working in new kernel' are equivalent to complains
that program with bpf_fetch* stopped working.

Whereas if program is using user-ized structs it will work
across kernel versions, though it will be able to see
only very limited slice of in-kernel data.

 May be some of the existing tracepoints like this one that
 takes one argument can be marked 'bpf-ready', so that
 programs can attach to them only.

 I really hate the idea of adding tracepoints that ftrace can't use. It
 basically kills the entire busy box usage scenario, as boards that have
 extremely limited userspace still make full use of ftrace via the
 existing tracepoints.

agree. I think your trace_marker with few args is a good
middle ground.

 I still don't see the argument that adding data via the bpf functions
 is any different than adding those same items to fields in an event.
 Once you add a bpf function, then you must maintain those fields.

 Look, you can always add more to a TP_printk(), as that is standard
 with all text file kernel parsing. Like stat in /proc. Fields can not
 be removed, but more can always be appended to the end.

 Any tool that parses trace_pipe is broken if it can't handle extended
 fields. The api can be extended, and for text files, that is by
 appending to them.

I agree that any text parsing script should be able to cope
with additional args without problems. I think it's a fear of
1% breakage is causing maintainers to avoid any changes
to tracepoints even when they just add few args to the end
of TP_printk.
When tracepoints stop printing and the only thing they see
is single pointer to a well known struct like sk_buff,
this fear of tracepoints should fade.
programs are 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Alexei Starovoitov
On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 10 Feb 2015 11:53:22 -0800
 Alexei Starovoitov a...@plumgrid.com wrote:

 On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt rost...@goodmis.org wrote:
  On Mon, 9 Feb 2015 22:10:45 -0800
  Alexei Starovoitov a...@plumgrid.com wrote:
 
  One can argue that current TP_printk format is already an ABI,
  because somebody might be parsing the text output.
 
  If somebody does, then it is an ABI. Luckily, it's not that useful to
  parse, thus it hasn't been an issue. As Linus has stated in the past,
  it's not that we can't change ABI interfaces, its just that we can not
  change them if there's a user space application that depends on it.

 there are already tools that parse trace_pipe:
 https://github.com/brendangregg/perf-tools

 Yep, and if this becomes a standard, then any change that makes
 trace_pipe different will be reverted.

I think reading of trace_pipe is widespread.

  and expect some events to have specific fields. Now we can add new
  fields, or even remove fields that no user space tool is using. This is
  because today, tools use libtraceevent to parse the event data.

 not all tools use libtraceevent.
 gdb calls perf_event_open directly:
 https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c
 and parses PERF_RECORD_SAMPLE as a binary.
 In this case it's branch records, but I think we never said anywhere
 that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come
 in this particular order.

 What particular order? Note, that's a hardware event, not a software
 one.

yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr'
when attr.sample_type = IP | ADDR whereas this is an
internal order of 'if' statements inside perf_output_sample()...

 But some maintainers think of them as ABI, whereas others
 are using them freely. imo it's time to remove ambiguity.

 I would love to, and have brought this up at Kernel Summit more than
 once with no solution out of it.

let's try it again at plumbers in august?

For now I'm going to drop bpf+tracepoints, since it's so controversial
and go with bpf+syscall and bpf+kprobe only.

Hopefully by august it will be clear what bpf+kprobes can do
and why I'm excited about bpf+tracepoints in the future.

 The idea for new style of tracepoints is the following:
 introduce new macro: DEFINE_EVENT_USER
 and let programs attach to them.

 We actually have that today. But it's TRACE_EVENT_FLAGS(), although
 that should be cleaned up a bit. Frederic added it to label events that
 are safe for perf non root. It seems to be used currently only for
 syscalls.

I didn't mean to let unprivileged apps to use.
Better name probably would be: DEFINE_EVENT_BPF
and only let bpf programs use it.

 These tracepoint will receive one or two pointers to important
 structs only. They will not have TP_printk, assign and fields.
 The placement and arguments to these new tracepoints
 will be an ABI.
 All existing tracepoints are not.

 TP_printk() is not really an issue.

I think it is. The way things are printed is the most
visible part of tracepoints and I suspect maintainers don't
want to add new ones, because internal fields are printed
and users do parse trace_pipe.
Recent discussion about tcp instrumentation was
about adding new tracepoints and a module to print them.
As soon as something like this is in, the next question
'what we're going to do when more arguments need
to be printed'...

imo the solution is DEFINE_EVENT_BPF that doesn't
print anything and a bpf program to process it.
Then programs decide what they like to pass to
user space and in what format.
The kernel/user ABI is the position of this new tracepoint
and arguments that are passed in.
bpf program takes care of walking pointers, extracting
interesting fields, aggregating and passing them to
user in the format specific to this particular program.
Then when user wants to collect more fields it just
changes the program and corresponding userland side.

 The position of new tracepoints and their arguments
 will be an ABI and the programs can be both.

 You means special tracepoints one that does export the arguments?

 Question is, how many maintainers will add these, knowing that they
 will have to be forever maintained as is.

Obviously we would need to prove usefulness of
tracepoint_bpf before they're accepted.
Hopefully bpf+kprobe will make an impression on its own
and users would want similar but without debug info.

 If program is using bpf_fetch*() helpers it obviously
 wants to access internal data structures, so
 it's really nothing more, but 'safe kernel module'
 and kernel layout specific.
 Both old and new tracepoints + programs will be used
 for live kernel debugging.

 If program is accessing user-ized data structures then

 Technically, the TP_struct__entry is a user-ized structure.

 it is portable and will run on any kernel.
 In uapi header we can define:
 struct task_struct_user {
   int pid;
   

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Mon, 9 Feb 2015 22:10:45 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 One can argue that current TP_printk format is already an ABI,
 because somebody might be parsing the text output.

If somebody does, then it is an ABI. Luckily, it's not that useful to
parse, thus it hasn't been an issue. As Linus has stated in the past,
it's not that we can't change ABI interfaces, its just that we can not
change them if there's a user space application that depends on it.

The harder we make it for interface changes to break user space, the
better. The field layouts is a user interface. In fact, some of those
fields must always be there. This is because tools do parse the layout
and expect some events to have specific fields. Now we can add new
fields, or even remove fields that no user space tool is using. This is
because today, tools use libtraceevent to parse the event data.

This is why I'm nervous about exporting the parameters of the trace
event call. Right now, those parameters can always change, because the
only way to know they exist is by looking at the code. And currently,
there's no way to interact with those parameters. Once we have eBPF in
mainline, we now have a way to interact with the parameters and if
those parameters change, then the eBPF program will break, and if eBPF
can be part of a user space tool, that will break that tool and
whatever change in the trace point that caused this breakage would have
to be reverted. IOW, this can limit development in the kernel.

Al Viro currently does not let any tracepoint in VFS because he doesn't
want the internals of that code locked to an ABI. He's right to be
worried.

 so in some cases we cannot change tracepoints without
 somebody complaining that his tool broke.
 In other cases tracepoints are used for debugging only
 and no one will notice when they change...
 It was and still a grey area.

Not really. If a tool uses the tracepoint, it can lock that tracepoint
down. This is exactly what latencytop did. It happened, it's not a
hypothetical situation.

 bpf doesn't change any of that.
 It actually makes addition of new tracepoints easier.

I totally disagree. It adds more ways to see inside the kernel, and if
user space depends on this, it adds more ways the kernel can not change.

It comes down to how robust eBPF is with the internals of the kernel
changing. If we limit eBPF to system call tracepoints only, that's
fine because those have the same ABI as the system call itself. I'm
worried about the internal tracepoints for scheduling, irqs, file
systems, etc.


 In the future we might add a tracepoint and pass a single
 pointer to interesting data struct to it. bpf programs will walk
 data structures 'as safe modules' via bpf_fetch*() methods
 without exposing it as ABI.

Will this work if that structure changes? When the field we are looking
for no longer exists?

 whereas today we pass a lot of fields to tracepoints and
 make all of these fields immutable.

The parameters passed to the tracepoint are not shown to userspace and
can change at will. Now, we present the final parsing of the parameters
that convert to fields. As all currently known tools uses
libtraceevent.a, and parse the format files, those fields can move
around and even change in size. The structures are not immutable. The
fields are locked down if user space relies on them. But they can move
about within the tracepoint, because the parsing allows for it.

Remember, these are processed fields. The result of TP_fast_assign()
and what gets put into the ring buffer. Now what is passed to the
actual tracepoint is not visible by userspace, and in lots of cases, it
is just a pointer to some structure. What eBPF brings to the table is a
way to access this structure from user space. What keeps a structured
passed to a tracepoint from becoming immutable if there's a eBPF
program that expects it to have a specific field?


 
 To me tracepoints are like gdb breakpoints.

Unfortunately, it doesn't matter what you think they are. It doesn't
matter what I think they are. What matters is what Linus thinks they
are and if user space depends on it and Linus decides to revert what
ever change broke that user space program, no matter what we think, we
just screwed ourselves.

I'm being stubborn on this because this is exactly what happened in the
past, which caused every trace point to hold 4 bytes of padding. 4
bytes may not sound like a lot, but when you have 1 million
tracepoints, that's 4 megs of wasted space.

 and bpf programs like live debugger that examine things.

If bpf programs only dealt with kprobes, I may agree. But tracepoints
have already been proven to be a type of ABI. If we open another window
into the kernel, this can screw us later. It's better to solve this now
than when we are fighting with Linus over user space breakage.

 
 the next step is to be able to write bpf scripts on the fly
 without leaving debugger. Something like perf probe +
 editor + live execution. Truly 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Mon, 9 Feb 2015 21:51:05 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt rost...@goodmis.org wrote:
 
  Looks like this is entirely perf based and does not interact with
  ftrace at all. In other words, it's perf not tracing.
 
  It makes more sense to go through tip than the tracing tree.
 
 well, all of earlier series were based on ftrace only,
 but I was given convincing enough arguments that
 perf_even_open+ioctl is a better interface :)
 Ok. will rebase on tip in the next version.

That's fine. If this proves useful, I'll probably add the interface
back to ftrace as well.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-10 Thread Steven Rostedt
On Tue, 10 Feb 2015 16:22:50 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

  Yep, and if this becomes a standard, then any change that makes
  trace_pipe different will be reverted.
 
 I think reading of trace_pipe is widespread.

I've heard of a few, but nothing that has broken when something changed.

Is it scripts or actual C code?

Again, it matters if people complain about the change.


  But some maintainers think of them as ABI, whereas others
  are using them freely. imo it's time to remove ambiguity.
 
  I would love to, and have brought this up at Kernel Summit more than
  once with no solution out of it.
 
 let's try it again at plumbers in august?

Well, we need a statement from Linus. And it would be nice if we could
also get Ingo involved in the discussion, but he seldom comes to
anything but Kernel Summit.

 
 For now I'm going to drop bpf+tracepoints, since it's so controversial
 and go with bpf+syscall and bpf+kprobe only.

Probably the safest bet.

 
 Hopefully by august it will be clear what bpf+kprobes can do
 and why I'm excited about bpf+tracepoints in the future.

BTW, I wonder if I could make a simple compiler in the kernel that
would translate the current ftrace filters into a BPF program, where it
could use the program and not use the current filter logic.


  These tracepoint will receive one or two pointers to important
  structs only. They will not have TP_printk, assign and fields.
  The placement and arguments to these new tracepoints
  will be an ABI.
  All existing tracepoints are not.
 
  TP_printk() is not really an issue.
 
 I think it is. The way things are printed is the most
 visible part of tracepoints and I suspect maintainers don't
 want to add new ones, because internal fields are printed
 and users do parse trace_pipe.
 Recent discussion about tcp instrumentation was
 about adding new tracepoints and a module to print them.
 As soon as something like this is in, the next question
 'what we're going to do when more arguments need
 to be printed'...

I should rephrase that. It's not that it's not an issue, it's just that
it hasn't been an issue. the trace_pipe code is slow. The
raw_trace_pipe is much faster. Any tool would benefit from using it.

I really need to get a library out to help users do such a thing.


 
 imo the solution is DEFINE_EVENT_BPF that doesn't
 print anything and a bpf program to process it.

You mean to be completely invisible to ftrace? And the debugfs/tracefs
directory?


 
  it is portable and will run on any kernel.
  In uapi header we can define:
  struct task_struct_user {
int pid;
int prio;
 
  Here's a perfect example of something that looks stable to show to
  user space, but is really a pimple that is hiding cancer.
 
  Lets start with pid. We have name spaces. What pid will be put there?
  We have to show the pid of the name space it is under.
 
  Then we have prio. What is prio in the DEADLINE scheduler. It is rather
  meaningless. Also, it is meaningless in SCHED_OTHER.
 
  Also note that even for SCHED_FIFO, the prio is used differently in the
  kernel than it is in userspace. For the kernel, lower is higher.
 
 well, -prio and -pid are already printed by sched tracepoints
 and their meaning depends on scheduler. So users taking that
 into account.

I know, and Peter hates this.

 I'm not suggesting to preserve the meaning of 'pid' semantically
 in all cases. That's not what users would want anyway.
 I want to allow programs to access important fields and print
 them in more generic way than current TP_printk does.
 Then exposed ABI of such tracepoint_bpf is smaller than
 with current tracepoints.

Again, this would mean they become invisible to ftrace, and even
ftrace_dump_on_oops.

I'm not fully understanding what is to be exported by this new ABI. If
the fields available, will always be available, then why can't the
appear in a TP_printk()?


  eBPF is very flexible, which means it is bound to have someone use it
  in a way you never dreamed of, and that will be what bites you in the
  end (pun intended).
 
 understood :)
 let's start slow then with bpf+syscall and bpf+kprobe only.

I'm fine with that.

 
  also not all bpf programs are equal.
  bpf+existing tracepoint is not ABI
 
  Why not?
 
 well, because we want to see more tracepoints in the kernel.
 We're already struggling to add more.

Still, the question is, even with a new tracepoint that limits what
it shows, there still needs to be something that is guaranteed to
always be there. I still don't see how this is different than the
current tracepoints.

 
  bpf+new tracepoint is ABI if programs are not using bpf_fetch
 
  How is this different?
 
 the new ones will be explicit by definition.

Who monitors this?



  To give you an example, we thought about scrambling the trace event
  field locations from boot to boot to keep tools from hard coding the
  event layout. This may sound crazy, but developers out there are crazy.
  And if you want to keep 

Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 9:13 PM, Steven Rostedt  wrote:
>>   \
>> + if (prog) { \
>> + __maybe_unused const u64 z = 0; \
>> + struct bpf_context __ctx = ((struct bpf_context) {  \
>> + __BPF_CAST6(args, z, z, z, z, z)\
>
> Note, there is no guarantee that args is at most 6. For example, in
> drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
> trace_event brcms_txstatus has 8 args.
>
> But I guess that's OK if you do not need those last args, right?

yeah, some tracepoints pass a lot of things.
That's rare and in most of the cases they can be fetched
from parent structure.

> I'm nervous about showing args of tracepoints too, because we don't want
> that to become a strict ABI either.

One can argue that current TP_printk format is already an ABI,
because somebody might be parsing the text output.
so in some cases we cannot change tracepoints without
somebody complaining that his tool broke.
In other cases tracepoints are used for debugging only
and no one will notice when they change...
It was and still a grey area.
bpf doesn't change any of that.
It actually makes addition of new tracepoints easier.
In the future we might add a tracepoint and pass a single
pointer to interesting data struct to it. bpf programs will walk
data structures 'as safe modules' via bpf_fetch*() methods
without exposing it as ABI.
whereas today we pass a lot of fields to tracepoints and
make all of these fields immutable.

To me tracepoints are like gdb breakpoints.
and bpf programs like live debugger that examine things.

the next step is to be able to write bpf scripts on the fly
without leaving debugger. Something like perf probe +
editor + live execution. Truly like gdb for kernel.
while kernel is running.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt  wrote:
>
> Looks like this is entirely perf based and does not interact with
> ftrace at all. In other words, it's perf not tracing.
>
> It makes more sense to go through tip than the tracing tree.

well, all of earlier series were based on ftrace only,
but I was given convincing enough arguments that
perf_even_open+ioctl is a better interface :)
Ok. will rebase on tip in the next version.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:54 -0800
Alexei Starovoitov  wrote:

> +/* For tracepoint filters argN fields match one to one to arguments
> + * passed to tracepoint events
> + *
> + * For syscall entry filters argN fields match syscall arguments
> + * For syscall exit filters arg1 is a return value
> + */
> +struct bpf_context {
> + u64 arg1;
> + u64 arg2;
> + u64 arg3;
> + u64 arg4;
> + u64 arg5;
> + u64 arg6;
> +};
> +
> +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 139b5067345b..4c275ce2dcf0 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -17,6 +17,7 @@
>   */
>  
>  #include 
> +#include 
>  
>  /*
>   * DECLARE_EVENT_CLASS can be used to add a generic function
> @@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) 
> *__event_##call = _##call
>  #undef __perf_task
>  #define __perf_task(t)   (__task = (t))
>  
> +/* zero extend integer, pointer or aggregate type to u64 without warnings */
> +#define __CAST_TO_U64(EXPR) ({ \
> + u64 ret = 0; \
> + typeof(EXPR) expr = EXPR; \
> + switch (sizeof(expr)) { \
> + case 8: ret = *(u64 *)  break; \
> + case 4: ret = *(u32 *)  break; \
> + case 2: ret = *(u16 *)  break; \
> + case 1: ret = *(u8 *)  break; \
> + } \
> + ret; })
> +
> +#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
> +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
> +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
> +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
> +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
> +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)   
> \
>  static notrace void  \
>  perf_trace_##call(void *__data, proto)   
> \
>  {\
>   struct ftrace_event_call *event_call = __data;  \
> + struct bpf_prog *prog = event_call->prog;   \
>   struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
>   struct ftrace_raw_##call *entry;\
>   struct pt_regs __regs;  \
> @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto)   
> \
>   int __data_size;\
>   int rctx;   \
>   \
> + if (prog) { \
> + __maybe_unused const u64 z = 0; \
> + struct bpf_context __ctx = ((struct bpf_context) {  \
> + __BPF_CAST6(args, z, z, z, z, z)\

Note, there is no guarantee that args is at most 6. For example, in
drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
trace_event brcms_txstatus has 8 args.

But I guess that's OK if you do not need those last args, right?

Also, there's no interface the lets us know what the args are. I may be
able to come up with something. That's the reason I never filtered
before tracing. Because we had no way of knowing what to filter on,
because the args were never visible.

I'm nervous about showing args of tracepoints too, because we don't want
that to become a strict ABI either.

-- Steve



> + }); \
> + \
> + if (!trace_call_bpf(prog, &__ctx))  \
> + return; \
> + }   \
> + \
>   __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:54 -0800
Alexei Starovoitov  wrote:

> +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
> diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
> index 139b5067345b..4c275ce2dcf0 100644
> --- a/include/trace/ftrace.h
> +++ b/include/trace/ftrace.h
> @@ -17,6 +17,7 @@
>   */
>  
>  #include 
> +#include 
>  
>  /*
>   * DECLARE_EVENT_CLASS can be used to add a generic function
> @@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) 
> *__event_##call = _##call
>  #undef __perf_task
>  #define __perf_task(t)   (__task = (t))
>  
> +/* zero extend integer, pointer or aggregate type to u64 without warnings */
> +#define __CAST_TO_U64(EXPR) ({ \
> + u64 ret = 0; \
> + typeof(EXPR) expr = EXPR; \
> + switch (sizeof(expr)) { \
> + case 8: ret = *(u64 *)  break; \
> + case 4: ret = *(u32 *)  break; \
> + case 2: ret = *(u16 *)  break; \
> + case 1: ret = *(u8 *)  break; \
> + } \
> + ret; })
> +
> +#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
> +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
> +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
> +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
> +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
> +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
> +
>  #undef DECLARE_EVENT_CLASS
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)   
> \
>  static notrace void  \
>  perf_trace_##call(void *__data, proto)   
> \
>  {\
>   struct ftrace_event_call *event_call = __data;  \
> + struct bpf_prog *prog = event_call->prog;   \


Looks like this is entirely perf based and does not interact with
ftrace at all. In other words, it's perf not tracing.

It makes more sense to go through tip than the tracing tree.

But I still do not want any hard coded event structures. All access to
data from the binary code must be parsed by looking at the event/format
files. Otherwise you will lock internals of the kernel as userspace
ABI, because eBPF programs will break if those internals change, and
that could severely limit progress in the future.

-- Steve

>   struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
>   struct ftrace_raw_##call *entry;\
>   struct pt_regs __regs;  \
> @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto)   
> \
>   int __data_size;\
>   int rctx;   \
>   \
> + if (prog) { \
> + __maybe_unused const u64 z = 0; \
> + struct bpf_context __ctx = ((struct bpf_context) {  \
> + __BPF_CAST6(args, z, z, z, z, z)\
> + }); \
> + \
> + if (!trace_call_bpf(prog, &__ctx))  \
> + return; \
> + }   \
> + \
>   __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \
>   \
>   head = this_cpu_ptr(event_call->perf_events);   \
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt rost...@goodmis.org wrote:

 Looks like this is entirely perf based and does not interact with
 ftrace at all. In other words, it's perf not tracing.

 It makes more sense to go through tip than the tracing tree.

well, all of earlier series were based on ftrace only,
but I was given convincing enough arguments that
perf_even_open+ioctl is a better interface :)
Ok. will rebase on tip in the next version.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Alexei Starovoitov
On Mon, Feb 9, 2015 at 9:13 PM, Steven Rostedt rost...@goodmis.org wrote:
   \
 + if (prog) { \
 + __maybe_unused const u64 z = 0; \
 + struct bpf_context __ctx = ((struct bpf_context) {  \
 + __BPF_CAST6(args, z, z, z, z, z)\

 Note, there is no guarantee that args is at most 6. For example, in
 drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
 trace_event brcms_txstatus has 8 args.

 But I guess that's OK if you do not need those last args, right?

yeah, some tracepoints pass a lot of things.
That's rare and in most of the cases they can be fetched
from parent structure.

 I'm nervous about showing args of tracepoints too, because we don't want
 that to become a strict ABI either.

One can argue that current TP_printk format is already an ABI,
because somebody might be parsing the text output.
so in some cases we cannot change tracepoints without
somebody complaining that his tool broke.
In other cases tracepoints are used for debugging only
and no one will notice when they change...
It was and still a grey area.
bpf doesn't change any of that.
It actually makes addition of new tracepoints easier.
In the future we might add a tracepoint and pass a single
pointer to interesting data struct to it. bpf programs will walk
data structures 'as safe modules' via bpf_fetch*() methods
without exposing it as ABI.
whereas today we pass a lot of fields to tracepoints and
make all of these fields immutable.

To me tracepoints are like gdb breakpoints.
and bpf programs like live debugger that examine things.

the next step is to be able to write bpf scripts on the fly
without leaving debugger. Something like perf probe +
editor + live execution. Truly like gdb for kernel.
while kernel is running.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:54 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
 diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
 index 139b5067345b..4c275ce2dcf0 100644
 --- a/include/trace/ftrace.h
 +++ b/include/trace/ftrace.h
 @@ -17,6 +17,7 @@
   */
  
  #include linux/ftrace_event.h
 +#include trace/bpf_trace.h
  
  /*
   * DECLARE_EVENT_CLASS can be used to add a generic function
 @@ -755,12 +756,32 @@ __attribute__((section(_ftrace_events))) 
 *__event_##call = event_##call
  #undef __perf_task
  #define __perf_task(t)   (__task = (t))
  
 +/* zero extend integer, pointer or aggregate type to u64 without warnings */
 +#define __CAST_TO_U64(EXPR) ({ \
 + u64 ret = 0; \
 + typeof(EXPR) expr = EXPR; \
 + switch (sizeof(expr)) { \
 + case 8: ret = *(u64 *) expr; break; \
 + case 4: ret = *(u32 *) expr; break; \
 + case 2: ret = *(u16 *) expr; break; \
 + case 1: ret = *(u8 *) expr; break; \
 + } \
 + ret; })
 +
 +#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
 +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
 +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
 +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
 +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
 +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
 +
  #undef DECLARE_EVENT_CLASS
  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)   
 \
  static notrace void  \
  perf_trace_##call(void *__data, proto)   
 \
  {\
   struct ftrace_event_call *event_call = __data;  \
 + struct bpf_prog *prog = event_call-prog;   \


Looks like this is entirely perf based and does not interact with
ftrace at all. In other words, it's perf not tracing.

It makes more sense to go through tip than the tracing tree.

But I still do not want any hard coded event structures. All access to
data from the binary code must be parsed by looking at the event/format
files. Otherwise you will lock internals of the kernel as userspace
ABI, because eBPF programs will break if those internals change, and
that could severely limit progress in the future.

-- Steve

   struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
   struct ftrace_raw_##call *entry;\
   struct pt_regs __regs;  \
 @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto)   
 \
   int __data_size;\
   int rctx;   \
   \
 + if (prog) { \
 + __maybe_unused const u64 z = 0; \
 + struct bpf_context __ctx = ((struct bpf_context) {  \
 + __BPF_CAST6(args, z, z, z, z, z)\
 + }); \
 + \
 + if (!trace_call_bpf(prog, __ctx))  \
 + return; \
 + }   \
 + \
   __data_size = ftrace_get_offsets_##call(__data_offsets, args); \
   \
   head = this_cpu_ptr(event_call-perf_events);   \
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

2015-02-09 Thread Steven Rostedt
On Mon,  9 Feb 2015 19:45:54 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 +/* For tracepoint filters argN fields match one to one to arguments
 + * passed to tracepoint events
 + *
 + * For syscall entry filters argN fields match syscall arguments
 + * For syscall exit filters arg1 is a return value
 + */
 +struct bpf_context {
 + u64 arg1;
 + u64 arg2;
 + u64 arg3;
 + u64 arg4;
 + u64 arg5;
 + u64 arg6;
 +};
 +
 +#endif /* _LINUX_KERNEL_BPF_TRACE_H */
 diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
 index 139b5067345b..4c275ce2dcf0 100644
 --- a/include/trace/ftrace.h
 +++ b/include/trace/ftrace.h
 @@ -17,6 +17,7 @@
   */
  
  #include linux/ftrace_event.h
 +#include trace/bpf_trace.h
  
  /*
   * DECLARE_EVENT_CLASS can be used to add a generic function
 @@ -755,12 +756,32 @@ __attribute__((section(_ftrace_events))) 
 *__event_##call = event_##call
  #undef __perf_task
  #define __perf_task(t)   (__task = (t))
  
 +/* zero extend integer, pointer or aggregate type to u64 without warnings */
 +#define __CAST_TO_U64(EXPR) ({ \
 + u64 ret = 0; \
 + typeof(EXPR) expr = EXPR; \
 + switch (sizeof(expr)) { \
 + case 8: ret = *(u64 *) expr; break; \
 + case 4: ret = *(u32 *) expr; break; \
 + case 2: ret = *(u16 *) expr; break; \
 + case 1: ret = *(u8 *) expr; break; \
 + } \
 + ret; })
 +
 +#define __BPF_CAST1(a,...) __CAST_TO_U64(a)
 +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__)
 +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__)
 +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__)
 +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__)
 +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__)
 +
  #undef DECLARE_EVENT_CLASS
  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)   
 \
  static notrace void  \
  perf_trace_##call(void *__data, proto)   
 \
  {\
   struct ftrace_event_call *event_call = __data;  \
 + struct bpf_prog *prog = event_call-prog;   \
   struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\
   struct ftrace_raw_##call *entry;\
   struct pt_regs __regs;  \
 @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto)   
 \
   int __data_size;\
   int rctx;   \
   \
 + if (prog) { \
 + __maybe_unused const u64 z = 0; \
 + struct bpf_context __ctx = ((struct bpf_context) {  \
 + __BPF_CAST6(args, z, z, z, z, z)\

Note, there is no guarantee that args is at most 6. For example, in
drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the
trace_event brcms_txstatus has 8 args.

But I guess that's OK if you do not need those last args, right?

Also, there's no interface the lets us know what the args are. I may be
able to come up with something. That's the reason I never filtered
before tracing. Because we had no way of knowing what to filter on,
because the args were never visible.

I'm nervous about showing args of tracepoints too, because we don't want
that to become a strict ABI either.

-- Steve



 + }); \
 + \
 + if (!trace_call_bpf(prog, __ctx))  \
 + return; \
 + }   \
 + \
   __data_size = ftrace_get_offsets_##call(__data_offsets, args); \

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/