Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, Feb 16, 2015 at 6:26 AM, He Kuang wrote: > Hi, Alexei > > Another suggestion on bpf syscall interface. Currently, BPF + > syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on > commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this > case, bpf bytecode cannot be loaded to the kernel. I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc When it's all ready, we can turn that config on by default. > If we turn the functionality of BPF_SYSCALL into a loadable module, then > we can use it without any dependencies on the kernel. What about change > bpf syscall to a /dev node or /sys file which can be exported by a > kernel module? I don't think we will allow extending bpf by modules. 'bpf in modules' is an interface that is too easy to abuse. So all of bpf core, helper functions and program types will be builtin. As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls first. Then add right set of helpers to make sure that use cases like 'tcp stack instrumentation' are fully addressed. Then there were few great ideas of accelerating kprobes with trace markers and debug tracepoints that we can do later. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, Feb 16, 2015 at 6:26 AM, He Kuang heku...@huawei.com wrote: Hi, Alexei Another suggestion on bpf syscall interface. Currently, BPF + syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this case, bpf bytecode cannot be loaded to the kernel. I'm seeing a flurry of use cases for bpf in ovs, tc, tracing, etc When it's all ready, we can turn that config on by default. If we turn the functionality of BPF_SYSCALL into a loadable module, then we can use it without any dependencies on the kernel. What about change bpf syscall to a /dev node or /sys file which can be exported by a kernel module? I don't think we will allow extending bpf by modules. 'bpf in modules' is an interface that is too easy to abuse. So all of bpf core, helper functions and program types will be builtin. As far as bpf+tracing the plan is to do bpf+kprobe and bpf+syscalls first. Then add right set of helpers to make sure that use cases like 'tcp stack instrumentation' are fully addressed. Then there were few great ideas of accelerating kprobes with trace markers and debug tracepoints that we can do later. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
Hi, Alexei Another suggestion on bpf syscall interface. Currently, BPF + syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this case, bpf bytecode cannot be loaded to the kernel. If we turn the functionality of BPF_SYSCALL into a loadable module, then we can use it without any dependencies on the kernel. What about change bpf syscall to a /dev node or /sys file which can be exported by a kernel module? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
Hi, Alexei Another suggestion on bpf syscall interface. Currently, BPF + syscalls/kprobes depends on CONFIG_BPF_SYSCALL. In kernel used on commercial products, CONFIG_BPF_SYSCALL is probably disabled, in this case, bpf bytecode cannot be loaded to the kernel. If we turn the functionality of BPF_SYSCALL into a loadable module, then we can use it without any dependencies on the kernel. What about change bpf syscall to a /dev node or /sys file which can be exported by a kernel module? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Wed, Feb 11, 2015 at 11:58 PM, Hekuang wrote: > >>> eBPF is very flexible, which means it is bound to have someone use it >>> in a way you never dreamed of, and that will be what bites you in the >>> end (pun intended). >> >> understood :) >> let's start slow then with bpf+syscall and bpf+kprobe only. > > > I think BPF + system calls/kprobes can meet our use case > (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be > improved. > > I suggest that you can improve bpf+kprobes when attached to function > headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1, > arg2.., then top models and architectures can be separated by bpf. > > BPF bytecode is cross-platform, but what we can get by using bpf+kprobes > is a 'regs->rdx' kind of information, such information is both > architecture and kernel version related. for kprobes in the middle of the function, kernel cannot convert pt_regs into argN. Placement was decided by compiler and can only be found in debug info. I think bpf+kprobe will be using it when it is available. When there is no debug info, kprobes will be limited to function entry and mapping of regs/stack into argN can be done by user space depending on architecture. So user tracing scripts in some higher level language can be kernel/arch independent when 'perf probe+bpf' is loading them on the fly on the given machine. > We hope to establish some models for describing kernel procedures such > as IO and network, which requires that it does not rely on architecture > and does not rely to a specific kernel version as much as possible. That's obviously a goal, but it requires a new approach to tracepoints. I think a lot of great ideas were discussed in this thread, so I'm hopeful that we'll come up with solution that will satisfy even strictest Peter's requirements :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Wed, Feb 11, 2015 at 7:51 AM, Steven Rostedt wrote: > On Tue, 10 Feb 2015 22:33:05 -0800 > Alexei Starovoitov wrote: > > >> fair enough. >> Something like TRACE_MARKER(arg1, arg2) that prints >> it was hit without accessing the args would be enough. >> Without any args it is indeed a 'fast kprobe' only. >> Debug info would still be needed to access >> function arguments. >> On x64 function entry point and x64 abi make it easy >> to access args, but i386 or kprobe in the middle >> lose visibility when debug info is not available. >> TRACE_MARKER (with few key args that function >> is operating on) is enough to achieve roughly the same >> as kprobe without debug info. > > Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and > possibly a full trace event layout. > > The difference would be that the trace events do not show up unless you > have "trace_debug" on the command line. This should prevent > applications from depending on them. > > I could even do the nasty dmesg output like I do with trace_printk()s, > that would definitely keep a production kernel from adding it by > default. > > When trace_debug is not there, the trace points could still be accessed > but perhaps only via bpf, or act like a simple trace marker. I think that is a great idea! Makes it clear that all prints are for debugging and no abi guarantees. > Note, if you need ids, I rather have them in another directory than > tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs > simple. indeed. makes sense. no reason to burn fs memory just to get an id from name. may be perf_event api can be extended to lookup id from name. I think perf will benefit as well. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Wed, Feb 11, 2015 at 5:28 AM, Peter Zijlstra wrote: > > We're compiling the BPF stuff against the 'current' kernel headers > right? the tracex1 example is pulling kernel headers to demonstrate how bpf_fetch*() helpers can be used to walk kernel structures without debug info. The other examples don't need any internal headers. > So would enforcing module versioning not be sufficient? I'm going to redo the ex1 to use kprobe and some form of version check. Indeed module-like versioning should be enough. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Wed, Feb 11, 2015 at 11:58 PM, Hekuang heku...@huawei.com wrote: eBPF is very flexible, which means it is bound to have someone use it in a way you never dreamed of, and that will be what bites you in the end (pun intended). understood :) let's start slow then with bpf+syscall and bpf+kprobe only. I think BPF + system calls/kprobes can meet our use case (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be improved. I suggest that you can improve bpf+kprobes when attached to function headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx-arg1, arg2.., then top models and architectures can be separated by bpf. BPF bytecode is cross-platform, but what we can get by using bpf+kprobes is a 'regs-rdx' kind of information, such information is both architecture and kernel version related. for kprobes in the middle of the function, kernel cannot convert pt_regs into argN. Placement was decided by compiler and can only be found in debug info. I think bpf+kprobe will be using it when it is available. When there is no debug info, kprobes will be limited to function entry and mapping of regs/stack into argN can be done by user space depending on architecture. So user tracing scripts in some higher level language can be kernel/arch independent when 'perf probe+bpf' is loading them on the fly on the given machine. We hope to establish some models for describing kernel procedures such as IO and network, which requires that it does not rely on architecture and does not rely to a specific kernel version as much as possible. That's obviously a goal, but it requires a new approach to tracepoints. I think a lot of great ideas were discussed in this thread, so I'm hopeful that we'll come up with solution that will satisfy even strictest Peter's requirements :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Wed, Feb 11, 2015 at 5:28 AM, Peter Zijlstra pet...@infradead.org wrote: We're compiling the BPF stuff against the 'current' kernel headers right? the tracex1 example is pulling kernel headers to demonstrate how bpf_fetch*() helpers can be used to walk kernel structures without debug info. The other examples don't need any internal headers. So would enforcing module versioning not be sufficient? I'm going to redo the ex1 to use kprobe and some form of version check. Indeed module-like versioning should be enough. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Wed, Feb 11, 2015 at 7:51 AM, Steven Rostedt rost...@goodmis.org wrote: On Tue, 10 Feb 2015 22:33:05 -0800 Alexei Starovoitov a...@plumgrid.com wrote: fair enough. Something like TRACE_MARKER(arg1, arg2) that prints it was hit without accessing the args would be enough. Without any args it is indeed a 'fast kprobe' only. Debug info would still be needed to access function arguments. On x64 function entry point and x64 abi make it easy to access args, but i386 or kprobe in the middle lose visibility when debug info is not available. TRACE_MARKER (with few key args that function is operating on) is enough to achieve roughly the same as kprobe without debug info. Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and possibly a full trace event layout. The difference would be that the trace events do not show up unless you have trace_debug on the command line. This should prevent applications from depending on them. I could even do the nasty dmesg output like I do with trace_printk()s, that would definitely keep a production kernel from adding it by default. When trace_debug is not there, the trace points could still be accessed but perhaps only via bpf, or act like a simple trace marker. I think that is a great idea! Makes it clear that all prints are for debugging and no abi guarantees. Note, if you need ids, I rather have them in another directory than tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs simple. indeed. makes sense. no reason to burn fs memory just to get an id from name. may be perf_event api can be extended to lookup id from name. I think perf will benefit as well. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
eBPF is very flexible, which means it is bound to have someone use it in a way you never dreamed of, and that will be what bites you in the end (pun intended). understood :) let's start slow then with bpf+syscall and bpf+kprobe only. I think BPF + system calls/kprobes can meet our use case (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be improved. I suggest that you can improve bpf+kprobes when attached to function headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx->arg1, arg2.., then top models and architectures can be separated by bpf. BPF bytecode is cross-platform, but what we can get by using bpf+kprobes is a 'regs->rdx' kind of information, such information is both architecture and kernel version related. We hope to establish some models for describing kernel procedures such as IO and network, which requires that it does not rely on architecture and does not rely to a specific kernel version as much as possible. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 22:33:05 -0800 Alexei Starovoitov wrote: > fair enough. > Something like TRACE_MARKER(arg1, arg2) that prints > it was hit without accessing the args would be enough. > Without any args it is indeed a 'fast kprobe' only. > Debug info would still be needed to access > function arguments. > On x64 function entry point and x64 abi make it easy > to access args, but i386 or kprobe in the middle > lose visibility when debug info is not available. > TRACE_MARKER (with few key args that function > is operating on) is enough to achieve roughly the same > as kprobe without debug info. Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and possibly a full trace event layout. The difference would be that the trace events do not show up unless you have "trace_debug" on the command line. This should prevent applications from depending on them. I could even do the nasty dmesg output like I do with trace_printk()s, that would definitely keep a production kernel from adding it by default. When trace_debug is not there, the trace points could still be accessed but perhaps only via bpf, or act like a simple trace marker. Note, if you need ids, I rather have them in another directory than tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs simple. This is something that needs to probably be discussed at bit. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:53:59PM -0500, Steven Rostedt wrote: > > >> In the future we might add a tracepoint and pass a single > > >> pointer to interesting data struct to it. bpf programs will walk > > >> data structures 'as safe modules' via bpf_fetch*() methods > > >> without exposing it as ABI. > > > > > > Will this work if that structure changes? When the field we are looking > > > for no longer exists? > > > > bpf_fetch*() is the same mechanism as perf probe. > > If there is a mistake by user space tools, the program > > will be reading some junk, but it won't be crashing. > > But what if the userspace tool depends on that value returning > something meaningful. If it was meaningful in the past, it will have to > be meaningful in the future, even if the internals of the kernel make > it otherwise. We're compiling the BPF stuff against the 'current' kernel headers right? So would enforcing module versioning not be sufficient? We already break out-of-tree modules without a second thought, the module interface is not guaranteed. They just need to cope with it. Anything using the kernel headers to look into the kernel guts should be bound to the same rules. So if we think of BFP stuff as out-of-tree modules, and treat them the same, I see no problem. I'm sure some BFP 'scripts' will turn in the same right mess that out-of-tree modules are, with endless #ifdef version checks, but hey, not my problem ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote: > > It would need to do more that that. It may have to calculate the value > > that it returns, as the internal value may be different with different > > kernels. > > back to 'prio'... the 'prio' accessible from the program > should be the same 'prio' that we're storing inside task_struct. Its not, task_struct::prio is an entirely different value than the one used in sched_param::sched_priority / sched_attr::sched_priority. And the 'problem' is, prio is only relevant to SCHED_RR/SCHED_FIFO tasks, we have more classes. > No extra conversions. We're not going to add runtime/space overhead to the kernel just because someone might maybe someday trace the kernel. That leaves the option of either tracing the kernel internal value and userspace will just have to deal with it, or making the tracepoint more expensive by having it do the conversion. Now the big question is, what do we do when we add/extend a scheduling class and have more parameters? We cannot change the tracepoint because userspace assumes format. And I simply refuse to add a second -- because that will end up being a third and fourth etc.. -- tracepoint right next to it with a different layout. Note that we just did add a class, we grew SCHED_DEADLINE a few releases ago, and that has 3 parameters (or 6 depending on how you look at it). You currently cannot 'debug' that with the existing tracepoints. In short, I loathe traceevents, they're more trouble than they're worth. Now I do love the infrastructure, its very useful debugging, but that's all with local hacks that will never see the light of day. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote: > well, ->prio and ->pid are already printed by sched tracepoints > and their meaning depends on scheduler. So users taking that > into account. Right, so trace_events were/are root only, and root 'should' be in the root pid namespace, and therefore pid magically works. And I'm not sure, but I don't think the 'nested' root available from containers should have access to ftrace, so that should not be an issue. Perf tries real hard to present PERF_SAMPLE_PID data in the pid namespace of the task that created the event. As to prio; yes this is a prime example of suck, I would love to change that but cannot :-(. The only solace I have is that everybody who is relying on it is broken. There is a very good reason I'm against adding more tracepoints to the scheduler, its a nightmare. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote: > >> not all tools use libtraceevent. > >> gdb calls perf_event_open directly: > >> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c > >> and parses PERF_RECORD_SAMPLE as a binary. > >> In this case it's branch records, but I think we never said anywhere > >> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come > >> in this particular order. > > > > What particular order? Note, that's a hardware event, not a software > > one. > > yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr' > when attr.sample_type = IP | ADDR whereas this is an > internal order of 'if' statements inside perf_output_sample()... This is indeed promised in the data layout description in include/uapi/linux/perf_event.h. There is no other way to find where fields are; perf relies on predetermined order of fields coupled with the requested field bitmask. So we promise the order: id, ip, pid, tid, time, addr,.. etc. So if you request IP and ADDR but none of the other fields, then you know your sample will start with IP and then contain ADDR. The traceevent thing has a debug/trace-fs format description of fields that is supposed to be used. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:53:59PM -0500, Steven Rostedt wrote: In the future we might add a tracepoint and pass a single pointer to interesting data struct to it. bpf programs will walk data structures 'as safe modules' via bpf_fetch*() methods without exposing it as ABI. Will this work if that structure changes? When the field we are looking for no longer exists? bpf_fetch*() is the same mechanism as perf probe. If there is a mistake by user space tools, the program will be reading some junk, but it won't be crashing. But what if the userspace tool depends on that value returning something meaningful. If it was meaningful in the past, it will have to be meaningful in the future, even if the internals of the kernel make it otherwise. We're compiling the BPF stuff against the 'current' kernel headers right? So would enforcing module versioning not be sufficient? We already break out-of-tree modules without a second thought, the module interface is not guaranteed. They just need to cope with it. Anything using the kernel headers to look into the kernel guts should be bound to the same rules. So if we think of BFP stuff as out-of-tree modules, and treat them the same, I see no problem. I'm sure some BFP 'scripts' will turn in the same right mess that out-of-tree modules are, with endless #ifdef version checks, but hey, not my problem ;-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote: not all tools use libtraceevent. gdb calls perf_event_open directly: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c and parses PERF_RECORD_SAMPLE as a binary. In this case it's branch records, but I think we never said anywhere that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come in this particular order. What particular order? Note, that's a hardware event, not a software one. yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr' when attr.sample_type = IP | ADDR whereas this is an internal order of 'if' statements inside perf_output_sample()... This is indeed promised in the data layout description in include/uapi/linux/perf_event.h. There is no other way to find where fields are; perf relies on predetermined order of fields coupled with the requested field bitmask. So we promise the order: id, ip, pid, tid, time, addr,.. etc. So if you request IP and ADDR but none of the other fields, then you know your sample will start with IP and then contain ADDR. The traceevent thing has a debug/trace-fs format description of fields that is supposed to be used. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote: well, -prio and -pid are already printed by sched tracepoints and their meaning depends on scheduler. So users taking that into account. Right, so trace_events were/are root only, and root 'should' be in the root pid namespace, and therefore pid magically works. And I'm not sure, but I don't think the 'nested' root available from containers should have access to ftrace, so that should not be an issue. Perf tries real hard to present PERF_SAMPLE_PID data in the pid namespace of the task that created the event. As to prio; yes this is a prime example of suck, I would love to change that but cannot :-(. The only solace I have is that everybody who is relying on it is broken. There is a very good reason I'm against adding more tracepoints to the scheduler, its a nightmare. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 04:22:50PM -0800, Alexei Starovoitov wrote: It would need to do more that that. It may have to calculate the value that it returns, as the internal value may be different with different kernels. back to 'prio'... the 'prio' accessible from the program should be the same 'prio' that we're storing inside task_struct. Its not, task_struct::prio is an entirely different value than the one used in sched_param::sched_priority / sched_attr::sched_priority. And the 'problem' is, prio is only relevant to SCHED_RR/SCHED_FIFO tasks, we have more classes. No extra conversions. We're not going to add runtime/space overhead to the kernel just because someone might maybe someday trace the kernel. That leaves the option of either tracing the kernel internal value and userspace will just have to deal with it, or making the tracepoint more expensive by having it do the conversion. Now the big question is, what do we do when we add/extend a scheduling class and have more parameters? We cannot change the tracepoint because userspace assumes format. And I simply refuse to add a second -- because that will end up being a third and fourth etc.. -- tracepoint right next to it with a different layout. Note that we just did add a class, we grew SCHED_DEADLINE a few releases ago, and that has 3 parameters (or 6 depending on how you look at it). You currently cannot 'debug' that with the existing tracepoints. In short, I loathe traceevents, they're more trouble than they're worth. Now I do love the infrastructure, its very useful debugging, but that's all with local hacks that will never see the light of day. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 22:33:05 -0800 Alexei Starovoitov a...@plumgrid.com wrote: fair enough. Something like TRACE_MARKER(arg1, arg2) that prints it was hit without accessing the args would be enough. Without any args it is indeed a 'fast kprobe' only. Debug info would still be needed to access function arguments. On x64 function entry point and x64 abi make it easy to access args, but i386 or kprobe in the middle lose visibility when debug info is not available. TRACE_MARKER (with few key args that function is operating on) is enough to achieve roughly the same as kprobe without debug info. Actually, what about a TRACE_EVENT_DEBUG(), that has a few args and possibly a full trace event layout. The difference would be that the trace events do not show up unless you have trace_debug on the command line. This should prevent applications from depending on them. I could even do the nasty dmesg output like I do with trace_printk()s, that would definitely keep a production kernel from adding it by default. When trace_debug is not there, the trace points could still be accessed but perhaps only via bpf, or act like a simple trace marker. Note, if you need ids, I rather have them in another directory than tracefs. Make a eventfs perhaps that holds these. I rather keep tracefs simple. This is something that needs to probably be discussed at bit. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
eBPF is very flexible, which means it is bound to have someone use it in a way you never dreamed of, and that will be what bites you in the end (pun intended). understood :) let's start slow then with bpf+syscall and bpf+kprobe only. I think BPF + system calls/kprobes can meet our use case (https://lkml.org/lkml/2015/2/6/44), but there're some issues to be improved. I suggest that you can improve bpf+kprobes when attached to function headers(or TRACE_MARKERS), make it converts pt-regs to bpf_ctx-arg1, arg2.., then top models and architectures can be separated by bpf. BPF bytecode is cross-platform, but what we can get by using bpf+kprobes is a 'regs-rdx' kind of information, such information is both architecture and kernel version related. We hope to establish some models for describing kernel procedures such as IO and network, which requires that it does not rely on architecture and does not rely to a specific kernel version as much as possible. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt wrote: >> > Again, this would mean they become invisible to ftrace, and even >> > ftrace_dump_on_oops. >> >> yes, since these new tracepoints have no meat inside them. >> They're placeholders sitting idle and waiting for bpf to do >> something useful with them. > > Hmm, I have a patch somewhere (never posted it), that add > TRACE_MARKER(), which basically would just print that it was hit. But > no data was passed to it. Something like that I would be more inclined > to take. Then the question is, what can bpf access there? Could just be > a place holder to add a "fast kprobe". This way, it can show up in > trace events with enable and and all that, it just wont have any data > to print out besides the normal pid, flags, etc. > > But because it will inject a nop, that could be converted to a jump, it > will give you the power of kprobes but with the speed of a tracepoint. fair enough. Something like TRACE_MARKER(arg1, arg2) that prints it was hit without accessing the args would be enough. Without any args it is indeed a 'fast kprobe' only. Debug info would still be needed to access function arguments. On x64 function entry point and x64 abi make it easy to access args, but i386 or kprobe in the middle lose visibility when debug info is not available. TRACE_MARKER (with few key args that function is operating on) is enough to achieve roughly the same as kprobe without debug info. >> > I'm not fully understanding what is to be exported by this new ABI. If >> > the fields available, will always be available, then why can't the >> > appear in a TP_printk()? >> >> say, we define trace_netif_rx_entry() as this new tracepoint_bpf. >> It will have only one argument 'skb'. >> bpf program will read and print skb fields the way it likes >> for particular tracing scenario. >> So instead of making >> TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p >> vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x >> ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u >> mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d >> gso_type=%#x",... >> the abi exposed via trace_pipe (as it is today), >> the new tracepoint_bpf abi is presence of 'skb' pointer as one >> and only argument to bpf program. >> Future refactoring of netif_rx would need to guarantee >> that trace_netif_rx_entry(skb) is called. that's it. >> imo such tracepoints are much easier to deal with during >> code changes. > > But what can you access from that skb that is guaranteed to be there? > If you say anything, then there's no reason it can't be added to the > printk as well. programs can access any field via bpf_fetch*() helpers which make them kernel layout dependent or via user-ized sk_buff with few fields which is portable. In both cases kernel/user abi is only 'skb' pointer. whether it's debugging program that needs full access via fetch* helpers or portable program that uses stable api it's up to program author. Just like kprobes, it's clear, that if program is using fetch* helpers it's doing it without any abi guarantees. 'perf probe' and 'bpf with fetch* helpers are the same. perf probe creates wrappers on top of probe_kernel_read and bpf_fetch* helpers are wrappers on top of probe_kernel_read. Complains that 'my kprobe with flags=%cx mode=+4($stack) stopped working in new kernel' are equivalent to complains that program with bpf_fetch* stopped working. Whereas if program is using user-ized structs it will work across kernel versions, though it will be able to see only very limited slice of in-kernel data. >> May be some of the existing tracepoints like this one that >> takes one argument can be marked 'bpf-ready', so that >> programs can attach to them only. > > I really hate the idea of adding tracepoints that ftrace can't use. It > basically kills the entire busy box usage scenario, as boards that have > extremely limited userspace still make full use of ftrace via the > existing tracepoints. agree. I think your trace_marker with few args is a good middle ground. > I still don't see the argument that adding data via the bpf functions > is any different than adding those same items to fields in an event. > Once you add a bpf function, then you must maintain those fields. > > Look, you can always add more to a TP_printk(), as that is standard > with all text file kernel parsing. Like stat in /proc. Fields can not > be removed, but more can always be appended to the end. > > Any tool that parses trace_pipe is broken if it can't handle extended > fields. The api can be extended, and for text files, that is by > appending to them. I agree that any text parsing script should be able to cope with additional args without problems. I think it's a fear of <1% breakage is causing maintainers to avoid any changes to tracepoints even when they just add few args to the end of TP_printk. When tracepoints stop printing and the only thing they see is single pointer to a
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 19:04:55 -0800 Alexei Starovoitov wrote: > > You mean to be completely invisible to ftrace? And the debugfs/tracefs > > directory? > > I mean it will be seen in tracefs to get 'id', but without > enable/format/filter In other words, invisible to ftrace. I'm not sure I'll be on board to work with that. > > Again, this would mean they become invisible to ftrace, and even > > ftrace_dump_on_oops. > > yes, since these new tracepoints have no meat inside them. > They're placeholders sitting idle and waiting for bpf to do > something useful with them. Hmm, I have a patch somewhere (never posted it), that add TRACE_MARKER(), which basically would just print that it was hit. But no data was passed to it. Something like that I would be more inclined to take. Then the question is, what can bpf access there? Could just be a place holder to add a "fast kprobe". This way, it can show up in trace events with enable and and all that, it just wont have any data to print out besides the normal pid, flags, etc. But because it will inject a nop, that could be converted to a jump, it will give you the power of kprobes but with the speed of a tracepoint. > > > I'm not fully understanding what is to be exported by this new ABI. If > > the fields available, will always be available, then why can't the > > appear in a TP_printk()? > > say, we define trace_netif_rx_entry() as this new tracepoint_bpf. > It will have only one argument 'skb'. > bpf program will read and print skb fields the way it likes > for particular tracing scenario. > So instead of making > TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p > vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x > ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u > mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d > gso_type=%#x",... > the abi exposed via trace_pipe (as it is today), > the new tracepoint_bpf abi is presence of 'skb' pointer as one > and only argument to bpf program. > Future refactoring of netif_rx would need to guarantee > that trace_netif_rx_entry(skb) is called. that's it. > imo such tracepoints are much easier to deal with during > code changes. But what can you access from that skb that is guaranteed to be there? If you say anything, then there's no reason it can't be added to the printk as well. > > May be some of the existing tracepoints like this one that > takes one argument can be marked 'bpf-ready', so that > programs can attach to them only. I really hate the idea of adding tracepoints that ftrace can't use. It basically kills the entire busy box usage scenario, as boards that have extremely limited userspace still make full use of ftrace via the existing tracepoints. I still don't see the argument that adding data via the bpf functions is any different than adding those same items to fields in an event. Once you add a bpf function, then you must maintain those fields. Look, you can always add more to a TP_printk(), as that is standard with all text file kernel parsing. Like stat in /proc. Fields can not be removed, but more can always be appended to the end. Any tool that parses trace_pipe is broken if it can't handle extended fields. The api can be extended, and for text files, that is by appending to them. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt wrote: > >> >> But some maintainers think of them as ABI, whereas others >> >> are using them freely. imo it's time to remove ambiguity. >> > >> > I would love to, and have brought this up at Kernel Summit more than >> > once with no solution out of it. >> >> let's try it again at plumbers in august? > > Well, we need a statement from Linus. And it would be nice if we could > also get Ingo involved in the discussion, but he seldom comes to > anything but Kernel Summit. +1 > BTW, I wonder if I could make a simple compiler in the kernel that > would translate the current ftrace filters into a BPF program, where it > could use the program and not use the current filter logic. yep. I've sent that patch last year. It converted pred_tree into bpf program. I can try to dig it up. It doesn't provide extra programmability though, just makes filtering logic much faster. >> imo the solution is DEFINE_EVENT_BPF that doesn't >> print anything and a bpf program to process it. > > You mean to be completely invisible to ftrace? And the debugfs/tracefs > directory? I mean it will be seen in tracefs to get 'id', but without enable/format/filter >> I'm not suggesting to preserve the meaning of 'pid' semantically >> in all cases. That's not what users would want anyway. >> I want to allow programs to access important fields and print >> them in more generic way than current TP_printk does. >> Then exposed ABI of such tracepoint_bpf is smaller than >> with current tracepoints. > > Again, this would mean they become invisible to ftrace, and even > ftrace_dump_on_oops. yes, since these new tracepoints have no meat inside them. They're placeholders sitting idle and waiting for bpf to do something useful with them. > I'm not fully understanding what is to be exported by this new ABI. If > the fields available, will always be available, then why can't the > appear in a TP_printk()? say, we define trace_netif_rx_entry() as this new tracepoint_bpf. It will have only one argument 'skb'. bpf program will read and print skb fields the way it likes for particular tracing scenario. So instead of making TP_printk("dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d gso_type=%#x",... the abi exposed via trace_pipe (as it is today), the new tracepoint_bpf abi is presence of 'skb' pointer as one and only argument to bpf program. Future refactoring of netif_rx would need to guarantee that trace_netif_rx_entry(skb) is called. that's it. imo such tracepoints are much easier to deal with during code changes. May be some of the existing tracepoints like this one that takes one argument can be marked 'bpf-ready', so that programs can attach to them only. >> let's start slow then with bpf+syscall and bpf+kprobe only. > > I'm fine with that. thanks. will wait for merge window to close and will repost. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 16:22:50 -0800 Alexei Starovoitov wrote: > > Yep, and if this becomes a standard, then any change that makes > > trace_pipe different will be reverted. > > I think reading of trace_pipe is widespread. I've heard of a few, but nothing that has broken when something changed. Is it scripts or actual C code? Again, it matters if people complain about the change. > >> But some maintainers think of them as ABI, whereas others > >> are using them freely. imo it's time to remove ambiguity. > > > > I would love to, and have brought this up at Kernel Summit more than > > once with no solution out of it. > > let's try it again at plumbers in august? Well, we need a statement from Linus. And it would be nice if we could also get Ingo involved in the discussion, but he seldom comes to anything but Kernel Summit. > > For now I'm going to drop bpf+tracepoints, since it's so controversial > and go with bpf+syscall and bpf+kprobe only. Probably the safest bet. > > Hopefully by august it will be clear what bpf+kprobes can do > and why I'm excited about bpf+tracepoints in the future. BTW, I wonder if I could make a simple compiler in the kernel that would translate the current ftrace filters into a BPF program, where it could use the program and not use the current filter logic. > >> These tracepoint will receive one or two pointers to important > >> structs only. They will not have TP_printk, assign and fields. > >> The placement and arguments to these new tracepoints > >> will be an ABI. > >> All existing tracepoints are not. > > > > TP_printk() is not really an issue. > > I think it is. The way things are printed is the most > visible part of tracepoints and I suspect maintainers don't > want to add new ones, because internal fields are printed > and users do parse trace_pipe. > Recent discussion about tcp instrumentation was > about adding new tracepoints and a module to print them. > As soon as something like this is in, the next question > 'what we're going to do when more arguments need > to be printed'... I should rephrase that. It's not that it's not an issue, it's just that it hasn't been an issue. the trace_pipe code is slow. The raw_trace_pipe is much faster. Any tool would benefit from using it. I really need to get a library out to help users do such a thing. > > imo the solution is DEFINE_EVENT_BPF that doesn't > print anything and a bpf program to process it. You mean to be completely invisible to ftrace? And the debugfs/tracefs directory? > > > >> it is portable and will run on any kernel. > >> In uapi header we can define: > >> struct task_struct_user { > >> int pid; > >> int prio; > > > > Here's a perfect example of something that looks stable to show to > > user space, but is really a pimple that is hiding cancer. > > > > Lets start with pid. We have name spaces. What pid will be put there? > > We have to show the pid of the name space it is under. > > > > Then we have prio. What is prio in the DEADLINE scheduler. It is rather > > meaningless. Also, it is meaningless in SCHED_OTHER. > > > > Also note that even for SCHED_FIFO, the prio is used differently in the > > kernel than it is in userspace. For the kernel, lower is higher. > > well, ->prio and ->pid are already printed by sched tracepoints > and their meaning depends on scheduler. So users taking that > into account. I know, and Peter hates this. > I'm not suggesting to preserve the meaning of 'pid' semantically > in all cases. That's not what users would want anyway. > I want to allow programs to access important fields and print > them in more generic way than current TP_printk does. > Then exposed ABI of such tracepoint_bpf is smaller than > with current tracepoints. Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? > > eBPF is very flexible, which means it is bound to have someone use it > > in a way you never dreamed of, and that will be what bites you in the > > end (pun intended). > > understood :) > let's start slow then with bpf+syscall and bpf+kprobe only. I'm fine with that. > > >> also not all bpf programs are equal. > >> bpf+existing tracepoint is not ABI > > > > Why not? > > well, because we want to see more tracepoints in the kernel. > We're already struggling to add more. Still, the question is, even with a new "tracepoint" that limits what it shows, there still needs to be something that is guaranteed to always be there. I still don't see how this is different than the current tracepoints. > > >> bpf+new tracepoint is ABI if programs are not using bpf_fetch > > > > How is this different? > > the new ones will be explicit by definition. Who monitors this? > > To give you an example, we thought about scrambling the trace event > > field locations from boot to boot to
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt wrote: > On Tue, 10 Feb 2015 11:53:22 -0800 > Alexei Starovoitov wrote: > >> On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt wrote: >> > On Mon, 9 Feb 2015 22:10:45 -0800 >> > Alexei Starovoitov wrote: >> > >> >> One can argue that current TP_printk format is already an ABI, >> >> because somebody might be parsing the text output. >> > >> > If somebody does, then it is an ABI. Luckily, it's not that useful to >> > parse, thus it hasn't been an issue. As Linus has stated in the past, >> > it's not that we can't change ABI interfaces, its just that we can not >> > change them if there's a user space application that depends on it. >> >> there are already tools that parse trace_pipe: >> https://github.com/brendangregg/perf-tools > > Yep, and if this becomes a standard, then any change that makes > trace_pipe different will be reverted. I think reading of trace_pipe is widespread. >> > and expect some events to have specific fields. Now we can add new >> > fields, or even remove fields that no user space tool is using. This is >> > because today, tools use libtraceevent to parse the event data. >> >> not all tools use libtraceevent. >> gdb calls perf_event_open directly: >> https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c >> and parses PERF_RECORD_SAMPLE as a binary. >> In this case it's branch records, but I think we never said anywhere >> that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come >> in this particular order. > > What particular order? Note, that's a hardware event, not a software > one. yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr' when attr.sample_type = IP | ADDR whereas this is an internal order of 'if' statements inside perf_output_sample()... >> But some maintainers think of them as ABI, whereas others >> are using them freely. imo it's time to remove ambiguity. > > I would love to, and have brought this up at Kernel Summit more than > once with no solution out of it. let's try it again at plumbers in august? For now I'm going to drop bpf+tracepoints, since it's so controversial and go with bpf+syscall and bpf+kprobe only. Hopefully by august it will be clear what bpf+kprobes can do and why I'm excited about bpf+tracepoints in the future. >> The idea for new style of tracepoints is the following: >> introduce new macro: DEFINE_EVENT_USER >> and let programs attach to them. > > We actually have that today. But it's TRACE_EVENT_FLAGS(), although > that should be cleaned up a bit. Frederic added it to label events that > are safe for perf non root. It seems to be used currently only for > syscalls. I didn't mean to let unprivileged apps to use. Better name probably would be: DEFINE_EVENT_BPF and only let bpf programs use it. >> These tracepoint will receive one or two pointers to important >> structs only. They will not have TP_printk, assign and fields. >> The placement and arguments to these new tracepoints >> will be an ABI. >> All existing tracepoints are not. > > TP_printk() is not really an issue. I think it is. The way things are printed is the most visible part of tracepoints and I suspect maintainers don't want to add new ones, because internal fields are printed and users do parse trace_pipe. Recent discussion about tcp instrumentation was about adding new tracepoints and a module to print them. As soon as something like this is in, the next question 'what we're going to do when more arguments need to be printed'... imo the solution is DEFINE_EVENT_BPF that doesn't print anything and a bpf program to process it. Then programs decide what they like to pass to user space and in what format. The kernel/user ABI is the position of this new tracepoint and arguments that are passed in. bpf program takes care of walking pointers, extracting interesting fields, aggregating and passing them to user in the format specific to this particular program. Then when user wants to collect more fields it just changes the program and corresponding userland side. >> The position of new tracepoints and their arguments >> will be an ABI and the programs can be both. > > You means "special tracepoints" one that does export the arguments? > > Question is, how many maintainers will add these, knowing that they > will have to be forever maintained as is. Obviously we would need to prove usefulness of tracepoint_bpf before they're accepted. Hopefully bpf+kprobe will make an impression on its own and users would want similar but without debug info. >> If program is using bpf_fetch*() helpers it obviously >> wants to access internal data structures, so >> it's really nothing more, but 'safe kernel module' >> and kernel layout specific. >> Both old and new tracepoints + programs will be used >> for live kernel debugging. >> >> If program is accessing user-ized data structures then > > Technically, the TP_struct__entry is a user-ized structure. > >> it is portable and will run on any kernel. >> In uapi
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 11:53:22 -0800 Alexei Starovoitov wrote: > On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt wrote: > > On Mon, 9 Feb 2015 22:10:45 -0800 > > Alexei Starovoitov wrote: > > > >> One can argue that current TP_printk format is already an ABI, > >> because somebody might be parsing the text output. > > > > If somebody does, then it is an ABI. Luckily, it's not that useful to > > parse, thus it hasn't been an issue. As Linus has stated in the past, > > it's not that we can't change ABI interfaces, its just that we can not > > change them if there's a user space application that depends on it. > > there are already tools that parse trace_pipe: > https://github.com/brendangregg/perf-tools Yep, and if this becomes a standard, then any change that makes trace_pipe different will be reverted. > > > and expect some events to have specific fields. Now we can add new > > fields, or even remove fields that no user space tool is using. This is > > because today, tools use libtraceevent to parse the event data. > > not all tools use libtraceevent. > gdb calls perf_event_open directly: > https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c > and parses PERF_RECORD_SAMPLE as a binary. > In this case it's branch records, but I think we never said anywhere > that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come > in this particular order. What particular order? Note, that's a hardware event, not a software one. > > > This is why I'm nervous about exporting the parameters of the trace > > event call. Right now, those parameters can always change, because the > > only way to know they exist is by looking at the code. And currently, > > there's no way to interact with those parameters. Once we have eBPF in > > mainline, we now have a way to interact with the parameters and if > > those parameters change, then the eBPF program will break, and if eBPF > > can be part of a user space tool, that will break that tool and > > whatever change in the trace point that caused this breakage would have > > to be reverted. IOW, this can limit development in the kernel. > > it can limit development unless we say that bpf programs > that attach to tracepoints are not part of ABI. > Easy enough to add large comment similar to perf_event.h Again, it doesn't matter what we say. Linus made that very clear. He stated if you provide an interface, and someone uses that interface for a user space application, and they complain if it breaks, that is just reason to revert whatever broke it. > > > Al Viro currently does not let any tracepoint in VFS because he doesn't > > want the internals of that code locked to an ABI. He's right to be > > worried. > > Same with networking bits. We don't want tracepoints to limit > kernel development, but we want debuggability and kernel > analytics. > All existing tracepoints defined via DEFINE_EVENT should > not be an ABI. I agree, but that doesn't make it so :-/ > But some maintainers think of them as ABI, whereas others > are using them freely. imo it's time to remove ambiguity. I would love to, and have brought this up at Kernel Summit more than once with no solution out of it. > > The idea for new style of tracepoints is the following: > introduce new macro: DEFINE_EVENT_USER > and let programs attach to them. We actually have that today. But it's TRACE_EVENT_FLAGS(), although that should be cleaned up a bit. Frederic added it to label events that are safe for perf non root. It seems to be used currently only for syscalls. > These tracepoint will receive one or two pointers to important > structs only. They will not have TP_printk, assign and fields. > The placement and arguments to these new tracepoints > will be an ABI. > All existing tracepoints are not. TP_printk() is not really an issue. > > The main reason to attach to tracepoint is that they are > accessible without debug info (unlike kprobe) That is, if you have a special bpf call to access variables, right? How else do you access part of a data structure. > Another reason is speed. tracepoints are much faster than > optimized kprobes and for real-time analytics the speed > is critical. > > The position of new tracepoints and their arguments > will be an ABI and the programs can be both. You means "special tracepoints" one that does export the arguments? Question is, how many maintainers will add these, knowing that they will have to be forever maintained as is. > If program is using bpf_fetch*() helpers it obviously > wants to access internal data structures, so > it's really nothing more, but 'safe kernel module' > and kernel layout specific. > Both old and new tracepoints + programs will be used > for live kernel debugging. > > If program is accessing user-ized data structures then Technically, the TP_struct__entry is a user-ized structure. > it is portable and will run on any kernel. > In uapi header we can define: > struct task_struct_user { > int pid; > int
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt wrote: > On Mon, 9 Feb 2015 22:10:45 -0800 > Alexei Starovoitov wrote: > >> One can argue that current TP_printk format is already an ABI, >> because somebody might be parsing the text output. > > If somebody does, then it is an ABI. Luckily, it's not that useful to > parse, thus it hasn't been an issue. As Linus has stated in the past, > it's not that we can't change ABI interfaces, its just that we can not > change them if there's a user space application that depends on it. there are already tools that parse trace_pipe: https://github.com/brendangregg/perf-tools > and expect some events to have specific fields. Now we can add new > fields, or even remove fields that no user space tool is using. This is > because today, tools use libtraceevent to parse the event data. not all tools use libtraceevent. gdb calls perf_event_open directly: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c and parses PERF_RECORD_SAMPLE as a binary. In this case it's branch records, but I think we never said anywhere that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come in this particular order. > This is why I'm nervous about exporting the parameters of the trace > event call. Right now, those parameters can always change, because the > only way to know they exist is by looking at the code. And currently, > there's no way to interact with those parameters. Once we have eBPF in > mainline, we now have a way to interact with the parameters and if > those parameters change, then the eBPF program will break, and if eBPF > can be part of a user space tool, that will break that tool and > whatever change in the trace point that caused this breakage would have > to be reverted. IOW, this can limit development in the kernel. it can limit development unless we say that bpf programs that attach to tracepoints are not part of ABI. Easy enough to add large comment similar to perf_event.h > Al Viro currently does not let any tracepoint in VFS because he doesn't > want the internals of that code locked to an ABI. He's right to be > worried. Same with networking bits. We don't want tracepoints to limit kernel development, but we want debuggability and kernel analytics. All existing tracepoints defined via DEFINE_EVENT should not be an ABI. But some maintainers think of them as ABI, whereas others are using them freely. imo it's time to remove ambiguity. The idea for new style of tracepoints is the following: introduce new macro: DEFINE_EVENT_USER and let programs attach to them. These tracepoint will receive one or two pointers to important structs only. They will not have TP_printk, assign and fields. The placement and arguments to these new tracepoints will be an ABI. All existing tracepoints are not. The main reason to attach to tracepoint is that they are accessible without debug info (unlike kprobe) Another reason is speed. tracepoints are much faster than optimized kprobes and for real-time analytics the speed is critical. The position of new tracepoints and their arguments will be an ABI and the programs can be both. If program is using bpf_fetch*() helpers it obviously wants to access internal data structures, so it's really nothing more, but 'safe kernel module' and kernel layout specific. Both old and new tracepoints + programs will be used for live kernel debugging. If program is accessing user-ized data structures then it is portable and will run on any kernel. In uapi header we can define: struct task_struct_user { int pid; int prio; }; and let bpf programs access it via real 'struct task_struct *' pointer passed into tracepoint. bpf loader will translate offsets and sizes used inside the program into real task_struct's offsets and loads. (all structs are read-only of course) programs will be fast and kernel independent. They will be used for analytics (latency, etc) >> so in some cases we cannot change tracepoints without >> somebody complaining that his tool broke. >> In other cases tracepoints are used for debugging only >> and no one will notice when they change... >> It was and still a grey area. > > Not really. If a tool uses the tracepoint, it can lock that tracepoint > down. This is exactly what latencytop did. It happened, it's not a > hypothetical situation. correct. >> bpf doesn't change any of that. >> It actually makes addition of new tracepoints easier. > > I totally disagree. It adds more ways to see inside the kernel, and if > user space depends on this, it adds more ways the kernel can not change. > > It comes down to how robust eBPF is with the internals of the kernel > changing. If we limit eBPF to system call tracepoints only, that's > fine because those have the same ABI as the system call itself. I'm > worried about the internal tracepoints for scheduling, irqs, file > systems, etc. agree. we need to make it clear that existing tracepoints + programs is not ABI. >> In the future we might add a
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 22:10:45 -0800 Alexei Starovoitov wrote: > One can argue that current TP_printk format is already an ABI, > because somebody might be parsing the text output. If somebody does, then it is an ABI. Luckily, it's not that useful to parse, thus it hasn't been an issue. As Linus has stated in the past, it's not that we can't change ABI interfaces, its just that we can not change them if there's a user space application that depends on it. The harder we make it for interface changes to break user space, the better. The field layouts is a user interface. In fact, some of those fields must always be there. This is because tools do parse the layout and expect some events to have specific fields. Now we can add new fields, or even remove fields that no user space tool is using. This is because today, tools use libtraceevent to parse the event data. This is why I'm nervous about exporting the parameters of the trace event call. Right now, those parameters can always change, because the only way to know they exist is by looking at the code. And currently, there's no way to interact with those parameters. Once we have eBPF in mainline, we now have a way to interact with the parameters and if those parameters change, then the eBPF program will break, and if eBPF can be part of a user space tool, that will break that tool and whatever change in the trace point that caused this breakage would have to be reverted. IOW, this can limit development in the kernel. Al Viro currently does not let any tracepoint in VFS because he doesn't want the internals of that code locked to an ABI. He's right to be worried. > so in some cases we cannot change tracepoints without > somebody complaining that his tool broke. > In other cases tracepoints are used for debugging only > and no one will notice when they change... > It was and still a grey area. Not really. If a tool uses the tracepoint, it can lock that tracepoint down. This is exactly what latencytop did. It happened, it's not a hypothetical situation. > bpf doesn't change any of that. > It actually makes addition of new tracepoints easier. I totally disagree. It adds more ways to see inside the kernel, and if user space depends on this, it adds more ways the kernel can not change. It comes down to how robust eBPF is with the internals of the kernel changing. If we limit eBPF to system call tracepoints only, that's fine because those have the same ABI as the system call itself. I'm worried about the internal tracepoints for scheduling, irqs, file systems, etc. > In the future we might add a tracepoint and pass a single > pointer to interesting data struct to it. bpf programs will walk > data structures 'as safe modules' via bpf_fetch*() methods > without exposing it as ABI. Will this work if that structure changes? When the field we are looking for no longer exists? > whereas today we pass a lot of fields to tracepoints and > make all of these fields immutable. The parameters passed to the tracepoint are not shown to userspace and can change at will. Now, we present the final parsing of the parameters that convert to fields. As all currently known tools uses libtraceevent.a, and parse the format files, those fields can move around and even change in size. The structures are not immutable. The fields are locked down if user space relies on them. But they can move about within the tracepoint, because the parsing allows for it. Remember, these are processed fields. The result of TP_fast_assign() and what gets put into the ring buffer. Now what is passed to the actual tracepoint is not visible by userspace, and in lots of cases, it is just a pointer to some structure. What eBPF brings to the table is a way to access this structure from user space. What keeps a structured passed to a tracepoint from becoming immutable if there's a eBPF program that expects it to have a specific field? > > To me tracepoints are like gdb breakpoints. Unfortunately, it doesn't matter what you think they are. It doesn't matter what I think they are. What matters is what Linus thinks they are and if user space depends on it and Linus decides to revert what ever change broke that user space program, no matter what we think, we just screwed ourselves. I'm being stubborn on this because this is exactly what happened in the past, which caused every trace point to hold 4 bytes of padding. 4 bytes may not sound like a lot, but when you have 1 million tracepoints, that's 4 megs of wasted space. > and bpf programs like live debugger that examine things. If bpf programs only dealt with kprobes, I may agree. But tracepoints have already been proven to be a type of ABI. If we open another window into the kernel, this can screw us later. It's better to solve this now than when we are fighting with Linus over user space breakage. > > the next step is to be able to write bpf scripts on the fly > without leaving debugger. Something like perf probe + > editor + live execution.
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 21:51:05 -0800 Alexei Starovoitov wrote: > On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt wrote: > > > > Looks like this is entirely perf based and does not interact with > > ftrace at all. In other words, it's perf not tracing. > > > > It makes more sense to go through tip than the tracing tree. > > well, all of earlier series were based on ftrace only, > but I was given convincing enough arguments that > perf_even_open+ioctl is a better interface :) > Ok. will rebase on tip in the next version. That's fine. If this proves useful, I'll probably add the interface back to ftrace as well. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 4:50 PM, Steven Rostedt rost...@goodmis.org wrote: But some maintainers think of them as ABI, whereas others are using them freely. imo it's time to remove ambiguity. I would love to, and have brought this up at Kernel Summit more than once with no solution out of it. let's try it again at plumbers in august? Well, we need a statement from Linus. And it would be nice if we could also get Ingo involved in the discussion, but he seldom comes to anything but Kernel Summit. +1 BTW, I wonder if I could make a simple compiler in the kernel that would translate the current ftrace filters into a BPF program, where it could use the program and not use the current filter logic. yep. I've sent that patch last year. It converted pred_tree into bpf program. I can try to dig it up. It doesn't provide extra programmability though, just makes filtering logic much faster. imo the solution is DEFINE_EVENT_BPF that doesn't print anything and a bpf program to process it. You mean to be completely invisible to ftrace? And the debugfs/tracefs directory? I mean it will be seen in tracefs to get 'id', but without enable/format/filter I'm not suggesting to preserve the meaning of 'pid' semantically in all cases. That's not what users would want anyway. I want to allow programs to access important fields and print them in more generic way than current TP_printk does. Then exposed ABI of such tracepoint_bpf is smaller than with current tracepoints. Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. yes, since these new tracepoints have no meat inside them. They're placeholders sitting idle and waiting for bpf to do something useful with them. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? say, we define trace_netif_rx_entry() as this new tracepoint_bpf. It will have only one argument 'skb'. bpf program will read and print skb fields the way it likes for particular tracing scenario. So instead of making TP_printk(dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d gso_type=%#x,... the abi exposed via trace_pipe (as it is today), the new tracepoint_bpf abi is presence of 'skb' pointer as one and only argument to bpf program. Future refactoring of netif_rx would need to guarantee that trace_netif_rx_entry(skb) is called. that's it. imo such tracepoints are much easier to deal with during code changes. May be some of the existing tracepoints like this one that takes one argument can be marked 'bpf-ready', so that programs can attach to them only. let's start slow then with bpf+syscall and bpf+kprobe only. I'm fine with that. thanks. will wait for merge window to close and will repost. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 19:04:55 -0800 Alexei Starovoitov a...@plumgrid.com wrote: You mean to be completely invisible to ftrace? And the debugfs/tracefs directory? I mean it will be seen in tracefs to get 'id', but without enable/format/filter In other words, invisible to ftrace. I'm not sure I'll be on board to work with that. Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. yes, since these new tracepoints have no meat inside them. They're placeholders sitting idle and waiting for bpf to do something useful with them. Hmm, I have a patch somewhere (never posted it), that add TRACE_MARKER(), which basically would just print that it was hit. But no data was passed to it. Something like that I would be more inclined to take. Then the question is, what can bpf access there? Could just be a place holder to add a fast kprobe. This way, it can show up in trace events with enable and and all that, it just wont have any data to print out besides the normal pid, flags, etc. But because it will inject a nop, that could be converted to a jump, it will give you the power of kprobes but with the speed of a tracepoint. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? say, we define trace_netif_rx_entry() as this new tracepoint_bpf. It will have only one argument 'skb'. bpf program will read and print skb fields the way it likes for particular tracing scenario. So instead of making TP_printk(dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d gso_type=%#x,... the abi exposed via trace_pipe (as it is today), the new tracepoint_bpf abi is presence of 'skb' pointer as one and only argument to bpf program. Future refactoring of netif_rx would need to guarantee that trace_netif_rx_entry(skb) is called. that's it. imo such tracepoints are much easier to deal with during code changes. But what can you access from that skb that is guaranteed to be there? If you say anything, then there's no reason it can't be added to the printk as well. May be some of the existing tracepoints like this one that takes one argument can be marked 'bpf-ready', so that programs can attach to them only. I really hate the idea of adding tracepoints that ftrace can't use. It basically kills the entire busy box usage scenario, as boards that have extremely limited userspace still make full use of ftrace via the existing tracepoints. I still don't see the argument that adding data via the bpf functions is any different than adding those same items to fields in an event. Once you add a bpf function, then you must maintain those fields. Look, you can always add more to a TP_printk(), as that is standard with all text file kernel parsing. Like stat in /proc. Fields can not be removed, but more can always be appended to the end. Any tool that parses trace_pipe is broken if it can't handle extended fields. The api can be extended, and for text files, that is by appending to them. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 11:53:22 -0800 Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt rost...@goodmis.org wrote: On Mon, 9 Feb 2015 22:10:45 -0800 Alexei Starovoitov a...@plumgrid.com wrote: One can argue that current TP_printk format is already an ABI, because somebody might be parsing the text output. If somebody does, then it is an ABI. Luckily, it's not that useful to parse, thus it hasn't been an issue. As Linus has stated in the past, it's not that we can't change ABI interfaces, its just that we can not change them if there's a user space application that depends on it. there are already tools that parse trace_pipe: https://github.com/brendangregg/perf-tools Yep, and if this becomes a standard, then any change that makes trace_pipe different will be reverted. and expect some events to have specific fields. Now we can add new fields, or even remove fields that no user space tool is using. This is because today, tools use libtraceevent to parse the event data. not all tools use libtraceevent. gdb calls perf_event_open directly: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c and parses PERF_RECORD_SAMPLE as a binary. In this case it's branch records, but I think we never said anywhere that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come in this particular order. What particular order? Note, that's a hardware event, not a software one. This is why I'm nervous about exporting the parameters of the trace event call. Right now, those parameters can always change, because the only way to know they exist is by looking at the code. And currently, there's no way to interact with those parameters. Once we have eBPF in mainline, we now have a way to interact with the parameters and if those parameters change, then the eBPF program will break, and if eBPF can be part of a user space tool, that will break that tool and whatever change in the trace point that caused this breakage would have to be reverted. IOW, this can limit development in the kernel. it can limit development unless we say that bpf programs that attach to tracepoints are not part of ABI. Easy enough to add large comment similar to perf_event.h Again, it doesn't matter what we say. Linus made that very clear. He stated if you provide an interface, and someone uses that interface for a user space application, and they complain if it breaks, that is just reason to revert whatever broke it. Al Viro currently does not let any tracepoint in VFS because he doesn't want the internals of that code locked to an ABI. He's right to be worried. Same with networking bits. We don't want tracepoints to limit kernel development, but we want debuggability and kernel analytics. All existing tracepoints defined via DEFINE_EVENT should not be an ABI. I agree, but that doesn't make it so :-/ But some maintainers think of them as ABI, whereas others are using them freely. imo it's time to remove ambiguity. I would love to, and have brought this up at Kernel Summit more than once with no solution out of it. The idea for new style of tracepoints is the following: introduce new macro: DEFINE_EVENT_USER and let programs attach to them. We actually have that today. But it's TRACE_EVENT_FLAGS(), although that should be cleaned up a bit. Frederic added it to label events that are safe for perf non root. It seems to be used currently only for syscalls. These tracepoint will receive one or two pointers to important structs only. They will not have TP_printk, assign and fields. The placement and arguments to these new tracepoints will be an ABI. All existing tracepoints are not. TP_printk() is not really an issue. The main reason to attach to tracepoint is that they are accessible without debug info (unlike kprobe) That is, if you have a special bpf call to access variables, right? How else do you access part of a data structure. Another reason is speed. tracepoints are much faster than optimized kprobes and for real-time analytics the speed is critical. The position of new tracepoints and their arguments will be an ABI and the programs can be both. You means special tracepoints one that does export the arguments? Question is, how many maintainers will add these, knowing that they will have to be forever maintained as is. If program is using bpf_fetch*() helpers it obviously wants to access internal data structures, so it's really nothing more, but 'safe kernel module' and kernel layout specific. Both old and new tracepoints + programs will be used for live kernel debugging. If program is accessing user-ized data structures then Technically, the TP_struct__entry is a user-ized structure. it is portable and will run on any kernel. In uapi header we can define: struct task_struct_user { int pid; int prio; Here's a perfect example of something that looks
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt rost...@goodmis.org wrote: On Mon, 9 Feb 2015 22:10:45 -0800 Alexei Starovoitov a...@plumgrid.com wrote: One can argue that current TP_printk format is already an ABI, because somebody might be parsing the text output. If somebody does, then it is an ABI. Luckily, it's not that useful to parse, thus it hasn't been an issue. As Linus has stated in the past, it's not that we can't change ABI interfaces, its just that we can not change them if there's a user space application that depends on it. there are already tools that parse trace_pipe: https://github.com/brendangregg/perf-tools and expect some events to have specific fields. Now we can add new fields, or even remove fields that no user space tool is using. This is because today, tools use libtraceevent to parse the event data. not all tools use libtraceevent. gdb calls perf_event_open directly: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c and parses PERF_RECORD_SAMPLE as a binary. In this case it's branch records, but I think we never said anywhere that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come in this particular order. This is why I'm nervous about exporting the parameters of the trace event call. Right now, those parameters can always change, because the only way to know they exist is by looking at the code. And currently, there's no way to interact with those parameters. Once we have eBPF in mainline, we now have a way to interact with the parameters and if those parameters change, then the eBPF program will break, and if eBPF can be part of a user space tool, that will break that tool and whatever change in the trace point that caused this breakage would have to be reverted. IOW, this can limit development in the kernel. it can limit development unless we say that bpf programs that attach to tracepoints are not part of ABI. Easy enough to add large comment similar to perf_event.h Al Viro currently does not let any tracepoint in VFS because he doesn't want the internals of that code locked to an ABI. He's right to be worried. Same with networking bits. We don't want tracepoints to limit kernel development, but we want debuggability and kernel analytics. All existing tracepoints defined via DEFINE_EVENT should not be an ABI. But some maintainers think of them as ABI, whereas others are using them freely. imo it's time to remove ambiguity. The idea for new style of tracepoints is the following: introduce new macro: DEFINE_EVENT_USER and let programs attach to them. These tracepoint will receive one or two pointers to important structs only. They will not have TP_printk, assign and fields. The placement and arguments to these new tracepoints will be an ABI. All existing tracepoints are not. The main reason to attach to tracepoint is that they are accessible without debug info (unlike kprobe) Another reason is speed. tracepoints are much faster than optimized kprobes and for real-time analytics the speed is critical. The position of new tracepoints and their arguments will be an ABI and the programs can be both. If program is using bpf_fetch*() helpers it obviously wants to access internal data structures, so it's really nothing more, but 'safe kernel module' and kernel layout specific. Both old and new tracepoints + programs will be used for live kernel debugging. If program is accessing user-ized data structures then it is portable and will run on any kernel. In uapi header we can define: struct task_struct_user { int pid; int prio; }; and let bpf programs access it via real 'struct task_struct *' pointer passed into tracepoint. bpf loader will translate offsets and sizes used inside the program into real task_struct's offsets and loads. (all structs are read-only of course) programs will be fast and kernel independent. They will be used for analytics (latency, etc) so in some cases we cannot change tracepoints without somebody complaining that his tool broke. In other cases tracepoints are used for debugging only and no one will notice when they change... It was and still a grey area. Not really. If a tool uses the tracepoint, it can lock that tracepoint down. This is exactly what latencytop did. It happened, it's not a hypothetical situation. correct. bpf doesn't change any of that. It actually makes addition of new tracepoints easier. I totally disagree. It adds more ways to see inside the kernel, and if user space depends on this, it adds more ways the kernel can not change. It comes down to how robust eBPF is with the internals of the kernel changing. If we limit eBPF to system call tracepoints only, that's fine because those have the same ABI as the system call itself. I'm worried about the internal tracepoints for scheduling, irqs, file systems, etc. agree. we need to make it clear that existing tracepoints + programs is not ABI. In the future we might add a tracepoint and pass a
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 8:31 PM, Steven Rostedt rost...@goodmis.org wrote: Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. yes, since these new tracepoints have no meat inside them. They're placeholders sitting idle and waiting for bpf to do something useful with them. Hmm, I have a patch somewhere (never posted it), that add TRACE_MARKER(), which basically would just print that it was hit. But no data was passed to it. Something like that I would be more inclined to take. Then the question is, what can bpf access there? Could just be a place holder to add a fast kprobe. This way, it can show up in trace events with enable and and all that, it just wont have any data to print out besides the normal pid, flags, etc. But because it will inject a nop, that could be converted to a jump, it will give you the power of kprobes but with the speed of a tracepoint. fair enough. Something like TRACE_MARKER(arg1, arg2) that prints it was hit without accessing the args would be enough. Without any args it is indeed a 'fast kprobe' only. Debug info would still be needed to access function arguments. On x64 function entry point and x64 abi make it easy to access args, but i386 or kprobe in the middle lose visibility when debug info is not available. TRACE_MARKER (with few key args that function is operating on) is enough to achieve roughly the same as kprobe without debug info. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? say, we define trace_netif_rx_entry() as this new tracepoint_bpf. It will have only one argument 'skb'. bpf program will read and print skb fields the way it likes for particular tracing scenario. So instead of making TP_printk(dev=%s napi_id=%#x queue_mapping=%u skbaddr=%p vlan_tagged=%d vlan_proto=0x%04x vlan_tci=0x%04x protocol=0x%04x ip_summed=%d hash=0x%08x l4_hash=%d len=%u data_len=%u truesize=%u mac_header_valid=%d mac_header=%d nr_frags=%d gso_size=%d gso_type=%#x,... the abi exposed via trace_pipe (as it is today), the new tracepoint_bpf abi is presence of 'skb' pointer as one and only argument to bpf program. Future refactoring of netif_rx would need to guarantee that trace_netif_rx_entry(skb) is called. that's it. imo such tracepoints are much easier to deal with during code changes. But what can you access from that skb that is guaranteed to be there? If you say anything, then there's no reason it can't be added to the printk as well. programs can access any field via bpf_fetch*() helpers which make them kernel layout dependent or via user-ized sk_buff with few fields which is portable. In both cases kernel/user abi is only 'skb' pointer. whether it's debugging program that needs full access via fetch* helpers or portable program that uses stable api it's up to program author. Just like kprobes, it's clear, that if program is using fetch* helpers it's doing it without any abi guarantees. 'perf probe' and 'bpf with fetch* helpers are the same. perf probe creates wrappers on top of probe_kernel_read and bpf_fetch* helpers are wrappers on top of probe_kernel_read. Complains that 'my kprobe with flags=%cx mode=+4($stack) stopped working in new kernel' are equivalent to complains that program with bpf_fetch* stopped working. Whereas if program is using user-ized structs it will work across kernel versions, though it will be able to see only very limited slice of in-kernel data. May be some of the existing tracepoints like this one that takes one argument can be marked 'bpf-ready', so that programs can attach to them only. I really hate the idea of adding tracepoints that ftrace can't use. It basically kills the entire busy box usage scenario, as boards that have extremely limited userspace still make full use of ftrace via the existing tracepoints. agree. I think your trace_marker with few args is a good middle ground. I still don't see the argument that adding data via the bpf functions is any different than adding those same items to fields in an event. Once you add a bpf function, then you must maintain those fields. Look, you can always add more to a TP_printk(), as that is standard with all text file kernel parsing. Like stat in /proc. Fields can not be removed, but more can always be appended to the end. Any tool that parses trace_pipe is broken if it can't handle extended fields. The api can be extended, and for text files, that is by appending to them. I agree that any text parsing script should be able to cope with additional args without problems. I think it's a fear of 1% breakage is causing maintainers to avoid any changes to tracepoints even when they just add few args to the end of TP_printk. When tracepoints stop printing and the only thing they see is single pointer to a well known struct like sk_buff, this fear of tracepoints should fade. programs are
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, Feb 10, 2015 at 1:53 PM, Steven Rostedt rost...@goodmis.org wrote: On Tue, 10 Feb 2015 11:53:22 -0800 Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Feb 10, 2015 at 5:05 AM, Steven Rostedt rost...@goodmis.org wrote: On Mon, 9 Feb 2015 22:10:45 -0800 Alexei Starovoitov a...@plumgrid.com wrote: One can argue that current TP_printk format is already an ABI, because somebody might be parsing the text output. If somebody does, then it is an ABI. Luckily, it's not that useful to parse, thus it hasn't been an issue. As Linus has stated in the past, it's not that we can't change ABI interfaces, its just that we can not change them if there's a user space application that depends on it. there are already tools that parse trace_pipe: https://github.com/brendangregg/perf-tools Yep, and if this becomes a standard, then any change that makes trace_pipe different will be reverted. I think reading of trace_pipe is widespread. and expect some events to have specific fields. Now we can add new fields, or even remove fields that no user space tool is using. This is because today, tools use libtraceevent to parse the event data. not all tools use libtraceevent. gdb calls perf_event_open directly: https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-btrace.c and parses PERF_RECORD_SAMPLE as a binary. In this case it's branch records, but I think we never said anywhere that PERF_SAMPLE_IP | PERF_SAMPLE_ADDR should come in this particular order. What particular order? Note, that's a hardware event, not a software one. yes, but gdb assumes that 'u64 ip' precedes, 'u64 addr' when attr.sample_type = IP | ADDR whereas this is an internal order of 'if' statements inside perf_output_sample()... But some maintainers think of them as ABI, whereas others are using them freely. imo it's time to remove ambiguity. I would love to, and have brought this up at Kernel Summit more than once with no solution out of it. let's try it again at plumbers in august? For now I'm going to drop bpf+tracepoints, since it's so controversial and go with bpf+syscall and bpf+kprobe only. Hopefully by august it will be clear what bpf+kprobes can do and why I'm excited about bpf+tracepoints in the future. The idea for new style of tracepoints is the following: introduce new macro: DEFINE_EVENT_USER and let programs attach to them. We actually have that today. But it's TRACE_EVENT_FLAGS(), although that should be cleaned up a bit. Frederic added it to label events that are safe for perf non root. It seems to be used currently only for syscalls. I didn't mean to let unprivileged apps to use. Better name probably would be: DEFINE_EVENT_BPF and only let bpf programs use it. These tracepoint will receive one or two pointers to important structs only. They will not have TP_printk, assign and fields. The placement and arguments to these new tracepoints will be an ABI. All existing tracepoints are not. TP_printk() is not really an issue. I think it is. The way things are printed is the most visible part of tracepoints and I suspect maintainers don't want to add new ones, because internal fields are printed and users do parse trace_pipe. Recent discussion about tcp instrumentation was about adding new tracepoints and a module to print them. As soon as something like this is in, the next question 'what we're going to do when more arguments need to be printed'... imo the solution is DEFINE_EVENT_BPF that doesn't print anything and a bpf program to process it. Then programs decide what they like to pass to user space and in what format. The kernel/user ABI is the position of this new tracepoint and arguments that are passed in. bpf program takes care of walking pointers, extracting interesting fields, aggregating and passing them to user in the format specific to this particular program. Then when user wants to collect more fields it just changes the program and corresponding userland side. The position of new tracepoints and their arguments will be an ABI and the programs can be both. You means special tracepoints one that does export the arguments? Question is, how many maintainers will add these, knowing that they will have to be forever maintained as is. Obviously we would need to prove usefulness of tracepoint_bpf before they're accepted. Hopefully bpf+kprobe will make an impression on its own and users would want similar but without debug info. If program is using bpf_fetch*() helpers it obviously wants to access internal data structures, so it's really nothing more, but 'safe kernel module' and kernel layout specific. Both old and new tracepoints + programs will be used for live kernel debugging. If program is accessing user-ized data structures then Technically, the TP_struct__entry is a user-ized structure. it is portable and will run on any kernel. In uapi header we can define: struct task_struct_user { int pid;
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 22:10:45 -0800 Alexei Starovoitov a...@plumgrid.com wrote: One can argue that current TP_printk format is already an ABI, because somebody might be parsing the text output. If somebody does, then it is an ABI. Luckily, it's not that useful to parse, thus it hasn't been an issue. As Linus has stated in the past, it's not that we can't change ABI interfaces, its just that we can not change them if there's a user space application that depends on it. The harder we make it for interface changes to break user space, the better. The field layouts is a user interface. In fact, some of those fields must always be there. This is because tools do parse the layout and expect some events to have specific fields. Now we can add new fields, or even remove fields that no user space tool is using. This is because today, tools use libtraceevent to parse the event data. This is why I'm nervous about exporting the parameters of the trace event call. Right now, those parameters can always change, because the only way to know they exist is by looking at the code. And currently, there's no way to interact with those parameters. Once we have eBPF in mainline, we now have a way to interact with the parameters and if those parameters change, then the eBPF program will break, and if eBPF can be part of a user space tool, that will break that tool and whatever change in the trace point that caused this breakage would have to be reverted. IOW, this can limit development in the kernel. Al Viro currently does not let any tracepoint in VFS because he doesn't want the internals of that code locked to an ABI. He's right to be worried. so in some cases we cannot change tracepoints without somebody complaining that his tool broke. In other cases tracepoints are used for debugging only and no one will notice when they change... It was and still a grey area. Not really. If a tool uses the tracepoint, it can lock that tracepoint down. This is exactly what latencytop did. It happened, it's not a hypothetical situation. bpf doesn't change any of that. It actually makes addition of new tracepoints easier. I totally disagree. It adds more ways to see inside the kernel, and if user space depends on this, it adds more ways the kernel can not change. It comes down to how robust eBPF is with the internals of the kernel changing. If we limit eBPF to system call tracepoints only, that's fine because those have the same ABI as the system call itself. I'm worried about the internal tracepoints for scheduling, irqs, file systems, etc. In the future we might add a tracepoint and pass a single pointer to interesting data struct to it. bpf programs will walk data structures 'as safe modules' via bpf_fetch*() methods without exposing it as ABI. Will this work if that structure changes? When the field we are looking for no longer exists? whereas today we pass a lot of fields to tracepoints and make all of these fields immutable. The parameters passed to the tracepoint are not shown to userspace and can change at will. Now, we present the final parsing of the parameters that convert to fields. As all currently known tools uses libtraceevent.a, and parse the format files, those fields can move around and even change in size. The structures are not immutable. The fields are locked down if user space relies on them. But they can move about within the tracepoint, because the parsing allows for it. Remember, these are processed fields. The result of TP_fast_assign() and what gets put into the ring buffer. Now what is passed to the actual tracepoint is not visible by userspace, and in lots of cases, it is just a pointer to some structure. What eBPF brings to the table is a way to access this structure from user space. What keeps a structured passed to a tracepoint from becoming immutable if there's a eBPF program that expects it to have a specific field? To me tracepoints are like gdb breakpoints. Unfortunately, it doesn't matter what you think they are. It doesn't matter what I think they are. What matters is what Linus thinks they are and if user space depends on it and Linus decides to revert what ever change broke that user space program, no matter what we think, we just screwed ourselves. I'm being stubborn on this because this is exactly what happened in the past, which caused every trace point to hold 4 bytes of padding. 4 bytes may not sound like a lot, but when you have 1 million tracepoints, that's 4 megs of wasted space. and bpf programs like live debugger that examine things. If bpf programs only dealt with kprobes, I may agree. But tracepoints have already been proven to be a type of ABI. If we open another window into the kernel, this can screw us later. It's better to solve this now than when we are fighting with Linus over user space breakage. the next step is to be able to write bpf scripts on the fly without leaving debugger. Something like perf probe + editor + live execution. Truly
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 21:51:05 -0800 Alexei Starovoitov a...@plumgrid.com wrote: On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt rost...@goodmis.org wrote: Looks like this is entirely perf based and does not interact with ftrace at all. In other words, it's perf not tracing. It makes more sense to go through tip than the tracing tree. well, all of earlier series were based on ftrace only, but I was given convincing enough arguments that perf_even_open+ioctl is a better interface :) Ok. will rebase on tip in the next version. That's fine. If this proves useful, I'll probably add the interface back to ftrace as well. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Tue, 10 Feb 2015 16:22:50 -0800 Alexei Starovoitov a...@plumgrid.com wrote: Yep, and if this becomes a standard, then any change that makes trace_pipe different will be reverted. I think reading of trace_pipe is widespread. I've heard of a few, but nothing that has broken when something changed. Is it scripts or actual C code? Again, it matters if people complain about the change. But some maintainers think of them as ABI, whereas others are using them freely. imo it's time to remove ambiguity. I would love to, and have brought this up at Kernel Summit more than once with no solution out of it. let's try it again at plumbers in august? Well, we need a statement from Linus. And it would be nice if we could also get Ingo involved in the discussion, but he seldom comes to anything but Kernel Summit. For now I'm going to drop bpf+tracepoints, since it's so controversial and go with bpf+syscall and bpf+kprobe only. Probably the safest bet. Hopefully by august it will be clear what bpf+kprobes can do and why I'm excited about bpf+tracepoints in the future. BTW, I wonder if I could make a simple compiler in the kernel that would translate the current ftrace filters into a BPF program, where it could use the program and not use the current filter logic. These tracepoint will receive one or two pointers to important structs only. They will not have TP_printk, assign and fields. The placement and arguments to these new tracepoints will be an ABI. All existing tracepoints are not. TP_printk() is not really an issue. I think it is. The way things are printed is the most visible part of tracepoints and I suspect maintainers don't want to add new ones, because internal fields are printed and users do parse trace_pipe. Recent discussion about tcp instrumentation was about adding new tracepoints and a module to print them. As soon as something like this is in, the next question 'what we're going to do when more arguments need to be printed'... I should rephrase that. It's not that it's not an issue, it's just that it hasn't been an issue. the trace_pipe code is slow. The raw_trace_pipe is much faster. Any tool would benefit from using it. I really need to get a library out to help users do such a thing. imo the solution is DEFINE_EVENT_BPF that doesn't print anything and a bpf program to process it. You mean to be completely invisible to ftrace? And the debugfs/tracefs directory? it is portable and will run on any kernel. In uapi header we can define: struct task_struct_user { int pid; int prio; Here's a perfect example of something that looks stable to show to user space, but is really a pimple that is hiding cancer. Lets start with pid. We have name spaces. What pid will be put there? We have to show the pid of the name space it is under. Then we have prio. What is prio in the DEADLINE scheduler. It is rather meaningless. Also, it is meaningless in SCHED_OTHER. Also note that even for SCHED_FIFO, the prio is used differently in the kernel than it is in userspace. For the kernel, lower is higher. well, -prio and -pid are already printed by sched tracepoints and their meaning depends on scheduler. So users taking that into account. I know, and Peter hates this. I'm not suggesting to preserve the meaning of 'pid' semantically in all cases. That's not what users would want anyway. I want to allow programs to access important fields and print them in more generic way than current TP_printk does. Then exposed ABI of such tracepoint_bpf is smaller than with current tracepoints. Again, this would mean they become invisible to ftrace, and even ftrace_dump_on_oops. I'm not fully understanding what is to be exported by this new ABI. If the fields available, will always be available, then why can't the appear in a TP_printk()? eBPF is very flexible, which means it is bound to have someone use it in a way you never dreamed of, and that will be what bites you in the end (pun intended). understood :) let's start slow then with bpf+syscall and bpf+kprobe only. I'm fine with that. also not all bpf programs are equal. bpf+existing tracepoint is not ABI Why not? well, because we want to see more tracepoints in the kernel. We're already struggling to add more. Still, the question is, even with a new tracepoint that limits what it shows, there still needs to be something that is guaranteed to always be there. I still don't see how this is different than the current tracepoints. bpf+new tracepoint is ABI if programs are not using bpf_fetch How is this different? the new ones will be explicit by definition. Who monitors this? To give you an example, we thought about scrambling the trace event field locations from boot to boot to keep tools from hard coding the event layout. This may sound crazy, but developers out there are crazy. And if you want to keep
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, Feb 9, 2015 at 9:13 PM, Steven Rostedt wrote: >> \ >> + if (prog) { \ >> + __maybe_unused const u64 z = 0; \ >> + struct bpf_context __ctx = ((struct bpf_context) { \ >> + __BPF_CAST6(args, z, z, z, z, z)\ > > Note, there is no guarantee that args is at most 6. For example, in > drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the > trace_event brcms_txstatus has 8 args. > > But I guess that's OK if you do not need those last args, right? yeah, some tracepoints pass a lot of things. That's rare and in most of the cases they can be fetched from parent structure. > I'm nervous about showing args of tracepoints too, because we don't want > that to become a strict ABI either. One can argue that current TP_printk format is already an ABI, because somebody might be parsing the text output. so in some cases we cannot change tracepoints without somebody complaining that his tool broke. In other cases tracepoints are used for debugging only and no one will notice when they change... It was and still a grey area. bpf doesn't change any of that. It actually makes addition of new tracepoints easier. In the future we might add a tracepoint and pass a single pointer to interesting data struct to it. bpf programs will walk data structures 'as safe modules' via bpf_fetch*() methods without exposing it as ABI. whereas today we pass a lot of fields to tracepoints and make all of these fields immutable. To me tracepoints are like gdb breakpoints. and bpf programs like live debugger that examine things. the next step is to be able to write bpf scripts on the fly without leaving debugger. Something like perf probe + editor + live execution. Truly like gdb for kernel. while kernel is running. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt wrote: > > Looks like this is entirely perf based and does not interact with > ftrace at all. In other words, it's perf not tracing. > > It makes more sense to go through tip than the tracing tree. well, all of earlier series were based on ftrace only, but I was given convincing enough arguments that perf_even_open+ioctl is a better interface :) Ok. will rebase on tip in the next version. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 19:45:54 -0800 Alexei Starovoitov wrote: > +/* For tracepoint filters argN fields match one to one to arguments > + * passed to tracepoint events > + * > + * For syscall entry filters argN fields match syscall arguments > + * For syscall exit filters arg1 is a return value > + */ > +struct bpf_context { > + u64 arg1; > + u64 arg2; > + u64 arg3; > + u64 arg4; > + u64 arg5; > + u64 arg6; > +}; > + > +#endif /* _LINUX_KERNEL_BPF_TRACE_H */ > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h > index 139b5067345b..4c275ce2dcf0 100644 > --- a/include/trace/ftrace.h > +++ b/include/trace/ftrace.h > @@ -17,6 +17,7 @@ > */ > > #include > +#include > > /* > * DECLARE_EVENT_CLASS can be used to add a generic function > @@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) > *__event_##call = _##call > #undef __perf_task > #define __perf_task(t) (__task = (t)) > > +/* zero extend integer, pointer or aggregate type to u64 without warnings */ > +#define __CAST_TO_U64(EXPR) ({ \ > + u64 ret = 0; \ > + typeof(EXPR) expr = EXPR; \ > + switch (sizeof(expr)) { \ > + case 8: ret = *(u64 *) break; \ > + case 4: ret = *(u32 *) break; \ > + case 2: ret = *(u16 *) break; \ > + case 1: ret = *(u8 *) break; \ > + } \ > + ret; }) > + > +#define __BPF_CAST1(a,...) __CAST_TO_U64(a) > +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__) > +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__) > +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__) > +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__) > +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__) > + > #undef DECLARE_EVENT_CLASS > #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) > \ > static notrace void \ > perf_trace_##call(void *__data, proto) > \ > {\ > struct ftrace_event_call *event_call = __data; \ > + struct bpf_prog *prog = event_call->prog; \ > struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ > struct ftrace_raw_##call *entry;\ > struct pt_regs __regs; \ > @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) > \ > int __data_size;\ > int rctx; \ > \ > + if (prog) { \ > + __maybe_unused const u64 z = 0; \ > + struct bpf_context __ctx = ((struct bpf_context) { \ > + __BPF_CAST6(args, z, z, z, z, z)\ Note, there is no guarantee that args is at most 6. For example, in drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the trace_event brcms_txstatus has 8 args. But I guess that's OK if you do not need those last args, right? Also, there's no interface the lets us know what the args are. I may be able to come up with something. That's the reason I never filtered before tracing. Because we had no way of knowing what to filter on, because the args were never visible. I'm nervous about showing args of tracepoints too, because we don't want that to become a strict ABI either. -- Steve > + }); \ > + \ > + if (!trace_call_bpf(prog, &__ctx)) \ > + return; \ > + } \ > + \ > __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 19:45:54 -0800 Alexei Starovoitov wrote: > +#endif /* _LINUX_KERNEL_BPF_TRACE_H */ > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h > index 139b5067345b..4c275ce2dcf0 100644 > --- a/include/trace/ftrace.h > +++ b/include/trace/ftrace.h > @@ -17,6 +17,7 @@ > */ > > #include > +#include > > /* > * DECLARE_EVENT_CLASS can be used to add a generic function > @@ -755,12 +756,32 @@ __attribute__((section("_ftrace_events"))) > *__event_##call = _##call > #undef __perf_task > #define __perf_task(t) (__task = (t)) > > +/* zero extend integer, pointer or aggregate type to u64 without warnings */ > +#define __CAST_TO_U64(EXPR) ({ \ > + u64 ret = 0; \ > + typeof(EXPR) expr = EXPR; \ > + switch (sizeof(expr)) { \ > + case 8: ret = *(u64 *) break; \ > + case 4: ret = *(u32 *) break; \ > + case 2: ret = *(u16 *) break; \ > + case 1: ret = *(u8 *) break; \ > + } \ > + ret; }) > + > +#define __BPF_CAST1(a,...) __CAST_TO_U64(a) > +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__) > +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__) > +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__) > +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__) > +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__) > + > #undef DECLARE_EVENT_CLASS > #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) > \ > static notrace void \ > perf_trace_##call(void *__data, proto) > \ > {\ > struct ftrace_event_call *event_call = __data; \ > + struct bpf_prog *prog = event_call->prog; \ Looks like this is entirely perf based and does not interact with ftrace at all. In other words, it's perf not tracing. It makes more sense to go through tip than the tracing tree. But I still do not want any hard coded event structures. All access to data from the binary code must be parsed by looking at the event/format files. Otherwise you will lock internals of the kernel as userspace ABI, because eBPF programs will break if those internals change, and that could severely limit progress in the future. -- Steve > struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ > struct ftrace_raw_##call *entry;\ > struct pt_regs __regs; \ > @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) > \ > int __data_size;\ > int rctx; \ > \ > + if (prog) { \ > + __maybe_unused const u64 z = 0; \ > + struct bpf_context __ctx = ((struct bpf_context) { \ > + __BPF_CAST6(args, z, z, z, z, z)\ > + }); \ > + \ > + if (!trace_call_bpf(prog, &__ctx)) \ > + return; \ > + } \ > + \ > __data_size = ftrace_get_offsets_##call(&__data_offsets, args); \ > \ > head = this_cpu_ptr(event_call->perf_events); \ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, Feb 9, 2015 at 8:46 PM, Steven Rostedt rost...@goodmis.org wrote: Looks like this is entirely perf based and does not interact with ftrace at all. In other words, it's perf not tracing. It makes more sense to go through tip than the tracing tree. well, all of earlier series were based on ftrace only, but I was given convincing enough arguments that perf_even_open+ioctl is a better interface :) Ok. will rebase on tip in the next version. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, Feb 9, 2015 at 9:13 PM, Steven Rostedt rost...@goodmis.org wrote: \ + if (prog) { \ + __maybe_unused const u64 z = 0; \ + struct bpf_context __ctx = ((struct bpf_context) { \ + __BPF_CAST6(args, z, z, z, z, z)\ Note, there is no guarantee that args is at most 6. For example, in drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the trace_event brcms_txstatus has 8 args. But I guess that's OK if you do not need those last args, right? yeah, some tracepoints pass a lot of things. That's rare and in most of the cases they can be fetched from parent structure. I'm nervous about showing args of tracepoints too, because we don't want that to become a strict ABI either. One can argue that current TP_printk format is already an ABI, because somebody might be parsing the text output. so in some cases we cannot change tracepoints without somebody complaining that his tool broke. In other cases tracepoints are used for debugging only and no one will notice when they change... It was and still a grey area. bpf doesn't change any of that. It actually makes addition of new tracepoints easier. In the future we might add a tracepoint and pass a single pointer to interesting data struct to it. bpf programs will walk data structures 'as safe modules' via bpf_fetch*() methods without exposing it as ABI. whereas today we pass a lot of fields to tracepoints and make all of these fields immutable. To me tracepoints are like gdb breakpoints. and bpf programs like live debugger that examine things. the next step is to be able to write bpf scripts on the fly without leaving debugger. Something like perf probe + editor + live execution. Truly like gdb for kernel. while kernel is running. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 19:45:54 -0800 Alexei Starovoitov a...@plumgrid.com wrote: +#endif /* _LINUX_KERNEL_BPF_TRACE_H */ diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 139b5067345b..4c275ce2dcf0 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -17,6 +17,7 @@ */ #include linux/ftrace_event.h +#include trace/bpf_trace.h /* * DECLARE_EVENT_CLASS can be used to add a generic function @@ -755,12 +756,32 @@ __attribute__((section(_ftrace_events))) *__event_##call = event_##call #undef __perf_task #define __perf_task(t) (__task = (t)) +/* zero extend integer, pointer or aggregate type to u64 without warnings */ +#define __CAST_TO_U64(EXPR) ({ \ + u64 ret = 0; \ + typeof(EXPR) expr = EXPR; \ + switch (sizeof(expr)) { \ + case 8: ret = *(u64 *) expr; break; \ + case 4: ret = *(u32 *) expr; break; \ + case 2: ret = *(u16 *) expr; break; \ + case 1: ret = *(u8 *) expr; break; \ + } \ + ret; }) + +#define __BPF_CAST1(a,...) __CAST_TO_U64(a) +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__) +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__) +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__) +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__) +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__) + #undef DECLARE_EVENT_CLASS #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ static notrace void \ perf_trace_##call(void *__data, proto) \ {\ struct ftrace_event_call *event_call = __data; \ + struct bpf_prog *prog = event_call-prog; \ Looks like this is entirely perf based and does not interact with ftrace at all. In other words, it's perf not tracing. It makes more sense to go through tip than the tracing tree. But I still do not want any hard coded event structures. All access to data from the binary code must be parsed by looking at the event/format files. Otherwise you will lock internals of the kernel as userspace ABI, because eBPF programs will break if those internals change, and that could severely limit progress in the future. -- Steve struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ struct ftrace_raw_##call *entry;\ struct pt_regs __regs; \ @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) \ int __data_size;\ int rctx; \ \ + if (prog) { \ + __maybe_unused const u64 z = 0; \ + struct bpf_context __ctx = ((struct bpf_context) { \ + __BPF_CAST6(args, z, z, z, z, z)\ + }); \ + \ + if (!trace_call_bpf(prog, __ctx)) \ + return; \ + } \ + \ __data_size = ftrace_get_offsets_##call(__data_offsets, args); \ \ head = this_cpu_ptr(event_call-perf_events); \ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls
On Mon, 9 Feb 2015 19:45:54 -0800 Alexei Starovoitov a...@plumgrid.com wrote: +/* For tracepoint filters argN fields match one to one to arguments + * passed to tracepoint events + * + * For syscall entry filters argN fields match syscall arguments + * For syscall exit filters arg1 is a return value + */ +struct bpf_context { + u64 arg1; + u64 arg2; + u64 arg3; + u64 arg4; + u64 arg5; + u64 arg6; +}; + +#endif /* _LINUX_KERNEL_BPF_TRACE_H */ diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 139b5067345b..4c275ce2dcf0 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -17,6 +17,7 @@ */ #include linux/ftrace_event.h +#include trace/bpf_trace.h /* * DECLARE_EVENT_CLASS can be used to add a generic function @@ -755,12 +756,32 @@ __attribute__((section(_ftrace_events))) *__event_##call = event_##call #undef __perf_task #define __perf_task(t) (__task = (t)) +/* zero extend integer, pointer or aggregate type to u64 without warnings */ +#define __CAST_TO_U64(EXPR) ({ \ + u64 ret = 0; \ + typeof(EXPR) expr = EXPR; \ + switch (sizeof(expr)) { \ + case 8: ret = *(u64 *) expr; break; \ + case 4: ret = *(u32 *) expr; break; \ + case 2: ret = *(u16 *) expr; break; \ + case 1: ret = *(u8 *) expr; break; \ + } \ + ret; }) + +#define __BPF_CAST1(a,...) __CAST_TO_U64(a) +#define __BPF_CAST2(a,...) __CAST_TO_U64(a), __BPF_CAST1(__VA_ARGS__) +#define __BPF_CAST3(a,...) __CAST_TO_U64(a), __BPF_CAST2(__VA_ARGS__) +#define __BPF_CAST4(a,...) __CAST_TO_U64(a), __BPF_CAST3(__VA_ARGS__) +#define __BPF_CAST5(a,...) __CAST_TO_U64(a), __BPF_CAST4(__VA_ARGS__) +#define __BPF_CAST6(a,...) __CAST_TO_U64(a), __BPF_CAST5(__VA_ARGS__) + #undef DECLARE_EVENT_CLASS #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ static notrace void \ perf_trace_##call(void *__data, proto) \ {\ struct ftrace_event_call *event_call = __data; \ + struct bpf_prog *prog = event_call-prog; \ struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ struct ftrace_raw_##call *entry;\ struct pt_regs __regs; \ @@ -771,6 +792,16 @@ perf_trace_##call(void *__data, proto) \ int __data_size;\ int rctx; \ \ + if (prog) { \ + __maybe_unused const u64 z = 0; \ + struct bpf_context __ctx = ((struct bpf_context) { \ + __BPF_CAST6(args, z, z, z, z, z)\ Note, there is no guarantee that args is at most 6. For example, in drivers/net/wireless/brcm80211/brcmsmac/brcms_trace_events.h, the trace_event brcms_txstatus has 8 args. But I guess that's OK if you do not need those last args, right? Also, there's no interface the lets us know what the args are. I may be able to come up with something. That's the reason I never filtered before tracing. Because we had no way of knowing what to filter on, because the args were never visible. I'm nervous about showing args of tracepoints too, because we don't want that to become a strict ABI either. -- Steve + }); \ + \ + if (!trace_call_bpf(prog, __ctx)) \ + return; \ + } \ + \ __data_size = ftrace_get_offsets_##call(__data_offsets, args); \ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/