Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 05, 2016 at 12:45:07PM -0700, Alexei Starovoitov wrote: > Also thinking about concurrency and the need to remember the original > handler somewhere, would it be cleaner api to add a bit to perf_event_attr > and use attr.config1 as bpf_fd ? attr.config[12] are in use already, typically uncore events need them. We cannot rely on those bits being unused. > Then perf_event_open at event creation time will use bpf prog as > overflow_handler. That solves concurrency concerns and potential semantical > issues if we go with ioctl() approach. > Like if we perf_event_open() an event for a task, then bpf attach to it, > what children task and corresponding inherited events suppose to do? > Inherit overflow_handler, right? but then deatch of bpf in the parent > suppose to clear it in inherited events as well. A bit complicated. > I guess we can define it that way. > Just seems easier to do bpf attach at perf_event_open time only. Which is why I would've liked BPF to create its own events, instead of userspace passing them in through this array. Because now you have a chicken'n'egg issue with that you want BPF to use the overflow handler but need BPF running before you have an actual handler to link to. > > Urgh, does it have to be stable API? Can't we simply rely on the kernel > > headers to provide the right structure definition? > > yes we can. The concern is about assumptions people will make about > perf_sample_data and the speed of access to it. From bpf program point > of view the pointer to perf_sample_data will be opaque unsafe pointer, > so any access to fields would have to be done via bpf_probe_read which > has non-trivial overhead. > If we go with the uapi mirror of perf_sample_data approach, it will be > fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we > have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere > and no fields are copied. When bpf program does 'skb->vlan_present' > the verifier rewrites it at load time into corresponding access to > kernel internal 'struct sk_buff' fields with bitmask, shifts and such. > For this case we can define something like > struct bpf_perf_sample_data { > __u64 period; > }; > then bpf prog will only be able to access that signle field which verifier > will translate into 'data->period' where data is 'struct perf_sample_data *' > Later we can add other fields if necessary. The kernel is free to mess > around with perf_sample_data whichever way without impacting bpf progs. Hmm, I was not aware of that. Should be doable indeed.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 05, 2016 at 12:45:07PM -0700, Alexei Starovoitov wrote: > Also thinking about concurrency and the need to remember the original > handler somewhere, would it be cleaner api to add a bit to perf_event_attr > and use attr.config1 as bpf_fd ? attr.config[12] are in use already, typically uncore events need them. We cannot rely on those bits being unused. > Then perf_event_open at event creation time will use bpf prog as > overflow_handler. That solves concurrency concerns and potential semantical > issues if we go with ioctl() approach. > Like if we perf_event_open() an event for a task, then bpf attach to it, > what children task and corresponding inherited events suppose to do? > Inherit overflow_handler, right? but then deatch of bpf in the parent > suppose to clear it in inherited events as well. A bit complicated. > I guess we can define it that way. > Just seems easier to do bpf attach at perf_event_open time only. Which is why I would've liked BPF to create its own events, instead of userspace passing them in through this array. Because now you have a chicken'n'egg issue with that you want BPF to use the overflow handler but need BPF running before you have an actual handler to link to. > > Urgh, does it have to be stable API? Can't we simply rely on the kernel > > headers to provide the right structure definition? > > yes we can. The concern is about assumptions people will make about > perf_sample_data and the speed of access to it. From bpf program point > of view the pointer to perf_sample_data will be opaque unsafe pointer, > so any access to fields would have to be done via bpf_probe_read which > has non-trivial overhead. > If we go with the uapi mirror of perf_sample_data approach, it will be > fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we > have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere > and no fields are copied. When bpf program does 'skb->vlan_present' > the verifier rewrites it at load time into corresponding access to > kernel internal 'struct sk_buff' fields with bitmask, shifts and such. > For this case we can define something like > struct bpf_perf_sample_data { > __u64 period; > }; > then bpf prog will only be able to access that signle field which verifier > will translate into 'data->period' where data is 'struct perf_sample_data *' > Later we can add other fields if necessary. The kernel is free to mess > around with perf_sample_data whichever way without impacting bpf progs. Hmm, I was not aware of that. Should be doable indeed.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 05, 2016 at 10:22:08AM -0700, Brendan Gregg wrote: > (Normally I'd use I$ miss overflow, but none of our Linux systems have > PMCs: cloud.) I think I had better not comment on that ;-) > >> > The perf:perf_hrtimer probe point is also reading state mid-way > >> > through a function, so it's not quite as simple as wrapping the > >> > function pointer. I do like that idea, though, but for things like > >> > struct file_operations. > > > > So what additional state to you need? > > I was pulling in regs after get_irq_regs(), struct perf_event *event > after it's populated. Not that hard to duplicate. Just noting it > didn't map directly to the function entry. Right, both of which are available to the overflow handler. > I wanted perf_event just for event->ctx->task->pid, so that a BPF > program can differentiate between it's samples and other concurrent > sessions. > > (I was thinking of changing my patch to expose pid_t instead of > perf_event, since I was noticing it didn't add many instructions.) Slightly confused, event->ctx->task == current, no? We flip that pointer when we flip the contexts. At which point, it should be the same as SAMPLE_TID. !? > [...] > >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace > >> overflow_handler for that particular event with some form of bpf wrapper. > >> (probably new bpf program type). Then not only periodic events > >> will be triggering bpf prog, but pmu events as well. > > > > Exactly. > > Although the timer use case is a bit different, and is via > hwc->hrtimer.function = perf_swevent_hrtimer. Still not entirely sure why you could not hook into event->overflow_handler.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 05, 2016 at 10:22:08AM -0700, Brendan Gregg wrote: > (Normally I'd use I$ miss overflow, but none of our Linux systems have > PMCs: cloud.) I think I had better not comment on that ;-) > >> > The perf:perf_hrtimer probe point is also reading state mid-way > >> > through a function, so it's not quite as simple as wrapping the > >> > function pointer. I do like that idea, though, but for things like > >> > struct file_operations. > > > > So what additional state to you need? > > I was pulling in regs after get_irq_regs(), struct perf_event *event > after it's populated. Not that hard to duplicate. Just noting it > didn't map directly to the function entry. Right, both of which are available to the overflow handler. > I wanted perf_event just for event->ctx->task->pid, so that a BPF > program can differentiate between it's samples and other concurrent > sessions. > > (I was thinking of changing my patch to expose pid_t instead of > perf_event, since I was noticing it didn't add many instructions.) Slightly confused, event->ctx->task == current, no? We flip that pointer when we flip the contexts. At which point, it should be the same as SAMPLE_TID. !? > [...] > >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace > >> overflow_handler for that particular event with some form of bpf wrapper. > >> (probably new bpf program type). Then not only periodic events > >> will be triggering bpf prog, but pmu events as well. > > > > Exactly. > > Although the timer use case is a bit different, and is via > hwc->hrtimer.function = perf_swevent_hrtimer. Still not entirely sure why you could not hook into event->overflow_handler.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote: > > > > Currently overflow_handler is set at event alloc time. If we start > > > > changing it on the fly with atomic xchg(), afaik things shouldn't > > > > break, since each overflow_handler is run to completion and doesn't > > > > change global state, right? > > Yes, or even a simple WRITE_ONCE() to replace it, as long as we make > sure to use a READ_ONCE() to load the pointer. > > As long as we're sure to limit this poking to a single user its fairly > simple to get right. The moment there can be concurrency a lot of fail > can happen. agreed. > > So instead of normal __perf_event_output() writing into ringbuffer, > > a bpf prog will be called that can optionally write into different > > rb via bpf_perf_event_output. > > It could even chain and call into the original function once its done > and have both outputs. interesting idea. makes sense. Also thinking about concurrency and the need to remember the original handler somewhere, would it be cleaner api to add a bit to perf_event_attr and use attr.config1 as bpf_fd ? Then perf_event_open at event creation time will use bpf prog as overflow_handler. That solves concurrency concerns and potential semantical issues if we go with ioctl() approach. Like if we perf_event_open() an event for a task, then bpf attach to it, what children task and corresponding inherited events suppose to do? Inherit overflow_handler, right? but then deatch of bpf in the parent suppose to clear it in inherited events as well. A bit complicated. I guess we can define it that way. Just seems easier to do bpf attach at perf_event_open time only. > > The question is what to pass into the > > program to make the most use out of it. 'struct pt_regs' is done deal. > > but perf_sample_data we cannot pass as-is, since it's kernel internal. > > Urgh, does it have to be stable API? Can't we simply rely on the kernel > headers to provide the right structure definition? yes we can. The concern is about assumptions people will make about perf_sample_data and the speed of access to it. From bpf program point of view the pointer to perf_sample_data will be opaque unsafe pointer, so any access to fields would have to be done via bpf_probe_read which has non-trivial overhead. If we go with the uapi mirror of perf_sample_data approach, it will be fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere and no fields are copied. When bpf program does 'skb->vlan_present' the verifier rewrites it at load time into corresponding access to kernel internal 'struct sk_buff' fields with bitmask, shifts and such. For this case we can define something like struct bpf_perf_sample_data { __u64 period; }; then bpf prog will only be able to access that signle field which verifier will translate into 'data->period' where data is 'struct perf_sample_data *' Later we can add other fields if necessary. The kernel is free to mess around with perf_sample_data whichever way without impacting bpf progs.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote: > > > > Currently overflow_handler is set at event alloc time. If we start > > > > changing it on the fly with atomic xchg(), afaik things shouldn't > > > > break, since each overflow_handler is run to completion and doesn't > > > > change global state, right? > > Yes, or even a simple WRITE_ONCE() to replace it, as long as we make > sure to use a READ_ONCE() to load the pointer. > > As long as we're sure to limit this poking to a single user its fairly > simple to get right. The moment there can be concurrency a lot of fail > can happen. agreed. > > So instead of normal __perf_event_output() writing into ringbuffer, > > a bpf prog will be called that can optionally write into different > > rb via bpf_perf_event_output. > > It could even chain and call into the original function once its done > and have both outputs. interesting idea. makes sense. Also thinking about concurrency and the need to remember the original handler somewhere, would it be cleaner api to add a bit to perf_event_attr and use attr.config1 as bpf_fd ? Then perf_event_open at event creation time will use bpf prog as overflow_handler. That solves concurrency concerns and potential semantical issues if we go with ioctl() approach. Like if we perf_event_open() an event for a task, then bpf attach to it, what children task and corresponding inherited events suppose to do? Inherit overflow_handler, right? but then deatch of bpf in the parent suppose to clear it in inherited events as well. A bit complicated. I guess we can define it that way. Just seems easier to do bpf attach at perf_event_open time only. > > The question is what to pass into the > > program to make the most use out of it. 'struct pt_regs' is done deal. > > but perf_sample_data we cannot pass as-is, since it's kernel internal. > > Urgh, does it have to be stable API? Can't we simply rely on the kernel > headers to provide the right structure definition? yes we can. The concern is about assumptions people will make about perf_sample_data and the speed of access to it. From bpf program point of view the pointer to perf_sample_data will be opaque unsafe pointer, so any access to fields would have to be done via bpf_probe_read which has non-trivial overhead. If we go with the uapi mirror of perf_sample_data approach, it will be fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere and no fields are copied. When bpf program does 'skb->vlan_present' the verifier rewrites it at load time into corresponding access to kernel internal 'struct sk_buff' fields with bitmask, shifts and such. For this case we can define something like struct bpf_perf_sample_data { __u64 period; }; then bpf prog will only be able to access that signle field which verifier will translate into 'data->period' where data is 'struct perf_sample_data *' Later we can add other fields if necessary. The kernel is free to mess around with perf_sample_data whichever way without impacting bpf progs.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstrawrote: > On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote: >> tracepoints are actually zero overhead already via static-key mechanism. >> I don't think Peter's objection for the tracepoint was due to overhead. > > Almost 0, they still have some I$ footprint, but yes. My main worry is > that we can feed tracepoints into perf, so having tracepoints in perf is > tricky. Coincidentally I$ footprint was my most recent use case for needing this: I have an I$ busting workload, and wanting to profile instructions at a very high rate to get a breakdown of I$ population. (Normally I'd use I$ miss overflow, but none of our Linux systems have PMCs: cloud.) > I also don't much like this tracepoint being specific to the hrtimer > bits, I can well imagine people wanting to do the same thing for > hardware based samples or whatnot. Sure, which is why I thought we'd have two in a perf category. I'm all for PMCs events, even though we can't currently use them! > >> > The perf:perf_hrtimer probe point is also reading state mid-way >> > through a function, so it's not quite as simple as wrapping the >> > function pointer. I do like that idea, though, but for things like >> > struct file_operations. > > So what additional state to you need? I was pulling in regs after get_irq_regs(), struct perf_event *event after it's populated. Not that hard to duplicate. Just noting it didn't map directly to the function entry. I wanted perf_event just for event->ctx->task->pid, so that a BPF program can differentiate between it's samples and other concurrent sessions. (I was thinking of changing my patch to expose pid_t instead of perf_event, since I was noticing it didn't add many instructions.) [...] >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace >> overflow_handler for that particular event with some form of bpf wrapper. >> (probably new bpf program type). Then not only periodic events >> will be triggering bpf prog, but pmu events as well. > > Exactly. Although the timer use case is a bit different, and is via hwc->hrtimer.function = perf_swevent_hrtimer. [...] >> The question is what to pass into the >> program to make the most use out of it. 'struct pt_regs' is done deal. >> but perf_sample_data we cannot pass as-is, since it's kernel internal. > > Urgh, does it have to be stable API? Can't we simply rely on the kernel > headers to provide the right structure definition? For timer it can be: struct pt_regs, pid_t. So that would restrict your BPF program to one timer, since if you had two (from one pid) you couldn't tell them apart. But I'm not sure of a use case for two in-kernel timers. If there were, we could also add struct perf_event_attr, which has enough info to tell things apart, and is already exposed to user space. I haven't looked into the PMU arguments, but perhaps that could be: struct pt_regs, pid_t, struct perf_event_attr. Thanks, Brendan
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra wrote: > On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote: >> tracepoints are actually zero overhead already via static-key mechanism. >> I don't think Peter's objection for the tracepoint was due to overhead. > > Almost 0, they still have some I$ footprint, but yes. My main worry is > that we can feed tracepoints into perf, so having tracepoints in perf is > tricky. Coincidentally I$ footprint was my most recent use case for needing this: I have an I$ busting workload, and wanting to profile instructions at a very high rate to get a breakdown of I$ population. (Normally I'd use I$ miss overflow, but none of our Linux systems have PMCs: cloud.) > I also don't much like this tracepoint being specific to the hrtimer > bits, I can well imagine people wanting to do the same thing for > hardware based samples or whatnot. Sure, which is why I thought we'd have two in a perf category. I'm all for PMCs events, even though we can't currently use them! > >> > The perf:perf_hrtimer probe point is also reading state mid-way >> > through a function, so it's not quite as simple as wrapping the >> > function pointer. I do like that idea, though, but for things like >> > struct file_operations. > > So what additional state to you need? I was pulling in regs after get_irq_regs(), struct perf_event *event after it's populated. Not that hard to duplicate. Just noting it didn't map directly to the function entry. I wanted perf_event just for event->ctx->task->pid, so that a BPF program can differentiate between it's samples and other concurrent sessions. (I was thinking of changing my patch to expose pid_t instead of perf_event, since I was noticing it didn't add many instructions.) [...] >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace >> overflow_handler for that particular event with some form of bpf wrapper. >> (probably new bpf program type). Then not only periodic events >> will be triggering bpf prog, but pmu events as well. > > Exactly. Although the timer use case is a bit different, and is via hwc->hrtimer.function = perf_swevent_hrtimer. [...] >> The question is what to pass into the >> program to make the most use out of it. 'struct pt_regs' is done deal. >> but perf_sample_data we cannot pass as-is, since it's kernel internal. > > Urgh, does it have to be stable API? Can't we simply rely on the kernel > headers to provide the right structure definition? For timer it can be: struct pt_regs, pid_t. So that would restrict your BPF program to one timer, since if you had two (from one pid) you couldn't tell them apart. But I'm not sure of a use case for two in-kernel timers. If there were, we could also add struct perf_event_attr, which has enough info to tell things apart, and is already exposed to user space. I haven't looked into the PMU arguments, but perhaps that could be: struct pt_regs, pid_t, struct perf_event_attr. Thanks, Brendan
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote: > tracepoints are actually zero overhead already via static-key mechanism. > I don't think Peter's objection for the tracepoint was due to overhead. Almost 0, they still have some I$ footprint, but yes. My main worry is that we can feed tracepoints into perf, so having tracepoints in perf is tricky. I also don't much like this tracepoint being specific to the hrtimer bits, I can well imagine people wanting to do the same thing for hardware based samples or whatnot. > > The perf:perf_hrtimer probe point is also reading state mid-way > > through a function, so it's not quite as simple as wrapping the > > function pointer. I do like that idea, though, but for things like > > struct file_operations. So what additional state to you need? > > > Currently overflow_handler is set at event alloc time. If we start > > > changing it on the fly with atomic xchg(), afaik things shouldn't > > > break, since each overflow_handler is run to completion and doesn't > > > change global state, right? Yes, or even a simple WRITE_ONCE() to replace it, as long as we make sure to use a READ_ONCE() to load the pointer. As long as we're sure to limit this poking to a single user its fairly simple to get right. The moment there can be concurrency a lot of fail can happen. > instead of adding a tracepoint to perf_swevent_hrtimer we can replace > overflow_handler for that particular event with some form of bpf wrapper. > (probably new bpf program type). Then not only periodic events > will be triggering bpf prog, but pmu events as well. Exactly. > So instead of normal __perf_event_output() writing into ringbuffer, > a bpf prog will be called that can optionally write into different > rb via bpf_perf_event_output. It could even chain and call into the original function once its done and have both outputs. > The question is what to pass into the > program to make the most use out of it. 'struct pt_regs' is done deal. > but perf_sample_data we cannot pass as-is, since it's kernel internal. Urgh, does it have to be stable API? Can't we simply rely on the kernel headers to provide the right structure definition?
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote: > tracepoints are actually zero overhead already via static-key mechanism. > I don't think Peter's objection for the tracepoint was due to overhead. Almost 0, they still have some I$ footprint, but yes. My main worry is that we can feed tracepoints into perf, so having tracepoints in perf is tricky. I also don't much like this tracepoint being specific to the hrtimer bits, I can well imagine people wanting to do the same thing for hardware based samples or whatnot. > > The perf:perf_hrtimer probe point is also reading state mid-way > > through a function, so it's not quite as simple as wrapping the > > function pointer. I do like that idea, though, but for things like > > struct file_operations. So what additional state to you need? > > > Currently overflow_handler is set at event alloc time. If we start > > > changing it on the fly with atomic xchg(), afaik things shouldn't > > > break, since each overflow_handler is run to completion and doesn't > > > change global state, right? Yes, or even a simple WRITE_ONCE() to replace it, as long as we make sure to use a READ_ONCE() to load the pointer. As long as we're sure to limit this poking to a single user its fairly simple to get right. The moment there can be concurrency a lot of fail can happen. > instead of adding a tracepoint to perf_swevent_hrtimer we can replace > overflow_handler for that particular event with some form of bpf wrapper. > (probably new bpf program type). Then not only periodic events > will be triggering bpf prog, but pmu events as well. Exactly. > So instead of normal __perf_event_output() writing into ringbuffer, > a bpf prog will be called that can optionally write into different > rb via bpf_perf_event_output. It could even chain and call into the original function once its done and have both outputs. > The question is what to pass into the > program to make the most use out of it. 'struct pt_regs' is done deal. > but perf_sample_data we cannot pass as-is, since it's kernel internal. Urgh, does it have to be stable API? Can't we simply rely on the kernel headers to provide the right structure definition?
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote: > On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov >wrote: > > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: > >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: > >> > >> > As for pmu tracepoints: if I were to instrument it (although I wasn't > >> > planning to), I'd put a tracepoint in perf_event_overflow() called > >> > "perf:perf_overflow", with the same arguments. That could then be used > >> > for all PMU overflow events, without needing to add specific > >> > tracepoints. > >> > >> Could we not teach BPF to replace event->overflow_handler and inject > >> itself there? > >> > >> We don't currently have nice interfaces for doing that, but it should be > >> possible to do I think. We already have the indirect function call, so > >> injecting ourself there has 0 overhead. > > Sounds like a good idea, especially for things like struct > file_operations so that we can statically instrument file system > read/writes with zero non-enabled overhead, and not worry about high > frequency workloads (>10M events/sec). > > These perf probes aren't high frequency, though, and the code is not > normally in use, so overhead should be much less of a concern. > Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the > tracepoint code is still adding a mem read, conditional, and branch, > then that's not many instructions, especially considering the normal > use case of these perf functions: creating records and writing to a > perf ring buffer, then picking that up in user space by perf, then > either processing it live or writing to perf.data, back to the file > system, etc. It would be hard to benchmark the effect of adding a few > instructions to that path (and any results may be more sensitive to > cache line placement than the instructions). tracepoints are actually zero overhead already via static-key mechanism. I don't think Peter's objection for the tracepoint was due to overhead. > The perf:perf_hrtimer probe point is also reading state mid-way > through a function, so it's not quite as simple as wrapping the > function pointer. I do like that idea, though, but for things like > struct file_operations. > > > > > you're right. All makes sense. I guess I was too lazy to look into > > how to do it properly. Adding a tracepoint looked like quick and > > easy way to achieve the same. > > As far as api goes probably existing IOC_SET_BPF ioctl will do too. > > Currently overflow_handler is set at event alloc time. If we start > > changing it on the fly with atomic xchg(), afaik things shouldn't > > break, since each overflow_handler is run to completion and doesn't > > change global state, right? > > > > How would it be implemented? I was thinking of adding explicit wrappers, eg: instead of adding a tracepoint to perf_swevent_hrtimer we can replace overflow_handler for that particular event with some form of bpf wrapper. (probably new bpf program type). Then not only periodic events will be triggering bpf prog, but pmu events as well. So instead of normal __perf_event_output() writing into ringbuffer, a bpf prog will be called that can optionally write into different rb via bpf_perf_event_output. The question is what to pass into the program to make the most use out of it. 'struct pt_regs' is done deal. but perf_sample_data we cannot pass as-is, since it's kernel internal. Probably something similar to __sk_buff mirror would be needed. Another nice benefit of doing via overflow_handler instead of tracepoint is that exclude_idle, exclude_user, exclude_kernel flags of the perf event will all magically work and program will be event specific. So two parallel 'perf record'-like sampling won't conflict.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote: > On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov > wrote: > > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: > >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: > >> > >> > As for pmu tracepoints: if I were to instrument it (although I wasn't > >> > planning to), I'd put a tracepoint in perf_event_overflow() called > >> > "perf:perf_overflow", with the same arguments. That could then be used > >> > for all PMU overflow events, without needing to add specific > >> > tracepoints. > >> > >> Could we not teach BPF to replace event->overflow_handler and inject > >> itself there? > >> > >> We don't currently have nice interfaces for doing that, but it should be > >> possible to do I think. We already have the indirect function call, so > >> injecting ourself there has 0 overhead. > > Sounds like a good idea, especially for things like struct > file_operations so that we can statically instrument file system > read/writes with zero non-enabled overhead, and not worry about high > frequency workloads (>10M events/sec). > > These perf probes aren't high frequency, though, and the code is not > normally in use, so overhead should be much less of a concern. > Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the > tracepoint code is still adding a mem read, conditional, and branch, > then that's not many instructions, especially considering the normal > use case of these perf functions: creating records and writing to a > perf ring buffer, then picking that up in user space by perf, then > either processing it live or writing to perf.data, back to the file > system, etc. It would be hard to benchmark the effect of adding a few > instructions to that path (and any results may be more sensitive to > cache line placement than the instructions). tracepoints are actually zero overhead already via static-key mechanism. I don't think Peter's objection for the tracepoint was due to overhead. > The perf:perf_hrtimer probe point is also reading state mid-way > through a function, so it's not quite as simple as wrapping the > function pointer. I do like that idea, though, but for things like > struct file_operations. > > > > > you're right. All makes sense. I guess I was too lazy to look into > > how to do it properly. Adding a tracepoint looked like quick and > > easy way to achieve the same. > > As far as api goes probably existing IOC_SET_BPF ioctl will do too. > > Currently overflow_handler is set at event alloc time. If we start > > changing it on the fly with atomic xchg(), afaik things shouldn't > > break, since each overflow_handler is run to completion and doesn't > > change global state, right? > > > > How would it be implemented? I was thinking of adding explicit wrappers, eg: instead of adding a tracepoint to perf_swevent_hrtimer we can replace overflow_handler for that particular event with some form of bpf wrapper. (probably new bpf program type). Then not only periodic events will be triggering bpf prog, but pmu events as well. So instead of normal __perf_event_output() writing into ringbuffer, a bpf prog will be called that can optionally write into different rb via bpf_perf_event_output. The question is what to pass into the program to make the most use out of it. 'struct pt_regs' is done deal. but perf_sample_data we cannot pass as-is, since it's kernel internal. Probably something similar to __sk_buff mirror would be needed. Another nice benefit of doing via overflow_handler instead of tracepoint is that exclude_idle, exclude_user, exclude_kernel flags of the perf event will all magically work and program will be event specific. So two parallel 'perf record'-like sampling won't conflict.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitovwrote: > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: >> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't >> > planning to), I'd put a tracepoint in perf_event_overflow() called >> > "perf:perf_overflow", with the same arguments. That could then be used >> > for all PMU overflow events, without needing to add specific >> > tracepoints. >> >> Could we not teach BPF to replace event->overflow_handler and inject >> itself there? >> >> We don't currently have nice interfaces for doing that, but it should be >> possible to do I think. We already have the indirect function call, so >> injecting ourself there has 0 overhead. Sounds like a good idea, especially for things like struct file_operations so that we can statically instrument file system read/writes with zero non-enabled overhead, and not worry about high frequency workloads (>10M events/sec). These perf probes aren't high frequency, though, and the code is not normally in use, so overhead should be much less of a concern. Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the tracepoint code is still adding a mem read, conditional, and branch, then that's not many instructions, especially considering the normal use case of these perf functions: creating records and writing to a perf ring buffer, then picking that up in user space by perf, then either processing it live or writing to perf.data, back to the file system, etc. It would be hard to benchmark the effect of adding a few instructions to that path (and any results may be more sensitive to cache line placement than the instructions). The perf:perf_hrtimer probe point is also reading state mid-way through a function, so it's not quite as simple as wrapping the function pointer. I do like that idea, though, but for things like struct file_operations. > > you're right. All makes sense. I guess I was too lazy to look into > how to do it properly. Adding a tracepoint looked like quick and > easy way to achieve the same. > As far as api goes probably existing IOC_SET_BPF ioctl will do too. > Currently overflow_handler is set at event alloc time. If we start > changing it on the fly with atomic xchg(), afaik things shouldn't > break, since each overflow_handler is run to completion and doesn't > change global state, right? > How would it be implemented? I was thinking of adding explicit wrappers, eg: static ssize_t __ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from) { trace_ext4_write_iter(iocb, from); ext4_file_write_iter(iocb, from); } Then switching in __ext4_file_write_iter_tracepoint() as needed. I was wondering about doing this dynamically, but wouldn't that then create another problem of maintenance -- we'd need to decorate such code with a comment, at least, to say "this function is exposed as a tracepoint". If a dynamic approach is still desired, and the goal is zero non-enabled overhead, then I'd also wonder if we could leverage kprobes to do this. Imagine walking a file_operations to find the addresses, and then kprobe-ing the addresses. Not the same (or probably as efficient) as setting the function pointer, but, using existing kprobes. Brendan
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov wrote: > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: >> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't >> > planning to), I'd put a tracepoint in perf_event_overflow() called >> > "perf:perf_overflow", with the same arguments. That could then be used >> > for all PMU overflow events, without needing to add specific >> > tracepoints. >> >> Could we not teach BPF to replace event->overflow_handler and inject >> itself there? >> >> We don't currently have nice interfaces for doing that, but it should be >> possible to do I think. We already have the indirect function call, so >> injecting ourself there has 0 overhead. Sounds like a good idea, especially for things like struct file_operations so that we can statically instrument file system read/writes with zero non-enabled overhead, and not worry about high frequency workloads (>10M events/sec). These perf probes aren't high frequency, though, and the code is not normally in use, so overhead should be much less of a concern. Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the tracepoint code is still adding a mem read, conditional, and branch, then that's not many instructions, especially considering the normal use case of these perf functions: creating records and writing to a perf ring buffer, then picking that up in user space by perf, then either processing it live or writing to perf.data, back to the file system, etc. It would be hard to benchmark the effect of adding a few instructions to that path (and any results may be more sensitive to cache line placement than the instructions). The perf:perf_hrtimer probe point is also reading state mid-way through a function, so it's not quite as simple as wrapping the function pointer. I do like that idea, though, but for things like struct file_operations. > > you're right. All makes sense. I guess I was too lazy to look into > how to do it properly. Adding a tracepoint looked like quick and > easy way to achieve the same. > As far as api goes probably existing IOC_SET_BPF ioctl will do too. > Currently overflow_handler is set at event alloc time. If we start > changing it on the fly with atomic xchg(), afaik things shouldn't > break, since each overflow_handler is run to completion and doesn't > change global state, right? > How would it be implemented? I was thinking of adding explicit wrappers, eg: static ssize_t __ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from) { trace_ext4_write_iter(iocb, from); ext4_file_write_iter(iocb, from); } Then switching in __ext4_file_write_iter_tracepoint() as needed. I was wondering about doing this dynamically, but wouldn't that then create another problem of maintenance -- we'd need to decorate such code with a comment, at least, to say "this function is exposed as a tracepoint". If a dynamic approach is still desired, and the goal is zero non-enabled overhead, then I'd also wonder if we could leverage kprobes to do this. Imagine walking a file_operations to find the addresses, and then kprobe-ing the addresses. Not the same (or probably as efficient) as setting the function pointer, but, using existing kprobes. Brendan
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: > On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: > > > As for pmu tracepoints: if I were to instrument it (although I wasn't > > planning to), I'd put a tracepoint in perf_event_overflow() called > > "perf:perf_overflow", with the same arguments. That could then be used > > for all PMU overflow events, without needing to add specific > > tracepoints. > > Could we not teach BPF to replace event->overflow_handler and inject > itself there? > > We don't currently have nice interfaces for doing that, but it should be > possible to do I think. We already have the indirect function call, so > injecting ourself there has 0 overhead. you're right. All makes sense. I guess I was too lazy to look into how to do it properly. Adding a tracepoint looked like quick and easy way to achieve the same. As far as api goes probably existing IOC_SET_BPF ioctl will do too. Currently overflow_handler is set at event alloc time. If we start changing it on the fly with atomic xchg(), afaik things shouldn't break, since each overflow_handler is run to completion and doesn't change global state, right?
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: > On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: > > > As for pmu tracepoints: if I were to instrument it (although I wasn't > > planning to), I'd put a tracepoint in perf_event_overflow() called > > "perf:perf_overflow", with the same arguments. That could then be used > > for all PMU overflow events, without needing to add specific > > tracepoints. > > Could we not teach BPF to replace event->overflow_handler and inject > itself there? > > We don't currently have nice interfaces for doing that, but it should be > possible to do I think. We already have the indirect function call, so > injecting ourself there has 0 overhead. you're right. All makes sense. I guess I was too lazy to look into how to do it properly. Adding a tracepoint looked like quick and easy way to achieve the same. As far as api goes probably existing IOC_SET_BPF ioctl will do too. Currently overflow_handler is set at event alloc time. If we start changing it on the fly with atomic xchg(), afaik things shouldn't break, since each overflow_handler is run to completion and doesn't change global state, right?
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: > As for pmu tracepoints: if I were to instrument it (although I wasn't > planning to), I'd put a tracepoint in perf_event_overflow() called > "perf:perf_overflow", with the same arguments. That could then be used > for all PMU overflow events, without needing to add specific > tracepoints. Could we not teach BPF to replace event->overflow_handler and inject itself there? We don't currently have nice interfaces for doing that, but it should be possible to do I think. We already have the indirect function call, so injecting ourself there has 0 overhead.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: > As for pmu tracepoints: if I were to instrument it (although I wasn't > planning to), I'd put a tracepoint in perf_event_overflow() called > "perf:perf_overflow", with the same arguments. That could then be used > for all PMU overflow events, without needing to add specific > tracepoints. Could we not teach BPF to replace event->overflow_handler and inject itself there? We don't currently have nice interfaces for doing that, but it should be possible to do I think. We already have the indirect function call, so injecting ourself there has 0 overhead.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Wed, Aug 3, 2016 at 2:48 AM, Peter Zijlstrawrote: > > On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote: > > When perf is performing hrtimer-based sampling, this tracepoint can be used > > by BPF to run additional logic on each sample. For example, BPF can fetch > > stack traces and frequency count them in kernel context, for an efficient > > profiler. > > Urgh, no like. > > And then next week we'll add a tracepoint to some other random pmu or > whatnot. I wouldn't want random pmu tracepoints either, but I can see how it looks with a new tracepoint category of "perf:", and maybe I shouldn't have called it that. Would it be better if the tracepoint was called "timer:perf_hrtimer"? As for pmu tracepoints: if I were to instrument it (although I wasn't planning to), I'd put a tracepoint in perf_event_overflow() called "perf:perf_overflow", with the same arguments. That could then be used for all PMU overflow events, without needing to add specific tracepoints. The "perf:" category would then have two tracepoints, and would not need to grow. Does that option sound better than "timer:perf_hrtimer"? (As in, keeping the category "perf" so that we have a later option of adding a perf:perf_overflow tracepoint if need be?) It's really perf_hrtimer that we'll use daily, for CPU profiling with in-kernel aggregations of instruction pointers and stack traces, and other sampling needs. It's one tracepoint that will add much value, and will mean that BPF programs can then be attached to kprobes, uprobes, tracepoints, and timed samples. Thanks for your consideration, Brendan
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Wed, Aug 3, 2016 at 2:48 AM, Peter Zijlstra wrote: > > On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote: > > When perf is performing hrtimer-based sampling, this tracepoint can be used > > by BPF to run additional logic on each sample. For example, BPF can fetch > > stack traces and frequency count them in kernel context, for an efficient > > profiler. > > Urgh, no like. > > And then next week we'll add a tracepoint to some other random pmu or > whatnot. I wouldn't want random pmu tracepoints either, but I can see how it looks with a new tracepoint category of "perf:", and maybe I shouldn't have called it that. Would it be better if the tracepoint was called "timer:perf_hrtimer"? As for pmu tracepoints: if I were to instrument it (although I wasn't planning to), I'd put a tracepoint in perf_event_overflow() called "perf:perf_overflow", with the same arguments. That could then be used for all PMU overflow events, without needing to add specific tracepoints. The "perf:" category would then have two tracepoints, and would not need to grow. Does that option sound better than "timer:perf_hrtimer"? (As in, keeping the category "perf" so that we have a later option of adding a perf:perf_overflow tracepoint if need be?) It's really perf_hrtimer that we'll use daily, for CPU profiling with in-kernel aggregations of instruction pointers and stack traces, and other sampling needs. It's one tracepoint that will add much value, and will mean that BPF programs can then be attached to kprobes, uprobes, tracepoints, and timed samples. Thanks for your consideration, Brendan
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote: > When perf is performing hrtimer-based sampling, this tracepoint can be used > by BPF to run additional logic on each sample. For example, BPF can fetch > stack traces and frequency count them in kernel context, for an efficient > profiler. Urgh, no like. And then next week we'll add a tracepoint to some other random pmu or whatnot.
Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote: > When perf is performing hrtimer-based sampling, this tracepoint can be used > by BPF to run additional logic on each sample. For example, BPF can fetch > stack traces and frequency count them in kernel context, for an efficient > profiler. Urgh, no like. And then next week we'll add a tracepoint to some other random pmu or whatnot.