Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-08 Thread Peter Zijlstra
On Fri, Aug 05, 2016 at 12:45:07PM -0700, Alexei Starovoitov wrote:

> Also thinking about concurrency and the need to remember the original
> handler somewhere, would it be cleaner api to add a bit to perf_event_attr
> and use attr.config1 as bpf_fd ?

attr.config[12] are in use already, typically uncore events need them.
We cannot rely on those bits being unused.

> Then perf_event_open at event creation time will use bpf prog as
> overflow_handler. That solves concurrency concerns and potential semantical
> issues if we go with ioctl() approach.

> Like if we perf_event_open() an event for a task, then bpf attach to it,
> what children task and corresponding inherited events suppose to do?
> Inherit overflow_handler, right? but then deatch of bpf in the parent
> suppose to clear it in inherited events as well. A bit complicated.
> I guess we can define it that way.
> Just seems easier to do bpf attach at perf_event_open time only.

Which is why I would've liked BPF to create its own events, instead of
userspace passing them in through this array.

Because now you have a chicken'n'egg issue with that you want BPF to use
the overflow handler but need BPF running before you have an actual
handler to link to.

> > Urgh, does it have to be stable API? Can't we simply rely on the kernel
> > headers to provide the right structure definition?
> 
> yes we can. The concern is about assumptions people will make about
> perf_sample_data and the speed of access to it. From bpf program point
> of view the pointer to perf_sample_data will be opaque unsafe pointer,
> so any access to fields would have to be done via bpf_probe_read which
> has non-trivial overhead.
> If we go with the uapi mirror of perf_sample_data approach, it will be
> fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
> have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
> and no fields are copied. When bpf program does 'skb->vlan_present'
> the verifier rewrites it at load time into corresponding access to
> kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
> For this case we can define something like
> struct bpf_perf_sample_data {
>   __u64 period;
> };
> then bpf prog will only be able to access that signle field which verifier
> will translate into 'data->period' where data is 'struct perf_sample_data *'
> Later we can add other fields if necessary. The kernel is free to mess
> around with perf_sample_data whichever way without impacting bpf progs.

Hmm, I was not aware of that. Should be doable indeed.


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-08 Thread Peter Zijlstra
On Fri, Aug 05, 2016 at 12:45:07PM -0700, Alexei Starovoitov wrote:

> Also thinking about concurrency and the need to remember the original
> handler somewhere, would it be cleaner api to add a bit to perf_event_attr
> and use attr.config1 as bpf_fd ?

attr.config[12] are in use already, typically uncore events need them.
We cannot rely on those bits being unused.

> Then perf_event_open at event creation time will use bpf prog as
> overflow_handler. That solves concurrency concerns and potential semantical
> issues if we go with ioctl() approach.

> Like if we perf_event_open() an event for a task, then bpf attach to it,
> what children task and corresponding inherited events suppose to do?
> Inherit overflow_handler, right? but then deatch of bpf in the parent
> suppose to clear it in inherited events as well. A bit complicated.
> I guess we can define it that way.
> Just seems easier to do bpf attach at perf_event_open time only.

Which is why I would've liked BPF to create its own events, instead of
userspace passing them in through this array.

Because now you have a chicken'n'egg issue with that you want BPF to use
the overflow handler but need BPF running before you have an actual
handler to link to.

> > Urgh, does it have to be stable API? Can't we simply rely on the kernel
> > headers to provide the right structure definition?
> 
> yes we can. The concern is about assumptions people will make about
> perf_sample_data and the speed of access to it. From bpf program point
> of view the pointer to perf_sample_data will be opaque unsafe pointer,
> so any access to fields would have to be done via bpf_probe_read which
> has non-trivial overhead.
> If we go with the uapi mirror of perf_sample_data approach, it will be
> fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
> have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
> and no fields are copied. When bpf program does 'skb->vlan_present'
> the verifier rewrites it at load time into corresponding access to
> kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
> For this case we can define something like
> struct bpf_perf_sample_data {
>   __u64 period;
> };
> then bpf prog will only be able to access that signle field which verifier
> will translate into 'data->period' where data is 'struct perf_sample_data *'
> Later we can add other fields if necessary. The kernel is free to mess
> around with perf_sample_data whichever way without impacting bpf progs.

Hmm, I was not aware of that. Should be doable indeed.


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-08 Thread Peter Zijlstra
On Fri, Aug 05, 2016 at 10:22:08AM -0700, Brendan Gregg wrote:

> (Normally I'd use I$ miss overflow, but none of our Linux systems have
> PMCs: cloud.)

I think I had better not comment on that ;-)

> >> > The perf:perf_hrtimer probe point is also reading state mid-way
> >> > through a function, so it's not quite as simple as wrapping the
> >> > function pointer. I do like that idea, though, but for things like
> >> > struct file_operations.
> >
> > So what additional state to you need?
> 
> I was pulling in regs after get_irq_regs(), struct perf_event *event
> after it's populated. Not that hard to duplicate. Just noting it
> didn't map directly to the function entry.

Right, both of which are available to the overflow handler.

> I wanted perf_event just for event->ctx->task->pid, so that a BPF
> program can differentiate between it's samples and other concurrent
> sessions.
> 
> (I  was thinking of changing my patch to expose pid_t instead of
> perf_event, since I was noticing it didn't add many instructions.)

Slightly confused, event->ctx->task == current, no? We flip that pointer
when we flip the contexts.

At which point, it should be the same as SAMPLE_TID.

!?

> [...]
> >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> >> overflow_handler for that particular event with some form of bpf wrapper.
> >> (probably new bpf program type). Then not only periodic events
> >> will be triggering bpf prog, but pmu events as well.
> >
> > Exactly.
> 
> Although the timer use case is a bit different, and is via
> hwc->hrtimer.function = perf_swevent_hrtimer.

Still not entirely sure why you could not hook into
event->overflow_handler.



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-08 Thread Peter Zijlstra
On Fri, Aug 05, 2016 at 10:22:08AM -0700, Brendan Gregg wrote:

> (Normally I'd use I$ miss overflow, but none of our Linux systems have
> PMCs: cloud.)

I think I had better not comment on that ;-)

> >> > The perf:perf_hrtimer probe point is also reading state mid-way
> >> > through a function, so it's not quite as simple as wrapping the
> >> > function pointer. I do like that idea, though, but for things like
> >> > struct file_operations.
> >
> > So what additional state to you need?
> 
> I was pulling in regs after get_irq_regs(), struct perf_event *event
> after it's populated. Not that hard to duplicate. Just noting it
> didn't map directly to the function entry.

Right, both of which are available to the overflow handler.

> I wanted perf_event just for event->ctx->task->pid, so that a BPF
> program can differentiate between it's samples and other concurrent
> sessions.
> 
> (I  was thinking of changing my patch to expose pid_t instead of
> perf_event, since I was noticing it didn't add many instructions.)

Slightly confused, event->ctx->task == current, no? We flip that pointer
when we flip the contexts.

At which point, it should be the same as SAMPLE_TID.

!?

> [...]
> >> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> >> overflow_handler for that particular event with some form of bpf wrapper.
> >> (probably new bpf program type). Then not only periodic events
> >> will be triggering bpf prog, but pmu events as well.
> >
> > Exactly.
> 
> Although the timer use case is a bit different, and is via
> hwc->hrtimer.function = perf_swevent_hrtimer.

Still not entirely sure why you could not hook into
event->overflow_handler.



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Alexei Starovoitov
On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote:
> > > > Currently overflow_handler is set at event alloc time. If we start
> > > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > > break, since each overflow_handler is run to completion and doesn't
> > > > change global state, right?
> 
> Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
> sure to use a READ_ONCE() to load the pointer.
> 
> As long as we're sure to limit this poking to a single user its fairly
> simple to get right. The moment there can be concurrency a lot of fail
> can happen.

agreed.

> > So instead of normal __perf_event_output() writing into ringbuffer,
> > a bpf prog will be called that can optionally write into different
> > rb via bpf_perf_event_output. 
> 
> It could even chain and call into the original function once its done
> and have both outputs.

interesting idea. makes sense.
Also thinking about concurrency and the need to remember the original
handler somewhere, would it be cleaner api to add a bit to perf_event_attr
and use attr.config1 as bpf_fd ?
Then perf_event_open at event creation time will use bpf prog as
overflow_handler. That solves concurrency concerns and potential semantical
issues if we go with ioctl() approach.
Like if we perf_event_open() an event for a task, then bpf attach to it,
what children task and corresponding inherited events suppose to do?
Inherit overflow_handler, right? but then deatch of bpf in the parent
suppose to clear it in inherited events as well. A bit complicated.
I guess we can define it that way.
Just seems easier to do bpf attach at perf_event_open time only.

> > The question is what to pass into the
> > program to make the most use out of it. 'struct pt_regs' is done deal.
> > but perf_sample_data we cannot pass as-is, since it's kernel internal.
> 
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

yes we can. The concern is about assumptions people will make about
perf_sample_data and the speed of access to it. From bpf program point
of view the pointer to perf_sample_data will be opaque unsafe pointer,
so any access to fields would have to be done via bpf_probe_read which
has non-trivial overhead.
If we go with the uapi mirror of perf_sample_data approach, it will be
fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
and no fields are copied. When bpf program does 'skb->vlan_present'
the verifier rewrites it at load time into corresponding access to
kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
For this case we can define something like
struct bpf_perf_sample_data {
  __u64 period;
};
then bpf prog will only be able to access that signle field which verifier
will translate into 'data->period' where data is 'struct perf_sample_data *'
Later we can add other fields if necessary. The kernel is free to mess
around with perf_sample_data whichever way without impacting bpf progs.



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Alexei Starovoitov
On Fri, Aug 05, 2016 at 12:52:09PM +0200, Peter Zijlstra wrote:
> > > > Currently overflow_handler is set at event alloc time. If we start
> > > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > > break, since each overflow_handler is run to completion and doesn't
> > > > change global state, right?
> 
> Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
> sure to use a READ_ONCE() to load the pointer.
> 
> As long as we're sure to limit this poking to a single user its fairly
> simple to get right. The moment there can be concurrency a lot of fail
> can happen.

agreed.

> > So instead of normal __perf_event_output() writing into ringbuffer,
> > a bpf prog will be called that can optionally write into different
> > rb via bpf_perf_event_output. 
> 
> It could even chain and call into the original function once its done
> and have both outputs.

interesting idea. makes sense.
Also thinking about concurrency and the need to remember the original
handler somewhere, would it be cleaner api to add a bit to perf_event_attr
and use attr.config1 as bpf_fd ?
Then perf_event_open at event creation time will use bpf prog as
overflow_handler. That solves concurrency concerns and potential semantical
issues if we go with ioctl() approach.
Like if we perf_event_open() an event for a task, then bpf attach to it,
what children task and corresponding inherited events suppose to do?
Inherit overflow_handler, right? but then deatch of bpf in the parent
suppose to clear it in inherited events as well. A bit complicated.
I guess we can define it that way.
Just seems easier to do bpf attach at perf_event_open time only.

> > The question is what to pass into the
> > program to make the most use out of it. 'struct pt_regs' is done deal.
> > but perf_sample_data we cannot pass as-is, since it's kernel internal.
> 
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

yes we can. The concern is about assumptions people will make about
perf_sample_data and the speed of access to it. From bpf program point
of view the pointer to perf_sample_data will be opaque unsafe pointer,
so any access to fields would have to be done via bpf_probe_read which
has non-trivial overhead.
If we go with the uapi mirror of perf_sample_data approach, it will be
fast, since mirror is not an actual struct. Like the 'struct __sk_buff' we
have in uapi/linux/bpf.h is a meta structure. It's not allocated anywhere
and no fields are copied. When bpf program does 'skb->vlan_present'
the verifier rewrites it at load time into corresponding access to
kernel internal 'struct sk_buff' fields with bitmask, shifts and such.
For this case we can define something like
struct bpf_perf_sample_data {
  __u64 period;
};
then bpf prog will only be able to access that signle field which verifier
will translate into 'data->period' where data is 'struct perf_sample_data *'
Later we can add other fields if necessary. The kernel is free to mess
around with perf_sample_data whichever way without impacting bpf progs.



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Brendan Gregg
On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra  wrote:
> On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
>> tracepoints are actually zero overhead already via static-key mechanism.
>> I don't think Peter's objection for the tracepoint was due to overhead.
>
> Almost 0, they still have some I$ footprint, but yes. My main worry is
> that we can feed tracepoints into perf, so having tracepoints in perf is
> tricky.

Coincidentally I$ footprint was my most recent use case for needing
this: I have an I$ busting workload, and wanting to profile
instructions at a very high rate to get a breakdown of I$ population.
(Normally I'd use I$ miss overflow, but none of our Linux systems have
PMCs: cloud.)

> I also don't much like this tracepoint being specific to the hrtimer
> bits, I can well imagine people wanting to do the same thing for
> hardware based samples or whatnot.

Sure, which is why I thought we'd have two in a perf category. I'm all
for PMCs events, even though we can't currently use them!

>
>> > The perf:perf_hrtimer probe point is also reading state mid-way
>> > through a function, so it's not quite as simple as wrapping the
>> > function pointer. I do like that idea, though, but for things like
>> > struct file_operations.
>
> So what additional state to you need?

I was pulling in regs after get_irq_regs(), struct perf_event *event
after it's populated. Not that hard to duplicate. Just noting it
didn't map directly to the function entry.

I wanted perf_event just for event->ctx->task->pid, so that a BPF
program can differentiate between it's samples and other concurrent
sessions.

(I  was thinking of changing my patch to expose pid_t instead of
perf_event, since I was noticing it didn't add many instructions.)

[...]
>> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
>> overflow_handler for that particular event with some form of bpf wrapper.
>> (probably new bpf program type). Then not only periodic events
>> will be triggering bpf prog, but pmu events as well.
>
> Exactly.

Although the timer use case is a bit different, and is via
hwc->hrtimer.function = perf_swevent_hrtimer.

[...]
>> The question is what to pass into the
>> program to make the most use out of it. 'struct pt_regs' is done deal.
>> but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

For timer it can be: struct pt_regs, pid_t.

So that would restrict your BPF program to one timer, since if you had
two (from one pid) you couldn't tell them apart. But I'm not sure of a
use case for two in-kernel timers. If there were, we could also add
struct perf_event_attr, which has enough info to tell things apart,
and is already exposed to user space.

I haven't looked into the PMU arguments, but perhaps that could be:
struct pt_regs, pid_t, struct perf_event_attr.

Thanks,

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Brendan Gregg
On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra  wrote:
> On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
>> tracepoints are actually zero overhead already via static-key mechanism.
>> I don't think Peter's objection for the tracepoint was due to overhead.
>
> Almost 0, they still have some I$ footprint, but yes. My main worry is
> that we can feed tracepoints into perf, so having tracepoints in perf is
> tricky.

Coincidentally I$ footprint was my most recent use case for needing
this: I have an I$ busting workload, and wanting to profile
instructions at a very high rate to get a breakdown of I$ population.
(Normally I'd use I$ miss overflow, but none of our Linux systems have
PMCs: cloud.)

> I also don't much like this tracepoint being specific to the hrtimer
> bits, I can well imagine people wanting to do the same thing for
> hardware based samples or whatnot.

Sure, which is why I thought we'd have two in a perf category. I'm all
for PMCs events, even though we can't currently use them!

>
>> > The perf:perf_hrtimer probe point is also reading state mid-way
>> > through a function, so it's not quite as simple as wrapping the
>> > function pointer. I do like that idea, though, but for things like
>> > struct file_operations.
>
> So what additional state to you need?

I was pulling in regs after get_irq_regs(), struct perf_event *event
after it's populated. Not that hard to duplicate. Just noting it
didn't map directly to the function entry.

I wanted perf_event just for event->ctx->task->pid, so that a BPF
program can differentiate between it's samples and other concurrent
sessions.

(I  was thinking of changing my patch to expose pid_t instead of
perf_event, since I was noticing it didn't add many instructions.)

[...]
>> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
>> overflow_handler for that particular event with some form of bpf wrapper.
>> (probably new bpf program type). Then not only periodic events
>> will be triggering bpf prog, but pmu events as well.
>
> Exactly.

Although the timer use case is a bit different, and is via
hwc->hrtimer.function = perf_swevent_hrtimer.

[...]
>> The question is what to pass into the
>> program to make the most use out of it. 'struct pt_regs' is done deal.
>> but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

For timer it can be: struct pt_regs, pid_t.

So that would restrict your BPF program to one timer, since if you had
two (from one pid) you couldn't tell them apart. But I'm not sure of a
use case for two in-kernel timers. If there were, we could also add
struct perf_event_attr, which has enough info to tell things apart,
and is already exposed to user space.

I haven't looked into the PMU arguments, but perhaps that could be:
struct pt_regs, pid_t, struct perf_event_attr.

Thanks,

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Peter Zijlstra
On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
> tracepoints are actually zero overhead already via static-key mechanism.
> I don't think Peter's objection for the tracepoint was due to overhead.

Almost 0, they still have some I$ footprint, but yes. My main worry is
that we can feed tracepoints into perf, so having tracepoints in perf is
tricky.

I also don't much like this tracepoint being specific to the hrtimer
bits, I can well imagine people wanting to do the same thing for
hardware based samples or whatnot.

> > The perf:perf_hrtimer probe point is also reading state mid-way
> > through a function, so it's not quite as simple as wrapping the
> > function pointer. I do like that idea, though, but for things like
> > struct file_operations.

So what additional state to you need?

> > > Currently overflow_handler is set at event alloc time. If we start
> > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > break, since each overflow_handler is run to completion and doesn't
> > > change global state, right?

Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
sure to use a READ_ONCE() to load the pointer.

As long as we're sure to limit this poking to a single user its fairly
simple to get right. The moment there can be concurrency a lot of fail
can happen.

> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> overflow_handler for that particular event with some form of bpf wrapper.
> (probably new bpf program type). Then not only periodic events
> will be triggering bpf prog, but pmu events as well.

Exactly.

> So instead of normal __perf_event_output() writing into ringbuffer,
> a bpf prog will be called that can optionally write into different
> rb via bpf_perf_event_output. 

It could even chain and call into the original function once its done
and have both outputs.

> The question is what to pass into the
> program to make the most use out of it. 'struct pt_regs' is done deal.
> but perf_sample_data we cannot pass as-is, since it's kernel internal.

Urgh, does it have to be stable API? Can't we simply rely on the kernel
headers to provide the right structure definition?


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Peter Zijlstra
On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
> tracepoints are actually zero overhead already via static-key mechanism.
> I don't think Peter's objection for the tracepoint was due to overhead.

Almost 0, they still have some I$ footprint, but yes. My main worry is
that we can feed tracepoints into perf, so having tracepoints in perf is
tricky.

I also don't much like this tracepoint being specific to the hrtimer
bits, I can well imagine people wanting to do the same thing for
hardware based samples or whatnot.

> > The perf:perf_hrtimer probe point is also reading state mid-way
> > through a function, so it's not quite as simple as wrapping the
> > function pointer. I do like that idea, though, but for things like
> > struct file_operations.

So what additional state to you need?

> > > Currently overflow_handler is set at event alloc time. If we start
> > > changing it on the fly with atomic xchg(), afaik things shouldn't
> > > break, since each overflow_handler is run to completion and doesn't
> > > change global state, right?

Yes, or even a simple WRITE_ONCE() to replace it, as long as we make
sure to use a READ_ONCE() to load the pointer.

As long as we're sure to limit this poking to a single user its fairly
simple to get right. The moment there can be concurrency a lot of fail
can happen.

> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
> overflow_handler for that particular event with some form of bpf wrapper.
> (probably new bpf program type). Then not only periodic events
> will be triggering bpf prog, but pmu events as well.

Exactly.

> So instead of normal __perf_event_output() writing into ringbuffer,
> a bpf prog will be called that can optionally write into different
> rb via bpf_perf_event_output. 

It could even chain and call into the original function once its done
and have both outputs.

> The question is what to pass into the
> program to make the most use out of it. 'struct pt_regs' is done deal.
> but perf_sample_data we cannot pass as-is, since it's kernel internal.

Urgh, does it have to be stable API? Can't we simply rely on the kernel
headers to provide the right structure definition?


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Alexei Starovoitov
On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote:
> On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
>  wrote:
> > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> >>
> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> >> > planning to), I'd put a tracepoint in perf_event_overflow() called
> >> > "perf:perf_overflow", with the same arguments. That could then be used
> >> > for all PMU overflow events, without needing to add specific
> >> > tracepoints.
> >>
> >> Could we not teach BPF to replace event->overflow_handler and inject
> >> itself there?
> >>
> >> We don't currently have nice interfaces for doing that, but it should be
> >> possible to do I think. We already have the indirect function call, so
> >> injecting ourself there has 0 overhead.
> 
> Sounds like a good idea, especially for things like struct
> file_operations so that we can statically instrument file system
> read/writes with zero non-enabled overhead, and not worry about high
> frequency workloads (>10M events/sec).
> 
> These perf probes aren't high frequency, though, and the code is not
> normally in use, so overhead should be much less of a concern.
> Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
> tracepoint code is still adding a mem read, conditional, and branch,
> then that's not many instructions, especially considering the normal
> use case of these perf functions: creating records and writing to a
> perf ring buffer, then picking that up in user space by perf, then
> either processing it live or writing to perf.data, back to the file
> system, etc. It would be hard to benchmark the effect of adding a few
> instructions to that path (and any results may be more sensitive to
> cache line placement than the instructions).

tracepoints are actually zero overhead already via static-key mechanism.
I don't think Peter's objection for the tracepoint was due to overhead.

> The perf:perf_hrtimer probe point is also reading state mid-way
> through a function, so it's not quite as simple as wrapping the
> function pointer. I do like that idea, though, but for things like
> struct file_operations.
> 
> >
> > you're right. All makes sense. I guess I was too lazy to look into
> > how to do it properly. Adding a tracepoint looked like quick and
> > easy way to achieve the same.
> > As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> > Currently overflow_handler is set at event alloc time. If we start
> > changing it on the fly with atomic xchg(), afaik things shouldn't
> > break, since each overflow_handler is run to completion and doesn't
> > change global state, right?
> >
> 
> How would it be implemented? I was thinking of adding explicit wrappers, eg:

instead of adding a tracepoint to perf_swevent_hrtimer we can replace
overflow_handler for that particular event with some form of bpf wrapper.
(probably new bpf program type). Then not only periodic events
will be triggering bpf prog, but pmu events as well.
So instead of normal __perf_event_output() writing into ringbuffer,
a bpf prog will be called that can optionally write into different
rb via bpf_perf_event_output. The question is what to pass into the
program to make the most use out of it. 'struct pt_regs' is done deal.
but perf_sample_data we cannot pass as-is, since it's kernel internal.
Probably something similar to __sk_buff mirror would be needed.
Another nice benefit of doing via overflow_handler instead of tracepoint
is that exclude_idle, exclude_user, exclude_kernel flags of the perf event
will all magically work and program will be event specific.
So two parallel 'perf record'-like sampling won't conflict.



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Alexei Starovoitov
On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote:
> On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
>  wrote:
> > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> >>
> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> >> > planning to), I'd put a tracepoint in perf_event_overflow() called
> >> > "perf:perf_overflow", with the same arguments. That could then be used
> >> > for all PMU overflow events, without needing to add specific
> >> > tracepoints.
> >>
> >> Could we not teach BPF to replace event->overflow_handler and inject
> >> itself there?
> >>
> >> We don't currently have nice interfaces for doing that, but it should be
> >> possible to do I think. We already have the indirect function call, so
> >> injecting ourself there has 0 overhead.
> 
> Sounds like a good idea, especially for things like struct
> file_operations so that we can statically instrument file system
> read/writes with zero non-enabled overhead, and not worry about high
> frequency workloads (>10M events/sec).
> 
> These perf probes aren't high frequency, though, and the code is not
> normally in use, so overhead should be much less of a concern.
> Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
> tracepoint code is still adding a mem read, conditional, and branch,
> then that's not many instructions, especially considering the normal
> use case of these perf functions: creating records and writing to a
> perf ring buffer, then picking that up in user space by perf, then
> either processing it live or writing to perf.data, back to the file
> system, etc. It would be hard to benchmark the effect of adding a few
> instructions to that path (and any results may be more sensitive to
> cache line placement than the instructions).

tracepoints are actually zero overhead already via static-key mechanism.
I don't think Peter's objection for the tracepoint was due to overhead.

> The perf:perf_hrtimer probe point is also reading state mid-way
> through a function, so it's not quite as simple as wrapping the
> function pointer. I do like that idea, though, but for things like
> struct file_operations.
> 
> >
> > you're right. All makes sense. I guess I was too lazy to look into
> > how to do it properly. Adding a tracepoint looked like quick and
> > easy way to achieve the same.
> > As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> > Currently overflow_handler is set at event alloc time. If we start
> > changing it on the fly with atomic xchg(), afaik things shouldn't
> > break, since each overflow_handler is run to completion and doesn't
> > change global state, right?
> >
> 
> How would it be implemented? I was thinking of adding explicit wrappers, eg:

instead of adding a tracepoint to perf_swevent_hrtimer we can replace
overflow_handler for that particular event with some form of bpf wrapper.
(probably new bpf program type). Then not only periodic events
will be triggering bpf prog, but pmu events as well.
So instead of normal __perf_event_output() writing into ringbuffer,
a bpf prog will be called that can optionally write into different
rb via bpf_perf_event_output. The question is what to pass into the
program to make the most use out of it. 'struct pt_regs' is done deal.
but perf_sample_data we cannot pass as-is, since it's kernel internal.
Probably something similar to __sk_buff mirror would be needed.
Another nice benefit of doing via overflow_handler instead of tracepoint
is that exclude_idle, exclude_user, exclude_kernel flags of the perf event
will all magically work and program will be event specific.
So two parallel 'perf record'-like sampling won't conflict.



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Brendan Gregg
On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
 wrote:
> On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
>> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
>>
>> > As for pmu tracepoints: if I were to instrument it (although I wasn't
>> > planning to), I'd put a tracepoint in perf_event_overflow() called
>> > "perf:perf_overflow", with the same arguments. That could then be used
>> > for all PMU overflow events, without needing to add specific
>> > tracepoints.
>>
>> Could we not teach BPF to replace event->overflow_handler and inject
>> itself there?
>>
>> We don't currently have nice interfaces for doing that, but it should be
>> possible to do I think. We already have the indirect function call, so
>> injecting ourself there has 0 overhead.

Sounds like a good idea, especially for things like struct
file_operations so that we can statically instrument file system
read/writes with zero non-enabled overhead, and not worry about high
frequency workloads (>10M events/sec).

These perf probes aren't high frequency, though, and the code is not
normally in use, so overhead should be much less of a concern.
Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
tracepoint code is still adding a mem read, conditional, and branch,
then that's not many instructions, especially considering the normal
use case of these perf functions: creating records and writing to a
perf ring buffer, then picking that up in user space by perf, then
either processing it live or writing to perf.data, back to the file
system, etc. It would be hard to benchmark the effect of adding a few
instructions to that path (and any results may be more sensitive to
cache line placement than the instructions).

The perf:perf_hrtimer probe point is also reading state mid-way
through a function, so it's not quite as simple as wrapping the
function pointer. I do like that idea, though, but for things like
struct file_operations.

>
> you're right. All makes sense. I guess I was too lazy to look into
> how to do it properly. Adding a tracepoint looked like quick and
> easy way to achieve the same.
> As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> Currently overflow_handler is set at event alloc time. If we start
> changing it on the fly with atomic xchg(), afaik things shouldn't
> break, since each overflow_handler is run to completion and doesn't
> change global state, right?
>

How would it be implemented? I was thinking of adding explicit wrappers, eg:

static ssize_t
__ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from)
{
trace_ext4_write_iter(iocb, from);
ext4_file_write_iter(iocb, from);
}

Then switching in __ext4_file_write_iter_tracepoint() as needed.

I was wondering about doing this dynamically, but wouldn't that then
create another problem of maintenance -- we'd need to decorate such
code with a comment, at least, to say "this function is exposed as a
tracepoint".

If a dynamic approach is still desired, and the goal is zero
non-enabled overhead, then I'd also wonder if we could leverage
kprobes to do this. Imagine walking a file_operations to find the
addresses, and then kprobe-ing the addresses. Not the same (or
probably as efficient) as setting the function pointer, but, using
existing kprobes.

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Brendan Gregg
On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
 wrote:
> On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
>> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
>>
>> > As for pmu tracepoints: if I were to instrument it (although I wasn't
>> > planning to), I'd put a tracepoint in perf_event_overflow() called
>> > "perf:perf_overflow", with the same arguments. That could then be used
>> > for all PMU overflow events, without needing to add specific
>> > tracepoints.
>>
>> Could we not teach BPF to replace event->overflow_handler and inject
>> itself there?
>>
>> We don't currently have nice interfaces for doing that, but it should be
>> possible to do I think. We already have the indirect function call, so
>> injecting ourself there has 0 overhead.

Sounds like a good idea, especially for things like struct
file_operations so that we can statically instrument file system
read/writes with zero non-enabled overhead, and not worry about high
frequency workloads (>10M events/sec).

These perf probes aren't high frequency, though, and the code is not
normally in use, so overhead should be much less of a concern.
Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
tracepoint code is still adding a mem read, conditional, and branch,
then that's not many instructions, especially considering the normal
use case of these perf functions: creating records and writing to a
perf ring buffer, then picking that up in user space by perf, then
either processing it live or writing to perf.data, back to the file
system, etc. It would be hard to benchmark the effect of adding a few
instructions to that path (and any results may be more sensitive to
cache line placement than the instructions).

The perf:perf_hrtimer probe point is also reading state mid-way
through a function, so it's not quite as simple as wrapping the
function pointer. I do like that idea, though, but for things like
struct file_operations.

>
> you're right. All makes sense. I guess I was too lazy to look into
> how to do it properly. Adding a tracepoint looked like quick and
> easy way to achieve the same.
> As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> Currently overflow_handler is set at event alloc time. If we start
> changing it on the fly with atomic xchg(), afaik things shouldn't
> break, since each overflow_handler is run to completion and doesn't
> change global state, right?
>

How would it be implemented? I was thinking of adding explicit wrappers, eg:

static ssize_t
__ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from)
{
trace_ext4_write_iter(iocb, from);
ext4_file_write_iter(iocb, from);
}

Then switching in __ext4_file_write_iter_tracepoint() as needed.

I was wondering about doing this dynamically, but wouldn't that then
create another problem of maintenance -- we'd need to decorate such
code with a comment, at least, to say "this function is exposed as a
tracepoint".

If a dynamic approach is still desired, and the goal is zero
non-enabled overhead, then I'd also wonder if we could leverage
kprobes to do this. Imagine walking a file_operations to find the
addresses, and then kprobe-ing the addresses. Not the same (or
probably as efficient) as setting the function pointer, but, using
existing kprobes.

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Alexei Starovoitov
On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> 
> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> > planning to), I'd put a tracepoint in perf_event_overflow() called
> > "perf:perf_overflow", with the same arguments. That could then be used
> > for all PMU overflow events, without needing to add specific
> > tracepoints. 
> 
> Could we not teach BPF to replace event->overflow_handler and inject
> itself there?
> 
> We don't currently have nice interfaces for doing that, but it should be
> possible to do I think. We already have the indirect function call, so
> injecting ourself there has 0 overhead.

you're right. All makes sense. I guess I was too lazy to look into
how to do it properly. Adding a tracepoint looked like quick and
easy way to achieve the same.
As far as api goes probably existing IOC_SET_BPF ioctl will do too.
Currently overflow_handler is set at event alloc time. If we start
changing it on the fly with atomic xchg(), afaik things shouldn't
break, since each overflow_handler is run to completion and doesn't
change global state, right?



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Alexei Starovoitov
On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> 
> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> > planning to), I'd put a tracepoint in perf_event_overflow() called
> > "perf:perf_overflow", with the same arguments. That could then be used
> > for all PMU overflow events, without needing to add specific
> > tracepoints. 
> 
> Could we not teach BPF to replace event->overflow_handler and inject
> itself there?
> 
> We don't currently have nice interfaces for doing that, but it should be
> possible to do I think. We already have the indirect function call, so
> injecting ourself there has 0 overhead.

you're right. All makes sense. I guess I was too lazy to look into
how to do it properly. Adding a tracepoint looked like quick and
easy way to achieve the same.
As far as api goes probably existing IOC_SET_BPF ioctl will do too.
Currently overflow_handler is set at event alloc time. If we start
changing it on the fly with atomic xchg(), afaik things shouldn't
break, since each overflow_handler is run to completion and doesn't
change global state, right?



Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Peter Zijlstra
On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:

> As for pmu tracepoints: if I were to instrument it (although I wasn't
> planning to), I'd put a tracepoint in perf_event_overflow() called
> "perf:perf_overflow", with the same arguments. That could then be used
> for all PMU overflow events, without needing to add specific
> tracepoints. 

Could we not teach BPF to replace event->overflow_handler and inject
itself there?

We don't currently have nice interfaces for doing that, but it should be
possible to do I think. We already have the indirect function call, so
injecting ourself there has 0 overhead.




Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Peter Zijlstra
On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:

> As for pmu tracepoints: if I were to instrument it (although I wasn't
> planning to), I'd put a tracepoint in perf_event_overflow() called
> "perf:perf_overflow", with the same arguments. That could then be used
> for all PMU overflow events, without needing to add specific
> tracepoints. 

Could we not teach BPF to replace event->overflow_handler and inject
itself there?

We don't currently have nice interfaces for doing that, but it should be
possible to do I think. We already have the indirect function call, so
injecting ourself there has 0 overhead.




Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-03 Thread Brendan Gregg
On Wed, Aug 3, 2016 at 2:48 AM, Peter Zijlstra  wrote:
>
> On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote:
> > When perf is performing hrtimer-based sampling, this tracepoint can be used
> > by BPF to run additional logic on each sample. For example, BPF can fetch
> > stack traces and frequency count them in kernel context, for an efficient
> > profiler.
>
> Urgh, no like.
>
> And then next week we'll add a tracepoint to some other random pmu or
> whatnot.

I wouldn't want random pmu tracepoints either, but I can see how it
looks with a new tracepoint category of "perf:", and maybe I shouldn't
have called it that. Would it be better if the tracepoint was called
"timer:perf_hrtimer"?

As for pmu tracepoints: if I were to instrument it (although I wasn't
planning to), I'd put a tracepoint in perf_event_overflow() called
"perf:perf_overflow", with the same arguments. That could then be used
for all PMU overflow events, without needing to add specific
tracepoints. The "perf:" category would then have two tracepoints, and
would not need to grow. Does that option sound better than
"timer:perf_hrtimer"? (As in, keeping the category "perf" so that we
have a later option of adding a perf:perf_overflow tracepoint if need
be?)

It's really perf_hrtimer that we'll use daily, for CPU profiling with
in-kernel aggregations of instruction pointers and stack traces, and
other sampling needs. It's one tracepoint that will add much value,
and will mean that BPF programs can then be attached to kprobes,
uprobes, tracepoints, and timed samples.

Thanks for your consideration,

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-03 Thread Brendan Gregg
On Wed, Aug 3, 2016 at 2:48 AM, Peter Zijlstra  wrote:
>
> On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote:
> > When perf is performing hrtimer-based sampling, this tracepoint can be used
> > by BPF to run additional logic on each sample. For example, BPF can fetch
> > stack traces and frequency count them in kernel context, for an efficient
> > profiler.
>
> Urgh, no like.
>
> And then next week we'll add a tracepoint to some other random pmu or
> whatnot.

I wouldn't want random pmu tracepoints either, but I can see how it
looks with a new tracepoint category of "perf:", and maybe I shouldn't
have called it that. Would it be better if the tracepoint was called
"timer:perf_hrtimer"?

As for pmu tracepoints: if I were to instrument it (although I wasn't
planning to), I'd put a tracepoint in perf_event_overflow() called
"perf:perf_overflow", with the same arguments. That could then be used
for all PMU overflow events, without needing to add specific
tracepoints. The "perf:" category would then have two tracepoints, and
would not need to grow. Does that option sound better than
"timer:perf_hrtimer"? (As in, keeping the category "perf" so that we
have a later option of adding a perf:perf_overflow tracepoint if need
be?)

It's really perf_hrtimer that we'll use daily, for CPU profiling with
in-kernel aggregations of instruction pointers and stack traces, and
other sampling needs. It's one tracepoint that will add much value,
and will mean that BPF programs can then be attached to kprobes,
uprobes, tracepoints, and timed samples.

Thanks for your consideration,

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-03 Thread Peter Zijlstra
On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Urgh, no like.

And then next week we'll add a tracepoint to some other random pmu or
whatnot.




Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-03 Thread Peter Zijlstra
On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Urgh, no like.

And then next week we'll add a tracepoint to some other random pmu or
whatnot.