Re: [RFC][PATCH 00/10] Add trace event support to eBPF

2016-02-18 Thread Tom Zanussi
On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote:
> On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote:
> > On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> > > On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> 
> > > this hist triggers belong in the kernel. BPF already can do
> > > way more complex aggregation and histograms.
> > 
> > Way more?  I still can't accomplish with eBPF some of the most basic and
> > helpful use cases that I can with hist triggers, such as using
> > stacktraces as hash keys.  And the locking in the eBPF hashmap
> > implementation prevents anything like the function_hist [1] tracer from
> > being implemented on top of it:
> 
> Both statements are not true.

Erm, not exactly...

> In the link from previous email take a look at funccount_example.txt:
> # ./funccount 'vfs_*'
> Tracing... Ctrl-C to end.
> ^C
> ADDR FUNC  COUNT
> 811efe81 vfs_create1
> 811f24a1 vfs_rename1
> 81215191 vfs_fsync_range   2
> 81231df1 vfs_lock_file30
> 811e8dd1 vfs_fstatat 152
> 811e8d71 vfs_fstat   154
> 811e4381 vfs_write   166
> 811e8c71 vfs_getattr_nosec   262
> 811e8d41 vfs_getattr 262
> 811e3221 vfs_open264
> 811e4251 vfs_read470
> Detaching...
> 

When I tried to use it to trace all functions:

  # ./funccount.py '*'

I got about 5 minutes worth of these kinds of error messages, which I
couldn't break out of:

write of "p:kprobes/p_rootfs_mount rootfs_mount" into kprobe_events failed: 
Device or resource busy
open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_int_noop/id): Too 
many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_irq_pending_noop/id):
 Too many open files
open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_probe/id): Too many 
open files
open(/sys/kernel/debug/tracing/events/kprobes/p_unmask_8259A_irq/id): Too many 
open files
open(/sys/kernel/debug/tracing/events/kprobes/p_enable_8259A_irq/id): Too many 
open files
open(/sys/kernel/debug/tracing/events/kprobes/p_mask_8259A_irq/id): Too many 
open files
open(/sys/kernel/debug/tracing/events/kprobes/p_disable_8259A_irq/id): Too many 
open files
open(/sys/kernel/debug/tracing/events/kprobes/p_i8259A_irq_pending/id): Too 
many open files
...

Which is probably a good thing, since that's a relatively common thing
for someone to try, whereas this subset probably isn't:

  # ./funccount.py '*spin*'

Which on my machine resulted in a hard lockup on all CPUs.  I'm not set
up to grab serial output on that machine, but here's a screenshot of
most of the stacktrace, all I could get:

  http://picpaste.com/bcc-spinstar-hardlock-fobQbcuG.JPG

There's nothing special about that machine and it's running a stock
4.4.0 kernel, so it should be pretty easy to reproduce on anything with
just a BPF-enabled config.  If not, let me know and I'll send more
specific info.

> And this is done without adding new code to the kernel.
> 
> Another example is offwaketime that uses two stack traces as
> part of single key.
> 

Which has the below code in the script itself implementing a stack
walker:

static u64 get_frame(u64 *bp) {
if (*bp) {
// The following stack walker is x86_64 specific
u64 ret = 0;
if (bpf_probe_read(, sizeof(ret), (void *)(*bp+8)))
return 0;
if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp))
*bp = 0;
if (ret < __START_KERNEL_map)
return 0;
return ret;
}
return 0;
}

int waker(struct pt_regs *ctx, struct task_struct *p) {
u32 pid = p->pid;

if (!(FILTER))
return 0;

u64 bp = 0;
struct wokeby_t woke = {};
int depth = 0;
bpf_get_current_comm(, sizeof(woke.name));
bp = ctx->bp;

// unrolled loop (MAXWDEPTH):
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
if (!(woke.ret[depth++] = get_frame())) goto out;
woke.ret[depth] = get_frame();

I think one call to this pretty much cancels out any performance gains
you might have made by fastidiously avoiding trace events.

So technically users can use stacktraces as hash keys, but they're
ex

Re: [RFC][PATCH 00/10] Add trace event support to eBPF

2016-02-16 Thread Tom Zanussi
On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> > Hi,
> > 
> > As promised in previous threads, this patchset shares some common
> > functionality with the hist triggers code and enables trace events to
> > be accessed from eBPF programs.
> 
> great that you've started working on BPF!
> 
> > It needs to be applied on top of current tracing/for-next with the
> > hist triggers v13 patches applied:
> > 
> >   https://lkml.org/lkml/2015/12/10/640
> > 
> > Any input welcome, especially things that could be done better wrt
> > eBPF, since I'm not as familiar with that code...
> 
> this dependency looks odd. As I was saying a year ago I don't think

It's not actually a dependency, just a consequence of trying to share
code for this RFC that happened to already be in hist triggers.  I could
just as well have removed the hunk that deletes that code from hist
triggers, but one of the main points was to show how they could be
shared.

> this hist triggers belong in the kernel. BPF already can do
> way more complex aggregation and histograms.

Way more?  I still can't accomplish with eBPF some of the most basic and
helpful use cases that I can with hist triggers, such as using
stacktraces as hash keys.  And the locking in the eBPF hashmap
implementation prevents anything like the function_hist [1] tracer from
being implemented on top of it:

.
.
.
ip: [a0166350] gen7_render_get_cmd_length_mask hitcount: 18551173520
ip: [817589b0] _cond_resched   hitcount: 19242157574
ip: [8175bd60] _raw_spin_lock  hitcount: 19275222963
ip: [81232c60] __fdget hitcount: 22865629762
ip: [81232c00] __fget_lighthitcount: 24594932641
ip: [8175bed0] _raw_spin_lock_irqsave  hitcount: 26510496751
ip: [8175bb30] _raw_spin_unlock_irqrestore hitcount: 26525916042
ip: [a018d6c0] gen6_ring_get_seqno hitcount: 95888137753

Totals:
Hits: 1020596753338
Entries: 7102
Dropped: 0

[1]
http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/commit/?h=tzanussi/func-hist-v0=fa123181a068574bd9e300416cc947aa7424d808

> Take a look at all the tools written on top of it:
> https://github.com/iovisor/bcc/tree/master/tools

That's great, but it's all out-of-tree.  Supporting out-of-tree users
has never been justification for merging in-kernel code (or for blocking
it from being merged). 

> I don't buy into reasoning that hist triggers are needed to do
> performance analysis in the busybox. If not in busybox,
> just use perf+bpf or bcc.
> 

I have systems with tiny amounts of memory and storage that have zero
chance of ever having a compiler or Python running on them.  It's
indispensable on those systems to have access to all and nothing more
than the ftrace-based tools and trace event files directly.  I even have
systems that don't have a shell at all, but can nonetheless access all
the ftrace and trace event functionality through a different channel.
hist triggers is just another one of those tools, a logical progression
towards allowing users in such a situation (among others) to make more
efficient and concise use of the trace event data.  I don't see what
your problem with that is.

> that aside we do need to be able to attach bpf to tracepoints.
> The key goal of bpf+tracepoints is to be faster than bpf+kprobe.
> kprobes were good enough for almost all use cases so far except
> when we started processing millions of events per second via
> kprobe+bpf. At that point even optimized kprobe adds too much
> overhead.
> 

It sounds from this like you're mainly interested in hooking into the
trace events but not the trace event data, which is fine, but other
tools might find some value in the trace event data, even if it may not
be the most optimal path.

I haven't measured the overhead of the cost of accessing data from the
trace buffers, but I do know that the hist triggers have no problem
logging millions of events per second.

In the past I have measured the basic hist triggers mechanism and found
it to be somewhat faster than the ftrace function tracer itself:

over a kernel compile:

no tracing:

  real110m47.817s
  user99m34.144s
  sys7m19.928s

function tracer enabled:

  real 138m6.854s
  user 100m57.692s
  sys 36m27.724s

function_graph tracer enabled:

  real194m38.052s
  user102m32.192s
  sys97m47.056s

function_hist tracer enabled:

  real128m44.682s
  user100m29.080s
  sys26m52.880s

Given the fact that I haven't heard people complaining about the
overhead of even the function_graph tracer, I don't think th