Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Masami Hiramatsu
(Note that I also agree with Linus's opinion that this is
 like a debugger, since I already did it in perf-probe :))

On Sun, 4 Feb 2018 11:39:39 -0800
Linus Torvalds  wrote:

> (Obviously it's not entirely black-and-white, but I do think there is
> a pretty big difference between the two groups. And the first group
> will obviously use the explicit trace points _too_, generally to
> narrow down where they want to go with the function-based one).
> 
> We'll see. Maybe I'm entirely wrong. But I'm hoping that the
> function-based one will end up being helpful.

BTW, if the function-based tracing is helpful for both of them,
they can start using it back in 2010 because kprobe-based tracer
already supported it.

It was less announced, I must admit that I was lazy at that point.
Also, since I moved usability effort on perf-probe, kprobe-based
event syntax is not so funcy.

"SyS_openat(int dfd, string path, x32 flags, x16 mode)"

is equal to

"p SyS_openat dfd=%di:x64 path=%si:string flags=%dx:x32 mode=%cx:x16"

in kprobe probe definition syntax, but with perf-probe and CONFIG_DEBUG_INFO,

perf probe -a 'sys_openat $params'

will setup the event correctly.

So, we need to clarify what will attract more "2nd group" people to
function based events. E.g. the events for EXPORT_SYMBOL_GPL() symbols
already defined and easily on/off.

Thank you,

-- 
Masami Hiramatsu 


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Steven Rostedt
On Mon, 5 Feb 2018 11:49:31 +0100
Daniel Bristot de Oliveira  wrote:

> Isn't this the case of calling tracing functions before calling:
> 
> rcu_irq_enter();

Yes, I'm actually aware of this. I just forgot to deal with it during
development.

Thanks, for reporting.

-- Steve


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Steven Rostedt
On Mon, 5 Feb 2018 11:09:31 +0100
Peter Zijlstra  wrote:

> As a run-around, Steve then suggested to decouple the trace-hook from
> the actual trace-event. Let the scheduler only provide the hook, nothing
> else. And then allow users to create their own events with the specific
> data they need for their specific use-case.

Actually, we can make the raw tracepoint callback no longer 'notrace',
and then we can extend all tracepoints as well with this.

One issue I hit was the structure randomization for security. If the
structures are randomized, it becomes much more difficult to find
specific aspects of task_struct.

-- Steve


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Steven Rostedt
On Mon, 5 Feb 2018 14:53:55 +0100
Juri Lelli  wrote:

> First impression is that this is going to be definitely useful if
> 
>  - it's possibile to hook at function end (e.g., replenish_dl_entity above
>carries more useful information _after_ it did its job)

The one issue is that you will only have access to one argument at the
end. And that will be the return value. How useful would that be?

Hmm, actually, if we incorporate Tom Zanussi's histogram patches (which
I'll start reviewing this week for inclusion), we could add the pseudo
events to carry necessary data.

>  - inside functions? not really sure it's actually going to be needed, but I
>was wondering if it's possible at all :); with tracepoints it's for example
>easy to collect detailed information about which branches has been taken 
> etc.

This will not be something to handle anything other than function
calls. You have three options for dealing with inside a function.

 - add another function that can be traced with this, inside the
   function

 - use kprobes

 - add a tracepoint

> 
> I'm going to play with this more. Just wanted to give back a quick positive
> feedback.

Thanks!

> 
> I'm also adding Arm folks to the discussion, as they (and I :) have been
> asking to add tracepoints to scheduler code in the past [1].

You will need to implement the arch_get_func_args() for ARM too.

-- Steve


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Steven Rostedt
On Sun, 4 Feb 2018 11:25:20 +0900
Namhyung Kim  wrote:

> I'm interested in this.  From my understanding, it's basically
> function tracing + filter + custom argument info, right?
> 
> Supporting arguments with complex type could be error-prone.
> We need to prevent malfunctions by invalid inputs.

All reads are done with probe_kernel_read(). If it faults (at any
stage), it simply returns "0".

-- Steve



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Masami Hiramatsu
On Sun, 4 Feb 2018 09:21:30 -0800
Alexei Starovoitov  wrote:

> On Sun, Feb 04, 2018 at 12:57:47PM +0900, Masami Hiramatsu wrote:
> > 
> > > I based some of the code from kprobes too. But I wanted this to be
> > > simpler, and as such, not as powerful as kprobes. More of a "poor mans"
> > > kprobe ;-) Where you are limited to functions and their arguments. If
> > > you need more power, switch to kprobes. In other words, its just an
> > > added stepping stone.
> > > 
> > > Also, this should work without kprobe support, only ftrace, and function
> > > args from the arch.
> > 
> > Hmm, but implementation seems very far from current probe events, we need
> > to consider how to unify it. Anyway, it is a very good time to do, because
> > I found current probe-event fetch method is not good with retpoline/IBRS,
> > it is full of indirect call.
> > 
> > I would like to convert it to eBPF if possible. It will be good for the
> > performance with JIT, and we can collaborate on the same code with BPF
> > people.
> 
> The current probe fetch method is indeed going to slow down due to
> retpoline, but this issue is going to affect not only this piece
> of code, but the rest of the kernel where indirect call performance
> matters a lot. Like networking stack where we have at least 4 indirect
> calls per packet.
> So I'd suggest to focus on finding a general method instead of coming
> with a specific solution for this kprobe fetching problem.

OK.

> Devirtualization approach works well and applicable in many cases.
> For networking stack deliver_skb() and __netif_receive_skb_core()
> can check if (pt_prev->func == ip_rcv || ipv6_rcv)
> and call them directly.

Yeah, if the options are limited, that works. (like replacing with
switch-case)

> The other approach I was thinking to explore is static_key-like
> for indirect calls. In many cases the target is rarely changed,
> so we can do arch specific rewrite of destination offset inside
> normal direct call instruction. That should be faster than retpoline.

I doubt it. Most of the indirect call uses are "ops->method" and
it depends on "ops".

> As far as emitting raw bpf insns instead of kprobe fetch methods
> there is a big problem with such apporach. Interpreter and all
> JITs take 'struct bpf_prog' that passed the verifier and not just
> random set of bpf instructions. BPF is not a generic assembler.

If you mean kernel/bpf/verifier.c, I'm happy with passing raw
bpf insns generated by kprobe-fetch-method to it :)

> BPF is an instruction set _with_ C calling convention.
> The registers and instructions must be used in certain way or
> things will horribly break.
> See Documentation/bpf/bpf_design_QA.txt for details.
> Long ago I wrote a patch that converted pred tree walk into
> raw bpf insns. If that patch made it into mainline back then
> it would have been a huge headache for us now.
> So if you plan on generating bpf programs they _must_ pass the verifier.

Yes, of course.
Anyway, it is just an idea for retpoline/Spectre V2. (yeah, it
is actual big issue, it makes the faster pointer-call method
slower. Now we see switch-case may be faster than that in some cases.)

I'm also considering to simplify it (or do it with branch and static
function call) as Steve did on this series.

Thank you,

-- 
Masami Hiramatsu 


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Juri Lelli
Hi Steve,

On 03/02/18 16:17, Steven Rostedt wrote:
> On Sat, 3 Feb 2018 12:52:08 -0800
> Alexei Starovoitov  wrote:
> 
> > It's a user space job.
> 
> BTW, I asked around at DevConf.cz, and nobody I talked with (besides
> Arnaldo), have used eBPF. The "path to hello world" is quite high. This
> interface is extremely simple to use, and one doesn't need to install
> LLVM or other tools to interface with it.

Yep. Managed to get this working in less than an hour. :)

With something like

# echo 'replenish_dl_entity(u64 dl_runtime[3] | u64 dl_deadline[4] | u64 
dl_period[5] | s64 runtime[8] | u64 deadline[9])' > function_events
# echo 'sched:*' > set_event
# echo replenish_dl_entity >> set_event

you can get something like

--->8---
 [...]
 cpuhog-3556  [002] d..3   727.101815: sched_switch: prev_comm=cpuhog 
prev_pid=3556 prev_prio=-1 prev_state=S ==> next_comm=swapper/2 next_pid=0 
next_prio=120   
  
 -0 [002] d.s4   727.128139: sched_waking: comm=kworker/2:1 
pid=53 prio=120 target_cpu=002   
 -0 [002] dNs5   727.128150: sched_wakeup: comm=kworker/2:1 
pid=53 prio=120 target_cpu=002   
 -0 [002] d..3   727.128184: sched_switch: prev_comm=swapper/2 
prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=kworker/2:1 next_pid=53 
next_prio=120   
  
kworker/2:1-53[002] d..3   727.128280: sched_stat_runtime: comm=kworker/2:1 
pid=53 runtime=123827 [ns] vruntime=12389788162 [ns]

   
kworker/2:1-53[002] d..3   727.128288: sched_switch: prev_comm=kworker/2:1 
prev_pid=53 prev_prio=120 prev_state=R+ ==> next_comm=swapper/2 next_pid=0 
next_prio=120   
 
 -0 [002] d.h5   727.191609: 
enqueue_task_dl->replenish_dl_entity(dl_runtime=1000, 
dl_deadline=1, dl_period=1, runtime=-218339, 
deadline=726823680456)  
 
 -0 [002] d..3   727.191676: sched_switch: prev_comm=swapper/2 
prev_pid=0 prev_prio=120 prev_state=S ==> next_comm=cpuhog next_pid=3556 
next_prio=-1
 [...]
--->8---

Which is quite nice already IMHO.

> I used the analogy, that eBPF is like C, and this is like Bash. One is
> much easier to get "Hello World!" out than the other.
> 
> So personally, this is something I know I would use (note, I have
> never used eBPF either). But if I'm the only one to use this
> interface then I'll stop here (and not bother with the function graph
> return interface). If others think this would be helpful, I would ask
> them to speak up now.

First impression is that this is going to be definitely useful if

 - it's possibile to hook at function end (e.g., replenish_dl_entity above
   carries more useful information _after_ it did its job)
 - inside functions? not really sure it's actually going to be needed, but I
   was wondering if it's possible at all :); with tracepoints it's for example
   easy to collect detailed information about which branches has been taken etc.

I'm going to play with this more. Just wanted to give back a quick positive
feedback.

I'm also adding Arm folks to the discussion, as they (and I :) have been
asking to add tracepoints to scheduler code in the past [1].

Best,

- Juri

[1] https://marc.info/?l=linux-kernel&m=149068303518607


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Daniel Bristot de Oliveira
On 02/05/2018 11:23 AM, Juri Lelli wrote:
> Hi Steve,
> 
> On 02/02/18 18:04, Steven Rostedt wrote:
>>
>> At Kernel Summit back in October, we tried to bring up trace markers, which
>> would be nops within the kernel proper, that would allow modules to hook
>> arbitrary trace events to them. The reaction to this proposal was less than
>> favorable. We were told that we were trying to make a work around for a
>> problem, and not solving it. The problem in our minds is the notion of a
>> "stable trace event".
>>
>> There are maintainers that do not want trace events, or more trace events in
>> their subsystems. This is due to the fact that trace events post an
>> interface to user space, and this interface could become required by some
>> tool. This may cause the trace event to become stable where it must not
>> break the tool, and thus prevent the code from changing.
>>
>> Or, the trace event may just have to add padding for fields that tools
>> may require. The "success" field of the sched_wakeup trace event is one such
>> instance. There is no more "success" variable, but tools may fail if it were
>> to go away, so a "1" is simply added to the trace event wasting ring buffer
>> real estate.
>>
>> I talked with Linus about this, and he told me that we already have these
>> markers in the kernel. They are from the mcount/__fentry__ used by function
>> tracing. Have the trace events be created by these, and see if this will
>> satisfy most areas that want trace events.
>>
>> I decided to implement this idea, and here's the patch set.
>>
>> Introducing "function based events". These are created dynamically by a
>> tracefs file called "function_events". By writing a pseudo prototype into
>> this file, you create an event.
>>
>>  # mount -t tracefs nodev /sys/kernel/tracing
>>  # cd /sys/kernel/tracing
>>  # echo 'do_IRQ(symbol ip[16] | x64[6] irq_stack[16])' > function_events
>>  # cat events/functions/do_IRQ/format
>> name: do_IRQ
>> ID: 1399
>> format:
>>  field:unsigned short common_type;   offset:0;   size:2; 
>> signed:0;
>>  field:unsigned char common_flags;   offset:2;   size:1; 
>> signed:0;
>>  field:unsigned char common_preempt_count;   offset:3;   size:1; 
>> signed:0;
>>  field:int common_pid;   offset:4;   size:4; signed:1;
>>
>>  field:unsigned long __parent_ip;offset:8;   size:8; 
>> signed:0;
>>  field:unsigned long __ip;   offset:16;  size:8; signed:0;
>>  field:symbol ip;offset:24;  size:8; signed:0;
>>  field:x64 irq_stack[6]; offset:32;  size:48;signed:0;
>>
>> print fmt: "%pS->%pS(ip=%pS, irq_stack=%llx:%llx:%llx:%llx:%llx:%llx)", 
>> REC->__ip, REC->__parent_ip,
>> REC->ip, REC->irq_stack[0], REC->irq_stack[1], REC->irq_stack[2], 
>> REC->irq_stack[3], REC->irq_stack[4],
>> REC->irq_stack[5]
>>
>>  # echo 1 > events/functions/do_IRQ/enable
> 
> Got the following:
> 
> [  110.433602] =
> [  110.435671] WARNING: suspicious RCU usage
> [  110.437173] 4.15.0-rc9+ #42 Not tainted
> [  110.438698] -
> [  110.440343] /home/juri/Work/kernel/linux/include/linux/rcupdate.h:749 
> rcu_read_lock_sched() used illegally while idle!
> [  110.80]
> [  110.80] other info that might help us debug this:
> [  110.80]
> [  110.447616]
> [  110.447616] RCU used illegally from idle CPU!
> [  110.447616] rcu_scheduler_active = 2, debug_locks = 1
> [  110.452047] RCU used illegally from extended quiescent state!
> [  110.454072] 1 lock held by swapper/0/0:
> [  110.455447]  #0:  (rcu_read_lock_sched){}, at: [] 
> func_event_call+0x0/0x3c0
> [  110.458532]
> [  110.458532] stack backtrace:
> [  110.460066] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc9+ #42
> [  110.462300] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [  110.464477] Call Trace:
> [  110.465095]  
> [  110.465600]  dump_stack+0x85/0xc5
> [  110.466417]  func_event_call+0x378/0x3c0
> [  110.467373]  ? find_held_lock+0x34/0xa0
> [  110.468216]  ? common_interrupt+0xa2/0xa2
> [  110.468978]  ? irq_work_interrupt+0xb0/0xb0
> [  110.470021]  ? hrtimer_start_range_ns+0x1bf/0x3e0
> [  110.471031]  ftrace_ops_assist_func+0x64/0xf0
> [  110.471941]  ? _raw_spin_unlock_irqrestore+0x55/0x60
> [  110.472926]  0xc02e30bf
> [  110.473491]  ? do_IRQ+0x5/0x100
> [  110.473977]  do_IRQ+0x5/0x100
> [  110.474430]  common_interrupt+0xa2/0xa2
> [  110.475014]  
> [  110.475341] RIP: 0010:native_safe_halt+0x2/0x10
> [  110.476020] RSP: 0018:96a03ec8 EFLAGS: 0206 ORIG_RAX: 
> ffdd
> [  110.477137] RAX: 96a2a500 RBX:  RCX: 
> 
> [  110.478110] RDX: 96a2a500 RSI: 0001 RDI: 
> 96a2a500
> [  110.478997] RBP:  R08: 0001 R09: 
> 
> [  110.479880] R10:  R11:  R12:

Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Juri Lelli
Hi Steve,

On 02/02/18 18:04, Steven Rostedt wrote:
> 
> At Kernel Summit back in October, we tried to bring up trace markers, which
> would be nops within the kernel proper, that would allow modules to hook
> arbitrary trace events to them. The reaction to this proposal was less than
> favorable. We were told that we were trying to make a work around for a
> problem, and not solving it. The problem in our minds is the notion of a
> "stable trace event".
> 
> There are maintainers that do not want trace events, or more trace events in
> their subsystems. This is due to the fact that trace events post an
> interface to user space, and this interface could become required by some
> tool. This may cause the trace event to become stable where it must not
> break the tool, and thus prevent the code from changing.
> 
> Or, the trace event may just have to add padding for fields that tools
> may require. The "success" field of the sched_wakeup trace event is one such
> instance. There is no more "success" variable, but tools may fail if it were
> to go away, so a "1" is simply added to the trace event wasting ring buffer
> real estate.
> 
> I talked with Linus about this, and he told me that we already have these
> markers in the kernel. They are from the mcount/__fentry__ used by function
> tracing. Have the trace events be created by these, and see if this will
> satisfy most areas that want trace events.
> 
> I decided to implement this idea, and here's the patch set.
> 
> Introducing "function based events". These are created dynamically by a
> tracefs file called "function_events". By writing a pseudo prototype into
> this file, you create an event.
> 
>  # mount -t tracefs nodev /sys/kernel/tracing
>  # cd /sys/kernel/tracing
>  # echo 'do_IRQ(symbol ip[16] | x64[6] irq_stack[16])' > function_events
>  # cat events/functions/do_IRQ/format
> name: do_IRQ
> ID: 1399
> format:
>   field:unsigned short common_type;   offset:0;   size:2; 
> signed:0;
>   field:unsigned char common_flags;   offset:2;   size:1; 
> signed:0;
>   field:unsigned char common_preempt_count;   offset:3;   size:1; 
> signed:0;
>   field:int common_pid;   offset:4;   size:4; signed:1;
> 
>   field:unsigned long __parent_ip;offset:8;   size:8; 
> signed:0;
>   field:unsigned long __ip;   offset:16;  size:8; signed:0;
>   field:symbol ip;offset:24;  size:8; signed:0;
>   field:x64 irq_stack[6]; offset:32;  size:48;signed:0;
> 
> print fmt: "%pS->%pS(ip=%pS, irq_stack=%llx:%llx:%llx:%llx:%llx:%llx)", 
> REC->__ip, REC->__parent_ip,
> REC->ip, REC->irq_stack[0], REC->irq_stack[1], REC->irq_stack[2], 
> REC->irq_stack[3], REC->irq_stack[4],
> REC->irq_stack[5]
> 
>  # echo 1 > events/functions/do_IRQ/enable

Got the following:

[  110.433602] =
[  110.435671] WARNING: suspicious RCU usage
[  110.437173] 4.15.0-rc9+ #42 Not tainted
[  110.438698] -
[  110.440343] /home/juri/Work/kernel/linux/include/linux/rcupdate.h:749 
rcu_read_lock_sched() used illegally while idle!
[  110.80]
[  110.80] other info that might help us debug this:
[  110.80]
[  110.447616]
[  110.447616] RCU used illegally from idle CPU!
[  110.447616] rcu_scheduler_active = 2, debug_locks = 1
[  110.452047] RCU used illegally from extended quiescent state!
[  110.454072] 1 lock held by swapper/0/0:
[  110.455447]  #0:  (rcu_read_lock_sched){}, at: [] 
func_event_call+0x0/0x3c0
[  110.458532]
[  110.458532] stack backtrace:
[  110.460066] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc9+ #42
[  110.462300] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-2.fc27 04/01/2014
[  110.464477] Call Trace:
[  110.465095]  
[  110.465600]  dump_stack+0x85/0xc5
[  110.466417]  func_event_call+0x378/0x3c0
[  110.467373]  ? find_held_lock+0x34/0xa0
[  110.468216]  ? common_interrupt+0xa2/0xa2
[  110.468978]  ? irq_work_interrupt+0xb0/0xb0
[  110.470021]  ? hrtimer_start_range_ns+0x1bf/0x3e0
[  110.471031]  ftrace_ops_assist_func+0x64/0xf0
[  110.471941]  ? _raw_spin_unlock_irqrestore+0x55/0x60
[  110.472926]  0xc02e30bf
[  110.473491]  ? do_IRQ+0x5/0x100
[  110.473977]  do_IRQ+0x5/0x100
[  110.474430]  common_interrupt+0xa2/0xa2
[  110.475014]  
[  110.475341] RIP: 0010:native_safe_halt+0x2/0x10
[  110.476020] RSP: 0018:96a03ec8 EFLAGS: 0206 ORIG_RAX: 
ffdd
[  110.477137] RAX: 96a2a500 RBX:  RCX: 
[  110.478110] RDX: 96a2a500 RSI: 0001 RDI: 96a2a500
[  110.478997] RBP:  R08: 0001 R09: 
[  110.479880] R10:  R11:  R12: 
[  110.480764] R13:  R14:  R15: 
[  110.481661]  default_idle+0x1f/0x1a0
[  110.482118]  do_idle+0x166/0x1e0
[  110.482530] 

Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-05 Thread Peter Zijlstra
On Sun, Feb 04, 2018 at 11:39:39AM -0800, Linus Torvalds wrote:
> Then there's the "I just want an overview" MIS people, who care about
> things like "I want a histogram of packets sent according to criteria
> XYZ", who want some highlevel block IO performance, or who just want
> random system-wide statistics.

> The second group might want explicit trace points exactly because that
> group doesn't even care *how* a packet is sent or received, or what
> the path through the block layer is. It just wants to know "packet
> sent" or "latency between IO request and completion" or things like
> that.

So a large sticking point here as been the scheduler tracepoints; which
I'm rather unhappy with.

As a result of adding SCHED_DEADLINE the existing tracepoints no longer
are sufficient (they don't provide any deadline specific information).

So the MIS people that are intersted in deadline tasks are unhappy.

My own preference is to just add the deadline information to the
existing tracepoints, but then people complain these become too big
(which is slow etc..), saying sched_switch is a high rate tracepoint
(true of course) (not to mention that changing the tracepoint will
probably break something, but they'll just have to cope).

So they've proposed all kinds of horrible alternatives that are all
variations of multiple tracepoints in the same location that fragment
the information, each of which I hate.

I'm ok with having the _one_ tracepoint, but I don't want 3+
sched_switch tracepoints, each having a different set of information
depending on what people want, that way lies madness.

As a run-around, Steve then suggested to decouple the trace-hook from
the actual trace-event. Let the scheduler only provide the hook, nothing
else. And then allow users to create their own events with the specific
data they need for their specific use-case.

Various options have been floated, ebpf, modules, whatever.

At this point I'm well tired of all this and would just as soon rip out
all tracepoints entire; but of course, the scheduler has some very
useful information for MIS people, so we can't realy do that either.

/sadface


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-04 Thread Linus Torvalds
On Sun, Feb 4, 2018 at 7:30 AM, Mathieu Desnoyers
 wrote:
>
> I agree with your arguments. A consequence of those arguments is that
> function-based tracing should be expected to be used by kernel engineers
> and experts who can adapt their scripts to follow code changes, and tune
> the script based on their specific kernel version and configuration.

Honestly, I think that's largely the case already.

The main source of tracing is done by experts at big cloud companies,
I bet. People who do it for performance reasons, or to find some
anomaly. They're pretty intimate with the kernel.

There _are_ "generic MIS" uses for tracing, and I think those are
places where we may want architectural trace points. Things like
gathering IO statistics etc.

I personally think that one of the pain points with tracing has been
exactly the fact that there are two *completely* different uses, and
they have *completely* different requirements. There's the expert
user, who basically wants tracepoints almost everywhere, and who is
doing some really deep analysis of some random area.

Then there's the "I just want an overview" MIS people, who care about
things like "I want a histogram of packets sent according to criteria
XYZ", who want some highlevel block IO performance, or who just want
random system-wide statistics.

One group really needs to tie in to _anything_, and by definition is
going to delve deep into some very specific corner of the kernel,
because they might be chasing a subtle bug and want to have traces to
just _find_ it.

The other group is looking for a much higher-level thing, and isn't
necessarily a kernel hacker, and just wants to know IO latencies or
something for statistics.

I think the function-based events is for that first group. We do not
want to have actual explicit trace events for that group, because that
group might want them _everywhere_.  That first group might want to
know the latency of a packet or block command through one particular
chain.

The second group might want explicit trace points exactly because that
group doesn't even care *how* a packet is sent or received, or what
the path through the block layer is. It just wants to know "packet
sent" or "latency between IO request and completion" or things like
that.

The first group cares about a particular kernel implementation and has
the expertise to line things up for the particular kernel that is
being deployed on a hundred thousand machines.

The second group doesn't want to care about a particular kernel, just
wants tools that work across them.

This is why I pushed Steven towards this function-based events things.
Because I'm *hoping* that this can actually resolve that conflict
between the two groups. Function-based events are for the first group,
while actual explicit trace points are for the second.

(Obviously it's not entirely black-and-white, but I do think there is
a pretty big difference between the two groups. And the first group
will obviously use the explicit trace points _too_, generally to
narrow down where they want to go with the function-based one).

We'll see. Maybe I'm entirely wrong. But I'm hoping that the
function-based one will end up being helpful.

   Linus


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-04 Thread Alexei Starovoitov
On Sun, Feb 04, 2018 at 12:57:47PM +0900, Masami Hiramatsu wrote:
> 
> > I based some of the code from kprobes too. But I wanted this to be
> > simpler, and as such, not as powerful as kprobes. More of a "poor mans"
> > kprobe ;-) Where you are limited to functions and their arguments. If
> > you need more power, switch to kprobes. In other words, its just an
> > added stepping stone.
> > 
> > Also, this should work without kprobe support, only ftrace, and function
> > args from the arch.
> 
> Hmm, but implementation seems very far from current probe events, we need
> to consider how to unify it. Anyway, it is a very good time to do, because
> I found current probe-event fetch method is not good with retpoline/IBRS,
> it is full of indirect call.
> 
> I would like to convert it to eBPF if possible. It will be good for the
> performance with JIT, and we can collaborate on the same code with BPF
> people.

The current probe fetch method is indeed going to slow down due to
retpoline, but this issue is going to affect not only this piece
of code, but the rest of the kernel where indirect call performance
matters a lot. Like networking stack where we have at least 4 indirect
calls per packet.
So I'd suggest to focus on finding a general method instead of coming
with a specific solution for this kprobe fetching problem.
Devirtualization approach works well and applicable in many cases.
For networking stack deliver_skb() and __netif_receive_skb_core()
can check if (pt_prev->func == ip_rcv || ipv6_rcv)
and call them directly.
The other approach I was thinking to explore is static_key-like
for indirect calls. In many cases the target is rarely changed,
so we can do arch specific rewrite of destination offset inside
normal direct call instruction. That should be faster than retpoline.

As far as emitting raw bpf insns instead of kprobe fetch methods
there is a big problem with such apporach. Interpreter and all
JITs take 'struct bpf_prog' that passed the verifier and not just
random set of bpf instructions. BPF is not a generic assembler.
BPF is an instruction set _with_ C calling convention.
The registers and instructions must be used in certain way or
things will horribly break.
See Documentation/bpf/bpf_design_QA.txt for details.
Long ago I wrote a patch that converted pred tree walk into
raw bpf insns. If that patch made it into mainline back then
it would have been a huge headache for us now.
So if you plan on generating bpf programs they _must_ pass the verifier.



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-04 Thread Mathieu Desnoyers
- On Feb 3, 2018, at 4:08 PM, rostedt rost...@goodmis.org wrote:

> On Sat, 3 Feb 2018 12:52:08 -0800
> Alexei Starovoitov  wrote:
> 
>> On Sat, Feb 03, 2018 at 02:02:17PM -0500, Steven Rostedt wrote:
>> > 
>> > From those that were asking about having "trace markers" (ie.
>> > Facebook), they told us they can cope with kernel changes.
>> 
>> There is some misunderstanding here.
>> We never asked for this interface.
> 
> But you wanted trace markers? Just to confirm.
> 
>> We're perfectly fine with existing kprobe/tracepoint+bpf.
> 
> OK, so no new development in this was wanted? So the entire talk about
> getting tracepoints into vfs and scheduling wasn't needed?

I did presentations in the past months about the need to add some
tracepoints into scheduling and IPI code on x86.

Instrumentation of IPI is needed not only by kernel developers, but also
by tools targeting sysadmins/app developers to help them figure out where
the time is spent when they are hitting unexpected long latencies in their
system. We can indeed start by using function instrumentation to show the
usefulness of this instrumentation, but I expect that we'll end up adding
a tracepoint there eventually.

Tracepoints in scheduling also falls in the category of letting sysadmins
and app developers understand where time is spent on their system. When
they hit an unexpected long latency, they want to understand what is
wrong in their task priority and scheduling policy that led to this delay.
The data extracted from the scheduler today is not sufficient to achieve
this. So this is another case where we might see kernel developers using
function instrumentation initially, but we'll probably end up adding and/or
changing tracepoints to help users out there who need tools analyzing this
scheduling data.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-04 Thread Steven Rostedt
On Sun, 4 Feb 2018 15:30:04 + (UTC)
Mathieu Desnoyers  wrote:

> This should therefore leave a door open to adding new tracepoints: cases
> where the data gathered is shown to be useful enough for tools targeting
> an audience wider than just kernel developers. To improve over the current
> situation, we should think about documenting some rules about how those
> tools should cope with tracepoints changing over time (event version,
> tools backward compatibility, and so on), and make sure the ABI exposes
> the information required to help tools cope with change.

As I mentioned earlier. If a function based event proves to be useful
enough to pull out information that sysadmins et.al. find beneficial,
than that could be used as an argument to create a normal static
tracepoint for that information.

-- Steve


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-04 Thread Mathieu Desnoyers
- On Feb 3, 2018, at 4:43 PM, Linus Torvalds torva...@linux-foundation.org 
wrote:

> On Sat, Feb 3, 2018 at 9:04 AM, Mathieu Desnoyers
>  wrote:
>>
>> The approach proposed here will introduce an expectation that internal
>> function signatures never change in the kernel, else it would break 
>> user-space
>> tools hooking on those functions.
> 
> No, I really don't think so.
> 
> There's two reasons for that:
> 
> The first is purely about kernel development. I, and every sane kernel
> engineer, will simply laugh in the face of somebody who comes to us
> and says "hey, I had this script that did low-level function tracing
> on your kernel, and then you changed something, and now the random
> function I was tracing has a new name and different arguments".
> 
> We'll just go "yeah, tough, change your script". Or more likely, not
> even bother to reply at all.
> 
> But the bigger issue is actually simply just psychology. Exactly
> *because* this is all implicit, and there are no explicit trace
> points, it's _obvious_ to any user that there isn't something
> long-term dependable that they hang their hat on.
> 
> Everybody *understands* that this is like a debugger: if you have a
> gdb script that shows some information, and then you go around and
> change the source code, then *obviously* you'll have to change your
> debugger script too. You don't keep the source code static just to
> make your gdb script happy., That would be silly.
> 
> In contrast, the explicit tracepoints really made people believe that
> they have some long-term meaning.
> 
> So yes, we'll  make it obvious that hell no, random kernel functions
> are not a long-term ABI. But honestly, I don't think we even need to
> have a lot of "education" on this, simply because it's so obvious that
> anybody who thinks it's some ABI is not going to be somebody we'll
> have to worry about.
> 
> Because the kind of person thinking "Ooh, this is a stable ABI" won't
> be doing interesting work anyway. That kind of person will be sitting
> in a corner eating paste, not doing interesting kernel tracing.

I agree with your arguments. A consequence of those arguments is that
function-based tracing should be expected to be used by kernel engineers
and experts who can adapt their scripts to follow code changes, and tune
the script based on their specific kernel version and configuration.

On the other hand, system administrators and application developers who
wish to tune and understand their workload will more likely fetch their
analysis scripts/tools from a third-party. If those scripts hook on
functions/arguments, it becomes really hard to ensure those will work
with a myriad of kernel versions, configurations, and toolchain versions
out there.

The good news is that Steven's approach should allow us to deal with
use-cases that are specific to kernel developers by simply using
function tracing. That should take care of most "one off" and very
specialized use-cases.

My genuine concern here is that tools targeting sysadmins and application
developers start hooking on kernel functions, and all goes well for a
few months or years until someone changes that part of the code. Then
all those wonderfully useful scripts that users depend on will end up
being broken: in the best scenario those tools will detect the change
and require to be updated, but in the worse case the tools will simply
print bogus results that people will take for granted.

This should therefore leave a door open to adding new tracepoints: cases
where the data gathered is shown to be useful enough for tools targeting
an audience wider than just kernel developers. To improve over the current
situation, we should think about documenting some rules about how those
tools should cope with tracepoints changing over time (event version,
tools backward compatibility, and so on), and make sure the ABI exposes
the information required to help tools cope with change.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Masami Hiramatsu
On Sat, 3 Feb 2018 10:27:59 -0500
Steven Rostedt  wrote:

> On Sat, 3 Feb 2018 22:38:17 +0900
> Masami Hiramatsu  wrote:
> 
> > This seems very similar feature of what kprobe-based event does.
> 
> It is similar, but not the same as kprobes. It only focuses on
> functions and their arguments, and should not require any disassembling
> (no knowledge of regs required). Any need to see anything within the
> function will still require kprobe based help.

Yes, I see that point.

> > Earlier version of kprobe-based event supported Nth argument, but I decided
> > to drop it because we can not ensure the "function signature" from kernel
> > binary. It has been passed to "perf probe", so that we can define line-level
> > granularity. 
> 
> Sure, and if we get a wrong function, which can happen, the code is set
> up not to break anything. You only get garbage output.
> 
> > 
> > Of course if it is easy to support "argN" again as it does if the arch's
> > calling convention is clearly stated. (and now kprobe is based on ftrace
> > if it is on the entry of functions)
> > 
> > So my question is, what is the difference of those from the user 
> > perspective?
> > Only event syntax is a problem?
> > I'm very confusing...
> 
> Again, this is not a kprobe replacement. It is somewhat of a syntax
> issue. I want this to be very simple and not need a disassembler. For
> indexing of structures one may use gdb, but that's about it. You could
> get the same info from counting what's in a structure itself.

OK, so it is a simpler version of function event...

> I based some of the code from kprobes too. But I wanted this to be
> simpler, and as such, not as powerful as kprobes. More of a "poor mans"
> kprobe ;-) Where you are limited to functions and their arguments. If
> you need more power, switch to kprobes. In other words, its just an
> added stepping stone.
> 
> Also, this should work without kprobe support, only ftrace, and function
> args from the arch.

Hmm, but implementation seems very far from current probe events, we need
to consider how to unify it. Anyway, it is a very good time to do, because
I found current probe-event fetch method is not good with retpoline/IBRS,
it is full of indirect call.

I would like to convert it to eBPF if possible. It will be good for the
performance with JIT, and we can collaborate on the same code with BPF
people.

Thank you,

-- 
Masami Hiramatsu 


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Namhyung Kim
On Sun, Feb 4, 2018 at 6:30 AM, Alexei Starovoitov
 wrote:
> On Sat, Feb 03, 2018 at 04:08:24PM -0500, Steven Rostedt wrote:
>> OK, so no new development in this was wanted? So the entire talk about
>> getting tracepoints into vfs and scheduling wasn't needed?
>
> I don't know who wants tracepoints in vfs.

AFAIK some people wanted to get some info (e.g. filename) from vfs.


>> Not if you are working in the embedded space and only have busybox as
>> your interface.
>
> did you notice bpfd project that does remote kprobe+bpf into an android phone?
> or phone is not an embedded space?

I never tried bpfd yet but it looks promissing.
It'd be nice to have such setup working on a typical yocto environment.
Anyway, I'd say that android phone is different that typical embedded
systems. :)

Thanks,
Namhyung


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Namhyung Kim
Hi Steve and Alexei,

On Sun, Feb 4, 2018 at 6:17 AM, Steven Rostedt  wrote:
> On Sat, 3 Feb 2018 12:52:08 -0800
> Alexei Starovoitov  wrote:
>
>> It's a user space job.
>
> BTW, I asked around at DevConf.cz, and nobody I talked with (besides
> Arnaldo), have used eBPF. The "path to hello world" is quite high. This
> interface is extremely simple to use, and one doesn't need to install
> LLVM or other tools to interface with it.
>
> I used the analogy, that eBPF is like C, and this is like Bash. One is
> much easier to get "Hello World!" out than the other.
>
> So personally, this is something I know I would use (note, I have
> never used eBPF either). But if I'm the only one to use this
> interface then I'll stop here (and not bother with the function graph
> return interface). If others think this would be helpful, I would ask
> them to speak up now.

I'm interested in this.  From my understanding, it's basically
function tracing + filter + custom argument info, right?

Supporting arguments with complex type could be error-prone.
We need to prevent malfunctions by invalid inputs.

Thanks,
Namhyung


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Linus Torvalds
On Sat, Feb 3, 2018 at 9:04 AM, Mathieu Desnoyers
 wrote:
>
> The approach proposed here will introduce an expectation that internal
> function signatures never change in the kernel, else it would break user-space
> tools hooking on those functions.

No, I really don't think so.

There's two reasons for that:

The first is purely about kernel development. I, and every sane kernel
engineer, will simply laugh in the face of somebody who comes to us
and says "hey, I had this script that did low-level function tracing
on your kernel, and then you changed something, and now the random
function I was tracing has a new name and different arguments".

We'll just go "yeah, tough, change your script". Or more likely, not
even bother to reply at all.

But the bigger issue is actually simply just psychology. Exactly
*because* this is all implicit, and there are no explicit trace
points, it's _obvious_ to any user that there isn't something
long-term dependable that they hang their hat on.

Everybody *understands* that this is like a debugger: if you have a
gdb script that shows some information, and then you go around and
change the source code, then *obviously* you'll have to change your
debugger script too. You don't keep the source code static just to
make your gdb script happy., That would be silly.

In contrast, the explicit tracepoints really made people believe that
they have some long-term meaning.

So yes, we'll  make it obvious that hell no, random kernel functions
are not a long-term ABI. But honestly, I don't think we even need to
have a lot of "education" on this, simply because it's so obvious that
anybody who thinks it's some ABI is not going to be somebody we'll
have to worry about.

Because the kind of person thinking "Ooh, this is a stable ABI" won't
be doing interesting work anyway. That kind of person will be sitting
in a corner eating paste, not doing interesting kernel tracing.

 Linus


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Alexei Starovoitov
On Sat, Feb 03, 2018 at 04:17:32PM -0500, Steven Rostedt wrote:
> On Sat, 3 Feb 2018 12:52:08 -0800
> Alexei Starovoitov  wrote:
> 
> > It's a user space job.
> 
> BTW, I asked around at DevConf.cz, and nobody I talked with (besides
> Arnaldo), have used eBPF. The "path to hello world" is quite high. This
> interface is extremely simple to use, and one doesn't need to install
> LLVM or other tools to interface with it.
> 
> I used the analogy, that eBPF is like C, and this is like Bash. One is
> much easier to get "Hello World!" out than the other.
> 
> So personally, this is something I know I would use (note, I have
> never used eBPF either). But if I'm the only one to use this
> interface then I'll stop here (and not bother with the function graph
> return interface). If others think this would be helpful, I would ask
> them to speak up now.

I'm not arguing against the patches. I know that there are folks
out there who like to use cat/echo interfaces.
I'm only happy that the whole thing has its own kconfig
that we can keep off in fb kernels to reduce maintenance/support burden.
Just like we do for all sorts of other kernel features.



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Alexei Starovoitov
On Sat, Feb 03, 2018 at 04:08:24PM -0500, Steven Rostedt wrote:
> On Sat, 3 Feb 2018 12:52:08 -0800
> Alexei Starovoitov  wrote:
> 
> > On Sat, Feb 03, 2018 at 02:02:17PM -0500, Steven Rostedt wrote:
> > > 
> > > From those that were asking about having "trace markers" (ie.
> > > Facebook), they told us they can cope with kernel changes.  
> > 
> > There is some misunderstanding here.
> > We never asked for this interface.
> 
> But you wanted trace markers? Just to confirm.

what is a definition of 'trace marker' and how it's better than tracepoint?

> > We're perfectly fine with existing kprobe/tracepoint+bpf.
> 
> OK, so no new development in this was wanted? So the entire talk about
> getting tracepoints into vfs and scheduling wasn't needed?

I don't know who wants tracepoints in vfs.
Improving scheduler tracepoints? yes. definitely,
but it's not a technical problem and cannot be solved by technical means.

> > I don't see how they are any better than kprobes and suffer from the same 
> > issues.
> 
> One only needs to look at source code, to add these. You don't need to
> know the specifics of a registers and such. That's a big +. Sure, we
> could add this to kprobes as well. But this also doesn't need the
> kprobe infrastructure.

same goes for kprobes.
with bcc it's even easier. we write tools like:
int kprobe__inet_listen(struct pt_regs *ctx, struct socket *sock, int backlog)
{ 
  struct sock *sk = sock->sk;
  ...
and bcc knows that it needs to add kprobe to inet_listen()
and this function has two arguments of the given types.
Then sock->sk access is automatically replaced with bpf_probe_read, etc, etc.

> Not if you are working in the embedded space and only have busybox as
> your interface.

did you notice bpfd project that does remote kprobe+bpf into an android phone?
or phone is not an embedded space?

> I've already cleaned up several tracepoints that have no path to them.

Great. So you're making the same point as I do that tracepoints can and do 
change.



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Steven Rostedt
On Sat, 3 Feb 2018 12:52:08 -0800
Alexei Starovoitov  wrote:

> It's a user space job.

BTW, I asked around at DevConf.cz, and nobody I talked with (besides
Arnaldo), have used eBPF. The "path to hello world" is quite high. This
interface is extremely simple to use, and one doesn't need to install
LLVM or other tools to interface with it.

I used the analogy, that eBPF is like C, and this is like Bash. One is
much easier to get "Hello World!" out than the other.

So personally, this is something I know I would use (note, I have
never used eBPF either). But if I'm the only one to use this
interface then I'll stop here (and not bother with the function graph
return interface). If others think this would be helpful, I would ask
them to speak up now.

Thanks,

-- Steve


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Steven Rostedt
On Sat, 3 Feb 2018 12:52:08 -0800
Alexei Starovoitov  wrote:

> On Sat, Feb 03, 2018 at 02:02:17PM -0500, Steven Rostedt wrote:
> > 
> > From those that were asking about having "trace markers" (ie.
> > Facebook), they told us they can cope with kernel changes.  
> 
> There is some misunderstanding here.
> We never asked for this interface.

But you wanted trace markers? Just to confirm.

> We're perfectly fine with existing kprobe/tracepoint+bpf.

OK, so no new development in this was wanted? So the entire talk about
getting tracepoints into vfs and scheduling wasn't needed?

> There are plenty of things to improve there, but this 'function based events'
> is not something we're interested in.

OK, but when I was showing this interface in DevConf.cz, there appeared
to be a lot of interest for it.

> I don't see how they are any better than kprobes and suffer from the same 
> issues.

One only needs to look at source code, to add these. You don't need to
know the specifics of a registers and such. That's a big +. Sure, we
could add this to kprobes as well. But this also doesn't need the
kprobe infrastructure.

> We really dislike text based interfaces since they are good only

Who exactly is "we"?

Peter Zijlstra told me it's basically the only interface he uses. He
doesn't care for tools on top.

> for human access and very awkward to use from tools.

trace-cmd builds its entire connection without issue via these
interfaces. What is awkward about it?

> We also dislike when kernel takes on challenge to do text language parsing.
> It's a user space job.

Not if you are working in the embedded space and only have busybox as
your interface.

> 
> > The issue is that people are afraid to add tracepoints into their
> > subsystem because they are afraid that they will become stable and
> > limit their own development.  
> 
> This is not true. Tracepoints are being added and they're being changed.

vfs doesn't have any tracepoints. And Peter is reluctant to add any
more to the scheduler.

> We have a bunch of tools that use both kprobe and tracepoint hooks
> together with bpf programs to extract information from the kernel.
> They do break from time to time when we upgrade kernels (and we upgrade 
> often),
> but keeping 'if kernel X do this, if kernel Y do that' inside the tool
> is perfectly fine.
> More often the tools have 'if kernel X ...' due to bpf verifier differences.
> It's constantly evolving and older kernels cannot load complex bpf
> programs written for the latest. So tools have to do some ugly workarounds.
> 
> > One problem we are having today is that too many trace events are being
> > created, where there are a lot of them that have been used once and
> > never used again. And people don't care about them.  
> 
> I don't think such issue exists. Please point an example of
> a tracepoint that was added and no one cares about it.

I've already cleaned up several tracepoints that have no path to them.
I'd say those are ones people do not care about. I've also removed
several trace events that are not even connected anywhere. These take
up around 5k each of memory. And these are just the trace events that
don't have paths to them. If we have tracepoints that no longer have
paths to them (which I can detect), how many more have paths but people
don't care about?

-- Steve

> 
> As far as Mathieu's point about detecting the difference between kernels,
> yes, it's a real problem to solve. The tracepoint changes are
> easy to detect, but changes to function arguments are much harder.
> Hence we're using kprobes on functions that are unlikely to change
> and that works well.
> 
> Also please note that kernel tracepoints are not different from tracing tool
> point of view than USDT tracepoints in user space apps.
> The tools attach to both of them and expect both to be more or less
> stable. Yet kernel tracepoints and USDT in apps _do_ change
> and tools have to deal with changes. It's actually harder with USDT.



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Alexei Starovoitov
On Sat, Feb 03, 2018 at 02:02:17PM -0500, Steven Rostedt wrote:
> 
> From those that were asking about having "trace markers" (ie.
> Facebook), they told us they can cope with kernel changes.

There is some misunderstanding here.
We never asked for this interface.
We're perfectly fine with existing kprobe/tracepoint+bpf.
There are plenty of things to improve there, but this 'function based events'
is not something we're interested in.
I don't see how they are any better than kprobes and suffer from the same 
issues.
We really dislike text based interfaces since they are good only
for human access and very awkward to use from tools.
We also dislike when kernel takes on challenge to do text language parsing.
It's a user space job.

> The issue is that people are afraid to add tracepoints into their
> subsystem because they are afraid that they will become stable and
> limit their own development.

This is not true. Tracepoints are being added and they're being changed.
We have a bunch of tools that use both kprobe and tracepoint hooks
together with bpf programs to extract information from the kernel.
They do break from time to time when we upgrade kernels (and we upgrade often),
but keeping 'if kernel X do this, if kernel Y do that' inside the tool
is perfectly fine.
More often the tools have 'if kernel X ...' due to bpf verifier differences.
It's constantly evolving and older kernels cannot load complex bpf
programs written for the latest. So tools have to do some ugly workarounds.

> One problem we are having today is that too many trace events are being
> created, where there are a lot of them that have been used once and
> never used again. And people don't care about them.

I don't think such issue exists. Please point an example of
a tracepoint that was added and no one cares about it.

As far as Mathieu's point about detecting the difference between kernels,
yes, it's a real problem to solve. The tracepoint changes are
easy to detect, but changes to function arguments are much harder.
Hence we're using kprobes on functions that are unlikely to change
and that works well.

Also please note that kernel tracepoints are not different from tracing tool
point of view than USDT tracepoints in user space apps.
The tools attach to both of them and expect both to be more or less
stable. Yet kernel tracepoints and USDT in apps _do_ change
and tools have to deal with changes. It's actually harder with USDT.



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Steven Rostedt
On Sat, 3 Feb 2018 17:04:14 + (UTC)
Mathieu Desnoyers  wrote:


> The approach proposed here will introduce an expectation that internal
> function signatures never change in the kernel, else it would break user-space
> tools hooking on those functions.

I had this exact discussion with Linus. Linus, please correct me if I'm
wrong.

This is a case where he said if someone expected a function to be
there, than too bad. Functions can come and go depending on if gcc
inlines it or not. We already have this interface today. It's the
function tracer. One could argue a tool requires a function to exist
because it depends on a function being accessible to the function
tracer.

> 
> The instrumentation infrastructure provided by this patchset might be useful
> for "one off" scripts, but it does not address the "stable instrumentation"
> expectations issue.

Actually, it could work for adding a tracepoint.

> 
> The problem today is caused by widely used trace analysis tools that cannot
> cope with changes to the kernel instrumentation, do not report the
> instrumentation mismatch compared to their expectations, and we generally
> don't expect users to ever update those tools to deal with newer kernels. 
> Having
> those tools hook on function names/arguments will not make this magically go
> away. As soon as kernel code changes, widely used trace analysis tools will
> start breaking left and right, and we will be back to square one. Only this 
> time,
> it's the internal function signature which will have become an ABI.

>From those that were asking about having "trace markers" (ie.
Facebook), they told us they can cope with kernel changes.

If a user can't cope with the changes, then they need to have their own
custom kernels.

> 
> A possible solution to this problem appears if we start considering trace
> analysis tools as just that: "tooling", with the following properties:
> 
> 1) Tools need to validate that the instrumentation provided matches their
>expectations. This can be done by checking event/field names and/or 
> version.
>Tools that fail to do that should be fixed.
> 
> 2) Tools need to report to the user when the instrumentation does not match
>their expectations, and hint users to upgrade in order to deal with change.
> 
> 3) Tools need to be backward compatible with respect to instrumentation: a
>user switching between older and newer kernels should not have to keep
>various copies of their tooling stack (graphical UI, analysis scripts,
>and so on).
> 
> If our goal is really to address this "stable instrumentation" issue, I don't
> think hooking on functions helps in any way. I hope we can work on defining
> instrumentation interface rules in order to deal with the fundamental problem
> of requiring tooling to adapt to kernel changes.

I think you may have mistaken my goal. It was not to establish stable
instrumentation. In fact, it was the exact opposite. It was a way to
avoid stable instrumentation but still be able to add trace events.

The issue is that people are afraid to add tracepoints into their
subsystem because they are afraid that they will become stable and
limit their own development. The problem is that it hurts those that
want to trace these subsystems who are perfectly fine with the
tracepoints going away, and then they would need to change their tools.
This change set was to help those that can customize their tools with
new kernels. It was not to help those that just want their tools to
work with all kernels.

With that said, this actually can help those who want stable
infrastructure as well. If there happens to be a function that is
constantly used to create a dynamic function based event, it can then
be shown to ask the sub system maintainer to add a static tracepoint
there. As they can show that it is very useful to have.

One problem we are having today is that too many trace events are being
created, where there are a lot of them that have been used once and
never used again. And people don't care about them. I want to slow down
the addition of trace events if these function events can be used
instead. And when they are not good enough, or we see that one is
constantly being added, then we will know that we can add a trace event
that would be useful in the future.

-- Steve


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Steven Rostedt

I need to add a "Quick guide" and "Tips and tricks" section to the
document. To explain the arguments better...

Simple args are:

 "int val", "unsigned int val", "char x", "unsigned long addr",

Also the types:

 "s32 val", "u32 val", "s8 x", "u64 addr"

The above are all printed in decimal "%d" or "%u", if you want hex...

 "x32 val", "x8 x", "x64 addr"

If you want to have it use "%pS" to print (symbols)

 "symbol addr" is like: "%pS", (void *)addr

Arrays can be expressed after the type:

 "x8[6] mac" is like: "%x,%x,%x,%x,%x,%x", mac[0], mac[1], mac[2],
   mac[3], mac[4], mac[5]

Where mac would be: unsigned char mac[6] type.

Note, arrays of type "char" and "unsigned char" turn into a static
string.

 "char[10] str" is like: "%s", str

Where str is defined as char str[10];

If the argument is a pointer to a structure, you can index into the
structure:

 "x64 ip[16]" is like: "%llx", ((u64 *)ip)[16]

Finally, if an argument is a pointer to a structure, and you want to
get to another structure that it points to, for example

 struct sk_buff *skb;

and you want to get to:

  skb->dev->perm_addr

when the parameter is a pointer to skb.

  (gdb) printf "%d\n", &((struct sk_buff *)0)->dev
16
  (gdb) printf "%d\n", &((struct net_device *)0)->perm_addr
558

The net_device *dev is 16 bytes into sk_buff, and the char array
perm_addr, is 558 bytes into the net_device structure.

Where perm_addr is an array of the mac address.

 "x8[6] perm_addr+16[0]+558"

Which would turn into:

 char *dev = (char **)(((void *)skb)+16)[0];
 char *perm_addr = (char *)(dev+558);

 "%x,%x,%x,%x,%x,%x", perm_addr[0], perm_addr[1], perm_addr[2],
  perm_addr[3], perm_addr[4], perm_addr[5]

OK, the above is a bit complex ;-) But works nicely.

-- Steve

On Fri, 02 Feb 2018 18:04:58 -0500
Steven Rostedt  wrote:

> At Kernel Summit back in October, we tried to bring up trace markers, which
> would be nops within the kernel proper, that would allow modules to hook
> arbitrary trace events to them. The reaction to this proposal was less than
> favorable. We were told that we were trying to make a work around for a
> problem, and not solving it. The problem in our minds is the notion of a
> "stable trace event".
> 
> There are maintainers that do not want trace events, or more trace events in
> their subsystems. This is due to the fact that trace events post an
> interface to user space, and this interface could become required by some
> tool. This may cause the trace event to become stable where it must not
> break the tool, and thus prevent the code from changing.
> 
> Or, the trace event may just have to add padding for fields that tools
> may require. The "success" field of the sched_wakeup trace event is one such
> instance. There is no more "success" variable, but tools may fail if it were
> to go away, so a "1" is simply added to the trace event wasting ring buffer
> real estate.
> 
> I talked with Linus about this, and he told me that we already have these
> markers in the kernel. They are from the mcount/__fentry__ used by function
> tracing. Have the trace events be created by these, and see if this will
> satisfy most areas that want trace events.
> 
> I decided to implement this idea, and here's the patch set.
> 
> Introducing "function based events". These are created dynamically by a
> tracefs file called "function_events". By writing a pseudo prototype into
> this file, you create an event.
> 
>  # mount -t tracefs nodev /sys/kernel/tracing
>  # cd /sys/kernel/tracing
>  # echo 'do_IRQ(symbol ip[16] | x64[6] irq_stack[16])' > function_events
>  # cat events/functions/do_IRQ/format
> name: do_IRQ
> ID: 1399
> format:
>   field:unsigned short common_type;   offset:0;   size:2; 
> signed:0;
>   field:unsigned char common_flags;   offset:2;   size:1; 
> signed:0;
>   field:unsigned char common_preempt_count;   offset:3;   size:1; 
> signed:0;
>   field:int common_pid;   offset:4;   size:4; signed:1;
> 
>   field:unsigned long __parent_ip;offset:8;   size:8; 
> signed:0;
>   field:unsigned long __ip;   offset:16;  size:8; signed:0;
>   field:symbol ip;offset:24;  size:8; signed:0;
>   field:x64 irq_stack[6]; offset:32;  size:48;signed:0;
> 
> print fmt: "%pS->%pS(ip=%pS, irq_stack=%llx:%llx:%llx:%llx:%llx:%llx)", 
> REC->__ip, REC->__parent_ip,
> REC->ip, REC->irq_stack[0], REC->irq_stack[1], REC->irq_stack[2], 
> REC->irq_stack[3], REC->irq_stack[4],
> REC->irq_stack[5]
> 
>  # echo 1 > events/functions/do_IRQ/enable
>  # cat trace
>   -0 [003] d..3  3647.049344: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.049433: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,

Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Mathieu Desnoyers
- On Feb 2, 2018, at 6:04 PM, rostedt rost...@goodmis.org wrote:

> At Kernel Summit back in October, we tried to bring up trace markers, which
> would be nops within the kernel proper, that would allow modules to hook
> arbitrary trace events to them. The reaction to this proposal was less than
> favorable. We were told that we were trying to make a work around for a
> problem, and not solving it. The problem in our minds is the notion of a
> "stable trace event".
> 
> There are maintainers that do not want trace events, or more trace events in
> their subsystems. This is due to the fact that trace events post an
> interface to user space, and this interface could become required by some
> tool. This may cause the trace event to become stable where it must not
> break the tool, and thus prevent the code from changing.
> 
> Or, the trace event may just have to add padding for fields that tools
> may require. The "success" field of the sched_wakeup trace event is one such
> instance. There is no more "success" variable, but tools may fail if it were
> to go away, so a "1" is simply added to the trace event wasting ring buffer
> real estate.
> 
> I talked with Linus about this, and he told me that we already have these
> markers in the kernel. They are from the mcount/__fentry__ used by function
> tracing. Have the trace events be created by these, and see if this will
> satisfy most areas that want trace events.

The approach proposed here will introduce an expectation that internal
function signatures never change in the kernel, else it would break user-space
tools hooking on those functions.

The instrumentation infrastructure provided by this patchset might be useful
for "one off" scripts, but it does not address the "stable instrumentation"
expectations issue.

The problem today is caused by widely used trace analysis tools that cannot
cope with changes to the kernel instrumentation, do not report the
instrumentation mismatch compared to their expectations, and we generally
don't expect users to ever update those tools to deal with newer kernels. Having
those tools hook on function names/arguments will not make this magically go
away. As soon as kernel code changes, widely used trace analysis tools will
start breaking left and right, and we will be back to square one. Only this 
time,
it's the internal function signature which will have become an ABI.

A possible solution to this problem appears if we start considering trace
analysis tools as just that: "tooling", with the following properties:

1) Tools need to validate that the instrumentation provided matches their
   expectations. This can be done by checking event/field names and/or version.
   Tools that fail to do that should be fixed.

2) Tools need to report to the user when the instrumentation does not match
   their expectations, and hint users to upgrade in order to deal with change.

3) Tools need to be backward compatible with respect to instrumentation: a
   user switching between older and newer kernels should not have to keep
   various copies of their tooling stack (graphical UI, analysis scripts,
   and so on).

If our goal is really to address this "stable instrumentation" issue, I don't
think hooking on functions helps in any way. I hope we can work on defining
instrumentation interface rules in order to deal with the fundamental problem
of requiring tooling to adapt to kernel changes.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Steven Rostedt
On Sat, 3 Feb 2018 22:38:17 +0900
Masami Hiramatsu  wrote:

> This seems very similar feature of what kprobe-based event does.

It is similar, but not the same as kprobes. It only focuses on
functions and their arguments, and should not require any disassembling
(no knowledge of regs required). Any need to see anything within the
function will still require kprobe based help.

> 
> Earlier version of kprobe-based event supported Nth argument, but I decided
> to drop it because we can not ensure the "function signature" from kernel
> binary. It has been passed to "perf probe", so that we can define line-level
> granularity. 

Sure, and if we get a wrong function, which can happen, the code is set
up not to break anything. You only get garbage output.

> 
> Of course if it is easy to support "argN" again as it does if the arch's
> calling convention is clearly stated. (and now kprobe is based on ftrace
> if it is on the entry of functions)
> 
> So my question is, what is the difference of those from the user perspective?
> Only event syntax is a problem?
> I'm very confusing...

Again, this is not a kprobe replacement. It is somewhat of a syntax
issue. I want this to be very simple and not need a disassembler. For
indexing of structures one may use gdb, but that's about it. You could
get the same info from counting what's in a structure itself.

I based some of the code from kprobes too. But I wanted this to be
simpler, and as such, not as powerful as kprobes. More of a "poor mans"
kprobe ;-) Where you are limited to functions and their arguments. If
you need more power, switch to kprobes. In other words, its just an
added stepping stone.

Also, this should work without kprobe support, only ftrace, and function
args from the arch.

-- Steve



Re: [PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-03 Thread Masami Hiramatsu
On Fri, 02 Feb 2018 18:04:58 -0500
Steven Rostedt  wrote:

> 
> At Kernel Summit back in October, we tried to bring up trace markers, which
> would be nops within the kernel proper, that would allow modules to hook
> arbitrary trace events to them. The reaction to this proposal was less than
> favorable. We were told that we were trying to make a work around for a
> problem, and not solving it. The problem in our minds is the notion of a
> "stable trace event".
> 
> There are maintainers that do not want trace events, or more trace events in
> their subsystems. This is due to the fact that trace events post an
> interface to user space, and this interface could become required by some
> tool. This may cause the trace event to become stable where it must not
> break the tool, and thus prevent the code from changing.
> 
> Or, the trace event may just have to add padding for fields that tools
> may require. The "success" field of the sched_wakeup trace event is one such
> instance. There is no more "success" variable, but tools may fail if it were
> to go away, so a "1" is simply added to the trace event wasting ring buffer
> real estate.
> 
> I talked with Linus about this, and he told me that we already have these
> markers in the kernel. They are from the mcount/__fentry__ used by function
> tracing. Have the trace events be created by these, and see if this will
> satisfy most areas that want trace events.
> 
> I decided to implement this idea, and here's the patch set.
> 
> Introducing "function based events". These are created dynamically by a
> tracefs file called "function_events". By writing a pseudo prototype into
> this file, you create an event.

This seems very similar feature of what kprobe-based event does.

Earlier version of kprobe-based event supported Nth argument, but I decided
to drop it because we can not ensure the "function signature" from kernel
binary. It has been passed to "perf probe", so that we can define line-level
granularity. 

Of course if it is easy to support "argN" again as it does if the arch's
calling convention is clearly stated. (and now kprobe is based on ftrace
if it is on the entry of functions)

So my question is, what is the difference of those from the user perspective?
Only event syntax is a problem?
I'm very confusing...

Thank you,

> 
>  # mount -t tracefs nodev /sys/kernel/tracing
>  # cd /sys/kernel/tracing
>  # echo 'do_IRQ(symbol ip[16] | x64[6] irq_stack[16])' > function_events
>  # cat events/functions/do_IRQ/format
> name: do_IRQ
> ID: 1399
> format:
>   field:unsigned short common_type;   offset:0;   size:2; 
> signed:0;
>   field:unsigned char common_flags;   offset:2;   size:1; 
> signed:0;
>   field:unsigned char common_preempt_count;   offset:3;   size:1; 
> signed:0;
>   field:int common_pid;   offset:4;   size:4; signed:1;
> 
>   field:unsigned long __parent_ip;offset:8;   size:8; 
> signed:0;
>   field:unsigned long __ip;   offset:16;  size:8; signed:0;
>   field:symbol ip;offset:24;  size:8; signed:0;
>   field:x64 irq_stack[6]; offset:32;  size:48;signed:0;
> 
> print fmt: "%pS->%pS(ip=%pS, irq_stack=%llx:%llx:%llx:%llx:%llx:%llx)", 
> REC->__ip, REC->__parent_ip,
> REC->ip, REC->irq_stack[0], REC->irq_stack[1], REC->irq_stack[2], 
> REC->irq_stack[3], REC->irq_stack[4],
> REC->irq_stack[5]
> 
>  # echo 1 > events/functions/do_IRQ/enable
>  # cat trace
>   -0 [003] d..3  3647.049344: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.049433: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.049672: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.325709: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.325929: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.325993: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.387571: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.387791: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
> irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
>   -0 [003] d..3  3647.387874: 
> ret_from_intr->do_IRQ(ip=cpuidle_enter

[PATCH 00/18] [ANNOUNCE] Dynamically created function based events

2018-02-02 Thread Steven Rostedt

At Kernel Summit back in October, we tried to bring up trace markers, which
would be nops within the kernel proper, that would allow modules to hook
arbitrary trace events to them. The reaction to this proposal was less than
favorable. We were told that we were trying to make a work around for a
problem, and not solving it. The problem in our minds is the notion of a
"stable trace event".

There are maintainers that do not want trace events, or more trace events in
their subsystems. This is due to the fact that trace events post an
interface to user space, and this interface could become required by some
tool. This may cause the trace event to become stable where it must not
break the tool, and thus prevent the code from changing.

Or, the trace event may just have to add padding for fields that tools
may require. The "success" field of the sched_wakeup trace event is one such
instance. There is no more "success" variable, but tools may fail if it were
to go away, so a "1" is simply added to the trace event wasting ring buffer
real estate.

I talked with Linus about this, and he told me that we already have these
markers in the kernel. They are from the mcount/__fentry__ used by function
tracing. Have the trace events be created by these, and see if this will
satisfy most areas that want trace events.

I decided to implement this idea, and here's the patch set.

Introducing "function based events". These are created dynamically by a
tracefs file called "function_events". By writing a pseudo prototype into
this file, you create an event.

 # mount -t tracefs nodev /sys/kernel/tracing
 # cd /sys/kernel/tracing
 # echo 'do_IRQ(symbol ip[16] | x64[6] irq_stack[16])' > function_events
 # cat events/functions/do_IRQ/format
name: do_IRQ
ID: 1399
format:
field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

field:unsigned long __parent_ip;offset:8;   size:8; 
signed:0;
field:unsigned long __ip;   offset:16;  size:8; signed:0;
field:symbol ip;offset:24;  size:8; signed:0;
field:x64 irq_stack[6]; offset:32;  size:48;signed:0;

print fmt: "%pS->%pS(ip=%pS, irq_stack=%llx:%llx:%llx:%llx:%llx:%llx)", 
REC->__ip, REC->__parent_ip,
REC->ip, REC->irq_stack[0], REC->irq_stack[1], REC->irq_stack[2], 
REC->irq_stack[3], REC->irq_stack[4],
REC->irq_stack[5]

 # echo 1 > events/functions/do_IRQ/enable
 # cat trace
  -0 [003] d..3  3647.049344: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.049433: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.049672: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.325709: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.325929: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.325993: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.387571: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.387791: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)
  -0 [003] d..3  3647.387874: 
ret_from_intr->do_IRQ(ip=cpuidle_enter_state+0xb1/0x330, 
irq_stack=81665db1,10,246,c96c3e80,18,88011eae9b40)

And this is much more powerful than just this. We can show strings, and
index off of structures into other structures.

  # echo '__vfs_read(symbol read+40[0]+16)' > function_events

  # echo 1 > events/functions/__vfs_read/enable
  # cat trace
 sshd-1343  [005] ...2   199.734752: 
vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
 bash-1344  [003] ...2   199.734822: 
vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
 sshd-1343  [005] ...2   199.734835: 
vfs_read->__vfs_read(read=tty_read+0x0/0xf0)
 avahi-daemon-910   [003] ...2   200.136740: vfs_read->__vfs_read(read= 
 (null))
 avahi-daemon-910   [003] ...2   200.136750: vfs_read->__vfs_read(read= 
 (null))

And even read user space:

  # echo 'SyS_openat(int dfd, str