Re: [PATCH 2/3] x86/xen: move paravirt lazy code

2023-09-13 Thread Steven Rostedt
On Wed, 13 Sep 2023 13:38:27 +0200
Juergen Gross  wrote:

> diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
> index 44a3f565264d..0577f0cdd231 100644
> --- a/include/trace/events/xen.h
> +++ b/include/trace/events/xen.h
> @@ -6,26 +6,26 @@
>  #define _TRACE_XEN_H
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  struct multicall_entry;
>  
>  /* Multicalls */
>  DECLARE_EVENT_CLASS(xen_mc__batch,
> - TP_PROTO(enum paravirt_lazy_mode mode),
> + TP_PROTO(enum xen_lazy_mode mode),
>   TP_ARGS(mode),
>   TP_STRUCT__entry(
> - __field(enum paravirt_lazy_mode, mode)
> + __field(enum xen_lazy_mode, mode)
>   ),
>   TP_fast_assign(__entry->mode = mode),
>   TP_printk("start batch LAZY_%s",
> -   (__entry->mode == PARAVIRT_LAZY_MMU) ? "MMU" :
> -   (__entry->mode == PARAVIRT_LAZY_CPU) ? "CPU" : "NONE")
> +   (__entry->mode == XEN_LAZY_MMU) ? "MMU" :
> +   (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE")

There's helper functions that make the above easier to implement as well as
exports the symbols so that user space can parse this better:

TRACE_DEFINE_ENUM(XEN_LAZY_NONE);
TRACE_DEFINE_ENUM(XEN_LAZY_MMU);
TRACE_DEFINE_ENUM(XEN_LAZY_CPU);

[..]

TP_printk("start batch LAZY_%s",
  __print_symbolic(mode,
   { XEN_LAZY_NONE, "NONE" },
   { XEN_LAZY_MMU,  "MMU   },
   { XEN_LAZY_CPU,  "CPU"  }))

Then user space parsers that read the raw data can convert these events
into something humans can read.

-- Steve

>   );
>  #define DEFINE_XEN_MC_BATCH(name)\
>   DEFINE_EVENT(xen_mc__batch, name,   \
> - TP_PROTO(enum paravirt_lazy_mode mode), \
> + TP_PROTO(enum xen_lazy_mode mode),  \
>TP_ARGS(mode))
>  
>  DEFINE_XEN_MC_BATCH(xen_mc_batch);



Re: [PATCH 01/20] x86: move prepare_ftrace_return prototype to header

2023-05-16 Thread Steven Rostedt
On Tue, 16 May 2023 21:35:30 +0200
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> On 32-bit builds, the prepare_ftrace_return() function only has a global
> definition, but no prototype before it, which causes a warning:
> 
> arch/x86/kernel/ftrace.c:625:6: warning: no previous prototype for 
> ‘prepare_ftrace_return’ [-Wmissing-prototypes]
>   625 | void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
> 
> Move the prototype that is already needed for some configurations into
> a header file where it can be seen unconditionally.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/include/asm/ftrace.h | 3 +++
>  arch/x86/kernel/ftrace.c  | 3 ---
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index 5061ac98ffa1..b8d4a07f9595 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h

Acked-by: Steven Rostedt (Google) 

-- Steve



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-07 Thread Steven Rostedt
On Wed, 7 Sep 2022 09:04:28 -0400
Kent Overstreet  wrote:

> On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > Hmm, it seems that further discussion doesn't really make much sense
> > here. I know how to use my time better.  
> 
> Just a thought, but I generally find it more productive to propose ideas than 
> to
> just be disparaging.
> 

But it's not Michal's job to do so. He's just telling you that the given
feature is not worth the burden. He's telling you the issues that he has
with the patch set. It's the submitter's job to address those concerns and
not the maintainer's to tell you how to make it better.

When Linus tells us that a submission is crap, we don't ask him how to make
it less crap, we listen to why he called it crap, and then rewrite to be
not so crappy. If we cannot figure it out, it doesn't get in.

-- Steve



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-05 Thread Steven Rostedt
On Mon, 5 Sep 2022 16:42:29 -0400
Kent Overstreet  wrote:

> > Haven't tried that yet but will do. Thanks for the reference code!  
> 
> Is it really worth the effort of benchmarking tracing API overhead here?
> 
> The main cost of a tracing based approach is going to to be the data structure
> for remembering outstanding allocations so that free events can be matched to
> the appropriate callsite. Regardless of whether it's done with BFP or by
> attaching to the tracepoints directly, that's going to be the main overhead.

The point I was making here is that you do not need your own hooking
mechanism. You can get the information directly by attaching to the
tracepoint.

> > static void my_callback(void *data, unsigned long call_site,
> > const void *ptr, struct kmem_cache *s,
> > size_t bytes_req, size_t bytes_alloc,
> > gfp_t gfp_flags)
> > {
> > struct my_data_struct *my_data = data;
> >
> > { do whatever }
> > }

The "do whatever" is anything you want to do.

Or is the data structure you create with this approach going to be too much
overhead? How hard is it for a hash or binary search lookup?


-- Steve



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-05 Thread Steven Rostedt
On Mon, 5 Sep 2022 11:44:55 -0700
Nadav Amit  wrote:

> I would note that I have a solution in the making (which pretty much works)
> for this matter, and does not require any kernel changes. It produces a
> call stack that leads to the code that lead to syscall failure.
> 
> The way it works is by using seccomp to trap syscall failures, and then
> setting ftrace function filters and kprobes on conditional branches,
> indirect branch targets and function returns.

Ooh nifty!

> 
> Using symbolic execution, backtracking is performed and the condition that
> lead to the failure is then pin-pointed.
> 
> I hope to share the code soon.

Looking forward to it.

-- Steve



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-05 Thread Steven Rostedt
On Sun, 4 Sep 2022 18:32:58 -0700
Suren Baghdasaryan  wrote:

> Page allocations (overheads are compared to get_free_pages() duration):
> 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + 
> __alloc_tag_add)
> 8.8% lookup_page_ext
> 1237% call stack capture
> 139% tracepoint with attached empty BPF program

Have you tried tracepoint with custom callback?

static void my_callback(void *data, unsigned long call_site,
const void *ptr, struct kmem_cache *s,
size_t bytes_req, size_t bytes_alloc,
gfp_t gfp_flags)
{
struct my_data_struct *my_data = data;

{ do whatever }
}

[..]
register_trace_kmem_alloc(my_callback, my_data);

Now the my_callback function will be called directly every time the
kmem_alloc tracepoint is hit.

This avoids that perf and BPF overhead.

-- Steve



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 21:35:32 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 08:23:11PM -0400, Steven Rostedt wrote:
> > If ftrace, perf, bpf can't do what you want, take a harder look to see if
> > you can modify them to do so.  
> 
> Maybe we can use this exchange to make both of our tools better. I like your
> histograms - the quantiles algorithm I've had for years is janky, I've been
> meaning to rip that out, I'd love to take a look at your code for that. And
> having an on/off switch is a good idea, I'll try to add that at some point.
> Maybe you got some ideas from my stuff too.
> 
> I'd love to get better tracepoints for measuring latency - what I added to
> init_wait() and finish_wait() was really only a starting point. Figuring out
> the right places to measure is where I'd like to be investing my time in this
> area, and there's no reason we couldn't both be making use of that.

Yes, this is exactly what I'm talking about. I'm not against your work, I
just want you to work more with everyone to come up with ideas that can
help everyone as a whole. That's how "open source communities" is suppose
to work ;-)

The histogram and synthetic events can use some more clean ups. There's a
lot of places that can be improved in that code. But I feel the ideas
behind that code is sound. It's just getting the implementation to be a bit
more efficient.

> 
> e.g. with kernel waitqueues, I looked at hooking prepare_to_wait() first but 
> not
> all code uses that, init_wait() got me better coverage. But I've already seen
> that that misses things, too, there's more work to be done.

I picked prepare_to_wait() just because I was hacking up something quick
and thought that was "close enough" ;-)

-- Steve




Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 18:55:15 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 06:34:30PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Sep 2022 17:54:38 -0400
> > Kent Overstreet  wrote:  
> > > 
> > > So this looks like it's gotten better since I last looked, but it's still 
> > > not
> > > there yet.
> > > 
> > > Part of the problem is that the tracepoints themselves are in the wrong 
> > > place:
> > > your end event is when a task is woken up, but that means spurious 
> > > wakeups will  
> > 
> > The end event is when a task is scheduled onto the CPU. The start event is
> > the first time it is woken up.  
> 
> Yeah, that's not what I want. You're just tracing latency due to having more
> processes runnable than CPUs.
> 
> I don't care about that for debugging, though! I specifically want latency at
> the wait_event() level, and related - every time a process blocked _on some
> condition_, until that condition became true. Not until some random, 
> potentially
> spurious wakeup.

Ideally this would be better if we could pass the stack trace from one
event to the next, but that wouldn't be too hard to implement. It just
needs to be done.

But anyway:

 # echo 'p:wait prepare_to_wait_event' > /sys/kernel/tracing/kprobe_events

// created an event on prepare_to_wait_event as that's usually called just
// before wait event.

 # sqlhist -e -n wait_sched 'select start.common_pid as 
pid,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as delta from wait as start 
join sched_switch as end on start.common_pid = end.prev_pid where 
end.prev_state & 3'

// Create a "wait_sched" event that traces the time between the
// prepare_to_wait_event call and the scheduler. Only trigger it if the
// schedule happens in the interruptible or uninterruptible states.

 # sqlhist -e -n wake_sched 'select start.pid,(end.TIMESTAMP_USECS - 
start.TIMESTAMP_USECS) as delta2 from wait_sched as start join sched_switch as 
end on start.pid = end.next_pid where start.delta < 50'

// Now attach the wait_event to the sched_switch where the task gets
// scheduled back in. But we are only going to care if the delta between
// the prepare_to_wait_event and the schedule is less that 50us. This is a
// hack to just care about where a prepare_to_wait_event was done just before
// scheduling out.

 # echo 'hist:keys=pid,delta2.buckets=10:sort=delta2' > 
/sys/kernel/tracing/events/synthetic/wake_sched/trigger

// Now we are going to look at the deltas that the task was sleeping for an
// event. But this just gives pids and deltas.

 # echo 'hist:keys=pid,stacktrace if delta < 50' >> 
/sys/kernel/tracing/events/synthetic/wait_sched/trigger

// And this is to get the backtraces of where the task was. This is because
// the stack trace is not available at the schedule in, because the
// sched_switch can only give the stack trace of when a task schedules out.
// Again, this is somewhat a hack.

 # cat /sys/kernel/tracing/events/synthetic/wake_sched/hist
# event histogram
#
# trigger info: 
hist:keys=pid,delta2.buckets=10:vals=hitcount:sort=delta2.buckets=10:size=2048 
[active]
#

{ pid:   2114, delta2: ~ 10-19 } hitcount:  1
{ pid:   1389, delta2: ~ 160-169 } hitcount:  1
{ pid:   1389, delta2: ~ 660-669 } hitcount:  1
{ pid:   1389, delta2: ~ 1020-1029 } hitcount:  1
{ pid:   1189, delta2: ~ 500020-500029 } hitcount:  1
{ pid:   1189, delta2: ~ 500030-500039 } hitcount:  1
{ pid:   1195, delta2: ~ 500030-500039 } hitcount:  2
{ pid:   1189, delta2: ~ 500040-500049 } hitcount: 10
{ pid:   1193, delta2: ~ 500040-500049 } hitcount:  3
{ pid:   1197, delta2: ~ 500040-500049 } hitcount:  3
{ pid:   1195, delta2: ~ 500040-500049 } hitcount:  9
{ pid:   1190, delta2: ~ 500050-500059 } hitcount: 55
{ pid:   1197, delta2: ~ 500050-500059 } hitcount: 51
{ pid:   1191, delta2: ~ 500050-500059 } hitcount: 61
{ pid:   1198, delta2: ~ 500050-500059 } hitcount: 56
{ pid:   1195, delta2: ~ 500050-500059 } hitcount: 48
{ pid:   1192, delta2: ~ 500050-500059 } hitcount: 54
{ pid:   1194, delta2: ~ 500050-500059 } hitcount: 50
{ pid:   1196, delta2: ~ 500050-500059 } hitcount: 57
{ pid:   1189, delta2: ~ 500050-500059 } hitcount: 48
{ pid:   1193, delta2: ~ 500050-500059 } hitcount: 52
{ pid:   1194, delta2: ~ 500060-500069 } hitcount: 12
{ pid:   1191, delta2: ~ 500060-500069 } hitcount:  2
{ pid:   1190, delta2: ~ 500060-500069 } hitcount:  7
{ pid:   1198, delta2: ~ 500060-500069 } hitcount:  9
{ pid:   1193, delta2: ~ 500060-500069 } hitcount:  6
{ pid:   1196, delta2: ~ 5

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 17:54:38 -0400
Kent Overstreet  wrote:
> 
> So this looks like it's gotten better since I last looked, but it's still not
> there yet.
> 
> Part of the problem is that the tracepoints themselves are in the wrong place:
> your end event is when a task is woken up, but that means spurious wakeups 
> will

The end event is when a task is scheduled onto the CPU. The start event is
the first time it is woken up.

> cause one wait_event() call to be reported as multiple smaller waits, not one
> long wait - oops, now I can't actually find the thing that's causing my
> multi-second delay.
> 
> Also, in your example you don't have it broken out by callsite. That would be
> the first thing I'd need for any real world debugging.

OK, how about this (currently we can only have 3 keys, but you can create
multiple histograms on the same event).

 # echo 'hist:keys=comm,stacktrace,delta.buckets=10:sort=delta' > 
/sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

(notice the "stacktrace" in the keys)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,stacktrace,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048
 [active]
#

{ comm: migration/2   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/5   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/1   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/7   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/0   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 start_kernel+0x595/0x5be
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: migration/4   , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 schedule_idle+0x26/0x40
 do_idle+0xb4/0xd0
 cpu_startup_entry+0x19/0x20
 secondary_startup_64_no_verify+0xc2/0xcb
, delta: ~ 10-19} hitcount:  7
{ comm: rtkit-daemon  , stacktrace:
 event_hist_trigger+0x290/0x2b0
 event_triggers_call+0x52/0xe0
 trace_event_buffer_commit+0x193/0x240
 trace_event_raw_event_sched_switch+0x120/0x180
 __traceiter_sched_switch+0x39/0x50
 __schedule+0x310/0x700
 preempt_schedule_common+0x2d/0x70
 preempt_schedule_thunk+0x16/0x18
 _raw_spin_unlock_irq+0x2e/0x40
 eventfd_write+0xc8/0x290
 vfs_write+0xc0/0x2a0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
, delta: ~ 

Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 17:38:44 -0400
Steven Rostedt  wrote:

>  # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > 
> /sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above could almost be done with sqlhist (but I haven't implemented
"buckets=10" yet because that's a new feature. But for now, let's do log2):

 # sqlhist -e 'select comm,prio,cast(delta as log2) from wakeup_lat'

("-e" is to execute the command, as it normally only displays what commands
need to be run to create the synthetic events and histograms)

# cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,prio,delta.log2:vals=hitcount:sort=hitcount:size=2048 [active]
#

{ comm: migration/4   , prio:  0, 
delta: ~ 2^5  } hitcount:  1
{ comm: migration/0   , prio:  0, 
delta: ~ 2^4  } hitcount:  2
{ comm: rtkit-daemon  , prio:  0, 
delta: ~ 2^7  } hitcount:  2
{ comm: rtkit-daemon  , prio:  0, 
delta: ~ 2^6  } hitcount:  4
{ comm: migration/0   , prio:  0, 
delta: ~ 2^5  } hitcount:  8
{ comm: migration/4   , prio:  0, 
delta: ~ 2^4  } hitcount:  9
{ comm: migration/2   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/5   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/7   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/1   , prio:  0, 
delta: ~ 2^4  } hitcount: 10
{ comm: migration/6   , prio:  0, 
delta: ~ 2^4  } hitcount: 10

Totals:
Hits: 76
Entries: 11
Dropped: 0


-- Steve



Re: [RFC PATCH 27/30] Code tagging based latency tracking

2022-09-01 Thread Steven Rostedt
On Tue, 30 Aug 2022 14:49:16 -0700
Suren Baghdasaryan  wrote:

> From: Kent Overstreet 
> 
> This adds the ability to easily instrument code for measuring latency.
> To use, add the following to calls to your code, at the start and end of
> the event you wish to measure:
> 
>   code_tag_time_stats_start(start_time);
>   code_tag_time_stats_finish(start_time);

So you need to modify the code to see what you want?

> 
> Stastistics will then show up in debugfs under
> /sys/kernel/debug/time_stats, listed by file and line number.
> 
> Stastics measured include weighted averages of frequency, duration, max
> duration, as well as quantiles.
> 
> This patch also instruments all calls to init_wait and finish_wait,
> which includes all calls to wait_event. Example debugfs output:
> 
> fs/xfs/xfs_trans_ail.c:746 module:xfs func:xfs_ail_push_all_sync
> count:  17
> rate:   0/sec
> frequency:  2 sec
> avg duration:   10 us
> max duration:   232 us
> quantiles (ns): 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
> 
> lib/sbitmap.c:813 module:sbitmap func:sbitmap_finish_wait
> count:  3
> rate:   0/sec
> frequency:  4 sec
> avg duration:   4 sec
> max duration:   4 sec
> quantiles (ns): 0 4288669120 4288669120 5360836048 5360836048 5360836048 
> 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 5360836048 
> 5360836048 5360836048
> 
> net/core/datagram.c:122 module:datagram func:__skb_wait_for_more_packets
> count:  10
> rate:   1/sec
> frequency:  859 ms
> avg duration:   472 ms
> max duration:   30 sec
> quantiles (ns): 0 12279 12279 15669 15669 15669 15669 17217 17217 17217 17217 
> 17217 17217 17217 17217

For function length you could just do something like this:

 # cd /sys/kernel/tracing
 # echo __skb_wait_for_more_packets > set_ftrace_filter
 # echo 1 > function_profile_enabled
 # cat trace_stat/function*
  Function   HitTimeAvg 
s^2
     ------ 
---
  __skb_wait_for_more_packets  10.000 us0.000 us
0.000 us
  Function   HitTimeAvg 
s^2
     ------ 
---
  __skb_wait_for_more_packets  174.813 us   74.813 us   
0.000 us
  Function   HitTimeAvg 
s^2
     ------ 
---
  Function   HitTimeAvg 
s^2
     ------ 
---

The above is for a 4 CPU machine. The s^2 is the square of the standard
deviation (makes not having to do divisions while it runs).

But if you are looking for latency between two events (which can be kprobes
too, where you do not need to rebuild your kernel):

From: https://man.archlinux.org/man/sqlhist.1.en
which comes in: https://git.kernel.org/pub/scm/libs/libtrace/libtracefs.git/
  if not already installed on your distro.

 # sqlhist -e -n wakeup_lat 'select end.next_comm as 
comm,start.pid,start.prio,(end.TIMESTAMP_USECS - start.TIMESTAMP_USECS) as 
delta from sched_waking as start join sched_switch as end on start.pid = 
end.next_pid where start.prio < 100'

The above creates a synthetic event called "wakeup_lat" that joins two
events (sched_waking and sched_switch) when the pid field of sched_waking
matches the next_pid field of sched_switch. When there is a match, it will
trigger the wakeup_lat event only if the prio of the sched_waking event is
less than 100 (which in the kernel means any real-time task). The
wakeup_lat event will record the next_comm (as comm field), the pid of
woken task and the time delta in microseconds between the two events.

 # echo 'hist:keys=comm,prio,delta.buckets=10:sort=delta' > 
/sys/kernel/tracing/events/synthetic/wakeup_lat/trigger

The above starts a histogram tracing the name of the woken task, the
priority and the delta (but placing the delta in buckets of size 10, as we
do not need to see every latency number).

 # chrt -f 20 sleep 1

Force something to be woken up that is interesting.

 # cat /sys/kernel/tracing/events/synthetic/wakeup_lat/hist
# event histogram
#
# trigger info: 
hist:keys=comm,prio,delta.buckets=10:vals=hitcount:sort=delta.buckets=10:size=2048
 [active]
#

{ comm: migration/5   , prio:  0, 
delta: ~ 10-19 } hitcount:  1
{ comm: migration/2   , prio:  0, 
delta: ~ 10-19 } hitcount:  1
{ comm: sleep , prio: 79, 
delta: ~ 10-19 } hitcount:  1
{ comm: migration/7   , prio:  0, 
delta: ~ 10-19 } hitcount:  

Re: [RFC PATCH 03/30] Lazy percpu counters

2022-09-01 Thread Steven Rostedt
On Thu, 1 Sep 2022 10:32:19 -0400
Kent Overstreet  wrote:

> On Thu, Sep 01, 2022 at 08:51:31AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:52PM -0700, Suren Baghdasaryan wrote:  
> > > +static void lazy_percpu_counter_switch_to_pcpu(struct 
> > > raw_lazy_percpu_counter *c)
> > > +{
> > > + u64 __percpu *pcpu_v = alloc_percpu_gfp(u64, GFP_ATOMIC|__GFP_NOWARN);  
> > 
> > Realize that this is incorrect when used under a raw_spinlock_t.  
> 
> Can you elaborate?

All allocations (including GFP_ATOMIC) grab normal spin_locks. When
PREEMPT_RT is configured, normal spin_locks turn into a mutex, where as
raw_spinlock's do not.

Thus, if this is done within a raw_spinlock with PREEMPT_RT configured, it
can cause a schedule while holding a spinlock.

-- Steve



Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-14 Thread Steven Rostedt
On Thu, 9 Jun 2022 15:02:20 +0200
Petr Mladek  wrote:

> > I'm somewhat curious whether we can actually remove that trace event.  
> 
> Good question.
> 
> Well, I think that it might be useful. It allows to see trace and
> printk messages together.

Yes people still use it. I was just asked about it at Kernel Recipes. That
is, someone wanted printk mixed in with the tracing, and I told them about
this event (which they didn't know about but was happy to hear that it
existed).

-- Steve



Re: [PATCH 23/30] printk: kmsg_dump: Introduce helper to inform number of dumpers

2022-05-10 Thread Steven Rostedt
On Wed, 27 Apr 2022 19:49:17 -0300
"Guilherme G. Piccoli"  wrote:

> Currently we don't have a way to check if there are dumpers set,
> except counting the list members maybe. This patch introduces a very
> simple helper to provide this information, by just keeping track of
> registered/unregistered kmsg dumpers. It's going to be used on the
> panic path in the subsequent patch.

FYI, it is considered "bad form" to reference in the change log "this
patch". We know this is a patch. The change log should just talk about what
is being done. So can you reword your change logs (you do this is almost
every patch). Here's what I would reword the above to be:

 Currently we don't have a way to check if there are dumpers set, except
 perhaps by counting the list members. Introduce a very simple helper to
 provide this information, by just keeping track of registered/unregistered
 kmsg dumpers. This will simplify the refactoring of the panic path.


-- Steve



Re: [PATCH 18/30] notifier: Show function names on notifier routines if DEBUG_NOTIFIERS is set

2022-05-10 Thread Steven Rostedt
On Thu, 28 Apr 2022 09:01:13 +0800
Xiaoming Ni  wrote:

> > +#ifdef CONFIG_DEBUG_NOTIFIERS
> > +   {
> > +   char sym_name[KSYM_NAME_LEN];
> > +
> > +   pr_info("notifiers: registered %s()\n",
> > +   notifier_name(n, sym_name));
> > +   }  
> 
> Duplicate Code.
> 
> Is it better to use __func__ and %pS?
> 
> pr_info("%s: %pS\n", __func__, n->notifier_call);
> 
> 
> > +#endif

Also, don't sprinkle #ifdef in C code. Instead:

if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS))
pr_info("notifers: regsiter %ps()\n",
n->notifer_call);


Or define a print macro at the start of the C file that is a nop if it is
not defined, and use the macro.

-- Steve



Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path

2022-05-10 Thread Steven Rostedt
On Tue, 10 May 2022 13:38:39 +0200
Petr Mladek  wrote:

> As already mentioned in the other reply, panic() sometimes stops
> the other CPUs using NMI, for example, see kdump_nmi_shootdown_cpus().
> 
> Another situation is when the CPU using the lock ends in some
> infinite loop because something went wrong. The system is in
> an unpredictable state during panic().
> 
> I am not sure if this is possible with the code under gsmi_dev.lock
> but such things really happen during panic() in other subsystems.
> Using trylock in the panic() code path is a good practice.

I believe that Peter Zijlstra had a special spin lock for NMIs or early
printk, where it would not block if the lock was held on the same CPU. That
is, if an NMI happened and paniced while this lock was held on the same
CPU, it would not deadlock. But it would block if the lock was held on
another CPU.

-- Steve



Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 10:46:35 -0300
"Guilherme G. Piccoli"  wrote:

> Thanks Sergei and Steven, good idea! I thought about the switch change
> you propose, but I confess I got a bit confused by the "fallthrough"
> keyword - do I need to use it?

No. The fallthrough keyword is only needed when there's code between case
labels. As it is very common to list multiple cases for the same code path.
That is:

case DIE_OOPS:
case PANIC_NOTIFIER:
do_dump = 1;
break;

Does not need a fall through label, as there's no code between the DIE_OOPS
and the PANIC_NOTIFIER. But if you had:

case DIE_OOPS:
x = true;
case PANIC_NOTIFIER:
do_dump = 1;
break;

Then you do.

-- Steve



Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 12:22:44 +0300
Sergei Shtylyov  wrote:

> > +   switch (ev) {
> > +   case DIE_OOPS:
> > +   do_dump = 1;
> > +   break;
> > +   case PANIC_NOTIFIER:
> > +   do_dump = 1;
> > +   break;  
> 
>Why not:
> 
>   case DIE_OOPS:
>   case PANIC_NOTIFIER:
>   do_dump = 1;
>       break;

Agreed.

Other than that.

Acked-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v0 00/42] notifiers: Return an error when callback is already registered

2021-11-08 Thread Steven Rostedt
On Mon, 8 Nov 2021 15:35:50 +0100
Borislav Petkov  wrote:

> On Mon, Nov 08, 2021 at 03:24:39PM +0100, Borislav Petkov wrote:
> > I guess I can add another indirection to notifier_chain_register() and
> > avoid touching all the call sites.  
> 
> IOW, something like this below.
> 
> This way I won't have to touch all the callsites and the registration
> routines would still return a proper value instead of returning 0
> unconditionally.

I prefer this method.

Question, how often does this warning trigger? Is it common to see in
development?

-- Steve



Re: [PATCH 1/4] x86/xen/entry: Rename xenpv_exc_nmi to noist_exc_nmi

2021-04-30 Thread Steven Rostedt
On Fri, 30 Apr 2021 09:15:51 +0200
Paolo Bonzini  wrote:

> > Nit, but in change logs, please avoid stating "next patch" as searching git
> > history (via git blame or whatever) there is no such thing as "next patch". 
> >  
> 
> Interesting, I use next patch(es) relatively often, though you're right 
> that something like "in preparation for" works just as well.  Yes, it's 
> the previous in "git log", but you get what it's meant in practice. :)

It's not always the previous in a git log. Git log sorts by time, and
if an unrelated commit was created in between those two patches, it
will be in between them.

-- Steve



Re: [PATCH 1/4] x86/xen/entry: Rename xenpv_exc_nmi to noist_exc_nmi

2021-04-28 Thread Steven Rostedt
On Tue, 27 Apr 2021 07:09:46 +0800
Lai Jiangshan  wrote:

> From: Lai Jiangshan 
> 
> There is no any functionality change intended.  Just rename it and
> move it to arch/x86/kernel/nmi.c so that we can resue it later in
> next patch for early NMI and kvm.

Nit, but in change logs, please avoid stating "next patch" as searching git
history (via git blame or whatever) there is no such thing as "next patch".

Just state: "so that we can reuse it for early NMI and KVM."

I also just noticed the typo in "resue". Or maybe both NMI and KVM should
be sued again ;-)

-- Steve



Re: A KernelShark plugin for Xen traces analysis

2021-04-15 Thread Steven Rostedt
On Thu, 15 Apr 2021 02:50:53 +0200
Dario Faggioli  wrote:

> On Wed, 2021-04-14 at 15:07 -0400, Steven Rostedt wrote:
> > On Wed, 14 Apr 2021 19:11:19 +0100
> > Andrew Cooper  wrote:
> >   
> > > Where the plugin (ought to) live depends heavily on whether we
> > > consider
> > > the trace format a stable ABI or not.  
> > 
> > Agreed. Like the VMware plugin to handle ESX traces. It's internal
> > and not
> > published as the API is not stable.
> >   
> Mmm... Does this imply that Linux's tracepoints should be considered a
> stable ABI then? :-D :-D :-D

Some already are. Like the sched_switch tracepoint. That's one of the
reasons Peter Zijlstra now hates exported tracepoints.

> 
> > But if it ever becomes stable, and you would like it to live with
> > KernelShark, we are looking to have a place to store third party
> > plugins.
> >   
> Sure. TBH, either Xen or KernelShark main or plugin repositories would
> be fine for me.
> 
> Which doesn't mean we should choose randomly, as clearly each solution
> has pros and cons that needs to be evaluated.
> 
> I'm just saying that we would prefer the plugin to end up in one of
> those places, rather than remaining its own project. And of course
> we're up for maintaining it, wherever it lands. :-)

Like I said, we can have a third party repository within the KernelShark
repo (or along side of it). As a claws-mail user, I like their method. They
have a bunch of plugins you can add that they have in their repo, but those
plugins are maintained by different people.

-- Steve


> 
> > We are working to make sure that the API for KernelShark plugins
> > remains
> > stable, so your plugins should always work too.
> >   
> Great!
> 
> Regards




Re: A KernelShark plugin for Xen traces analysis

2021-04-14 Thread Steven Rostedt
On Thu, 15 Apr 2021 00:11:32 +0200
Dario Faggioli  wrote:

 
> Yes, basically, we can say that a Xen system has "its own trace-cmd".
> It's called `xentrace`, you run it from Dom0 and you get a (binary)
> file which contains a bunch of events.
> 
> Not that differently from a trace-cmd's "trace.dat" file, but the
> events in there comes from tracepoints within the hypervisor (which, of
> course, use a different tracing mechanism than ftrace).

Right, that's exactly what the ESX trace did as well.

> > Perhaps we can update trace-cmd agent to work with
> > Xen as well. Does xen implement vsock or some other way to
> > communicate
> > between the guests and the Dom0 kernel? 
> >  
> Not vsock, AFAIK. But we probably can use something else/come up with
> something new.
> 

Yeah, we would like to have trace-cmd agent work with more than just vsock.
Heck, we could just use networking as well. It's just a bit more overhead.


> > And then on KernelShark,
> > we have a KVM plugin in development that does this. But you can do
> > the same
> > with Xen.
> >   
> I think that one of the trickiest aspects would be synchronizing the
> timestamps in the 3 traces.
> 
> *I guess* that the dom0 trace and the guest traces could at least use
> the PTP algorithm that is currently implemented in the trace-cmd
> patches (but not KVM specific one). For synch'ing the Xen trace with
> them, well, I don't really know... We'd have to think about it. :-P

Really, TSC is the way to go. All you would need to do is to have a way to
map all the TSCs together. Assuming the xen trace has a unmodified TSC, and
you can retrieve all the multipliers and shifts used for each guest, you
then will have a synchronized TSC. Then only one guest or the xen HV needs
to calculate the TSC to nanoseconds, and then have all use that. Probably
would need to be the xen HV as it would be the one without a modified TSC.

-- Steve



Re: A KernelShark plugin for Xen traces analysis

2021-04-14 Thread Steven Rostedt
On Wed, 14 Apr 2021 19:11:19 +0100
Andrew Cooper  wrote:

> Where the plugin (ought to) live depends heavily on whether we consider
> the trace format a stable ABI or not.

Agreed. Like the VMware plugin to handle ESX traces. It's internal and not
published as the API is not stable.

But if it ever becomes stable, and you would like it to live with
KernelShark, we are looking to have a place to store third party plugins.

We are working to make sure that the API for KernelShark plugins remains
stable, so your plugins should always work too.

-- Steve



Re: A KernelShark plugin for Xen traces analysis

2021-04-14 Thread Steven Rostedt
On Wed, 14 Apr 2021 11:07:33 +0100
Andrew Cooper  wrote:

> On 13/04/2021 16:46, Steven Rostedt wrote:
> > Hi Giuseppe,
> >
> > On Tue, 13 Apr 2021 16:28:36 +0200
> > Giuseppe Eletto  wrote:
> >  
> >> Hello,
> >> I want to share with you a new plugin developed by me, under the
> >> supervision of Dario Faggioli, which allows the new version of KernelShark
> >> (the v2-beta) to open and view the Xen traces created using the "xentrace" 
> >> tool.
> >>
> >> In fact, KernelShark is a well known tool for graphical visualization
> >> Linux kernel traces, obtained via "ftrace" and "trace-cmd". Anyway thanks
> >> to its modular architecture, it is now possible to implement plugins which
> >> open and display traces with arbitrary format, for example, as in in
> >> this case, traces of the Xen hypervisor.  
> > I'm guessing you have trace events coming from Xen itself?
> >
> >  
> >> For more information on how to build the plugin and/or
> >> to view the source code I leave the repository below:
> >> https://github.com/giuseppe998e/kernelshark-xentrace-plugin
> >>
> >>
> >> In short:
> >>
> >> $ sudo apt install git build-essential libjson-c-dev
> >> $ git clone --recurse-submodules
> >> https://github.com/giuseppe998e/kernelshark-xentrace-plugin.git
> >> $ cd kernelshark-xentrace-plugin/
> >> $ make
> >>
> >> $ export XEN_CPUHZ=3G # Sets the CPU frequency ((G)hz/(M)hz/(K)hz/hz)
> >> $ kernelshark -p out/ks-xentrace.so trace.xen
> >>
> >>
> >> You will need the development version of KernelShark, available here:
> >> https://git.kernel.org/pub/scm/utils/trace-cmd/kernel-shark.git  
> > This will soon be the main repository, as we are going to deprecate the
> > version in the trace-cmd.git repo soon. And because libtracecmd 1.0 has
> > already been released.
> >
> >  
> >> A screenshot of the plugin in action is available here:
> >> https://github.com/giuseppe998e/kernelshark-xentrace-plugin/raw/master/.github/img/ks-xentrace.png
> >>
> >> I'm happy to receive whatever feedback you may have about it,
> >> and to answer any question.
> >>  
> > Thanks for doing this. What would be nice is to have the xen traces along
> > side the linux tracing. Perhaps we can update trace-cmd agent to work with
> > Xen as well. Does xen implement vsock or some other way to communicate
> > between the guests and the Dom0 kernel? If not, we should add one. The you
> > could do the following:
> >
> >  1. On each guest, run as root: trace-cmd agent --xen
> >  2. On Dom0 run: trace-cmd record -e (events on Dom0) \
> >  --xen (commands to do tracing in Xen HV) \
> >  -A  -e (events on guest)
> >
> > And then you would get a trace.dat file for Dom0 and the guest, and also
> > have a trace file for Xen (however that is done). And then on KernelShark,
> > we have a KVM plugin in development that does this. But you can do the same
> > with Xen.
> >
> > For KVM, we have:
> >
> >  1. On each guest: trace-cmd agent
> >  2. On the host: trace-cmd record -e kvm -e sched -e irq \
> >   -A guest-name -e all
> > The above produces trace.dat for the host trace, and 
> >  trace-.dat for the guest.
> >  3. kernelshark trace.dat -a trace-Fedora21.dat
> >
> > (I have a guest called Fedora21).
> >
> >   http://rostedt.org/private/kernelshark-kvm.png
> >
> > Where you can see the kvm hypervisor task KVM-2356 is the host task running
> > the guest VCPU 0, and you see the switch between the two.
> >
> > Perhaps we can do something like that with Xen as well. The plugin is still
> > in the works, but should be published soon. And when it is, you could use
> > that as a template for Xen.  
> 
> A possibly tangential question.  Where does KernelShark's idea of CPUs
> (i.e. real logical threads) come from?

It comes from the plugin that parses the file.

KernelShark can take multiple inputs. We have a plugin internally at VMware
that can read trace data from ESX and produces CPUs and Tasks on top of
them. In ESX tasks are actually called "Worlds".

> 
> In a Xen system, dom0 is just a VM, and particularly on larger servers,
> may not be as many vcpus as the system has logical threads.

Yes, I have some idea of how Xen works. When I was hired at Red Hat in
2006, my first job was to work on Xen. I even ported "logdev" (the
predecessor to ftrace) to the Xen hypervisor).

&g

Re: A KernelShark plugin for Xen traces analysis

2021-04-13 Thread Steven Rostedt


Hi Giuseppe,

On Tue, 13 Apr 2021 16:28:36 +0200
Giuseppe Eletto  wrote:

> Hello,
> I want to share with you a new plugin developed by me, under the
> supervision of Dario Faggioli, which allows the new version of KernelShark
> (the v2-beta) to open and view the Xen traces created using the "xentrace" 
> tool.
> 
> In fact, KernelShark is a well known tool for graphical visualization
> Linux kernel traces, obtained via "ftrace" and "trace-cmd". Anyway thanks
> to its modular architecture, it is now possible to implement plugins which
> open and display traces with arbitrary format, for example, as in in
> this case, traces of the Xen hypervisor.

I'm guessing you have trace events coming from Xen itself?


> 
> For more information on how to build the plugin and/or
> to view the source code I leave the repository below:
> https://github.com/giuseppe998e/kernelshark-xentrace-plugin
> 
> 
> In short:
> 
> $ sudo apt install git build-essential libjson-c-dev
> $ git clone --recurse-submodules
> https://github.com/giuseppe998e/kernelshark-xentrace-plugin.git
> $ cd kernelshark-xentrace-plugin/
> $ make
> 
> $ export XEN_CPUHZ=3G # Sets the CPU frequency ((G)hz/(M)hz/(K)hz/hz)
> $ kernelshark -p out/ks-xentrace.so trace.xen
> 
> 
> You will need the development version of KernelShark, available here:
> https://git.kernel.org/pub/scm/utils/trace-cmd/kernel-shark.git

This will soon be the main repository, as we are going to deprecate the
version in the trace-cmd.git repo soon. And because libtracecmd 1.0 has
already been released.


> 
> A screenshot of the plugin in action is available here:
> https://github.com/giuseppe998e/kernelshark-xentrace-plugin/raw/master/.github/img/ks-xentrace.png
> 
> I'm happy to receive whatever feedback you may have about it,
> and to answer any question.
> 

Thanks for doing this. What would be nice is to have the xen traces along
side the linux tracing. Perhaps we can update trace-cmd agent to work with
Xen as well. Does xen implement vsock or some other way to communicate
between the guests and the Dom0 kernel? If not, we should add one. The you
could do the following:

 1. On each guest, run as root: trace-cmd agent --xen
 2. On Dom0 run: trace-cmd record -e (events on Dom0) \
 --xen (commands to do tracing in Xen HV) \
 -A  -e (events on guest)

And then you would get a trace.dat file for Dom0 and the guest, and also
have a trace file for Xen (however that is done). And then on KernelShark,
we have a KVM plugin in development that does this. But you can do the same
with Xen.

For KVM, we have:

 1. On each guest: trace-cmd agent
 2. On the host: trace-cmd record -e kvm -e sched -e irq \
  -A guest-name -e all
The above produces trace.dat for the host trace, and 
 trace-.dat for the guest.
 3. kernelshark trace.dat -a trace-Fedora21.dat

(I have a guest called Fedora21).

  http://rostedt.org/private/kernelshark-kvm.png

Where you can see the kvm hypervisor task KVM-2356 is the host task running
the guest VCPU 0, and you see the switch between the two.

Perhaps we can do something like that with Xen as well. The plugin is still
in the works, but should be published soon. And when it is, you could use
that as a template for Xen.

-- Steve



Re: [Xen-devel] [PATCH v3 21/27] x86/ftrace: Adapt function tracing for PIE support

2018-05-24 Thread Steven Rostedt
On Thu, 24 May 2018 13:40:24 +0200
Petr Mladek  wrote:

> On Wed 2018-05-23 12:54:15, Thomas Garnier wrote:
> > When using -fPIE/PIC with function tracing, the compiler generates a
> > call through the GOT (call *__fentry__@GOTPCREL). This instruction
> > takes 6 bytes instead of 5 on the usual relative call.
> > 
> > If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> > so ftrace can handle the previous 5-bytes as before.
> > 
> > Position Independent Executable (PIE) support will allow to extended the
> > KASLR randomization range below the -2G memory limit.
> > 
> > Signed-off-by: Thomas Garnier 
> > ---
> >  arch/x86/include/asm/ftrace.h   |  6 +++--
> >  arch/x86/include/asm/sections.h |  4 
> >  arch/x86/kernel/ftrace.c| 42 +++--
> >  3 files changed, 48 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index c18ed65287d5..8f2decce38d8 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -25,9 +25,11 @@ extern void __fentry__(void);
> >  static inline unsigned long ftrace_call_adjust(unsigned long addr)
> >  {
> > /*
> > -* addr is the address of the mcount call instruction.
> > -* recordmcount does the necessary offset calculation.
> > +* addr is the address of the mcount call instruction. PIE has always a
> > +* byte added to the start of the function.
> >  */
> > +   if (IS_ENABLED(CONFIG_X86_PIE))
> > +   addr -= 1;  
> 
> This seems to modify the address even for modules that are _not_ compiled with
> PIE, see below.

Can one load a module not compiled for PIE in a kernel with PIE?

> 
> > return addr;
> >  }
> >  
> > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > index 01ebcb6f263e..73b3c30cb7a3 100644
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -135,6 +135,44 @@ ftrace_modify_code_direct(unsigned long ip, unsigned 
> > const char *old_code,
> > return 0;
> >  }
> >  
> > +/* Bytes before call GOT offset */
> > +const unsigned char got_call_preinsn[] = { 0xff, 0x15 };
> > +
> > +static int
> > +ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code,
> > +  unsigned const char *new_code)
> > +{
> > +   unsigned char replaced[MCOUNT_INSN_SIZE + 1];
> > +
> > +   ftrace_expected = old_code;
> > +
> > +   /*
> > +* If PIE is not enabled or no GOT call was found, default to the
> > +* original approach to code modification.
> > +*/
> > +   if (!IS_ENABLED(CONFIG_X86_PIE) ||
> > +   probe_kernel_read(replaced, (void *)ip, sizeof(replaced)) ||
> > +   memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn)))
> > +   return ftrace_modify_code_direct(ip, old_code, new_code);  
> 
> And this looks like an attempt to handle modules compiled without
> PIE. Does it works with the right ip in that case?

I'm guessing the || is for the "or no GOT call was found", but it
doesn't explain why no GOT would be found.

> 
> I wonder if a better solution would be to update
> scripts/recordmcount.c to store the incremented location into the module.

If recordmcount.c can handle this, then I think that's the preferred
approach. Thanks!

-- Steve

> 
> IMPORTANT: I have only vague picture about how this all works. It is
> possible that I am completely wrong. The code might be correct,
> especially if you tested this situation.
> 
> Best Regards,
> Petr


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [GIT PULL] tracing/x86/xen: Remove zero data size trace events trace_xen_mmu_flush_tlb{_all}

2018-05-15 Thread Steven Rostedt

Linus,

Some of the ftrace internal events use a zero for a data size of
a field event. This is increasingly important for the histogram trigger
work that is being extended.

While auditing trace events, I found that a couple of the xen events
were used as just marking that a function was called, by creating
a static array of size zero. This can play havoc with the tracing
features if these events are used, because a zero size of a static
array is denoted as a special nul terminated dynamic array (this is
what the trace_marker code uses). But since the xen events have no
size, they are not nul terminated, and unexpected results may occur.

As trace events were never intended on being a marker to denote
that a function was hit or not, especially since function tracing
and kprobes can trivially do the same, the best course of action is
to simply remove these events.


Please pull the latest trace-v4.17-rc4-2 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
trace-v4.17-rc4-2

Tag SHA1: 580faa7b1b80b1332683dc869732a4db8a506b9c
Head SHA1: 45dd9b0666a162f8e4be76096716670cf1741f0e


Steven Rostedt (VMware) (1):
  tracing/x86/xen: Remove zero data size trace events 
trace_xen_mmu_flush_tlb{_all}


 arch/x86/xen/mmu.c |  4 +---
 arch/x86/xen/mmu_pv.c  |  4 +---
 include/trace/events/xen.h | 16 
 3 files changed, 2 insertions(+), 22 deletions(-)
---
commit 45dd9b0666a162f8e4be76096716670cf1741f0e
Author: Steven Rostedt (VMware) <rost...@goodmis.org>
Date:   Wed May 9 14:36:09 2018 -0400

tracing/x86/xen: Remove zero data size trace events 
trace_xen_mmu_flush_tlb{_all}

Doing an audit of trace events, I discovered two trace events in the xen
subsystem that use a hack to create zero data size trace events. This is not
what trace events are for. Trace events add memory footprint overhead, and
if all you need to do is see if a function is hit or not, simply make that
function noinline and use function tracer filtering.

Worse yet, the hack used was:

 __array(char, x, 0)

Which creates a static string of zero in length. There's assumptions about
such constructs in ftrace that this is a dynamic string that is nul
terminated. This is not the case with these tracepoints and can cause
problems in various parts of ftrace.

Nuke the trace events!

Link: http://lkml.kernel.org/r/20180509144605.5a220...@gandalf.local.home

Cc: sta...@vger.kernel.org
Fixes: 95a7d76897c1e ("xen/mmu: Use Xen specific TLB flush instead of the 
generic one.")
Reviewed-by: Juergen Gross <jgr...@suse.com>
    Signed-off-by: Steven Rostedt (VMware) <rost...@goodmis.org>

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index d33e7dbe3129..2d76106788a3 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -42,13 +42,11 @@ xmaddr_t arbitrary_virt_to_machine(void *vaddr)
 }
 EXPORT_SYMBOL_GPL(arbitrary_virt_to_machine);
 
-static void xen_flush_tlb_all(void)
+static noinline void xen_flush_tlb_all(void)
 {
struct mmuext_op *op;
struct multicall_space mcs;
 
-   trace_xen_mmu_flush_tlb_all(0);
-
preempt_disable();
 
mcs = xen_mc_entry(sizeof(*op));
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 486c0a34d00b..2c30cabfda90 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1310,13 +1310,11 @@ unsigned long xen_read_cr2_direct(void)
return this_cpu_read(xen_vcpu_info.arch.cr2);
 }
 
-static void xen_flush_tlb(void)
+static noinline void xen_flush_tlb(void)
 {
struct mmuext_op *op;
struct multicall_space mcs;
 
-   trace_xen_mmu_flush_tlb(0);
-
preempt_disable();
 
mcs = xen_mc_entry(sizeof(*op));
diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 7dd8f34c37df..fdcf88bcf0ea 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -352,22 +352,6 @@ DECLARE_EVENT_CLASS(xen_mmu_pgd,
 DEFINE_XEN_MMU_PGD_EVENT(xen_mmu_pgd_pin);
 DEFINE_XEN_MMU_PGD_EVENT(xen_mmu_pgd_unpin);
 
-TRACE_EVENT(xen_mmu_flush_tlb_all,
-   TP_PROTO(int x),
-   TP_ARGS(x),
-   TP_STRUCT__entry(__array(char, x, 0)),
-   TP_fast_assign((void)x),
-   TP_printk("%s", "")
-   );
-
-TRACE_EVENT(xen_mmu_flush_tlb,
-   TP_PROTO(int x),
-   TP_ARGS(x),
-   TP_STRUCT__entry(__array(char, x, 0)),
-   TP_fast_assign((void)x),
-   TP_printk("%s", "")
-   );
-
 TRACE_EVENT(xen_mmu_flush_tlb_one_user,
TP_PROTO(unsigned long addr),
TP_ARGS(addr),

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel