Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-03-14 Thread Steven Rostedt
On Thu, 14 Mar 2024 09:57:57 -0700
Alison Schofield  wrote:

> On Fri, Feb 23, 2024 at 12:56:34PM -0500, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > [
> >This is a treewide change. I will likely re-create this patch again in
> >the second week of the merge window of v6.9 and submit it then. Hoping
> >to keep the conflicts that it will cause to a minimum.
> > ]

Note, change of plans. I plan on sending this in the next merge window, as
this merge window I have this patch:

  
https://lore.kernel.org/linux-trace-kernel/20240312113002.00031...@gandalf.local.home/

That will warn if the source string of __string() is different than the
source string of __assign_str(). I want to make sure they are identical
before just dropping one of them.


> 
> > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> > index bdf117a33744..07ba4e033347 100644
> > --- a/drivers/cxl/core/trace.h
> > +++ b/drivers/cxl/core/trace.h  
> 
> snip to poison
> 
> > @@ -668,8 +668,8 @@ TRACE_EVENT(cxl_poison,
> > ),
> >  
> > TP_fast_assign(
> > -   __assign_str(memdev, dev_name(>dev));
> > -   __assign_str(host, dev_name(cxlmd->dev.parent));
> > +   __assign_str(memdev);
> > +   __assign_str(host);  
> 
> I think I get that the above changes work because the TP_STRUCT__entry for
> these did:
>   __string(memdev, dev_name(>dev))
>   __string(host, dev_name(cxlmd->dev.parent))

That's the point. They have to be identical or you will likely bug.

The __string(name, src) is used to find the string length of src which
allocates the necessary length on the ring buffer. The __assign_str(name, src)
will copy src into the ring buffer.

Similar to:

len = strlen(src);
buf = malloc(len);
strcpy(buf, str);

Where __string() is strlen() and __assign_str() is strcpy(). It doesn't
make sense to use two different strings, and if you did, it would likely be
a bug.

But the magic behind __string() does much more than just get the length of
the string, and it could easily save the pointer to the string (along with
its length) and have it copy that in the __assign_str() call, making the
src parameter of __assign_str() useless.

> 
> > __entry->serial = cxlmd->cxlds->serial;
> > __entry->overflow_ts = cxl_poison_overflow(flags, overflow_ts);
> > __entry->dpa = cxl_poison_record_dpa(record);
> > @@ -678,12 +678,12 @@ TRACE_EVENT(cxl_poison,
> > __entry->trace_type = trace_type;
> > __entry->flags = flags;
> > if (region) {
> > -   __assign_str(region, dev_name(>dev));
> > +   __assign_str(region);
> > memcpy(__entry->uuid, >params.uuid, 16);
> > __entry->hpa = cxl_trace_hpa(region, cxlmd,
> >  __entry->dpa);
> > } else {
> > -   __assign_str(region, "");
> > +   __assign_str(region);
> > memset(__entry->uuid, 0, 16);
> > __entry->hpa = ULLONG_MAX;  
> 
> For the above 2, there was no helper in TP_STRUCT__entry. A recently
> posted patch is fixing that up to be __string(region, NULL) See [1],
> with the actual assignment still happening in TP_fast_assign.

__string(region, NULL) doesn't make sense. It's like:

len = strlen(NULL);
buf = malloc(len);
strcpy(buf, NULL);

??

I'll reply to that email.

-- Steve

> 
> Does that assign logic need to move to the TP_STRUCT__entry definition
> when you merge these changes? I'm not clear how much logic is able to be
> included, ie like 'C' style code in the TP_STRUCT__entry.
> 
> [1]
> https://lore.kernel.org/linux-cxl/20240314044301.2108650-1-alison.schofi...@intel.com/


Re: Kernel WARNING at lib/vsprintf.c:2721 while running ftrace kernel selftests

2024-02-27 Thread Steven Rostedt
On Tue, 27 Feb 2024 11:56:14 -0500
Steven Rostedt  wrote:

> On Tue, 27 Feb 2024 22:08:18 +0530
> Sachin Sant  wrote:
> 
> > > Can you apply this, and see if it triggers and if it does, print the line
> > > that has the max size?
> > > 
> > 
> > With this I see following trace
> > 
> > [   61.327138] [ cut here ]
> > [   61.327159] MAX OUT OF RANGE 63492  
> 
> Well I guess there you have it ;-)
> 
> vsprintf() doesn't like a precision of 63492!
> 
> I'll look to see what the best way to deal with this is.

Does this fix it?

-- Steve

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ff0b0a999171..e0840b94f1a2 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6882,7 +6882,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
 /* Used in tracing_mark_raw_write() as well */
 #define FAULTED_STR ""
 #define FAULTED_SIZE (sizeof(FAULTED_STR) - 1) /* '\0' is already accounted 
for */
-
+#ifndef SHORT_MAX
+#define SHORT_MAX  ((1<<15) - 1)
+#endif
if (tracing_disabled)
return -EINVAL;
 
@@ -6900,6 +6902,16 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
 
+   /*
+* trace_print_print() uses vsprintf() to determine the size via
+* the precision format "%*.s" which can not be greater than
+* a signed short.
+*/
+   if (size > SHORT_MAX) {
+   cnt -= size - SHORT_MAX;
+   goto again;
+   }
+
if (size > TRACE_SEQ_BUFFER_SIZE) {
cnt -= size - TRACE_SEQ_BUFFER_SIZE;
goto again;



Re: Kernel WARNING at lib/vsprintf.c:2721 while running ftrace kernel selftests

2024-02-27 Thread Steven Rostedt
On Tue, 27 Feb 2024 22:08:18 +0530
Sachin Sant  wrote:

> > Can you apply this, and see if it triggers and if it does, print the line
> > that has the max size?
> >   
> 
> With this I see following trace
> 
> [   61.327138] [ cut here ]
> [   61.327159] MAX OUT OF RANGE 63492

Well I guess there you have it ;-)

vsprintf() doesn't like a precision of 63492!

I'll look to see what the best way to deal with this is.

-- Steve


Re: Kernel WARNING at lib/vsprintf.c:2721 while running ftrace kernel selftests

2024-02-27 Thread Steven Rostedt
On Tue, 27 Feb 2024 21:38:57 +0530
Sachin Sant  wrote:

> This warning was not triggered.

Interesting.

> 
> I have attached .config

This is what I was looking for:

> # CONFIG_PPC_4K_PAGES is not set
> CONFIG_PPC_64K_PAGES=y
> CONFIG_PAGE_SIZE_64KB=y
> CONFIG_PPC_PAGE_SHIFT=16

So the pages are 64K in size. I wonder if this is causing an issue.

Can you apply this, and see if it triggers and if it does, print the line
that has the max size?

-- Steve

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..43e270bf8d78 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1589,6 +1589,9 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
struct trace_seq *s = >seq;
int max = iter->ent_size - offsetof(struct print_entry, buf);
 
+   if (WARN_ONCE(max < 0 || max > 2000, "MAX OUT OF RANGE %d", max))
+   return TRACE_TYPE_UNHANDLED;
+
trace_assign_type(field, iter->ent);
 
seq_print_ip_sym(s, field->ip, flags);


Re: Kernel WARNING at lib/vsprintf.c:2721 while running ftrace kernel selftests

2024-02-27 Thread Steven Rostedt
On Tue, 27 Feb 2024 15:06:18 +0530
Sachin Sant  wrote:

> I used this setup to again run bisect between 6.7.0 and 6.8-rc1.
> Bisect points to following patch 
> 
> commit 8ec90be7f15fac42992ea821be929d3b06cd0fd9
> tracing: Allow for max buffer data size trace_marker writes

Thanks, that was what I was looking for. Hmm, can you send me your config.
One thing that powerpc architecture has different is that the page size can
be much larger, and the calculations are based off of that.

Can you try this patch and see if it triggers?

Thanks,

-- Steve

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 3e7fa44dc2b2..c3dba537f342 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -1589,6 +1589,9 @@ static enum print_line_t trace_print_print(struct 
trace_iterator *iter,
struct trace_seq *s = >seq;
int max = iter->ent_size - offsetof(struct print_entry, buf);
 
+   if (WARN_ON_ONCE(max < 0))
+   return TRACE_TYPE_UNHANDLED;
+
trace_assign_type(field, iter->ent);
 
seq_print_ip_sym(s, field->ip, flags);


Re: Kernel WARNING at lib/vsprintf.c:2721 while running ftrace kernel selftests

2024-02-26 Thread Steven Rostedt
On Sun, 25 Feb 2024 22:01:50 +0530
Sachin Sant  wrote:

> While running ftrace specific kernel selftests following warning
> is seen on a Power10 logical partition (LPAR) booted with
> latest mainline kernel.
> 
> [ cut here ]
> precision 63492 too large
> WARNING: CPU: 15 PID: 2538829 at lib/vsprintf.c:2721 set_precision+0x68/0xa4

Interesting. I'm guessing something went negative.

> Modules linked in: nvram rpadlpar_io rpaphp uinput torture vmac 
> poly1305_generic chacha_generic chacha20poly1305 n_gsm pps_ldisc ppp_synctty 
> ppp_async ppp_generic serport slcan can_dev slip slhc snd_hrtimer snd_seq 
> snd_seq_device snd_timer snd soundcore pcrypt crypto_user n_hdlc dummy veth 
> tun nfsv3 nfs netfs brd overlay exfat vfat fat btrfs blake2b_generic xor 
> raid6_pq zstd_compress xfs loop sctp ip6_udp_tunnel udp_tunnel nfsd 
> auth_rpcgss nfs_acl lockd grace configs dm_mod nft_fib_inet nft_fib_ipv4 
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject 
> nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
> bonding rfkill tls ip_set nf_tables libcrc32c nfnetlink sunrpc pseries_rng 
> vmx_crypto ext4 mbcache jbd2 sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi 
> ibmveth scsi_transport_srp fuse [last unloaded: test_cpuidle_latency(O)]
> CPU: 15 PID: 2538829 Comm: awk Tainted: G M O K 6.8.0-rc5-gfca7526b7d89 #1
> Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
> (NH1060_018) hv:phyp pSeries
> NIP: c0f57c34 LR: c0f57c30 CTR: c0f5cdf0
> REGS: c00a58e4f5f0 TRAP: 0700 Tainted: G M O K (6.8.0-rc5-gfca7526b7d89)
> MSR: 82029033  CR: 48000824 XER: 0005
> CFAR: c016154c IRQMASK: 0 
> GPR00: c0f57c30 c00a58e4f890 c1482800 0019 
> GPR04: 000100011559 c00a58e4f660 c00a58e4f658 0027 
> GPR08: c00e84e37c10 0001 0027 c2a47e50 
> GPR12:  c00e87bf7300   
> GPR16:     
> GPR20: c004a43ec590 00400cc0 0003 c12c3e65 
> GPR24: c00a58e4fa18 0025 0020 0001ff97 
> GPR28: c001168a00dd c001168c0074 c00a58e4f920 f804 
> NIP [c0f57c34] set_precision+0x68/0xa4
> LR [c0f57c30] set_precision+0x64/0xa4
> Call Trace:
> [c00a58e4f890] [c0f57c30] set_precision+0x64/0xa4 (unreliable)
> [c00a58e4f900] [c0f5ccc4] vsnprintf+0x198/0x4c8
> [c00a58e4f980] [c0f53228] seq_buf_vprintf+0x50/0xa0
> [c00a58e4f9b0] [c031cec0] trace_seq_printf+0x60/0xe0
> [c00a58e4f9e0] [c031b5f0] trace_print_print+0x78/0xa4
> [c00a58e4fa60] [c03133a4] print_trace_line+0x2ac/0x6d8
> [c00a58e4fb20] [c03145c0] s_show+0x58/0x2c0
> [c00a58e4fba0] [c05dfb2c] seq_read_iter+0x448/0x618
> [c00a58e4fc70] [c05dfe08] seq_read+0x10c/0x174
> [c00a58e4fd10] [c059a7e0] vfs_read+0xe0/0x39c
> [c00a58e4fdc0] [c059b59c] ksys_read+0x7c/0x140
> [c00a58e4fe10] [c0035d74] system_call_exception+0x134/0x330
> [c00a58e4fe50] [c000d6a0] system_call_common+0x160/0x2e4
> --- interrupt: c00 at 0x7fff84521684
> NIP: 7fff84521684 LR: 0001252f0dc4 CTR: 
> REGS: c00a58e4fe80 TRAP: 0c00 Tainted: G M O K (6.8.0-rc5-gfca7526b7d89)
> MSR: 8000f033  CR: 22000202 XER: 
> IRQMASK: 0 
> GPR00: 0003 7fffcd076c30 7fff84607300 0005 
> GPR04: 00012a8a1a20 0001 fff5 00012a9bcce6 
> GPR08: 0001    
> GPR12:  7fff84a5d2a0 0001 00012a8851d8 
> GPR16: 00012a884268 7fff84601888 000125377f50 7fff84601880 
> GPR20: 0044 000f   
> GPR24: 7fffcd076e48 7fffcd076d98   
> GPR28: 7fffcd076e48  7fffcd076da0 00012a8a1900 
> NIP [7fff84521684] 0x7fff84521684
> LR [0001252f0dc4] 0x1252f0dc4
> --- interrupt: c00
> Code: f821ff91 2f89 409e0034 7c0802a6 3c62fff0 3921 3d420177 3863e310 
> 992ad6db f8010080 4b209899 6000 <0fe0> e8010080 7c0803a6 2f9f 
> ---[ end trace  ]—
> 
> This warning is seen while running test that was added by
> following commit:
> 
> commit 3bf7009251f0f41cdd0188ab7b3879df81810703
>  tracing/selftests: Add test to test the trace_marker

This adds the user space selftest that triggered this warning, but it is
not the cause of it. Could you run this test against kernel builds before
this commit. Does this test cause this to trigger on older versions of the
kernel?

-- Steve


Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-02-23 Thread Steven Rostedt
On Fri, 23 Feb 2024 13:46:53 -0500
Steven Rostedt  wrote:

> Now one thing I could do is to not remove the parameter, but just add:
> 
>   WARN_ON_ONCE((src) != __data_offsets->item##_ptr_);
> 
> in the __assign_str() macro to make sure that it's still the same that is
> assigned. But I'm not sure how useful that is, and still causes burden to
> have it. I never really liked the passing of the string in two places to
> begin with.

Hmm, maybe I'll just add this patch for 6.9 and then in 6.10 do the
parameter removal.

-- Steve

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 0c0f50bcdc56..7372e2c2a0c4 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,6 +35,7 @@ #define __assign_str(dst, src)
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
+   WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\


Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-02-23 Thread Steven Rostedt
On Fri, 23 Feb 2024 14:50:49 -0500
Kent Overstreet  wrote:

> Tangentially related though, what would make me really happy is if we
> could create the string with in the TP__fast_assign() section. I have to
> have a bunch of annoying wrappers right now because the string length
> has to be known when we invoke the tracepoint.

You can use __string_len() to determine the string length in the tracepoint
(which is executed in the TP_fast_assign() section).

My clean up patches will make __assign_str_len() obsolete too (I'm working
on them now), and you can just use __assign_str().

I noticed that I don't have a string_len example in the sample code and I'm
actually writing it now.

// cutting out everything else:

TRACE_EVENT(foo_bar,

TP_PROTO(const char *foo, int bar),

TP_ARGS(foo, bar),

TP_STRUCT__entry(
__string_len(   lstr,   foo,bar < strlen(foo) ? bar : 
strlen(foo) )
),

TP_fast_assign(
__assign_str(lstr, foo);

// Note, the above is with my updates, without them, you need to duplicate the 
logic

//  __assign_str_len(lstr, foo, bar < strlen(foo) ? bar : 
strlen(foo));
),

TP_printk("%s", __get_str(lstr))
);


The above will allocate "bar < strlen(foo) ? bar : strlen(foo)" size on the
ring buffer. As the size is already stored, my clean up code uses that
instead of requiring duplicating the logic again.

-- Steve


Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-02-23 Thread Steven Rostedt
On Fri, 23 Feb 2024 10:30:45 -0800
Jeff Johnson  wrote:

> On 2/23/2024 9:56 AM, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > [
> >This is a treewide change. I will likely re-create this patch again in
> >the second week of the merge window of v6.9 and submit it then. Hoping
> >to keep the conflicts that it will cause to a minimum.
> > ]
> > 
> > With the rework of how the __string() handles dynamic strings where it
> > saves off the source string in field in the helper structure[1], the
> > assignment of that value to the trace event field is stored in the helper
> > value and does not need to be passed in again.  
> 
> Just curious if this could be done piecemeal by first changing the
> macros to be variadic macros which allows you to ignore the extra
> argument. The callers could then be modified in their separate trees.
> And then once all the callers have be merged, the macros could be
> changed to no longer be variadic.

I weighed doing that, but I think ripping off the band-aid is a better
approach. One thing I found is that leaving unused parameters in the macros
can cause bugs itself. I found one case doing my clean up, where an unused
parameter in one of the macros was bogus, and when I made it a used
parameter, it broke the build.

I think for tree-wide changes, the preferred approach is to do one big
patch at once. And since this only affects TRACE_EVENT() macros, it
hopefully would not be too much of a burden (although out of tree users may
suffer from this, but do we care?)

Now one thing I could do is to not remove the parameter, but just add:

WARN_ON_ONCE((src) != __data_offsets->item##_ptr_);

in the __assign_str() macro to make sure that it's still the same that is
assigned. But I'm not sure how useful that is, and still causes burden to
have it. I never really liked the passing of the string in two places to
begin with.

-- Steve


Re: [FYI][PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-02-23 Thread Steven Rostedt
On Fri, 23 Feb 2024 12:56:34 -0500
Steven Rostedt  wrote:

> Note, the same updates will need to be done for:
> 
>   __assign_str_len()
>   __assign_rel_str()
>   __assign_rel_str_len()

Correction: The below macros do not pass in their source to the entry
macros, so they will not need to be updated.

-- Steve

>   __assign_bitmask()
>   __assign_rel_bitmask()
>   __assign_cpumask()
>   __assign_rel_cpumask()



Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line

2023-12-21 Thread Steven Rostedt
On Thu, 21 Dec 2023 10:46:08 +
Christophe Leroy  wrote:

> > To enable ftrace, the nop at function entry is changed to an
> > unconditional branch to 'tramp'. The call to ftrace_caller() may be
> > updated to ftrace_regs_caller() depending on the registered ftrace ops.
> > On 64-bit powerpc, we additionally change the instruction at 'tramp' to
> > 'mflr r0' from an unconditional branch back to func+4. This is so that
> > functions entered through the GEP can skip the function profile sequence
> > unless ftrace is enabled.
> > 
> > With the context_switch microbenchmark on a P9 machine, there is a
> > performance improvement of ~6% with this patch applied, going from 650k
> > context switches to 690k context switches without ftrace enabled. With
> > ftrace enabled, the performance was similar at 86k context switches.  
> 
> Wondering how significant that context_switch micorbenchmark is.
> 
> I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the 
> results:
> # ./context_switch --no-fp
> Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no 
> vdso:no
> 
> On 885, I get the following results before and after your patch.
> 
> CONFIG_FTRACE not selected : 44,9k
> CONFIG_FTRACE selected, before : 32,8k
> CONFIG_FTRACE selected, after : 33,6k
> 
> All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But 
> when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected 
> result is only 34,4.
> 
> On 8321:
> 
> CONFIG_FTRACE not selected : 100,3k
> CONFIG_FTRACE selected, before : 72,5k
> CONFIG_FTRACE selected, after : 116k
> 
> So the results look odd to me.


BTW, CONFIG_FTRACE just enables the tracing system (I would like to change
that to CONFIG_TRACING, but not sure if I can without breaking .configs all
over the place).

The nops for ftrace is enabled with CONFIG_FUNCTION_TRACER.

-- Steve


Re: [linux-next:master] BUILD REGRESSION 2dac75696c6da3c848daa118a729827541c89d33

2023-10-18 Thread Steven Rostedt
On Thu, 19 Oct 2023 04:07:35 +0800
kernel test robot  wrote:

> Documentation/devicetree/bindings/mfd/qcom,tcsr.yaml:
> Documentation/devicetree/bindings/mfd/qcom-pm8xxx.yaml:
> fs/tracefs/event_inode.c:782:11-21: ERROR: ei is NULL but dereferenced.

This was already reported and I'm currently testing a patch to fix it.

-- Steve


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2023-07-21 Thread Steven Rostedt
On Fri, 21 Jul 2023 18:06:07 +0200
Alexander Lobakin  wrote:

> > Just because you disable interrupts does not mean you are in interrupt
> > context.  
> 
> Ah okay, thanks! IOW, if we want to check in some code that we're
> certainly have interrupts enabled and are not in the interrupt context,
> we must always do
> 
> if (!(in_hardirq() || irqs_disabled()))
> 

Yeah, probably.

> , nothing more elegant / already existing / ...?

It's not a common check. What would you call that?

irq_safe() ?

-- Steve


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2023-07-21 Thread Steven Rostedt
On Fri, 21 Jul 2023 17:34:41 +0200
Alexander Lobakin  wrote:

> From: Steven Rostedt 
> Date: Fri, 15 Oct 2021 14:25:41 -0400
> 
> Sorry for such a necroposting :z
> Just wanted to know if this is a bug, so that I could send a fix, or
> intended behaviour.
> 
> > On Fri, 15 Oct 2021 14:20:33 -0400
> > Steven Rostedt  wrote:
> >   
> >>> I think having one copy of that in a header is better than having 3
> >>> copies. But yes, something along them lines.
> >>
> >> I was just about to ask you about this patch ;-)  
> > 
> > Except it doesn't build :-p (need to move the inlined function down a bit)
> > 
> > diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> > index 4d244e295e85..b32e3dabe28b 100644
> > --- a/include/linux/preempt.h
> > +++ b/include/linux/preempt.h
> > @@ -77,6 +77,27 @@
> >  /* preempt_count() and related functions, depends on PREEMPT_NEED_RESCHED 
> > */
> >  #include 
> >  
> > +/**
> > + * interrupt_context_level - return interrupt context level
> > + *
> > + * Returns the current interrupt context level.
> > + *  0 - normal context
> > + *  1 - softirq context
> > + *  2 - hardirq context
> > + *  3 - NMI context
> > + */
> > +static __always_inline unsigned char interrupt_context_level(void)
> > +{
> > +   unsigned long pc = preempt_count();
> > +   unsigned char level = 0;
> > +
> > +   level += !!(pc & (NMI_MASK));
> > +   level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
> > +   level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));  
> 
> This doesn't take into account that we can switch the context manually
> via local_bh_disable() / local_irq_save() etc. During the testing of the

You cannot manually switch interrupt context.

> separate issue[0], I've found that the function returns 1 in both just
> softirq and softirq under local_irq_save().
> Is this intended? Shouldn't that be

That is intended behavior.

local_bh_disable() and local_irq_save() is not a context switch. It is just
preventing that context from happening. The interrupt_context_level() is to
tell us what context we are running in, not what context is disabled.

> 
>   level += !!(pc & (NMI_MASK));
>   level += !!(pc * (NMI_MASK | HARDIRQ_MASK)) || irqs_disabled();
>   level += !!(pc * (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)) ||
>in_atomic();
> 
> ?
> Otherwise, the result it returns is not really "context level".

local_bh_disable() use to (and perhaps still does in some configurations)
confuse things. But read the comment in kernel/softirq.c

/*
 * SOFTIRQ_OFFSET usage:
 *
 * On !RT kernels 'count' is the preempt counter, on RT kernels this applies
 * to a per CPU counter and to task::softirqs_disabled_cnt.
 *
 * - count is changed by SOFTIRQ_OFFSET on entering or leaving softirq
 *   processing.
 *
 * - count is changed by SOFTIRQ_DISABLE_OFFSET (= 2 * SOFTIRQ_OFFSET)
 *   on local_bh_disable or local_bh_enable.
 *
 * This lets us distinguish between whether we are currently processing
 * softirq and whether we just have bh disabled.
 */

Just because you disable interrupts does not mean you are in interrupt
context.

-- Steve


> 
> > +
> > +   return level;
> > +}
> > +  
> [0]
> https://lore.kernel.org/netdev/b3884ff9-d903-948d-797a-1830a39b1...@intel.com
> 
> Thanks,
> Olek



Re: [PATCH 00/79] fs: new accessors for inode->i_ctime

2023-06-21 Thread Steven Rostedt
On Wed, 21 Jun 2023 10:45:05 -0400
Jeff Layton  wrote:

> Most of this conversion was done via coccinelle, with a few of the more
> non-standard accesses done by hand. There should be no behavioral
> changes with this set. That will come later, as we convert individual
> filesystems to use multigrain timestamps.

BTW, Linus has suggested to me that whenever a conccinelle script is used,
it should be included in the change log.

-- Steve


Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-20 Thread Steven Rostedt
On Mon, 19 Jun 2023 02:43:58 +0200
Thomas Gleixner  wrote:

> Now you might argue that it _is_ a "hotpath" due to the BPF usage, but
> then even more so as any intermediate wrapper which converts from one
> data representation to another data representation is not going to
> increase performance, right?

Just as a side note. BPF can not attach its return calling code to
functions that have more than 6 parameters (3 on 32 bit x86), because of
the way BPF return path trampoline works. It is a requirement that all
parameters live in registers, and none on the stack.

-- Steve


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-04 Thread Steven Rostedt
On Thu, 1 Jun 2023 16:54:36 -0700
Nadav Amit  wrote:

> > The way text_poke() is used here, it is creating a new writable alias
> > and flushing it for *each* write to the module (like for each write of
> > an individual relocation, etc). I was just thinking it might warrant
> > some batching or something.  

Batching does exist, which is what the text_poke_queue() thing does.

-- Steve



Re: [RFC PATCH] powerpc/ftrace: Refactoring and support for -fpatchable-function-entry

2023-05-23 Thread Steven Rostedt
On Sat, 20 May 2023 23:58:01 +0530
Naveen N Rao  wrote:

> > 
> > At the begining of ftrace_replace_code(), don't you miss:
> > 
> > if (unlikely(ftrace_disabled))
> > return;  
> 
> ftrace_disabled is a static, so it can't be tested here. FWIW, x86 also 
> implements ftrace_replace_code(), and doesn't test ftrace_disabled.
> 
> Looking at the call path, it looks like it gets tested much before, so 
> may not be necessary. It sure would be good to get Steven's view on 
> this.

Yes, ftrace_disabled should not be used in arch code. It should be
caught before then.

Thanks,

-- Steve


Re: [PATCH v2 19/24] xtensa/cpu: Make sure cpu_die() doesn't return

2023-02-14 Thread Steven Rostedt
On Tue, 14 Feb 2023 10:23:22 -0800
Josh Poimboeuf  wrote:


> void __ref cpu_die(void)
> {
>   idle_task_exit();
>   local_irq_disable();
>   __asm__ __volatile__(
>   "   movia2, cpu_restart\n"
>   "   jx  a2\n");
> 
>   BUG();
> }
> 
> Hm, not only is the comment wrong, but it seems to be branching to
> cpu_restart?  That doesn't seem right at all.

Agreed, that does not look right at all.

-- Steve

> 
> Max/Chris?
> 



Re: [RFC PATCH v2 0/8] Generic IPI sending tracepoint

2022-11-14 Thread Steven Rostedt
On Wed,  2 Nov 2022 18:29:41 +
Valentin Schneider  wrote:

> This is incomplete, just looking at arm64 there's more IPI types that aren't
> covered: 
> 
>   IPI_CPU_STOP,
>   IPI_CPU_CRASH_STOP,
>   IPI_TIMER,
>   IPI_WAKEUP,
> 
> ... But it feels like a good starting point.

For the tracing portions:

Reviewed-by: Steven Rostedt (Google) 

-- Steve


Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

2022-10-11 Thread Steven Rostedt
On Tue, 11 Oct 2022 17:40:26 +0100
Valentin Schneider  wrote:

> > You could keep the tracepoint as a mask, and then make it pretty, like 
> > cpus=3-5,8
> > in user-space. For example with a trace-cmd/perf loadable plugin, 
> > libtracefs helper.
> >  
> 
> That's a nice idea, the one downside I see is that means registering an
> event handler for all events with cpumasks rather than directly targeting
> cpumask fields, but that doesn't look too horrible. I'll dig a bit in that
> direction.

We could just make all all dynamic array's of unsigned long use that
format? I don't know of any other event that has dynamic arrays of
unsigned longs. And doing a search doesn't come up with any.

-- Steve


Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

2022-10-11 Thread Steven Rostedt
On Tue, 11 Oct 2022 17:17:04 +0100
Valentin Schneider  wrote:

> tep_get_field_val() just yields an unsigned long long of value 0x200018,
> which AFAICT is just the [length, offset] thing associated with dynamic
> arrays. Not really usable, and I don't see anything exported in the lib to
> extract and use those values.

Correct.

> 
> tep_get_field_raw() is better, it handles the dynamic array for us and
> yields a pointer to the cpumask array at the tail of the record. With that
> it's easy to get an output such as: cpumask[size=32]=[4,0,0,0,]. Still,
> this isn't a native type for many programming languages.

Yeah, this is the interface that I would have used. And it would likely
require some kind of wrapper to make it into what you prefer.

Note, I've been complaining for some time now how much I hate the
libtraceevent interface, and want to rewrite it. (I just spoke to
someone that wants to implement it in Rust). But the question comes
down to how to make it backward compatible. Perhaps we don't and just
up the major version number (libtraceevent 2.0).

What would you recommend as an API to process cpumasks better?

-- Steve


Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

2022-10-08 Thread Steven Rostedt
On Fri, 7 Oct 2022 17:01:33 -0300
Marcelo Tosatti  wrote:

> > As for the targeted CPUs, the existing tracepoint does export them, albeit 
> > in
> > cpumask form, which is quite inconvenient from a tooling perspective. For
> > instance, as far as I'm aware, it's not possible to do event filtering on a
> > cpumask via trace-cmd.  
> 
> https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html
> 
>-f filter
>Specify a filter for the previous event. This must come after
>a -e. This will filter what events get recorded based on the
>content of the event. Filtering is passed to the kernel
>directly so what filtering is allowed may depend on what
>version of the kernel you have. Basically, it will let you
>use C notation to check if an event should be processed or
>not.
> 
>==, >=, <=, >, <, &, |, && and ||
> 
>The above are usually safe to use to compare fields.

We could always add an "isset(x)" filter ;-)

-- Steve


Re: [RFC PATCH 4/5] irq_work: Trace calls to arch_irq_work_raise()

2022-10-08 Thread Steven Rostedt
On Fri,  7 Oct 2022 16:45:32 +0100
Valentin Schneider  wrote:
>  }
>  
> +static inline void irq_work_raise(void)
> +{
> + if (arch_irq_work_has_interrupt())
> + trace_ipi_send_cpu(_RET_IP_, smp_processor_id());

To save on the branch, let's make the above:

if (trace_ipi_send_cpu_enabled() && arch_irq_work_has_interrupt())

As the "trace_*_enabled()" is a static branch that will make it a nop
when the tracepoint is not enabled.

-- Steve


> +
> + arch_irq_work_raise();
> +}
> +
>  /* Enqueue on current CPU, work must already be claimed and preempt disabled 
> */


Re: [PATCH v2 33/44] ftrace: WARN on rcuidle

2022-09-27 Thread Steven Rostedt


Nit, the subject should have "tracing:" an not "ftrace:" as the former
encompasses the tracing infrastructure and the latter is for the function
hook part of that.

On Mon, 19 Sep 2022 12:00:12 +0200
Peter Zijlstra  wrote:

> CONFIG_GENERIC_ENTRY disallows any and all tracing when RCU isn't
> enabled.
> 
> XXX if s390 (the only other GENERIC_ENTRY user as of this writing)
> isn't comfortable with this, we could switch to
> HAVE_NOINSTR_VALIDATION which is x86_64 only atm.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/tracepoint.h |   13 -
>  kernel/trace/trace.c   |3 +++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -178,6 +178,16 @@ static inline struct tracepoint *tracepo
>  #endif /* CONFIG_HAVE_STATIC_CALL */
>  
>  /*
> + * CONFIG_GENERIC_ENTRY archs are expected to have sanitized entry and idle
> + * code that disallow any/all tracing/instrumentation when RCU isn't 
> watching.
> + */
> +#ifdef CONFIG_GENERIC_ENTRY
> +#define RCUIDLE_COND(rcuidle)(rcuidle)
> +#else

Should probably move the below comment to here:

 /* srcu can't be used from NMI */

> +#define RCUIDLE_COND(rcuidle)(rcuidle && in_nmi())
> +#endif
> +
> +/*
>   * it_func[0] is never NULL because there is at least one element in the 
> array
>   * when the array itself is non NULL.
>   */
> @@ -189,7 +199,8 @@ static inline struct tracepoint *tracepo
>   return; \
>   \
>   /* srcu can't be used from NMI */   \

And remove the above.

-- Steve

> - WARN_ON_ONCE(rcuidle && in_nmi());  \
> + if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
> + return; \
>   \
>   /* keep srcu and sched-rcu usage consistent */  \
>   preempt_disable_notrace();  \
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3104,6 +3104,9 @@ void __trace_stack(struct trace_array *t
>   return;
>   }
>  
> + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_GENERIC_ENTRY)))
> + return;
> +
>   /*
>* When an NMI triggers, RCU is enabled via ct_nmi_enter(),
>* but if the above rcu_is_watching() failed, then the NMI
> 



Re: [PATCH] powerpc/ftrace: Ignore weak functions

2022-08-16 Thread Steven Rostedt
On Tue,  9 Aug 2022 16:24:25 +0530
"Naveen N. Rao"  wrote:

> Extend commit b39181f7c6907d ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to
> avoid adding weak function") to ppc32 and ppc64 -mprofile-kernel by
> defining FTRACE_MCOUNT_MAX_OFFSET.
> 
> For ppc64 -mprofile-kernel ABI, we can have two instructions at function
> entry for TOC setup followed by 'mflr r0' and 'bl _mcount'. So, the
> mcount location is at most the 4th instruction in a function. For ppc32,
> mcount location is always the 3rd instruction in a function, preceded by
> 'mflr r0' and 'stw r0,4(r1)'.
> 
> With this patch, and with ppc64le_guest_defconfig and some ftrace/bpf
> config items enabled:
>   # grep __ftrace_invalid_address available_filter_functions | wc -l
>   79

I wonder if this patch answers the question to my last email. ;-)

Acked-by: Steven Rostedt (Google) 

-- Steve

> 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/include/asm/ftrace.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 3cee7115441b41..ade406dc6504e3 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -10,6 +10,13 @@
>  
>  #define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
>  
> +/* Ignore unused weak functions which will have larger offsets */
> +#ifdef CONFIG_MPROFILE_KERNEL
> +#define FTRACE_MCOUNT_MAX_OFFSET 12
> +#elif defined(CONFIG_PPC32)
> +#define FTRACE_MCOUNT_MAX_OFFSET 8
> +#endif
> +
>  #ifndef __ASSEMBLY__
>  extern void _mcount(void);
>  
> 
> base-commit: ff1ed171e05c971652a0ede3d716997de8ee41c9



Re: [PATCH 0/2] ftrace/recordmcount: Handle object files without section symbols

2022-08-16 Thread Steven Rostedt
On Wed, 27 Apr 2022 15:01:20 +0530
"Naveen N. Rao"  wrote:

> This solves a build issue on powerpc with binutils v2.36 and newer [1].
> Since commit d1bcae833b32f1 ("ELF: Don't generate unused section
> symbols") [2], binutils started dropping section symbols that it thought
> were unused. Due to this, in certain scenarios, recordmcount is unable 
> to find a non-weak symbol to generate a relocation record against.
> 
> Clang integrated assembler is also aggressive in dropping section 
> symbols [3].
> 
> In the past, there have been various workarounds to address this. See 
> commits 55d5b7dd6451b5 ("initramfs: fix clang build failure") and 
> 6e7b64b9dd6d96 ("elfcore: fix building with clang") and a recent patch:
> https://lore.kernel.org/linuxppc-dev/20220425174128.11455-1-naveen.n@linux.vnet.ibm.com/T/#u
> 
> Fix this issue by using the weak symbol in the relocation record. This 
> can result in duplicate locations in the mcount table if those weak 
> functions are overridden, so have ftrace skip dupicate entries.
> 
> Objtool already follows this approach, so patch 2 updates recordmcount 
> to do the same. Patch 1 updates ftrace to skip duplicate entries.
> 
> - Naveen
> 
> 
> [1] https://github.com/linuxppc/issues/issues/388
> [2] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1
> [3] https://github.com/ClangBuiltLinux/linux/issues/981
> 
>

There's been work to handle weak functions, but I'm not sure that work
handled the issues here. Are these patches still needed, or was there
another workaround to handle the problems this addressed?

-- Steve


Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-14 Thread Steven Rostedt
.com, yury.no...@gmail.com, ulli.kr...@googlemail.com, vgu...@kernel.org, 
linux-...@vger.kernel.org, mon...@monstr.eu, r...@vger.kernel.org, 
b...@alien8.de, bc...@quicinc.com, tsbog...@alpha.franken.de, 
linux-par...@vger.kernel.org, sudeep.ho...@arm.com, shawn...@kernel.org, 
da...@davemloft.net, dal...@libc.org, Peter Zijlstra , 
amakha...@vmware.com, bjorn.anders...@linaro.org, h...@zytor.com, 
sparcli...@vger.kernel.org, linux-hexa...@vger.kernel.org, 
linux-ri...@lists.infradead.org, anton.iva...@cambridgegreys.com, 
jo...@southpole.se, Arnd Bergmann , rich...@nod.at, 
x...@kernel.org, li...@armlinux.org.uk, mi...@redhat.com, 
a...@eecs.berkeley.edu, paul...@kernel.org, h...@linux.ibm.com, 
stefan.kristians...@saunalahti.fi, openr...@lists.librecores.org, 
paul.walms...@sifive.com, linux-te...@vger.kernel.org, namhy...@kernel.org, 
andriy.shevche...@linux.intel.com, jpoim...@kernel.org, jgr...@suse.com, 
pv-driv...@vmware.com, linux-m...@vger.kernel.org, palmer
 @dabbelt.com, a...@brainfault.org, i...@jurassic.park.msu.ru, 
johan...@sipsolutions.net, linuxppc-dev@lists.ozlabs.org
Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org
Sender: "Linuxppc-dev" 


On Thu, 9 Jun 2022 15:02:20 +0200
Petr Mladek  wrote:

> > I'm somewhat curious whether we can actually remove that trace event.  
> 
> Good question.
> 
> Well, I think that it might be useful. It allows to see trace and
> printk messages together.

Yes people still use it. I was just asked about it at Kernel Recipes. That
is, someone wanted printk mixed in with the tracing, and I told them about
this event (which they didn't know about but was happy to hear that it
existed).

-- Steve


Re: [PATCH 18/30] notifier: Show function names on notifier routines if DEBUG_NOTIFIERS is set

2022-05-10 Thread Steven Rostedt
On Thu, 28 Apr 2022 09:01:13 +0800
Xiaoming Ni  wrote:

> > +#ifdef CONFIG_DEBUG_NOTIFIERS
> > +   {
> > +   char sym_name[KSYM_NAME_LEN];
> > +
> > +   pr_info("notifiers: registered %s()\n",
> > +   notifier_name(n, sym_name));
> > +   }  
> 
> Duplicate Code.
> 
> Is it better to use __func__ and %pS?
> 
> pr_info("%s: %pS\n", __func__, n->notifier_call);
> 
> 
> > +#endif

Also, don't sprinkle #ifdef in C code. Instead:

if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS))
pr_info("notifers: regsiter %ps()\n",
n->notifer_call);


Or define a print macro at the start of the C file that is a nop if it is
not defined, and use the macro.

-- Steve


Re: [PATCH 04/30] firmware: google: Convert regular spinlock into trylock on panic path

2022-05-10 Thread Steven Rostedt
On Tue, 10 May 2022 13:38:39 +0200
Petr Mladek  wrote:

> As already mentioned in the other reply, panic() sometimes stops
> the other CPUs using NMI, for example, see kdump_nmi_shootdown_cpus().
> 
> Another situation is when the CPU using the lock ends in some
> infinite loop because something went wrong. The system is in
> an unpredictable state during panic().
> 
> I am not sure if this is possible with the code under gsmi_dev.lock
> but such things really happen during panic() in other subsystems.
> Using trylock in the panic() code path is a good practice.

I believe that Peter Zijlstra had a special spin lock for NMIs or early
printk, where it would not block if the lock was held on the same CPU. That
is, if an NMI happened and paniced while this lock was held on the same
CPU, it would not deadlock. But it would block if the lock was held on
another CPU.

-- Steve


Re: [PATCH 23/30] printk: kmsg_dump: Introduce helper to inform number of dumpers

2022-05-10 Thread Steven Rostedt
On Wed, 27 Apr 2022 19:49:17 -0300
"Guilherme G. Piccoli"  wrote:

> Currently we don't have a way to check if there are dumpers set,
> except counting the list members maybe. This patch introduces a very
> simple helper to provide this information, by just keeping track of
> registered/unregistered kmsg dumpers. It's going to be used on the
> panic path in the subsequent patch.

FYI, it is considered "bad form" to reference in the change log "this
patch". We know this is a patch. The change log should just talk about what
is being done. So can you reword your change logs (you do this is almost
every patch). Here's what I would reword the above to be:

 Currently we don't have a way to check if there are dumpers set, except
 perhaps by counting the list members. Introduce a very simple helper to
 provide this information, by just keeping track of registered/unregistered
 kmsg dumpers. This will simplify the refactoring of the panic path.


-- Steve


Re: [PATCH 2/2] recordmcount: Handle sections with no non-weak symbols

2022-05-04 Thread Steven Rostedt
On Wed, 4 May 2022 16:50:58 +
Christophe Leroy  wrote:

> In vmlinux, relocations are resolved, trampolines are installed for 
> unreachable destinations and you don't anymore have a section with all 
> the relocations to mcount. It means 'recordmcount' or whatever tool we 
> use will have to read the code to find all direct calls to mcount, then 
> find all trampolines to mcount then find all calls to those trampolines.

OK, so what you are saying is that in the object file, we can see the
site that calls mcount, but when it is linked, it may not call mcount,
but instead it will call a trampoline that will call mcount, thus the
tool will need to find these calls to the trampolines that call mcount
as well as the locations that call mcount directly.

Did I get that right?

-- Steve


Re: [PATCH 2/2] recordmcount: Handle sections with no non-weak symbols

2022-05-03 Thread Steven Rostedt
On Tue, 3 May 2022 11:20:22 +
Christophe Leroy  wrote:

> Maybe I misunderstood. When you say 'after linking', do you mean vmlinux 
> or vmlinux.o ?

Whichever ;-)

> 
> In vmlinux, the addresses to be saved in __mcount_loc table might not 
> contain anymore a call to _mcount but a call to a trampoline that jumps 
> to _mcount, in case _mcount is too far from the said location at link 
> time. That's what I meant.

But how is that different than what is done today? And at linking,
everything still calls mcount. It's not until runtime things change.

The point I'm talking about is that after linking, if the linker
removed unused code (which would include unused weak functions,
right?), then the calls to mcount that were in the weak functions would
be gone too, and they would not be added by recordmcount, and for those
that are still around, then using their symbols as the reference point
would also not be an issue.

The problem we have right now is that the only symbol we have is a weak
function to reference the mcount call location in the __mcount_loc
section. But if all the global entries are not used and the linker
removes them, then the references using those symbols in the
__mcount_loc section will be "undefined". After linking, everything in
the vmlinux(.o) is set, and we are free to use that as a reference
point for the call sites.

-- Steve


Re: [PATCH 2/2] recordmcount: Handle sections with no non-weak symbols

2022-05-02 Thread Steven Rostedt
On Mon, 2 May 2022 14:44:56 +
Christophe Leroy  wrote:

> If we do that after the linking, won't it be a nightmare with the 
> trampolines installed by the linker when the destination is over the 24 
> bits limit ?

Not sure what you mean. The locations I'm talking about is the full
address saved in the __mcount_loc table (data section).

-- Steve


Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-29 Thread Steven Rostedt
On Sat, 30 Apr 2022 01:03:01 +0530
"Naveen N. Rao"  wrote:

> > The point of this section is to know which functions in __mcount_loc may
> > have been overridden, as they would be found in the __mcount_loc_weak
> > section. And then we can do something "special" to them.  
> 
> I'm not sure I follow that. How are you intending to figure out which 
> functions were overridden by looking at entries in the __mcount_loc_weak 
> section?

If there's duplicates (or something strange with the offset) then we could
delete the one that has the match in the weak section.

> 
> In the final vmlinux image, we only get offsets into .text for all 
> mcount locations, but no symbol information. The only hint is the fact 
> that a single kallsym symbol has multiple mcount locations within it. 
> Even then, the symbol with duplicate mcount entries won't be the 
> function that was overridden.
> 
> We could do a kallsyms_lookup() on each entry and consult the 
> __mcount_loc_weak section to identify duplicates, but that looks to be 
> very expensive.
> 
> Did you have a different approach in mind?

We only need to look at the ones in the weak section. How many that is may
determine if that's feasible or not.

> 
> >   
> >> 
> >> As an example, in the issue described in this patch set, if powerpc 
> >> starts over-riding kexec_arch_apply_relocations(), then the current weak 
> >> implementation in kexec_file.o gets carried over to the final vmlinux, 
> >> but the instructions will instead appear under the previous function in 
> >> kexec_file.o: crash_prepare_elf64_headers(). This function may or may 
> >> not be traced to begin with, so we won't be able to figure out if this 
> >> is valid or not.  
> > 
> > If it was overridden, then there would be two entries for function that
> > overrides the weak function in the __mcount_loc section, right? One for the
> > new function, and one that was overridden.  
> 
> In the final vmlinux, we will have two entries: one pointing at the 
> correct function, while the other will point to some other function 
> name. So, at least from kallsym perspective, duplicate mcount entries 
> won't be for the function that was overridden, but some arbitrary 
> function that came before the weak function in the object file.

Right, and we should be able to find them.

> 
> > Of course this could be more
> > complex if the new function had been marked notrace.
> > 
> > I was thinking of doing this just so that we know what functions are weak
> > and perhaps need extra processing.
> > 
> > Another issue with weak functions and ftrace just came up here:
> > 
> >   https://lore.kernel.org/all/20220428095803.66c17...@gandalf.local.home/  
> 
> I noticed this just yesterday:
> 
>   # cat available_filter_functions | sort | uniq -d | wc -l
>   430
> 
> I'm fairly certain that some of those are due to weak functions -- I 
> just wasn't sure if all of those were.

Probably :-/

> 
> I suppose this will now also be a problem with ftrace_location(), given 
> that it was recently changed to look at an entire function for mcount 
> locations?
> 

Maybe. I have to focus on other things at the moment, but I'll try to find
some time to look at this deeper.

-- Steve


Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 23:09:19 +0530
"Naveen N. Rao"  wrote:

> If I'm understanding your suggestion right:
> - we now create a new section in each object file: __mcount_loc_weak, 
>   and capture such relocations using weak symbols there.

Yes, but it would be putting the same information it puts into __mcount_loc
but also add it here too. That is, it will duplicate the data.

> - we then ask the linker to put these separately between, say, 
>   __start_mcount_loc_weak and __stop_mcount_loc_weak

Yes, but it will also go in the location between __start_mcount_loc and
__stop_mcount_loc.

> - on ftrace init, we go through entries in this range, but discard those 
>   that belong to functions that also have an entry between 
>   __start_mcount_loc and __stop_mcount loc.

But we should be able to know if it was overridden or not, by seeing if
there's another function that was called. Or at least, we can validate them
to make sure that they are correct.

> 
> The primary issue I see here is that the mcount locations within the new 
> weak section will end up being offsets from a different function in 
> vmlinux, since the linker does not create a symbol for the weak 
> functions that were over-ridden.

The point of this section is to know which functions in __mcount_loc may
have been overridden, as they would be found in the __mcount_loc_weak
section. And then we can do something "special" to them.

> 
> As an example, in the issue described in this patch set, if powerpc 
> starts over-riding kexec_arch_apply_relocations(), then the current weak 
> implementation in kexec_file.o gets carried over to the final vmlinux, 
> but the instructions will instead appear under the previous function in 
> kexec_file.o: crash_prepare_elf64_headers(). This function may or may 
> not be traced to begin with, so we won't be able to figure out if this 
> is valid or not.

If it was overridden, then there would be two entries for function that
overrides the weak function in the __mcount_loc section, right? One for the
new function, and one that was overridden. Of course this could be more
complex if the new function had been marked notrace.

I was thinking of doing this just so that we know what functions are weak
and perhaps need extra processing.

Another issue with weak functions and ftrace just came up here:

  https://lore.kernel.org/all/20220428095803.66c17...@gandalf.local.home/


-- Steve


Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 10:46:35 -0300
"Guilherme G. Piccoli"  wrote:

> Thanks Sergei and Steven, good idea! I thought about the switch change
> you propose, but I confess I got a bit confused by the "fallthrough"
> keyword - do I need to use it?

No. The fallthrough keyword is only needed when there's code between case
labels. As it is very common to list multiple cases for the same code path.
That is:

case DIE_OOPS:
case PANIC_NOTIFIER:
do_dump = 1;
break;

Does not need a fall through label, as there's no code between the DIE_OOPS
and the PANIC_NOTIFIER. But if you had:

case DIE_OOPS:
x = true;
case PANIC_NOTIFIER:
do_dump = 1;
break;

Then you do.

-- Steve


Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 12:22:44 +0300
Sergei Shtylyov  wrote:

> > +   switch (ev) {
> > +   case DIE_OOPS:
> > +   do_dump = 1;
> > +   break;
> > +   case PANIC_NOTIFIER:
> > +   do_dump = 1;
> > +   break;  
> 
>Why not:
> 
>   case DIE_OOPS:
>   case PANIC_NOTIFIER:
>   do_dump = 1;
>       break;

Agreed.

Other than that.

Acked-by: Steven Rostedt (Google) 

-- Steve


Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-28 Thread Steven Rostedt
On Thu, 28 Apr 2022 22:49:52 +0530
"Naveen N. Rao"  wrote:

> But, with ppc64 elf abi v1 which only supports the old -pg flag, mcount
> location can differ between the weak and non-weak variants of a
> function. In such scenarios, one of the two mcount entries will be
> invalid. Such architectures need to validate mcount locations by
> ensuring that the instruction(s) at those locations are as expected. On
> powerpc, this can be a simple check to ensure that the instruction is a
> 'bl'. This check can be further tightened as necessary.

I was thinking about this more, and I was thinking that we could create
another section; Perhaps __mcount_loc_weak. And place these in that
section. That way, we could check if these symbols to see if there's
already a symbol for it, and if there is, then drop it.

-- Steve


Re: [PATCH 2/2] recordmcount: Handle sections with no non-weak symbols

2022-04-28 Thread Steven Rostedt
On Thu, 28 Apr 2022 13:15:22 +0530
"Naveen N. Rao"  wrote:

> Indeed, plain old -pg will be a problem. I'm not sure there is a generic 
> way to address this. I suppose architectures will have to validate the 
> mcount locations, something like this?

Perhaps another solution is to make the mcount locations after the linking
is done. The main downside to that is that it takes time to go over the
entire vmlinux, and will slow down a compile that only modified a couple of
files.

-- Steve


Re: [PATCH 2/2] recordmcount: Handle sections with no non-weak symbols

2022-04-27 Thread Steven Rostedt
On Wed, 27 Apr 2022 15:01:22 +0530
"Naveen N. Rao"  wrote:

> If one or both of these weak functions are overridden in future, in the
> final vmlinux mcount table, references to these will change over to the
> non-weak variant which has its own mcount location entry. As such, there
> will now be two entries for these functions, both pointing to the same
> non-weak location.

But is that really true in all cases? x86 uses fentry these days, and other
archs do things differently too. But the original mcount (-pg) call
happened *after* the frame setup. That means the offset of the mcount call
would be at different offsets wrt the start of the function. If you have
one of these architectures that still use mcount, and the weak function
doesn't have the same size frame setup as the overriding function, then the
addresses will not be the same.

-- Steve


> We don't need the non-weak locations since they will
> never be executed. Ftrace skips the duplicate entries due to a previous
> commit.



Re: [PATCH 1/2] ftrace: Drop duplicate mcount locations

2022-04-27 Thread Steven Rostedt
On Wed, 27 Apr 2022 15:01:21 +0530
"Naveen N. Rao"  wrote:

> In the absence of section symbols [1], objtool (today) and recordmcount
> (with a subsequent patch) generate __mcount_loc relocation records with
> weak symbols as the base. This works fine as long as those weak symbols
> are not overridden, but if they are, these can result in duplicate
> entries in the final vmlinux mcount location table. This will cause
> ftrace to fail when trying to patch the same location twice. Fix this by
> dropping duplicate locations during ftrace init.
> 
> [1] https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=d1bcae833b32f1
> 
> Signed-off-by: Naveen N. Rao 
> ---
>  kernel/trace/ftrace.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 4f1d2f5e726341..8bc4f282bb3ff4 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -6496,7 +6496,7 @@ static int ftrace_process_locs(struct module *mod,
>   struct dyn_ftrace *rec;
>   unsigned long count;
>   unsigned long *p;
> - unsigned long addr;
> + unsigned long addr, prev_addr = 0;
>   unsigned long flags = 0; /* Shut up gcc */
>   int ret = -ENOMEM;
>  
> @@ -6550,6 +6550,16 @@ static int ftrace_process_locs(struct module *mod,
>   while (p < end) {
>   unsigned long end_offset;
>   addr = ftrace_call_adjust(*p++);
> +
> + /*
> +  * Drop duplicate entries, which can happen when weak
> +  * functions are overridden, and __mcount_loc relocation
> +  * records were generated against function names due to
> +  * absence of non-weak section symbols
> +  */
> + if (addr == prev_addr)
> + addr = 0;

Please don't use the side effect of addr == 0 causing the loop to continue
for this logic. The two are not related. Simply call continue.

if (addr == prev_addr)
continue;


-- Steve


> +
>   /*
>* Some architecture linkers will pad between
>* the different mcount_loc sections of different
> @@ -6569,6 +6579,7 @@ static int ftrace_process_locs(struct module *mod,
>  
>   rec = >records[pg->index++];
>   rec->ip = addr;
> + prev_addr = addr;
>   }
>  
>   /* We should have used all pages */



Re: [PATCH v1 07/22] powerpc/ftrace: Use patch_instruction() return directly

2022-04-18 Thread Steven Rostedt
On Mon, 18 Apr 2022 11:51:16 +0530
"Naveen N. Rao"  wrote:

> > --- a/arch/powerpc/kernel/trace/ftrace.c
> > +++ b/arch/powerpc/kernel/trace/ftrace.c
> > @@ -78,10 +78,7 @@ ftrace_modify_code(unsigned long ip, ppc_inst_t old, 
> > ppc_inst_t new)
> > }
> > 
> > /* replace the text with the new text */
> > -   if (patch_instruction((u32 *)ip, new))
> > -   return -EPERM;
> > -
> > -   return 0;
> > +   return patch_instruction((u32 *)ip, new);  
> 
> I think the reason we were returning -EPERM is so that ftrace_bug() can 

That is correct.

> throw the right error message. That will change due to this patch, 
> though I'm not sure how much it matters. -EFAULT and -EPERM seem to 
> print almost the same error message.

In these cases it helps to know the type of failure, as the way to debug it
is different.

-EFAULT: It failed to read it the location. This means that the memory is
likely not even mapped in, or the pointer is way off.

-EINVAL: Means that what was read did not match what was expected (the code
was already updated, pointing to the wrong location, or simply the
calculation of what to expect is incorrect).

-EPERM: Means the write failed. What was read was expected, but the
permissions to write have not been updated properly.

Differentiating the three is crucial to looking at where the issue lies
when an ftrace_bug() triggers.

-- Steve


Re: [PATCH v2] ftrace: Make ftrace_graph_is_dead() a static branch

2022-03-30 Thread Steven Rostedt
On Wed, 30 Mar 2022 06:55:26 +
Christophe Leroy  wrote:

> > Small nit. Please order the includes in "upside-down x-mas tree" fashion:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> >   
> 
> That's the first time I get such a request. Usually people request 
> #includes to be in alphabetical order so when I see a file that has 
> headers in alphabetical order I try to not break it, but here that was 
> not the case so I put it at the end of the list.

This is something that Ingo Molnar started back in 2009 or so. And I do
find it easier on the eyes ;-)  I may be the only one today trying to keep
it (albeit poorly).

It's not a hard requirement, but I find it makes the code look more like
art, which it is :-D

> 
> I'll send v3

Thanks,

-- Steve


Re: [PATCH v2] ftrace: Make ftrace_graph_is_dead() a static branch

2022-03-29 Thread Steven Rostedt
On Fri, 25 Mar 2022 09:03:08 +0100
Christophe Leroy  wrote:

> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  

Small nit. Please order the includes in "upside-down x-mas tree" fashion:

#include 
#include 
#include 
#include 

Thanks,

-- Steve


Re: [PATCH v2] MAINTAINERS: Enlarge coverage of TRACING inside architectures

2022-03-29 Thread Steven Rostedt
On Fri, 25 Mar 2022 07:32:21 +0100
Christophe Leroy  wrote:

> Most architectures have ftrace related stuff in arch/*/kernel/ftrace.c
> but powerpc has it spread in multiple files located in
> arch/powerpc/kernel/trace/
> In several architectures, there are also additional files containing
> 'ftrace' as part of the name but with some prefix or suffix.

Acked-by: Steven Rostedt (Google) 

-- Steve


Re: [PATCH v1] ftrace: Make ftrace_graph_is_dead() static inline

2022-03-24 Thread Steven Rostedt
On Thu, 24 Mar 2022 11:01:45 +0100
Christophe Leroy  wrote:

> @@ -1006,7 +1006,20 @@ unsigned long ftrace_graph_ret_addr(struct task_struct 
> *task, int *idx,
>  extern int register_ftrace_graph(struct fgraph_ops *ops);
>  extern void unregister_ftrace_graph(struct fgraph_ops *ops);
>  
> -extern bool ftrace_graph_is_dead(void);
> +/**
> + * ftrace_graph_is_dead - returns true if ftrace_graph_stop() was called
> + *
> + * ftrace_graph_stop() is called when a severe error is detected in
> + * the function graph tracing. This function is called by the critical
> + * paths of function graph to keep those paths from doing any more harm.
> + */
> +extern bool kill_ftrace_graph;
> +
> +static inline bool ftrace_graph_is_dead(void)
> +{
> + return kill_ftrace_graph;
> +}
> +
>  extern void ftrace_graph_stop(void);

The reason I did not expose that variable, is because I didn't want it to
be touched outside of the kernel/trace directory. Or the ftrace.c file for
that matter (although, I could put it in fgraph.c :-/)

What would be better, is to make it a static branch.

extern struct static_key fgraph_dead;

static inline bool ftrace_graph_is_dead(void)
{
if (static_key_false(_dead))
return true;
return false;
}

That way we even get rid of the conditional branch.

Yeah, the fgraph_dead is still exposed for anyone to touch, but it still
requires a function to modify it, so I'm not as worried it will be touched
as easily.

-- Steve


Re: [PATCH] MAINTAINERS: Add arch/powerpc/kernel/trace to TRACING

2022-03-24 Thread Steven Rostedt
On Thu, 24 Mar 2022 15:38:28 +0100
Christophe Leroy  wrote:

> diff --git a/MAINTAINERS b/MAINTAINERS
> index e127c2fb08a7..3b5943f34568 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19592,6 +19592,7 @@ T:git 
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
>  F:   Documentation/trace/ftrace.rst
>  F:   arch/*/*/*/ftrace.h
>  F:   arch/*/kernel/ftrace.c
> +F:   arch/powerpc/kernel/trace/

Perhaps replace the above two with:
   F:   arch/*/*/ftrace*
   F:   arch/*/*/*/ftrace*

As we also have things like ftrace_32.S, ftrace_64.S

Maybe even add:

   F: arch/*/*/*_ftrace.*
   F: arch/*/*/*/*_ftrace.*

to catch the "entry_ftrace.S" and friends (needing the leading "_" to not
get confused by any "bpftrace" files).

?

But in reality, I'm only a reviewer for ftrace in the archs. It really is
maintained by the arch maintainers. But it is still good that I get Cc'd on
any ftrace arch related changes.

-- Steve


>  F:   fs/tracefs/
>  F:   include/*/ftrace.h
>  F:   include/linux/trace*.h



Re: [PATCH v2 02/13] tracing: Fix selftest config check for function graph start up test

2022-02-24 Thread Steven Rostedt
On Thu, 24 Feb 2022 15:13:12 +
Christophe Leroy  wrote:

> > But I'm working on a series to send to Linus. I can pick this patch up, as
> > it touches just my code.
> >   
> 
> That would be great, thanks.

It's in my queue and running through my tests, which take 7 to 13 hours to
complete (depending on the changes).

-- Steve


Re: [PATCH v2 02/13] tracing: Fix selftest config check for function graph start up test

2022-02-24 Thread Steven Rostedt
On Thu, 24 Feb 2022 13:43:02 +
Christophe Leroy  wrote:

> Hi Michael,
> 
> Le 20/12/2021 à 17:38, Christophe Leroy a écrit :
> > CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is required to test
> > direct tramp.
> > 
> > Signed-off-by: Christophe Leroy   
> 
> You didn't apply this patch when you merged the series. Without it I get 
> the following :

Maybe they wanted my acked-by.

But I'm working on a series to send to Linus. I can pick this patch up, as
it touches just my code.

-- Steve


Re: [PATCH v2 09/13] powerpc/ftrace: Implement CONFIG_DYNAMIC_FTRACE_WITH_ARGS

2022-02-15 Thread Steven Rostedt
On Tue, 15 Feb 2022 19:06:48 +0530
"Naveen N. Rao"  wrote:

> As I understand it, the reason ftrace_get_regs() was introduced was to 
> be able to only return the pt_regs, if _all_ registers were saved into 
> it, which we don't do when coming in through ftrace_caller(). See the 
> x86 implementation (commit 02a474ca266a47 ("ftrace/x86: Allow for 
> arguments to be passed in to ftrace_regs by default"), which returns 
> pt_regs conditionally.

I can give you the history of ftrace_caller and ftrace_regs_caller.

ftrace_caller saved just enough as was denoted for gcc mcount trampolines.
The new fentry which happens at the start of the function, whereas mcount
happens after the stack frame is set up, may change the rules on some
architectures.

As for ftrace_regs_caller, that was created for kprobes. As the majority of
kprobes were added at the start of the function, it made sense to hook into
ftrace as the ftrace trampoline call is much faster than taking a
breakpoint interrupt. But to keep compatibility with breakpoint
interrupts, we needed to fill in all the registers, and make it act just
like a breakpoint interrupt.

I've been wanting to record function parameters, and because the ftrace
trampoline must at a minimum save the function parameters before calling
the ftrace callbacks, all the information for those parameters were being
saved but were never exposed to the ftrace callbacks. I created the the
DYNAMIC_FTRACE_WITH_ARGS to expose them. I first just used pt_regs with
just the parameters filled in, but that was criticized as it could be
confusing where the non filled in pt_regs might be used and thinking they
are legitimate. So I created ftrace_regs that would give you just the
function arguments (if DYNAMIC_FTRACE_WITH_ARGS is defined), or it will
give you a full pt_regs, if the caller came from the ftrace_regs_caller. If
not, it will give you a NULL pointer.

The first user to use the args was live kernel patching, as they only need
that and the return pointer.

-- Steve


Re: [PATCH v2 11/13] powerpc/ftrace: directly call of function graph tracer by ftrace caller

2022-02-14 Thread Steven Rostedt
On Mon, 14 Feb 2022 22:54:23 +0530
"Naveen N. Rao"  wrote:

> For x86, commit 0c0593b45c9b4e ("x86/ftrace: Make function graph use 
> ftrace directly") also adds recursion check before the call to 
> function_graph_enter() in prepare_ftrace_return(). Do we need that on
> powerpc as well?

Yes. The function_graph_enter() does not provide any recursion protection,
so if it were to call something that gets function graph traced, it will
crash the machine.

-- Steve


Re: [RFC PATCH 2/3] powerpc/ftrace: Override ftrace_location_lookup() for MPROFILE_KERNEL

2022-02-10 Thread Steven Rostedt
On Thu, 10 Feb 2022 16:40:28 +
"Naveen N. Rao"  wrote:

> The other option is to mark ftrace_cmp_recs() as a __weak function, but 
> I have a vague recollection of you suggesting #ifdef rather than a 
> __weak function in the past. I might be mis-remembering, so if you think 
> making this a __weak function is better, I can do that.

No. If I wanted that I would have suggested it. I think this is the
prettiest of the ugly solutions out there ;-)

As I said, I can't think of a better solution, and we can go with this
until something else comes along.

-- Steve


Re: [RFC PATCH 2/3] powerpc/ftrace: Override ftrace_location_lookup() for MPROFILE_KERNEL

2022-02-10 Thread Steven Rostedt
On Thu, 10 Feb 2022 13:58:29 +
"Naveen N. Rao"  wrote:

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index f9feb197b2daaf..68f20cf34b0c47 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1510,6 +1510,7 @@ ftrace_ops_test(struct ftrace_ops *ops, unsigned long 
> ip, void *regs)
>   }
>  
>  
> +#ifndef ftrace_cmp_recs
>  static int ftrace_cmp_recs(const void *a, const void *b)
>  {
>   const struct dyn_ftrace *key = a;
> @@ -1521,6 +1522,7 @@ static int ftrace_cmp_recs(const void *a, const void *b)
>   return 1;
>   return 0;
>  }
> +#endif
>  

I don't really care for this part, as it seems somewhat ugly. But this
patch does appear to solve your issue, and I can't think of a prettier way
to do this.

So, I will reluctantly ack it.

If anything, please add a comment above saying that architectures may need
to override this function, and when doing so, they will define their own
ftrace_cmp_recs.

-- Steve


Re: [RFC PATCH 2/3] powerpc/ftrace: Override ftrace_location_lookup() for MPROFILE_KERNEL

2022-02-09 Thread Steven Rostedt
On Wed, 09 Feb 2022 17:50:09 +
"Naveen N. Rao"  wrote:

> However, I think we will not be able to use a fixed range.  I would like 
> to reserve instructions from function entry till the branch to 
> _mcount(), and it can be two or four instructions depending on whether a 
> function has a global entry point. For this, I am considering adding a 
> field in 'struct dyn_arch_ftrace', and a hook in ftrace_process_locs() 
> to initialize the same. I may need to override ftrace_cmp_recs().

Be careful about adding anything to dyn_arch_ftrace. powerpc already adds
the pointer to the module. Anything you add to that gets multiplied by
thousands of times (which takes up memory).

At boot up you may see something like:

  ftrace: allocating 45363 entries in 178 pages

That's 45,363 dyn_arch_ftrace structures. And each module loads their own
as well. To see how many total you have after boot up:


  # cat /sys/kernel/tracing/dyn_ftrace_total_info 
55974 pages:295 groups: 89

That's from the same kernel. Another 10,000 entries were created by modules.
(This was for x86_64)

What you may be able to do, is to add a way to look at the already saved
kallsyms, which keeps track of the function entry and exit to know how to
map an address back to the function.

   kallsyms_lookup(addr, NULL, , NULL, NULL);

Should give you the offset of addr from the start of the function.

-- Steve


Re: [RFC PATCH 2/3] powerpc/ftrace: Override ftrace_location_lookup() for MPROFILE_KERNEL

2022-02-07 Thread Steven Rostedt
On Mon,  7 Feb 2022 12:37:21 +0530
"Naveen N. Rao"  wrote:

> --- a/arch/powerpc/kernel/trace/ftrace.c
> +++ b/arch/powerpc/kernel/trace/ftrace.c
> @@ -1137,3 +1137,14 @@ char *arch_ftrace_match_adjust(char *str, const char 
> *search)
>   return str;
>  }
>  #endif /* PPC64_ELF_ABI_v1 */
> +
> +#ifdef CONFIG_MPROFILE_KERNEL
> +unsigned long ftrace_location_lookup(unsigned long ip)
> +{
> + /*
> +  * Per livepatch.h, ftrace location is always within the first
> +  * 16 bytes of a function on powerpc with -mprofile-kernel.
> +  */
> + return ftrace_location_range(ip, ip + 16);

I think this is the wrong approach for the implementation and error prone.

> +}
> +#endif
> -- 

What I believe is a safer approach is to use the record address and add the
range to it.

That is, you know that the ftrace modification site is a range (multiple
instructions), where in the ftrace infrastructure, only one ip represents
that range. What you want is if you pass in an ip, and that ip is within
that range, you return the ip that represents that range to ftrace.

It looks like we need to change the compare function in the bsearch.

Perhaps add a new macro to define the size of the range to be searched,
instead of just using MCOUNT_INSN_SIZE? Then we may not even need this new
lookup function?

static int ftrace_cmp_recs(const void *a, const void *b)
{
const struct dyn_ftrace *key = a;
const struct dyn_ftrace *rec = b;

if (key->flags < rec->ip)
return -1;
if (key->ip >= rec->ip + ARCH_IP_SIZE)
return 1;
return 0;
}

Where ARCH_IP_SIZE is defined to MCOUNT_INSN_SIZE by default, but an arch
could define it to something else, like 16.

Would that work for you, or am I missing something?

-- Steve


Re: [PATCH] ftrace: Have architectures opt-in for mcount build time sorting

2022-01-28 Thread Steven Rostedt
On Fri, 28 Jan 2022 16:11:39 -0500
Joe Lawrence  wrote:

> The bisect finally landed on:
> 
>   72b3942a173c387b27860ba1069636726e208777 is the first bad commit
>   commit 72b3942a173c387b27860ba1069636726e208777
>   Author: Yinan Liu 
>   Date:   Sun Dec 12 19:33:58 2021 +0800
> 
>   scripts: ftrace - move the sort-processing in ftrace_init
> 
> and I can confirm that your updates today in "[for-linus][PATCH 00/10]
> tracing: Fixes for 5.17-rc1" fix or avoid the issue.  I just wanted to
> add my report in case this adds any future complications for mcount
> build time sorting.  Let me know if any additional tests would be
> helpful.

Thanks for letting me know. That patch set has already landed in Linus's
tree.


-- Steve


[PATCH v2] ftrace: Have architectures opt-in for mcount build time sorting

2022-01-27 Thread Steven Rostedt
From f7d4ef4e77464e08af38789ea5d3a9cfb80a3d78 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Tue, 25 Jan 2022 09:19:10 -0500
Subject: [PATCH] ftrace: Have architectures opt-in for mcount build time
 sorting

First S390 complained that the sorting of the mcount sections at build
time caused the kernel to crash on their architecture. Now PowerPC is
complaining about it too. And also ARM64 appears to be having issues.

It may be necessary to also update the relocation table for the values
in the mcount table. Not only do we have to sort the table, but also
update the relocations that may be applied to the items in the table.

If the system is not relocatable, then it is fine to sort, but if it is,
some architectures may have issues (although x86 does not as it shifts all
addresses the same).

Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
safe to do the sorting at build time.

Also update the config to compile in build time sorting in the sorttable
code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.

Link: 
https://lore.kernel.org/all/944d10da-8200-4ba9-8d0a-3bed9aa99...@linux.ibm.com/

Cc: Yinan Liu 
Cc: Ard Biesheuvel 
Cc: Kees Cook 
Reported-by: Sachin Sant 
Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland  [arm64]
Tested-by: Sachin Sant 
Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in 
ftrace_init")
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/all/20220127114249.03b1b...@gandalf.local.home/
 - Have CONFIG_BUILDTIME_MCOUNT_SORT depend on DYNAMIC_FTRACE
   otherwise it fails to build when DYNAMIC_FTRACE is not set
   because it can not find the mcount section in the sorttable code.

 arch/arm/Kconfig | 1 +
 arch/x86/Kconfig | 1 +
 kernel/trace/Kconfig | 8 +++-
 scripts/Makefile | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c2724d986fa0..5256ebe57451 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -82,6 +82,7 @@ config ARM
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
select HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
select HAVE_DMA_CONTIGUOUS if MMU
select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7399327d1eff..46080dea5dba 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -186,6 +186,7 @@ config X86
select HAVE_CONTEXT_TRACKING_OFFSTACK   if HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT  if STACK_VALIDATION
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 752ed89a293b..a5eb5e7fd624 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
help
  C version of recordmcount available?
 
+config HAVE_BUILDTIME_MCOUNT_SORT
+   bool
+   help
+ An architecture selects this if it sorts the mcount_loc section
+at build time.
+
 config BUILDTIME_MCOUNT_SORT
bool
default y
-   depends on BUILDTIME_TABLE_SORT && !S390
+   depends on HAVE_BUILDTIME_MCOUNT_SORT && DYNAMIC_FTRACE
help
  Sort the mcount_loc section at build time.
 
diff --git a/scripts/Makefile b/scripts/Makefile
index b082d2f93357..cedc1f0e21d8 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -32,7 +32,7 @@ HOSTCFLAGS_sorttable.o += -I$(srctree)/tools/arch/x86/include
 HOSTCFLAGS_sorttable.o += -DUNWINDER_ORC_ENABLED
 endif
 
-ifdef CONFIG_DYNAMIC_FTRACE
+ifdef CONFIG_BUILDTIME_MCOUNT_SORT
 HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
 endif
 
-- 
2.33.0



[PATCH] ftrace: Have architectures opt-in for mcount build time sorting

2022-01-27 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

First S390 complained that the sorting of the mcount sections at build
time caused the kernel to crash on their architecture. Now PowerPC is
complaining about it too. And also ARM64 appears to be having issues.

It may be necessary to also update the relocation table for the values
in the mcount table. Not only do we have to sort the table, but also
update the relocations that may be applied to the items in the table.

If the system is not relocatable, then it is fine to sort, but if it is,
some architectures may have issues (although x86 does not as it shifts all
addresses the same).

Add a HAVE_BUILDTIME_MCOUNT_SORT that an architecture can set to say it is
safe to do the sorting at build time.

Also update the config to compile in build time sorting in the sorttable
code in scripts/ to depend on CONFIG_BUILDTIME_MCOUNT_SORT.

Link: 
https://lore.kernel.org/all/944d10da-8200-4ba9-8d0a-3bed9aa99...@linux.ibm.com/

Cc: Mark Rutland 
Cc: Yinan Liu 
Cc: Ard Biesheuvel 
Cc: Kees Cook 
Cc: linuxppc-dev@lists.ozlabs.org
Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Fixes: 72b3942a173c ("scripts: ftrace - move the sort-processing in 
ftrace_init")
Signed-off-by: Steven Rostedt (Google) 
---
 arch/arm/Kconfig | 1 +
 arch/x86/Kconfig | 1 +
 kernel/trace/Kconfig | 8 +++-
 scripts/Makefile | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c2724d986fa0..5256ebe57451 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -82,6 +82,7 @@ config ARM
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
select HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
select HAVE_DMA_CONTIGUOUS if MMU
select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7399327d1eff..46080dea5dba 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -186,6 +186,7 @@ config X86
select HAVE_CONTEXT_TRACKING_OFFSTACK   if HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT  if STACK_VALIDATION
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 752ed89a293b..7e5b92090faa 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
help
  C version of recordmcount available?
 
+config HAVE_BUILDTIME_MCOUNT_SORT
+   bool
+   help
+ An architecture selects this if it sorts the mcount_loc section
+at build time.
+
 config BUILDTIME_MCOUNT_SORT
bool
default y
-   depends on BUILDTIME_TABLE_SORT && !S390
+   depends on HAVE_BUILDTIME_MCOUNT_SORT
help
  Sort the mcount_loc section at build time.
 
diff --git a/scripts/Makefile b/scripts/Makefile
index b082d2f93357..cedc1f0e21d8 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -32,7 +32,7 @@ HOSTCFLAGS_sorttable.o += -I$(srctree)/tools/arch/x86/include
 HOSTCFLAGS_sorttable.o += -DUNWINDER_ORC_ENABLED
 endif
 
-ifdef CONFIG_DYNAMIC_FTRACE
+ifdef CONFIG_BUILDTIME_MCOUNT_SORT
 HOSTCFLAGS_sorttable.o += -DMCOUNT_SORT_ENABLED
 endif
 
-- 
2.33.0



Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Steven Rostedt
On Thu, 27 Jan 2022 13:33:02 +
Mark Rutland  wrote:

> I want to get the regression fixed ASAP, so can we take a simple patch for 
> -rc2
> which disables the build-time sort where it's currently broken (by limiting 
> the
> opt-in to arm and x86), then follow-up per-architecture to re-enable it if
> desired/safe?

I'm going to retest my patch that makes it an opt in for just x86 and arm
(32bit). I'll be pushing that hopefully later today. I have some other
patches to test as well.

-- Steve


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-27 Thread Steven Rostedt
On Thu, 27 Jan 2022 12:27:04 +
Mark Rutland  wrote:

> Ah, so those non-ELF relocations for the mcount_loc table just mean "apply the
> KASLR offset here", which is equivalent for all entries.
> 
> That makes sense, thanks!

And this is why we were having such a hard time understanding each other ;-)

I started a new project called "shelf", which is a shell interface to
read ELF files (Shelf on a ELF!).

It uses my ccli library:

   https://github.com/rostedt/libccli

and can be found here:

   https://github.com/rostedt/shelf

Build and install the latest libccli and then build this with just
"make".

  $ shelf vmlinux

and then you can see what is stored in the mcount location:

  shelf> dump symbol __start_mcount_loc - __stop_mcount_loc

I plan on adding more to include the REL and RELA sections and show how
they affect symbols and such.

Feel free to contribute too ;-)

-- Steve


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-25 Thread Steven Rostedt
On Tue, 25 Jan 2022 09:30:51 +0530
Sachin Sant  wrote:

> Tested-by: Sachin Sant 

Thanks, I'll start running it through my tests and send it to Linus later
today or tomorrow.

-- Steve


Re: [powerpc] ftrace warning kernel/trace/ftrace.c:2068 with code-patching selftests

2022-01-24 Thread Steven Rostedt
On Mon, 24 Jan 2022 20:15:06 +0800
Yinan Liu  wrote:

> Hi, Steven and Sachin
> 
> I don't have a powerpc machine for testing, I guess the ppc has a 
> similar problem with the s390. It's not clear to me why the compiler 
> does this. Maybe we can handle ppc like you did with the s390 before, 
> but I'm not sure if other architectures have similar issues. Or limit 
> BUILDTIME_MCOUNT_SORT to a smaller scope and make it only available for 
> x86 and arm?
> 
> steven, what's your opinion?

Yeah, I think it's time to opt in, instead of opting out.

Something like this:

-- Steve

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index c2724d986fa0..5256ebe57451 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -82,6 +82,7 @@ config ARM
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
select HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_DEBUG_KMEMLEAK if !XIP_KERNEL
select HAVE_DMA_CONTIGUOUS if MMU
select HAVE_DYNAMIC_FTRACE if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4207cf9bb17..7996548b2b27 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -166,6 +166,7 @@ config ARM64
select HAVE_ASM_MODVERSIONS
select HAVE_EBPF_JIT
select HAVE_C_RECORDMCOUNT
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_CMPXCHG_DOUBLE
select HAVE_CMPXCHG_LOCAL
select HAVE_CONTEXT_TRACKING
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7399327d1eff..46080dea5dba 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -186,6 +186,7 @@ config X86
select HAVE_CONTEXT_TRACKING_OFFSTACK   if HAVE_CONTEXT_TRACKING
select HAVE_C_RECORDMCOUNT
select HAVE_OBJTOOL_MCOUNT  if STACK_VALIDATION
+   select HAVE_BUILDTIME_MCOUNT_SORT
select HAVE_DEBUG_KMEMLEAK
select HAVE_DMA_CONTIGUOUS
select HAVE_DYNAMIC_FTRACE
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 752ed89a293b..7e5b92090faa 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -70,10 +70,16 @@ config HAVE_C_RECORDMCOUNT
help
  C version of recordmcount available?
 
+config HAVE_BUILDTIME_MCOUNT_SORT
+   bool
+   help
+ An architecture selects this if it sorts the mcount_loc section
+at build time.
+
 config BUILDTIME_MCOUNT_SORT
bool
default y
-   depends on BUILDTIME_TABLE_SORT && !S390
+   depends on HAVE_BUILDTIME_MCOUNT_SORT
help
  Sort the mcount_loc section at build time.
 


Re: [PATCH v1 0/5] Implement livepatch on PPC32

2021-12-14 Thread Steven Rostedt
On Tue, 14 Dec 2021 08:35:14 +0100
Christophe Leroy  wrote:

> > Will continue investigating.
> >   
> 
> trace_selftest_startup_function_graph() calls register_ftrace_direct() 
> which returns -ENOSUPP because powerpc doesn't select 
> CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS.
> 
> Should TEST_DIRECT_TRAMP depend on CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS ?

Yes, that should be:

#if defined(CONFIG_DYNAMIC_FTRACE) && \
defined(CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS)
#define TEST_DIRECT_TRAMP
noinline __noclone static void trace_direct_tramp(void) { }
#endif


And make it test it with or without the args.

Thanks for finding this.

-- Steve


Re: [PATCH v1 0/5] Implement livepatch on PPC32

2021-12-13 Thread Steven Rostedt
On Mon, 13 Dec 2021 19:33:47 +
Christophe Leroy  wrote:

> STill the same Oops, below

Unfortunately, I don't have a PPC machine (32 nor 64 bit) to help debug
this.


> I will look more closely tomorrow.

OK, thanks.

-- Steve


Re: [PATCH v1 0/5] Implement livepatch on PPC32

2021-12-13 Thread Steven Rostedt
On Mon, 13 Dec 2021 17:50:52 +
Christophe Leroy  wrote:

> @@ -958,6 +942,12 @@ unsigned long prepare_ftrace_return(unsigned long 
> parent, unsigned long ip,
>   out:
>   return parent;
>   }
> +
> +void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
> +struct ftrace_ops *op, struct ftrace_regs *fregs)
> +{
> + prepare_ftrace_return(ip, kernel_stack_pointer(>regs), 0);
> +}

I have for powerpc prepare_ftrace_return as:


unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
unsigned long sp)
{
unsigned long return_hooker;

if (unlikely(ftrace_graph_is_dead()))
goto out;

if (unlikely(atomic_read(>tracing_graph_pause)))
goto out;

return_hooker = ppc_function_entry(return_to_handler);

if (!function_graph_enter(parent, ip, 0, (unsigned long *)sp))
parent = return_hooker;
out:
return parent;
}

Which means you'll need different parameters to it than what x86 has, which
has the prototype of:

void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
   unsigned long frame_pointer)

and it does not use the frame_pointer for this case, which is why it is
zero.

For powerpc though, it uses the stack pointer, so you parameters are
incorrect. Looks like it should be:

prepare_ftrace_return(parent_ip, ip, 
kernel_stack_pointer(>regs));

And that will likely not be enough. I'll need to update the ctr register,
as that is where the return address is saved. So you'll probably need it to be:

void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
   struct ftrace_ops *op, struct ftrace_regs *fregs)
{
unsigned long parent;

parent = prepare_ftrace_return(parent_ip, ip, 
kernel_stack_pointer(>regs));
fregs->regs.ctr = parent;
}



-- Steve


Re: [PATCH v1 0/5] Implement livepatch on PPC32

2021-12-13 Thread Steven Rostedt
On Mon, 13 Dec 2021 17:30:48 +
Christophe Leroy  wrote:

> Thanks, I will try that.
> 
> I can't find ftrace_graph_func() in s390. Does it mean that s390 doesn't 
> have a working function tracer anymore ?
> 
> I see your commit 0c0593b45c9b4 ("x86/ftrace: Make function graph use 
> ftrace directly") is dated 8 Oct 2021 while 5740a7c71ab6 ("s390/ftrace: 
> add HAVE_DYNAMIC_FTRACE_WITH_ARGS support") is 4 Oct 2021.

Hmm, maybe not. I can't test it.

This needs to be fixed if that's the case.

Thanks for bringing it up!

-- Steve


Re: [PATCH v1 0/5] Implement livepatch on PPC32

2021-12-13 Thread Steven Rostedt
On Mon, 13 Dec 2021 14:39:15 +
Christophe Leroy  wrote:

> > Note, you can implement this first, (I looked over the patches and they
> > seem fine) and then update both ppc64 and ppc32 to implement
> > DYNAMIC_FTRACE_WITH_ARGS.
> >   
> 
> I tried to activate DYNAMIC_FTRACE_WITH_ARGS on PPC32.
> 
> I copied into powerpc the changes from 5740a7c71ab6 ("s390/ftrace: add 
> HAVE_DYNAMIC_FTRACE_WITH_ARGS support")
> 
> Ftrace selftests tell "Testing tracer function_graph: FAILED!".
> 
> Is there anything else to do ?

Yes. Because BPF is now hooking into the function callbacks, it causes
issues with function graph tracer. So what we did was to have function
graph tracing to now use the function tracer callback as well (this allows
both the BPF direct trampolines to work with function graph tracer).

As it requires DYNAMIC_FTRACE_WITH_ARGS, and x86 was the only one to
support that for now, I decided to make all the archs change function graph
tracing when they implement DYNAMIC_FTRACE_WITH_ARGS too. (It is becoming a
pain to have too many variants of function tracing between the archs).

The change that did this for x86 was:

0c0593b45c9b4 ("x86/ftrace: Make function graph use ftrace directly")

This actually simplifies the function graph tracer, as you no longer need
it's own entry trampoline (still need the trampoline for the return of the
function).

What you need to do is:

In your arch/*/include/asm/ftrace.h add:

struct ftrace_ops;

#define ftrace_graph_func ftrace_graph_func
void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs);


Where ftrace_graph_func() is now what is called for the function graph
tracer, directly from the ftrace callbacks (no longer a secondary
trampoline).

Define the ftrace_graph_func() to be something like:

void ftrace_graph_func(unsigned long ip, unsigned long parent_ip,
  struct ftrace_ops *op, struct ftrace_regs *fregs)
{
   struct pt_regs *regs = >regs;
   unsigned long *stack = (unsigned long *)kernel_stack_pointer(regs);

   prepare_ftrace_return(ip, (unsigned long *)stack, 0);
}

This is called by the function tracer code. But because with
DYNAMIC_FTRACE_WITH_ARGS, we have access to the argument register, we should
also have access to the link register and the stack. Then you can use that
to modify the stack and or link register to jump to the the return
trampoline.

This should all work with powerpc (both 64 and 32) but if it does not, let
me know. I'm happy to help out.

-- Steve


Re: [PATCH] recordmcount: Support empty section from recent binutils

2021-11-29 Thread Steven Rostedt
On Fri, 26 Nov 2021 08:43:23 +
LEROY Christophe  wrote:

> Le 24/11/2021 à 15:43, Christophe Leroy a écrit :
> > Looks like recent binutils (2.36 and over ?) may empty some section,
> > leading to failure like:
> > 
> > Cannot find symbol for section 11: .text.unlikely.
> > kernel/kexec_file.o: failed
> > make[1]: *** [scripts/Makefile.build:287: kernel/kexec_file.o] Error 1
> > 
> > In order to avoid that, ensure that the section has a content before
> > returning it's name in has_rel_mcount().  
> 
> This patch doesn't work, on PPC32 I get the following message with this 
> patch applied:
> 
> [0.00] ftrace: No functions to be traced?
> 
> Without the patch I get:
> 
> [0.00] ftrace: allocating 22381 entries in 66 pages
> [0.00] ftrace: allocated 66 pages with 2 groups

Because of this report, I have not applied this patch (even though I was
about to push it to Linus).

I'm pulling it from my queue until this gets resolved.

Thanks,

-- Steve


Re: [PATCH v2 5/5] ftrace: Use preemption model accessors for trace header printout

2021-11-10 Thread Steven Rostedt
On Wed, 10 Nov 2021 20:24:48 +
Valentin Schneider  wrote:

> Per PREEMPT_DYNAMIC, checking CONFIG_PREEMPT doesn't tell you the actual
> preemption model of the live kernel. Use the newly-introduced accessors
> instead.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/trace/trace.c | 14 --
>  1 file changed, 4 insertions(+), 10 deletions(-)

Reviewed-by: Steven Rostedt (VMware) 

-- Steve


Re: [PATCH v0 00/42] notifiers: Return an error when callback is already registered

2021-11-08 Thread Steven Rostedt
On Mon, 8 Nov 2021 15:35:50 +0100
Borislav Petkov  wrote:

> On Mon, Nov 08, 2021 at 03:24:39PM +0100, Borislav Petkov wrote:
> > I guess I can add another indirection to notifier_chain_register() and
> > avoid touching all the call sites.  
> 
> IOW, something like this below.
> 
> This way I won't have to touch all the callsites and the registration
> routines would still return a proper value instead of returning 0
> unconditionally.

I prefer this method.

Question, how often does this warning trigger? Is it common to see in
development?

-- Steve


Re: [PATCH v1 0/5] Implement livepatch on PPC32

2021-10-28 Thread Steven Rostedt
On Thu, 28 Oct 2021 14:24:00 +0200
Christophe Leroy  wrote:

> This series implements livepatch on PPC32.
> 
> This is largely copied from what's done on PPC64.
> 
> Christophe Leroy (5):
>   livepatch: Fix build failure on 32 bits processors
>   powerpc/ftrace: No need to read LR from stack in _mcount()
>   powerpc/ftrace: Add module_trampoline_target() for PPC32
>   powerpc/ftrace: Activate HAVE_DYNAMIC_FTRACE_WITH_REGS on PPC32
>   powerpc/ftrace: Add support for livepatch to PPC32
> 
>  arch/powerpc/Kconfig  |   2 +-
>  arch/powerpc/include/asm/livepatch.h  |   4 +-
>  arch/powerpc/kernel/module_32.c   |  33 +
>  arch/powerpc/kernel/trace/ftrace.c|  53 +++-
>  arch/powerpc/kernel/trace/ftrace_32.S | 187 --
>  kernel/livepatch/core.c   |   4 +-
>  6 files changed, 230 insertions(+), 53 deletions(-)
> 

This is great that you are doing this, but I wonder if it would even be
easier, and more efficient, if you could implement
HAVE_DYNAMIC_FTRACE_WITH_ARGS?

Then you don't need to save all regs for live kernel patching. And I am
also working on function tracing with arguments with this too.

That is, to call a generic ftrace callback, you need to save all the args
that are stored in registers to prevent the callback from clobbering them.
As live kernel patching only needs to have the arguments of the functions,
you save time from having to save the other regs as well.

The callbacks now have "struct ftrace_regs" instead of pt_regs, because it
will allow non ftrace_regs_caller functions to access the arguments if it
is supported.

Look at how x86_64 implements this. It should be possible to do this for
all other archs as well.

Also note, by doing this, we can then get rid of the ftrace_graph_caller,
and have function graph tracer be a function tracing callback, as it will
allow ftrace_graph_caller to have access to the stack and the return as
well.

If you need any more help or information to do this, I'd be happy to assist
you.

Note, you can implement this first, (I looked over the patches and they
seem fine) and then update both ppc64 and ppc32 to implement
DYNAMIC_FTRACE_WITH_ARGS.

Cheers,

-- Steve


Re: [PATCH v6 1/2] ftrace: disable preemption when recursion locked

2021-10-26 Thread Steven Rostedt
On Wed, 27 Oct 2021 10:34:13 +0800
王贇  wrote:

> +/*
> + * Preemption will be enabled (if it was previously enabled).
> + */
>  static __always_inline void trace_clear_recursion(int bit)
>  {
> + WARN_ON_ONCE(bit < 0);

Can you send a v7 without the WARN_ON.

This is an extremely hot path, and this will cause noticeable overhead.

If something were to call this with bit < 0, then it would crash and
burn rather quickly.

-- Steve


> +
> + preempt_enable_notrace();
>   barrier();
>   trace_recursion_clear(bit);
>  }


Re: [PATCH v5 1/2] ftrace: disable preemption when recursion locked

2021-10-26 Thread Steven Rostedt
On Wed, 27 Oct 2021 09:54:13 +0800
王贇  wrote:

> My apologize for the stupid comments... I'll send a v6 for this patch
> only to fix that, please let me know if this is not a good way to fix
> few lines of comments.

Actually, please resend both patches, as a new patch set, on its own thread.

Just replying here won't trigger my patchwork scripts.

And also, if you don't include the other patch, the scripts will drop
it.

Thanks,

-- Steve


Re: [PATCH v5 1/2] ftrace: disable preemption when recursion locked

2021-10-26 Thread Steven Rostedt
On Tue, 26 Oct 2021 17:48:10 +0800
王贇  wrote:

> > The two comments should be updated too since Steven removed the "bit == 0" 
> > trick.  
> 
> Could you please give more hint on how will it be correct?
> 
> I get the point that bit will no longer be 0, there are only -1 or > 0 now
> so trace_test_and_set_recursion() will disable preemption on bit > 0 and
> trace_clear_recursion() will enabled it since it should only be called when
> bit > 0 (I remember we could use a WARN_ON here now :-P).
> 
> >   
> >> @@ -178,7 +187,7 @@ static __always_inline void trace_clear_recursion(int 
> >> bit)
> >>   * tracing recursed in the same context (normal vs interrupt),
> >>   *
> >>   * Returns: -1 if a recursion happened.
> >> - *   >= 0 if no recursion
> >> + *   > 0 if no recursion.
> >>   */
> >>  static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
> >> unsigned long 
> >> parent_ip)  
> > 
> > And this change would not be correct now.  
> 
> I thought it will no longer return 0 so I change it to > 0, isn't that 
> correct?

No it is not. I removed the bit + 1 return value, which means it returns the
actual bit now. Which is 0 or more.

-- Steve


Re: [PATCH v4 0/2] fix & prevent the missing preemption disabling

2021-10-25 Thread Steven Rostedt
On Tue, 26 Oct 2021 10:09:12 +0800
王贇  wrote:

> Just a ping, to see if there are any more comments :-P

I guess you missed what went into mainline (and your name found a bug
in my perl script for importing patches ;-)

  https://lore.kernel.org/all/20211019091344.65629...@gandalf.local.home/

Which means patch 1 needs to change:

> + /*
> +  * Disable preemption to fulfill the promise.
> +  *
> +  * Don't worry about the bit 0 cases, they indicate
> +  * the disabling behaviour has already been done by
> +  * internal call previously.
> +  */
> + preempt_disable_notrace();
> +
>   return bit + 1;
>  }
> 
> +/*
> + * Preemption will be enabled (if it was previously enabled).
> + */
>  static __always_inline void trace_clear_recursion(int bit)
>  {
>   if (!bit)
>   return;
> 
> + if (bit > 0)
> + preempt_enable_notrace();
> +

Where this wont work anymore.

Need to preempt disable and enable always.

-- Steve


>   barrier();
>   bit--;
>   trace_recursion_clear(bit);
> @@ -209,7 +227,7 @@ static __always_inline void trace_clear_recursion(int bit)
>   * tracing recursed in the same context (normal vs interrupt),
>   *


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-19 Thread Steven Rostedt
On Tue, 19 Oct 2021 08:41:23 +0200
Petr Mladek  wrote:

> Feel free to postpone this change. I do not want to complicate
> upstreaming the fix for stable. I am sorry if I already
> complicated it.
> 

No problem. It's not that complicated of a merge fix. I'm sure Linus can
handle it ;-)

-- Steve


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-18 Thread Steven Rostedt
On Mon, 18 Oct 2021 12:19:20 +0200
Petr Mladek  wrote:

> > -
> > bit = trace_get_context_bit() + start;
> > if (unlikely(val & (1 << bit))) {
> > /*
> >  * It could be that preempt_count has not been updated during
> >  * a switch between contexts. Allow for a single recursion.
> >  */
> > -   bit = TRACE_TRANSITION_BIT;
> > +   bit = TRACE_CTX_TRANSITION + start;  
>

[..]

> Could we please update the comment? I mean to say if it is a race
> or if we trace a function that should not get traced.

What do you think of this change?

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index 1d8cce02c3fb..24f284eb55a7 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -168,8 +168,12 @@ static __always_inline int 
trace_test_and_set_recursion(unsigned long ip, unsign
bit = trace_get_context_bit() + start;
if (unlikely(val & (1 << bit))) {
/*
-* It could be that preempt_count has not been updated during
-* a switch between contexts. Allow for a single recursion.
+* If an interrupt occurs during a trace, and another trace
+* happens in that interrupt but before the preempt_count is
+* updated to reflect the new interrupt context, then this
+* will think a recursion occurred, and the event will be 
dropped.
+* Let a single instance happen via the TRANSITION_BIT to
+* not drop those events.
 */
bit = TRACE_TRANSITION_BIT;
if (val & (1 << bit)) {


-- Steve


Re: [PATCH v2] tracing: Have all levels of checks prevent recursion

2021-10-18 Thread Steven Rostedt
On Mon, 18 Oct 2021 15:44:12 -0400
Steven Rostedt  (by way of Steven Rostedt
) wrote:

> [
>Linus,
>  I have patches that clean this up that are not marked for stable, but
>  will depend on this patch. As I already have commits in my next queue,
>  I can do one of the following:
> 
> 1. Cherry pick this from my urgent tree, and build everything on top.
> 2. Add this on top of the mainline tag my next branch is based on and
>merge it.
> 3. Add this to my next branch, and have it go in at the next merge
>window.

Hmm, I take this back. Although the clean up affects the same code block,
the updates don't actually conflict. (Although, if I do update the comment
that Petr asked, that will conflict. But nothing you can't handle ;-)

I'll start running this change through my tests and post it separately.

-- Steve



> ]



[PATCH v2] tracing: Have all levels of checks prevent recursion

2021-10-18 Thread Steven Rostedt
[
   Linus,
 I have patches that clean this up that are not marked for stable, but
 will depend on this patch. As I already have commits in my next queue,
 I can do one of the following:

1. Cherry pick this from my urgent tree, and build everything on top.
2. Add this on top of the mainline tag my next branch is based on and
   merge it.
3. Add this to my next branch, and have it go in at the next merge
   window.
]

From: "Steven Rostedt (VMware)" 

While writing an email explaining the "bit = 0" logic for a discussion on
making ftrace_test_recursion_trylock() disable preemption, I discovered a
path that makes the "not do the logic if bit is zero" unsafe.

The recursion logic is done in hot paths like the function tracer. Thus,
any code executed causes noticeable overhead. Thus, tricks are done to try
to limit the amount of code executed. This included the recursion testing
logic.

Having recursion testing is important, as there are many paths that can
end up in an infinite recursion cycle when tracing every function in the
kernel. Thus protection is needed to prevent that from happening.

Because it is OK to recurse due to different running context levels (e.g.
an interrupt preempts a trace, and then a trace occurs in the interrupt
handler), a set of bits are used to know which context one is in (normal,
softirq, irq and NMI). If a recursion occurs in the same level, it is
prevented*.

Then there are infrastructure levels of recursion as well. When more than
one callback is attached to the same function to trace, it calls a loop
function to iterate over all the callbacks. Both the callbacks and the
loop function have recursion protection. The callbacks use the
"ftrace_test_recursion_trylock()" which has a "function" set of context
bits to test, and the loop function calls the internal
trace_test_and_set_recursion() directly, with an "internal" set of bits.

If an architecture does not implement all the features supported by ftrace
then the callbacks are never called directly, and the loop function is
called instead, which will implement the features of ftrace.

Since both the loop function and the callbacks do recursion protection, it
was seemed unnecessary to do it in both locations. Thus, a trick was made
to have the internal set of recursion bits at a more significant bit
location than the function bits. Then, if any of the higher bits were set,
the logic of the function bits could be skipped, as any new recursion
would first have to go through the loop function.

This is true for architectures that do not support all the ftrace
features, because all functions being traced must first go through the
loop function before going to the callbacks. But this is not true for
architectures that support all the ftrace features. That's because the
loop function could be called due to two callbacks attached to the same
function, but then a recursion function inside the callback could be
called that does not share any other callback, and it will be called
directly.

i.e.

 traced_function_1: [ more than one callback tracing it ]
   call loop_func

 loop_func:
   trace_recursion set internal bit
   call callback

 callback:
   trace_recursion [ skipped because internal bit is set, return 0 ]
   call traced_function_2

 traced_function_2: [ only traced by above callback ]
   call callback

 callback:
   trace_recursion [ skipped because internal bit is set, return 0 ]
   call traced_function_2

 [ wash, rinse, repeat, BOOM! out of shampoo! ]

Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
call for all functions.

Since we want to encourage architectures to implement all ftrace features,
having them slow down due to this extra logic may encourage the
maintainers to update to the latest ftrace features. And because this
logic is only safe for them, remove it completely.

 [*] There is on layer of recursion that is allowed, and that is to allow
 for the transition between interrupt context (normal -> softirq ->
 irq -> NMI), because a trace may occur before the context update is
 visible to the trace recursion logic.

Link: 
https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d...@linux.alibaba.com/

Cc: sta...@vger.kernel.org
Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) 
---
Changes since v1:
   - Removed "max" parameter (Petr Mladek)
 Petr, I did not update the comment, because that can/should be a
 separate patch, as the comment was there before this patch.

 include/linux/trace_recursion.h | 49 ++---
 kernel/trace/ftrace.c   |  4 +--
 2 files changed, 11 insertions(+), 42 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index a9f9c5714e65..fe95f0922526 100644
--- a/include/linux/tr

Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-18 Thread Steven Rostedt
On Mon, 18 Oct 2021 12:19:20 +0200
Petr Mladek  wrote:

> On Fri 2021-10-15 11:00:35, Steven Rostedt wrote:
> > From: "Steven Rostedt (VMware)" 
> > 
> > While writing an email explaining the "bit = 0" logic for a discussion on
> > making ftrace_test_recursion_trylock() disable preemption, I discovered a
> > path that makes the "not do the logic if bit is zero" unsafe.
> > 
> > Since we want to encourage architectures to implement all ftrace features,
> > having them slow down due to this extra logic may encourage the
> > maintainers to update to the latest ftrace features. And because this
> > logic is only safe for them, remove it completely.
> > 
> >  [*] There is on layer of recursion that is allowed, and that is to allow
> >  for the transition between interrupt context (normal -> softirq ->
> >  irq -> NMI), because a trace may occur before the context update is
> >  visible to the trace recursion logic.
> > 
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index a9f9c5714e65..168fdf07419a 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -165,40 +147,29 @@ static __always_inline int 
> > trace_test_and_set_recursion(unsigned long ip, unsign
> > unsigned int val = READ_ONCE(current->trace_recursion);
> > int bit;
> >  
> > -   /* A previous recursion check was made */
> > -   if ((val & TRACE_CONTEXT_MASK) > max)
> > -   return 0;  
> 
> @max parameter is no longer used.

Thanks for noticing!

> 
> > -
> > bit = trace_get_context_bit() + start;
> > if (unlikely(val & (1 << bit))) {
> > /*
> >  * It could be that preempt_count has not been updated during
> >  * a switch between contexts. Allow for a single recursion.
> >  */
> > -   bit = TRACE_TRANSITION_BIT;
> > +   bit = TRACE_CTX_TRANSITION + start;  
> 
> I just want to be sure that I understand it correctly.
> 
> The transition bit is the same for all contexts. It will allow one
> recursion only in one context.

Right.

> 
> IMHO, the same problem (not-yet-updated preempt_count) might happen
> in any transition between contexts: normal -> soft IRQ -> IRQ -> NMI.

Yes, and then we will drop the event if it happens twice, otherwise, we
will need to have a 4 layer transition bit mask, and allow 4 recursions,
which is more than I want to allow.


> 
> Well, I am not sure what exacly it means "preempt_count has not been
> updated during a switch between contexts."
> 
>Is it that a function in the interrupt entry code is traced before
>preempt_count is updated?
> 
>Or that an interrupt entry is interrupted by a higher level
>interrupt, e.g. IRQ entry code interrupted by NMI?

Both actually ;-)

There are places that can trigger a trace between the time the interrupt is
triggered, and the time the preempt_count updates the interrupt context it
is in. Thus the tracer will still think it is in the previous context. But
that is OK, unless, that interrupt happened while the previous context was
in the middle of tracing:

trace() {
  context = get_context(preempt_count);
  test_and_set_bit(context)
  <<--- interrupt --->>>
  trace() {
  context = get_context(preempt_count);
  test_and_set_bit(context); <-- detects recursion!
  }
  update_preempt_count(irq_context);

By allowing a single recursion, it still does the above trace.

Yes, if an NMI happens during that first interrupt trace, and it too traces
before the preempt_count is updated, it will detect it as a recursion.

But this can only happen for that one trace. After the trace returns, the
transition bit is cleared, and the tracing that happens in the rest of the
interrupt is using the proper context. Thus, to drop due to needing two
transition bits, it would require that an interrupt triggered during a
trace, and while it was tracing before the preempt_count update, it too
needed to be interrupted by something (NMI) and that needs to trace before
the preempt_count update.

Although, I think we were able to remove all the NMI tracing before the
update, there's a game of whack-a-mole to handle the interrupt cases.

> 
> 
> I guess that it is the first case. It would mean that the above
> function allows one mistake (one traced funtion in an interrupt entry
> code before the entry code updates preempt_count).
> 
> Do I get it correctly?
> Is this what we want?

Pretty much, which my above explanation to hopefully fill in the details.

> 
> 
&

Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Steven Rostedt
On Fri, 15 Oct 2021 20:24:59 +0200
Peter Zijlstra  wrote:

> > @@ -206,11 +206,7 @@ DEFINE_OUTPUT_COPY(__output_copy_user, 
> > arch_perf_out_copy_user)
> >  static inline int get_recursion_context(int *recursion)
> >  {
> > unsigned int pc = preempt_count();  
> 
> Although I think we can do without that ^ line as well :-)

Ah, I could have sworn I deleted it. Oh well, will make a proper patch set.

Thanks!

-- Steve


> 
> > -   unsigned char rctx = 0;


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Steven Rostedt
On Fri, 15 Oct 2021 14:20:33 -0400
Steven Rostedt  wrote:

> > I think having one copy of that in a header is better than having 3
> > copies. But yes, something along them lines.  
> 
> I was just about to ask you about this patch ;-)

Except it doesn't build :-p (need to move the inlined function down a bit)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 4d244e295e85..b32e3dabe28b 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -77,6 +77,27 @@
 /* preempt_count() and related functions, depends on PREEMPT_NEED_RESCHED */
 #include 
 
+/**
+ * interrupt_context_level - return interrupt context level
+ *
+ * Returns the current interrupt context level.
+ *  0 - normal context
+ *  1 - softirq context
+ *  2 - hardirq context
+ *  3 - NMI context
+ */
+static __always_inline unsigned char interrupt_context_level(void)
+{
+   unsigned long pc = preempt_count();
+   unsigned char level = 0;
+
+   level += !!(pc & (NMI_MASK));
+   level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
+   level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+
+   return level;
+}
+
 #define nmi_count()(preempt_count() & NMI_MASK)
 #define hardirq_count()(preempt_count() & HARDIRQ_MASK)
 #ifdef CONFIG_PREEMPT_RT
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index 41f5bfd9e93f..018a04381556 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -118,12 +118,7 @@ enum {
 
 static __always_inline int trace_get_context_bit(void)
 {
-   unsigned long pc = preempt_count();
-   unsigned char bit = 0;
-
-   bit += !!(pc & (NMI_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+   unsigned char bit = interrupt_context_level();
 
return TRACE_CTX_NORMAL - bit;
 }
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 228801e20788..c91711f20cf8 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -206,11 +206,7 @@ DEFINE_OUTPUT_COPY(__output_copy_user, 
arch_perf_out_copy_user)
 static inline int get_recursion_context(int *recursion)
 {
unsigned int pc = preempt_count();
-   unsigned char rctx = 0;
-
-   rctx += !!(pc & (NMI_MASK));
-   rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
-   rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+   unsigned char rctx = interrupt_context_level();
 
if (recursion[rctx])
return -1;
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 15d4380006e3..f6520d0a4c8c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3167,12 +3167,7 @@ static __always_inline int
 trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer)
 {
unsigned int val = cpu_buffer->current_context;
-   unsigned long pc = preempt_count();
-   int bit = 0;
-
-   bit += !!(pc & (NMI_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+   int bit = interrupt_context_level();
 
bit = RB_CTX_NORMAL - bit;
 


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Steven Rostedt
On Fri, 15 Oct 2021 20:04:29 +0200
Peter Zijlstra  wrote:

> On Fri, Oct 15, 2021 at 01:58:06PM -0400, Steven Rostedt wrote:
> > Something like this:  
> 
> I think having one copy of that in a header is better than having 3
> copies. But yes, something along them lines.

I was just about to ask you about this patch ;-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 4d244e295e85..a6ae329aa654 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -74,6 +74,27 @@
  */
 #define FORK_PREEMPT_COUNT (2*PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED)
 
+/**
+ * interrupt_context_level - return interrupt context level
+ *
+ * Returns the current interrupt context level.
+ *  0 - normal context
+ *  1 - softirq context
+ *  2 - hardirq context
+ *  3 - NMI context
+ */
+static __always_inline unsigned char interrupt_context_level(void)
+{
+   unsigned long pc = preempt_count();
+   unsigned char level = 0;
+
+   level += !!(pc & (NMI_MASK));
+   level += !!(pc & (NMI_MASK | HARDIRQ_MASK));
+   level += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+
+   return level;
+}
+
 /* preempt_count() and related functions, depends on PREEMPT_NEED_RESCHED */
 #include 
 
diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index 41f5bfd9e93f..018a04381556 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -118,12 +118,7 @@ enum {
 
 static __always_inline int trace_get_context_bit(void)
 {
-   unsigned long pc = preempt_count();
-   unsigned char bit = 0;
-
-   bit += !!(pc & (NMI_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+   unsigned char bit = interrupt_context_level();
 
return TRACE_CTX_NORMAL - bit;
 }
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 228801e20788..c91711f20cf8 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -206,11 +206,7 @@ DEFINE_OUTPUT_COPY(__output_copy_user, 
arch_perf_out_copy_user)
 static inline int get_recursion_context(int *recursion)
 {
unsigned int pc = preempt_count();
-   unsigned char rctx = 0;
-
-   rctx += !!(pc & (NMI_MASK));
-   rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK));
-   rctx += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+   unsigned char rctx = interrupt_context_level();
 
if (recursion[rctx])
return -1;
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 15d4380006e3..f6520d0a4c8c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3167,12 +3167,7 @@ static __always_inline int
 trace_recursive_lock(struct ring_buffer_per_cpu *cpu_buffer)
 {
unsigned int val = cpu_buffer->current_context;
-   unsigned long pc = preempt_count();
-   int bit = 0;
-
-   bit += !!(pc & (NMI_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK));
-   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+   int bit = interrupt_context_level();
 
bit = RB_CTX_NORMAL - bit;
 


Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Steven Rostedt
On Fri, 15 Oct 2021 13:35:04 -0400
Steven Rostedt  wrote:

> On Fri, 15 Oct 2021 18:17:02 +0200
> Peter Zijlstra  wrote:
> 
> > On Fri, Oct 15, 2021 at 11:00:35AM -0400, Steven Rostedt wrote:  
> > > From: "Steven Rostedt (VMware)" 
> > > 
> > > While writing an email explaining the "bit = 0" logic for a discussion on 
> > >
> >   
> > >   bit = trace_get_context_bit() + start;
> > 
> > While there, you were also going to update that function to match/use
> > get_recursion_context(). Because your version is still branch hell.  
> 
> That would probably be a separate patch. This is just a fix to a design
> flaw, changing the context tests would be performance enhancement.
> 
> Thanks for the reminder (as it was on my todo list to update that code).

Something like this:

From: "Steven Rostedt (VMware)" 
Subject: [PATCH] tracing: Reuse logic from perf's get_recursion_context()

Instead of having branches that adds noise to the branch prediction, use
the addition logic to set the bit for the level of interrupt context that
the state is currently in. This copies the logic from perf's
get_recursion_context() function.

Link: 
https://lore.kernel.org/all/20211015161702.gf174...@worktop.programming.kicks-ass.net/

Suggested-by: Peter Zijlstra 
Signed-off-by: Steven Rostedt (VMware) 
---
 include/linux/trace_recursion.h | 11 ++-
 kernel/trace/ring_buffer.c  | 12 ++--
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index 168fdf07419a..41f5bfd9e93f 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -119,12 +119,13 @@ enum {
 static __always_inline int trace_get_context_bit(void)
 {
unsigned long pc = preempt_count();
+   unsigned char bit = 0;
 
-   if (!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
-   return TRACE_CTX_NORMAL;
-   else
-   return pc & NMI_MASK ? TRACE_CTX_NMI :
-   pc & HARDIRQ_MASK ? TRACE_CTX_IRQ : TRACE_CTX_SOFTIRQ;
+   bit += !!(pc & (NMI_MASK));
+   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK));
+   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+
+   return TRACE_CTX_NORMAL - bit;
 }
 
 #ifdef CONFIG_FTRACE_RECORD_RECURSION
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c5a3fbf19617..15d4380006e3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3168,13 +3168,13 @@ trace_recursive_lock(struct ring_buffer_per_cpu 
*cpu_buffer)
 {
unsigned int val = cpu_buffer->current_context;
unsigned long pc = preempt_count();
-   int bit;
+   int bit = 0;
 
-   if (!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET)))
-   bit = RB_CTX_NORMAL;
-   else
-   bit = pc & NMI_MASK ? RB_CTX_NMI :
-   pc & HARDIRQ_MASK ? RB_CTX_IRQ : RB_CTX_SOFTIRQ;
+   bit += !!(pc & (NMI_MASK));
+   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK));
+   bit += !!(pc & (NMI_MASK | HARDIRQ_MASK | SOFTIRQ_OFFSET));
+
+   bit = RB_CTX_NORMAL - bit;
 
if (unlikely(val & (1 << (bit + cpu_buffer->nest {
/*
-- 
2.31.1



Re: [PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Steven Rostedt
On Fri, 15 Oct 2021 18:17:02 +0200
Peter Zijlstra  wrote:

> On Fri, Oct 15, 2021 at 11:00:35AM -0400, Steven Rostedt wrote:
> > From: "Steven Rostedt (VMware)" 
> > 
> > While writing an email explaining the "bit = 0" logic for a discussion on  
> 
> > bit = trace_get_context_bit() + start;  
> 
> While there, you were also going to update that function to match/use
> get_recursion_context(). Because your version is still branch hell.

That would probably be a separate patch. This is just a fix to a design
flaw, changing the context tests would be performance enhancement.

Thanks for the reminder (as it was on my todo list to update that code).

-- Steve


[PATCH] tracing: Have all levels of checks prevent recursion

2021-10-15 Thread Steven Rostedt
From: "Steven Rostedt (VMware)" 

While writing an email explaining the "bit = 0" logic for a discussion on
making ftrace_test_recursion_trylock() disable preemption, I discovered a
path that makes the "not do the logic if bit is zero" unsafe.

The recursion logic is done in hot paths like the function tracer. Thus,
any code executed causes noticeable overhead. Thus, tricks are done to try
to limit the amount of code executed. This included the recursion testing
logic.

Having recursion testing is important, as there are many paths that can
end up in an infinite recursion cycle when tracing every function in the
kernel. Thus protection is needed to prevent that from happening.

Because it is OK to recurse due to different running context levels (e.g.
an interrupt preempts a trace, and then a trace occurs in the interrupt
handler), a set of bits are used to know which context one is in (normal,
softirq, irq and NMI). If a recursion occurs in the same level, it is
prevented*.

Then there are infrastructure levels of recursion as well. When more than
one callback is attached to the same function to trace, it calls a loop
function to iterate over all the callbacks. Both the callbacks and the
loop function have recursion protection. The callbacks use the
"ftrace_test_recursion_trylock()" which has a "function" set of context
bits to test, and the loop function calls the internal
trace_test_and_set_recursion() directly, with an "internal" set of bits.

If an architecture does not implement all the features supported by ftrace
then the callbacks are never called directly, and the loop function is
called instead, which will implement the features of ftrace.

Since both the loop function and the callbacks do recursion protection, it
was seemed unnecessary to do it in both locations. Thus, a trick was made
to have the internal set of recursion bits at a more significant bit
location than the function bits. Then, if any of the higher bits were set,
the logic of the function bits could be skipped, as any new recursion
would first have to go through the loop function.

This is true for architectures that do not support all the ftrace
features, because all functions being traced must first go through the
loop function before going to the callbacks. But this is not true for
architectures that support all the ftrace features. That's because the
loop function could be called due to two callbacks attached to the same
function, but then a recursion function inside the callback could be
called that does not share any other callback, and it will be called
directly.

i.e.

 traced_function_1: [ more than one callback tracing it ]
   call loop_func

 loop_func:
   trace_recursion set internal bit
   call callback

 callback:
   trace_recursion [ skipped because internal bit is set, return 0 ]
   call traced_function_2

 traced_function_2: [ only traced by above callback ]
   call callback

 callback:
   trace_recursion [ skipped because internal bit is set, return 0 ]
   call traced_function_2

 [ wash, rinse, repeat, BOOM! out of shampoo! ]

Thus, the "bit == 0 skip" trick is not safe, unless the loop function is
call for all functions.

Since we want to encourage architectures to implement all ftrace features,
having them slow down due to this extra logic may encourage the
maintainers to update to the latest ftrace features. And because this
logic is only safe for them, remove it completely.

 [*] There is on layer of recursion that is allowed, and that is to allow
 for the transition between interrupt context (normal -> softirq ->
 irq -> NMI), because a trace may occur before the context update is
 visible to the trace recursion logic.

Link: 
https://lore.kernel.org/all/609b565a-ed6e-a1da-f025-166691b5d...@linux.alibaba.com/

Cc: sta...@vger.kernel.org
Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) 
---
 include/linux/trace_recursion.h | 41 +
 1 file changed, 6 insertions(+), 35 deletions(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index a9f9c5714e65..168fdf07419a 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -16,23 +16,8 @@
  *  When function tracing occurs, the following steps are made:
  *   If arch does not support a ftrace feature:
  *call internal function (uses INTERNAL bits) which calls...
- *   If callback is registered to the "global" list, the list
- *function is called and recursion checks the GLOBAL bits.
- *then this function calls...
  *   The function callback, which can use the FTRACE bits to
  *check for recursion.
- *
- * Now if the arch does not support a feature, and it calls
- * the global list function which calls the ftrace callback
- * all three of these steps will do a recursion protec

Re: [PATCH v3 1/2] ftrace: disable preemption between ftrace_test_recursion_trylock/unlock()

2021-10-15 Thread Steven Rostedt
On Fri, 15 Oct 2021 17:12:26 +0800
王贇  wrote:

> Maybe take some example would be easier to understand...
> 
> Currently there are two way of using ftrace_test_recursion_trylock(),
> one with TRACE_FTRACE_XXX we mark as A, one with TRACE_LIST_XXX we mark
> as B, then:
> 
> A followed by B on same context got bit > 0
> B followed by A on any context got bit 0
> A followed by A on same context got bit > 0
> A followed by A followed by A on same context got bit -1
> B followed by B on same context got bit > 0
> B followed by B followed by B on same context got bit -1
> 
> If we get rid of the TRACE_TRANSITION_BIT which allowed recursion for
> onetime, then it would be:

We can't get rid of the transition bit. It's there to fix a bug. Or at
least until we finish the "noinst" issue which may be true now? I have to
go revisit it.

The reason for the transition bit, is because we were dropping function
traces, that people relied on being there. The problem is that the
recursion protection allows for nested context. That is, it will not detect
recursion if we an interrupt triggers during a trace (while the recursion
lock is held) and then that interrupt does a trace. It is allowed to call
the ftrace_test_recursion_trylock() again.

But, what happens if the trace occurs after the interrupt triggers, but
before the preempt_count states that we are now in interrupt context? As
preempt_count is used by this code to determine what context we are in, if
a trace happens in this "transition" state, without the transition bit, a
recursion is detected (false positive) and the event is dropped. People are
then confused on how an interrupt happened, but the entry of the interrupt
never triggered (according to the trace).

The transition bit allows one level of recursion to handle this case.

The "noinst" work may also remove all locations that can be traced before
the context is updated. I need to see if that is now the case, and if it
is, we can remove the transition bit.

This is not the design issue I mentioned earlier. I'm still working on that
one.

-- Steve


> 
> A followed by B on same context got bit > 0
> B followed by A on any context got bit 0
> A followed by A on same context got bit -1
> B followed by B on same context got bit -1
> 
> So as long as no continuously AAA it's safe?



Re: [PATCH v3 1/2] ftrace: disable preemption between ftrace_test_recursion_trylock/unlock()

2021-10-14 Thread Steven Rostedt
On Thu, 14 Oct 2021 17:14:07 +0200
Petr Mladek  wrote:

>   /**
>* ftrace_test_recursion_trylock - tests for recursion in same context
>*
>* Use this for ftrace callbacks. This will detect if the function
>* tracing recursed in the same context (normal vs interrupt),
>*
>* Returns: -1 if a recursion happened.
> -  *   >= 0 if no recursion
> +  *   >= 0 if no recursion (success)
> +  *
> +  * Disables the preemption on success. It is just for a convenience.
> +  * Current users needed to disable the preemtion for some reasons.
>*/

I started replying to this explaining the difference between bit not
zero and a bit zero, and I think I found a design flaw that can allow
unwanted recursion.

It's late and I'm about to go to bed, but I may have a new patch to fix
this before this gets added, as the fix will conflict with this patch,
and the fix will likely need to go to stable.

Stay tuned.

-- Steve


Re: [PATCH v3 1/2] ftrace: disable preemption between ftrace_test_recursion_trylock/unlock()

2021-10-14 Thread Steven Rostedt
On Thu, 14 Oct 2021 11:13:13 +0200 (CEST)
Miroslav Benes  wrote:

> for the livepatch part of the patch.
> 
> I would also ask you not to submit new versions so often, so that the 
> other reviewers have time to actually review the patch set.
> 
> Quoting from Documentation/process/submitting-patches.rst:
> 
> "Wait for a minimum of one week before resubmitting or pinging reviewers - 
> possibly longer during busy times like merge windows."

Although, for updates on a small patch set, I would say just a couple of
days, instead of a week. It's annoying when you have a 15 patch set series,
that gets updated on a daily basis. But for something with only 2 patches,
wait just two days. At least, that's how I feel.

-- Steve


Re: [RESEND PATCH v2 1/2] ftrace: disable preemption between ftrace_test_recursion_trylock/unlock()

2021-10-13 Thread Steven Rostedt
On Wed, 13 Oct 2021 16:34:53 +0800
王贇  wrote:

> On 2021/10/13 下午4:25, Miroslav Benes wrote:
> >>> Side note... the comment will eventually conflict with peterz's 
> >>> https://lore.kernel.org/all/20210929152429.125997...@infradead.org/.  
> >>
> >> Steven, would you like to share your opinion on this patch?
> >>
> >> If klp_synchronize_transition() will be removed anyway, the comments
> >> will be meaningless and we can just drop it :-P  
> > 
> > The comment will still be needed in some form. We will handle it depending 
> > on what gets merged first. peterz's patches are not ready yet.  
> 
> Ok, then I'll move it before trylock() inside klp_ftrace_handler() anyway.

+1

-- Steve


Re: [PATCH v2 0/2] fix & prevent the missing preemption disabling

2021-10-12 Thread Steven Rostedt
Please start a new thread when sending new versions. v2 should not be a
reply to v1. If you want to reference v1, just add it to the cover
letter with a link tag:

Link: 
https://lore.kernel.org/all/8c7de46d-9869-aa5e-2bb9-5dbc2eda3...@linux.alibaba.com/

-- Steve


On Wed, 13 Oct 2021 11:16:56 +0800
王贇  wrote:

> The testing show that perf_ftrace_function_call() are using smp_processor_id()
> with preemption enabled, all the checking on CPU could be wrong after 
> preemption.
> 
> As Peter point out, the section between ftrace_test_recursion_trylock/unlock()
> pair require the preemption to be disabled as 
> 'Documentation/trace/ftrace-uses.rst'
> explained, but currently the work is done outside of the helpers.
> 
> Patch 1/2 will make sure preemption disabled after trylock() succeed,
> patch 2/2 will do smp_processor_id() checking after trylock to address the
> issue.
> 
> Michael Wang (2):
>   ftrace: disable preemption between ftrace_test_recursion_trylock/unlock()
>   ftrace: do CPU checking after preemption disabled
> 
>  arch/csky/kernel/probes/ftrace.c |  2 --
>  arch/parisc/kernel/ftrace.c  |  2 --
>  arch/powerpc/kernel/kprobes-ftrace.c |  2 --
>  arch/riscv/kernel/probes/ftrace.c|  2 --
>  arch/x86/kernel/kprobes/ftrace.c |  2 --
>  include/linux/trace_recursion.h  | 22 +-
>  kernel/livepatch/patch.c |  6 --
>  kernel/trace/trace_event_perf.c  |  6 +++---
>  kernel/trace/trace_functions.c   |  5 -
>  9 files changed, 24 insertions(+), 25 deletions(-)
> 



Re: [PATCH 1/2] ftrace: disable preemption on the testing of recursion

2021-10-12 Thread Steven Rostedt
On Wed, 13 Oct 2021 10:04:52 +0800
王贇  wrote:

> I see, while the user can still check smp_processor_id() after trylock
> return bit 0...

But preemption would have already been disabled. That's because a bit 0
means that a recursion check has already been made by a previous
caller and this one is nested, thus preemption is already disabled.
If bit is 0, then preemption had better be disabled as well.

-- Steve


Re: [PATCH 1/2] ftrace: disable preemption on the testing of recursion

2021-10-12 Thread Steven Rostedt
On Wed, 13 Oct 2021 09:50:17 +0800
王贇  wrote:

> >> -  preempt_enable_notrace();
> >>ftrace_test_recursion_unlock(bit);
> >>  }  
> > 
> > I don't like this change much. We have preempt_disable there not because 
> > of ftrace_test_recursion, but because of RCU. ftrace_test_recursion was 
> > added later. Yes, it would work with the change, but it would also hide 
> > things which should not be hidden in my opinion.  
> 
> Not very sure about the backgroup stories, but just found this in
> 'Documentation/trace/ftrace-uses.rst':
> 
>   Note, on success,
>   ftrace_test_recursion_trylock() will disable preemption, and the
>   ftrace_test_recursion_unlock() will enable it again (if it was previously
>   enabled).

Right that part is to be fixed by what you are adding here.

The point that Miroslav is complaining about is that the preemption
disabling is special in this case, and not just from the recursion
point of view, which is why the comment is still required.

-- Steve


> 
> Seems like this lock pair was supposed to take care the preemtion itself?


Re: [PATCH 1/2] ftrace: disable preemption on the testing of recursion

2021-10-12 Thread Steven Rostedt
On Tue, 12 Oct 2021 13:40:08 +0800
王贇  wrote:

> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -214,7 +214,14 @@ static __always_inline void trace_clear_recursion(int 
> bit)
>  static __always_inline int ftrace_test_recursion_trylock(unsigned long ip,
>unsigned long 
> parent_ip)
>  {
> - return trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START, 
> TRACE_FTRACE_MAX);
> + int bit;
> +
> + preempt_disable_notrace();

The recursion test does not require preemption disabled, it uses the task
struct, not per_cpu variables, so you should not disable it before the test.

bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START, 
TRACE_FTRACE_MAX);
if (bit >= 0)
preempt_disable_notrace();

And if the bit is zero, it means a recursion check was already done by
another caller (ftrace handler does the check, followed by calling perf),
and you really don't even need to disable preemption in that case.

if (bit > 0)
preempt_disable_notrace();

And on the unlock, have:

 static __always_inline void ftrace_test_recursion_unlock(int bit)
 {
if (bit)
preempt_enable_notrace();
trace_clear_recursion(bit);
 }

But maybe that's over optimizing ;-)

-- Steve


> + bit = trace_test_and_set_recursion(ip, parent_ip, TRACE_FTRACE_START, 
> TRACE_FTRACE_MAX);
> + if (bit < 0)
> + preempt_enable_notrace();
> +
> + return bit;
>  }



Re: [PATCH 1/2] ftrace: disable preemption on the testing of recursion

2021-10-12 Thread Steven Rostedt
On Tue, 12 Oct 2021 14:24:43 +0200 (CEST)
Miroslav Benes  wrote:

> > +++ b/kernel/livepatch/patch.c
> > @@ -52,11 +52,6 @@ static void notrace klp_ftrace_handler(unsigned long ip,
> > bit = ftrace_test_recursion_trylock(ip, parent_ip);
> > if (WARN_ON_ONCE(bit < 0))
> > return;
> > -   /*
> > -* A variant of synchronize_rcu() is used to allow patching functions
> > -* where RCU is not watching, see klp_synchronize_transition().
> > -*/
> > -   preempt_disable_notrace();
> > 
> > func = list_first_or_null_rcu(>func_stack, struct klp_func,
> >   stack_node);
> > @@ -120,7 +115,6 @@ static void notrace klp_ftrace_handler(unsigned long ip,
> > klp_arch_set_pc(fregs, (unsigned long)func->new_func);
> > 
> >  unlock:
> > -   preempt_enable_notrace();
> > ftrace_test_recursion_unlock(bit);
> >  }  
> 
> I don't like this change much. We have preempt_disable there not because 
> of ftrace_test_recursion, but because of RCU. ftrace_test_recursion was 
> added later. Yes, it would work with the change, but it would also hide 
> things which should not be hidden in my opinion.

Agreed, but I believe the change is fine, but requires a nice comment to
explain what you said above.

Thus, before the "ftrace_test_recursion_trylock()" we need:

/*
 * The ftrace_test_recursion_trylock() will disable preemption,
 * which is required for the variant of synchronize_rcu() that is
 * used to allow patching functions where RCU is not watching.
 * See klp_synchronize_transition() for more details.
 */

-- Steve


Re: [PATCH 1/2] ftrace: disable preemption on the testing of recursion

2021-10-12 Thread Steven Rostedt
On Tue, 12 Oct 2021 13:40:08 +0800
王贇  wrote:

> @@ -52,11 +52,6 @@ static void notrace klp_ftrace_handler(unsigned long ip,
>   bit = ftrace_test_recursion_trylock(ip, parent_ip);
>   if (WARN_ON_ONCE(bit < 0))
>   return;
> - /*
> -  * A variant of synchronize_rcu() is used to allow patching functions
> -  * where RCU is not watching, see klp_synchronize_transition().
> -  */

I have to take a deeper look at this patch set, but do not remove this
comment, as it explains the protection here, that is not obvious with the
changes you made.

-- Steve


> - preempt_disable_notrace();
> 
>   func = list_first_or_null_rcu(>func_stack, struct klp_func,
> stack_node);


Re: [PATCH v4] ftrace: Cleanup ftrace_dyn_arch_init()

2021-09-09 Thread Steven Rostedt
On Thu,  9 Sep 2021 17:02:16 +0800
Weizhao Ouyang  wrote:

> Most of ARCHs use empty ftrace_dyn_arch_init(), introduce a weak common
> ftrace_dyn_arch_init() to cleanup them.

FYI,

I'm not ignoring this patch. I just wont be able to look at it until the
merge window is over.

-- Steve


> 
> Signed-off-by: Weizhao Ouyang 
> Acked-by: Heiko Carstens  (s390)
> Acked-by: Helge Deller  (parisc)
> 
> ---
> Changes in v4:
> -- revert the generic declaration
> 
> Changes in v3:
> -- fix unrecognized opcode on PowerPC
> 
> Changes in v2:
> -- correct CONFIG_DYNAMIC_FTRACE on PowerPC
> -- add Acked-by tag
> 
> ---
>  arch/arm/kernel/ftrace.c| 5 -
>  arch/arm64/kernel/ftrace.c  | 5 -
>  arch/csky/kernel/ftrace.c   | 5 -
>  arch/ia64/kernel/ftrace.c   | 6 --
>  arch/microblaze/kernel/ftrace.c | 5 -
>  arch/nds32/kernel/ftrace.c  | 5 -
>  arch/parisc/kernel/ftrace.c | 5 -
>  arch/riscv/kernel/ftrace.c  | 5 -
>  arch/s390/kernel/ftrace.c   | 5 -
>  arch/sh/kernel/ftrace.c | 5 -
>  arch/sparc/kernel/ftrace.c  | 5 -
>  arch/x86/kernel/ftrace.c| 5 -
>  kernel/trace/ftrace.c   | 5 +
>  13 files changed, 5 insertions(+), 61 deletions(-)
> 
>


Re: [PATCH v2 5/7] kallsyms: Rename is_kernel() and is_kernel_text()

2021-07-28 Thread Steven Rostedt
On Thu, 29 Jul 2021 10:00:51 +0800
Kefeng Wang  wrote:

> On 2021/7/28 23:28, Steven Rostedt wrote:
> > On Wed, 28 Jul 2021 16:13:18 +0800
> > Kefeng Wang  wrote:
> >  
> >> The is_kernel[_text]() function check the address whether or not
> >> in kernel[_text] ranges, also they will check the address whether
> >> or not in gate area, so use better name.  
> > Do you know what a gate area is?
> >
> > Because I believe gate area is kernel text, so the rename just makes it
> > redundant and more confusing.  
> 
> Yes, the gate area(eg, vectors part on ARM32, similar on x86/ia64) is 
> kernel text.
> 
> I want to keep the 'basic' section boundaries check, which only check 
> the start/end
> 
> of sections, all in section.h,  could we use 'generic' or 'basic' or 
> 'core' in the naming?
> 
>   * is_kernel_generic_data()  --- come from core_kernel_data() in kernel.h
>   * is_kernel_generic_text()
> 
> The old helper could remain unchanged, any suggestion, thanks.

Because it looks like the check of just being in the range of "_stext"
to "_end" is just an internal helper, why not do what we do all over
the kernel, and just prefix the function with a couple of underscores,
that denote that it's internal?

  __is_kernel_text()

Then you have:

 static inline int is_kernel_text(unsigned long addr)
 {
if (__is_kernel_text(addr))
return 1;
return in_gate_area_no_mm(addr);
 }

-- Steve


Re: [PATCH v2 6/7] sections: Add new is_kernel() and is_kernel_text()

2021-07-28 Thread Steven Rostedt
On Wed, 28 Jul 2021 16:13:19 +0800
Kefeng Wang  wrote:

> @@ -64,8 +64,7 @@ const struct exception_table_entry 
> *search_exception_tables(unsigned long addr)
>  
>  int notrace core_kernel_text(unsigned long addr)
>  {
> - if (addr >= (unsigned long)_stext &&
> - addr < (unsigned long)_etext)
> + if (is_kernel_text(addr))

Perhaps this was a bug, and these functions should be checking the gate
area as well, as that is part of kernel text.

-- Steve


>   return 1;
>  
>   if (system_state < SYSTEM_RUNNING &&
> diff --git a/mm/kasan/report.c b/mm/kasan/report.c
> index 884a950c7026..88f5b0c058b7 100644
> --- a/mm/kasan/report.c
> +++ b/mm/kasan/report.c
> @@ -235,7 +235,7 @@ static void describe_object(struct kmem_cache *cache, 
> void *object,
>  
>  static inline bool kernel_or_module_addr(const void *addr)
>  {
> - if (addr >= (void *)_stext && addr < (void *)_end)
> + if (is_kernel((unsigned long)addr))
>   return true;
>   if (is_module_address((unsigned long)addr))
>   return true;
> -- 


  1   2   3   4   5   6   >