Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt
On Mon, 1 Apr 2024 19:29:46 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 1 Apr 2024 12:09:18 -0400
> > Steven Rostedt  wrote:
> >  
> > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > >  
> > > > > Masami,
> > > > >
> > > > > Are you OK with just keeping it set to N.  
> > > >
> > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > >  
> > > > >
> > > > > We could have other options like PROVE_LOCKING enable it.  
> > > >
> > > > Agreed (but it should say this is a debug option)  
> > >
> > > It does say "Validate" which to me is a debug option. What would you
> > > suggest?  
> >
> > I think the help message should have "This is for debugging ftrace."
> >  
> 
> Sent v2 with adjusted wording, thanks!

You may want to wait till Masami and I agree ;-)

Masami,

But it isn't really for "debugging", it's for validating. That is, it
doesn't give us any information to debug ftrace. It only validates if it is
executed properly. In other words, I never want to be asked "How can I use
this option to debug ftrace?"

For example, we also have:

  "Verify ring buffer time stamp deltas"

that makes sure the time stamps of the ring buffer are not buggy.

And there's:

  "Verify compile time sorting of ftrace functions"

Which is also used to make sure things are working properly.

Neither of the above says they are for "debugging", even though they are
more useful for debugging than this option.

I'm not sure saying this is "debugging ftrace" is accurate. It may help
debug ftrace if it is called outside of an RCU location, which has a
1 in 100,000,000,000 chance of causing an actual bug, as the race window is
extremely small. 

Now if it is also called outside of instrumentation, that will likely trigger
other warnings even without this code, and this will not be needed to debug
that.

ftrace has all sorts of "verifiers" that is used to make sure things are
working properly. And yes, you can consider it as "debugging". But I would
not consider this an option to enable if ftrace was broken, and you are
looking into why it is broken.

To me, this option is only to verify that ftrace (and other users of
ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
are, it can cause a race. But it also has a noticeable overhead when enabled.

-- Steve


-- Steve



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Apr 2024 12:09:18 -0400
> Steven Rostedt  wrote:
>
> > On Mon, 1 Apr 2024 20:25:52 +0900
> > Masami Hiramatsu (Google)  wrote:
> >
> > > > Masami,
> > > >
> > > > Are you OK with just keeping it set to N.
> > >
> > > OK, if it is only for the debugging, I'm OK to set N this.
> > >
> > > >
> > > > We could have other options like PROVE_LOCKING enable it.
> > >
> > > Agreed (but it should say this is a debug option)
> >
> > It does say "Validate" which to me is a debug option. What would you
> > suggest?
>
> I think the help message should have "This is for debugging ftrace."
>

Sent v2 with adjusted wording, thanks!

> Thank you,
>
> >
> > -- Steve
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level debugging of ftrace subsystem. For most users it should
probably be kept disabled to eliminate unnecessary runtime overhead.

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 14 ++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..fcf45d5c60cb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace recursion check"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort
+ of protection against recursion. This option performs additional
+ checks to make sure RCU is on when ftrace callbacks recurse.
+
+ This is a feature useful for debugging ftrace. This will add more
+ overhead to all ftrace-based invocations.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Google
On Mon, 1 Apr 2024 12:09:18 -0400
Steven Rostedt  wrote:

> On Mon, 1 Apr 2024 20:25:52 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > > Masami,
> > > 
> > > Are you OK with just keeping it set to N.  
> > 
> > OK, if it is only for the debugging, I'm OK to set N this.
> > 
> > > 
> > > We could have other options like PROVE_LOCKING enable it.  
> > 
> > Agreed (but it should say this is a debug option)
> 
> It does say "Validate" which to me is a debug option. What would you
> suggest?

I think the help message should have "This is for debugging ftrace."

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



[PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-01 Thread Kui-Feng Lee
rethook_find_ret_addr() prints a warning message and returns 0 when the
target task is running and not the "current" task to prevent returning an
incorrect return address. However, this check is incomplete as the target
task can still transition to the running state when finding the return
address, although it is safe with RCU.

The issue we encounter is that the kernel frequently prints warning
messages when BPF profiling programs call to bpf_get_task_stack() on
running tasks.

The callers should be aware and willing to take the risk of receiving an
incorrect return address from a task that is currently running other than
the "current" one. A warning is not needed here as the callers are intent
on it.

Signed-off-by: Kui-Feng Lee 
---
 kernel/trace/rethook.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..4297a132a7ae 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
*tsk, unsigned long frame
if (WARN_ON_ONCE(!cur))
return 0;
 
-   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
+   if (tsk != current && task_is_running(tsk))
return 0;
 
do {
-- 
2.34.1




Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt
On Mon, 1 Apr 2024 20:25:52 +0900
Masami Hiramatsu (Google)  wrote:

> > Masami,
> > 
> > Are you OK with just keeping it set to N.  
> 
> OK, if it is only for the debugging, I'm OK to set N this.
> 
> > 
> > We could have other options like PROVE_LOCKING enable it.  
> 
> Agreed (but it should say this is a debug option)

It does say "Validate" which to me is a debug option. What would you
suggest?

-- Steve



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Google
On Tue, 26 Mar 2024 15:01:21 -0400
Steven Rostedt  wrote:

> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko  wrote:
> 
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >  
> > 
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
> 
> Masami,
> 
> Are you OK with just keeping it set to N.

OK, if it is only for the debugging, I'm OK to set N this.

> 
> We could have other options like PROVE_LOCKING enable it.

Agreed (but it should say this is a debug option)

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



[PATCH net-next v4 2/2] trace: tcp: fully support trace_tcp_send_reset

2024-04-01 Thread Jason Xing
From: Jason Xing 

Prior to this patch, what we can see by enabling trace_tcp_send is
only happening under two circumstances:
1) active rst mode
2) non-active rst mode and based on the full socket

That means the inconsistency occurs if we use tcpdump and trace
simultaneously to see how rst happens.

It's necessary that we should take into other cases into considerations,
say:
1) time-wait socket
2) no socket
...

By parsing the incoming skb and reversing its 4-tuple can
we know the exact 'flow' which might not exist.

Samples after applied this patch:
1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port
state=TCP_ESTABLISHED
2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port
state=UNKNOWN
Note:
1) UNKNOWN means we cannot extract the right information from skb.
2) skbaddr/skaddr could be 0

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 40 --
 net/ipv4/tcp_ipv4.c|  7 +++
 net/ipv6/tcp_ipv6.c|  3 ++-
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index cf14b6fcbeed..5c04a61a11c2 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -78,11 +78,47 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
-DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset,
+TRACE_EVENT(tcp_send_reset,
 
TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
 
-   TP_ARGS(sk, skb)
+   TP_ARGS(sk, skb),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(const void *, skaddr)
+   __field(int, state)
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
+   ),
+
+   TP_fast_assign(
+   __entry->skbaddr = skb;
+   __entry->skaddr = sk;
+   /* Zero means unknown state. */
+   __entry->state = sk ? sk->sk_state : 0;
+
+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   if (sk && sk_fullsock(sk)) {
+   const struct inet_sock *inet = inet_sk(sk);
+
+   TP_STORE_ADDR_PORTS(__entry, inet, sk);
+   } else if (skb) {
+   const struct tcphdr *th = (const struct tcphdr 
*)skb->data;
+   /*
+* We should reverse the 4-tuple of skb, so later
+* it can print the right flow direction of rst.
+*/
+   TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
+   }
+   ),
+
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+ __entry->skbaddr, __entry->skaddr,
+ __entry->saddr, __entry->daddr,
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
 );
 
 /*
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a22ee5838751..0d47b48f8cfd 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -866,11 +866,10 @@ static void tcp_v4_send_reset(const struct sock *sk, 
struct sk_buff *skb)
 * routing might fail in this case. No choice here, if we choose to 
force
 * input interface, we will misroute in case of asymmetric route.
 */
-   if (sk) {
+   if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
-   if (sk_fullsock(sk))
-   trace_tcp_send_reset(sk, skb);
-   }
+
+   trace_tcp_send_reset(sk, skb);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 3f4cba49e9ee..8e9c59b6c00c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
if (sk) {
oif = sk->sk_bound_dev_if;
if (sk_fullsock(sk)) {
-   trace_tcp_send_reset(sk, skb);
if (inet6_test_bit(REPFLOW, sk))
label = ip6_flowlabel(ipv6h);
priority = READ_ONCE(sk->sk_priority);
@@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb)
label = ip6_flowlabel(ipv6h);
}
 
+   trace_tcp_send_reset(sk, skb);
+
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
 );
-- 
2.37.3




[PATCH net-next v4 1/2] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters

2024-04-01 Thread Jason Xing
From: Jason Xing 

Introducing entry_saddr and entry_daddr parameters in this macro
for later use can help us record the reverse 4-tuple by analyzing
the 4-tuple of the incoming skb when receiving.

Signed-off-by: Jason Xing 
Reviewed-by: Eric Dumazet 
---
 include/trace/events/net_probe_common.h | 20 +++-
 include/trace/events/tcp.h  |  2 +-
 include/trace/events/udp.h  |  2 +-
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/include/trace/events/net_probe_common.h 
b/include/trace/events/net_probe_common.h
index 5e33f91bdea3..976a58364bff 100644
--- a/include/trace/events/net_probe_common.h
+++ b/include/trace/events/net_probe_common.h
@@ -70,14 +70,14 @@
TP_STORE_V4MAPPED(__entry, saddr, daddr)
 #endif
 
-#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)   \
+#define TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr) \
do {\
-   struct sockaddr_in *v4 = (void *)__entry->saddr;\
+   struct sockaddr_in *v4 = (void *)entry_saddr;   \
\
v4->sin_family = AF_INET;   \
v4->sin_port = protoh->source;  \
v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
-   v4 = (void *)__entry->daddr;\
+   v4 = (void *)entry_daddr;   \
v4->sin_family = AF_INET;   \
v4->sin_port = protoh->dest;\
v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
@@ -85,28 +85,30 @@
 
 #if IS_ENABLED(CONFIG_IPV6)
 
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
+#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \
do {\
const struct iphdr *iph = ip_hdr(skb);  \
\
if (iph->version == 6) {\
-   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+   struct sockaddr_in6 *v6 = (void *)entry_saddr;  \
\
v6->sin6_family = AF_INET6; \
v6->sin6_port = protoh->source; \
v6->sin6_addr = ipv6_hdr(skb)->saddr;   \
-   v6 = (void *)__entry->daddr;\
+   v6 = (void *)entry_daddr;   \
v6->sin6_family = AF_INET6; \
v6->sin6_port = protoh->dest;   \
v6->sin6_addr = ipv6_hdr(skb)->daddr;   \
} else  \
-   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \
+   TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, \
+  entry_saddr, \
+  entry_daddr);\
} while (0)
 
 #else
 
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
-   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)
+#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \
+   TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr)
 
 #endif
 
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 1db95175c1e5..cf14b6fcbeed 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -295,7 +295,7 @@ DECLARE_EVENT_CLASS(tcp_event_skb,
memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
 
-   TP_STORE_ADDR_PORTS_SKB(__entry, skb, th);
+   TP_STORE_ADDR_PORTS_SKB(skb, th, __entry->saddr, 
__entry->daddr);
),
 
TP_printk("skbaddr=%p src=%pISpc dest=%pISpc",
diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h
index 62bebe2a6ece..6142be4068e2 100644
--- a/include/trace/events/udp.h
+++ b/include/trace/events/udp.h
@@ -38,7 +38,7 @@ TRACE_EVENT(udp_fail_queue_rcv_skb,
memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
 
-   TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh);
+   TP_STORE_ADDR_PORTS_SKB(skb, uh, __entry->saddr, 
__entry->daddr);
),
 
TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc", 

[PATCH net-next v4 0/2] tcp: make trace of reset logic complete

2024-04-01 Thread Jason Xing
From: Jason Xing 

Before this, we miss some cases where the TCP layer could send RST but
we cannot trace it. So I decided to complete it :)

v4
Link: 
https://lore.kernel.org/all/20240329034243.7929-1-kerneljasonx...@gmail.com/
1. rebased against latest net-next
2. remove {} and add skb test statement (Eric)
3. drop v3 patch [3/3] temporarily because 1) location is not that useful
since we can use perf or something else to trace, 2) Eric said we could
use drop_reason to show why we have to RST, which is good, but this seems
not work well for those ->send_reset() logic. I need more time to
investigate this part.

v3
1. fix a format problem in patch [3/3]

v2
1. fix spelling mistakes

Jason Xing (2):
  trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters
  trace: tcp: fully support trace_tcp_send_reset

 include/trace/events/net_probe_common.h | 20 ++--
 include/trace/events/tcp.h  | 42 +++--
 include/trace/events/udp.h  |  2 +-
 net/ipv4/tcp_ipv4.c |  7 ++---
 net/ipv6/tcp_ipv6.c |  3 +-
 5 files changed, 56 insertions(+), 18 deletions(-)

-- 
2.37.3