Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 19:29:46 -0700 Andrii Nakryiko wrote: > On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > > > On Mon, 1 Apr 2024 12:09:18 -0400 > > Steven Rostedt wrote: > > > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > > Masami Hiramatsu (Google) wrote: > > > > > > > > Masami, > > > > > > > > > > Are you OK with just keeping it set to N. > > > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > > > Agreed (but it should say this is a debug option) > > > > > > It does say "Validate" which to me is a debug option. What would you > > > suggest? > > > > I think the help message should have "This is for debugging ftrace." > > > > Sent v2 with adjusted wording, thanks! You may want to wait till Masami and I agree ;-) Masami, But it isn't really for "debugging", it's for validating. That is, it doesn't give us any information to debug ftrace. It only validates if it is executed properly. In other words, I never want to be asked "How can I use this option to debug ftrace?" For example, we also have: "Verify ring buffer time stamp deltas" that makes sure the time stamps of the ring buffer are not buggy. And there's: "Verify compile time sorting of ftrace functions" Which is also used to make sure things are working properly. Neither of the above says they are for "debugging", even though they are more useful for debugging than this option. I'm not sure saying this is "debugging ftrace" is accurate. It may help debug ftrace if it is called outside of an RCU location, which has a 1 in 100,000,000,000 chance of causing an actual bug, as the race window is extremely small. Now if it is also called outside of instrumentation, that will likely trigger other warnings even without this code, and this will not be needed to debug that. ftrace has all sorts of "verifiers" that is used to make sure things are working properly. And yes, you can consider it as "debugging". But I would not consider this an option to enable if ftrace was broken, and you are looking into why it is broken. To me, this option is only to verify that ftrace (and other users of ftrace_test_recursion_trylock()) are not called outside of RCU, as if they are, it can cause a race. But it also has a noticeable overhead when enabled. -- Steve -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > On Mon, 1 Apr 2024 12:09:18 -0400 > Steven Rostedt wrote: > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > > Masami, > > > > > > > > Are you OK with just keeping it set to N. > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > Agreed (but it should say this is a debug option) > > > > It does say "Validate" which to me is a debug option. What would you > > suggest? > > I think the help message should have "This is for debugging ftrace." > Sent v2 with adjusted wording, thanks! > Thank you, > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level debugging of ftrace subsystem. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..fcf45d5c60cb 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace recursion check" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort + of protection against recursion. This option performs additional + checks to make sure RCU is on when ftrace callbacks recurse. + + This is a feature useful for debugging ftrace. This will add more + overhead to all ftrace-based invocations. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 12:09:18 -0400 Steven Rostedt wrote: > On Mon, 1 Apr 2024 20:25:52 +0900 > Masami Hiramatsu (Google) wrote: > > > > Masami, > > > > > > Are you OK with just keeping it set to N. > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > Agreed (but it should say this is a debug option) > > It does say "Validate" which to me is a debug option. What would you > suggest? I think the help message should have "This is for debugging ftrace." Thank you, > > -- Steve -- Masami Hiramatsu (Google)
[PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
rethook_find_ret_addr() prints a warning message and returns 0 when the target task is running and not the "current" task to prevent returning an incorrect return address. However, this check is incomplete as the target task can still transition to the running state when finding the return address, although it is safe with RCU. The issue we encounter is that the kernel frequently prints warning messages when BPF profiling programs call to bpf_get_task_stack() on running tasks. The callers should be aware and willing to take the risk of receiving an incorrect return address from a task that is currently running other than the "current" one. A warning is not needed here as the callers are intent on it. Signed-off-by: Kui-Feng Lee --- kernel/trace/rethook.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..4297a132a7ae 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame if (WARN_ON_ONCE(!cur)) return 0; - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) + if (tsk != current && task_is_running(tsk)) return 0; do { -- 2.34.1
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 20:25:52 +0900 Masami Hiramatsu (Google) wrote: > > Masami, > > > > Are you OK with just keeping it set to N. > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > We could have other options like PROVE_LOCKING enable it. > > Agreed (but it should say this is a debug option) It does say "Validate" which to me is a debug option. What would you suggest? -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 26 Mar 2024 15:01:21 -0400 Steven Rostedt wrote: > On Tue, 26 Mar 2024 09:16:33 -0700 > Andrii Nakryiko wrote: > > > > It's no different than lockdep. Test boxes should have it enabled, but > > > there's no reason to have this enabled in a production system. > > > > > > > I tend to agree with Steven here (which is why I sent this patch as it > > is), but I'm happy to do it as an opt-out, if Masami insists. Please > > do let me know if I need to send v2 or this one is actually the one > > we'll end up using. Thanks! > > Masami, > > Are you OK with just keeping it set to N. OK, if it is only for the debugging, I'm OK to set N this. > > We could have other options like PROVE_LOCKING enable it. Agreed (but it should say this is a debug option) Thank you, > > -- Steve -- Masami Hiramatsu (Google)
[PATCH net-next v4 2/2] trace: tcp: fully support trace_tcp_send_reset
From: Jason Xing Prior to this patch, what we can see by enabling trace_tcp_send is only happening under two circumstances: 1) active rst mode 2) non-active rst mode and based on the full socket That means the inconsistency occurs if we use tcpdump and trace simultaneously to see how rst happens. It's necessary that we should take into other cases into considerations, say: 1) time-wait socket 2) no socket ... By parsing the incoming skb and reversing its 4-tuple can we know the exact 'flow' which might not exist. Samples after applied this patch: 1. tcp_send_reset: skbaddr=XXX skaddr=XXX src=ip:port dest=ip:port state=TCP_ESTABLISHED 2. tcp_send_reset: skbaddr=000...000 skaddr=XXX src=ip:port dest=ip:port state=UNKNOWN Note: 1) UNKNOWN means we cannot extract the right information from skb. 2) skbaddr/skaddr could be 0 Signed-off-by: Jason Xing --- include/trace/events/tcp.h | 40 -- net/ipv4/tcp_ipv4.c| 7 +++ net/ipv6/tcp_ipv6.c| 3 ++- 3 files changed, 43 insertions(+), 7 deletions(-) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index cf14b6fcbeed..5c04a61a11c2 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -78,11 +78,47 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, * skb of trace_tcp_send_reset is the skb that caused RST. In case of * active reset, skb should be NULL */ -DEFINE_EVENT(tcp_event_sk_skb, tcp_send_reset, +TRACE_EVENT(tcp_send_reset, TP_PROTO(const struct sock *sk, const struct sk_buff *skb), - TP_ARGS(sk, skb) + TP_ARGS(sk, skb), + + TP_STRUCT__entry( + __field(const void *, skbaddr) + __field(const void *, skaddr) + __field(int, state) + __array(__u8, saddr, sizeof(struct sockaddr_in6)) + __array(__u8, daddr, sizeof(struct sockaddr_in6)) + ), + + TP_fast_assign( + __entry->skbaddr = skb; + __entry->skaddr = sk; + /* Zero means unknown state. */ + __entry->state = sk ? sk->sk_state : 0; + + memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); + memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); + + if (sk && sk_fullsock(sk)) { + const struct inet_sock *inet = inet_sk(sk); + + TP_STORE_ADDR_PORTS(__entry, inet, sk); + } else if (skb) { + const struct tcphdr *th = (const struct tcphdr *)skb->data; + /* +* We should reverse the 4-tuple of skb, so later +* it can print the right flow direction of rst. +*/ + TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, entry->saddr); + } + ), + + TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s", + __entry->skbaddr, __entry->skaddr, + __entry->saddr, __entry->daddr, + __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN") ); /* diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a22ee5838751..0d47b48f8cfd 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -866,11 +866,10 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) * routing might fail in this case. No choice here, if we choose to force * input interface, we will misroute in case of asymmetric route. */ - if (sk) { + if (sk) arg.bound_dev_if = sk->sk_bound_dev_if; - if (sk_fullsock(sk)) - trace_tcp_send_reset(sk, skb); - } + + trace_tcp_send_reset(sk, skb); BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) != offsetof(struct inet_timewait_sock, tw_bound_dev_if)); diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 3f4cba49e9ee..8e9c59b6c00c 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1113,7 +1113,6 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) if (sk) { oif = sk->sk_bound_dev_if; if (sk_fullsock(sk)) { - trace_tcp_send_reset(sk, skb); if (inet6_test_bit(REPFLOW, sk)) label = ip6_flowlabel(ipv6h); priority = READ_ONCE(sk->sk_priority); @@ -1129,6 +1128,8 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) label = ip6_flowlabel(ipv6h); } + trace_tcp_send_reset(sk, skb); + tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1, ipv6_get_dsfield(ipv6h), label, priority, txhash, ); -- 2.37.3
[PATCH net-next v4 1/2] trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters
From: Jason Xing Introducing entry_saddr and entry_daddr parameters in this macro for later use can help us record the reverse 4-tuple by analyzing the 4-tuple of the incoming skb when receiving. Signed-off-by: Jason Xing Reviewed-by: Eric Dumazet --- include/trace/events/net_probe_common.h | 20 +++- include/trace/events/tcp.h | 2 +- include/trace/events/udp.h | 2 +- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/include/trace/events/net_probe_common.h b/include/trace/events/net_probe_common.h index 5e33f91bdea3..976a58364bff 100644 --- a/include/trace/events/net_probe_common.h +++ b/include/trace/events/net_probe_common.h @@ -70,14 +70,14 @@ TP_STORE_V4MAPPED(__entry, saddr, daddr) #endif -#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh) \ +#define TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr) \ do {\ - struct sockaddr_in *v4 = (void *)__entry->saddr;\ + struct sockaddr_in *v4 = (void *)entry_saddr; \ \ v4->sin_family = AF_INET; \ v4->sin_port = protoh->source; \ v4->sin_addr.s_addr = ip_hdr(skb)->saddr; \ - v4 = (void *)__entry->daddr;\ + v4 = (void *)entry_daddr; \ v4->sin_family = AF_INET; \ v4->sin_port = protoh->dest;\ v4->sin_addr.s_addr = ip_hdr(skb)->daddr; \ @@ -85,28 +85,30 @@ #if IS_ENABLED(CONFIG_IPV6) -#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh) \ +#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \ do {\ const struct iphdr *iph = ip_hdr(skb); \ \ if (iph->version == 6) {\ - struct sockaddr_in6 *v6 = (void *)__entry->saddr; \ + struct sockaddr_in6 *v6 = (void *)entry_saddr; \ \ v6->sin6_family = AF_INET6; \ v6->sin6_port = protoh->source; \ v6->sin6_addr = ipv6_hdr(skb)->saddr; \ - v6 = (void *)__entry->daddr;\ + v6 = (void *)entry_daddr; \ v6->sin6_family = AF_INET6; \ v6->sin6_port = protoh->dest; \ v6->sin6_addr = ipv6_hdr(skb)->daddr; \ } else \ - TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \ + TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, \ + entry_saddr, \ + entry_daddr);\ } while (0) #else -#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh) \ - TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh) +#define TP_STORE_ADDR_PORTS_SKB(skb, protoh, entry_saddr, entry_daddr) \ + TP_STORE_ADDR_PORTS_SKB_V4(skb, protoh, entry_saddr, entry_daddr) #endif diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 1db95175c1e5..cf14b6fcbeed 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -295,7 +295,7 @@ DECLARE_EVENT_CLASS(tcp_event_skb, memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); - TP_STORE_ADDR_PORTS_SKB(__entry, skb, th); + TP_STORE_ADDR_PORTS_SKB(skb, th, __entry->saddr, __entry->daddr); ), TP_printk("skbaddr=%p src=%pISpc dest=%pISpc", diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h index 62bebe2a6ece..6142be4068e2 100644 --- a/include/trace/events/udp.h +++ b/include/trace/events/udp.h @@ -38,7 +38,7 @@ TRACE_EVENT(udp_fail_queue_rcv_skb, memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); - TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh); + TP_STORE_ADDR_PORTS_SKB(skb, uh, __entry->saddr, __entry->daddr); ), TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc",
[PATCH net-next v4 0/2] tcp: make trace of reset logic complete
From: Jason Xing Before this, we miss some cases where the TCP layer could send RST but we cannot trace it. So I decided to complete it :) v4 Link: https://lore.kernel.org/all/20240329034243.7929-1-kerneljasonx...@gmail.com/ 1. rebased against latest net-next 2. remove {} and add skb test statement (Eric) 3. drop v3 patch [3/3] temporarily because 1) location is not that useful since we can use perf or something else to trace, 2) Eric said we could use drop_reason to show why we have to RST, which is good, but this seems not work well for those ->send_reset() logic. I need more time to investigate this part. v3 1. fix a format problem in patch [3/3] v2 1. fix spelling mistakes Jason Xing (2): trace: adjust TP_STORE_ADDR_PORTS_SKB() parameters trace: tcp: fully support trace_tcp_send_reset include/trace/events/net_probe_common.h | 20 ++-- include/trace/events/tcp.h | 42 +++-- include/trace/events/udp.h | 2 +- net/ipv4/tcp_ipv4.c | 7 ++--- net/ipv6/tcp_ipv6.c | 3 +- 5 files changed, 56 insertions(+), 18 deletions(-) -- 2.37.3