Re: [PATCH v6 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-01-27 Thread Google
On Sat, 27 Jan 2024 19:44:06 +0100
Jiri Olsa  wrote:

> On Sat, Jan 27, 2024 at 12:14:05AM +0900, Masami Hiramatsu wrote:
> > On Thu, 25 Jan 2024 15:54:53 +0100
> > Jiri Olsa  wrote:
> > 
> > > On Fri, Jan 12, 2024 at 07:10:50PM +0900, Masami Hiramatsu (Google) wrote:
> > > > Hi,
> > > > 
> > > > Here is the 6th version of the series to re-implement the fprobe on
> > > > function-graph tracer. The previous version is;
> > > > 
> > > > https://lore.kernel.org/all/170290509018.220107.1347127510564358608.stgit@devnote2/
> > > > 
> > > > This version fixes use-after-unregister bug and arm64 stack unwinding
> > > > bug [13/36], add an improvement for multiple interrupts during push
> > > > operation[20/36], keep SAVE_REGS until BPF and fprobe_event using
> > > > ftrace_regs[26/36], also reorder the patches[30/36][31/36] so that new
> > > > fprobe can switch to SAVE_ARGS[32/36] safely.
> > > > This series also temporarily adds a DIRECT_CALLS bugfix[1/36], which
> > > > should be pushed separatedly as a stable bugfix.
> > > > 
> > > > There are some TODOs:
> > > >  - Add s390x and loongarch support to fprobe (multiple fgraph).
> > > >  - Fix to get the symbol address from ftrace entry address on arm64.
> > > >(This should be done in BPF trace event)
> > > >  - Cleanup code, rename some terms(offset/index) and FGRAPH_TYPE_BITMAP
> > > >part should be merged to FGRAPH_TYPE_ARRAY patch.
> > > 
> > > hi,
> > > I'm getting kasan bugs below when running bpf selftests on top of this
> > > patchset.. I think it's probably the reason I see failures in some bpf
> > > kprobe_multi/fprobe tests
> > > 
> > > so far I couldn't find the reason.. still checking ;-)
> > 
> > Thanks for reporting! Have you built the kernel with debuginfo? In that
> > case, can you decode the line from the address?
> > 
> > $ eu-addr2line -fi -e vmlinux ftrace_push_return_trace.isra.0+0x346
> > 
> > This helps me a lot.
> 
> I had to recompile/regenerate the fault, it points in here:
> 
> 8149b390 :
> ...
> 
> current->ret_stack[rindex - 1] = val;  
> 8149b6b1:   48 8d bd 78 28 00 00lea
> 0x2878(%rbp),%rdi
> 8149b6b8:   e8 63 e4 28 00  call   
> 81729b20 <__asan_load8>
> 8149b6bd:   48 8b 95 78 28 00 00mov
> 0x2878(%rbp),%rdx
> 8149b6c4:   41 8d 47 ff lea-0x1(%r15),%eax
> 8149b6c8:   48 98   cltq
> 8149b6ca:   4c 8d 24 c2 lea
> (%rdx,%rax,8),%r12
> 8149b6ce:   4c 89 e7mov%r12,%rdi
> 8149b6d1:   e8 ea e4 28 00  call   
> 81729bc0 <__asan_store8>
> --->8149b6d6:   49 89 1c 24 mov%rbx,(%r12)
> current->curr_ret_stack = index = rindex;
> 8149b6da:   48 8d bd 6c 28 00 00lea
> 0x286c(%rbp),%rdi
> 8149b6e1:   e8 9a e3 28 00  call   
> 81729a80 <__asan_store4>
> 8149b6e6:   44 89 bd 6c 28 00 00mov
> %r15d,0x286c(%rbp)
> 8149b6ed:   e9 8d fd ff ff  jmp
> 8149b47f 
> if (WARN_ON_ONCE(idx <= 0))  
> 

Thanks! So this shows that this bug is failed to check the boundary of
shadow stack while pushing the return trace.

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 0f11f80bdd6c..8e1fcc3f4bda 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -550,7 +550,7 @@ ftrace_push_return_trace(unsigned long ret, unsigned long 
func,
smp_rmb();
 
/* The return trace stack is full */
-   if (current->curr_ret_stack + FGRAPH_RET_INDEX >= 
SHADOW_STACK_MAX_INDEX) {
+   if (current->curr_ret_stack + FGRAPH_RET_INDEX + 1 >= 
SHADOW_STACK_MAX_INDEX) {
atomic_inc(>trace_overrun);
return -EBUSY;
} 

Sorry, I forgot to increment the space for reserved entry...

Thanks,

> 
> the dump is attached below (same address as in previous email)
> 
> jirka
> 
> 
> ---
> [  360.152200][C3] BUG: KASAN: slab-out-of-bounds in 
> ftrace_push_return_trace.isra.0+0x346/0x370
> [  360.153195][C3] Write of size 8 at addr 8881a0e10ff8 by task 
> kworker/3:4/728
> [  360.154101][C3] 
> [  360.154414][C3] CPU: 3 PID: 728 Comm: kworker/3:4 Tainted: G   
> OE  6.7.0+ #316 c9b0d53b3491b547d06b6b50629b74711600ddc9
> [  360.155679][C3] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> BIOS 1.16.2-1.fc38 04/01/2014
> [  360.156611][C3] Workqueue: events free_obj_work
> [  360.157175][C3] Call Trace:
> [  360.157561][C3]  
> [  360.157904][C3]  dump_stack_lvl+0xf6/0x180
> [  360.158404][C3]  print_report+0xc4/0x610
> [  360.158853][C3]  ? lock_release+0xba/0x760
> [  360.159375][C3]  ? __phys_addr+0x5/0x80
> [  

Re: [RFC PATCH 2/2] x86/kprobes: boost more instructions from grp2/3/4/5

2024-01-27 Thread Google
On Fri, 26 Jan 2024 22:41:24 -0600
Jinghao Jia  wrote:

> With the instruction decoder, we are now able to decode and recognize
> instructions with opcode extensions. There are more instructions in
> these groups that can be boosted:
> 
> Group 2: ROL, ROR, RCL, RCR, SHL/SAL, SHR, SAR
> Group 3: TEST, NOT, NEG, MUL, IMUL, DIV, IDIV
> Group 4: INC, DEC (byte operation)
> Group 5: INC, DEC (word/doubleword/quadword operation)
> 
> These instructions are not boosted previously because there are reserved
> opcodes within the groups, e.g., group 2 with ModR/M.nnn == 110 is
> unmapped. As a result, kprobes attached to them requires two int3 traps
> as being non-boostable also prevents jump-optimization.
> 
> Some simple tests on QEMU show that after boosting and jump-optimization
> a single kprobe on these instructions with an empty pre-handler runs 10x
> faster (~1000 cycles vs. ~100 cycles).
> 
> Since these instructions are mostly ALU operations and do not touch
> special registers like RIP, let's boost them so that we get the
> performance benefit.
> 

As far as we check the ModR/M byte, I think we can safely run these
instructions on trampoline buffer without adjusting results (this
means it can be "boosted").
I just have a minor comment, but basically this looks good to me.

Reviewed-by: Masami Hiramatsu (Google) 

> Signed-off-by: Jinghao Jia 
> ---
>  arch/x86/kernel/kprobes/core.c | 21 +++--
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index 792b38d22126..f847bd9cc91b 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -169,22 +169,31 @@ int can_boost(struct insn *insn, void *addr)
>   case 0x62:  /* bound */
>   case 0x70 ... 0x7f: /* Conditional jumps */
>   case 0x9a:  /* Call far */
> - case 0xc0 ... 0xc1: /* Grp2 */
>   case 0xcc ... 0xce: /* software exceptions */
> - case 0xd0 ... 0xd3: /* Grp2 */
>   case 0xd6:  /* (UD) */
>   case 0xd8 ... 0xdf: /* ESC */
>   case 0xe0 ... 0xe3: /* LOOP*, JCXZ */
>   case 0xe8 ... 0xe9: /* near Call, JMP */
>   case 0xeb:  /* Short JMP */
>   case 0xf0 ... 0xf4: /* LOCK/REP, HLT */
> - case 0xf6 ... 0xf7: /* Grp3 */
> - case 0xfe:  /* Grp4 */
>   /* ... are not boostable */
>   return 0;
> + case 0xc0 ... 0xc1: /* Grp2 */
> + case 0xd0 ... 0xd3: /* Grp2 */
> + /* ModR/M nnn == 110 is reserved */
> + return X86_MODRM_REG(insn->modrm.bytes[0]) != 6;
> + case 0xf6 ... 0xf7: /* Grp3 */
> + /* ModR/M nnn == 001 is reserved */

/* AMD uses nnn == 001 as TEST, but Intel makes it reserved. */

> + return X86_MODRM_REG(insn->modrm.bytes[0]) != 1;
> + case 0xfe:  /* Grp4 */
> + /* Only inc and dec are boostable */
> + return X86_MODRM_REG(insn->modrm.bytes[0]) == 0 ||
> +X86_MODRM_REG(insn->modrm.bytes[0]) == 1;
>   case 0xff:  /* Grp5 */
> - /* Only indirect jmp is boostable */
> - return X86_MODRM_REG(insn->modrm.bytes[0]) == 4;
> + /* Only inc, dec, and indirect jmp are boostable */
> + return X86_MODRM_REG(insn->modrm.bytes[0]) == 0 ||
> +X86_MODRM_REG(insn->modrm.bytes[0]) == 1 ||
> +X86_MODRM_REG(insn->modrm.bytes[0]) == 4;
>   default:
>   return 1;
>   }
> -- 
> 2.43.0
> 

Thamnk you,

-- 
Masami Hiramatsu (Google) 



Re: [RFC PATCH 1/2] x86/kprobes: Prohibit kprobing on INT and UD

2024-01-27 Thread Google
On Fri, 26 Jan 2024 22:41:23 -0600
Jinghao Jia  wrote:

> Both INTs (INT n, INT1, INT3, INTO) and UDs (UD0, UD1, UD2) serve
> special purposes in the kernel, e.g., INT3 is used by KGDB and UD2 is
> involved in LLVM-KCFI instrumentation. At the same time, attaching
> kprobes on these instructions (particularly UDs) will pollute the stack
> trace dumped in the kernel ring buffer, since the exception is triggered
> in the copy buffer rather than the original location.
> 
> Check for INTs and UDs in can_probe and reject any kprobes trying to
> attach to these instructions.
> 

Thanks for implement this check!


> Suggested-by: Masami Hiramatsu (Google) 
> Signed-off-by: Jinghao Jia 
> ---
>  arch/x86/kernel/kprobes/core.c | 33 ++---
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index e8babebad7b8..792b38d22126 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -252,6 +252,22 @@ unsigned long recover_probed_instruction(kprobe_opcode_t 
> *buf, unsigned long add
>   return __recover_probed_insn(buf, addr);
>  }
>  
> +static inline int is_exception_insn(struct insn *insn)
> +{
> + if (insn->opcode.bytes[0] == 0x0f) {
> + /* UD0 / UD1 / UD2 */
> + return insn->opcode.bytes[1] == 0xff ||
> +insn->opcode.bytes[1] == 0xb9 ||
> +insn->opcode.bytes[1] == 0x0b;
> + } else {

If "else" block just return, you don't need this "else".

bool func()
{
if (cond)
return ...

return ...
}

Is preferrable because this puts "return val" always at the end of non-void
function.

> + /* INT3 / INT n / INTO / INT1 */
> + return insn->opcode.bytes[0] == 0xcc ||
> +insn->opcode.bytes[0] == 0xcd ||
> +insn->opcode.bytes[0] == 0xce ||
> +insn->opcode.bytes[0] == 0xf1;
> + }
> +}
> +
>  /* Check if paddr is at an instruction boundary */
>  static int can_probe(unsigned long paddr)
>  {
> @@ -294,6 +310,16 @@ static int can_probe(unsigned long paddr)
>  #endif
>   addr += insn.length;
>   }
> + __addr = recover_probed_instruction(buf, addr);
> + if (!__addr)
> + return 0;
> +
> + if (insn_decode_kernel(, (void *)__addr) < 0)
> + return 0;
> +
> + if (is_exception_insn())
> + return 0;
> +

Please don't put this outside of decoding loop. You should put these in
the loop which decodes the instruction from the beginning of the function.
Since the x86 instrcution is variable length, can_probe() needs to check
whether that the address is instruction boundary and decodable.

Thank you,

>   if (IS_ENABLED(CONFIG_CFI_CLANG)) {
>   /*
>* The compiler generates the following instruction sequence
> @@ -308,13 +334,6 @@ static int can_probe(unsigned long paddr)
>* Also, these movl and addl are used for showing expected
>* type. So those must not be touched.
>*/
> - __addr = recover_probed_instruction(buf, addr);
> - if (!__addr)
> - return 0;
> -
> - if (insn_decode_kernel(, (void *)__addr) < 0)
> - return 0;
> -
>   if (insn.opcode.value == 0xBA)
>   offset = 12;
>   else if (insn.opcode.value == 0x3)
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v8 3/9] pinctrl: single: add marvell,pxa1908-padconf compatible

2024-01-27 Thread Linus Walleij
On Wed, Jan 10, 2024 at 8:04 PM Duje Mihanović  wrote:

> Add the "marvell,pxa1908-padconf" compatible to allow migrating to a
> separate pinctrl driver later.
>
> Signed-off-by: Duje Mihanović 

Acked-by: Linus Walleij 

I guess you will merge all of this through the SoC tree?

Yours,
Linus Walleij



Re: [PATCH v8 2/9] dt-bindings: pinctrl: pinctrl-single: add marvell,pxa1908-padconf compatible

2024-01-27 Thread Linus Walleij
On Wed, Jan 10, 2024 at 8:04 PM Duje Mihanović  wrote:

> Add the "marvell,pxa1908-padconf" compatible to allow migrating to a
> separate pinctrl driver later.
>
> Reviewed-by: Rob Herring 
> Signed-off-by: Duje Mihanović 

Acked-by: Linus Walleij 

Yours,
Linus Walleij



Re: [PATCH] ARM: dts: qcom: apq8026-samsung-matissewifi: Configure touch keys

2024-01-27 Thread Bjorn Andersson


On Mon, 04 Dec 2023 11:46:49 +0200, Matti Lehtimäki wrote:
> Add touch keys which are handled in touchscreen driver.
> Use KEY_APPSELECT for the left button because other devices use that
> even though downstream kernel uses KEY_RECENT.
> 
> 

Applied, thanks!

[1/1] ARM: dts: qcom: apq8026-samsung-matissewifi: Configure touch keys
  commit: ffb05e91b68bc58484b94b5d3d1aa8e559278fd6

Best regards,
-- 
Bjorn Andersson 



Re: (subset) [PATCH v3 0/3] Enable venus on Fairphone 5 / non-ChromeOS sc7280 venus support

2024-01-27 Thread Bjorn Andersson


On Fri, 01 Dec 2023 10:33:17 +0100, Luca Weiss wrote:
> Devices with Qualcomm firmware (compared to ChromeOS firmware) need some
> changes in the venus driver and dts layout so that venus can initialize.
> 
> Do these changes, similar to sc7180.
> 
> 

Applied, thanks!

[2/3] arm64: dts: qcom: sc7280: Move video-firmware to chrome-common
  commit: 82066cdb17608abc95192632fd5c702be8e57ab5
[3/3] arm64: dts: qcom: qcm6490-fairphone-fp5: Enable venus node
  commit: 2aa72de2fc9230c54048e3555d04e945ec4f8891

Best regards,
-- 
Bjorn Andersson 



Re: [PATCH v3 2/3] arm64: dts: qcom: sc7280: Move video-firmware to chrome-common

2024-01-27 Thread Bjorn Andersson
On Fri, Dec 01, 2023 at 10:33:19AM +0100, Luca Weiss wrote:
> If the video-firmware node is present, the venus driver assumes we're on
> a system that doesn't use TZ for starting venus, like on ChromeOS
> devices.
> 
> Move the video-firmware node to chrome-common.dtsi so we can use venus
> on a non-ChromeOS devices. We also need to move the secure SID 0x2184
> for iommu since (on some boards) we cannot touch that.
> 
> At the same time also disable the venus node by default in the dtsi,
> like it's done on other SoCs.
> 
> Reviewed-by: Bryan O'Donoghue 
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi | 11 +++
>  arch/arm64/boot/dts/qcom/sc7280.dtsi   |  9 +++--
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi 
> b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> index 5d462ae14ba1..459ff877df54 100644
> --- a/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sc7280-chrome-common.dtsi
> @@ -104,6 +104,17 @@  {
>   dma-coherent;
>  };
>  
> + {
> + iommus = <_smmu 0x2180 0x20>,
> +  <_smmu 0x2184 0x20>;
> +
> + status = "okay";
> +
> + video-firmware {
> + iommus = <_smmu 0x21a2 0x0>;
> + };
> +};
> +
>   {
>   status = "okay";
>  };
> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi 
> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> index 326897af117a..0ff9a2484096 100644
> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> @@ -3836,10 +3836,11 @@ venus: video-codec@aa0 {
>   <_noc MASTER_VIDEO_P0 0 _virt 
> SLAVE_EBI1 0>;
>   interconnect-names = "cpu-cfg", "video-mem";
>  
> - iommus = <_smmu 0x2180 0x20>,
> -  <_smmu 0x2184 0x20>;
> + iommus = <_smmu 0x2180 0x20>;

qcom,sc7280-venus expects 2 items here. Please follow up with a patch to
the binding, if you haven't send one already.

Thanks,
Bjorn

>   memory-region = <_mem>;
>  
> + status = "disabled";
> +
>   video-decoder {
>   compatible = "venus-decoder";
>   };
> @@ -3848,10 +3849,6 @@ video-encoder {
>   compatible = "venus-encoder";
>   };
>  
> - video-firmware {
> - iommus = <_smmu 0x21a2 0x0>;
> - };
> -
>   venus_opp_table: opp-table {
>   compatible = "operating-points-v2";
>  
> 
> -- 
> 2.43.0
> 



Re: [RFC PATCH 1/2] x86/kprobes: Prohibit kprobing on INT and UD

2024-01-27 Thread Xin Li

On 1/26/2024 8:41 PM, Jinghao Jia wrote:

Both INTs (INT n, INT1, INT3, INTO) and UDs (UD0, UD1, UD2) serve
special purposes in the kernel, e.g., INT3 is used by KGDB and UD2 is
involved in LLVM-KCFI instrumentation. At the same time, attaching
kprobes on these instructions (particularly UDs) will pollute the stack
trace dumped in the kernel ring buffer, since the exception is triggered
in the copy buffer rather than the original location.

Check for INTs and UDs in can_probe and reject any kprobes trying to
attach to these instructions.

Suggested-by: Masami Hiramatsu (Google) 
Signed-off-by: Jinghao Jia 
---
  arch/x86/kernel/kprobes/core.c | 33 ++---
  1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index e8babebad7b8..792b38d22126 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -252,6 +252,22 @@ unsigned long recover_probed_instruction(kprobe_opcode_t 
*buf, unsigned long add
return __recover_probed_insn(buf, addr);
  }
  
+static inline int is_exception_insn(struct insn *insn)


s/int/bool


+{
+   if (insn->opcode.bytes[0] == 0x0f) {
+   /* UD0 / UD1 / UD2 */
+   return insn->opcode.bytes[1] == 0xff ||
+  insn->opcode.bytes[1] == 0xb9 ||
+  insn->opcode.bytes[1] == 0x0b;
+   } else {
+   /* INT3 / INT n / INTO / INT1 */
+   return insn->opcode.bytes[0] == 0xcc ||
+  insn->opcode.bytes[0] == 0xcd ||
+  insn->opcode.bytes[0] == 0xce ||
+  insn->opcode.bytes[0] == 0xf1;
+   }
+}
+
  /* Check if paddr is at an instruction boundary */
  static int can_probe(unsigned long paddr)
  {
@@ -294,6 +310,16 @@ static int can_probe(unsigned long paddr)
  #endif
addr += insn.length;
}
+   __addr = recover_probed_instruction(buf, addr);
+   if (!__addr)
+   return 0;
+
+   if (insn_decode_kernel(, (void *)__addr) < 0)
+   return 0;
+
+   if (is_exception_insn())
+   return 0;
+
if (IS_ENABLED(CONFIG_CFI_CLANG)) {
/*
 * The compiler generates the following instruction sequence
@@ -308,13 +334,6 @@ static int can_probe(unsigned long paddr)
 * Also, these movl and addl are used for showing expected
 * type. So those must not be touched.
 */
-   __addr = recover_probed_instruction(buf, addr);
-   if (!__addr)
-   return 0;
-
-   if (insn_decode_kernel(, (void *)__addr) < 0)
-   return 0;
-
if (insn.opcode.value == 0xBA)
offset = 12;
else if (insn.opcode.value == 0x3)


--
Thanks!
Xin




Re: [RFC PATCH] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

2024-01-27 Thread Paul E. McKenney
On Tue, Jan 09, 2024 at 07:28:29PM +0800, Yang Jihong wrote:
> Hello,
> 
> PING.
> 
> I had a similar problem. Is this solution feasible?

Sadly, no.

It fails on CONFIG_PREEMPT=y kernels because synchronize_rcu_tasks_rude()
will not wait on tasks that have been preempted while executing in
a trampoline.

But could you please try out the patch shown at the end of this email?

Thanx, Paul

> Thanks,
> Yang
> 
> On 2024/1/2 11:40, Chen Zhongjin wrote:
> > There is a deadlock scenario in kprobe_optimizer():
> > 
> > pid A   pid B   pid C
> > kprobe_optimizer()  do_exit()   perf_kprobe_init()
> > mutex_lock(_mutex)   exit_tasks_rcu_start()  
> > mutex_lock(_mutex)
> > synchronize_rcu_tasks() zap_pid_ns_processes()  // waiting 
> > kprobe_mutex
> > // waiting tasks_rcu_exit_srcu  kernel_wait4()
> > // waiting pid C exit
> > 
> > To avoid this deadlock loop, use synchronize_rcu_tasks_rude() in 
> > kprobe_optimizer()
> > rather than synchronize_rcu_tasks(). synchronize_rcu_tasks_rude() can also 
> > promise
> > that all preempted tasks have scheduled, but it will not wait 
> > tasks_rcu_exit_srcu.
> > 
> > Signed-off-by: Chen Zhongjin 
> > ---
> >   arch/Kconfig | 2 +-
> >   kernel/kprobes.c | 2 +-
> >   2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/Kconfig b/arch/Kconfig
> > index f4b210ab0612..dc6a18854017 100644
> > --- a/arch/Kconfig
> > +++ b/arch/Kconfig
> > @@ -104,7 +104,7 @@ config STATIC_CALL_SELFTEST
> >   config OPTPROBES
> > def_bool y
> > depends on KPROBES && HAVE_OPTPROBES
> > -   select TASKS_RCU if PREEMPTION
> > +   select TASKS_RUDE_RCU
> >   config KPROBES_ON_FTRACE
> > def_bool y
> > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > index d5a0ee40bf66..09056ae50c58 100644
> > --- a/kernel/kprobes.c
> > +++ b/kernel/kprobes.c
> > @@ -623,7 +623,7 @@ static void kprobe_optimizer(struct work_struct *work)
> >  * Note that on non-preemptive kernel, this is transparently converted
> >  * to synchronoze_sched() to wait for all interrupts to have completed.
> >  */
> > -   synchronize_rcu_tasks();
> > +   synchronize_rcu_tasks_rude();
> > /* Step 3: Optimize kprobes after quiesence period */
> > do_optimize_kprobes();



commit bc31e6cb27a9334140ff2f0a209d59b08bc0bc8c
Author: Paul E. McKenney 
Date:   Sat Jan 20 07:07:08 2024 -0800

rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks

Holding a mutex across synchronize_rcu_tasks() and acquiring
that same mutex in code called from do_exit() after its call to
exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
results in deadlock.  This is by design, because tasks that are far
enough into do_exit() are no longer present on the tasks list, making
it a bit difficult for RCU Tasks to find them, let alone wait on them
to do a voluntary context switch.  However, such deadlocks are becoming
more frequent.  In addition, lockdep currently does not detect such
deadlocks and they can be difficult to reproduce.

In addition, if a task voluntarily context switches during that time
(for example, if it blocks acquiring a mutex), then this task is in an
RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
just as well take advantage of that fact.

This commit therefore eliminates these deadlock by replacing the
SRCU-based wait for do_exit() completion with per-CPU lists of tasks
currently exiting.  A given task will be on one of these per-CPU lists for
the same period of time that this task would previously have been in the
previous SRCU read-side critical section.  These lists enable RCU Tasks
to find the tasks that have already been removed from the tasks list,
but that must nevertheless be waited upon.

The RCU Tasks grace period gathers any of these do_exit() tasks that it
must wait on, and adds them to the list of holdouts.  Per-CPU locking
and get_task_struct() are used to synchronize addition to and removal
from these lists.

Link: 
https://lore.kernel.org/all/20240118021842.290665-1-chenzhong...@huawei.com/

Reported-by: Chen Zhongjin 
Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cdb8ea53c365..4f0e9274da2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -858,6 +858,8 @@ struct task_struct {
u8  rcu_tasks_idx;
int rcu_tasks_idle_cpu;
struct list_headrcu_tasks_holdout_list;
+   int rcu_tasks_exit_cpu;
+   struct list_headrcu_tasks_exit_list;
 #endif /* #ifdef 

Re: [PATCH] powerpc/papr_scm: Move duplicate definitions to common header files

2024-01-27 Thread Shivaprasad G Bhat

On 1/26/24 02:46, Christophe Leroy wrote:


Le 18/04/2022 à 06:38, Shivaprasad G Bhat a écrit :

papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across
papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
arch independent and can run on platforms other than PPC64, a way
needs to be deviced to avoid redundancy and duplication of PDSM
structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include-
-/uapi/ to the generic include/uapi/linux directory. Also, there are
some #defines common between papr_scm and ndtest which are not exported
to the user space. So, move them to a header file which can be shared
across ndtest and papr_scm via newly introduced include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 

This patch doesn't apply, if still relevant can you please rebase and
re-submit ?


Thanks for taking a look.


I have rebased and reposted the patch here

https://lore.kernel.org/nvdimm/170638176942.112443.2937254675538057083.st...@ltcd48-lp2.aus.stglab.ibm.com/T/#u


Thanks!

Shivaprasad






[REPOST PATCH v3] powerpc/papr_scm: Move duplicate definitions to common header files

2024-01-27 Thread Shivaprasad G Bhat
papr_scm and ndtest share common PDSM payload structs like
nd_papr_pdsm_health. Presently these structs are duplicated across
papr_pdsm.h and ndtest.h header files. Since 'ndtest' is essentially
arch independent and can run on platforms other than PPC64, a way
needs to be deviced to avoid redundancy and duplication of PDSM
structs in future.

So the patch proposes moving the PDSM header from arch/powerpc/include-
-/uapi/ to the generic include/uapi/linux directory. Also, there
are some #defines common between papr_scm and ndtest which are not
exported to the user space. So, move them to a header file which
can be shared across ndtest and papr_scm via newly introduced
include/linux/papr_scm.h.

Signed-off-by: Shivaprasad G Bhat 
Signed-off-by: Vaibhav Jain 
Suggested-by: "Aneesh Kumar K.V" 
---
Changelog:
Repost of v3:
Link: 
https://lore.kernel.org/all/165763948558.3501667.16230079003621326755.st...@ltc-boston123.aus.stglabs.ibm.com/
Rebased again, no changes.

Repost of v3:
Link: 
https://lore.kernel.org/nvdimm/165025666388.2927278.9540058958498766114.st...@lep8c.aus.stglabs.ibm.com/
Rebased and no changes.

Since v2:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/163454440296.431294.2368481747380790011.st...@lep8c.aus.stglabs.ibm.com/
* Made it like v1, and rebased.
* Fixed repeating words in comments of the header file papr_scm.h

Since v1:
Link: 
https://patchwork.kernel.org/project/linux-nvdimm/patch/162505488483.72147.12741153746322191381.stgit@56e104a48989/
* Removed dependency on this patch for the other patches

 MAINTAINERS   |2
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  165 -
 arch/powerpc/platforms/pseries/papr_scm.c |   43 
 include/linux/papr_scm.h  |   49 +
 include/uapi/linux/papr_pdsm.h|  165 +
 tools/testing/nvdimm/test/ndtest.c|2
 tools/testing/nvdimm/test/ndtest.h|   31 -
 7 files changed, 220 insertions(+), 237 deletions(-)
 delete mode 100644 arch/powerpc/include/uapi/asm/papr_pdsm.h
 create mode 100644 include/linux/papr_scm.h
 create mode 100644 include/uapi/linux/papr_pdsm.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 39219b144c23..70da9aa81654 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12393,6 +12393,8 @@ F:  drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
 F: drivers/tty/hvc/hvc_opal.c
 F: drivers/watchdog/wdrtas.c
+F: include/linux/papr_scm.h
+F: include/uapi/linux/papr_pdsm.h
 F: tools/testing/selftests/powerpc
 N: /pmac
 N: powermac
diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
deleted file mode 100644
index 17439925045c..
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ /dev/null
@@ -1,165 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
-/*
- * PAPR nvDimm Specific Methods (PDSM) and structs for libndctl
- *
- * (C) Copyright IBM 2020
- *
- * Author: Vaibhav Jain 
- */
-
-#ifndef _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-#define _UAPI_ASM_POWERPC_PAPR_PDSM_H_
-
-#include 
-#include 
-
-/*
- * PDSM Envelope:
- *
- * The ioctl ND_CMD_CALL exchange data between user-space and kernel via
- * envelope which consists of 2 headers sections and payload sections as
- * illustrated below:
- *  +-+---+---+
- *  |   64-Bytes  |   8-Bytes |   Max 184-Bytes   |
- *  +-+---+---+
- *  | ND-HEADER   |  PDSM-HEADER  |  PDSM-PAYLOAD |
- *  +-+---+---+
- *  | nd_family   |   |   |
- *  | nd_size_out | cmd_status|   |
- *  | nd_size_in  | reserved  | nd_pdsm_payload   |
- *  | nd_command  | payload   --> |   |
- *  | nd_fw_size  |   |   |
- *  | nd_payload ---> |   |   |
- *  +---+-+---+
- *
- * ND Header:
- * This is the generic libnvdimm header described as 'struct nd_cmd_pkg'
- * which is interpreted by libnvdimm before passed on to papr_scm. Important
- * member fields used are:
- * 'nd_family' : (In) NVDIMM_FAMILY_PAPR_SCM
- * 'nd_size_in': (In) PDSM-HEADER + PDSM-IN-PAYLOAD (usually 0)
- * 'nd_size_out': (In) PDSM-HEADER + PDSM-RETURN-PAYLOAD
- * 'nd_command' : (In) One of PAPR_PDSM_XXX
- * 'nd_fw_size' : (Out) PDSM-HEADER + size of actual payload returned
- *
- * PDSM Header:
- * This is papr-scm specific header that precedes the payload. This is defined
- * as nd_cmd_pdsm_pkg.  Following fields aare available in this header:
- *
- * 'cmd_status': (Out) Errors if any encountered while 
servicing PDSM.
- * 

Re: [PATCH v6 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-01-27 Thread Jiri Olsa
On Sat, Jan 27, 2024 at 12:14:05AM +0900, Masami Hiramatsu wrote:
> On Thu, 25 Jan 2024 15:54:53 +0100
> Jiri Olsa  wrote:
> 
> > On Fri, Jan 12, 2024 at 07:10:50PM +0900, Masami Hiramatsu (Google) wrote:
> > > Hi,
> > > 
> > > Here is the 6th version of the series to re-implement the fprobe on
> > > function-graph tracer. The previous version is;
> > > 
> > > https://lore.kernel.org/all/170290509018.220107.1347127510564358608.stgit@devnote2/
> > > 
> > > This version fixes use-after-unregister bug and arm64 stack unwinding
> > > bug [13/36], add an improvement for multiple interrupts during push
> > > operation[20/36], keep SAVE_REGS until BPF and fprobe_event using
> > > ftrace_regs[26/36], also reorder the patches[30/36][31/36] so that new
> > > fprobe can switch to SAVE_ARGS[32/36] safely.
> > > This series also temporarily adds a DIRECT_CALLS bugfix[1/36], which
> > > should be pushed separatedly as a stable bugfix.
> > > 
> > > There are some TODOs:
> > >  - Add s390x and loongarch support to fprobe (multiple fgraph).
> > >  - Fix to get the symbol address from ftrace entry address on arm64.
> > >(This should be done in BPF trace event)
> > >  - Cleanup code, rename some terms(offset/index) and FGRAPH_TYPE_BITMAP
> > >part should be merged to FGRAPH_TYPE_ARRAY patch.
> > 
> > hi,
> > I'm getting kasan bugs below when running bpf selftests on top of this
> > patchset.. I think it's probably the reason I see failures in some bpf
> > kprobe_multi/fprobe tests
> > 
> > so far I couldn't find the reason.. still checking ;-)
> 
> Thanks for reporting! Have you built the kernel with debuginfo? In that
> case, can you decode the line from the address?
> 
> $ eu-addr2line -fi -e vmlinux ftrace_push_return_trace.isra.0+0x346
> 
> This helps me a lot.

I had to recompile/regenerate the fault, it points in here:

8149b390 :
...

current->ret_stack[rindex - 1] = val;  
8149b6b1:   48 8d bd 78 28 00 00lea0x2878(%rbp),%rdi
8149b6b8:   e8 63 e4 28 00  call   81729b20 
<__asan_load8>
8149b6bd:   48 8b 95 78 28 00 00mov0x2878(%rbp),%rdx
8149b6c4:   41 8d 47 ff lea-0x1(%r15),%eax
8149b6c8:   48 98   cltq
8149b6ca:   4c 8d 24 c2 lea
(%rdx,%rax,8),%r12
8149b6ce:   4c 89 e7mov%r12,%rdi
8149b6d1:   e8 ea e4 28 00  call   81729bc0 
<__asan_store8>
--->8149b6d6:   49 89 1c 24 mov%rbx,(%r12)
current->curr_ret_stack = index = rindex;
8149b6da:   48 8d bd 6c 28 00 00lea0x286c(%rbp),%rdi
8149b6e1:   e8 9a e3 28 00  call   81729a80 
<__asan_store4>
8149b6e6:   44 89 bd 6c 28 00 00mov
%r15d,0x286c(%rbp)
8149b6ed:   e9 8d fd ff ff  jmp8149b47f 

if (WARN_ON_ONCE(idx <= 0))  


the dump is attached below (same address as in previous email)

jirka


---
[  360.152200][C3] BUG: KASAN: slab-out-of-bounds in 
ftrace_push_return_trace.isra.0+0x346/0x370
[  360.153195][C3] Write of size 8 at addr 8881a0e10ff8 by task 
kworker/3:4/728
[  360.154101][C3] 
[  360.154414][C3] CPU: 3 PID: 728 Comm: kworker/3:4 Tainted: G   
OE  6.7.0+ #316 c9b0d53b3491b547d06b6b50629b74711600ddc9
[  360.155679][C3] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-1.fc38 04/01/2014
[  360.156611][C3] Workqueue: events free_obj_work
[  360.157175][C3] Call Trace:
[  360.157561][C3]  
[  360.157904][C3]  dump_stack_lvl+0xf6/0x180
[  360.158404][C3]  print_report+0xc4/0x610
[  360.158853][C3]  ? lock_release+0xba/0x760
[  360.159375][C3]  ? __phys_addr+0x5/0x80
[  360.159872][C3]  ? __phys_addr+0x33/0x80
[  360.161309][C3]  kasan_report+0xbe/0xf0
[  360.161940][C3]  ? ftrace_push_return_trace.isra.0+0x346/0x370
[  360.162817][C3]  ? ftrace_push_return_trace.isra.0+0x346/0x370
[  360.163518][C3]  ? __pfx_kernel_text_address+0x10/0x10
[  360.164152][C3]  ? __kernel_text_address+0xe/0x40
[  360.164715][C3]  ftrace_push_return_trace.isra.0+0x346/0x370
[  360.165324][C3]  ? __pfx_kernel_text_address+0x10/0x10
[  360.165940][C3]  function_graph_enter_ops+0xbb/0x2d0
[  360.166555][C3]  ? __kernel_text_address+0xe/0x40
[  360.167134][C3]  ? __pfx_function_graph_enter_ops+0x10/0x10
[  360.167801][C3]  ? __pfx_function_graph_enter_ops+0x10/0x10
[  360.168454][C3]  ? __pfx___kernel_text_address+0x10/0x10
[  360.169086][C3]  ? __pfx_unwind_get_return_address+0x10/0x10
[  360.169781][C3]  ftrace_graph_func+0x142/0x270
[  360.170341][C3]  ? __pfx_kernel_text_address+0x10/0x10
[  360.170960][C3]  ? 

Re: [PATCH RFC 2/2] arm64: dts: qcom: msm8953: Add GPU

2024-01-27 Thread Luca Weiss
On Freitag, 26. Jänner 2024 00:50:43 CET Konrad Dybcio wrote:
> On 1/25/24 22:56, Luca Weiss wrote:
> > From: Vladimir Lypak 
> > 
> > Add the GPU node for the Adreno 506 found on this family of SoCs. The
> > clock speeds are a bit different per SoC variant, SDM450 maxes out at
> > 600MHz while MSM8953 (= SDM625) goes up to 650MHz and SDM632 goes up to
> > 725MHz.
> > 
> > To achieve this, create a new sdm450.dtsi to hold the 600MHz OPP and
> > use the new dtsi for sdm450-motorola-ali.
> > 
> > Signed-off-by: Vladimir Lypak 
> > Co-developed-by: Luca Weiss 
> > Signed-off-by: Luca Weiss 
> > ---
> > 
> >   arch/arm64/boot/dts/qcom/msm8953.dtsi| 115
> >   +++
> >   arch/arm64/boot/dts/qcom/sdm450-motorola-ali.dts |   2 +-
> >   arch/arm64/boot/dts/qcom/sdm450.dtsi |  14 +++
> >   arch/arm64/boot/dts/qcom/sdm632.dtsi |   8 ++
> >   4 files changed, 138 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi
> > b/arch/arm64/boot/dts/qcom/msm8953.dtsi index 91d083871ab0..1fe0c0c4fd15
> > 100644
> > --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi
> > +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi
> > @@ -1046,6 +1046,94 @@ mdss_dsi1_phy: phy@1a96400 {
> > 
> > };
> > 
> > };
> > 
> > +   gpu: gpu@1c0 {
> > +   compatible = "qcom,adreno-506.0", "qcom,adreno";
> > +   reg = <0x01c0 0x4>;
> > +   reg-names = "kgsl_3d0_reg_memory";
> > +   interrupts = ;
> > +
> > +   clocks = < GCC_OXILI_GFX3D_CLK>,
> > +< GCC_OXILI_AHB_CLK>,
> > +< GCC_BIMC_GFX_CLK>,
> > +< GCC_BIMC_GPU_CLK>,
> > +< GCC_OXILI_TIMER_CLK>,
> > +< GCC_OXILI_AON_CLK>;
> > +   clock-names = "core",
> > + "iface",
> > + "mem_iface",
> > + "alt_mem_iface",
> > + "rbbmtimer",
> > + "alwayson";
> > +   power-domains = < OXILI_GX_GDSC>;
> > +
> > +   iommus = <_iommu 0>;
> > +   operating-points-v2 = <_opp_table>;
> > +
> > +   #cooling-cells = <2>;
> > +
> > +   status = "disabled";
> > +
> > +   zap-shader {
> > +   memory-region = <_shader_region>;
> > +   };
> > +
> > +   gpu_opp_table: opp-table {
> > +   compatible = "operating-points-v2";
> > +
> > +   opp-1920 {
> > +   opp-hz = /bits/ 64 <1920>;
> > +   opp-supported-hw = <0xff>;
> > +   required-opps = <_opp_min_svs>;
> > +   };
> 
> If you remove all OPPs but this one, can the GPU still spit out pixels?

Yep, phosh is starting and is rendering at a few fps.

fairphone-fp3:~$ cat 
/sys/devices/platform/soc@0/1c0.gpu/devfreq/1c0.gpu/min_freq
1920
fairphone-fp3:~$ cat 
/sys/devices/platform/soc@0/1c0.gpu/devfreq/1c0.gpu/max_freq 
1920
fairphone-fp3:~$ cat 
/sys/devices/platform/soc@0/1c0.gpu/devfreq/1c0.gpu/cur_freq 
1920

Regards
Luca

> 
> Konrad







Re: [PATCH RFC 1/2] arm64: dts: qcom: msm8953: Add GPU IOMMU

2024-01-27 Thread Luca Weiss
On Freitag, 26. Jänner 2024 00:49:55 CET Konrad Dybcio wrote:
> On 1/25/24 23:24, Dmitry Baryshkov wrote:
> > On 25/01/2024 23:56, Luca Weiss wrote:
> >> From: Vladimir Lypak 
> >> 
> >> Add the IOMMU used for the GPU on MSM8953.
> >> 
> >> Signed-off-by: Vladimir Lypak 
> >> ---
> >>   arch/arm64/boot/dts/qcom/msm8953.dtsi | 31
> >> +++ 1 file changed, 31 insertions(+)
> >> 
> >> diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi
> >> b/arch/arm64/boot/dts/qcom/msm8953.dtsi index dcb5c98b793c..91d083871ab0
> >> 100644
> >> --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi
> >> +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi
> >> @@ -1046,6 +1046,37 @@ mdss_dsi1_phy: phy@1a96400 {
> >>   };
> >>   };
> >> +gpu_iommu: iommu@1c48000 {
> > 
> > Nit: most of the platforms use the adreno_smmu label. But maybe the
> > msm-iommu vs arm-smmu makes difference here.
> Not really :)
> 
> Please keep the labels unified

Ack, renaming to adreno_smmu

> 
> > Nevertheless:
> > 
> > Reviewed-by: Dmitry Baryshkov 
> > 
> >> +compatible = "qcom,msm8953-iommu", "qcom,msm-iommu-v2";
> >> +ranges = <0 0x01c48000 0x8000>;
> >> +
> >> +clocks = < GCC_OXILI_AHB_CLK>,
> >> + < GCC_BIMC_GFX_CLK>;
> 
> And align these

They are?

Also any comment about the issues listed in the cover letter?

Regards
Luca

> 
> Konrad







Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-27 Thread Borislav Petkov
On Fri, Jan 26, 2024 at 10:01:29PM +, Luck, Tony wrote:
> PPIN: Nice to have. People that send stuff to me are terrible about
> providing surrounding details. The record already includes
> CPUID(1).EAX ... so I can at least skip the step of asking them which
> CPU family/model/stepping they were using). But PPIN can be recovered
> (so long as the submitter kept good records about which system
> generated the record).

Yes.

> MICROCODE: Must have. Microcode version can be changed at run time.
> Going back to the system to check later may not give the correct
> answer to what was active at the time of the error. Especially for an
> error reported while a microcode update is waling across the CPUs
> poking the MSR on each in turn.

Easy:

- You've got an MCE? Was it during scheduled microcode updates?
- Yes.
- Come back to me when it happens again, *outside* of the microcode
  update schedule.

Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk
to data center operators more but this sounds like microcode update
failing is such a common thing to happen so that we *absolutely* *must*
capture the microcode revision when an MCE happens.

Maybe we should make microcode updates more resilient and add a retry
mechanism which doesn't back off as easily.

Or maybe people should script around it and keep retrying, dunno.

In my world, microcode update just works in the vast majority of the
cases and if it doesn't, then those cases need a specific look.

And if I am debugging an issue and I want to see the microcode revision,
I look at /proc/cpuinfo.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-27 Thread kernel test robot
Hi Yunjian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:
https://github.com/intel-lab-lkp/linux/commits/Yunjian-Wang/xsk-Remove-non-zero-dma_page-check-in-xp_assign_dev/20240124-174011
base:   net-next/main
patch link:
https://lore.kernel.org/r/1706089075-16084-1-git-send-email-wangyunjian%40huawei.com
patch subject: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
config: um-allmodconfig 
(https://download.01.org/0day-ci/archive/20240127/202401271943.bs1gpeeu-...@intel.com/config)
compiler: clang version 18.0.0git (https://github.com/llvm/llvm-project 
a31a60074717fc40887cfe132b77eec93bedd307)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240127/202401271943.bs1gpeeu-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202401271943.bs1gpeeu-...@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/net/tun.c:44:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a 
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
 547 | val = __raw_readb(PCI_IOBASE + addr);
 |   ~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a 
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
 560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + 
addr));
 | ~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from 
macro '__le16_to_cpu'
  37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
 |   ^
   In file included from drivers/net/tun.c:44:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a 
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
 573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + 
addr));
 | ~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from 
macro '__le32_to_cpu'
  35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
 |   ^
   In file included from drivers/net/tun.c:44:
   In file included from include/linux/skbuff.h:17:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from arch/um/include/asm/hardirq.h:5:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/um/include/asm/io.h:24:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a 
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
 584 | __raw_writeb(value, PCI_IOBASE + addr);
 | ~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a 
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
 594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + 
addr);
 |   ~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a 
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
 604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + 
addr);
 |   ~~ ^
   include/asm-generic/io.h:692:20: warning: performing pointer arithmetic on a 
null pointer has undefined be

Re: [PATCH v2] kprobes: Use synchronize_rcu_tasks_rude in kprobe_optimizer

2024-01-27 Thread Chen Zhongjin

On 2024/1/20 23:30, Paul E. McKenney wrote:

Hi Paul,
This patch works for my reproduce test case.

Just a small question, if you dont mind, this problem exsit on LTS 
version but we had struct rcu_tasks_percpu after v5.17. We can't 
backport this patch to LTS 5.10 or 4.19, which are still under maintaince.
If you have any idea or plan to apply this patch to elder version, 
please tell me, thanks very much!


Anyway, it's ok to apply this patch to mainline.

Best,
Chen


Again, that comment reads in full as follows:

/*
 * Step 2: Wait for quiesence period to ensure all potentially
 * preempted tasks to have normally scheduled. Because optprobe
 * may modify multiple instructions, there is a chance that Nth
 * instruction is preempted. In that case, such tasks can return
 * to 2nd-Nth byte of jump instruction. This wait is for avoiding it.
 * Note that on non-preemptive kernel, this is transparently converted
 * to synchronoze_sched() to wait for all interrupts to have completed.
 */

Please note well that first sentence.

Unless that first sentence no longer holds, this patch cannot work
because synchronize_rcu_tasks_rude() will not (repeat, NOT) wait for
preempted tasks.

So how to safely break this deadlock?  Reproducing Chen Zhongjin's
diagram:

pid A   pid B   pid C
kprobe_optimizer()  do_exit()   perf_kprobe_init()
mutex_lock(_mutex)   exit_tasks_rcu_start()  mutex_lock(_mutex)
synchronize_rcu_tasks() zap_pid_ns_processes()  // waiting kprobe_mutex
// waiting tasks_rcu_exit_srcu  kernel_wait4()
// waiting pid C exit

We need to stop synchronize_rcu_tasks() from waiting on tasks like
pid B that are voluntarily blocked.  One way to do that is to replace
SRCU with a set of per-CPU lists.  Then exit_tasks_rcu_start() adds the
current task to this list and does ...

OK, this is getting a bit involved.  If you would like to follow along,
please feel free to look here:

https://docs.google.com/document/d/1MEHHs5qbbZBzhN8dGP17pt-d87WptFJ2ZQcqS221d9I/edit?usp=sharing


And please see below for a prototype patch, which passes moderate
rcutorture testing.

Chen Zhongjin, does this prevent the deadlock you have been seeing?

Thanx, Paul



commit 113fe0eeabe7a83e87d638d44b9e1d0f9691b146
Author: Paul E. McKenney 
Date:   Sat Jan 20 07:07:08 2024 -0800

 rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
 
 Holding a mutex across synchronize_rcu_tasks() and acquiring

 that same mutex in code called from do_exit() after its call to
 exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop()
 results in deadlock.  This is by design, because tasks that are far
 enough into do_exit() are no longer present on the tasks list, making
 it a bit difficult for RCU Tasks to find them, let alone wait on them
 to do a voluntary context switch.  However, such deadlocks are becoming
 more frequent.  In addition, lockdep currently does not detect such
 deadlocks and they can be difficult to reproduce.
 
 In addition, if a task voluntarily context switches during that time

 (for example, if it blocks acquiring a mutex), then this task is in an
 RCU Tasks quiescent state.  And with some adjustments, RCU Tasks could
 just as well take advantage of that fact.
 
 This commit therefore eliminates these deadlock by replacing the

 SRCU-based wait for do_exit() completion with per-CPU lists of tasks
 currently exiting.  A given task will be on one of these per-CPU lists for
 the same period of time that this task would previously have been in the
 previous SRCU read-side critical section.  These lists enable RCU Tasks
 to find the tasks that have already been removed from the tasks list,
 but that must nevertheless be waited upon.
 
 The RCU Tasks grace period gathers any of these do_exit() tasks that it

 must wait on, and adds them to the list of holdouts.  Per-CPU locking
 and get_task_struct() are used to synchronize addition to and removal
 from these lists.
 
 Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhong...@huawei.com/
 
 Reported-by: Chen Zhongjin 

 Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 10b1b815..3fe36befb613 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -855,6 +855,8 @@ struct task_struct {
u8  rcu_tasks_idx;
int rcu_tasks_idle_cpu;
struct list_headrcu_tasks_holdout_list;
+   int rcu_tasks_exit_cpu;
+   struct list_head

RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-27 Thread wangyunjian
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Thursday, January 25, 2024 12:49 PM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com;
> > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> 
> > wrote:
> > >
> > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > drivers, which can reduce CPU utilization on the xdp program.
> > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > >
> > > This patch tries to address this by:
> > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > - add tun_put_user_desc function to copy the Rx data to VM
> >
> > Code explains themselves, let's explain why you need to do this.
> >
> > 1) why you want to use peek_len
> > 2) for "vq's array", what does it mean?
> > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
> > guess you meant TX zerocopy instead of RX (as I don't see codes for
> > RX?)
> 
> OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX zerocopy
> from the view of vhost-net.
> 
> >
> > A big question is how could you handle GSO packets from userspace/guests?
> 
> Now by disabling VM's TSO and csum feature. XDP does not support GSO
> packets.
> However, this feature can be added once XDP supports it in the future.
> 
> >
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >  drivers/net/tun.c   | 165
> > +++-
> > >  drivers/vhost/net.c |  18 +++--
> > >  2 files changed, 176 insertions(+), 7 deletions(-)

[...]

> > >
> > >  static int peek_head_len(struct vhost_net_virtqueue *rvq, struct
> > > sock
> > > *sk)  {
> > > +   struct socket *sock = sk->sk_socket;
> > > struct sk_buff *head;
> > > int len = 0;
> > > unsigned long flags;
> > >
> > > -   if (rvq->rx_ring)
> > > -   return vhost_net_buf_peek(rvq);
> > > +   if (rvq->rx_ring) {
> > > +   len = vhost_net_buf_peek(rvq);
> > > +   if (likely(len))
> > > +   return len;
> > > +   }
> > > +
> > > +   if (sock->ops->peek_len)
> > > +   return sock->ops->peek_len(sock);
> >
> > What prevents you from reusing the ptr_ring here? Then you don't need
> > the above tricks.
> 
> Thank you for your suggestion. I will consider how to reuse the ptr_ring.

If ptr_ring is used to transfer xdp_descs, there is a problem: After some
xdp_descs are obtained through xsk_tx_peek_desc(), the descs may fail
to be added to ptr_ring. However, no API is available to implement the
rollback function.

Thanks

> 
> >
> > Thanks
> >
> >
> > >
> > > spin_lock_irqsave(>sk_receive_queue.lock, flags);
> > > head = skb_peek(>sk_receive_queue);
> > > --
> > > 2.33.0
> > >