Re: KASAN: null-ptr-deref Write in tctx_task_work_run

2024-03-17 Thread Jens Axboe
On 3/17/24 6:59 PM, Ubisectech Sirius wrote:
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 
> 6.8.0-ge5e038b7ae9d. Attached to the email were a POC file of the issue.
> 
> Stack dump:
> 
> ==
> BUG: KASAN: null-ptr-deref in instrument_atomic_read_write 
> include/linux/instrumented.h:96 [inline]
> BUG: KASAN: null-ptr-deref in llist_del_all include/linux/llist.h:266 [inline]
> BUG: KASAN: null-ptr-deref in tctx_task_work_run+0x7d/0x330 
> io_uring/io_uring.c:1267
> Write of size 8 at addr 01c0 by task iou-sqp-215603/215604
> 
> CPU: 0 PID: 215604 Comm: iou-sqp-215603 Not tainted 6.8.0-ge5e038b7ae9d #40
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:114
>  kasan_report+0xbd/0xf0 mm/kasan/report.c:601
>  check_region_inline mm/kasan/generic.c:183 [inline]
>  kasan_check_range+0xf4/0x1a0 mm/kasan/generic.c:189
>  instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
>  llist_del_all include/linux/llist.h:266 [inline]
>  tctx_task_work_run+0x7d/0x330 io_uring/io_uring.c:1267
>  io_sq_tw+0x12a/0x1d0 io_uring/sqpoll.c:245
>  io_sq_thread+0x8d7/0x18a0 io_uring/sqpoll.c:308
>  ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
>  
> ==
> Kernel panic - not syncing: KASAN: panic_on_warn set ...
> CPU: 0 PID: 215604 Comm: iou-sqp-215603 Not tainted 6.8.0-ge5e038b7ae9d #40
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014

I think you snipped the fault injection that came before this. It looks
like an allocation failure, so we don't get tsk->io_uring setup for the
SQPOLL thread. Not a great way to handle this, but can you try the
below? Would be nicer if we could just prune the task rather than wake
it and have it error.

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 363052b4ea76..db7b0fdfe1cb 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -274,6 +274,10 @@ static int io_sq_thread(void *data)
char buf[TASK_COMM_LEN];
DEFINE_WAIT(wait);
 
+   /* offload context creation failed, just exit */
+   if (!current->io_uring) {
+   goto err_out;
+
snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
set_task_comm(current, buf);
 
@@ -371,7 +375,7 @@ static int io_sq_thread(void *data)
atomic_or(IORING_SQ_NEED_WAKEUP, >rings->sq_flags);
io_run_task_work();
mutex_unlock(>lock);
-
+err_out:
complete(>exited);
do_exit(0);
 }

-- 
Jens Axboe




KASAN: null-ptr-deref Write in tctx_task_work_run

2024-03-17 Thread Ubisectech Sirius
Hello.
We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
Recently, our team has discovered a issue in Linux kernel 6.8.0-ge5e038b7ae9d. 
Attached to the email were a POC file of the issue.

Stack dump:

==
BUG: KASAN: null-ptr-deref in instrument_atomic_read_write 
include/linux/instrumented.h:96 [inline]
BUG: KASAN: null-ptr-deref in llist_del_all include/linux/llist.h:266 [inline]
BUG: KASAN: null-ptr-deref in tctx_task_work_run+0x7d/0x330 
io_uring/io_uring.c:1267
Write of size 8 at addr 01c0 by task iou-sqp-215603/215604

CPU: 0 PID: 215604 Comm: iou-sqp-215603 Not tainted 6.8.0-ge5e038b7ae9d #40
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:114
 kasan_report+0xbd/0xf0 mm/kasan/report.c:601
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0xf4/0x1a0 mm/kasan/generic.c:189
 instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
 llist_del_all include/linux/llist.h:266 [inline]
 tctx_task_work_run+0x7d/0x330 io_uring/io_uring.c:1267
 io_sq_tw+0x12a/0x1d0 io_uring/sqpoll.c:245
 io_sq_thread+0x8d7/0x18a0 io_uring/sqpoll.c:308
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
 
==
Kernel panic - not syncing: KASAN: panic_on_warn set ...
CPU: 0 PID: 215604 Comm: iou-sqp-215603 Not tainted 6.8.0-ge5e038b7ae9d #40
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x3d/0x1b0 lib/dump_stack.c:114
 panic+0x6d2/0x780 kernel/panic.c:344
 check_panic_on_warn+0xb1/0xc0 kernel/panic.c:237
 end_report+0x107/0x150 mm/kasan/report.c:226
 kasan_report+0xcd/0xf0 mm/kasan/report.c:603
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0xf4/0x1a0 mm/kasan/generic.c:189
 instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
 llist_del_all include/linux/llist.h:266 [inline]
 tctx_task_work_run+0x7d/0x330 io_uring/io_uring.c:1267
 io_sq_tw+0x12a/0x1d0 io_uring/sqpoll.c:245
 io_sq_thread+0x8d7/0x18a0 io_uring/sqpoll.c:308
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
 
Kernel Offset: disabled
Rebooting in 86400 seconds..

Thank you for taking the time to read this email and we look forward to working 
with you further.




poc.c
Description: Binary data


Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-17 Thread Gavin Shan

On 3/18/24 02:50, Michael S. Tsirkin wrote:

On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:


On 3/15/24 21:05, Michael S. Tsirkin wrote:

On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:

Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper platform. I 
tried

to reproduce it with my own driver where one thread writes to the shared buffer
and another thread reads from the buffer. I don't hit the out-of-order issue so
far.


Make sure the 2 areas you are accessing are in different cache lines.



Yes, I already put those 2 areas to separate cache lines.




My driver may be not correct somewhere and I will update if I can reproduce
the issue with my driver in the future.


Then maybe your change is just making virtio slower and masks the bug
that is actually elsewhere?

You don't really need a driver. Here's a simple test: without barriers
assertion will fail. With barriers it will not.
(Warning: didn't bother testing too much, could be buggy.

---

#include 
#include 
#include 
#include 

#define FIRST values[0]
#define SECOND values[64]

volatile int values[100] = {};

void* writer_thread(void* arg) {
while (1) {
FIRST++;
// NEED smp_wmb here

 __asm__ volatile("dmb ishst" : : : "memory");

SECOND++;
}
}

void* reader_thread(void* arg) {
  while (1) {
int first = FIRST;
// NEED smp_rmb here

 __asm__ volatile("dmb ishld" : : : "memory");

int second = SECOND;
assert(first - second == 1 || first - second == 0);
  }
}

int main() {
  pthread_t writer, reader;

  pthread_create(, NULL, writer_thread, NULL);
  pthread_create(, NULL, reader_thread, NULL);

  pthread_join(writer, NULL);
  pthread_join(reader, NULL);

  return 0;
}



Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
the assert on both of them. After replacing 'dmb' with 'dsb', I can
hit assert on both of them too. I need to look at the code closely.

[root@virt-mtcollins-02 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
0' failed.
Aborted (core dumped)

[root@nvidia-grace-hopper-05 test]# ./a
a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
0' failed.
Aborted (core dumped)

Thanks,
Gavin



Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include 
#include 
#include 
#include 

#if 0
#define x86_rmb()  asm volatile("lfence":::"memory")
#define x86_mb()  asm volatile("mfence":::"memory")
#define x86_smb()  asm volatile("sfence":::"memory")
#else
#define x86_rmb()  asm volatile("":::"memory")
#define x86_mb()  asm volatile("":::"memory")
#define x86_smb()  asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
while (1) {
/* Now synchronize with reader */
while(FLAG);
FIRST++;
x86_smb();
SECOND++;
x86_smb();
FLAG = 1;
}
}

void* reader_thread(void* arg) {
 while (1) {
/* Now synchronize with writer */
while(!FLAG);
x86_rmb();
unsigned first = FIRST;
x86_rmb();
unsigned second = SECOND;
assert(first - second == 1 || first - second == 0);
FLAG = 0;

if (!(first %100))
printf("%d\n", first);
}
}

int main() {
 pthread_t writer, reader;

 pthread_create(, NULL, writer_thread, NULL);
 pthread_create(, NULL, reader_thread, NULL);

 pthread_join(writer, NULL);
 pthread_join(reader, NULL);

 return 0;
}



I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
can hit assert. With the barriers, it's working fine without hitting the
assert.

I also had some code to mimic virtio vring last weekend, and it's just
working well. Back to our original issue, __smb_wmb() is issued by guest
while __smb_rmb() is executed on host. The VM and host are running at
different exception level: EL2 vs EL1. I'm not sure it's the cause. I
need to modify my code so that __smb_wmb() and __smb_rmb() can be executed
from guest and host.

[gshan@gshan code]$ cat test.h
#ifndef __TEST_H
#define __TEST_H

struct vring_desc {
uint64_taddr;
uint32_tlen;
uint16_tflags;
uint16_tnext;
} __attribute__((aligned(4)));

struct vring_avail {
uint16_tflags;
uint16_tidx;
uint16_tring[];
} __attribute__((aligned(4)));

struct vring_used_elem {
uint32_tid;
uint32_tlen;
} __attribute__((aligned(4)));

struct vring_used {
uint16_tflags;
uint16_tidx;
struct vring_used_elem  ring[];
} __attribute__((aligned(4)));

struct 

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-17 Thread SeongJae Park
On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:

> Hi Honggyu,
> 
> On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > Thanks for the confirmation.  I have a few comments on young filter so
> > please read the inline comments again.
> > 
> > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > Hi Honggyu,
[...]
> > Thanks.  I see that it works fine, but I would like to have more
> > discussion about "young" filter.  What I think about filter is that if I
> > apply "young" filter "true" for demotion, then the action applies only
> > for "young" pages, but the current implementation works opposite.
> > 
> > I understand the function name of internal implementation is
> > "damos_pa_filter_out" so the basic action is filtering out, but the
> > cgroup filter works in the opposite way for now.
> 
> Does memcg filter works in the opposite way?  I don't think so because
> __damos_pa_filter_out() sets 'matches' as 'true' only if the the given folio 
> is
> contained in the given memcg.  'young' filter also simply sets 'matches' as
> 'true' only if the given folio is young.
> 
> If it works in the opposite way, it's a bug that need to be fixed.  Please let
> me know if I'm missing something.

I just read the DAMOS filters part of the documentation for DAMON sysfs
interface again.  I believe it is explaining the meaning of 'matching' as I
intended to, as below:

You can write ``Y`` or ``N`` to ``matching`` file to filter out pages that 
does
or does not match to the type, respectively.  Then, the scheme's action will
not be applied to the pages that specified to be filtered out.

But, I found the following example for memcg filter is wrong, as below:

For example, below restricts a DAMOS action to be applied to only 
non-anonymous
pages of all memory cgroups except ``/having_care_already``.::

# echo 2 > nr_filters
# # filter out anonymous pages
echo anon > 0/type
echo Y > 0/matching
# # further filter out all cgroups except one at '/having_care_already'
echo memcg > 1/type
echo /having_care_already > 1/memcg_path
echo N > 1/matching

Specifically, the last line of the commands should write 'Y' instead of 'N' to
do what explained.  Without the fix, the action will be applied to only
non-anonymous pages of 'having_care_already' memcg.  This is definitely wrong.
I will fix this soon.  I'm unsure if this is what made you to believe memcg
DAMOS filter is working in the opposite way, though.


Thanks,
SJ

[...]



Re: [PATCH 3/3] drivers: remoteproc: add Versal and Versal-NET support

2024-03-17 Thread Krzysztof Kozlowski
On 15/03/2024 22:15, Tanmay Shah wrote:
> AMD-Xilinx Versal and Versal-NET are successor of ZynqMP platform. ZynqMP
> remoteproc driver is mostly compatible with new platforms except few
> platform specific differences.
> 
> Versal has same IP of cortex-R5 cores hence maintained compatible string
> same as ZynqMP platform. However, hardcode TCM addresses are not
> supported for new platforms and must be provided in device-tree as per
> new bindings. This makes TCM representation data-driven and easy to
> maintain. This check is provided in the driver.
> 
> For Versal-NET platform, TCM doesn't need to be configured in lockstep
> mode or split mode. Hence that call to PMC firmware is avoided in the
> driver for Versal-NET platform.
> 
> Signed-off-by: Tanmay Shah 
> ---
>  drivers/remoteproc/xlnx_r5_remoteproc.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> index d4a22caebaad..193bc159d1b4 100644
> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> @@ -323,9 +323,12 @@ static int zynqmp_r5_set_mode(struct zynqmp_r5_core 
> *r5_core,
>   return ret;
>   }
>  
> - ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
> - if (ret < 0)
> - dev_err(r5_core->dev, "failed to configure TCM\n");
> + /* TCM configuration is not needed in versal-net */
> + if (device_is_compatible(r5_core->dev, "xlnx,zynqmp-r5f")) {
> + ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
> + if (ret < 0)
> + dev_err(r5_core->dev, "failed to configure TCM\n");
> + }
>  
>   return ret;
>  }
> @@ -933,10 +936,17 @@ static int zynqmp_r5_core_init(struct zynqmp_r5_cluster 
> *cluster,
>   int ret, i;
>  
>   r5_core = cluster->r5_cores[0];
> +
> + /*
> +  * New platforms must use device tree for TCM parsing.
> +  * Only ZynqMP uses hardcode TCM.
> +  */
>   if (of_find_property(r5_core->np, "reg", NULL))
>   ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
> - else
> + else if (of_machine_is_compatible("xlnx,zynqmp"))
>   ret = zynqmp_r5_get_tcm_node(cluster);

That's poor code. Your drivers should not depend on platform. I don't
understand why you need to do this and how is even related to this patch.


Best regards,
Krzysztof




Re: [PATCH 2/3] dt-bindings: remoteproc: add Versal-NET platform

2024-03-17 Thread Krzysztof Kozlowski
On 15/03/2024 22:15, Tanmay Shah wrote:
> AMD-Xilinx Versal-NET platform is successor of Versal platform. It
> contains multiple clusters of cortex-R52 real-time processing units.
> Each cluster contains two cores of cortex-R52 processors. Each cluster
> can be configured in lockstep mode or split mode.
> 
> Each R52 core is assigned 128KB of TCM memory. ATCM memory is 64KB, BTCM
> and CTCM memoreis are 32KB each. Each TCM memory has its own dedicated
> power-domain that needs to be requested before using it.
> 
> Signed-off-by: Tanmay Shah 
> ---
>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 220 +++---
>  1 file changed, 184 insertions(+), 36 deletions(-)
> 
> diff --git 
> a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
> b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> index 711da0272250..55654ee02eef 100644
> --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
> @@ -18,7 +18,9 @@ description: |
>  
>  properties:
>compatible:
> -const: xlnx,zynqmp-r5fss
> +enum:
> +  - xlnx,zynqmp-r5fss
> +  - xlnx,versal-net-r52fss
>  
>"#address-cells":
>  const: 2
> @@ -64,7 +66,9 @@ patternProperties:
>  
>  properties:
>compatible:
> -const: xlnx,zynqmp-r5f
> +enum:
> +  - xlnx,zynqmp-r5f
> +  - xlnx,versal-net-r52f
>  
>reg:
>  minItems: 1
> @@ -135,9 +139,11 @@ required:
>  allOf:
>- if:
>properties:
> -xlnx,cluster-mode:
> -  enum:
> -- 1
> +compatible:
> +  contains:
> +enum:
> +  - xlnx,versal-net-r52fss

Why do you touch these lines?

> +
>  then:
>patternProperties:
>  "^r5f@[0-9a-f]+$":
> @@ -149,16 +155,14 @@ allOf:
>items:
>  - description: ATCM internal memory
>  - description: BTCM internal memory
> -- description: extra ATCM memory in lockstep mode
> -- description: extra BTCM memory in lockstep mode
> +- description: CTCM internal memory
>  
>  reg-names:
>minItems: 1
>items:
> -- const: atcm0
> -- const: btcm0
> -- const: atcm1
> -- const: btcm1
> +- const: atcm
> +- const: btcm
> +- const: ctcm
>  
>  power-domains:
>minItems: 2
> @@ -166,33 +170,70 @@ allOf:
>  - description: RPU core power domain
>  - description: ATCM power domain
>  - description: BTCM power domain
> -- description: second ATCM power domain
> -- description: second BTCM power domain
> +- description: CTCM power domain
>  
>  else:
> -  patternProperties:
> -"^r5f@[0-9a-f]+$":
> -  type: object
> -
> -  properties:
> -reg:
> -  minItems: 1
> -  items:
> -- description: ATCM internal memory
> -- description: BTCM internal memory
> -
> -reg-names:
> -  minItems: 1
> -  items:
> -- const: atcm0
> -- const: btcm0
> -
> -power-domains:
> -  minItems: 2
> -  items:
> -- description: RPU core power domain
> -- description: ATCM power domain
> -- description: BTCM power domain
> +  allOf:
> +- if:
> +properties:
> +  xlnx,cluster-mode:
> +enum:
> +  - 1

Whatever you did here, is not really readable. You have now multiple
if:then:if:then embedded.

> +  then:
> +patternProperties:
> +  "^r5f@[0-9a-f]+$":
> +type: object
> +
> +properties:
> +  reg:
> +minItems: 1
> +items:
> +  - description: ATCM internal memory
> +  - description: BTCM internal memory
> +  - description: extra ATCM memory in lockstep mode
> +  - description: extra BTCM memory in lockstep mode
> +
> +  reg-names:
> +minItems: 1
> +items:
> +  - const: atcm0
> +  - const: btcm0
> +  - const: atcm1
> +  - const: btcm1
> +
> +  power-domains:
> +minItems: 2
> +items:
> +  - description: RPU core power domain
> +  - description: ATCM power domain
> +  - description: BTCM power domain
> +  

Re: [PATCH 1/3] dt-bindings: remoteproc: add Versal platform support

2024-03-17 Thread Krzysztof Kozlowski
On 15/03/2024 22:15, Tanmay Shah wrote:
> AMD-Xilinx Versal platform is successor of ZynqMP platform. Real-time
> Processor Unit R5 cluster IP on Versal is same as of ZynqMP Platform.
> Only difference is power-domains ID needed by power management firmware.
> Hence, keeping the compatible property same as of zynqmp node.
> 
> Signed-off-by: Tanmay Shah 

There is no binding change, so NAK. Platform is not being added to
examples. You changed nothing in  Versal support...

Best regards,
Krzysztof




Re: [PATCH v2] kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address

2024-03-17 Thread Jinghao Jia


On 3/16/24 08:46, Masami Hiramatsu (Google) wrote:
> On Thu, 14 Mar 2024 18:56:35 -0500
> Jinghao Jia  wrote:
> 
>> On 3/14/24 10:17, Masami Hiramatsu (Google) wrote:
>>> From: Masami Hiramatsu (Google) 
>>>
>>> Read from an unsafe address with copy_from_kernel_nofault() in
>>> arch_adjust_kprobe_addr() because this function is used before checking
>>> the address is in text or not. Syzcaller bot found a bug and reported
>>> the case if user specifies inaccessible data area,
>>> arch_adjust_kprobe_addr() will cause a kernel panic.
>>
>> IMHO there is a check on the address in kallsyms_lookup_size_offset to see
>> if it is a kernel text address before arch_adjust_kprobe_addr is invoked.
> 
> Yeah, kallsyms does not ensure the page (especially data) exists.
> 
>>
>> The call chain is:
>>
>> register_kprobe()
>>   _kprobe_addr()
>> kallsyms_lookup_size_offset() <- check on addr is here
>> arch_adjust_kprobe_addr()
>>
>> I wonder why this check was not able to capture the problem in this bug
>> report (I cannot reproduce it locally).
> 
> I could reproduce it locally, it tried to access 'Y' data.
> (I attached my .config) And I ensured that this fixed the problem.
> 
> The reproduce test actually tried to access initdata area
> 
> 82fb5450 d __alt_reloc_selftest_addr
> 82fb5460 d int3_exception_nb.1
> 82fb5478 d tsc_early_khz
> 82fb547c d io_delay_override
> 82fb5480 d fxregs.0
> 82fb5680 d y<--- access this
> 82fb5688 d x
> 82fb56a0 d xsave_cpuid_features
> 82fb56c8 d l1d_flush_mitigation
> 
> `y` is too generic, so check `io_delay_override` which is on the
> same page.
> 
> $ git grep io_delay_override
> arch/x86/kernel/io_delay.c:static int __initdata io_delay_override;
> 
> As you can see, it is marked as `__initdata`, and the initdata has been
> freed before starting /init.
> 
> 
> [2.679161] Freeing unused kernel image (initmem) memory: 2888K
> [2.688731] Write protecting the kernel read-only data: 24576k
> [2.691802] Freeing unused kernel image (rodata/data gap) memory: 1436K
> [2.746994] x86/mm: Checked W+X mappings: passed, no W+X pages found.
> [2.748022] x86/mm: Checking user space page tables
> [2.789520] x86/mm: Checked W+X mappings: passed, no W+X pages found.
> [2.790527] Run /init as init process
> 
> 
> So this has been caused because accessing freed initdata.

Thanks a lot for the explanation! I have confirmed the bug and tested the
patch with CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y (which explicitly marks
the init pages as not-present after boot).

Tested-by: Jinghao Jia 

--Jinghao

> 
> Thank you,
> 
>>
>> Thanks,
>> --Jinghao
>>
>>>
>>> Reported-by: Qiang Zhang 
>>> Closes: 
>>> https://urldefense.com/v3/__https://lore.kernel.org/all/cakhosas2rof6vqvbw_lg_j3qnku0canzr2qmy4et7r5lo8m...@mail.gmail.com/__;!!DZ3fjg!_C9Dn6-GBlkyS2z34bDUBsEXkTkgWp45MNrd4Rl5I5slz2A3SrurXOxKupsxLMqxC2BMiySCTfB2-5fPhkLP1g$
>>>  
>>> Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
>>> Signed-off-by: Masami Hiramatsu (Google) 
>>> ---
>>>  Changes in v2:
>>>   - Fix to return NULL if failed to access the address.
>>> ---
>>>  arch/x86/kernel/kprobes/core.c |   11 ++-
>>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
>>> index a0ce46c0a2d8..95e4ebe5d514 100644
>>> --- a/arch/x86/kernel/kprobes/core.c
>>> +++ b/arch/x86/kernel/kprobes/core.c
>>> @@ -335,7 +335,16 @@ static int can_probe(unsigned long paddr)
>>>  kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long 
>>> offset,
>>>  bool *on_func_entry)
>>>  {
>>> -   if (is_endbr(*(u32 *)addr)) {
>>> +   u32 insn;
>>> +
>>> +   /*
>>> +* Since addr is not guaranteed to be safely accessed yet, use
>>> +* copy_from_kernel_nofault() to get the instruction.
>>> +*/
>>> +   if (copy_from_kernel_nofault(, (void *)addr, sizeof(u32)))
>>> +   return NULL;
>>> +
>>> +   if (is_endbr(insn)) {
>>> *on_func_entry = !offset || offset == 4;
>>> if (*on_func_entry)
>>> offset = 4;
>>>
> 
> 


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-17 Thread Michael S. Tsirkin
On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> 
> On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper 
> > > > > platform. I tried
> > > to reproduce it with my own driver where one thread writes to the shared 
> > > buffer
> > > and another thread reads from the buffer. I don't hit the out-of-order 
> > > issue so
> > > far.
> > 
> > Make sure the 2 areas you are accessing are in different cache lines.
> > 
> 
> Yes, I already put those 2 areas to separate cache lines.
> 
> > 
> > > My driver may be not correct somewhere and I will update if I can 
> > > reproduce
> > > the issue with my driver in the future.
> > 
> > Then maybe your change is just making virtio slower and masks the bug
> > that is actually elsewhere?
> > 
> > You don't really need a driver. Here's a simple test: without barriers
> > assertion will fail. With barriers it will not.
> > (Warning: didn't bother testing too much, could be buggy.
> > 
> > ---
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #define FIRST values[0]
> > #define SECOND values[64]
> > 
> > volatile int values[100] = {};
> > 
> > void* writer_thread(void* arg) {
> > while (1) {
> > FIRST++;
> > // NEED smp_wmb here
> __asm__ volatile("dmb ishst" : : : "memory");
> > SECOND++;
> > }
> > }
> > 
> > void* reader_thread(void* arg) {
> >  while (1) {
> > int first = FIRST;
> > // NEED smp_rmb here
> __asm__ volatile("dmb ishld" : : : "memory");
> > int second = SECOND;
> > assert(first - second == 1 || first - second == 0);
> >  }
> > }
> > 
> > int main() {
> >  pthread_t writer, reader;
> > 
> >  pthread_create(, NULL, writer_thread, NULL);
> >  pthread_create(, NULL, reader_thread, NULL);
> > 
> >  pthread_join(writer, NULL);
> >  pthread_join(reader, NULL);
> > 
> >  return 0;
> > }
> > 
> 
> Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> the assert on both of them. After replacing 'dmb' with 'dsb', I can
> hit assert on both of them too. I need to look at the code closely.
> 
> [root@virt-mtcollins-02 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
> 0' failed.
> Aborted (core dumped)
> 
> [root@nvidia-grace-hopper-05 test]# ./a
> a: a.c:26: reader_thread: Assertion `first - second == 1 || first - second == 
> 0' failed.
> Aborted (core dumped)
> 
> Thanks,
> Gavin


Actually this test is broken. No need for ordering it's a simple race.
The following works on x86 though (x86 does not need barriers
though).


#include 
#include 
#include 
#include 

#if 0
#define x86_rmb()  asm volatile("lfence":::"memory")
#define x86_mb()  asm volatile("mfence":::"memory")
#define x86_smb()  asm volatile("sfence":::"memory")
#else
#define x86_rmb()  asm volatile("":::"memory")
#define x86_mb()  asm volatile("":::"memory")
#define x86_smb()  asm volatile("":::"memory")
#endif

#define FIRST values[0]
#define SECOND values[640]
#define FLAG values[1280]

volatile unsigned values[2000] = {};

void* writer_thread(void* arg) {
while (1) {
/* Now synchronize with reader */
while(FLAG);
FIRST++;
x86_smb();
SECOND++;
x86_smb();
FLAG = 1;
}
}

void* reader_thread(void* arg) {
while (1) {
/* Now synchronize with writer */
while(!FLAG);
x86_rmb();
unsigned first = FIRST;
x86_rmb();
unsigned second = SECOND;
assert(first - second == 1 || first - second == 0);
FLAG = 0;

if (!(first %100))
printf("%d\n", first);
   }
}

int main() {
pthread_t writer, reader;

pthread_create(, NULL, writer_thread, NULL);
pthread_create(, NULL, reader_thread, NULL);

pthread_join(writer, NULL);
pthread_join(reader, NULL);

return 0;
}




Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-17 Thread SeongJae Park
Hi Honggyu,

On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> Thanks for the confirmation.  I have a few comments on young filter so
> please read the inline comments again.
> 
> On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > Hi Honggyu,
> > 
> > > > -Original Message-
> > > > From: SeongJae Park 
> > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > To: Honggyu Kim 
> > > > Cc: SeongJae Park ; kernel_team 
> > > > 
> > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > management for CXL memory
> > > >
> > > > Hi Honggyu,
> > > >
> > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > >  wrote:
> > > >
> > > > > Hi SeongJae,
> > > > >
> > > > > I've tested it again and found that "young" filter has to be set
> > > > > differently as follows.
> > > > > - demote action: set "young" filter with "matching" true
> > > > > - promote action: set "young" filter with "matching" false
> > > >
> > > > DAMOS filter is basically for filtering "out" memory regions that 
> > > > matches to
> > > > the condition.  Hence in your setup, young pages are not filtered out 
> > > > from
> > > > demote action target.
> > > 
> > > I thought young filter true means "young pages ARE filtered out" for 
> > > demotion.
> > 
> > You're correct.
> 
> Ack.
> 
> > > 
> > > > That is, you're demoting pages that "not" young.
> > > 
> > > Your explanation here looks opposite to the previous statement.
> > 
> > Again, you're correct.  My intention was "non-young pages are not ..." but
> > maybe I was out of my mind and mistakenly removed "non-" part.  Sorry for 
> > the
> > confusion.
> 
> No problem.  I also think it's quite confusing.
> 
> > > 
> > > > And vice versa, so you're applying promote to non-non-young (young) 
> > > > pages.
> > > 
> > > Yes, I understand "promote non-non-young pages" means "promote young 
> > > pages".
> > > This might be understood as "young pages are NOT filtered out" for 
> > > promotion
> > > but it doesn't mean that "old pages are filtered out" instead.
> > > And we just rely hot detection only on DAMOS logics such as access 
> > > frequency
> > > and age. Am I correct?
> > 
> > You're correct.
> 
> Ack.  But if it doesn't mean that "old pages are filtered out" instead,

It does mean that.  Here, filtering is exclusive.  Hence, "filter-in a type of
pages" means "filter-out pages of other types".  At least that's the intention.
To quote the documentation
(https://docs.kernel.org/mm/damon/design.html#filters),

Each filter specifies the type of target memory, and whether it should
exclude the memory of the type (filter-out), or all except the memory of
the type (filter-in).

> then do we really need this filter for promotion?  If not, maybe should
> we create another "old" filter for promotion?  As of now, the promotion
> is mostly done inaccurately, but the accurate migration is done at
> demotion level.

Is this based on your theory?  Or, a real behavior that you're seeing from your
setup?  If this is a real behavior, I think that should be a bug that need to
be fixed.

> To avoid this issue, I feel we should promotion only "young" pages after
> filtering "old" pages out.
> 
> > > 
> > > > I understand this is somewhat complex, but what we have for now.
> > > 
> > > Thanks for the explanation. I guess you mean my filter setup is correct.
> > > Is it correct?
> > 
> > Again, you're correct.  Your filter setup is what I expected to :)
> 
> Thanks.  I see that it works fine, but I would like to have more
> discussion about "young" filter.  What I think about filter is that if I
> apply "young" filter "true" for demotion, then the action applies only
> for "young" pages, but the current implementation works opposite.
> 
> I understand the function name of internal implementation is
> "damos_pa_filter_out" so the basic action is filtering out, but the
> cgroup filter works in the opposite way for now.

Does memcg filter works in the opposite way?  I don't think so because
__damos_pa_filter_out() sets 'matches' as 'true' only if the the given folio is
contained in the given memcg.  'young' filter also simply sets 'matches' as
'true' only if the given folio is young.

If it works in the opposite way, it's a bug that need to be fixed.  Please let
me know if I'm missing something.

> 
> I would like to hear how you think about this.
> 
> > > 
> > > > > Then, I see that "hot_cold" migrates hot/cold memory correctly.
> > > >
> > > > Thank you so much for sharing this great news!  My tests also show no 
> > > > bad
> > > > signal so far.
> > > >
> > > > >
> > > > > Could you please upload the "damon_folio_mkold" patch to LKML?
> > > > > Then I will rebase our changes based on it and run the redis test 
> > > > > again.
> > > >
> > > > I will do that soon.
> > > 
> > > Thanks a lot for sharing the RFC v2 for DAMOS young filter.
> > > https://lore.kernel.org/damon/20240311204545.47097-1...@kernel.org/
> > > 
> > > I will 

Re: [PATCH 1/3] dt-bindings: remoteproc: add Versal platform support

2024-03-17 Thread Conor Dooley
On Sun, Mar 17, 2024 at 02:50:27PM +, Conor Dooley wrote:
> On Fri, Mar 15, 2024 at 02:15:31PM -0700, Tanmay Shah wrote:
> > AMD-Xilinx Versal platform is successor of ZynqMP platform. Real-time
> > Processor Unit R5 cluster IP on Versal is same as of ZynqMP Platform.
> 
> > Only difference is power-domains ID needed by power management firmware.
> > Hence, keeping the compatible property same as of zynqmp node.
> 
> No, don't be lazy. Add a compatible with a fallback please.

> This doesn't apply to linux-next either FYI.

Ordinarily'd I'd not even bother applying a patch like this and I
wouldn't notice, but the diff for the versal-net patch is difficult to
read without colour-moved enabled and since I can't apply this I'm not
going to review that patch.


signature.asc
Description: PGP signature


Re: [PATCH 1/3] dt-bindings: remoteproc: add Versal platform support

2024-03-17 Thread Conor Dooley
On Fri, Mar 15, 2024 at 02:15:31PM -0700, Tanmay Shah wrote:
> AMD-Xilinx Versal platform is successor of ZynqMP platform. Real-time
> Processor Unit R5 cluster IP on Versal is same as of ZynqMP Platform.

> Only difference is power-domains ID needed by power management firmware.
> Hence, keeping the compatible property same as of zynqmp node.

No, don't be lazy. Add a compatible with a fallback please.
This doesn't apply to linux-next either FYI.


signature.asc
Description: PGP signature


Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-17 Thread Honggyu Kim
Hi SeongJae,

Thanks for the confirmation.  I have a few comments on young filter so
please read the inline comments again.

On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> Hi Honggyu,
> 
> > > -Original Message-
> > > From: SeongJae Park 
> > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > To: Honggyu Kim 
> > > Cc: SeongJae Park ; kernel_team 
> > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management 
> > > for CXL memory
> > >
> > > Hi Honggyu,
> > >
> > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > >  wrote:
> > >
> > > > Hi SeongJae,
> > > >
> > > > I've tested it again and found that "young" filter has to be set
> > > > differently as follows.
> > > > - demote action: set "young" filter with "matching" true
> > > > - promote action: set "young" filter with "matching" false
> > >
> > > DAMOS filter is basically for filtering "out" memory regions that matches 
> > > to
> > > the condition.  Hence in your setup, young pages are not filtered out from
> > > demote action target.
> > 
> > I thought young filter true means "young pages ARE filtered out" for 
> > demotion.
> 
> You're correct.

Ack.

> > 
> > > That is, you're demoting pages that "not" young.
> > 
> > Your explanation here looks opposite to the previous statement.
> 
> Again, you're correct.  My intention was "non-young pages are not ..." but
> maybe I was out of my mind and mistakenly removed "non-" part.  Sorry for the
> confusion.

No problem.  I also think it's quite confusing.

> > 
> > > And vice versa, so you're applying promote to non-non-young (young) pages.
> > 
> > Yes, I understand "promote non-non-young pages" means "promote young pages".
> > This might be understood as "young pages are NOT filtered out" for promotion
> > but it doesn't mean that "old pages are filtered out" instead.
> > And we just rely hot detection only on DAMOS logics such as access frequency
> > and age. Am I correct?
> 
> You're correct.

Ack.  But if it doesn't mean that "old pages are filtered out" instead,
then do we really need this filter for promotion?  If not, maybe should
we create another "old" filter for promotion?  As of now, the promotion
is mostly done inaccurately, but the accurate migration is done at
demotion level.  To avoid this issue, I feel we should promotion only
"young" pages after filtering "old" pages out.

> > 
> > > I understand this is somewhat complex, but what we have for now.
> > 
> > Thanks for the explanation. I guess you mean my filter setup is correct.
> > Is it correct?
> 
> Again, you're correct.  Your filter setup is what I expected to :)

Thanks.  I see that it works fine, but I would like to have more
discussion about "young" filter.  What I think about filter is that if I
apply "young" filter "true" for demotion, then the action applies only
for "young" pages, but the current implementation works opposite.

I understand the function name of internal implementation is
"damos_pa_filter_out" so the basic action is filtering out, but the
cgroup filter works in the opposite way for now.

I would like to hear how you think about this.

> > 
> > > > Then, I see that "hot_cold" migrates hot/cold memory correctly.
> > >
> > > Thank you so much for sharing this great news!  My tests also show no bad
> > > signal so far.
> > >
> > > >
> > > > Could you please upload the "damon_folio_mkold" patch to LKML?
> > > > Then I will rebase our changes based on it and run the redis test again.
> > >
> > > I will do that soon.
> > 
> > Thanks a lot for sharing the RFC v2 for DAMOS young filter.
> > https://lore.kernel.org/damon/20240311204545.47097-1...@kernel.org/
> > 
> > I will rebase our work based on it and share the result.
> 
> Cool, looking forward to it!  Hopefully we will make it be merged into the
> mainline by v6.10!

I hope so.  Thanks for your help!

Honggyu