Re: regression: insmod module failed in VM with nvdimm on

2022-12-01 Thread chenxiang (M)

Hi Ard,


在 2022/12/1 19:07, Ard Biesheuvel 写道:

On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel  wrote:

On Thu, 1 Dec 2022 at 08:15, chenxiang (M)  wrote:

Hi Ard,


在 2022/11/30 16:18, Ard Biesheuvel 写道:

On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:

On Wed, 30 Nov 2022 02:52:35 +,
"chenxiang (M)"  wrote:

Hi,

We boot the VM using following commands (with nvdimm on)  (qemu
version 6.1.50, kernel 6.0-r4):

How relevant is the presence of the nvdimm? Do you observe the failure
without this?


qemu-system-aarch64 -machine
virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
/home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
/root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
-object memory-backend-ram,id=ram1,size=10G -device
nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
-device vfio-pci,host=7d:01.0,id=net0,bus=root_port1

Then in VM we insmod a module, vmalloc error occurs as follows (kernel
5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):

estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
[8.186563] vmap allocation for size 20480 failed: use
vmalloc= to increase size

Have you tried increasing the vmalloc size to check that this is
indeed the problem?

[...]


We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
defer initialization to initcall where permitted").

I guess you mean commit fc5a89f75d2a instead, right?


Do you have any idea about the issue?

I sort of suspect that the nvdimm gets vmap-ed and consumes a large
portion of the vmalloc space, but you give very little information
that could help here...


Ouch. I suspect what's going on here: that patch defers the
randomization of the module region, so that we can decouple it from
the very early init code.

Obviously, it is happening too late now, and the randomized module
region is overlapping with a vmalloc region that is in use by the time
the randomization occurs.

Does the below fix the issue?

The issue still occurs, but it seems decrease the probability, before it
occured almost every time, after the change, i tried 2-3 times, and it
occurs.
But i change back "subsys_initcall" to "core_initcall", and i test more
than 20 times, and it is still ok.


Thank you for confirming. I will send out a patch today.


...but before I do that, could you please check whether the change
below fixes your issue as well?

diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index 6ccc7ef600e7c1e1..c8c205b630da1951 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -20,7 +20,11 @@
  #include 
  #include 

-u64 __ro_after_init module_alloc_base;
+/*
+ * Set a reasonable default for module_alloc_base in case
+ * we end up running with module randomization disabled.
+ */
+u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE;
  u16 __initdata memstart_offset_seed;

  struct arm64_ftr_override kaslr_feature_override __initdata;
@@ -30,12 +34,6 @@ static int __init kaslr_init(void)
 u64 module_range;
 u32 seed;

-   /*
-* Set a reasonable default for module_alloc_base in case
-* we end up running with module randomization disabled.
-*/
-   module_alloc_base = (u64)_etext - MODULES_VSIZE;
-
 if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
 pr_info("KASLR disabled on command line\n");
 return 0;
.


We have tested this change, the issue is still and it doesn't fix the issue.




Re: regression: insmod module failed in VM with nvdimm on

2022-12-01 Thread chenxiang (M)




在 2022/12/1 19:07, Ard Biesheuvel 写道:

On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel  wrote:

On Thu, 1 Dec 2022 at 08:15, chenxiang (M)  wrote:

Hi Ard,


在 2022/11/30 16:18, Ard Biesheuvel 写道:

On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:

On Wed, 30 Nov 2022 02:52:35 +,
"chenxiang (M)"  wrote:

Hi,

We boot the VM using following commands (with nvdimm on)  (qemu
version 6.1.50, kernel 6.0-r4):

How relevant is the presence of the nvdimm? Do you observe the failure
without this?


qemu-system-aarch64 -machine
virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
/home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
/root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
-object memory-backend-ram,id=ram1,size=10G -device
nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
-device vfio-pci,host=7d:01.0,id=net0,bus=root_port1

Then in VM we insmod a module, vmalloc error occurs as follows (kernel
5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):

estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
[8.186563] vmap allocation for size 20480 failed: use
vmalloc= to increase size

Have you tried increasing the vmalloc size to check that this is
indeed the problem?

[...]


We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
defer initialization to initcall where permitted").

I guess you mean commit fc5a89f75d2a instead, right?


Do you have any idea about the issue?

I sort of suspect that the nvdimm gets vmap-ed and consumes a large
portion of the vmalloc space, but you give very little information
that could help here...


Ouch. I suspect what's going on here: that patch defers the
randomization of the module region, so that we can decouple it from
the very early init code.

Obviously, it is happening too late now, and the randomized module
region is overlapping with a vmalloc region that is in use by the time
the randomization occurs.

Does the below fix the issue?

The issue still occurs, but it seems decrease the probability, before it
occured almost every time, after the change, i tried 2-3 times, and it
occurs.
But i change back "subsys_initcall" to "core_initcall", and i test more
than 20 times, and it is still ok.


Thank you for confirming. I will send out a patch today.


...but before I do that, could you please check whether the change
below fixes your issue as well?


Yes, but i can only reply to you tomorrow as other guy is testing on the 
only environment today.




diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index 6ccc7ef600e7c1e1..c8c205b630da1951 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -20,7 +20,11 @@
  #include 
  #include 

-u64 __ro_after_init module_alloc_base;
+/*
+ * Set a reasonable default for module_alloc_base in case
+ * we end up running with module randomization disabled.
+ */
+u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE;
  u16 __initdata memstart_offset_seed;

  struct arm64_ftr_override kaslr_feature_override __initdata;
@@ -30,12 +34,6 @@ static int __init kaslr_init(void)
 u64 module_range;
 u32 seed;

-   /*
-* Set a reasonable default for module_alloc_base in case
-* we end up running with module randomization disabled.
-*/
-   module_alloc_base = (u64)_etext - MODULES_VSIZE;
-
 if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
 pr_info("KASLR disabled on command line\n");
 return 0;
.






Re: regression: insmod module failed in VM with nvdimm on

2022-11-30 Thread chenxiang (M)

Hi Ard,


在 2022/11/30 16:18, Ard Biesheuvel 写道:

On Wed, 30 Nov 2022 at 08:53, Marc Zyngier  wrote:

On Wed, 30 Nov 2022 02:52:35 +,
"chenxiang (M)"  wrote:

Hi,

We boot the VM using following commands (with nvdimm on)  (qemu
version 6.1.50, kernel 6.0-r4):

How relevant is the presence of the nvdimm? Do you observe the failure
without this?


qemu-system-aarch64 -machine
virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
/home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
/root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
-object memory-backend-ram,id=ram1,size=10G -device
nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
-device vfio-pci,host=7d:01.0,id=net0,bus=root_port1

Then in VM we insmod a module, vmalloc error occurs as follows (kernel
5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):

estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
[8.186563] vmap allocation for size 20480 failed: use
vmalloc= to increase size

Have you tried increasing the vmalloc size to check that this is
indeed the problem?

[...]


We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr:
defer initialization to initcall where permitted").

I guess you mean commit fc5a89f75d2a instead, right?


Do you have any idea about the issue?

I sort of suspect that the nvdimm gets vmap-ed and consumes a large
portion of the vmalloc space, but you give very little information
that could help here...


Ouch. I suspect what's going on here: that patch defers the
randomization of the module region, so that we can decouple it from
the very early init code.

Obviously, it is happening too late now, and the randomized module
region is overlapping with a vmalloc region that is in use by the time
the randomization occurs.

Does the below fix the issue?


The issue still occurs, but it seems decrease the probability, before it 
occured almost every time, after the change, i tried 2-3 times, and it 
occurs.
But i change back "subsys_initcall" to "core_initcall", and i test more 
than 20 times, and it is still ok.




diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index 37a9deed2aec..71fb18b2f304 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -90,4 +90,4 @@ static int __init kaslr_init(void)

 return 0;
  }
-subsys_initcall(kaslr_init)
+arch_initcall(kaslr_init)
.






Re: regression: insmod module failed in VM with nvdimm on

2022-11-30 Thread chenxiang (M)

Hi Marc,


在 2022/11/30 15:53, Marc Zyngier 写道:

On Wed, 30 Nov 2022 02:52:35 +,
"chenxiang (M)"  wrote:

Hi,

We boot the VM using following commands (with nvdimm on)  (qemu
version 6.1.50, kernel 6.0-r4):

How relevant is the presence of the nvdimm? Do you observe the failure
without this?


We didn't see the failure without it.


qemu-system-aarch64 -machine
virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel
/home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios
/root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m
2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0
ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1'
-object memory-backend-ram,id=ram1,size=10G -device
nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1
-device vfio-pci,host=7d:01.0,id=net0,bus=root_port1

Then in VM we insmod a module, vmalloc error occurs as follows (kernel
5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):

estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
[8.186563] vmap allocation for size 20480 failed: use
vmalloc= to increase size

Have you tried increasing the vmalloc size to check that this is
indeed the problem?

[...]


I didn't increase the vmalloc size, but i check the vmall size and i 
think it is big enough when the issue occurs:


estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
[4.879899] vmap allocation for size 20480 failed: use vmalloc= 
to increase size
[4.880643] insmod: vmalloc error: size 16384, vm_struct allocation 
failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0

[4.881802] CPU: 1 PID: 230 Comm: insmod Not tainted 6.1.0-rc4+ #21
[4.882414] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
02/06/2015

[4.883082] Call trace:
[4.88]  dump_backtrace.part.0+0xc4/0xd0
[4.883766]  show_stack+0x20/0x50
[4.884091]  dump_stack_lvl+0x68/0x84
[4.884450]  dump_stack+0x18/0x34
[4.884778]  warn_alloc+0x11c/0x1bc
[4.885124]  __vmalloc_node_range+0x50c/0x64c
[4.885553]  module_alloc+0xf4/0x100
[4.885902]  load_module+0x858/0x1e90
[4.886265]  __do_sys_init_module+0x1c0/0x200
[4.886699]  __arm64_sys_init_module+0x24/0x30
[4.887147]  invoke_syscall+0x50/0x120
[4.887516]  el0_svc_common.constprop.0+0x58/0x190
[4.887993]  do_el0_svc+0x34/0xc0
[4.888327]  el0_svc+0x2c/0xb4
[4.888631]  el0t_64_sync_handler+0xb8/0xbc
[4.889046]  el0t_64_sync+0x19c/0x1a0
[4.889423] Mem-Info:
[4.889639] active_anon:9679 inactive_anon:63094 isolated_anon:0
[4.889639]  active_file:0 inactive_file:0 isolated_file:0
[4.889639]  unevictable:0 dirty:0 writeback:0
[4.889639]  slab_reclaimable:3322 slab_unreclaimable:3082
[4.889639]  mapped:873 shmem:72569 pagetables:34
[4.889639]  sec_pagetables:0 bounce:0
[4.889639]  kernel_misc_reclaimable:0
[4.889639]  free:416212 free_pcp:4414 free_cma:0
[4.893362] Node 0 active_anon:38716kB inactive_anon:252376kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:3492kB dirty:0kB writeback:0kB shmem:290276kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB 
kernel_stack:1904kB pagetables:136kB sec_pagetables:0kB 
all_unreclaimable? no
[4.896343] Node 0 DMA free:1664848kB boost:0kB min:22528kB 
low:28160kB high:33792kB reserved_highatomic:0KB active_anon:38716kB 
inactive_anon:252376kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:2097152kB managed:2010376kB mlocked:0kB 
bounce:0kB free_pcp:17704kB local_pcp:3668kB free_cma:0kB

[4.899097] lowmem_reserve[]: 0 0 0 0 0
[4.899466] Node 0 DMA: 2*4kB (UM) 1*8kB (M) 2*16kB (UM) 1*32kB (M) 
2*64kB (ME) 1*128kB (U) 2*256kB (ME) 2*512kB (M) 6*1024kB (UME) 5*2048kB 
(UM) 402*4096kB (M) = 1664848kB
[4.900865] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=1048576kB
[4.901648] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=32768kB
[4.902526] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB
[4.903354] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=64kB

[4.904173] 72569 total pagecache pages
[4.904524] 0 pages in swap cache
[4.904831] Free swap  = 0kB
[4.905109] Total swap = 0kB
[4.905407] 524288 pages RAM
[4.905696] 0 pages HighMem/MovableOnly
[4.906085] 21694 pages reserved
[4.906388] 0 pages hwpoisoned
insmod: can't insert '/lib/modules/6.1.0-rc4+/hnae3.ko': Cannot allocate 
memory

estuary:/$ insmod /lib/modules/$(uname -r)/hns3.ko
[4.911599] vmap allocation for size 122880 failed: use 
vmalloc= to increase size
insmod: can't insert '/lib/modules/6.1.0-rc4+/hns3.ko': Cannot allocate 
memory

estuary:/$ insmod /lib/modules/$(uname -r)/hclge.ko
[4.917761] vmap allocation for size 319488 failed: use 
vmalloc= to increase size
insmod: ca

regression: insmod module failed in VM with nvdimm on

2022-11-29 Thread chenxiang (M)

Hi,

We boot the VM using following commands (with nvdimm on)  (qemu version 
6.1.50, kernel 6.0-r4):


qemu-system-aarch64 -machine 
virt,kernel_irqchip=on,gic-version=3,nvdimm=on  -kernel 
/home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios 
/root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m 
2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0 
ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1' 
-object memory-backend-ram,id=ram1,size=10G -device 
nvdimm,id=dimm1,memdev=ram1  -device ioh3420,id=root_port1,chassis=1 
-device vfio-pci,host=7d:01.0,id=net0,bus=root_port1


Then in VM we insmod a module, vmalloc error occurs as follows (kernel 
5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4):


estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko
[8.186563] vmap allocation for size 20480 failed: use vmalloc= 
to increase size
[8.187288] insmod: vmalloc error: size 16384, vm_struct allocation 
failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0

[8.188402] CPU: 1 PID: 235 Comm: insmod Not tainted 6.0.0-rc4+ #1
[8.188958] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 
02/06/2015

[8.189593] Call trace:
[8.189825]  dump_backtrace.part.0+0xc4/0xd0
[8.190245]  show_stack+0x24/0x40
[8.190563]  dump_stack_lvl+0x68/0x84
[8.190913]  dump_stack+0x18/0x34
[8.191223]  warn_alloc+0x124/0x1b0
[8.191555]  __vmalloc_node_range+0xe4/0x55c
[8.191959]  module_alloc+0xf8/0x104
[8.192305]  load_module+0x854/0x1e70
[8.192655]  __do_sys_init_module+0x1e0/0x220
[8.193075]  __arm64_sys_init_module+0x28/0x34
[8.193489]  invoke_syscall+0x50/0x120
[8.193841]  el0_svc_common.constprop.0+0x58/0x1a0
[8.194296]  do_el0_svc+0x38/0xd0
[8.194613]  el0_svc+0x2c/0xc0
[8.194901]  el0t_64_sync_handler+0x1ac/0x1b0
[8.195313]  el0t_64_sync+0x19c/0x1a0
[8.195672] Mem-Info:
[8.195872] active_anon:17641 inactive_anon:118549 isolated_anon:0
[8.195872]  active_file:0 inactive_file:0 isolated_file:0
[8.195872]  unevictable:0 dirty:0 writeback:0
[8.195872]  slab_reclaimable:3439 slab_unreclaimable:3067
[8.195872]  mapped:877 shmem:135976 pagetables:39 bounce:0
[8.195872]  kernel_misc_reclaimable:0
[8.195872]  free:353735 free_pcp:3210 free_cma:0
[8.199119] Node 0 active_anon:70564kB inactive_anon:474196kB 
active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB 
isolated(file):0kB mapped:3508kB dirty:0kB writeback:0kB shmem:543904kB 
shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB 
kernel_stack:1904kB pagetables:156kB all_unreclaimable? no
[8.201683] Node 0 DMA free:1414940kB boost:0kB min:22528kB 
low:28160kB high:33792kB reserved_highatomic:0KB active_anon:70564kB 
inactive_anon:474196kB active_file:0kB inactive_file:0kB unevictable:0kB 
writepending:0kB present:2097152kB managed:2010444kB mlocked:0kB 
bounce:0kB free_pcp:12840kB local_pcp:2032kB free_cma:0kB

[8.204158] lowmem_reserve[]: 0 0 0 0
[8.204481] Node 0 DMA: 1*4kB (E) 1*8kB (U) 1*16kB (U) 2*32kB (UM) 
1*64kB (U) 1*128kB (U) 2*256kB (ME) 2*512kB (ME) 2*1024kB (M) 3*2048kB 
(UM) 343*4096kB (M) = 1414940kB
[8.205881] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=1048576kB
[8.206644] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=32768kB
[8.207381] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=2048kB
[8.208111] Node 0 hugepages_total=0 hugepages_free=0 
hugepages_surp=0 hugepages_size=64kB

[8.208826] 135976 total pagecache pages
[8.209195] 0 pages in swap cache
[8.209484] Free swap  = 0kB
[8.209733] Total swap = 0kB
[8.209989] 524288 pages RAM
[8.210239] 0 pages HighMem/MovableOnly
[8.210571] 21677 pages reserved
[8.210852] 0 pages hwpoisoned
insmod: can't insert '/lib/modules/6.0.0-rc4+/hnae3.ko': Cannot allocate 
memory


We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr: 
defer initialization to initcall where permitted").


Do you have any idea about the issue?


Best Regards,

Xiang Chen




Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors

2022-11-25 Thread chenxiang (M)



在 2022/11/23 20:08, Marc Zyngier 写道:

On Wed, 23 Nov 2022 01:42:36 +,
chenxiang  wrote:

From: Xiang Chen 

Currently the number of MSI vectors comes from register PCI_MSI_FLAGS
which should be power-of-2 in qemu, in some scenaries it is not the same as
the number that driver requires in guest, for example, a PCI driver wants
to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
guest only wants to allocate 6 MSI vectors.

When GICv4.1 is enabled, it iterates over all possible MSIs and enable the
forwarding while the guest has only created some of mappings in the virtual
ITS, so some calls fail. The exception print is as following:
vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration
fails:66311

To avoid the issue, verify each MSI vector, skip some operations such as
request_irq() and irq_bypass_register_producer() for those invalid MSI vectors.

Signed-off-by: Xiang Chen 
---
I reported the issue at the link:
https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/

Change Log:
v1 -> v2:
Verify each MSI vector in kernel instead of adding systemcall according to
Mar's suggestion
---
  arch/arm64/kvm/vgic/vgic-irqfd.c  | 13 +
  arch/arm64/kvm/vgic/vgic-its.c| 36 
  arch/arm64/kvm/vgic/vgic.h|  1 +
  drivers/vfio/pci/vfio_pci_intrs.c | 33 +
  include/linux/kvm_host.h  |  2 ++
  5 files changed, 85 insertions(+)

diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c
index 475059b..71f6af57 100644
--- a/arch/arm64/kvm/vgic/vgic-irqfd.c
+++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
@@ -98,6 +98,19 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
return vgic_its_inject_msi(kvm, );
  }
  
+int kvm_verify_msi(struct kvm *kvm,

+  struct kvm_kernel_irq_routing_entry *irq_entry)
+{
+   struct kvm_msi msi;
+
+   if (!vgic_has_its(kvm))
+   return -ENODEV;
+
+   kvm_populate_msi(irq_entry, );
+
+   return vgic_its_verify_msi(kvm, );
+}
+
  /**
   * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
   */
diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index 94a666d..8312a4a 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -767,6 +767,42 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, 
struct kvm_msi *msi)
return 0;
  }
  
+int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi)

+{
+   struct vgic_its *its;
+   struct its_ite *ite;
+   struct kvm_vcpu *vcpu;
+   int ret = 0;
+
+   if (!irqchip_in_kernel(kvm) || (msi->flags & ~KVM_MSI_VALID_DEVID))
+   return -EINVAL;
+
+   if (!vgic_has_its(kvm))
+   return -ENODEV;
+
+   its = vgic_msi_to_its(kvm, msi);
+   if (IS_ERR(its))
+   return PTR_ERR(its);
+
+   mutex_lock(>its_lock);
+   if (!its->enabled) {
+   ret = -EBUSY;
+   goto unlock;
+   }
+   ite = find_ite(its, msi->devid, msi->data);
+   if (!ite || !its_is_collection_mapped(ite->collection)) {
+   ret = E_ITS_INT_UNMAPPED_INTERRUPT;
+   goto unlock;
+   }
+
+   vcpu = kvm_get_vcpu(kvm, ite->collection->target_addr);
+   if (!vcpu)
+   ret = E_ITS_INT_UNMAPPED_INTERRUPT;

I'm sorry, but what does this mean to the caller? This should never
leak outside of the ITS code.


Actually it is already leak outside of ITS code, and please see the 
exception printk (E_ITS_INT_UNMAPPED_INTERRUPT is 0x10307 which is equal 
to 66311):


vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
registration fails:66311





+unlock:
+   mutex_unlock(>its_lock);
+   return ret;
+}
+
  /*
   * Queries the KVM IO bus framework to get the ITS pointer from the given
   * doorbell address.
diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
index 0c8da72..d452150 100644
--- a/arch/arm64/kvm/vgic/vgic.h
+++ b/arch/arm64/kvm/vgic/vgic.h
@@ -240,6 +240,7 @@ int kvm_vgic_register_its_device(void);
  void vgic_enable_lpis(struct kvm_vcpu *vcpu);
  void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu);
  int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi);
+int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi);
  int vgic_v3_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr 
*attr);
  int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write,
 int offset, u32 *val);
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 40c3d7c..3027805 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "vfio_pci_priv.h"
  
@@ -315,6 +316,28 @@ static int 

Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector

2022-11-14 Thread chenxiang (M)

Hi Marc,


在 2022/11/10 18:28, Marc Zyngier 写道:

On Wed, 09 Nov 2022 06:21:18 +,
"chenxiang (M)"  wrote:

Hi Marc,


在 2022/11/8 20:47, Marc Zyngier 写道:

On Tue, 08 Nov 2022 08:08:57 +,
chenxiang  wrote:

From: Xiang Chen 

Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS
which should be power-of-2, but in some scenaries it is not the same as
the number that driver requires in guest, for example, a PCI driver wants
to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
guest only wants to allocate 6 MSI vectors.

When GICv4.1 is enabled, we can see some exception print as following for
above scenaro:
vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
registration fails:66311

In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do
that. If there is a mapping, return 0, otherwise return negative value.

This is the kernel part of adding system call KVM_VERIFY_MSI.

Exposing something that is an internal implementation detail to
userspace feels like the absolute wrong way to solve this issue.

Can you please characterise the issue you're having? Is it that vfio
tries to enable an interrupt for which there is no virtual ITS
mapping? Shouldn't we instead try and manage this in the kernel?

Before i reported the issue to community, you gave a suggestion about
the issue, but not sure whether i misundertood your meaning.
You can refer to the link for more details about the issue.
https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/

Right. It would have been helpful to mention this earlier. Anyway, I
would really like this to be done without involving userspace at all.

But first, can you please confirm that the VM works as expected
despite the message?

Yes, it works well except the message.


If that's the case, we only need to handle the
case where this is a multi-MSI setup, and I think this can be done in
VFIO, without involving userspace.


It seems we can verify every kvm_msi for multi-MSI setup in function 
vfio_pci_set_msi_trigger().
If it is a invalid MSI vector, then we can decrease the numer of MSI 
vectors before  calling vfio_msi_set_block().




Thanks,

M.






Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector

2022-11-08 Thread chenxiang (M)

Hi Marc,


在 2022/11/8 20:47, Marc Zyngier 写道:

On Tue, 08 Nov 2022 08:08:57 +,
chenxiang  wrote:

From: Xiang Chen 

Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS
which should be power-of-2, but in some scenaries it is not the same as
the number that driver requires in guest, for example, a PCI driver wants
to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate
8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in
guest only wants to allocate 6 MSI vectors.

When GICv4.1 is enabled, we can see some exception print as following for
above scenaro:
vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
registration fails:66311

In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do
that. If there is a mapping, return 0, otherwise return negative value.

This is the kernel part of adding system call KVM_VERIFY_MSI.

Exposing something that is an internal implementation detail to
userspace feels like the absolute wrong way to solve this issue.

Can you please characterise the issue you're having? Is it that vfio
tries to enable an interrupt for which there is no virtual ITS
mapping? Shouldn't we instead try and manage this in the kernel?


Before i reported the issue to community, you gave a suggestion about 
the issue, but not sure whether i misundertood your meaning.

You can refer to the link for more details about the issue.
https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/

Best regards,
Xiang



Re: [QUESTION] Exception print when enabling GICv4

2022-07-13 Thread chenxiang (M)

Hi Marc,

Thank you for your reply.

在 2022/7/12 23:25, Marc Zyngier 写道:

Hi Xiang,

On Tue, 12 Jul 2022 13:55:16 +0100,
"chenxiang (M)"  wrote:

Hi,
I encounter a issue related to GICv4 enable on ARM64 platform (kernel
5.19-rc4, qemu 6.2.0):
We have a accelaration module whose VF has 3 MSI interrupts, and we
passthrough it to virtual machine with following steps:

echo :79:00.1 > /sys/bus/pci/drivers/hisi_hpre/unbind
echo vfio-pci >
/sys/devices/pci\:78/\:78\:00.0/\:79\:00.1/driver_override
echo :79:00.1 > /sys/bus/pci/drivers_probe

Then we boot VM with "-device vfio-pci,host=79:00.1,id=net0 \".
When insmod the driver which registers 3 PCI MSI interrupts in VM,
some exception print occur as following:

vfio-pci :3a:00.1: irq bypass producer (token 8f08224d)
registration fails: 66311

I find that bit[6:4] of register PCI_MSI_FLAGS is 2 (4 MSI interrupts)
though we only register 3 PCI MSI interrupt,

and only 3 MSI interrupt is activated at last.
It allocates 4 vectors in function vfio_msi_enable() (qemu)  as it
reads the register PCI_MSI_FLAGS.
Later it will  call system call VFIO_DEVICE_SET_IRQS to set forwarding
for those interrupts
using function kvm_vgic_v4_set_forrwarding() as GICv4 is enabled. For
interrupt 0~2, it success to set forwarding as they are already
activated,
but for the 4th interrupt, it is not activated, so ite is not found in
function vgic_its_resolve_lpi(), so above printk occurs.

It seems that we only allocate and activate 3 MSI interrupts in guest
while it tried to set forwarding for 4 MSI interrupts in host.
Do you have any idea about this issue?

I have a hunch: QEMU cannot know that the guest is only using 3 MSIs
out of the 4 that the device can use, and PCI/Multi-MSI only has a
single enable bit for all MSIs. So it probably iterates over all
possible MSIs and enable the forwarding. Since the guest has only
created 3 mappings in the virtual ITS, the last call fails. I would
expect the guest to still work properly though.


Yes, that's the reason of exception print.
Is it possible for QEMU to get the exact number of interrupts guest is 
using? It seems not.




Thanks,

M.






[QUESTION] Exception print when enabling GICv4

2022-07-12 Thread chenxiang (M)

Hi,
I encounter a issue related to GICv4 enable on ARM64 platform (kernel 
5.19-rc4, qemu 6.2.0):
We have a accelaration module whose VF has 3 MSI interrupts, and we 
passthrough it to virtual machine with following steps:


echo :79:00.1 > /sys/bus/pci/drivers/hisi_hpre/unbind
echo vfio-pci > 
/sys/devices/pci\:78/\:78\:00.0/\:79\:00.1/driver_override

echo :79:00.1 > /sys/bus/pci/drivers_probe

Then we boot VM with "-device vfio-pci,host=79:00.1,id=net0 \".
When insmod the driver which registers 3 PCI MSI interrupts in VM,  some 
exception print occur as following:


vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) 
registration fails: 66311


I find that bit[6:4] of register PCI_MSI_FLAGS is 2 (4 MSI interrupts) 
though we only register 3 PCI MSI interrupt,


and only 3 MSI interrupt is activated at last.
It allocates 4 vectors in function vfio_msi_enable() (qemu)  as it reads 
the register PCI_MSI_FLAGS.
Later it will  call system call VFIO_DEVICE_SET_IRQS to set forwarding 
for those interrupts
using function kvm_vgic_v4_set_forrwarding() as GICv4 is enabled. For 
interrupt 0~2, it success to set forwarding as they are already activated,
but for the 4th interrupt, it is not activated, so ite is not found in 
function vgic_its_resolve_lpi(), so above printk occurs.


It seems that we only allocate and activate 3 MSI interrupts in guest 
while it tried to set forwarding for 4 MSI interrupts in host.

Do you have any idea about this issue?


Best regards,

Xiang Chen




Re: [Bug] Take more 150s to boot qemu on ARM64

2022-06-13 Thread chenxiang (M)




在 2022/6/13 21:22, Paul E. McKenney 写道:

On Mon, Jun 13, 2022 at 08:26:34PM +0800, chenxiang (M) wrote:

Hi all,

I encounter a issue with kernel 5.19-rc1 on a ARM64 board:  it takes about
150s between beginning to run qemu command and beginng to boot Linux kernel
("EFI stub: Booting Linux Kernel...").

But in kernel 5.18-rc4, it only takes about 5s. I git bisect the kernel code
and it finds c2445d387850 ("srcu: Add contention check to call_srcu()
srcu_data ->lock acquisition").

The qemu (qemu version is 6.2.92) command i run is :

./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g \
--trace "kvm*" \
-cpu host \
-machine virt,accel=kvm,gic-version=3  \
-machine smp.cpus=2,smp.sockets=2 \
-no-reboot \
-nographic \
-monitor unix:/home/cx/qmp-test,server,nowait \
-bios /home/cx/boot/QEMU_EFI.fd \
-kernel /home/cx/boot/Image  \
-device 
pcie-root-port,port=0x8,chassis=1,id=net1,bus=pcie.0,multifunction=on,addr=0x1
\
-device vfio-pci,host=7d:01.3,id=net0 \
-device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4  \
-drive file=/home/cx/boot/boot_ubuntu.img,if=none,id=drive0 \
-append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw " \
-net none \
-D /home/cx/qemu_log.txt

I am not familiar with rcu code, and don't know how it causes the issue. Do
you have any idea about this issue?

Please see the discussion here:

https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03e...@linaro.org/

Though that report requires ACPI to be forced on to get the
delay, which results in more than 9,000 back-to-back calls to
synchronize_srcu_expedited().  I cannot reproduce this on my setup, even
with an artificial tight loop invoking synchronize_srcu_expedited(),
but then again I don't have ARM hardware.

My current guess is that the following patch, but with larger values for
SRCU_MAX_NODELAY_PHASE.  Here "larger" might well be up in the hundreds,
or perhaps even larger.

If you get a chance to experiment with this, could you please reply
to the discussion at the above URL?  (Or let me know, and I can CC
you on the next message in that thread.)


Ok, thanks, i will reply it on above URL.




Thanx, Paul



diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019dea..0db7873f4e95b 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
  
  #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.

  #define SRCU_MAX_INTERVAL 10  // Maximum incremental delay from slow 
readers.
-#define SRCU_MAX_NODELAY_PHASE 1   // Maximum per-GP-phase consecutive 
no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3   // Maximum per-GP-phase consecutive 
no-delay instances.
  #define SRCU_MAX_NODELAY  100 // Maximum consecutive no-delay 
instances.
  
  /*

@@ -522,16 +522,22 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
   */
  static unsigned long srcu_get_delay(struct srcu_struct *ssp)
  {
+   unsigned long gpstart;
+   unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
  
  	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))

jbase = 0;
-   if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
-   jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
-   if (!jbase) {
-   WRITE_ONCE(ssp->srcu_n_exp_nodelay, 
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
-   if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
-   jbase = 1;
+   if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+   j = jiffies - 1;
+   gpstart = READ_ONCE(ssp->srcu_gp_start);
+   if (time_after(j, gpstart))
+   jbase += j - gpstart;
+   if (!jbase) {
+   WRITE_ONCE(ssp->srcu_n_exp_nodelay, 
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
+   if (READ_ONCE(ssp->srcu_n_exp_nodelay) > 
SRCU_MAX_NODELAY_PHASE)
+   jbase = 1;
+   }
}
return jbase > SRCU_MAX_INTERVAL ? SRCU_MAX_INTERVAL : jbase;
  }
.






[Bug] Take more 150s to boot qemu on ARM64

2022-06-13 Thread chenxiang (M)

Hi all,

I encounter a issue with kernel 5.19-rc1 on a ARM64 board:  it takes 
about 150s between beginning to run qemu command and beginng to boot 
Linux kernel ("EFI stub: Booting Linux Kernel...").


But in kernel 5.18-rc4, it only takes about 5s. I git bisect the kernel 
code and it finds c2445d387850 ("srcu: Add contention check to 
call_srcu() srcu_data ->lock acquisition").


The qemu (qemu version is 6.2.92) command i run is :

./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g \
--trace "kvm*" \
-cpu host \
-machine virt,accel=kvm,gic-version=3  \
-machine smp.cpus=2,smp.sockets=2 \
-no-reboot \
-nographic \
-monitor unix:/home/cx/qmp-test,server,nowait \
-bios /home/cx/boot/QEMU_EFI.fd \
-kernel /home/cx/boot/Image  \
-device 
pcie-root-port,port=0x8,chassis=1,id=net1,bus=pcie.0,multifunction=on,addr=0x1 
\

-device vfio-pci,host=7d:01.3,id=net0 \
-device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4  \
-drive file=/home/cx/boot/boot_ubuntu.img,if=none,id=drive0 \
-append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw " \
-net none \
-D /home/cx/qemu_log.txt

I am not familiar with rcu code, and don't know how it causes the issue. 
Do you have any idea about this issue?



Best Regard,

Xiang Chen





Re: [PATCH] hw/arm/smmuv3: Pass the real perm to returned IOMMUTLBEntry in smmuv3_translate()

2022-04-16 Thread chenxiang (M)

Hi Eric,


在 2022/4/15 0:02, Eric Auger 写道:

Hi Chenxiang,

On 4/7/22 9:57 AM, chenxiang via wrote:

From: Xiang Chen 

In function memory_region_iommu_replay(), it decides to notify() or not
according to the perm of returned IOMMUTLBEntry. But for smmuv3, the
returned perm is always IOMMU_NONE even if the translation success.

I think you should precise in the commit message that
memory_region_iommu_replay() always calls the IOMMU MR translate()
callback with flag=IOMMU_NONE and thus, currently, translate() returns
an IOMMUTLBEntry with perm set to IOMMU_NONE if the translation
succeeds, whereas it is expected to return the actual permission set in
the table entry.


Thank you for your comments.
I will change the commit message in next version.





Pass the real perm to returned IOMMUTLBEntry to avoid the issue.

Signed-off-by: Xiang Chen 
---
  hw/arm/smmuv3.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 674623aabe..707eb430c2 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -760,7 +760,7 @@ epilogue:
  qemu_mutex_unlock(>mutex);
  switch (status) {
  case SMMU_TRANS_SUCCESS:
-entry.perm = flag;
+entry.perm = cached_entry->entry.perm;

With that clarification
Reviewed-by: Eric Auger 


Ok, thanks



the translate() doc in ./include/exec/memory.h states
"
If IOMMU_NONE is passed then the IOMMU must do the
  * full page table walk and report the permissions in the returned
  * IOMMUTLBEntry. (Note that this implies that an IOMMU may not
  * return different mappings for reads and writes.)
"


Thanks

Eric

  entry.translated_addr = cached_entry->entry.translated_addr +
  (addr & cached_entry->entry.addr_mask);
  entry.addr_mask = cached_entry->entry.addr_mask;

.






Re: [PATCH] hw/vfio/common: Fix a small boundary issue of a trace

2022-04-06 Thread chenxiang (M)

Hi Damien,


在 2022/4/6 23:22, Damien Hedde 写道:



On 4/6/22 10:14, chenxiang via wrote:

From: Xiang Chen 

Right now the trace of vfio_region_sparse_mmap_entry is as follows:
vfio_region_sparse_mmap_entry sparse entry 0 [0x1000 - 0x9000]
Actually the range it wants to show is [0x1000 - 0x8fff],so fix it.

Signed-off-by: Xiang Chen 
---
  hw/vfio/common.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 080046e3f5..0b3808caf8 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1546,7 +1546,7 @@ static int 
vfio_setup_region_sparse_mmaps(VFIORegion *region,

  for (i = 0, j = 0; i < sparse->nr_areas; i++) {
  trace_vfio_region_sparse_mmap_entry(i, 
sparse->areas[i].offset,

sparse->areas[i].offset +
- sparse->areas[i].size);
+ sparse->areas[i].size - 1);
if (sparse->areas[i].size) {
  region->mmaps[j].offset = sparse->areas[i].offset;


If the size if zero, the trace will be weird with an underflow if 
offset is zero as well.


Yes, that's a issue.


Maybe just change the trace by inverting the right bracket ?
eg: [0x1000 - 0x9000[
Or don't trace in that case ? (but I am not maintainer of this, so 
maybe that does not make sense).


But it uses [offset, offset + size - 1] in other places such as 
trace_vfio_region_region_mmap()/trace_vfio_subregion_unmap()/trace_vfio_region_mmap_fault() 
in vfio code.
Maybe it is better to move this trace to the brace of "if 
(sparse->areas[i].size)" which ensures size != 0.




--
Damien
.






Re: [PATCH] hw/arm/virt: Enable HMAT on arm virt machine

2022-01-25 Thread chenxiang (M)




在 2022/1/25 20:46, Andrew Jones 写道:

On Tue, Jan 25, 2022 at 07:46:43PM +0800, chenxiang (M) wrote:

Hi Andrew,


在 2022/1/25 18:26, Andrew Jones 写道:

On Tue, Jan 25, 2022 at 05:15:34PM +0800, chenxiang via wrote:

From: Xiang Chen 

Since the patchset ("Build ACPI Heterogeneous Memory Attribute Table (HMAT)"),
HMAT is supported, but only x86 is enabled. Enable HMAT on arm virt machine.

Hi Xiang,

What QEMU commands lines have you tested with which Linux guest kernels?

I tested it with following commands with guest kernel 5.16-rc1, and the boot
log of guest kernel is as attached:

Thanks. Please consider adding HMAT tests, see tests/qtest/numa-test.c and
tests/qtest/bios-tables-test.c, for the virt machine type to this series.
Otherwise,

Reviewed-by: Andrew Jones 


Thanks, i will add those HMAT tests in v2.





Re: [PATCH] hw/arm/virt: Enable HMAT on arm virt machine

2022-01-25 Thread chenxiang (M)

Hi Andrew,


在 2022/1/25 18:26, Andrew Jones 写道:

On Tue, Jan 25, 2022 at 05:15:34PM +0800, chenxiang via wrote:

From: Xiang Chen 

Since the patchset ("Build ACPI Heterogeneous Memory Attribute Table (HMAT)"),
HMAT is supported, but only x86 is enabled. Enable HMAT on arm virt machine.

Hi Xiang,

What QEMU commands lines have you tested with which Linux guest kernels?


I tested it with following commands with guest kernel 5.16-rc1, and the 
boot log of guest kernel is as attached:


./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g \
-object memory-backend-ram,size=2G,id=m0 \
-object memory-backend-ram,size=2G,id=m1 \
-numa node,cpus=0-3,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 
\
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M 
\
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 
\
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M 
\
-numa 
hmat-cache,node-id=0,size=16K,level=1,associativity=direct,policy=write-back,line=8 
\
-numa 
hmat-cache,node-id=1,size=16K,level=1,associativity=direct,policy=write-back,line=8 
\

-smp 4 \
-no-reboot \
-nographic \
-cpu host \
-machine virt,accel=kvm,gic-version=3,hmat=on \
-bios /home/cx/QEMU_EFI.fd \
-monitor unix:/home/cx/opt/qmp-test,server,nowait \
-kernel /home/cx/Image  \
-device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4  \
-drive file=/home/cx/opt/boot.img,if=none,id=drive0 \
-append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw "




Thanks,
drew


Signed-off-by: Xiang Chen 
---
  hw/arm/Kconfig   | 1 +
  hw/arm/virt-acpi-build.c | 7 +++
  2 files changed, 8 insertions(+)

diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 2e0049196d..a3c6099829 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -29,6 +29,7 @@ config ARM_VIRT
  select ACPI_APEI
  select ACPI_VIOT
  select VIRTIO_MEM_SUPPORTED
+select ACPI_HMAT
  
  config CHEETAH

  bool
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 449fab0080..f19b55e486 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -42,6 +42,7 @@
  #include "hw/acpi/memory_hotplug.h"
  #include "hw/acpi/generic_event_device.h"
  #include "hw/acpi/tpm.h"
+#include "hw/acpi/hmat.h"
  #include "hw/pci/pcie_host.h"
  #include "hw/pci/pci.h"
  #include "hw/pci/pci_bus.h"
@@ -990,6 +991,12 @@ void virt_acpi_build(VirtMachineState *vms, 
AcpiBuildTables *tables)
  build_slit(tables_blob, tables->linker, ms, vms->oem_id,
 vms->oem_table_id);
  }
+
+if (ms->numa_state->hmat_enabled) {
+acpi_add_table(table_offsets, tables_blob);
+build_hmat(tables_blob, tables->linker, ms->numa_state,
+   vms->oem_id, vms->oem_table_id);
+}
  }
  
  if (ms->nvdimms_state->is_enabled) {

--
2.33.0



.



[root@centos build]# ./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g -object 
memory-backend-ram,size=2G,id=m0 -object memory-backend-ram,size=2G,id=m1 -numa 
node,cpus=0-3,nodeid=0,memdev=m0 -numa node,nodeid=1,memdev=m1,initiator=0 
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5
 -numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M
 -numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10
 -numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M
 -numa 
hmat-cache,node-id=0,size=16K,level=1,associativity=direct,policy=write-back,line=8
 -numa 
hmat-cache,node-id=1,size=16K,level=1,associativity=direct,policy=write-back,line=8
 -smp 4 -no-reboot -nographic -cpu host -machine 
virt,accel=kvm,gic-version=3,hmat=on -bios /home/cx/QEMU_EFI.fd -monitor 
unix:/home/cx/opt/qmp-test,server,nowait -kernel /home/cx/Image  -device 
virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4  -drive 
file=/home/cx/opt/boot.img,if=none,id=drive0 -append "rdinit=init 
console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw "
WARNING: Image format was not specified for '/home/cx/opt/boot.img' and probing 
guessed raw.
 Automatically detecting the format is dangerous for raw images, write 
operations on block 0 will be restricted.
 Specify the 'raw' format explicitly to remove the restrictions.
EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: Generating empty DTB
EFI stub: Exiting boot services...
[0.00] Booting Linux on physical CPU 0x00 [0x481fd010]
[0.00] Linux version 5.16.0-rc1-15060-g07d132dd883a (chenxiang@plinth) 
(aarch64-linux-gnu-gcc (Linaro GCC 7.3-2018.05-rc1) 7.3.1 20180425 
[linaro-7.3-2018.05-rc1 revision 38aec9a676236eaa42ca03ccb3a6c1dd0182c29f], GNU 
ld (Linaro_Binutils-2018.05-rc1) 

Re: [RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5

2022-01-06 Thread chenxiang (M)




在 2022/1/6 19:00, Eric Auger 写道:

Hi CHenxiangn

On 12/29/21 8:13 AM, chenxiang (M) via wrote:

Hi Eric,


在 2021/10/5 16:53, Eric Auger 写道:

Add a 'preserve_config' field in struct GPEXConfig and
if set generate the DSM #5 for preserving PCI boot configurations.
The DSM presence is needed to expose RMRs.

At the moment the DSM generation is not yet enabled.

Signed-off-by: Eric Auger 
---
   include/hw/pci-host/gpex.h |  1 +
   hw/pci-host/gpex-acpi.c| 12 
   2 files changed, 13 insertions(+)

diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index fcf8b63820..3f8f8ec38d 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -64,6 +64,7 @@ struct GPEXConfig {
   MemMapEntry pio;
   int irq;
   PCIBus  *bus;
+boolpreserve_config;
   };
 int gpex_set_irq_num(GPEXHost *s, int index, int gsi);
diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index e7e162a00a..7dab259379 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct
GPEXConfig *cfg)
   aml_append(dev, aml_name_decl("_PXM",
aml_int(numa_node)));
   }
   +if (cfg->preserve_config) {
+method = aml_method("_DSM", 5, AML_SERIALIZED);

I notice there is a ACPI BIOS Error when booting virtual machine which
seems be caused by this patch as I add this patchset in my branch to
test the function of vsmmu.
It seems that it requires only 4 parameter for method _DSM, but use 5
parameters here.
The error log is as following:

Thank you for the heads up. Yes the problem was reported by Igor too in
https://www.mail-archive.com/qemu-devel@nongnu.org/msg842972.html.

At the moment the RMRR ACPI situation has not progressed on spec side or
kernel if I have not missed anything but sure I will take this into
account in my next respin.


Ok, thanks.



Thanks!

Eric

[2.355459] ACPI BIOS Error (bug): Failure creating named object
[\_SB.PCI0._DSM], AE_ALREADY_EXISTS (20210930/dswload2-327)
[2.355467] ACPI Error: AE_ALREADY_EXISTS, During name
lookup/catalog (20210930/psobject-221)
[2.355470] ACPI: Skipping parse of AML opcode: OpcodeName
unavailable (0x0014)
[2.355657] ACPI: 1 ACPI AML tables successfully acquired and loaded
[2.356321] ACPI: Interpreter enabled
[2.356323] ACPI: Using GIC for interrupt routing
[2.356333] ACPI: MCFG table detected, 1 entries
[2.361359] ARMH0011:00: ttyAMA0 at MMIO 0x900 (irq = 16,
base_baud = 0) is a SBSA
[2.619805] printk: console [ttyAMA0] enabled
[2.622114] ACPI: PCI Root Bridge [PCI0] (domain  [bus 00-ff])
[2.622788] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
ClockPM Segments MSI HPX-Type3]
[2.623776] acpi PNP0A08:00: _OSC: platform does not support [LTR]
[2.624600] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME
AER PCIeCapability]
[2.625721] acpi PNP0A08:00: ECAM area [mem
0x401000-0x401fff] reserved by PNP0C02:00
[2.626645] acpi PNP0A08:00: ECAM at [mem
0x401000-0x401fff] for [bus 00-ff]
[2.627450] ACPI: Remapped I/O 0x3eff to [io
0x-0x window]
[2.628229] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments
- ASL declared 5, ACPI requires 4 (20210930/nsarguments-166)
[2.629576] PCI host bridge to bus :00
[2.630008] pci_bus :00: root bus resource [mem
0x1000-0x3efe window]
[2.630747] pci_bus :00: root bus resource [io  0x-0x
window]
[2.631405] pci_bus :00: root bus resource [mem
0x80-0xff window]
[2.632177] pci_bus :00: root bus resource [bus 00-ff]
[2.632731] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments
- ASL declared 5, ACPI requires 4 (20210930/nsarguments-166)



+aml_append(method, aml_return(aml_int(0)));
+aml_append(dev, method);
+}
+
   acpi_dsdt_add_pci_route_table(dev, cfg->irq);
 /*
@@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct
GPEXConfig *cfg)
   aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0
Device")));
   aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
   +if (cfg->preserve_config) {
+method = aml_method("_DSM", 5, AML_SERIALIZED);
+aml_append(method, aml_return(aml_int(0)));
+aml_append(dev, method);
+}
+
   acpi_dsdt_add_pci_route_table(dev, cfg->irq);
 method = aml_method("_CBA", 0, AML_NOTSERIALIZED);



.






Re: [RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5

2021-12-28 Thread chenxiang (M)

Hi Eric,


在 2021/10/5 16:53, Eric Auger 写道:

Add a 'preserve_config' field in struct GPEXConfig and
if set generate the DSM #5 for preserving PCI boot configurations.
The DSM presence is needed to expose RMRs.

At the moment the DSM generation is not yet enabled.

Signed-off-by: Eric Auger 
---
  include/hw/pci-host/gpex.h |  1 +
  hw/pci-host/gpex-acpi.c| 12 
  2 files changed, 13 insertions(+)

diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index fcf8b63820..3f8f8ec38d 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -64,6 +64,7 @@ struct GPEXConfig {
  MemMapEntry pio;
  int irq;
  PCIBus  *bus;
+boolpreserve_config;
  };
  
  int gpex_set_irq_num(GPEXHost *s, int index, int gsi);

diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index e7e162a00a..7dab259379 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
  aml_append(dev, aml_name_decl("_PXM", aml_int(numa_node)));
  }
  
+if (cfg->preserve_config) {

+method = aml_method("_DSM", 5, AML_SERIALIZED);


I notice there is a ACPI BIOS Error when booting virtual machine which 
seems be caused by this patch as I add this patchset in my branch to 
test the function of vsmmu.
It seems that it requires only 4 parameter for method _DSM, but use 5 
parameters here.

The error log is as following:

[2.355459] ACPI BIOS Error (bug): Failure creating named object 
[\_SB.PCI0._DSM], AE_ALREADY_EXISTS (20210930/dswload2-327)
[2.355467] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog 
(20210930/psobject-221)
[2.355470] ACPI: Skipping parse of AML opcode: OpcodeName 
unavailable (0x0014)

[2.355657] ACPI: 1 ACPI AML tables successfully acquired and loaded
[2.356321] ACPI: Interpreter enabled
[2.356323] ACPI: Using GIC for interrupt routing
[2.356333] ACPI: MCFG table detected, 1 entries
[2.361359] ARMH0011:00: ttyAMA0 at MMIO 0x900 (irq = 16, 
base_baud = 0) is a SBSA

[2.619805] printk: console [ttyAMA0] enabled
[2.622114] ACPI: PCI Root Bridge [PCI0] (domain  [bus 00-ff])
[2.622788] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM 
ClockPM Segments MSI HPX-Type3]

[2.623776] acpi PNP0A08:00: _OSC: platform does not support [LTR]
[2.624600] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME 
AER PCIeCapability]
[2.625721] acpi PNP0A08:00: ECAM area [mem 
0x401000-0x401fff] reserved by PNP0C02:00
[2.626645] acpi PNP0A08:00: ECAM at [mem 0x401000-0x401fff] 
for [bus 00-ff]
[2.627450] ACPI: Remapped I/O 0x3eff to [io 
0x-0x window]
[2.628229] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments - 
ASL declared 5, ACPI requires 4 (20210930/nsarguments-166)

[2.629576] PCI host bridge to bus :00
[2.630008] pci_bus :00: root bus resource [mem 
0x1000-0x3efe window]

[2.630747] pci_bus :00: root bus resource [io  0x-0x window]
[2.631405] pci_bus :00: root bus resource [mem 
0x80-0xff window]

[2.632177] pci_bus :00: root bus resource [bus 00-ff]
[2.632731] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments - 
ASL declared 5, ACPI requires 4 (20210930/nsarguments-166)




+aml_append(method, aml_return(aml_int(0)));
+aml_append(dev, method);
+}
+
  acpi_dsdt_add_pci_route_table(dev, cfg->irq);
  
  /*

@@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
  aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device")));
  aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
  
+if (cfg->preserve_config) {

+method = aml_method("_DSM", 5, AML_SERIALIZED);
+aml_append(method, aml_return(aml_int(0)));
+aml_append(dev, method);
+}
+
  acpi_dsdt_add_pci_route_table(dev, cfg->irq);
  
  method = aml_method("_CBA", 0, AML_NOTSERIALIZED);





Re: [RESEND RFC] hw/arm/smmuv3: add device properties to disable cached iotlb

2021-08-06 Thread chenxiang (M)

Hi Eric,


在 2021/8/5 16:10, Eric Auger 写道:

Hi Chenxiang,
On 8/5/21 9:48 AM, chenxiang (M) wrote:

Hi Eric,


在 2021/8/5 0:26, Eric Auger 写道:

Hi Chenxiang,

On 8/4/21 10:49 AM, chenxiang wrote:

From: Xiang Chen 

It splits invalidations into ^2 range invalidations in the patch
6d9cd115b(" hw/arm/smmuv3: Enforce invalidation on a power of two
range").
So for some scenarios such as the size of invalidation is not ^2 range
invalidation, it costs more time to invalidate.

this ^² split is not only necessary for internal TLB management but also
for IOMMU MR notifier calls (which use a mask), ie. IOTLB unmap
notifications used for both vhost and vfio integrations.
So you can disable the internal IOTLB but we can't simply remove the pow
of 2 split. See below.

Right, in current code of qemu,  it is not right to simply remove the
pow of 2 split.
But i find that in my local repo, there is a private patch which seems
solve the issue, so it works on my test.

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 4a7a183..83d24e1 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -825,7 +825,8 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
  event.type = IOMMU_NOTIFIER_UNMAP;
  event.entry.target_as = _space_memory;
  event.entry.iova = iova;
-event.entry.addr_mask = num_pages * (1 << granule) - 1;
+event.entry.addr_mask = (1 << granule) - 1;
+   event.entry.num_pages = num_pages;

OK I see. But you change the existing semantic of addr_mask which
originally matches the mask of the  full addr range of the IOTLB
operation and you replace it by the granule mask and add another
num_pages field.

This is a change in the memory.h API and should be discussed with other
memory.h and vIOMMU maintainers if you want to go that way. This
typically breaks vhost integration which does not use num_pages and
would typically fail invalidating the full range.

So we have 2 different things: the disablement of the internal IOTLB (x-
prop) which can be done easily but what you mostly want it to remove the
pow of 2 splits to reduce the interactions with the physical IOMMU in
the VFIO/SMMU use case , right?


Yes, i mainly want to remove the pow of 2 splits to reduce the times of 
invalidations which i think

it will affect the performance.


  pow of 2 splits is also needed for vhost
integration at the moment. Note this use case is not upstreamed and far
from being upstreamed given the /dev/iommu redesign, so it will be
difficult to justify that kind of change at thims moment.


I am not familar with vhost, and maybe need more investigate on it.
Do you have any suggestion about how to improve the issue?



Thanks

Eric

  event.entry.perm = IOMMU_NONE;
  event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
  event.entry.arch_id = asid;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a863b7d..7b026f0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -639,7 +639,7 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier
*n, IOMMUTLBEntry *iotlb)
  {
  hwaddr start = iotlb->iova + giommu->iommu_offset;
  struct iommu_inv_addr_info *addr_info;
-size_t size = iotlb->addr_mask + 1;
+size_t size = iotlb->num_pages * (iotlb->addr_mask + 1);
  int archid = -1;

  addr_info = _info;
@@ -653,8 +653,8 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier
*n, IOMMUTLBEntry *iotlb)
  }
  addr_info->archid = archid;
  addr_info->addr = start;
-addr_info->granule_size = size;
-addr_info->nb_granules = 1;
+addr_info->granule_size = iotlb->addr_mask + 1;
+   addr_info->nb_granules = iotlb->num_pages;
  trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
  1, iotlb->leaf);
  break;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0c4389c..268a395 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -110,6 +110,7 @@ struct IOMMUTLBEntry {
  hwaddr   iova;
  hwaddr   translated_addr;
  hwaddr   addr_mask;
+   uint64_t num_pages;
  IOMMUAccessFlags perm;
  IOMMUInvGranularity granularity;
  #define IOMMU_INV_FLAGS_PASID  (1 << 0)



internal TLB could be disabled through a property but I would rather set
it as an "x-" experimental property for debug purpose. Until recently
this was indeed helpful to debug bugs related to internal IOTLB
management (RIL support) ;-) I hope this period is over though ;-)

Ok, maybe we set it as "x-" experimental property currently.


Currently smmuv3_translate is rarely used (i only see it is used when
binding msi), so i think maybe we can disable cached iotlb to promote
efficiency of invalidation. So add device property disable_cached_iotlb
to disable cached iotlb, and then we can send non-^2 range invalidation
directly.
Use tool dma_map_

Re: [RESEND RFC] hw/arm/smmuv3: add device properties to disable cached iotlb

2021-08-05 Thread chenxiang (M)

Hi Eric,


在 2021/8/5 0:26, Eric Auger 写道:

Hi Chenxiang,

On 8/4/21 10:49 AM, chenxiang wrote:

From: Xiang Chen 

It splits invalidations into ^2 range invalidations in the patch
6d9cd115b(" hw/arm/smmuv3: Enforce invalidation on a power of two range").
So for some scenarios such as the size of invalidation is not ^2 range
invalidation, it costs more time to invalidate.

this ^² split is not only necessary for internal TLB management but also
for IOMMU MR notifier calls (which use a mask), ie. IOTLB unmap
notifications used for both vhost and vfio integrations.
So you can disable the internal IOTLB but we can't simply remove the pow
of 2 split. See below.
Right, in current code of qemu,  it is not right to simply remove the 
pow of 2 split.
But i find that in my local repo, there is a private patch which seems 
solve the issue, so it works on my test.


diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 4a7a183..83d24e1 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -825,7 +825,8 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr,
 event.type = IOMMU_NOTIFIER_UNMAP;
 event.entry.target_as = _space_memory;
 event.entry.iova = iova;
-event.entry.addr_mask = num_pages * (1 << granule) - 1;
+event.entry.addr_mask = (1 << granule) - 1;
+   event.entry.num_pages = num_pages;
 event.entry.perm = IOMMU_NONE;
 event.entry.flags = IOMMU_INV_FLAGS_ARCHID;
 event.entry.arch_id = asid;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a863b7d..7b026f0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -639,7 +639,7 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier 
*n, IOMMUTLBEntry *iotlb)

 {
 hwaddr start = iotlb->iova + giommu->iommu_offset;
 struct iommu_inv_addr_info *addr_info;
-size_t size = iotlb->addr_mask + 1;
+size_t size = iotlb->num_pages * (iotlb->addr_mask + 1);
 int archid = -1;

 addr_info = _info;
@@ -653,8 +653,8 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier 
*n, IOMMUTLBEntry *iotlb)

 }
 addr_info->archid = archid;
 addr_info->addr = start;
-addr_info->granule_size = size;
-addr_info->nb_granules = 1;
+addr_info->granule_size = iotlb->addr_mask + 1;
+   addr_info->nb_granules = iotlb->num_pages;
 trace_vfio_iommu_addr_inv_iotlb(archid, start, size,
 1, iotlb->leaf);
 break;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0c4389c..268a395 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -110,6 +110,7 @@ struct IOMMUTLBEntry {
 hwaddr   iova;
 hwaddr   translated_addr;
 hwaddr   addr_mask;
+   uint64_t num_pages;
 IOMMUAccessFlags perm;
 IOMMUInvGranularity granularity;
 #define IOMMU_INV_FLAGS_PASID  (1 << 0)




internal TLB could be disabled through a property but I would rather set
it as an "x-" experimental property for debug purpose. Until recently
this was indeed helpful to debug bugs related to internal IOTLB
management (RIL support) ;-) I hope this period is over though ;-)

Ok, maybe we set it as "x-" experimental property currently.


Currently smmuv3_translate is rarely used (i only see it is used when
binding msi), so i think maybe we can disable cached iotlb to promote
efficiency of invalidation. So add device property disable_cached_iotlb
to disable cached iotlb, and then we can send non-^2 range invalidation
directly.
Use tool dma_map_benchmark to have a test on the latency of unmap,
and we can see it promotes much on unmap when the size of invalidation
is not ^2 range invalidation (such as g = 7/15/31/511):

t = 1(thread = 1)
before opt(us)   after opt(us)
g=1(4K size)0.2/7.6 0.2/7.5
g=4(8K size)0.4/7.9 0.4/7.9
g=7(28K size)   0.6/10.20.6/8.2
g=8(32K size)   0.6/8.3 0.6/8.3
g=15(60K size)  1.1/12.11.1/9.1
g=16(64K size)  1.1/9.2 1.1/9.1
g=31(124K size) 2.0/14.82.0/10.7
g=32(128K size) 2.1/14.82.1/10.7
g=511(2044K size)   30.9/65.1   31.1/55.9
g=512(2048K size) 0.3/32.1  0.3/32.1
t = 10(thread = 10)
before opt(us)   after opt(us)
g=1(4K size)0.2/39.90.2/39.1
g=4(8K size)0.5/42.60.5/42.4
g=7(28K size)   0.6/66.40.6/45.3
g=8(32K size)   0.7/45.80.7/46.1
g=15(60K size)  1.1/80.51.1/49.6
g=16(64K size)  1.1/49.81.1/50.2
g=31(124K size) 2.0/98.32.1/58.0
g=32(128K size) 2.1/57.72.1/58.2
g=511(2044K size)   35.2/322.2  35.3/236.7
g=512(2048K size) 0.8/238.2 0.9/240.3

Note: i test it based on VSMMU enabled with the patchset
("vSMMUv3/pSMMUv3 2 stage VFIO integration").

Signed-off-by: Xiang Chen 
---
  hw/arm/smmuv3.c |