Re: regression: insmod module failed in VM with nvdimm on
Hi Ard, 在 2022/12/1 19:07, Ard Biesheuvel 写道: On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel wrote: On Thu, 1 Dec 2022 at 08:15, chenxiang (M) wrote: Hi Ard, 在 2022/11/30 16:18, Ard Biesheuvel 写道: On Wed, 30 Nov 2022 at 08:53, Marc Zyngier wrote: On Wed, 30 Nov 2022 02:52:35 +, "chenxiang (M)" wrote: Hi, We boot the VM using following commands (with nvdimm on) (qemu version 6.1.50, kernel 6.0-r4): How relevant is the presence of the nvdimm? Do you observe the failure without this? qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3,nvdimm=on -kernel /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0 ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1' -object memory-backend-ram,id=ram1,size=10G -device nvdimm,id=dimm1,memdev=ram1 -device ioh3420,id=root_port1,chassis=1 -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1 Then in VM we insmod a module, vmalloc error occurs as follows (kernel 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4): estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko [8.186563] vmap allocation for size 20480 failed: use vmalloc= to increase size Have you tried increasing the vmalloc size to check that this is indeed the problem? [...] We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr: defer initialization to initcall where permitted"). I guess you mean commit fc5a89f75d2a instead, right? Do you have any idea about the issue? I sort of suspect that the nvdimm gets vmap-ed and consumes a large portion of the vmalloc space, but you give very little information that could help here... Ouch. I suspect what's going on here: that patch defers the randomization of the module region, so that we can decouple it from the very early init code. Obviously, it is happening too late now, and the randomized module region is overlapping with a vmalloc region that is in use by the time the randomization occurs. Does the below fix the issue? The issue still occurs, but it seems decrease the probability, before it occured almost every time, after the change, i tried 2-3 times, and it occurs. But i change back "subsys_initcall" to "core_initcall", and i test more than 20 times, and it is still ok. Thank you for confirming. I will send out a patch today. ...but before I do that, could you please check whether the change below fixes your issue as well? diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c index 6ccc7ef600e7c1e1..c8c205b630da1951 100644 --- a/arch/arm64/kernel/kaslr.c +++ b/arch/arm64/kernel/kaslr.c @@ -20,7 +20,11 @@ #include #include -u64 __ro_after_init module_alloc_base; +/* + * Set a reasonable default for module_alloc_base in case + * we end up running with module randomization disabled. + */ +u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE; u16 __initdata memstart_offset_seed; struct arm64_ftr_override kaslr_feature_override __initdata; @@ -30,12 +34,6 @@ static int __init kaslr_init(void) u64 module_range; u32 seed; - /* -* Set a reasonable default for module_alloc_base in case -* we end up running with module randomization disabled. -*/ - module_alloc_base = (u64)_etext - MODULES_VSIZE; - if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) { pr_info("KASLR disabled on command line\n"); return 0; . We have tested this change, the issue is still and it doesn't fix the issue.
Re: regression: insmod module failed in VM with nvdimm on
在 2022/12/1 19:07, Ard Biesheuvel 写道: On Thu, 1 Dec 2022 at 09:07, Ard Biesheuvel wrote: On Thu, 1 Dec 2022 at 08:15, chenxiang (M) wrote: Hi Ard, 在 2022/11/30 16:18, Ard Biesheuvel 写道: On Wed, 30 Nov 2022 at 08:53, Marc Zyngier wrote: On Wed, 30 Nov 2022 02:52:35 +, "chenxiang (M)" wrote: Hi, We boot the VM using following commands (with nvdimm on) (qemu version 6.1.50, kernel 6.0-r4): How relevant is the presence of the nvdimm? Do you observe the failure without this? qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3,nvdimm=on -kernel /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0 ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1' -object memory-backend-ram,id=ram1,size=10G -device nvdimm,id=dimm1,memdev=ram1 -device ioh3420,id=root_port1,chassis=1 -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1 Then in VM we insmod a module, vmalloc error occurs as follows (kernel 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4): estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko [8.186563] vmap allocation for size 20480 failed: use vmalloc= to increase size Have you tried increasing the vmalloc size to check that this is indeed the problem? [...] We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr: defer initialization to initcall where permitted"). I guess you mean commit fc5a89f75d2a instead, right? Do you have any idea about the issue? I sort of suspect that the nvdimm gets vmap-ed and consumes a large portion of the vmalloc space, but you give very little information that could help here... Ouch. I suspect what's going on here: that patch defers the randomization of the module region, so that we can decouple it from the very early init code. Obviously, it is happening too late now, and the randomized module region is overlapping with a vmalloc region that is in use by the time the randomization occurs. Does the below fix the issue? The issue still occurs, but it seems decrease the probability, before it occured almost every time, after the change, i tried 2-3 times, and it occurs. But i change back "subsys_initcall" to "core_initcall", and i test more than 20 times, and it is still ok. Thank you for confirming. I will send out a patch today. ...but before I do that, could you please check whether the change below fixes your issue as well? Yes, but i can only reply to you tomorrow as other guy is testing on the only environment today. diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c index 6ccc7ef600e7c1e1..c8c205b630da1951 100644 --- a/arch/arm64/kernel/kaslr.c +++ b/arch/arm64/kernel/kaslr.c @@ -20,7 +20,11 @@ #include #include -u64 __ro_after_init module_alloc_base; +/* + * Set a reasonable default for module_alloc_base in case + * we end up running with module randomization disabled. + */ +u64 __ro_after_init module_alloc_base = (u64)_etext - MODULES_VSIZE; u16 __initdata memstart_offset_seed; struct arm64_ftr_override kaslr_feature_override __initdata; @@ -30,12 +34,6 @@ static int __init kaslr_init(void) u64 module_range; u32 seed; - /* -* Set a reasonable default for module_alloc_base in case -* we end up running with module randomization disabled. -*/ - module_alloc_base = (u64)_etext - MODULES_VSIZE; - if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) { pr_info("KASLR disabled on command line\n"); return 0; .
Re: regression: insmod module failed in VM with nvdimm on
Hi Ard, 在 2022/11/30 16:18, Ard Biesheuvel 写道: On Wed, 30 Nov 2022 at 08:53, Marc Zyngier wrote: On Wed, 30 Nov 2022 02:52:35 +, "chenxiang (M)" wrote: Hi, We boot the VM using following commands (with nvdimm on) (qemu version 6.1.50, kernel 6.0-r4): How relevant is the presence of the nvdimm? Do you observe the failure without this? qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3,nvdimm=on -kernel /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0 ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1' -object memory-backend-ram,id=ram1,size=10G -device nvdimm,id=dimm1,memdev=ram1 -device ioh3420,id=root_port1,chassis=1 -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1 Then in VM we insmod a module, vmalloc error occurs as follows (kernel 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4): estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko [8.186563] vmap allocation for size 20480 failed: use vmalloc= to increase size Have you tried increasing the vmalloc size to check that this is indeed the problem? [...] We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr: defer initialization to initcall where permitted"). I guess you mean commit fc5a89f75d2a instead, right? Do you have any idea about the issue? I sort of suspect that the nvdimm gets vmap-ed and consumes a large portion of the vmalloc space, but you give very little information that could help here... Ouch. I suspect what's going on here: that patch defers the randomization of the module region, so that we can decouple it from the very early init code. Obviously, it is happening too late now, and the randomized module region is overlapping with a vmalloc region that is in use by the time the randomization occurs. Does the below fix the issue? The issue still occurs, but it seems decrease the probability, before it occured almost every time, after the change, i tried 2-3 times, and it occurs. But i change back "subsys_initcall" to "core_initcall", and i test more than 20 times, and it is still ok. diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c index 37a9deed2aec..71fb18b2f304 100644 --- a/arch/arm64/kernel/kaslr.c +++ b/arch/arm64/kernel/kaslr.c @@ -90,4 +90,4 @@ static int __init kaslr_init(void) return 0; } -subsys_initcall(kaslr_init) +arch_initcall(kaslr_init) .
Re: regression: insmod module failed in VM with nvdimm on
Hi Marc, 在 2022/11/30 15:53, Marc Zyngier 写道: On Wed, 30 Nov 2022 02:52:35 +, "chenxiang (M)" wrote: Hi, We boot the VM using following commands (with nvdimm on) (qemu version 6.1.50, kernel 6.0-r4): How relevant is the presence of the nvdimm? Do you observe the failure without this? We didn't see the failure without it. qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3,nvdimm=on -kernel /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0 ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1' -object memory-backend-ram,id=ram1,size=10G -device nvdimm,id=dimm1,memdev=ram1 -device ioh3420,id=root_port1,chassis=1 -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1 Then in VM we insmod a module, vmalloc error occurs as follows (kernel 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4): estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko [8.186563] vmap allocation for size 20480 failed: use vmalloc= to increase size Have you tried increasing the vmalloc size to check that this is indeed the problem? [...] I didn't increase the vmalloc size, but i check the vmall size and i think it is big enough when the issue occurs: estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko [4.879899] vmap allocation for size 20480 failed: use vmalloc= to increase size [4.880643] insmod: vmalloc error: size 16384, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 [4.881802] CPU: 1 PID: 230 Comm: insmod Not tainted 6.1.0-rc4+ #21 [4.882414] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [4.883082] Call trace: [4.88] dump_backtrace.part.0+0xc4/0xd0 [4.883766] show_stack+0x20/0x50 [4.884091] dump_stack_lvl+0x68/0x84 [4.884450] dump_stack+0x18/0x34 [4.884778] warn_alloc+0x11c/0x1bc [4.885124] __vmalloc_node_range+0x50c/0x64c [4.885553] module_alloc+0xf4/0x100 [4.885902] load_module+0x858/0x1e90 [4.886265] __do_sys_init_module+0x1c0/0x200 [4.886699] __arm64_sys_init_module+0x24/0x30 [4.887147] invoke_syscall+0x50/0x120 [4.887516] el0_svc_common.constprop.0+0x58/0x190 [4.887993] do_el0_svc+0x34/0xc0 [4.888327] el0_svc+0x2c/0xb4 [4.888631] el0t_64_sync_handler+0xb8/0xbc [4.889046] el0t_64_sync+0x19c/0x1a0 [4.889423] Mem-Info: [4.889639] active_anon:9679 inactive_anon:63094 isolated_anon:0 [4.889639] active_file:0 inactive_file:0 isolated_file:0 [4.889639] unevictable:0 dirty:0 writeback:0 [4.889639] slab_reclaimable:3322 slab_unreclaimable:3082 [4.889639] mapped:873 shmem:72569 pagetables:34 [4.889639] sec_pagetables:0 bounce:0 [4.889639] kernel_misc_reclaimable:0 [4.889639] free:416212 free_pcp:4414 free_cma:0 [4.893362] Node 0 active_anon:38716kB inactive_anon:252376kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3492kB dirty:0kB writeback:0kB shmem:290276kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:1904kB pagetables:136kB sec_pagetables:0kB all_unreclaimable? no [4.896343] Node 0 DMA free:1664848kB boost:0kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:38716kB inactive_anon:252376kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2097152kB managed:2010376kB mlocked:0kB bounce:0kB free_pcp:17704kB local_pcp:3668kB free_cma:0kB [4.899097] lowmem_reserve[]: 0 0 0 0 0 [4.899466] Node 0 DMA: 2*4kB (UM) 1*8kB (M) 2*16kB (UM) 1*32kB (M) 2*64kB (ME) 1*128kB (U) 2*256kB (ME) 2*512kB (M) 6*1024kB (UME) 5*2048kB (UM) 402*4096kB (M) = 1664848kB [4.900865] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [4.901648] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB [4.902526] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [4.903354] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB [4.904173] 72569 total pagecache pages [4.904524] 0 pages in swap cache [4.904831] Free swap = 0kB [4.905109] Total swap = 0kB [4.905407] 524288 pages RAM [4.905696] 0 pages HighMem/MovableOnly [4.906085] 21694 pages reserved [4.906388] 0 pages hwpoisoned insmod: can't insert '/lib/modules/6.1.0-rc4+/hnae3.ko': Cannot allocate memory estuary:/$ insmod /lib/modules/$(uname -r)/hns3.ko [4.911599] vmap allocation for size 122880 failed: use vmalloc= to increase size insmod: can't insert '/lib/modules/6.1.0-rc4+/hns3.ko': Cannot allocate memory estuary:/$ insmod /lib/modules/$(uname -r)/hclge.ko [4.917761] vmap allocation for size 319488 failed: use vmalloc= to increase size insmod: ca
regression: insmod module failed in VM with nvdimm on
Hi, We boot the VM using following commands (with nvdimm on) (qemu version 6.1.50, kernel 6.0-r4): qemu-system-aarch64 -machine virt,kernel_irqchip=on,gic-version=3,nvdimm=on -kernel /home/kernel/Image -initrd /home/mini-rootfs/rootfs.cpio.gz -bios /root/QEMU_EFI.FD -cpu host -enable-kvm -net none -nographic -m 2G,maxmem=64G,slots=3 -smp 4 -append 'rdinit=init console=ttyAMA0 ealycon=pl0ll,0x9000 pcie_ports=native pciehp.pciehp_debug=1' -object memory-backend-ram,id=ram1,size=10G -device nvdimm,id=dimm1,memdev=ram1 -device ioh3420,id=root_port1,chassis=1 -device vfio-pci,host=7d:01.0,id=net0,bus=root_port1 Then in VM we insmod a module, vmalloc error occurs as follows (kernel 5.19-rc4 is normal, and the issue is still on kernel 6.1-rc4): estuary:/$ insmod /lib/modules/$(uname -r)/hnae3.ko [8.186563] vmap allocation for size 20480 failed: use vmalloc= to increase size [8.187288] insmod: vmalloc error: size 16384, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0 [8.188402] CPU: 1 PID: 235 Comm: insmod Not tainted 6.0.0-rc4+ #1 [8.188958] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [8.189593] Call trace: [8.189825] dump_backtrace.part.0+0xc4/0xd0 [8.190245] show_stack+0x24/0x40 [8.190563] dump_stack_lvl+0x68/0x84 [8.190913] dump_stack+0x18/0x34 [8.191223] warn_alloc+0x124/0x1b0 [8.191555] __vmalloc_node_range+0xe4/0x55c [8.191959] module_alloc+0xf8/0x104 [8.192305] load_module+0x854/0x1e70 [8.192655] __do_sys_init_module+0x1e0/0x220 [8.193075] __arm64_sys_init_module+0x28/0x34 [8.193489] invoke_syscall+0x50/0x120 [8.193841] el0_svc_common.constprop.0+0x58/0x1a0 [8.194296] do_el0_svc+0x38/0xd0 [8.194613] el0_svc+0x2c/0xc0 [8.194901] el0t_64_sync_handler+0x1ac/0x1b0 [8.195313] el0t_64_sync+0x19c/0x1a0 [8.195672] Mem-Info: [8.195872] active_anon:17641 inactive_anon:118549 isolated_anon:0 [8.195872] active_file:0 inactive_file:0 isolated_file:0 [8.195872] unevictable:0 dirty:0 writeback:0 [8.195872] slab_reclaimable:3439 slab_unreclaimable:3067 [8.195872] mapped:877 shmem:135976 pagetables:39 bounce:0 [8.195872] kernel_misc_reclaimable:0 [8.195872] free:353735 free_pcp:3210 free_cma:0 [8.199119] Node 0 active_anon:70564kB inactive_anon:474196kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3508kB dirty:0kB writeback:0kB shmem:543904kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:1904kB pagetables:156kB all_unreclaimable? no [8.201683] Node 0 DMA free:1414940kB boost:0kB min:22528kB low:28160kB high:33792kB reserved_highatomic:0KB active_anon:70564kB inactive_anon:474196kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2097152kB managed:2010444kB mlocked:0kB bounce:0kB free_pcp:12840kB local_pcp:2032kB free_cma:0kB [8.204158] lowmem_reserve[]: 0 0 0 0 [8.204481] Node 0 DMA: 1*4kB (E) 1*8kB (U) 1*16kB (U) 2*32kB (UM) 1*64kB (U) 1*128kB (U) 2*256kB (ME) 2*512kB (ME) 2*1024kB (M) 3*2048kB (UM) 343*4096kB (M) = 1414940kB [8.205881] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [8.206644] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB [8.207381] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [8.208111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB [8.208826] 135976 total pagecache pages [8.209195] 0 pages in swap cache [8.209484] Free swap = 0kB [8.209733] Total swap = 0kB [8.209989] 524288 pages RAM [8.210239] 0 pages HighMem/MovableOnly [8.210571] 21677 pages reserved [8.210852] 0 pages hwpoisoned insmod: can't insert '/lib/modules/6.0.0-rc4+/hnae3.ko': Cannot allocate memory We git bisect the code, and find the patch c5a89f75d2a ("arm64: kaslr: defer initialization to initcall where permitted"). Do you have any idea about the issue? Best Regards, Xiang Chen
Re: [PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors
在 2022/11/23 20:08, Marc Zyngier 写道: On Wed, 23 Nov 2022 01:42:36 +, chenxiang wrote: From: Xiang Chen Currently the number of MSI vectors comes from register PCI_MSI_FLAGS which should be power-of-2 in qemu, in some scenaries it is not the same as the number that driver requires in guest, for example, a PCI driver wants to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in guest only wants to allocate 6 MSI vectors. When GICv4.1 is enabled, it iterates over all possible MSIs and enable the forwarding while the guest has only created some of mappings in the virtual ITS, so some calls fail. The exception print is as following: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 To avoid the issue, verify each MSI vector, skip some operations such as request_irq() and irq_bypass_register_producer() for those invalid MSI vectors. Signed-off-by: Xiang Chen --- I reported the issue at the link: https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/ Change Log: v1 -> v2: Verify each MSI vector in kernel instead of adding systemcall according to Mar's suggestion --- arch/arm64/kvm/vgic/vgic-irqfd.c | 13 + arch/arm64/kvm/vgic/vgic-its.c| 36 arch/arm64/kvm/vgic/vgic.h| 1 + drivers/vfio/pci/vfio_pci_intrs.c | 33 + include/linux/kvm_host.h | 2 ++ 5 files changed, 85 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c index 475059b..71f6af57 100644 --- a/arch/arm64/kvm/vgic/vgic-irqfd.c +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c @@ -98,6 +98,19 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, return vgic_its_inject_msi(kvm, ); } +int kvm_verify_msi(struct kvm *kvm, + struct kvm_kernel_irq_routing_entry *irq_entry) +{ + struct kvm_msi msi; + + if (!vgic_has_its(kvm)) + return -ENODEV; + + kvm_populate_msi(irq_entry, ); + + return vgic_its_verify_msi(kvm, ); +} + /** * kvm_arch_set_irq_inatomic: fast-path for irqfd injection */ diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c index 94a666d..8312a4a 100644 --- a/arch/arm64/kvm/vgic/vgic-its.c +++ b/arch/arm64/kvm/vgic/vgic-its.c @@ -767,6 +767,42 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) return 0; } +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi) +{ + struct vgic_its *its; + struct its_ite *ite; + struct kvm_vcpu *vcpu; + int ret = 0; + + if (!irqchip_in_kernel(kvm) || (msi->flags & ~KVM_MSI_VALID_DEVID)) + return -EINVAL; + + if (!vgic_has_its(kvm)) + return -ENODEV; + + its = vgic_msi_to_its(kvm, msi); + if (IS_ERR(its)) + return PTR_ERR(its); + + mutex_lock(>its_lock); + if (!its->enabled) { + ret = -EBUSY; + goto unlock; + } + ite = find_ite(its, msi->devid, msi->data); + if (!ite || !its_is_collection_mapped(ite->collection)) { + ret = E_ITS_INT_UNMAPPED_INTERRUPT; + goto unlock; + } + + vcpu = kvm_get_vcpu(kvm, ite->collection->target_addr); + if (!vcpu) + ret = E_ITS_INT_UNMAPPED_INTERRUPT; I'm sorry, but what does this mean to the caller? This should never leak outside of the ITS code. Actually it is already leak outside of ITS code, and please see the exception printk (E_ITS_INT_UNMAPPED_INTERRUPT is 0x10307 which is equal to 66311): vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 +unlock: + mutex_unlock(>its_lock); + return ret; +} + /* * Queries the KVM IO bus framework to get the ITS pointer from the given * doorbell address. diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 0c8da72..d452150 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -240,6 +240,7 @@ int kvm_vgic_register_its_device(void); void vgic_enable_lpis(struct kvm_vcpu *vcpu); void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu); int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi); +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi); int vgic_v3_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr); int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write, int offset, u32 *val); diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 40c3d7c..3027805 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "vfio_pci_pri
[PATCH v2] vfio/pci: Verify each MSI vector to avoid invalid MSI vectors
From: Xiang Chen Currently the number of MSI vectors comes from register PCI_MSI_FLAGS which should be power-of-2 in qemu, in some scenaries it is not the same as the number that driver requires in guest, for example, a PCI driver wants to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in guest only wants to allocate 6 MSI vectors. When GICv4.1 is enabled, it iterates over all possible MSIs and enable the forwarding while the guest has only created some of mappings in the virtual ITS, so some calls fail. The exception print is as following: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 To avoid the issue, verify each MSI vector, skip some operations such as request_irq() and irq_bypass_register_producer() for those invalid MSI vectors. Signed-off-by: Xiang Chen --- I reported the issue at the link: https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/ Change Log: v1 -> v2: Verify each MSI vector in kernel instead of adding systemcall according to Mar's suggestion --- arch/arm64/kvm/vgic/vgic-irqfd.c | 13 + arch/arm64/kvm/vgic/vgic-its.c| 36 arch/arm64/kvm/vgic/vgic.h| 1 + drivers/vfio/pci/vfio_pci_intrs.c | 33 + include/linux/kvm_host.h | 2 ++ 5 files changed, 85 insertions(+) diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c index 475059b..71f6af57 100644 --- a/arch/arm64/kvm/vgic/vgic-irqfd.c +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c @@ -98,6 +98,19 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, return vgic_its_inject_msi(kvm, ); } +int kvm_verify_msi(struct kvm *kvm, + struct kvm_kernel_irq_routing_entry *irq_entry) +{ + struct kvm_msi msi; + + if (!vgic_has_its(kvm)) + return -ENODEV; + + kvm_populate_msi(irq_entry, ); + + return vgic_its_verify_msi(kvm, ); +} + /** * kvm_arch_set_irq_inatomic: fast-path for irqfd injection */ diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c index 94a666d..8312a4a 100644 --- a/arch/arm64/kvm/vgic/vgic-its.c +++ b/arch/arm64/kvm/vgic/vgic-its.c @@ -767,6 +767,42 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) return 0; } +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi) +{ + struct vgic_its *its; + struct its_ite *ite; + struct kvm_vcpu *vcpu; + int ret = 0; + + if (!irqchip_in_kernel(kvm) || (msi->flags & ~KVM_MSI_VALID_DEVID)) + return -EINVAL; + + if (!vgic_has_its(kvm)) + return -ENODEV; + + its = vgic_msi_to_its(kvm, msi); + if (IS_ERR(its)) + return PTR_ERR(its); + + mutex_lock(>its_lock); + if (!its->enabled) { + ret = -EBUSY; + goto unlock; + } + ite = find_ite(its, msi->devid, msi->data); + if (!ite || !its_is_collection_mapped(ite->collection)) { + ret = E_ITS_INT_UNMAPPED_INTERRUPT; + goto unlock; + } + + vcpu = kvm_get_vcpu(kvm, ite->collection->target_addr); + if (!vcpu) + ret = E_ITS_INT_UNMAPPED_INTERRUPT; +unlock: + mutex_unlock(>its_lock); + return ret; +} + /* * Queries the KVM IO bus framework to get the ITS pointer from the given * doorbell address. diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 0c8da72..d452150 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -240,6 +240,7 @@ int kvm_vgic_register_its_device(void); void vgic_enable_lpis(struct kvm_vcpu *vcpu); void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu); int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi); +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi); int vgic_v3_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr); int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write, int offset, u32 *val); diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 40c3d7c..3027805 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -19,6 +19,7 @@ #include #include #include +#include #include "vfio_pci_priv.h" @@ -315,6 +316,28 @@ static int vfio_msi_enable(struct vfio_pci_core_device *vdev, int nvec, bool msi return 0; } +static int vfio_pci_verify_msi_entry(struct vfio_pci_core_device *vdev, + struct eventfd_ctx *trigger) +{ + struct kvm *kvm = vdev->vdev.kvm; + struct kvm_kernel_irqfd *tmp; + struct kvm_kernel_irq_routing_entry irq_entry; + int ret = -ENODEV; + + spin_lock_irq(>irqfds.lock); + list_for_each_entry(tmp, >irqfds.items, list) { +
Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector
Hi Marc, 在 2022/11/10 18:28, Marc Zyngier 写道: On Wed, 09 Nov 2022 06:21:18 +, "chenxiang (M)" wrote: Hi Marc, 在 2022/11/8 20:47, Marc Zyngier 写道: On Tue, 08 Nov 2022 08:08:57 +, chenxiang wrote: From: Xiang Chen Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS which should be power-of-2, but in some scenaries it is not the same as the number that driver requires in guest, for example, a PCI driver wants to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in guest only wants to allocate 6 MSI vectors. When GICv4.1 is enabled, we can see some exception print as following for above scenaro: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do that. If there is a mapping, return 0, otherwise return negative value. This is the kernel part of adding system call KVM_VERIFY_MSI. Exposing something that is an internal implementation detail to userspace feels like the absolute wrong way to solve this issue. Can you please characterise the issue you're having? Is it that vfio tries to enable an interrupt for which there is no virtual ITS mapping? Shouldn't we instead try and manage this in the kernel? Before i reported the issue to community, you gave a suggestion about the issue, but not sure whether i misundertood your meaning. You can refer to the link for more details about the issue. https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/ Right. It would have been helpful to mention this earlier. Anyway, I would really like this to be done without involving userspace at all. But first, can you please confirm that the VM works as expected despite the message? Yes, it works well except the message. If that's the case, we only need to handle the case where this is a multi-MSI setup, and I think this can be done in VFIO, without involving userspace. It seems we can verify every kvm_msi for multi-MSI setup in function vfio_pci_set_msi_trigger(). If it is a invalid MSI vector, then we can decrease the numer of MSI vectors before calling vfio_msi_set_block(). Thanks, M.
Re: [PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector
Hi Marc, 在 2022/11/8 20:47, Marc Zyngier 写道: On Tue, 08 Nov 2022 08:08:57 +, chenxiang wrote: From: Xiang Chen Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS which should be power-of-2, but in some scenaries it is not the same as the number that driver requires in guest, for example, a PCI driver wants to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in guest only wants to allocate 6 MSI vectors. When GICv4.1 is enabled, we can see some exception print as following for above scenaro: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do that. If there is a mapping, return 0, otherwise return negative value. This is the kernel part of adding system call KVM_VERIFY_MSI. Exposing something that is an internal implementation detail to userspace feels like the absolute wrong way to solve this issue. Can you please characterise the issue you're having? Is it that vfio tries to enable an interrupt for which there is no virtual ITS mapping? Shouldn't we instead try and manage this in the kernel? Before i reported the issue to community, you gave a suggestion about the issue, but not sure whether i misundertood your meaning. You can refer to the link for more details about the issue. https://lkml.kernel.org/lkml/87cze9lcut.wl-...@kernel.org/T/ Best regards, Xiang
[PATCH] vfio/pci: Add system call KVM_VERIFY_MSI to verify every MSI vector
From: Xiang Chen Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS which should be power-of-2, but in some scenaries it is not the same as the number that driver requires in guest, for example, a PCI driver wants to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in guest only wants to allocate 6 MSI vectors. When GICv4.1 is enabled, we can see some exception print as following for above scenaro: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 To avoid the issue, add system call KVM_VERIFY_MSI to verify whether every MSI vecotor is valid and adjust the numver of MSI vectors. This is qemu part of adding system call KVM_VERIFY_MSI. Signed-off-by: Xiang Chen --- accel/kvm/kvm-all.c | 19 +++ hw/vfio/pci.c | 13 + include/sysemu/kvm.h | 2 ++ linux-headers/linux/kvm.h | 1 + 4 files changed, 35 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f99b0be..19c8b84 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -1918,6 +1918,25 @@ int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg) return kvm_set_irq(s, route->kroute.gsi, 1); } +int kvm_irqchip_verify_msi_route(KVMState *s, int vector, PCIDevice *dev) +{ +if (pci_available && dev && kvm_msi_devid_required()) { + MSIMessage msg = {0, 0}; + struct kvm_msi msi; + + msg = pci_get_msi_message(dev, vector); + msi.address_lo = (uint32_t)msg.address; + msi.address_hi = msg.address >> 32; + msi.devid = pci_requester_id(dev); + msi.data = le32_to_cpu(msg.data); + msi.flags = KVM_MSI_VALID_DEVID; + memset(msi.pad, 0, sizeof(msi.pad)); + + return kvm_vm_ioctl(s, KVM_VERIFY_MSI, ); +} +return 0; +} + int kvm_irqchip_add_msi_route(KVMRouteChange *c, int vector, PCIDevice *dev) { struct kvm_irq_routing_entry kroute = {}; diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 939dcc3..8dae0e4 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -660,6 +660,7 @@ static void vfio_msix_enable(VFIOPCIDevice *vdev) static void vfio_msi_enable(VFIOPCIDevice *vdev) { int ret, i; +int msi_invalid = 0; vfio_disable_interrupts(vdev); @@ -671,6 +672,18 @@ static void vfio_msi_enable(VFIOPCIDevice *vdev) vfio_prepare_kvm_msi_virq_batch(vdev); vdev->nr_vectors = msi_nr_vectors_allocated(>pdev); + +/* + * Verify whether every msi interrupt is valid as the number of + * MSI vectors comes from PCI device registers which may be not the + * same as the number of vectors that driver requires. + */ +for (i = 0; i < vdev->nr_vectors; i++) { + ret = kvm_irqchip_verify_msi_route(kvm_state, i, >pdev); + if (ret < 0) + msi_invalid++; +} +vdev->nr_vectors -= msi_invalid; retry: vdev->msi_vectors = g_new0(VFIOMSIVector, vdev->nr_vectors); diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index e9a97ed..aca6e5b 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -482,6 +482,8 @@ void kvm_cpu_synchronize_state(CPUState *cpu); void kvm_init_cpu_signals(CPUState *cpu); +int kvm_irqchip_verify_msi_route(KVMState *s, int vector, PCIDevice *dev); + /** * kvm_irqchip_add_msi_route - Add MSI route for specific vector * @c: KVMRouteChange instance. diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h index ebdafa5..ac59350 100644 --- a/linux-headers/linux/kvm.h +++ b/linux-headers/linux/kvm.h @@ -1540,6 +1540,7 @@ struct kvm_s390_ucas_mapping { #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) #define KVM_ARM_MTE_COPY_TAGS_IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) +#define KVM_VERIFY_MSI_IOW(KVMIO, 0xb5, struct kvm_msi) /* ioctl for vm fd */ #define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xe0, struct kvm_create_device) -- 2.8.1
[PATCH] KVM: Add system call KVM_VERIFY_MSI to verify MSI vector
From: Xiang Chen Currently the numbers of MSI vectors come from register PCI_MSI_FLAGS which should be power-of-2, but in some scenaries it is not the same as the number that driver requires in guest, for example, a PCI driver wants to allocate 6 MSI vecotrs in guest, but as the limitation, it will allocate 8 MSI vectors. So it requires 8 MSI vectors in qemu while the driver in guest only wants to allocate 6 MSI vectors. When GICv4.1 is enabled, we can see some exception print as following for above scenaro: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails:66311 In order to verify whether a MSI vector is valid, add KVM_VERIFY_MSI to do that. If there is a mapping, return 0, otherwise return negative value. This is the kernel part of adding system call KVM_VERIFY_MSI. Signed-off-by: Xiang Chen --- arch/arm64/kvm/vgic/vgic-irqfd.c | 5 + arch/arm64/kvm/vgic/vgic-its.c | 36 arch/arm64/kvm/vgic/vgic.h | 1 + include/linux/kvm_host.h | 2 +- include/uapi/linux/kvm.h | 2 ++ virt/kvm/kvm_main.c | 9 + 6 files changed, 54 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c index 475059b..2312da6 100644 --- a/arch/arm64/kvm/vgic/vgic-irqfd.c +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c @@ -98,6 +98,11 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, return vgic_its_inject_msi(kvm, ); } +int kvm_verify_msi(struct kvm *kvm, struct kvm_msi *msi) +{ + return vgic_its_verify_msi(kvm, msi); +} + /** * kvm_arch_set_irq_inatomic: fast-path for irqfd injection */ diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c index 24d7778..cae6183 100644 --- a/arch/arm64/kvm/vgic/vgic-its.c +++ b/arch/arm64/kvm/vgic/vgic-its.c @@ -767,6 +767,42 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) return 0; } +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi) +{ + struct vgic_its *its; + struct its_ite *ite; + struct kvm_vcpu *vcpu; + int ret = 0; + + if (!irqchip_in_kernel(kvm) || (msi->flags & ~KVM_MSI_VALID_DEVID)) + return -EINVAL; + + if (!vgic_has_its(kvm)) + return -ENODEV; + + its = vgic_msi_to_its(kvm, msi); + if (IS_ERR(its)) + return PTR_ERR(its); + + mutex_lock(>its_lock); + if (!its->enabled) { + ret = -EBUSY; + goto unlock; + } + ite = find_ite(its, msi->devid, msi->data); + if (!ite || !its_is_collection_mapped(ite->collection)) { + ret = -E_ITS_INT_UNMAPPED_INTERRUPT; + goto unlock; + } + + vcpu = kvm_get_vcpu(kvm, ite->collection->target_addr); + if (!vcpu) + ret = -E_ITS_INT_UNMAPPED_INTERRUPT; +unlock: + mutex_unlock(>its_lock); + return ret; +} + /* * Queries the KVM IO bus framework to get the ITS pointer from the given * doorbell address. diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h index 0c8da72..d452150 100644 --- a/arch/arm64/kvm/vgic/vgic.h +++ b/arch/arm64/kvm/vgic/vgic.h @@ -240,6 +240,7 @@ int kvm_vgic_register_its_device(void); void vgic_enable_lpis(struct kvm_vcpu *vcpu); void vgic_flush_pending_lpis(struct kvm_vcpu *vcpu); int vgic_its_inject_msi(struct kvm *kvm, struct kvm_msi *msi); +int vgic_its_verify_msi(struct kvm *kvm, struct kvm_msi *msi); int vgic_v3_has_attr_regs(struct kvm_device *dev, struct kvm_device_attr *attr); int vgic_v3_dist_uaccess(struct kvm_vcpu *vcpu, bool is_write, int offset, u32 *val); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 32f259f..7923352 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1597,7 +1597,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm, int kvm_request_irq_source_id(struct kvm *kvm); void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id); bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args); - +int kvm_verify_msi(struct kvm *kvm, struct kvm_msi *msi); /* * Returns a pointer to the memslot if it contains gfn. * Otherwise returns NULL. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0d5d441..72b28f8 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1543,6 +1543,8 @@ struct kvm_s390_ucas_mapping { #define KVM_PPC_SVM_OFF _IO(KVMIO, 0xb3) #define KVM_ARM_MTE_COPY_TAGS_IOR(KVMIO, 0xb4, struct kvm_arm_copy_mte_tags) +#define KVM_VERIFY_MSI_IOW(KVMIO, 0xb5, struct kvm_msi) + /* ioctl for vm fd */ #define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xe0, struct kvm_create_device) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e30f1b4..439bdd7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@
Re: [QUESTION] Exception print when enabling GICv4
Hi Marc, Thank you for your reply. 在 2022/7/12 23:25, Marc Zyngier 写道: Hi Xiang, On Tue, 12 Jul 2022 13:55:16 +0100, "chenxiang (M)" wrote: Hi, I encounter a issue related to GICv4 enable on ARM64 platform (kernel 5.19-rc4, qemu 6.2.0): We have a accelaration module whose VF has 3 MSI interrupts, and we passthrough it to virtual machine with following steps: echo :79:00.1 > /sys/bus/pci/drivers/hisi_hpre/unbind echo vfio-pci > /sys/devices/pci\:78/\:78\:00.0/\:79\:00.1/driver_override echo :79:00.1 > /sys/bus/pci/drivers_probe Then we boot VM with "-device vfio-pci,host=79:00.1,id=net0 \". When insmod the driver which registers 3 PCI MSI interrupts in VM, some exception print occur as following: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails: 66311 I find that bit[6:4] of register PCI_MSI_FLAGS is 2 (4 MSI interrupts) though we only register 3 PCI MSI interrupt, and only 3 MSI interrupt is activated at last. It allocates 4 vectors in function vfio_msi_enable() (qemu) as it reads the register PCI_MSI_FLAGS. Later it will call system call VFIO_DEVICE_SET_IRQS to set forwarding for those interrupts using function kvm_vgic_v4_set_forrwarding() as GICv4 is enabled. For interrupt 0~2, it success to set forwarding as they are already activated, but for the 4th interrupt, it is not activated, so ite is not found in function vgic_its_resolve_lpi(), so above printk occurs. It seems that we only allocate and activate 3 MSI interrupts in guest while it tried to set forwarding for 4 MSI interrupts in host. Do you have any idea about this issue? I have a hunch: QEMU cannot know that the guest is only using 3 MSIs out of the 4 that the device can use, and PCI/Multi-MSI only has a single enable bit for all MSIs. So it probably iterates over all possible MSIs and enable the forwarding. Since the guest has only created 3 mappings in the virtual ITS, the last call fails. I would expect the guest to still work properly though. Yes, that's the reason of exception print. Is it possible for QEMU to get the exact number of interrupts guest is using? It seems not. Thanks, M.
[QUESTION] Exception print when enabling GICv4
Hi, I encounter a issue related to GICv4 enable on ARM64 platform (kernel 5.19-rc4, qemu 6.2.0): We have a accelaration module whose VF has 3 MSI interrupts, and we passthrough it to virtual machine with following steps: echo :79:00.1 > /sys/bus/pci/drivers/hisi_hpre/unbind echo vfio-pci > /sys/devices/pci\:78/\:78\:00.0/\:79\:00.1/driver_override echo :79:00.1 > /sys/bus/pci/drivers_probe Then we boot VM with "-device vfio-pci,host=79:00.1,id=net0 \". When insmod the driver which registers 3 PCI MSI interrupts in VM, some exception print occur as following: vfio-pci :3a:00.1: irq bypass producer (token 8f08224d) registration fails: 66311 I find that bit[6:4] of register PCI_MSI_FLAGS is 2 (4 MSI interrupts) though we only register 3 PCI MSI interrupt, and only 3 MSI interrupt is activated at last. It allocates 4 vectors in function vfio_msi_enable() (qemu) as it reads the register PCI_MSI_FLAGS. Later it will call system call VFIO_DEVICE_SET_IRQS to set forwarding for those interrupts using function kvm_vgic_v4_set_forrwarding() as GICv4 is enabled. For interrupt 0~2, it success to set forwarding as they are already activated, but for the 4th interrupt, it is not activated, so ite is not found in function vgic_its_resolve_lpi(), so above printk occurs. It seems that we only allocate and activate 3 MSI interrupts in guest while it tried to set forwarding for 4 MSI interrupts in host. Do you have any idea about this issue? Best regards, Xiang Chen
Re: [Bug] Take more 150s to boot qemu on ARM64
在 2022/6/13 21:22, Paul E. McKenney 写道: On Mon, Jun 13, 2022 at 08:26:34PM +0800, chenxiang (M) wrote: Hi all, I encounter a issue with kernel 5.19-rc1 on a ARM64 board: it takes about 150s between beginning to run qemu command and beginng to boot Linux kernel ("EFI stub: Booting Linux Kernel..."). But in kernel 5.18-rc4, it only takes about 5s. I git bisect the kernel code and it finds c2445d387850 ("srcu: Add contention check to call_srcu() srcu_data ->lock acquisition"). The qemu (qemu version is 6.2.92) command i run is : ./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g \ --trace "kvm*" \ -cpu host \ -machine virt,accel=kvm,gic-version=3 \ -machine smp.cpus=2,smp.sockets=2 \ -no-reboot \ -nographic \ -monitor unix:/home/cx/qmp-test,server,nowait \ -bios /home/cx/boot/QEMU_EFI.fd \ -kernel /home/cx/boot/Image \ -device pcie-root-port,port=0x8,chassis=1,id=net1,bus=pcie.0,multifunction=on,addr=0x1 \ -device vfio-pci,host=7d:01.3,id=net0 \ -device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4 \ -drive file=/home/cx/boot/boot_ubuntu.img,if=none,id=drive0 \ -append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw " \ -net none \ -D /home/cx/qemu_log.txt I am not familiar with rcu code, and don't know how it causes the issue. Do you have any idea about this issue? Please see the discussion here: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03e...@linaro.org/ Though that report requires ACPI to be forced on to get the delay, which results in more than 9,000 back-to-back calls to synchronize_srcu_expedited(). I cannot reproduce this on my setup, even with an artificial tight loop invoking synchronize_srcu_expedited(), but then again I don't have ARM hardware. My current guess is that the following patch, but with larger values for SRCU_MAX_NODELAY_PHASE. Here "larger" might well be up in the hundreds, or perhaps even larger. If you get a chance to experiment with this, could you please reply to the discussion at the above URL? (Or let me know, and I can CC you on the next message in that thread.) Ok, thanks, i will reply it on above URL. Thanx, Paul diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index 50ba70f019dea..0db7873f4e95b 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. /* @@ -522,16 +522,22 @@ static bool srcu_readers_active(struct srcu_struct *ssp) */ static unsigned long srcu_get_delay(struct srcu_struct *ssp) { + unsigned long gpstart; + unsigned long j; unsigned long jbase = SRCU_INTERVAL; if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) jbase = 0; - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); - if (!jbase) { - WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) - jbase = 1; + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { + j = jiffies - 1; + gpstart = READ_ONCE(ssp->srcu_gp_start); + if (time_after(j, gpstart)) + jbase += j - gpstart; + if (!jbase) { + WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) + jbase = 1; + } } return jbase > SRCU_MAX_INTERVAL ? SRCU_MAX_INTERVAL : jbase; } .
[Bug] Take more 150s to boot qemu on ARM64
Hi all, I encounter a issue with kernel 5.19-rc1 on a ARM64 board: it takes about 150s between beginning to run qemu command and beginng to boot Linux kernel ("EFI stub: Booting Linux Kernel..."). But in kernel 5.18-rc4, it only takes about 5s. I git bisect the kernel code and it finds c2445d387850 ("srcu: Add contention check to call_srcu() srcu_data ->lock acquisition"). The qemu (qemu version is 6.2.92) command i run is : ./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g \ --trace "kvm*" \ -cpu host \ -machine virt,accel=kvm,gic-version=3 \ -machine smp.cpus=2,smp.sockets=2 \ -no-reboot \ -nographic \ -monitor unix:/home/cx/qmp-test,server,nowait \ -bios /home/cx/boot/QEMU_EFI.fd \ -kernel /home/cx/boot/Image \ -device pcie-root-port,port=0x8,chassis=1,id=net1,bus=pcie.0,multifunction=on,addr=0x1 \ -device vfio-pci,host=7d:01.3,id=net0 \ -device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4 \ -drive file=/home/cx/boot/boot_ubuntu.img,if=none,id=drive0 \ -append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw " \ -net none \ -D /home/cx/qemu_log.txt I am not familiar with rcu code, and don't know how it causes the issue. Do you have any idea about this issue? Best Regard, Xiang Chen
[PATCH v2] hw/vfio/common: Fix a small boundary issue of a trace
From: Xiang Chen It uses [offset, offset + size - 1] to indicate that the length of range is size in most places in vfio trace code (such as trace_vfio_region_region_mmap()) execpt trace_vfio_region_sparse_mmap_entry(). So change it for trace_vfio_region_sparse_mmap_entry(), but if size is zero, the trace will be weird with an underflow, so move the trace and trace it only if size is not zero. Signed-off-by: Xiang Chen --- hw/vfio/common.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 080046e3f5..345ea7bd8a 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1544,11 +1544,10 @@ static int vfio_setup_region_sparse_mmaps(VFIORegion *region, region->mmaps = g_new0(VFIOMmap, sparse->nr_areas); for (i = 0, j = 0; i < sparse->nr_areas; i++) { -trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, -sparse->areas[i].offset + -sparse->areas[i].size); - if (sparse->areas[i].size) { +trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, +sparse->areas[i].offset + +sparse->areas[i].size - 1); region->mmaps[j].offset = sparse->areas[i].offset; region->mmaps[j].size = sparse->areas[i].size; j++; -- 2.33.0
[PATCH] softmmu/memory: Skip translation size instead of fixed granularity if translate() successfully
From: Xiang Chen Currently memory_region_iommu_replay() does full page table walk with fixed granularity (page size) no matter translate() succeeds or not. Actually if translate() successfully, we can skip translation size (iotlb.addr_mask + 1) instead of fixed granularity. Signed-off-by: Xiang Chen --- softmmu/memory.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index bfa5d5178c..ccfa19cf71 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1924,7 +1924,7 @@ void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) { MemoryRegion *mr = MEMORY_REGION(iommu_mr); IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr); -hwaddr addr, granularity; +hwaddr addr, granularity, def_granu; IOMMUTLBEntry iotlb; /* If the IOMMU has its own replay callback, override */ @@ -1933,12 +1933,15 @@ void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) return; } -granularity = memory_region_iommu_get_min_page_size(iommu_mr); +def_granu = memory_region_iommu_get_min_page_size(iommu_mr); for (addr = 0; addr < memory_region_size(mr); addr += granularity) { iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx); if (iotlb.perm != IOMMU_NONE) { n->notify(n, ); +granularity = iotlb.addr_mask + 1; +} else { +granularity = def_granu; } /* if (2^64 - MR size) < granularity, it's possible to get an -- 2.33.0
[PATCH v2] hw/arm/smmuv3: Pass the actual perm to returned IOMMUTLBEntry in smmuv3_translate()
From: Xiang Chen It always calls the IOMMU MR translate() callback with flag=IOMMU_NONE in memory_region_iommu_replay(). Currently, smmuv3_translate() return an IOMMUTLBEntry with perm set to IOMMU_NONE even if the translation success, whereas it is expected to return the actual permission set in the table entry. So pass the actual perm to returned IOMMUTLBEntry in the table entry. Signed-off-by: Xiang Chen Reviewed-by: Eric Auger --- hw/arm/smmuv3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 674623aabe..707eb430c2 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -760,7 +760,7 @@ epilogue: qemu_mutex_unlock(>mutex); switch (status) { case SMMU_TRANS_SUCCESS: -entry.perm = flag; +entry.perm = cached_entry->entry.perm; entry.translated_addr = cached_entry->entry.translated_addr + (addr & cached_entry->entry.addr_mask); entry.addr_mask = cached_entry->entry.addr_mask; -- 2.33.0
Re: [PATCH] hw/arm/smmuv3: Pass the real perm to returned IOMMUTLBEntry in smmuv3_translate()
Hi Eric, 在 2022/4/15 0:02, Eric Auger 写道: Hi Chenxiang, On 4/7/22 9:57 AM, chenxiang via wrote: From: Xiang Chen In function memory_region_iommu_replay(), it decides to notify() or not according to the perm of returned IOMMUTLBEntry. But for smmuv3, the returned perm is always IOMMU_NONE even if the translation success. I think you should precise in the commit message that memory_region_iommu_replay() always calls the IOMMU MR translate() callback with flag=IOMMU_NONE and thus, currently, translate() returns an IOMMUTLBEntry with perm set to IOMMU_NONE if the translation succeeds, whereas it is expected to return the actual permission set in the table entry. Thank you for your comments. I will change the commit message in next version. Pass the real perm to returned IOMMUTLBEntry to avoid the issue. Signed-off-by: Xiang Chen --- hw/arm/smmuv3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 674623aabe..707eb430c2 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -760,7 +760,7 @@ epilogue: qemu_mutex_unlock(>mutex); switch (status) { case SMMU_TRANS_SUCCESS: -entry.perm = flag; +entry.perm = cached_entry->entry.perm; With that clarification Reviewed-by: Eric Auger Ok, thanks the translate() doc in ./include/exec/memory.h states " If IOMMU_NONE is passed then the IOMMU must do the * full page table walk and report the permissions in the returned * IOMMUTLBEntry. (Note that this implies that an IOMMU may not * return different mappings for reads and writes.) " Thanks Eric entry.translated_addr = cached_entry->entry.translated_addr + (addr & cached_entry->entry.addr_mask); entry.addr_mask = cached_entry->entry.addr_mask; .
[PATCH] hw/arm/smmuv3: Pass the real perm to returned IOMMUTLBEntry in smmuv3_translate()
From: Xiang Chen In function memory_region_iommu_replay(), it decides to notify() or not according to the perm of returned IOMMUTLBEntry. But for smmuv3, the returned perm is always IOMMU_NONE even if the translation success. Pass the real perm to returned IOMMUTLBEntry to avoid the issue. Signed-off-by: Xiang Chen --- hw/arm/smmuv3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 674623aabe..707eb430c2 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -760,7 +760,7 @@ epilogue: qemu_mutex_unlock(>mutex); switch (status) { case SMMU_TRANS_SUCCESS: -entry.perm = flag; +entry.perm = cached_entry->entry.perm; entry.translated_addr = cached_entry->entry.translated_addr + (addr & cached_entry->entry.addr_mask); entry.addr_mask = cached_entry->entry.addr_mask; -- 2.33.0
Re: [PATCH] hw/vfio/common: Fix a small boundary issue of a trace
Hi Damien, 在 2022/4/6 23:22, Damien Hedde 写道: On 4/6/22 10:14, chenxiang via wrote: From: Xiang Chen Right now the trace of vfio_region_sparse_mmap_entry is as follows: vfio_region_sparse_mmap_entry sparse entry 0 [0x1000 - 0x9000] Actually the range it wants to show is [0x1000 - 0x8fff],so fix it. Signed-off-by: Xiang Chen --- hw/vfio/common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 080046e3f5..0b3808caf8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1546,7 +1546,7 @@ static int vfio_setup_region_sparse_mmaps(VFIORegion *region, for (i = 0, j = 0; i < sparse->nr_areas; i++) { trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, sparse->areas[i].offset + - sparse->areas[i].size); + sparse->areas[i].size - 1); if (sparse->areas[i].size) { region->mmaps[j].offset = sparse->areas[i].offset; If the size if zero, the trace will be weird with an underflow if offset is zero as well. Yes, that's a issue. Maybe just change the trace by inverting the right bracket ? eg: [0x1000 - 0x9000[ Or don't trace in that case ? (but I am not maintainer of this, so maybe that does not make sense). But it uses [offset, offset + size - 1] in other places such as trace_vfio_region_region_mmap()/trace_vfio_subregion_unmap()/trace_vfio_region_mmap_fault() in vfio code. Maybe it is better to move this trace to the brace of "if (sparse->areas[i].size)" which ensures size != 0. -- Damien .
[PATCH] hw/vfio/common: Fix a small boundary issue of a trace
From: Xiang Chen Right now the trace of vfio_region_sparse_mmap_entry is as follows: vfio_region_sparse_mmap_entry sparse entry 0 [0x1000 - 0x9000] Actually the range it wants to show is [0x1000 - 0x8fff]???so fix it. Signed-off-by: Xiang Chen --- hw/vfio/common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 080046e3f5..0b3808caf8 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -1546,7 +1546,7 @@ static int vfio_setup_region_sparse_mmaps(VFIORegion *region, for (i = 0, j = 0; i < sparse->nr_areas; i++) { trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset, sparse->areas[i].offset + -sparse->areas[i].size); +sparse->areas[i].size - 1); if (sparse->areas[i].size) { region->mmaps[j].offset = sparse->areas[i].offset; -- 2.33.0
Re: [PATCH] hw/arm/virt: Enable HMAT on arm virt machine
在 2022/1/25 20:46, Andrew Jones 写道: On Tue, Jan 25, 2022 at 07:46:43PM +0800, chenxiang (M) wrote: Hi Andrew, 在 2022/1/25 18:26, Andrew Jones 写道: On Tue, Jan 25, 2022 at 05:15:34PM +0800, chenxiang via wrote: From: Xiang Chen Since the patchset ("Build ACPI Heterogeneous Memory Attribute Table (HMAT)"), HMAT is supported, but only x86 is enabled. Enable HMAT on arm virt machine. Hi Xiang, What QEMU commands lines have you tested with which Linux guest kernels? I tested it with following commands with guest kernel 5.16-rc1, and the boot log of guest kernel is as attached: Thanks. Please consider adding HMAT tests, see tests/qtest/numa-test.c and tests/qtest/bios-tables-test.c, for the virt machine type to this series. Otherwise, Reviewed-by: Andrew Jones Thanks, i will add those HMAT tests in v2.
Re: [PATCH] hw/arm/virt: Enable HMAT on arm virt machine
Hi Andrew, 在 2022/1/25 18:26, Andrew Jones 写道: On Tue, Jan 25, 2022 at 05:15:34PM +0800, chenxiang via wrote: From: Xiang Chen Since the patchset ("Build ACPI Heterogeneous Memory Attribute Table (HMAT)"), HMAT is supported, but only x86 is enabled. Enable HMAT on arm virt machine. Hi Xiang, What QEMU commands lines have you tested with which Linux guest kernels? I tested it with following commands with guest kernel 5.16-rc1, and the boot log of guest kernel is as attached: ./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g \ -object memory-backend-ram,size=2G,id=m0 \ -object memory-backend-ram,size=2G,id=m1 \ -numa node,cpus=0-3,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M \ -numa hmat-cache,node-id=0,size=16K,level=1,associativity=direct,policy=write-back,line=8 \ -numa hmat-cache,node-id=1,size=16K,level=1,associativity=direct,policy=write-back,line=8 \ -smp 4 \ -no-reboot \ -nographic \ -cpu host \ -machine virt,accel=kvm,gic-version=3,hmat=on \ -bios /home/cx/QEMU_EFI.fd \ -monitor unix:/home/cx/opt/qmp-test,server,nowait \ -kernel /home/cx/Image \ -device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4 \ -drive file=/home/cx/opt/boot.img,if=none,id=drive0 \ -append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw " Thanks, drew Signed-off-by: Xiang Chen --- hw/arm/Kconfig | 1 + hw/arm/virt-acpi-build.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index 2e0049196d..a3c6099829 100644 --- a/hw/arm/Kconfig +++ b/hw/arm/Kconfig @@ -29,6 +29,7 @@ config ARM_VIRT select ACPI_APEI select ACPI_VIOT select VIRTIO_MEM_SUPPORTED +select ACPI_HMAT config CHEETAH bool diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index 449fab0080..f19b55e486 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -42,6 +42,7 @@ #include "hw/acpi/memory_hotplug.h" #include "hw/acpi/generic_event_device.h" #include "hw/acpi/tpm.h" +#include "hw/acpi/hmat.h" #include "hw/pci/pcie_host.h" #include "hw/pci/pci.h" #include "hw/pci/pci_bus.h" @@ -990,6 +991,12 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables) build_slit(tables_blob, tables->linker, ms, vms->oem_id, vms->oem_table_id); } + +if (ms->numa_state->hmat_enabled) { +acpi_add_table(table_offsets, tables_blob); +build_hmat(tables_blob, tables->linker, ms->numa_state, + vms->oem_id, vms->oem_table_id); +} } if (ms->nvdimms_state->is_enabled) { -- 2.33.0 . [root@centos build]# ./qemu-system-aarch64 -m 4G,slots=4,maxmem=8g -object memory-backend-ram,size=2G,id=m0 -object memory-backend-ram,size=2G,id=m1 -numa node,cpus=0-3,nodeid=0,memdev=m0 -numa node,nodeid=1,memdev=m1,initiator=0 -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M -numa hmat-cache,node-id=0,size=16K,level=1,associativity=direct,policy=write-back,line=8 -numa hmat-cache,node-id=1,size=16K,level=1,associativity=direct,policy=write-back,line=8 -smp 4 -no-reboot -nographic -cpu host -machine virt,accel=kvm,gic-version=3,hmat=on -bios /home/cx/QEMU_EFI.fd -monitor unix:/home/cx/opt/qmp-test,server,nowait -kernel /home/cx/Image -device virtio-blk-pci,drive=drive0,id=virtblk0,num-queues=4 -drive file=/home/cx/opt/boot.img,if=none,id=drive0 -append "rdinit=init console=ttyAMA0 root=/dev/vda rootfstype=ext4 rw " WARNING: Image format was not specified for '/home/cx/opt/boot.img' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'raw' format explicitly to remove the restrictions. EFI stub: Booting Linux Kernel... EFI stub: EFI_RNG_PROTOCOL unavailable EFI stub: Generating empty DTB EFI stub: Exiting boot services... [0.00] Booting Linux on physical CPU 0x00 [0x481fd010] [0.00] Linux version 5.16.0-rc1-15060-g07d132dd883a (chenxiang@plinth) (aarch64-linux-gnu-gcc (Linaro GCC 7.3-2018.05-
[PATCH] hw/arm/virt: Enable HMAT on arm virt machine
From: Xiang Chen Since the patchset ("Build ACPI Heterogeneous Memory Attribute Table (HMAT)"), HMAT is supported, but only x86 is enabled. Enable HMAT on arm virt machine. Signed-off-by: Xiang Chen --- hw/arm/Kconfig | 1 + hw/arm/virt-acpi-build.c | 7 +++ 2 files changed, 8 insertions(+) diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index 2e0049196d..a3c6099829 100644 --- a/hw/arm/Kconfig +++ b/hw/arm/Kconfig @@ -29,6 +29,7 @@ config ARM_VIRT select ACPI_APEI select ACPI_VIOT select VIRTIO_MEM_SUPPORTED +select ACPI_HMAT config CHEETAH bool diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index 449fab0080..f19b55e486 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -42,6 +42,7 @@ #include "hw/acpi/memory_hotplug.h" #include "hw/acpi/generic_event_device.h" #include "hw/acpi/tpm.h" +#include "hw/acpi/hmat.h" #include "hw/pci/pcie_host.h" #include "hw/pci/pci.h" #include "hw/pci/pci_bus.h" @@ -990,6 +991,12 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables) build_slit(tables_blob, tables->linker, ms, vms->oem_id, vms->oem_table_id); } + +if (ms->numa_state->hmat_enabled) { +acpi_add_table(table_offsets, tables_blob); +build_hmat(tables_blob, tables->linker, ms->numa_state, + vms->oem_id, vms->oem_table_id); +} } if (ms->nvdimms_state->is_enabled) { -- 2.33.0
Re: [RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
在 2022/1/6 19:00, Eric Auger 写道: Hi CHenxiangn On 12/29/21 8:13 AM, chenxiang (M) via wrote: Hi Eric, 在 2021/10/5 16:53, Eric Auger 写道: Add a 'preserve_config' field in struct GPEXConfig and if set generate the DSM #5 for preserving PCI boot configurations. The DSM presence is needed to expose RMRs. At the moment the DSM generation is not yet enabled. Signed-off-by: Eric Auger --- include/hw/pci-host/gpex.h | 1 + hw/pci-host/gpex-acpi.c| 12 2 files changed, 13 insertions(+) diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h index fcf8b63820..3f8f8ec38d 100644 --- a/include/hw/pci-host/gpex.h +++ b/include/hw/pci-host/gpex.h @@ -64,6 +64,7 @@ struct GPEXConfig { MemMapEntry pio; int irq; PCIBus *bus; +boolpreserve_config; }; int gpex_set_irq_num(GPEXHost *s, int index, int gsi); diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c index e7e162a00a..7dab259379 100644 --- a/hw/pci-host/gpex-acpi.c +++ b/hw/pci-host/gpex-acpi.c @@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg) aml_append(dev, aml_name_decl("_PXM", aml_int(numa_node))); } +if (cfg->preserve_config) { +method = aml_method("_DSM", 5, AML_SERIALIZED); I notice there is a ACPI BIOS Error when booting virtual machine which seems be caused by this patch as I add this patchset in my branch to test the function of vsmmu. It seems that it requires only 4 parameter for method _DSM, but use 5 parameters here. The error log is as following: Thank you for the heads up. Yes the problem was reported by Igor too in https://www.mail-archive.com/qemu-devel@nongnu.org/msg842972.html. At the moment the RMRR ACPI situation has not progressed on spec side or kernel if I have not missed anything but sure I will take this into account in my next respin. Ok, thanks. Thanks! Eric [2.355459] ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0._DSM], AE_ALREADY_EXISTS (20210930/dswload2-327) [2.355467] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20210930/psobject-221) [2.355470] ACPI: Skipping parse of AML opcode: OpcodeName unavailable (0x0014) [2.355657] ACPI: 1 ACPI AML tables successfully acquired and loaded [2.356321] ACPI: Interpreter enabled [2.356323] ACPI: Using GIC for interrupt routing [2.356333] ACPI: MCFG table detected, 1 entries [2.361359] ARMH0011:00: ttyAMA0 at MMIO 0x900 (irq = 16, base_baud = 0) is a SBSA [2.619805] printk: console [ttyAMA0] enabled [2.622114] ACPI: PCI Root Bridge [PCI0] (domain [bus 00-ff]) [2.622788] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3] [2.623776] acpi PNP0A08:00: _OSC: platform does not support [LTR] [2.624600] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability] [2.625721] acpi PNP0A08:00: ECAM area [mem 0x401000-0x401fff] reserved by PNP0C02:00 [2.626645] acpi PNP0A08:00: ECAM at [mem 0x401000-0x401fff] for [bus 00-ff] [2.627450] ACPI: Remapped I/O 0x3eff to [io 0x-0x window] [2.628229] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20210930/nsarguments-166) [2.629576] PCI host bridge to bus :00 [2.630008] pci_bus :00: root bus resource [mem 0x1000-0x3efe window] [2.630747] pci_bus :00: root bus resource [io 0x-0x window] [2.631405] pci_bus :00: root bus resource [mem 0x80-0xff window] [2.632177] pci_bus :00: root bus resource [bus 00-ff] [2.632731] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20210930/nsarguments-166) +aml_append(method, aml_return(aml_int(0))); +aml_append(dev, method); +} + acpi_dsdt_add_pci_route_table(dev, cfg->irq); /* @@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg) aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device"))); aml_append(dev, aml_name_decl("_CCA", aml_int(1))); +if (cfg->preserve_config) { +method = aml_method("_DSM", 5, AML_SERIALIZED); +aml_append(method, aml_return(aml_int(0))); +aml_append(dev, method); +} + acpi_dsdt_add_pci_route_table(dev, cfg->irq); method = aml_method("_CBA", 0, AML_NOTSERIALIZED); .
Re: [RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
Hi Eric, 在 2021/10/5 16:53, Eric Auger 写道: Add a 'preserve_config' field in struct GPEXConfig and if set generate the DSM #5 for preserving PCI boot configurations. The DSM presence is needed to expose RMRs. At the moment the DSM generation is not yet enabled. Signed-off-by: Eric Auger --- include/hw/pci-host/gpex.h | 1 + hw/pci-host/gpex-acpi.c| 12 2 files changed, 13 insertions(+) diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h index fcf8b63820..3f8f8ec38d 100644 --- a/include/hw/pci-host/gpex.h +++ b/include/hw/pci-host/gpex.h @@ -64,6 +64,7 @@ struct GPEXConfig { MemMapEntry pio; int irq; PCIBus *bus; +boolpreserve_config; }; int gpex_set_irq_num(GPEXHost *s, int index, int gsi); diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c index e7e162a00a..7dab259379 100644 --- a/hw/pci-host/gpex-acpi.c +++ b/hw/pci-host/gpex-acpi.c @@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg) aml_append(dev, aml_name_decl("_PXM", aml_int(numa_node))); } +if (cfg->preserve_config) { +method = aml_method("_DSM", 5, AML_SERIALIZED); I notice there is a ACPI BIOS Error when booting virtual machine which seems be caused by this patch as I add this patchset in my branch to test the function of vsmmu. It seems that it requires only 4 parameter for method _DSM, but use 5 parameters here. The error log is as following: [2.355459] ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0._DSM], AE_ALREADY_EXISTS (20210930/dswload2-327) [2.355467] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20210930/psobject-221) [2.355470] ACPI: Skipping parse of AML opcode: OpcodeName unavailable (0x0014) [2.355657] ACPI: 1 ACPI AML tables successfully acquired and loaded [2.356321] ACPI: Interpreter enabled [2.356323] ACPI: Using GIC for interrupt routing [2.356333] ACPI: MCFG table detected, 1 entries [2.361359] ARMH0011:00: ttyAMA0 at MMIO 0x900 (irq = 16, base_baud = 0) is a SBSA [2.619805] printk: console [ttyAMA0] enabled [2.622114] ACPI: PCI Root Bridge [PCI0] (domain [bus 00-ff]) [2.622788] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3] [2.623776] acpi PNP0A08:00: _OSC: platform does not support [LTR] [2.624600] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability] [2.625721] acpi PNP0A08:00: ECAM area [mem 0x401000-0x401fff] reserved by PNP0C02:00 [2.626645] acpi PNP0A08:00: ECAM at [mem 0x401000-0x401fff] for [bus 00-ff] [2.627450] ACPI: Remapped I/O 0x3eff to [io 0x-0x window] [2.628229] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20210930/nsarguments-166) [2.629576] PCI host bridge to bus :00 [2.630008] pci_bus :00: root bus resource [mem 0x1000-0x3efe window] [2.630747] pci_bus :00: root bus resource [io 0x-0x window] [2.631405] pci_bus :00: root bus resource [mem 0x80-0xff window] [2.632177] pci_bus :00: root bus resource [bus 00-ff] [2.632731] ACPI BIOS Error (bug): \_SB.PCI0._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20210930/nsarguments-166) +aml_append(method, aml_return(aml_int(0))); +aml_append(dev, method); +} + acpi_dsdt_add_pci_route_table(dev, cfg->irq); /* @@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg) aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device"))); aml_append(dev, aml_name_decl("_CCA", aml_int(1))); +if (cfg->preserve_config) { +method = aml_method("_DSM", 5, AML_SERIALIZED); +aml_append(method, aml_return(aml_int(0))); +aml_append(dev, method); +} + acpi_dsdt_add_pci_route_table(dev, cfg->irq); method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
Re: [RESEND RFC] hw/arm/smmuv3: add device properties to disable cached iotlb
Hi Eric, 在 2021/8/5 16:10, Eric Auger 写道: Hi Chenxiang, On 8/5/21 9:48 AM, chenxiang (M) wrote: Hi Eric, 在 2021/8/5 0:26, Eric Auger 写道: Hi Chenxiang, On 8/4/21 10:49 AM, chenxiang wrote: From: Xiang Chen It splits invalidations into ^2 range invalidations in the patch 6d9cd115b(" hw/arm/smmuv3: Enforce invalidation on a power of two range"). So for some scenarios such as the size of invalidation is not ^2 range invalidation, it costs more time to invalidate. this ^² split is not only necessary for internal TLB management but also for IOMMU MR notifier calls (which use a mask), ie. IOTLB unmap notifications used for both vhost and vfio integrations. So you can disable the internal IOTLB but we can't simply remove the pow of 2 split. See below. Right, in current code of qemu, it is not right to simply remove the pow of 2 split. But i find that in my local repo, there is a private patch which seems solve the issue, so it works on my test. diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 4a7a183..83d24e1 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -825,7 +825,8 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr, event.type = IOMMU_NOTIFIER_UNMAP; event.entry.target_as = _space_memory; event.entry.iova = iova; -event.entry.addr_mask = num_pages * (1 << granule) - 1; +event.entry.addr_mask = (1 << granule) - 1; + event.entry.num_pages = num_pages; OK I see. But you change the existing semantic of addr_mask which originally matches the mask of the full addr range of the IOTLB operation and you replace it by the granule mask and add another num_pages field. This is a change in the memory.h API and should be discussed with other memory.h and vIOMMU maintainers if you want to go that way. This typically breaks vhost integration which does not use num_pages and would typically fail invalidating the full range. So we have 2 different things: the disablement of the internal IOTLB (x- prop) which can be done easily but what you mostly want it to remove the pow of 2 splits to reduce the interactions with the physical IOMMU in the VFIO/SMMU use case , right? Yes, i mainly want to remove the pow of 2 splits to reduce the times of invalidations which i think it will affect the performance. pow of 2 splits is also needed for vhost integration at the moment. Note this use case is not upstreamed and far from being upstreamed given the /dev/iommu redesign, so it will be difficult to justify that kind of change at thims moment. I am not familar with vhost, and maybe need more investigate on it. Do you have any suggestion about how to improve the issue? Thanks Eric event.entry.perm = IOMMU_NONE; event.entry.flags = IOMMU_INV_FLAGS_ARCHID; event.entry.arch_id = asid; diff --git a/hw/vfio/common.c b/hw/vfio/common.c index a863b7d..7b026f0 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -639,7 +639,7 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) { hwaddr start = iotlb->iova + giommu->iommu_offset; struct iommu_inv_addr_info *addr_info; -size_t size = iotlb->addr_mask + 1; +size_t size = iotlb->num_pages * (iotlb->addr_mask + 1); int archid = -1; addr_info = _info; @@ -653,8 +653,8 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) } addr_info->archid = archid; addr_info->addr = start; -addr_info->granule_size = size; -addr_info->nb_granules = 1; +addr_info->granule_size = iotlb->addr_mask + 1; + addr_info->nb_granules = iotlb->num_pages; trace_vfio_iommu_addr_inv_iotlb(archid, start, size, 1, iotlb->leaf); break; diff --git a/include/exec/memory.h b/include/exec/memory.h index 0c4389c..268a395 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -110,6 +110,7 @@ struct IOMMUTLBEntry { hwaddr iova; hwaddr translated_addr; hwaddr addr_mask; + uint64_t num_pages; IOMMUAccessFlags perm; IOMMUInvGranularity granularity; #define IOMMU_INV_FLAGS_PASID (1 << 0) internal TLB could be disabled through a property but I would rather set it as an "x-" experimental property for debug purpose. Until recently this was indeed helpful to debug bugs related to internal IOTLB management (RIL support) ;-) I hope this period is over though ;-) Ok, maybe we set it as "x-" experimental property currently. Currently smmuv3_translate is rarely used (i only see it is used when binding msi), so i think maybe we can disable cached iotlb to promote efficiency of invalidation. So add device property disable_cached_iotlb to disable cached iotlb, and then we can send non-^2 range invalidation directly. Use tool dma_map_
Re: [RESEND RFC] hw/arm/smmuv3: add device properties to disable cached iotlb
Hi Eric, 在 2021/8/5 0:26, Eric Auger 写道: Hi Chenxiang, On 8/4/21 10:49 AM, chenxiang wrote: From: Xiang Chen It splits invalidations into ^2 range invalidations in the patch 6d9cd115b(" hw/arm/smmuv3: Enforce invalidation on a power of two range"). So for some scenarios such as the size of invalidation is not ^2 range invalidation, it costs more time to invalidate. this ^² split is not only necessary for internal TLB management but also for IOMMU MR notifier calls (which use a mask), ie. IOTLB unmap notifications used for both vhost and vfio integrations. So you can disable the internal IOTLB but we can't simply remove the pow of 2 split. See below. Right, in current code of qemu, it is not right to simply remove the pow of 2 split. But i find that in my local repo, there is a private patch which seems solve the issue, so it works on my test. diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 4a7a183..83d24e1 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -825,7 +825,8 @@ static void smmuv3_notify_iova(IOMMUMemoryRegion *mr, event.type = IOMMU_NOTIFIER_UNMAP; event.entry.target_as = _space_memory; event.entry.iova = iova; -event.entry.addr_mask = num_pages * (1 << granule) - 1; +event.entry.addr_mask = (1 << granule) - 1; + event.entry.num_pages = num_pages; event.entry.perm = IOMMU_NONE; event.entry.flags = IOMMU_INV_FLAGS_ARCHID; event.entry.arch_id = asid; diff --git a/hw/vfio/common.c b/hw/vfio/common.c index a863b7d..7b026f0 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -639,7 +639,7 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) { hwaddr start = iotlb->iova + giommu->iommu_offset; struct iommu_inv_addr_info *addr_info; -size_t size = iotlb->addr_mask + 1; +size_t size = iotlb->num_pages * (iotlb->addr_mask + 1); int archid = -1; addr_info = _info; @@ -653,8 +653,8 @@ static void vfio_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) } addr_info->archid = archid; addr_info->addr = start; -addr_info->granule_size = size; -addr_info->nb_granules = 1; +addr_info->granule_size = iotlb->addr_mask + 1; + addr_info->nb_granules = iotlb->num_pages; trace_vfio_iommu_addr_inv_iotlb(archid, start, size, 1, iotlb->leaf); break; diff --git a/include/exec/memory.h b/include/exec/memory.h index 0c4389c..268a395 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -110,6 +110,7 @@ struct IOMMUTLBEntry { hwaddr iova; hwaddr translated_addr; hwaddr addr_mask; + uint64_t num_pages; IOMMUAccessFlags perm; IOMMUInvGranularity granularity; #define IOMMU_INV_FLAGS_PASID (1 << 0) internal TLB could be disabled through a property but I would rather set it as an "x-" experimental property for debug purpose. Until recently this was indeed helpful to debug bugs related to internal IOTLB management (RIL support) ;-) I hope this period is over though ;-) Ok, maybe we set it as "x-" experimental property currently. Currently smmuv3_translate is rarely used (i only see it is used when binding msi), so i think maybe we can disable cached iotlb to promote efficiency of invalidation. So add device property disable_cached_iotlb to disable cached iotlb, and then we can send non-^2 range invalidation directly. Use tool dma_map_benchmark to have a test on the latency of unmap, and we can see it promotes much on unmap when the size of invalidation is not ^2 range invalidation (such as g = 7/15/31/511): t = 1(thread = 1) before opt(us) after opt(us) g=1(4K size)0.2/7.6 0.2/7.5 g=4(8K size)0.4/7.9 0.4/7.9 g=7(28K size) 0.6/10.20.6/8.2 g=8(32K size) 0.6/8.3 0.6/8.3 g=15(60K size) 1.1/12.11.1/9.1 g=16(64K size) 1.1/9.2 1.1/9.1 g=31(124K size) 2.0/14.82.0/10.7 g=32(128K size) 2.1/14.82.1/10.7 g=511(2044K size) 30.9/65.1 31.1/55.9 g=512(2048K size) 0.3/32.1 0.3/32.1 t = 10(thread = 10) before opt(us) after opt(us) g=1(4K size)0.2/39.90.2/39.1 g=4(8K size)0.5/42.60.5/42.4 g=7(28K size) 0.6/66.40.6/45.3 g=8(32K size) 0.7/45.80.7/46.1 g=15(60K size) 1.1/80.51.1/49.6 g=16(64K size) 1.1/49.81.1/50.2 g=31(124K size) 2.0/98.32.1/58.0 g=32(128K size) 2.1/57.72.1/58.2 g=511(2044K size) 35.2/322.2 35.3/236.7 g=512(2048K size) 0.8/238.2 0.9/240.3 Note: i test it based on VSMMU enabled with the patchset (&quo
[RESEND RFC] hw/arm/smmuv3: add device properties to disable cached iotlb
From: Xiang Chen It splits invalidations into ^2 range invalidations in the patch 6d9cd115b(" hw/arm/smmuv3: Enforce invalidation on a power of two range"). So for some scenarios such as the size of invalidation is not ^2 range invalidation, it costs more time to invalidate. Currently smmuv3_translate is rarely used (i only see it is used when binding msi), so i think maybe we can disable cached iotlb to promote efficiency of invalidation. So add device property disable_cached_iotlb to disable cached iotlb, and then we can send non-^2 range invalidation directly. Use tool dma_map_benchmark to have a test on the latency of unmap, and we can see it promotes much on unmap when the size of invalidation is not ^2 range invalidation (such as g = 7/15/31/511): t = 1(thread = 1) before opt(us) after opt(us) g=1(4K size)0.2/7.6 0.2/7.5 g=4(8K size)0.4/7.9 0.4/7.9 g=7(28K size) 0.6/10.20.6/8.2 g=8(32K size) 0.6/8.3 0.6/8.3 g=15(60K size) 1.1/12.11.1/9.1 g=16(64K size) 1.1/9.2 1.1/9.1 g=31(124K size) 2.0/14.82.0/10.7 g=32(128K size) 2.1/14.82.1/10.7 g=511(2044K size) 30.9/65.1 31.1/55.9 g=512(2048K size) 0.3/32.1 0.3/32.1 t = 10(thread = 10) before opt(us) after opt(us) g=1(4K size)0.2/39.90.2/39.1 g=4(8K size)0.5/42.60.5/42.4 g=7(28K size) 0.6/66.40.6/45.3 g=8(32K size) 0.7/45.80.7/46.1 g=15(60K size) 1.1/80.51.1/49.6 g=16(64K size) 1.1/49.81.1/50.2 g=31(124K size) 2.0/98.32.1/58.0 g=32(128K size) 2.1/57.72.1/58.2 g=511(2044K size) 35.2/322.2 35.3/236.7 g=512(2048K size) 0.8/238.2 0.9/240.3 Note: i test it based on VSMMU enabled with the patchset ("vSMMUv3/pSMMUv3 2 stage VFIO integration"). Signed-off-by: Xiang Chen --- hw/arm/smmuv3.c | 77 - include/hw/arm/smmuv3.h | 1 + 2 files changed, 52 insertions(+), 26 deletions(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index 01b60be..7ae668f 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -19,6 +19,7 @@ #include "qemu/osdep.h" #include "qemu/bitops.h" #include "hw/irq.h" +#include "hw/qdev-properties.h" #include "hw/sysbus.h" #include "migration/vmstate.h" #include "hw/qdev-core.h" @@ -682,19 +683,21 @@ static IOMMUTLBEntry smmuv3_translate(IOMMUMemoryRegion *mr, hwaddr addr, page_mask = (1ULL << (tt->granule_sz)) - 1; aligned_addr = addr & ~page_mask; -cached_entry = smmu_iotlb_lookup(bs, cfg, tt, aligned_addr); -if (cached_entry) { -if ((flag & IOMMU_WO) && !(cached_entry->entry.perm & IOMMU_WO)) { -status = SMMU_TRANS_ERROR; -if (event.record_trans_faults) { -event.type = SMMU_EVT_F_PERMISSION; -event.u.f_permission.addr = addr; -event.u.f_permission.rnw = flag & 0x1; +if (s->disable_cached_iotlb) { +cached_entry = smmu_iotlb_lookup(bs, cfg, tt, aligned_addr); +if (cached_entry) { +if ((flag & IOMMU_WO) && !(cached_entry->entry.perm & IOMMU_WO)) { +status = SMMU_TRANS_ERROR; +if (event.record_trans_faults) { +event.type = SMMU_EVT_F_PERMISSION; +event.u.f_permission.addr = addr; +event.u.f_permission.rnw = flag & 0x1; +} +} else { +status = SMMU_TRANS_SUCCESS; } -} else { -status = SMMU_TRANS_SUCCESS; +goto epilogue; } -goto epilogue; } cached_entry = g_new0(SMMUTLBEntry, 1); @@ -742,7 +745,9 @@ static IOMMUTLBEntry smmuv3_translate(IOMMUMemoryRegion *mr, hwaddr addr, } status = SMMU_TRANS_ERROR; } else { -smmu_iotlb_insert(bs, cfg, cached_entry); +if (s->disable_cached_iotlb) { +smmu_iotlb_insert(bs, cfg, cached_entry); +} status = SMMU_TRANS_SUCCESS; } @@ -855,8 +860,9 @@ static void smmuv3_inv_notifiers_iova(SMMUState *s, int asid, dma_addr_t iova, } } -static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd) +static void smmuv3_s1_range_inval(SMMUv3State *s, Cmd *cmd) { +SMMUState *bs = ARM_SMMU(s); dma_addr_t end, addr = CMD_ADDR(cmd); uint8_t type = CMD_TYPE(cmd); uint16_t vmid = CMD_VMID(cmd); @@ -876,7 +882,9 @@ static void smmuv3_s1_range_inval(SMMUState *s, Cmd *cmd) if (!tg) { trace_smmuv3_s1_range_inval(vmid, asid, addr, tg, 1, ttl, leaf); smmuv3_inv_notifiers_iova(s, asid, addr, tg, 1); -smmu_iotlb_inv_iova(s, asid, addr, tg, 1, ttl); +if (s->disable_cached_iotlb) { +smmu_iotlb_inv_iova(s, asid, addr, tg, 1, ttl); +