[Bug 1749393] Re: sbrk() not working under qemu-user with a PIE-compiled binary?
Focal old $ sudo apt install --reinstall qemu-user-static=1:4.2-3ubuntu6.18 Reading package lists... Done Building dependency tree Reading state information... Done 0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 0 not upgraded. Need to get 21.3 MB of archives. After this operation, 0 B of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 qemu-user-static amd64 1:4.2-3ubuntu6.18 [21.3 MB] Fetched 21.3 MB in 1s (16.4 MB/s) (Reading database ... 126154 files and directories currently installed.) Preparing to unpack .../qemu-user-static_1%3a4.2-3ubuntu6.18_amd64.deb ... Unpacking qemu-user-static (1:4.2-3ubuntu6.18) over (1:4.2-3ubuntu6.18) ... Setting up qemu-user-static (1:4.2-3ubuntu6.18) ... Processing triggers for man-db (2.9.1-1) ... ubuntu@f-1928075-qemuuserstatic:~$ sudo chroot /home/ubuntu/bullseye-arm64 /bin/sh /debootstrap/debootstrap --second-stage W: Failure trying to run: /sbin/ldconfig W: See //debootstrap/debootstrap.log for details ubuntu@f-1928075-qemuuserstatic:~$ tail -n 2 bullseye-arm64/debootstrap/debootstrap.log qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) Upgrade ubuntu@f-1928075-qemuuserstatic:~$ apt-cache policy qemu-user-static qemu-user-static: Installed: 1:4.2-3ubuntu6.18 Candidate: 1:4.2-3ubuntu6.19 Version table: 1:4.2-3ubuntu6.19 500 500 http://archive.ubuntu.com/ubuntu focal-proposed/universe amd64 Packages *** 1:4.2-3ubuntu6.18 500 500 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages 100 /var/lib/dpkg/status 1:4.2-3ubuntu6.17 500 500 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages 1:4.2-3ubuntu6 500 500 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages ubuntu@f-1928075-qemuuserstatic:~$ sudo apt install qemu-user-static Reading package lists... Done Building dependency tree Reading state information... Done The following packages will be upgraded: qemu-user-static 1 upgraded, 0 newly installed, 0 to remove and 65 not upgraded. Need to get 21.3 MB of archives. After this operation, 0 B of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu focal-proposed/universe amd64 qemu-user-static amd64 1:4.2-3ubuntu6.19 [21.3 MB] Fetched 21.3 MB in 2s (9092 kB/s) (Reading database ... 126160 files and directories currently installed.) Preparing to unpack .../qemu-user-static_1%3a4.2-3ubuntu6.19_amd64.deb ... Unpacking qemu-user-static (1:4.2-3ubuntu6.19) over (1:4.2-3ubuntu6.18) ... Setting up qemu-user-static (1:4.2-3ubuntu6.19) ... Processing triggers for man-db (2.9.1-1) ... ubuntu@f-1928075-qemuuserstatic:~$ sudo update-binfmts --test --display qemu-aarch64 qemu-aarch64 (enabled): package = qemu-user-static type = magic offset = 0 magic = \x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00 mask = \xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff\xff interpreter = /usr/bin/qemu-aarch64-static detector = Test with new versio ubuntu@f-1928075-qemuuserstatic:~$ sudo chroot /home/ubuntu/bullseye-arm64 /bin/sh /debootstrap/debootstrap --second-stage I: Installing core packages... W: Failure trying to run: dpkg --force-depends --install /var/cache/apt/archives/base-passwd_3.5.51_arm64.deb W: See //debootstrap/debootstrap.log for details ubuntu@f-1928075-qemuuserstatic:~$ tail -n 2 bullseye-arm64/debootstrap/debootstrap.log dpkg: error: parsing file '/var/lib/dpkg/status' near line 5 package 'dpkg': duplicate value for 'Package' field That is the good case and also a full run now completes. $ sudo rm -rf bullseye-arm64; sudo qemu-debootstrap --arch=arm64 bullseye bullseye-arm64 http://ftp.debian.org/debian I: Running command: debootstrap --arch arm64 --foreign bullseye bullseye-arm64 http://ftp.debian.org/debian W: Cannot check Release signature; keyring file not available /usr/share/keyrings/debian-archive-keyring.gpg I: Retrieving InRelease I: Retrieving Packages ... I: Configuring tasksel... I: Configuring libc-bin... I: Base system installed successfully. I can't run the docker test due to networking restrictions, but it was the same fault and the same fix - so that should be ok. If anyone else can test -proposed with docker please feel free to do so. ** Tags removed: verification-needed verification-needed-focal ** Tags added: verification-done verification-done-focal -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1749393 Title: sbrk() not working under qemu-user with a PIE-compiled binary? Status in QEMU: Fix Released Status in qemu package in Ubuntu: Fix Released Status in qemu source package in Focal: Fix Committed Bug description: [Impact]
Re: [PATCH v4] target/ppc: fix Hash64 MMU update of PTE bit R
On 11/30/21 21:00, Leandro Lupori wrote: On 30/11/2021 05:44, Cédric Le Goater wrote: It would be interesting to boot directly the PowerNV machine from a FreeBSB kernel and a minimum inirtd without using the skiroot images and an iso. Are images available ? AFAIK there are no minimum initrd images available. The closest thing would be the "bootonly" ISOs that can be found in the links below: https://download.freebsd.org/ftp/releases/powerpc/powerpc64/ISO-IMAGES/13.0/ https://download.freebsd.org/ftp/releases/powerpc/powerpc64le/ISO-IMAGES/13.0/ https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64/ISO-IMAGES/14.0/ https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64le/ISO-IMAGES/14.0/ But to avoid using skiroot, you would need to manually extract the kernel from the ISO and also specify the rootfs, using something like: "-append cd9660:cd0". If you don't want to emulate a disk or cd, the ISO can be passed to -initrd and "-append cd9660:md0" may be used to tell FreeBSD where to find the root partition (it loads it as a memory disk). The ISO is too big for quick tests. Isn't there a minimum initrd ? can we build a builroot-like image for FreeBSD ? There are also qcow2 snapshots available: https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64/ https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64le/ But you'll also need to extract the kernel from the image or from kernel.txz to boot them. ok. I will take a look. As these images target pseries and lack a FAT32 partition, Petitboot won't be able to boot them. The idea would be to skip petitboot and load directly the FreeBSD kernel from skiboot with a minimum initrd or disk image. See : https://github.com/legoater/qemu-ppc-boot/blob/main/buildroot/qemu_ppc64le_powernv-2021.11-rc1-199-g927444a04e-20211120/start-qemu.sh or https://github.com/legoater/qemu-ppc-boot/blob/main/buildroot/qemu_ppc64le_pseries-2021.11-rc1-199-g927444a04e-20211120/start-qemu.sh Thanks, C. Alfredo (cc'ed) was trying to make them Petitboot compatible.
[PATCH] QIO: Add force_quit to the QIOChannel to ensure QIO exits cleanly in some corner case
We found that the QIO channel coroutine could not be awakened in some corner cases during our stress test for COLO. The patch fixes as follow: #0 0x7fad72e24bf6 in __ppoll (fds=0x5563d75861f0, nfds=1, timeout=, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44 #1 0x5563d6977c68 in qemu_poll_ns (fds=0x5563d75861f0, nfds=1, timeout=59697630) at util/qemu-timer.c:348 #2 0x5563d697ac44 in aio_poll (ctx=0x5563d755dfa0, blocking=true) at util/aio-posix.c:669 #3 0x5563d68ba24f in bdrv_do_drained_begin (bs=0x5563d75ea0c0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:432 #4 0x5563d68ba338 in bdrv_drained_begin (bs=0x5563d75ea0c0) at block/io.c:438 #5 0x5563d689c6d2 in quorum_del_child (bs=0x5563d75ea0c0, child=0x5563d7908600, errp=0x7fff3cf5b960) at block/quorum.c:1063 #6 0x5563d684328f in bdrv_del_child (parent_bs=0x5563d75ea0c0, child=0x5563d7908600, errp=0x7fff3cf5b960) at block.c:6568 #7 0x5563d658499e in qmp_x_blockdev_change (parent=0x5563d80ec4c0 "colo-disk0", has_child=true, child=0x5563d7577d30 "children.1", has_node=false, node=0x0,errp=0x7fff3cf5b960) at blockdev.c:4494 #8 0x5563d67e8b4e in qmp_marshal_x_blockdev_change (args=0x7fad6400ada0, ret=0x7fff3cf5b9f8, errp=0x7fff3cf5b9f0) at qapi/qapi-commands-block-core.c:1538 #9 0x5563d691cd9e in do_qmp_dispatch (cmds=0x5563d719a4f0 , request=0x7fad64009d80, allow_oob=true, errp=0x7fff3cf5ba98) at qapi/qmp-dispatch.c:132 #10 0x5563d691cfab in qmp_dispatch (cmds=0x5563d719a4f0 , request=0x7fad64009d80, allow_oob=true) at qapi/qmp-dispatch.c:175 #11 0x5563d67b7787 in monitor_qmp_dispatch (mon=0x5563d7605d40, req=0x7fad64009d80) at monitor/qmp.c:145 #12 0x5563d67b7b24 in monitor_qmp_bh_dispatcher (data=0x0) at monitor/qmp.c:234 #13 0x5563d69754c2 in aio_bh_call (bh=0x5563d7473230) at util/async.c:89 #14 0x5563d697555e in aio_bh_poll (ctx=0x5563d7471da0) at util/async.c:117 #15 0x5563d697a423 in aio_dispatch (ctx=0x5563d7471da0) at util/aio-posix.c:459 #16 0x5563d6975917 in aio_ctx_dispatch (source=0x5563d7471da0, callback=0x0, user_data=0x0) at util/async.c:260 #17 0x7fad730e1fbd in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0 #18 0x5563d6978cda in glib_pollfds_poll () at util/main-loop.c:219 #19 0x5563d6978d58 in os_host_main_loop_wait (timeout=977650) at util/main-loop.c:242 #20 0x5563d6978e69 in main_loop_wait (nonblocking=0) at util/main-loop.c:518 #21 0x5563d658f5ed in main_loop () at vl.c:1814 #22 0x5563d6596bb7 in main (argc=61, argv=0x7fff3cf5c0c8, envp=0x7fff3cf5c2b8) at vl.c:450 From the call trace, we can see that the QEMU main thread is waiting for the in_flight return to zero. But the in_filght never reaches 0. After analysis, the root cause is that the coroutine of NBD was not awakened. Although this bug was found in the COLO test, I think this is a universal problem in the QIO module. This issue also affects other modules depending on QIO such as NBD. We dump the following data: $2 = { in_flight = 2, state = NBD_CLIENT_QUIT, connect_status = 0, connect_err = 0x0, wait_in_flight = false, requests = {{ coroutine = 0x0, offset = 273077071872, receiving = false, }, { coroutine = 0x7f1164571df0, offset = 498792529920, receiving = false, }, { coroutine = 0x0, offset = 500663590912, receiving = false, }, { ... } }, reply = { simple = { magic = 1732535960, error = 0, handle = 0 }, structured = { magic = 1732535960, flags = 0, type = 0, handle = 0, length = 0 }, { magic = 1732535960, _skip = 0, handle = 0 } }, bs = 0x7f11640e2140, reconnect_delay = 0, saddr = 0x7f11640e1f80, export = 0x7f11640dc560 "parent0", } From the data, we can see the coroutine of NBD does not exit normally when killing the NBD server(we kill the Secondary VM in the COLO stress test). Then it will not execute in_flight--. So, the QEMU main thread will hang here. Further analysis, I found the coroutine of NBD will yield in nbd_send_request() or qio_channel_write_all() in nbd_co_send_request. By the way, the yield is due to the kernel return EAGAIN(under the stress test). However, NBD didn't know it would yield here. So, the nbd_recv_coroutines_wake won't wake it up because req->receiving is false. if the NBD server is terminated abnormally at the same time. The G_IO_OUT will be invalid,
[PATCH v2] docs: fix section numbering in pcie.txt
There is no 5.2 section. Section 5.3 should really be 5.2. Fix it. fixes: 453ac8835b0022 ("docs: add PCIe devices placement guidelines") Reviewed-by: Liu Yi L Signed-off-by: Ani Sinha --- docs/pcie.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/pcie.txt b/docs/pcie.txt index 89e3502075..90310b0c5e 100644 --- a/docs/pcie.txt +++ b/docs/pcie.txt @@ -262,7 +262,7 @@ PCI Express Root Ports (and PCI Express Downstream Ports). Port, which may come handy for hot-plugging another device. -5.3 Hot-plug example: +5.2 Hot-plug example: Using HMP: (add -monitor stdio to QEMU command line) device_add ,id=,bus= -- 2.25.1
RE: [PATCH] docs: fix section numbering in pcie.txt
> From: Qemu-devel > On Behalf Of Ani Sinha > Sent: Wednesday, December 1, 2021 2:43 PM > > There is no 5.2 section. Section 5.3 should really be 5.2. Fix it. Reviewed-by: Liu Yi L BTW. Is a fix tag needed? Regards, Yi Liu > Signed-off-by: Ani Sinha > --- > docs/pcie.txt | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/docs/pcie.txt b/docs/pcie.txt > index 89e3502075..90310b0c5e 100644 > --- a/docs/pcie.txt > +++ b/docs/pcie.txt > @@ -262,7 +262,7 @@ PCI Express Root Ports (and PCI Express Downstream > Ports). > Port, which may come handy for hot-plugging another device. > > > -5.3 Hot-plug example: > +5.2 Hot-plug example: > Using HMP: (add -monitor stdio to QEMU command line) >device_add ,id=,bus= Downstream Port Id/PCI-PCI Bridge Id/> > > -- > 2.25.1 >
[PATCH] docs: fix section numbering in pcie.txt
There is no 5.2 section. Section 5.3 should really be 5.2. Fix it. Signed-off-by: Ani Sinha --- docs/pcie.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/pcie.txt b/docs/pcie.txt index 89e3502075..90310b0c5e 100644 --- a/docs/pcie.txt +++ b/docs/pcie.txt @@ -262,7 +262,7 @@ PCI Express Root Ports (and PCI Express Downstream Ports). Port, which may come handy for hot-plugging another device. -5.3 Hot-plug example: +5.2 Hot-plug example: Using HMP: (add -monitor stdio to QEMU command line) device_add ,id=,bus= -- 2.25.1
Re: [PULL 0/1] MAINTAINERS update
On 11/30/21 9:47 PM, Eduardo Habkost wrote: * MAINTAINERS: Change my email address (Eduardo Habkost) Eduardo Habkost (1): MAINTAINERS: Change my email address MAINTAINERS | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Not a pull request. But since it's just one patch that needs no regression testing, I have applied it directly. r~
Re: [PATCH] memory: Fix incorrect calls of log_global_start/stop
On Tue, Nov 30, 2021 at 04:03:10PM +0100, David Hildenbrand wrote: > On 30.11.21 09:00, Peter Xu wrote: > > We should only call the log_global_start/stop when the global dirty track > > bitmask changes from zero<->non-zero. > > > > No real issue reported for this yet probably because no immediate user to > > enable both dirty rate measurement and migration at the same time. However > > it'll be good to be prepared for it. > > > > Fixes: 63b41db4bc ("memory: make global_dirty_tracking a bitmask") > > Cc: Hyman Huang > > Cc: Paolo Bonzini > > Cc: Dr. David Alan Gilbert > > Cc: Juan Quintela > > Cc: David Hildenbrand > > Signed-off-by: Peter Xu > > LGTM > > Reviewed-by: David Hildenbrand Thanks for the quick review, David. Just a heads-up that I think it'll be nice to have this as 6.2-rc3 material. QEMU planning page told me rc3 has just passed (Nov 30th) but still I didn't see it released, so not sure.. I won't call it a blocker though, so if it missed 6.2 we just copy stable, and it'll be needed for 6.2 stable branch only. -- Peter Xu
Re: [PATCH 1/1] hw/arm/virt: Support for virtio-mem-pci
On 11/30/21 8:37 PM, David Hildenbrand wrote: On 30.11.21 01:33, Gavin Shan wrote: This supports virtio-mem-pci device on "virt" platform, by simply following the implementation on x86. Thanks for picking this up! Thanks, David. * The patch was written by David Hildenbrand modified by Jonathan Cameron Maybe replace this section by Co-developed-by: David Hildenbrand Co-developed-by: Jonathan Cameron Yes, it will be included into v2. * This implements the hotplug handlers to support virtio-mem-pci device hot-add, while the hot-remove isn't supported as we have on x86. * The block size is 1GB on ARM64 instead of 128MB on x86. See below, isn't it actually 512 MiB nowadays? I think so. * It has been passing the tests with various combinations like 64KB and 4KB page sizes on host and guest, different memory device backends like normal, transparent huge page and HugeTLB, plus migration. Perfect. A note that hugetlbfs isn't fully supported/safe to use until we have preallocation support in QEMU (WIP). Yes, there is some warnings raised to enlarge 'request-size' on host with 64KB page size. Note that the memory backends I used in the testings always have "prealloc=on" property though. Signed-off-by: Gavin Shan --- hw/arm/Kconfig | 1 + hw/arm/virt.c | 68 +- hw/virtio/virtio-mem.c | 2 ++ 3 files changed, 70 insertions(+), 1 deletion(-) diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index 2d37d29f02..15aff8efb8 100644 --- a/hw/arm/Kconfig +++ b/hw/arm/Kconfig @@ -27,6 +27,7 @@ config ARM_VIRT select DIMM select ACPI_HW_REDUCED select ACPI_APEI +select VIRTIO_MEM_SUPPORTED config CHEETAH bool diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 369552ad45..f4599a5ef0 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -71,9 +71,11 @@ #include "hw/arm/smmuv3.h" #include "hw/acpi/acpi.h" #include "target/arm/internals.h" +#include "hw/mem/memory-device.h" #include "hw/mem/pc-dimm.h" #include "hw/mem/nvdimm.h" #include "hw/acpi/generic_event_device.h" +#include "hw/virtio/virtio-mem-pci.h" #include "hw/virtio/virtio-iommu.h" #include "hw/char/pl011.h" #include "qemu/guest-random.h" @@ -2480,6 +2482,63 @@ static void virt_memory_plug(HotplugHandler *hotplug_dev, dev, _abort); } +static void virt_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev, +DeviceState *dev, Error **errp) +{ +HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev); +Error *local_err = NULL; + +if (!hotplug_dev2 && dev->hotplugged) { +/* + * Without a bus hotplug handler, we cannot control the plug/unplug + * order. We should never reach this point when hotplugging on x86, + * however, better add a safety net. + */ +error_setg(errp, "hotplug of virtio based memory devices not supported" + " on this bus."); +return; +} +/* + * First, see if we can plug this memory device at all. If that + * succeeds, branch of to the actual hotplug handler. + */ +memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL, + _err); +if (!local_err && hotplug_dev2) { +hotplug_handler_pre_plug(hotplug_dev2, dev, _err); +} +error_propagate(errp, local_err); +} + +static void virt_virtio_md_pci_plug(HotplugHandler *hotplug_dev, +DeviceState *dev, Error **errp) +{ +HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev); +Error *local_err = NULL; + +/* + * Plug the memory device first and then branch off to the actual + * hotplug handler. If that one fails, we can easily undo the memory + * device bits. + */ +memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev)); +if (hotplug_dev2) { +hotplug_handler_plug(hotplug_dev2, dev, _err); +if (local_err) { +memory_device_unplug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev)); +} +} +error_propagate(errp, local_err); +} + +static void virt_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev, + DeviceState *dev, Error **errp) +{ +/* We don't support hot unplug of virtio based memory devices */ +error_setg(errp, "virtio based memory devices cannot be unplugged."); +} + + static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev, Error **errp) { @@ -2513,6 +2572,8 @@ static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev, qdev_prop_set_uint32(dev, "len-reserved-regions", 1); qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str); g_free(resv_prop_str); +} else if
Re: [PATCH v2 1/2] ppc/pnv.c: add a friendly warning when accel=kvm is used
On Tue, Nov 30, 2021 at 10:31:52AM -0300, Daniel Henrique Barboza wrote: > If one tries to use -machine powernv9,accel=kvm in a Power9 host, a > cryptic error will be shown: > > qemu-system-ppc64: Register sync failed... If you're using kvm-hv.ko, only > "-cpu host" is possible > qemu-system-ppc64: kvm_init_vcpu: kvm_arch_init_vcpu failed (0): Invalid > argument > > Appending '-cpu host' will throw another error: > > qemu-system-ppc64: invalid chip model 'host' for powernv9 machine > > The root cause is that in IBM PowerPC we have different specs for the > bare-metal > and the guests. The bare-metal follows OPAL, the guests follow PAPR. The > kernel > KVM modules presented in the ppc kernels implements PAPR. This means that we > can't use KVM accel when using the powernv machine, which is the emulation of > the bare-metal host. > > All that said, let's give a more informative error in this case. > > Signed-off-by: Daniel Henrique Barboza Reviewed-by: David Gibson > --- > hw/ppc/pnv.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c > index 71e45515f1..e5b87e8730 100644 > --- a/hw/ppc/pnv.c > +++ b/hw/ppc/pnv.c > @@ -742,6 +742,11 @@ static void pnv_init(MachineState *machine) > DriveInfo *pnor = drive_get(IF_MTD, 0, 0); > DeviceState *dev; > > +if (kvm_enabled()) { > +error_report("The powernv machine does not work with KVM > acceleration"); > +exit(EXIT_FAILURE); > +} > + > /* allocate RAM */ > if (machine->ram_size < mc->default_ram_size) { > char *sz = size_to_str(mc->default_ram_size); -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH v8 06/10] target/ppc: enable PMU instruction count
On Tue, Nov 30, 2021 at 07:24:04PM -0300, Daniel Henrique Barboza wrote: > > > On 11/29/21 01:36, David Gibson wrote: > > On Thu, Nov 25, 2021 at 12:08:13PM -0300, Daniel Henrique Barboza wrote: > > > The PMU is already counting cycles by calculating time elapsed in > > > nanoseconds. Counting instructions is a different matter and requires > > > another approach. > > > > > > This patch adds the capability of counting completed instructions > > > (Perf event PM_INST_CMPL) by counting the amount of instructions > > > translated in each translation block right before exiting it. > > > > > > A new pmu_count_insns() helper in translation.c was added to do that. > > > After verifying that the PMU is running (MMCR0_FC bit not set), call > > > helper_insns_inc(). This new helper from power8-pmu.c will add the > > > instructions to the relevant counters. It'll also be responsible for > > > triggering counter negative overflows as it is already being done with > > > cycles. > > > > > > Signed-off-by: Daniel Henrique Barboza > > > --- > > > target/ppc/cpu.h | 1 + > > > target/ppc/helper.h | 1 + > > > target/ppc/helper_regs.c | 4 +++ > > > target/ppc/power8-pmu-regs.c.inc | 6 + > > > target/ppc/power8-pmu.c | 38 ++ > > > target/ppc/translate.c | 46 > > > 6 files changed, 96 insertions(+) > > > > > > diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h > > > index 9b41b022e2..38cd2b5c43 100644 > > > --- a/target/ppc/cpu.h > > > +++ b/target/ppc/cpu.h > > > @@ -656,6 +656,7 @@ enum { > > > HFLAGS_PR = 14, /* MSR_PR */ > > > HFLAGS_PMCC0 = 15, /* MMCR0 PMCC bit 0 */ > > > HFLAGS_PMCC1 = 16, /* MMCR0 PMCC bit 1 */ > > > +HFLAGS_MMCR0FC = 17, /* MMCR0 FC bit */ > > > > Now that the event stuff is a bit more refined, you could narrow this > > down to specifically marking if any counters are actively counting > > instructions (not frozen by MMCR0[FC] and not frozen by > > MMCR0[FC14|FC56] *and* have the right event selected). > > > > Since I suspect the instruction counting instrumentation could be > > quite expensive (helper call on every tb), that might be worthwhile. > > That was worthwhile. The performance increase is substantial with this > change, in particular with tests that exercises only cycle events. Good to know. > > > HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */ > > > HFLAGS_VR = 25, /* MSR_VR if cpu has VRE */ > > > diff --git a/target/ppc/helper.h b/target/ppc/helper.h > > > index 94b4690375..d8a23e054a 100644 > > > --- a/target/ppc/helper.h > > > +++ b/target/ppc/helper.h > > > @@ -24,6 +24,7 @@ DEF_HELPER_2(store_mmcr0, void, env, tl) > > > DEF_HELPER_2(store_mmcr1, void, env, tl) > > > DEF_HELPER_3(store_pmc, void, env, i32, i64) > > > DEF_HELPER_2(read_pmc, tl, env, i32) > > > +DEF_HELPER_2(insns_inc, void, env, i32) > > > #endif > > > DEF_HELPER_1(check_tlb_flush_local, void, env) > > > DEF_HELPER_1(check_tlb_flush_global, void, env) > > > diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c > > > index 99562edd57..875c2fdfc6 100644 > > > --- a/target/ppc/helper_regs.c > > > +++ b/target/ppc/helper_regs.c > > > @@ -115,6 +115,10 @@ static uint32_t > > > hreg_compute_hflags_value(CPUPPCState *env) > > > if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) { > > > hflags |= 1 << HFLAGS_PMCC1; > > > } > > > +if (env->spr[SPR_POWER_MMCR0] & MMCR0_FC) { > > > +hflags |= 1 << HFLAGS_MMCR0FC; > > > +} > > > + > > > #ifndef CONFIG_USER_ONLY > > > if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) { > > > diff --git a/target/ppc/power8-pmu-regs.c.inc > > > b/target/ppc/power8-pmu-regs.c.inc > > > index 25b13ad564..580e4e41b2 100644 > > > --- a/target/ppc/power8-pmu-regs.c.inc > > > +++ b/target/ppc/power8-pmu-regs.c.inc > > > @@ -113,6 +113,12 @@ static void write_MMCR0_common(DisasContext *ctx, > > > TCGv val) > > >*/ > > > gen_icount_io_start(ctx); > > > gen_helper_store_mmcr0(cpu_env, val); > > > + > > > +/* > > > + * End the translation block because MMCR0 writes can change > > > + * ctx->pmu_frozen. > > > + */ > > > +ctx->base.is_jmp = DISAS_EXIT_UPDATE; > > > } > > > void spr_write_MMCR0_ureg(DisasContext *ctx, int sprn, int gprn) > > > diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c > > > index 01e0b9b8fc..59d0def79d 100644 > > > --- a/target/ppc/power8-pmu.c > > > +++ b/target/ppc/power8-pmu.c > > > @@ -112,6 +112,30 @@ static PMUEventType pmc_get_event(CPUPPCState *env, > > > int sprn) > > > return evt_type; > > > } > > > +static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns) > > > +{ > > > +bool overflow_triggered = false; > > > +int sprn; > > > + > > > +/* PMC6 never counts instructions */ > > > +for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC5; sprn++) { > >
Re: [PATCH v4] target/ppc: fix Hash64 MMU update of PTE bit R
On Tue, Nov 30, 2021 at 09:44:58AM +0100, Cédric le Goater wrote: > On 11/30/21 01:30, David Gibson wrote: > > On Mon, Nov 29, 2021 at 03:57:51PM -0300, Leandro Lupori wrote: > > > When updating the R bit of a PTE, the Hash64 MMU was using a wrong byte > > > offset, causing the first byte of the adjacent PTE to be corrupted. > > > This caused a panic when booting FreeBSD, using the Hash MMU. > > > > > > Fixes: a2dd4e83e76b ("ppc/hash64: Rework R and C bit updates") > > > Signed-off-by: Leandro Lupori > > > > Reviewed-by: David Gibson > > Sorry, I didn't wait for your Rb because the patch was a good candidate > for -rc3. It is merged now. No worries. > > > Thanks for your patience with our nitpicking :). > > Yes, > > Here is another QEMU bug found by FreeBSD : > https://lore.kernel.org/qemu-devel/427ef2ee-6871-0d27-f485-90ad142f6...@kaod.org/ > > It would be interesting to boot directly the PowerNV machine from a > FreeBSB kernel and a minimum inirtd without using the skiroot images > and an iso. Are images available ? > > Thanks. > > C. > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PULL 0/1] MAINTAINERS update
* MAINTAINERS: Change my email address (Eduardo Habkost) Eduardo Habkost (1): MAINTAINERS: Change my email address MAINTAINERS | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) -- 2.32.0
Re: [PATCH] fw_cfg: Fix memory leak in fw_cfg_register_file
On Tue, Nov 16, 2021 at 04:28:34PM +0100, Philippe Mathieu-Daudé wrote: > On 11/16/21 12:42, Miaoqian Lin wrote: > > When kobject_init_and_add() fails, entry should be freed just like > > when sysfs_create_bin_file() fails. > > > > Fixes: fe3c60684377 ("firmware: Fix a reference count leak.") > Reviewed-by: Philippe Mathieu-Daudé No, no. This patch is completely bogus and would introduce a double free. > > Signed-off-by: Miaoqian Lin > > --- > > drivers/firmware/qemu_fw_cfg.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c > > index 172c751a4f6c..0f404777f016 100644 > > --- a/drivers/firmware/qemu_fw_cfg.c > > +++ b/drivers/firmware/qemu_fw_cfg.c > > @@ -608,6 +608,7 @@ static int fw_cfg_register_file(const struct > > fw_cfg_file *f) > >fw_cfg_sel_ko, "%d", entry->select); > > if (err) { > > kobject_put(>kobj); > > + kfree(entry); entry would already have been freed by kobject_put() and fw_cfg_sysfs_release_entry() here. > > return err; > > } > > > > Doesn't look like this patch has been picked up yet, so: NAK. Johan
[PULL 1/1] MAINTAINERS: Change my email address
The ehabk...@redhat.com email address will stop working on 2021-12-01, change it to my personal email address. Signed-off-by: Eduardo Habkost Message-Id: <20211129163053.2506734-1-ehabk...@redhat.com> Signed-off-by: Eduardo Habkost --- MAINTAINERS | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index d3879aa3c12..7e8a586b2ae 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -324,7 +324,7 @@ F: disas/sparc.c X86 TCG CPUs M: Paolo Bonzini M: Richard Henderson -M: Eduardo Habkost +M: Eduardo Habkost S: Maintained F: target/i386/tcg/ F: tests/tcg/i386/ @@ -1628,7 +1628,7 @@ F: include/hw/i386/microvm.h F: pc-bios/bios-microvm.bin Machine core -M: Eduardo Habkost +M: Eduardo Habkost M: Marcel Apfelbaum R: Philippe Mathieu-Daudé S: Supported @@ -2648,13 +2648,13 @@ F: backends/cryptodev*.c Python library M: John Snow M: Cleber Rosa -R: Eduardo Habkost +R: Eduardo Habkost S: Maintained F: python/ T: git https://gitlab.com/jsnow/qemu.git python Python scripts -M: Eduardo Habkost +M: Eduardo Habkost M: Cleber Rosa S: Odd Fixes F: scripts/*.py @@ -2730,7 +2730,7 @@ T: git https://github.com/mdroth/qemu.git qga QOM M: Paolo Bonzini R: Daniel P. Berrange -R: Eduardo Habkost +R: Eduardo Habkost S: Supported F: docs/qdev-device-use.txt F: hw/core/qdev* @@ -2750,7 +2750,7 @@ F: tests/unit/check-qom-proplist.c F: tests/unit/test-qdev-global-props.c QOM boilerplate conversion script -M: Eduardo Habkost +M: Eduardo Habkost S: Maintained F: scripts/codeconverter/ -- 2.32.0
Re: why is iommu_platform set to off by default?
On Tue, Nov 30, 2021 at 04:38:11PM +, Stefan Hajnoczi wrote: > On Tue, Nov 30, 2021 at 02:32:49PM +, Peter Maydell wrote: > > I've just spent a day or so trying to track down why PCI passthrough > > of a virtio-blk-pci device wasn't working. The problem turns out to be > > that by default virtio pci devices don't use the IOMMU, even when the > > machine model has created an IOMMU and arranged for the PCI bus to > > be underneath it. So when the L2 guest tries to program the virtio device, > > the virtio device treats the IPAs it writes as if they were PAs and > > of course the data structures it's looking for aren't there. > > > > Why do we default this to 'off'? It seems pretty unhelpful not to > > honour the existence of the IOMMU, and the failure mode is pretty > > opaque (L2 guest just hangs)... > > Historically VIRTIO used guest physical addresses instead of bus > addresses (IOVA). I think when IOMMU support was added, a QEMU -device > virtio-* parameter was added and it's simply off by default: > > iommu_platform= - on/off (default: false) > > Maybe this behavior is for backwards compatibility? Existing guests with > IOMMUs shouldn't change behavior, although this could be fixed with > machine type compat properties. > > Stefan Unfortunately iommu is special. Bypassing it makes the config insecure for nested virt, so it's important that we do not allow guest driver to disable the iommu through the feature bit. This means normal compat machinery (make feature on by default but disable for legacy drivers) does not work here. -- MST
Re: [PATCH v1 0/3] virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
On Tue, Nov 30, 2021 at 10:28:35AM +0100, David Hildenbrand wrote: > Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE in QEMU, which indicates to > a guest that we don't support reading unplugged memory. We indicate > the feature based on a new "unplugged-inaccessible" property available > for x86 targets only (the only ones with legacy guests). Guests that don't > support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE will fail initialization if > indicated/required by the hypervisor. > > For example, Linux guests starting with v5.16 will support > VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. > > For future targets that don't have legacy guests (especially arm64), we'll > always indicate VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. > > More details can be found in the description of patch #2. > > " > For existing compat machines, the property will default to "off", to > not change the behavior but eventually warn about a problematic setup. > Short-term, we'll set the property default to "auto" for new QEMU machines. > Mid-term, we'll set the property default to "on" for new QEMU machines. > Long-term, we'll deprecate the parameter and disallow legacy guests > completely. > " > > TODO: Once 6.2 was release, adjust patch #3. Replace patch #1 by a proper > Linux header sync. oh so it's not for 6.2. got it. > Cc: "Michael S. Tsirkin" > Cc: Marcel Apfelbaum > Cc: Paolo Bonzini > Cc: Richard Henderson > Cc: Eduardo Habkost > Cc: Gavin Shan > Cc: Hui Zhu > Cc: Sebastien Boeuf > Cc: Pankaj Gupta > > David Hildenbrand (3): > linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE > virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE > virtio-mem: Set "unplugged-inaccessible=auto" for the 6.2 machine on > x86 > > hw/i386/pc.c| 1 + > hw/virtio/virtio-mem.c | 63 + > include/hw/virtio/virtio-mem.h | 8 +++ > include/standard-headers/linux/virtio_mem.h | 9 ++- > 4 files changed, 78 insertions(+), 3 deletions(-) > > -- > 2.31.1
Re: [PATCH v1 1/3] linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
On Tue, Nov 30, 2021 at 10:28:36AM +0100, David Hildenbrand wrote: > TODO: replace by a proper header sync. what's the plan for this patchset then? > Signed-off-by: David Hildenbrand > --- > include/standard-headers/linux/virtio_mem.h | 9 ++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/include/standard-headers/linux/virtio_mem.h > b/include/standard-headers/linux/virtio_mem.h > index 05e5ade75d..18c74c527c 100644 > --- a/include/standard-headers/linux/virtio_mem.h > +++ b/include/standard-headers/linux/virtio_mem.h > @@ -68,9 +68,10 @@ > * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG). > * > * There are no guarantees what will happen if unplugged memory is > - * read/written. Such memory should, in general, not be touched. E.g., > - * even writing might succeed, but the values will simply be discarded at > - * random points in time. > + * read/written. In general, unplugged memory should not be touched, because > + * the resulting action is undefined. There is one exception: without > + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, unplugged memory inside the usable > + * region can be read, to simplify creation of memory dumps. > * > * It can happen that the device cannot process a request, because it is > * busy. The device driver has to retry later. > @@ -87,6 +88,8 @@ > > /* node_id is an ACPI PXM and is valid */ > #define VIRTIO_MEM_F_ACPI_PXM0 > +/* unplugged memory must not be accessed */ > +#define VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE 1 > > > /* --- virtio-mem: guest -> host requests --- */ > -- > 2.31.1
Re: why is iommu_platform set to off by default?
On Tue, Nov 30, 2021 at 02:32:49PM +, Peter Maydell wrote: > I've just spent a day or so trying to track down why PCI passthrough > of a virtio-blk-pci device wasn't working. The problem turns out to be > that by default virtio pci devices don't use the IOMMU, even when the > machine model has created an IOMMU and arranged for the PCI bus to > be underneath it. So when the L2 guest tries to program the virtio device, > the virtio device treats the IPAs it writes as if they were PAs and > of course the data structures it's looking for aren't there. Because this is what legacy guests expect, and legacy configs are much more common than nested. > Why do we default this to 'off'? It seems pretty unhelpful not to > honour the existence of the IOMMU, and the failure mode is pretty > opaque (L2 guest just hangs)... > > thanks > -- PMM This should be handled by VFIO in L1 really, it can check for a device quirk and refuse binding if the feature bit is disabled. -- MST
Re: [PATCH] virtio: signal after wrapping packed used_idx
On Tue, Nov 30, 2021 at 01:45:10PM +, Stefan Hajnoczi wrote: > Packed Virtqueues wrap used_idx instead of letting it run freely like > Split Virtqueues do. If the used ring wraps more than once there is no > way to compare vq->signalled_used and vq->used_idx in > virtio_packed_should_notify() since they are modulo vq->vring.num. > > This causes the device to stop sending used buffer notifications when > when virtio_packed_should_notify() is called less than once each time > around the used ring. > > It is possible to trigger this with virtio-blk's dataplane > notify_guest_bh() irq coalescing optimization. The call to > virtio_notify_irqfd() (and virtio_packed_should_notify()) is deferred to > a BH. If the guest driver is polling it can complete and submit more > requests before the BH executes, causing the used ring to wrap more than > once. The result is that the virtio-blk device ceases to raise > interrupts and I/O hangs. > > Cc: Tiwei Bie > Cc: Jason Wang > Cc: Michael S. Tsirkin > Signed-off-by: Stefan Hajnoczi Makes sense. Fixes tag? > --- > Smarter solutions welcome, but I think notifying once per vq->vring.num > is acceptable. > --- > hw/virtio/virtio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c > index ea7c079fb0..f7851c2750 100644 > --- a/hw/virtio/virtio.c > +++ b/hw/virtio/virtio.c > @@ -885,6 +885,7 @@ static void virtqueue_packed_flush(VirtQueue *vq, > unsigned int count) > if (vq->used_idx >= vq->vring.num) { > vq->used_idx -= vq->vring.num; > vq->used_wrap_counter ^= 1; > +vq->signalled_used_valid = false; > } > } > > -- > 2.33.1
[RFC PATCH v2 3/4] target/ppc: Remove the software TLB model of 7450 CPUs
(Applies to 7441, 7445, 7450, 7451, 7455, 7457, 7447, 7447a and 7448) The QEMU-side software TLB implementation for the 7450 family of CPUs is being removed due to lack of known users in the real world. The last users in the code were removed by the two previous commits. A brief history: The feature was added in QEMU by commit 7dbe11acd8 ("Handle all MMU models in switches...") with the mention that Linux was not able to handle the TLB miss interrupts and the MMU model would be kept disabled. At some point later, commit 8ca3f6c382 ("Allow selection of all defined PowerPC 74xx (aka G4) CPUs.") enabled the model for the 7450 family without further justification. We have since the year 2011 [1] been unable to run OpenBIOS in the 7450s and have not heard of any other software that is used with those CPUs in QEMU. Attempts were made to find a guest OS that implemented the TLB miss handlers and none were found among Linux 5.15, FreeBSD 13, MacOS9, MacOSX and MorphOS 3.15. All CPUs that registered this feature were moved to an MMU model that replaces the software TLB with a QEMU hardware TLB implementation. They can now run the same software as the 7400 CPUs, including the OSes mentioned above. References: - https://bugs.launchpad.net/qemu/+bug/812398 https://gitlab.com/qemu-project/qemu/-/issues/86 - https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00289.html message id: 2029134431.406753-1-faro...@linux.ibm.com Signed-off-by: Fabiano Rosas --- target/ppc/cpu-qom.h | 6 +- target/ppc/cpu.h | 4 +--- target/ppc/cpu_init.c| 26 -- target/ppc/excp_helper.c | 29 - target/ppc/helper.h | 2 -- target/ppc/mmu_common.c | 19 --- target/ppc/mmu_helper.c | 31 --- target/ppc/translate.c | 26 -- 8 files changed, 6 insertions(+), 137 deletions(-) diff --git a/target/ppc/cpu-qom.h b/target/ppc/cpu-qom.h index 5800fa324e..ef9e324474 100644 --- a/target/ppc/cpu-qom.h +++ b/target/ppc/cpu-qom.h @@ -45,7 +45,11 @@ enum powerpc_mmu_t { POWERPC_MMU_32B= 0x0001, /* PowerPC 6xx MMU with software TLB */ POWERPC_MMU_SOFT_6xx = 0x0002, -/* PowerPC 74xx MMU with software TLB */ +/* + * PowerPC 74xx MMU with software TLB (this has been + * disabled, see git history for more information. + * keywords: tlbld tlbli TLBMISS PTEHI PTELO) + */ POWERPC_MMU_SOFT_74xx = 0x0003, /* PowerPC 4xx MMU with software TLB */ POWERPC_MMU_SOFT_4xx = 0x0004, diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h index e946da5f3a..432d609094 100644 --- a/target/ppc/cpu.h +++ b/target/ppc/cpu.h @@ -2138,8 +2138,6 @@ enum { PPC_SEGMENT= 0x0200ULL, /* PowerPC 6xx TLB management instructions */ PPC_6xx_TLB= 0x0400ULL, -/* PowerPC 74xx TLB management instructions */ -PPC_74xx_TLB = 0x0800ULL, /* PowerPC 40x TLB management instructions */ PPC_40x_TLB= 0x1000ULL, /* segment register access instructions for PowerPC 64 "bridge"*/ @@ -2196,7 +2194,7 @@ enum { | PPC_CACHE_DCBZ \ | PPC_CACHE_DCBA | PPC_CACHE_LOCK \ | PPC_EXTERN | PPC_SEGMENT | PPC_6xx_TLB \ -| PPC_74xx_TLB | PPC_40x_TLB | PPC_SEGMENT_64B \ +| PPC_40x_TLB | PPC_SEGMENT_64B \ | PPC_SLBI | PPC_WRTEE | PPC_40x_EXCP \ | PPC_405_MAC | PPC_440_SPEC | PPC_BOOKE \ | PPC_MFAPIDI | PPC_TLBIVA | PPC_TLBIVAX \ diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c index 962acf295f..ed0e2136d9 100644 --- a/target/ppc/cpu_init.c +++ b/target/ppc/cpu_init.c @@ -945,31 +945,6 @@ static void register_l3_ctrl(CPUPPCState *env) 0x); } -static void register_74xx_soft_tlb(CPUPPCState *env, int nb_tlbs, int nb_ways) -{ -#if !defined(CONFIG_USER_ONLY) -env->nb_tlb = nb_tlbs; -env->nb_ways = nb_ways; -env->id_tlbs = 1; -env->tlb_type = TLB_6XX; -/* XXX : not implemented */ -spr_register(env, SPR_PTEHI, "PTEHI", - SPR_NOACCESS, SPR_NOACCESS, - _read_generic, _write_generic, - 0x); -/* XXX : not implemented */ -spr_register(env, SPR_PTELO, "PTELO", - SPR_NOACCESS, SPR_NOACCESS, - _read_generic, _write_generic, - 0x); -/* XXX : not implemented */ -spr_register(env, SPR_TLBMISS, "TLBMISS", - SPR_NOACCESS, SPR_NOACCESS, - _read_generic, _write_generic, - 0x); -#endif -}
[RFC PATCH v2 4/4] tests/avocado: ppc: Add smoke tests for MPC7400 and MPC7450 families
These tests ensure that our emulation for these cpus is not completely broken and we can at least run OpenBIOS on them. $ make check-avocado AVOCADO_TESTS=../tests/avocado/ppc_74xx.py Signed-off-by: Fabiano Rosas Reviewed-by: Willian Rampazzo --- Note that the 7450s depend on an OpenBIOS change adding support for their PVR: https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00290.html --- tests/avocado/ppc_74xx.py | 123 ++ 1 file changed, 123 insertions(+) create mode 100644 tests/avocado/ppc_74xx.py diff --git a/tests/avocado/ppc_74xx.py b/tests/avocado/ppc_74xx.py new file mode 100644 index 00..556a9a7da9 --- /dev/null +++ b/tests/avocado/ppc_74xx.py @@ -0,0 +1,123 @@ +# Smoke tests for 74xx cpus (aka G4). +# +# Copyright (c) 2021, IBM Corp. +# +# This work is licensed under the terms of the GNU GPL, version 2 or +# later. See the COPYING file in the top-level directory. + +from avocado_qemu import QemuSystemTest +from avocado_qemu import wait_for_console_pattern + +class ppc74xxCpu(QemuSystemTest): +""" +:avocado: tags=arch:ppc +""" +timeout = 5 + +def test_ppc_7400(self): +""" +:avocado: tags=cpu:7400 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7410(self): +""" +:avocado: tags=cpu:7410 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,74xx') + +def test_ppc_7441(self): +""" +:avocado: tags=cpu:7441 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7445(self): +""" +:avocado: tags=cpu:7445 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7447(self): +""" +:avocado: tags=cpu:7447 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7447a(self): +""" +:avocado: tags=cpu:7447a +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7448(self): +""" +:avocado: tags=cpu:7448 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,MPC86xx') + +def test_ppc_7450(self): +""" +:avocado: tags=cpu:7450 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7451(self): +""" +:avocado: tags=cpu:7451 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7455(self): +""" +:avocado: tags=cpu:7455 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7457(self): +""" +:avocado: tags=cpu:7457 +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') + +def test_ppc_7457a(self): +""" +:avocado: tags=cpu:7457a +""" +self.vm.set_console() +self.vm.launch() +wait_for_console_pattern(self, '>> OpenBIOS') +wait_for_console_pattern(self, '>> CPU type PowerPC,G4') -- 2.33.1
[RFC PATCH v2 1/4] target/ppc: Disable software TLB for the 7450 family
(Applies to 7441, 7445, 7450, 7451, 7455, 7457, 7447 and 7447a)* We have since 2011 [1] been unable to run OpenBIOS in the 7450s and have not heard of any other software that is used with those CPUs in QEMU. A current discussion [2] shows that the 7450 software TLB is unsupported in Linux 5.15, FreeBSD 13, MacOS9, MacOSX and MorphOS 3.15. With no known support in firmware or OS, this means that no code for any of the 7450 CPUs is ever ran in QEMU. Since the implementation in QEMU of the 7400 MMU is the same as the 7450, except for the software TLB vs. hardware TLB search, this patch changes all 7450 cpus to the 7400 MMU model. This has the practical effect of disabling the software TLB feature while keeping other aspects of address translation working as expected. This allow us to run software on the 7450 family again. *- note that the 7448 is currently aliased in QEMU for a 7400, so it is unaffected by this change. 1- https://bugs.launchpad.net/qemu/+bug/812398 https://gitlab.com/qemu-project/qemu/-/issues/86 2- https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00289.html message id: 2029134431.406753-1-faro...@linux.ibm.com Signed-off-by: Fabiano Rosas --- target/ppc/cpu_init.c | 25 ++--- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c index 6695985e9b..509df35d09 100644 --- a/target/ppc/cpu_init.c +++ b/target/ppc/cpu_init.c @@ -5932,7 +5932,6 @@ static void init_proc_7440(CPUPPCState *env) 0x); /* Memory management */ register_low_BATs(env); -register_74xx_soft_tlb(env, 128, 2); init_excp_7450(env); env->dcache_line_size = 32; env->icache_line_size = 32; @@ -5956,7 +5955,7 @@ POWERPC_FAMILY(7440)(ObjectClass *oc, void *data) PPC_CACHE_DCBA | PPC_CACHE_DCBZ | PPC_MEM_SYNC | PPC_MEM_EIEIO | PPC_MEM_TLBIE | PPC_MEM_TLBSYNC | - PPC_MEM_TLBIA | PPC_74xx_TLB | + PPC_MEM_TLBIA | PPC_SEGMENT | PPC_EXTERN | PPC_ALTIVEC; pcc->msr_mask = (1ull << MSR_VR) | @@ -5976,7 +5975,7 @@ POWERPC_FAMILY(7440)(ObjectClass *oc, void *data) (1ull << MSR_PMM) | (1ull << MSR_RI) | (1ull << MSR_LE); -pcc->mmu_model = POWERPC_MMU_SOFT_74xx; +pcc->mmu_model = POWERPC_MMU_32B; pcc->excp_model = POWERPC_EXCP_74xx; pcc->bus_model = PPC_FLAGS_INPUT_6xx; pcc->bfd_mach = bfd_mach_ppc_7400; @@ -6067,7 +6066,6 @@ static void init_proc_7450(CPUPPCState *env) 0x); /* Memory management */ register_low_BATs(env); -register_74xx_soft_tlb(env, 128, 2); init_excp_7450(env); env->dcache_line_size = 32; env->icache_line_size = 32; @@ -6091,7 +6089,7 @@ POWERPC_FAMILY(7450)(ObjectClass *oc, void *data) PPC_CACHE_DCBA | PPC_CACHE_DCBZ | PPC_MEM_SYNC | PPC_MEM_EIEIO | PPC_MEM_TLBIE | PPC_MEM_TLBSYNC | - PPC_MEM_TLBIA | PPC_74xx_TLB | + PPC_MEM_TLBIA | PPC_SEGMENT | PPC_EXTERN | PPC_ALTIVEC; pcc->msr_mask = (1ull << MSR_VR) | @@ -6111,7 +6109,7 @@ POWERPC_FAMILY(7450)(ObjectClass *oc, void *data) (1ull << MSR_PMM) | (1ull << MSR_RI) | (1ull << MSR_LE); -pcc->mmu_model = POWERPC_MMU_SOFT_74xx; +pcc->mmu_model = POWERPC_MMU_32B; pcc->excp_model = POWERPC_EXCP_74xx; pcc->bus_model = PPC_FLAGS_INPUT_6xx; pcc->bfd_mach = bfd_mach_ppc_7400; @@ -6205,7 +6203,6 @@ static void init_proc_7445(CPUPPCState *env) /* Memory management */ register_low_BATs(env); register_high_BATs(env); -register_74xx_soft_tlb(env, 128, 2); init_excp_7450(env); env->dcache_line_size = 32; env->icache_line_size = 32; @@ -6229,7 +6226,7 @@ POWERPC_FAMILY(7445)(ObjectClass *oc, void *data) PPC_CACHE_DCBA | PPC_CACHE_DCBZ | PPC_MEM_SYNC | PPC_MEM_EIEIO | PPC_MEM_TLBIE | PPC_MEM_TLBSYNC | - PPC_MEM_TLBIA | PPC_74xx_TLB | + PPC_MEM_TLBIA | PPC_SEGMENT | PPC_EXTERN | PPC_ALTIVEC; pcc->msr_mask = (1ull << MSR_VR) | @@ -6249,7 +6246,7 @@ POWERPC_FAMILY(7445)(ObjectClass *oc, void *data) (1ull << MSR_PMM) | (1ull << MSR_RI) | (1ull << MSR_LE); -pcc->mmu_model = POWERPC_MMU_SOFT_74xx; +pcc->mmu_model = POWERPC_MMU_32B; pcc->excp_model = POWERPC_EXCP_74xx; pcc->bus_model = PPC_FLAGS_INPUT_6xx; pcc->bfd_mach = bfd_mach_ppc_7400; @@ -6345,7 +6342,6 @@ static void init_proc_7455(CPUPPCState
[RFC PATCH v2 2/4] target/ppc: Disable unused facilities in the e600 CPU
The e600 CPU is a successor of the 7448 and like all the 7450s CPUs, it has an optional software TLB feature. We have determined that there is no OS software support for the 7450 software TLB available these days. See the previous commit for more information. This patch disables the SPRs and instructions related to software TLB from the e600 CPU. No functional change intended. These facilities should be used by the OS in interrupt handlers for interrupts that QEMU never generates. Signed-off-by: Fabiano Rosas --- target/ppc/cpu_init.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c index 509df35d09..962acf295f 100644 --- a/target/ppc/cpu_init.c +++ b/target/ppc/cpu_init.c @@ -2537,9 +2537,6 @@ static void init_excp_7450(CPUPPCState *env) env->excp_vectors[POWERPC_EXCP_TRACE]= 0x0D00; env->excp_vectors[POWERPC_EXCP_PERFM]= 0x0F00; env->excp_vectors[POWERPC_EXCP_VPU] = 0x0F20; -env->excp_vectors[POWERPC_EXCP_IFTLB]= 0x1000; -env->excp_vectors[POWERPC_EXCP_DLTLB]= 0x1100; -env->excp_vectors[POWERPC_EXCP_DSTLB]= 0x1200; env->excp_vectors[POWERPC_EXCP_IABR] = 0x1300; env->excp_vectors[POWERPC_EXCP_SMI] = 0x1400; env->excp_vectors[POWERPC_EXCP_VPUA] = 0x1600; @@ -6643,7 +6640,6 @@ static void init_proc_e600(CPUPPCState *env) /* Memory management */ register_low_BATs(env); register_high_BATs(env); -register_74xx_soft_tlb(env, 128, 2); init_excp_7450(env); env->dcache_line_size = 32; env->icache_line_size = 32; @@ -6667,7 +6663,7 @@ POWERPC_FAMILY(e600)(ObjectClass *oc, void *data) PPC_CACHE_DCBA | PPC_CACHE_DCBZ | PPC_MEM_SYNC | PPC_MEM_EIEIO | PPC_MEM_TLBIE | PPC_MEM_TLBSYNC | - PPC_MEM_TLBIA | PPC_74xx_TLB | + PPC_MEM_TLBIA | PPC_SEGMENT | PPC_EXTERN | PPC_ALTIVEC; pcc->insns_flags2 = PPC_NONE; -- 2.33.1
[RFC PATCH v2 0/4] QEMU/openbios: PPC Software TLB support in the G4 family
Hi all, Recap: - QEMU enables 7450 SW TLB search by default; - OpenBIOS does not know about SW TLB (vectors 0x1000, 0x1100, 0x1200); - OpenBIOS does not know about 7450s PVRs. Proposed solutions: a) find another firmware/guest OS code that supports the feature; b) implement the switching of the feature in QEMU and have the guest code enable it only when supported. That would take some fiddling with the MMU code to: merge POWERPC_MMU_SOFT_74xx into POWERPC_MMU_32B, check the HID0[STEN] bit, figure out how to switch from HW TLB miss to SW TLB miss on demand, block access to the TLBMISS register (and others) when the feature is off, and so on; c) leave the feature enabled in QEMU and implement the software TLB miss handlers in openbios. The UM provides sample code, so this is easy; d) remove support for software TLB search for the 7450 family and switch the cpus to the POWERPC_MMU_32B model. This is by far the easiest solution, but could cause problems for any (which?) guest OS code that actually uses the feature. All of the existing code for the POWERPC_MMU_SOFT_74xx MMU model would probably be removed since it would be dead code then; v1: https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00289.html v2: This series corresponds to option d) above. - patch 1 moves all of the 7450 CPUs* into the POWERPC_MMU_32B MMU model, which does the same as the POWERPC_MMU_SOFT_74xx minus the software TLB handling. It also removes the instructions (tlbld, tlbli) and SPRs (TLBMISS, PTEHI, PTELO) from the 7450s as these facilities are only used along with the software TLB. *- except for 7448, which is seen by QEMU as a 7400. - patch 2 removes the instructions and SPRs just like above, but from the e600 CPU. The e600 already uses the POWERPC_MMU_32B MMU model, so it already has software TLB disabled. - patch 3 removes all of the now dead code for the 74xx software TLB. I left a note in the code with keywords to help with grep in case people search for the feature in the future. - patch 4 adds smoke tests for all of the 74xx CPUs. These are broken pending the OpenBIOS patch. For OpenBIOS: We'd need to merge the patch 2/2 from the previous series: https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00290.html Message ID: 2029134431.406753-3-faro...@linux.ibm.com this is just for the new PVRs. There is no need to add the handlers. Thanks! Fabiano Rosas (4): target/ppc: Disable software TLB for the 7450 family target/ppc: Disable unused facilities in the e600 CPU target/ppc: Remove the software TLB model of 7450 CPUs tests/avocado: ppc: Add smoke tests for MPC7400 and MPC7450 families target/ppc/cpu-qom.h | 6 +- target/ppc/cpu.h | 4 +- target/ppc/cpu_init.c | 57 -- target/ppc/excp_helper.c | 29 - target/ppc/helper.h | 2 - target/ppc/mmu_common.c | 19 -- target/ppc/mmu_helper.c | 31 -- target/ppc/translate.c| 26 tests/avocado/ppc_74xx.py | 123 ++ 9 files changed, 140 insertions(+), 157 deletions(-) create mode 100644 tests/avocado/ppc_74xx.py -- 2.33.1
Re: [PATCH v8 06/10] target/ppc: enable PMU instruction count
On 11/29/21 01:36, David Gibson wrote: On Thu, Nov 25, 2021 at 12:08:13PM -0300, Daniel Henrique Barboza wrote: The PMU is already counting cycles by calculating time elapsed in nanoseconds. Counting instructions is a different matter and requires another approach. This patch adds the capability of counting completed instructions (Perf event PM_INST_CMPL) by counting the amount of instructions translated in each translation block right before exiting it. A new pmu_count_insns() helper in translation.c was added to do that. After verifying that the PMU is running (MMCR0_FC bit not set), call helper_insns_inc(). This new helper from power8-pmu.c will add the instructions to the relevant counters. It'll also be responsible for triggering counter negative overflows as it is already being done with cycles. Signed-off-by: Daniel Henrique Barboza --- target/ppc/cpu.h | 1 + target/ppc/helper.h | 1 + target/ppc/helper_regs.c | 4 +++ target/ppc/power8-pmu-regs.c.inc | 6 + target/ppc/power8-pmu.c | 38 ++ target/ppc/translate.c | 46 6 files changed, 96 insertions(+) diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h index 9b41b022e2..38cd2b5c43 100644 --- a/target/ppc/cpu.h +++ b/target/ppc/cpu.h @@ -656,6 +656,7 @@ enum { HFLAGS_PR = 14, /* MSR_PR */ HFLAGS_PMCC0 = 15, /* MMCR0 PMCC bit 0 */ HFLAGS_PMCC1 = 16, /* MMCR0 PMCC bit 1 */ +HFLAGS_MMCR0FC = 17, /* MMCR0 FC bit */ Now that the event stuff is a bit more refined, you could narrow this down to specifically marking if any counters are actively counting instructions (not frozen by MMCR0[FC] and not frozen by MMCR0[FC14|FC56] *and* have the right event selected). Since I suspect the instruction counting instrumentation could be quite expensive (helper call on every tb), that might be worthwhile. That was worthwhile. The performance increase is substantial with this change, in particular with tests that exercises only cycle events. HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */ HFLAGS_VR = 25, /* MSR_VR if cpu has VRE */ diff --git a/target/ppc/helper.h b/target/ppc/helper.h index 94b4690375..d8a23e054a 100644 --- a/target/ppc/helper.h +++ b/target/ppc/helper.h @@ -24,6 +24,7 @@ DEF_HELPER_2(store_mmcr0, void, env, tl) DEF_HELPER_2(store_mmcr1, void, env, tl) DEF_HELPER_3(store_pmc, void, env, i32, i64) DEF_HELPER_2(read_pmc, tl, env, i32) +DEF_HELPER_2(insns_inc, void, env, i32) #endif DEF_HELPER_1(check_tlb_flush_local, void, env) DEF_HELPER_1(check_tlb_flush_global, void, env) diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c index 99562edd57..875c2fdfc6 100644 --- a/target/ppc/helper_regs.c +++ b/target/ppc/helper_regs.c @@ -115,6 +115,10 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env) if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) { hflags |= 1 << HFLAGS_PMCC1; } +if (env->spr[SPR_POWER_MMCR0] & MMCR0_FC) { +hflags |= 1 << HFLAGS_MMCR0FC; +} + #ifndef CONFIG_USER_ONLY if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) { diff --git a/target/ppc/power8-pmu-regs.c.inc b/target/ppc/power8-pmu-regs.c.inc index 25b13ad564..580e4e41b2 100644 --- a/target/ppc/power8-pmu-regs.c.inc +++ b/target/ppc/power8-pmu-regs.c.inc @@ -113,6 +113,12 @@ static void write_MMCR0_common(DisasContext *ctx, TCGv val) */ gen_icount_io_start(ctx); gen_helper_store_mmcr0(cpu_env, val); + +/* + * End the translation block because MMCR0 writes can change + * ctx->pmu_frozen. + */ +ctx->base.is_jmp = DISAS_EXIT_UPDATE; } void spr_write_MMCR0_ureg(DisasContext *ctx, int sprn, int gprn) diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c index 01e0b9b8fc..59d0def79d 100644 --- a/target/ppc/power8-pmu.c +++ b/target/ppc/power8-pmu.c @@ -112,6 +112,30 @@ static PMUEventType pmc_get_event(CPUPPCState *env, int sprn) return evt_type; } +static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns) +{ +bool overflow_triggered = false; +int sprn; + +/* PMC6 never counts instructions */ +for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC5; sprn++) { +if (pmc_get_event(env, sprn) != PMU_EVENT_INSTRUCTIONS) { +continue; +} + +env->spr[sprn] += num_insns; + +if (env->spr[sprn] >= PMC_COUNTER_NEGATIVE_VAL && +pmc_has_overflow_enabled(env, sprn)) { + +overflow_triggered = true; +env->spr[sprn] = PMC_COUNTER_NEGATIVE_VAL; Does the hardware PMU actually guarantee that the event will happen exactly on the overflow? Or could you count a few into the negative zone before the event is delivered? My understand reading the ISA and from testing with the a real PMU is that yes, it'll guarantee that the overflow will happen when the counter reaches exactly
Re: [PATCH 2/2] hw/mips/boston: Fix load_elf error detection
On 11/30/21 22:17, Jiaxun Yang wrote: > load_elf gives negative return in case of error, not zero. > > Fixes: 10e3f30 ("hw/mips/boston: Allow loading elf kernel and dtb") > Signed-off-by: Jiaxun Yang > --- > For 6.2. > --- > hw/mips/boston.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH 1/2] hw/mips: bootloader: Fix write_ulong
On 11/30/21 22:17, Jiaxun Yang wrote: > bl_gen_write_ulong uses sd for both 32 and 64 bit CPU, > while sd is illegal on 32 bit CPUs. > > Replace sd with sw on 32bit CPUs. > > Fixes: 3ebbf86 ("hw/mips: Add a bootloader helper") > Signed-off-by: Jiaxun Yang > --- > Should be backported to 6.0 onwards. > --- > hw/mips/bootloader.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/hw/mips/bootloader.c b/hw/mips/bootloader.c > index 6ec8314490..1f8b2b 100644 > --- a/hw/mips/bootloader.c > +++ b/hw/mips/bootloader.c > @@ -182,7 +182,11 @@ void bl_gen_write_ulong(uint32_t **p, target_ulong addr, > target_ulong val) > { > bl_gen_load_ulong(p, BL_REG_K0, val); > bl_gen_load_ulong(p, BL_REG_K1, addr); > -bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0); > +if (bootcpu_supports_isa(ISA_MIPS3)) { > +bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0); > +} else { > +bl_gen_sw(p, BL_REG_K0, BL_REG_K1, 0x0); > +} We have bl_gen_load_ulong(); having bl_gen_store_ulong() would make the API even. Mind sending a patch? Otherwise: Reviewed-by: Philippe Mathieu-Daudé > } > > void bl_gen_write_u32(uint32_t **p, target_ulong addr, uint32_t val) >
[PATCH 2/2] hw/mips/boston: Fix load_elf error detection
load_elf gives negative return in case of error, not zero. Fixes: 10e3f30 ("hw/mips/boston: Allow loading elf kernel and dtb") Signed-off-by: Jiaxun Yang --- For 6.2. --- hw/mips/boston.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/hw/mips/boston.c b/hw/mips/boston.c index 0e3cca5511..296e9fa2ad 100644 --- a/hw/mips/boston.c +++ b/hw/mips/boston.c @@ -777,14 +777,15 @@ static void boston_mach_init(MachineState *machine) exit(1); } } else if (machine->kernel_filename) { -uint64_t kernel_entry, kernel_high, kernel_size; +uint64_t kernel_entry, kernel_high; +ssize_t kernel_size; kernel_size = load_elf(machine->kernel_filename, NULL, cpu_mips_kseg0_to_phys, NULL, _entry, NULL, _high, NULL, 0, EM_MIPS, 1, 0); -if (kernel_size) { +if (kernel_size > 0) { int dt_size; g_autofree const void *dtb_file_data, *dtb_load_data; hwaddr dtb_paddr = QEMU_ALIGN_UP(kernel_high, 64 * KiB); -- 2.11.0
[PATCH 1/2] hw/mips: bootloader: Fix write_ulong
bl_gen_write_ulong uses sd for both 32 and 64 bit CPU, while sd is illegal on 32 bit CPUs. Replace sd with sw on 32bit CPUs. Fixes: 3ebbf86 ("hw/mips: Add a bootloader helper") Signed-off-by: Jiaxun Yang --- Should be backported to 6.0 onwards. --- hw/mips/bootloader.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/hw/mips/bootloader.c b/hw/mips/bootloader.c index 6ec8314490..1f8b2b 100644 --- a/hw/mips/bootloader.c +++ b/hw/mips/bootloader.c @@ -182,7 +182,11 @@ void bl_gen_write_ulong(uint32_t **p, target_ulong addr, target_ulong val) { bl_gen_load_ulong(p, BL_REG_K0, val); bl_gen_load_ulong(p, BL_REG_K1, addr); -bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0); +if (bootcpu_supports_isa(ISA_MIPS3)) { +bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0); +} else { +bl_gen_sw(p, BL_REG_K0, BL_REG_K1, 0x0); +} } void bl_gen_write_u32(uint32_t **p, target_ulong addr, uint32_t val) -- 2.11.0
[PATCH 0/2] MIPS misc fixes
Two problems caught when I was trying to add CI job for various configurations. Jiaxun Yang (2): hw/mips: bootloader: Fix write_ulong hw/mips/boston: Fix elf_load error detection hw/mips/bootloader.c | 6 +- hw/mips/boston.c | 5 +++-- 2 files changed, 8 insertions(+), 3 deletions(-) -- 2.11.0
Re: [PATCH] target/arm: Correct calculation of tlb range invalidate length
Peter Maydell writes: > The calculation of the length of TLB range invalidate operations > in tlbi_aa64_range_get_length() is incorrect in two ways: > * the NUM field is 5 bits, but we read only 4 bits > * we miscalculate the page_shift value, because of an >off-by-one error: > TG 0b00 is invalid > TG 0b01 is 4K granule size == 4096 == 2^12 > TG 0b10 is 16K granule size == 16384 == 2^14 > TG 0b11 is 64K granule size == 65536 == 2^16 >so page_shift should be (TG - 1) * 2 + 12 > > Thanks to the bug report submitter Cha HyunSoo for identifying > both these errors. > > Fixes: 84940ed82552d3c Fixes: 84940ed825 (target/arm: Add support for FEAT_TLBIRANGE) Reviewed-by: Alex Bennée -- Alex Bennée
Re: [PATCH v4] target/ppc: fix Hash64 MMU update of PTE bit R
On 30/11/2021 05:44, Cédric Le Goater wrote: It would be interesting to boot directly the PowerNV machine from a FreeBSB kernel and a minimum inirtd without using the skiroot images and an iso. Are images available ? AFAIK there are no minimum initrd images available. The closest thing would be the "bootonly" ISOs that can be found in the links below: https://download.freebsd.org/ftp/releases/powerpc/powerpc64/ISO-IMAGES/13.0/ https://download.freebsd.org/ftp/releases/powerpc/powerpc64le/ISO-IMAGES/13.0/ https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64/ISO-IMAGES/14.0/ https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64le/ISO-IMAGES/14.0/ But to avoid using skiroot, you would need to manually extract the kernel from the ISO and also specify the rootfs, using something like: "-append cd9660:cd0". If you don't want to emulate a disk or cd, the ISO can be passed to -initrd and "-append cd9660:md0" may be used to tell FreeBSD where to find the root partition (it loads it as a memory disk). There are also qcow2 snapshots available: https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64/ https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64le/ But you'll also need to extract the kernel from the image or from kernel.txz to boot them. As these images target pseries and lack a FAT32 partition, Petitboot won't be able to boot them. Alfredo (cc'ed) was trying to make them Petitboot compatible.
[Bug 1749393] Re: sbrk() not working under qemu-user with a PIE-compiled binary?
Hello Raphaël, or anyone else affected, Accepted qemu into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/qemu/1:4.2-3ubuntu6.19 in a few hours, and then in the -proposed repository. Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users. If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed- focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification- failed-focal. In either case, without details of your testing we will not be able to proceed. Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping! N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days. ** Changed in: qemu (Ubuntu Focal) Status: In Progress => Fix Committed ** Tags added: verification-needed verification-needed-focal -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1749393 Title: sbrk() not working under qemu-user with a PIE-compiled binary? Status in QEMU: Fix Released Status in qemu package in Ubuntu: Fix Released Status in qemu source package in Focal: Fix Committed Bug description: [Impact] * The current space reserved can be too small and we can end up with no space at all for BRK. It can happen to any case, but is much more likely with the now common PIE binaries. * Backport the upstream fix which reserves a bit more space while loading and giving it back after interpreter and stack is loaded. [Test Plan] * On x86 run: sudo apt install -y qemu-user-static docker.io sudo docker run --rm arm64v8/debian:bullseye bash -c 'apt update && apt install -y wget' ... Running hooks in /etc/ca-certificates/update.d... done. Errors were encountered while processing: libc-bin E: Sub-process /usr/bin/dpkg returned an error code (1) Second test from bug 1928075 $ sudo qemu-debootstrap --arch=arm64 bullseye bullseye-arm64 http://ftp.debian.org/debian In the bad case this is failing like W: Failure trying to run: /sbin/ldconfig W: See //debootstrap/debootstrap.log for detail And in that log file you'll see the segfault $ tail -n 2 bullseye-arm64/debootstrap/debootstrap.log qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) [Where problems could occur] * Regressions would be around use-cases of linux-user that is emulation not of a system but of binaries. Commonly uses for cross-tests and cross-builds so that is the space to watch for regressions [Other Info] * n/a --- In Debian unstable, we recently switched bash to be a PIE-compiled binary (for hardening). Unfortunately this resulted in bash being broken when run under qemu-user (for all target architectures, host being amd64 for me). $ sudo chroot /srv/chroots/sid-i386/ qemu-i386-static /bin/bash bash: xmalloc: .././shell.c:1709: cannot allocate 10 bytes (0 bytes allocated) bash has its own malloc implementation based on sbrk(): https://git.savannah.gnu.org/cgit/bash.git/tree/lib/malloc/malloc.c When we disable this internal implementation and rely on glibc's malloc, then everything is fine. But it might be that glibc has a fallback when sbrk() is not working properly and it might hide the underlying problem in qemu-user. This issue has also been reported to the bash upstream author and he suggested that the issue might be in qemu-user so I'm opening a ticket here. Here's the discussion with the bash upstream author: https://lists.gnu.org/archive/html/bug-bash/2018-02/threads.html#00080 You can find the problematic bash binary in that .deb file: http://snapshot.debian.org/archive/debian/20180206T154716Z/pool/main/b/bash/bash_4.4.18-1_i386.deb The version of qemu I have been using is 2.11 (Debian package qemu- user-static version 1:2.11+dfsg-1) but I have had reports that the problem is reproducible with older versions (back to 2.8 at least). Here are the related Debian bug reports: https://bugs.debian.org/889869 https://bugs.debian.org/865599 It's worth noting that bash used to have this problem (when compiled as a PIE binary) even when run directly but then something got fixed in the kernel and now the problem only appears when run under qemu-user:
Re: [PATCH v5 5/6] migration: Add migrate_use_tls() helper
Hello Juan, Thanks for reviewing! Best regards, Leo On Fri, Nov 12, 2021 at 8:04 AM Juan Quintela wrote: > Leonardo Bras wrote: > > A lot of places check parameters.tls_creds in order to evaluate if TLS is > > in use, and sometimes call migrate_get_current() just for that test. > > > > Add new helper function migrate_use_tls() in order to simplify testing > > for TLS usage. > > > > Signed-off-by: Leonardo Bras > > Reviewed-by: Juan Quintela > >
Re: Odd square bracket encoding in QOM names
On 30/11/2021 16:41, Peter Maydell wrote: On Tue, 30 Nov 2021 at 08:36, Mark Cave-Ayland wrote: Has there been a recent change as to how square brackets are encoded within QOM names? I noticed that the output has changed here in the "info qom-tree" output in qemu-system-m68k for the q800 machine. The q800 machine has a set of 256 memory region aliases that used to appear in the "info qom-tree" output as: /mac_m68k.io[100] (memory-region) /mac_m68k.io[101] (memory-region) /mac_m68k.io[102] (memory-region) but they now appear as: /mac_m68k.io\x5b100\x5d[0] (memory-region) /mac_m68k.io\x5b101\x5d[0] (memory-region) /mac_m68k.io\x5b102\x5d[0] (memory-region) I looked at info qom-tree for an Arm machine, and the [..] seem to be OK there. I tried to test with q800 but got stuck on finding a command line to get it to run. Do you have repro instructions? A couple of tests this evening and I think I must have misremembered this - I tried a couple of older versions of my various q800 branches (one within the 6.0 dev cycle and another within 6.1) and both escape the QOM object name in "info qom-tree" the same way as above. Easiest way to see this is to grab Finn's kernel from issue #611 like this: $ wget https://gitlab.com/qemu-project/qemu/uploads/dead759323116fb22cf0f03b8cdbe15a/vmlinux-5.14-multi.xz $ unxz vmlinux-5.14-multi.xz And then start QEMU with this command line: $ ./qemu-system-m68k -M q800 -kernel vmlinux-5.14-multi -monitor stdio -S Obviously the cause is the use of square brackets within the memory region name as per https://gitlab.com/qemu-project/qemu/-/blob/master/hw/m68k/q800.c#L411, so given the escaping has been like this for some time then I guess everything is working as intended, and sorry for the noise. ATB, Mark.
Re: Follow-up on the CXL discussion at OFTC
On 21-11-30 13:09:56, Jonathan Cameron wrote: > On Mon, 29 Nov 2021 18:28:43 + > Alex Bennée wrote: > > > Ben Widawsky writes: > > > > > On 21-11-26 12:08:08, Alex Bennée wrote: > > >> > > >> Ben Widawsky writes: > > >> > > >> > On 21-11-19 02:29:51, Shreyas Shah wrote: > > >> >> Hi Ben > > >> >> > > >> >> Are you planning to add the CXL2.0 switch inside QEMU or already > > >> >> added in one of the version? > > >> >> > > >> > > > >> > From me, there are no plans for QEMU anything until/unless upstream > > >> > thinks it > > >> > will merge the existing patches, or provide feedback as to what it > > >> > would take to > > >> > get them merged. If upstream doesn't see a point in these patches, > > >> > then I really > > >> > don't see much value in continuing to further them. Once hardware > > >> > comes out, the > > >> > value proposition is certainly less. > > >> > > >> I take it: > > >> > > >> Subject: [RFC PATCH v3 00/31] CXL 2.0 Support > > >> Date: Mon, 1 Feb 2021 16:59:17 -0800 > > >> Message-Id: <20210202005948.241655-1-ben.widaw...@intel.com> > > >> > > >> is the current state of the support? I saw there was a fair amount of > > >> discussion on the thread so assumed there would be a v4 forthcoming at > > >> some point. > > > > > > Hi Alex, > > > > > > There is a v4, however, we never really had a solid plan for the primary > > > issue > > > which was around handling CXL memory expander devices properly (both from > > > an > > > interleaving standpoint as well as having a device which hosts multiple > > > memory > > > capacities, persistent and volatile). I didn't feel it was worth sending > > > a v4 > > > unless someone could say > > > > > > 1. we will merge what's there and fix later, or > > > 2. you must have a more perfect emulation in place, or > > > 3. we want to see usages for a real guest > > > > I think 1. is acceptable if the community is happy there will be ongoing > > development and it's not just a code dump. Given it will have a > > MAINTAINERS entry I think that is demonstrated. > > My thought is also 1. There are a few hacks we need to clean out but > nothing that should take too long. I'm sure it'll take a rev or two more. > Right now for example, I've added support to arm-virt and maybe need to > move that over to a different machine model... > The most annoying thing about rebasing it is passing the ACPI tests. They keep changing upstream. Being able to at least merge up to there would be huge. > > > > What's the current use case? Testing drivers before real HW comes out? > > Will it still be useful after real HW comes out for people wanting to > > debug things without HW? > > CXL is continuing to expand in scope and capabilities and I don't see that > reducing any time soon (My guess is 3 years+ to just catch up with what is > under discussion today). So I see two long term use cases: > > 1) Automated verification that we haven't broken things. I suspect no > one person is going to have a test farm covering all the corner cases. > So we'll need emulation + firmware + kernel based testing. > Does this exist in other forms? AFAICT for x86, there isn't much example of this. > 2) New feature prove out. We have already used it for some features that > will appear in the next spec version. Obviously I can't say what or > send that code out yet. Its very useful and the spec draft has changed > in various ways a result. I can't commit others, but Huawei will be > doing more of this going forwards. For that we need a stable base to > which we add the new stuff once spec publication allows it. > I can't commit for Intel but I will say there's more latitude now to work on projects like this compared to when I first wrote the patches. I have interesting in continuing to develop this as well. I'm very interested in supporting interleave and hotplug specifically. > > > > > > > > I had hoped we could merge what was there mostly as is and fix it up as > > > we go. > > > It's useful in the state it is now, and as time goes on, we find more > > > usecases > > > for it in a VMM, and not just driver development. > > > > > >> > > >> Adding new subsystems to QEMU does seem to be a pain point for new > > >> contributors. Patches tend to fall through the cracks of existing > > >> maintainers who spend most of their time looking at stuff that directly > > >> touches their files. There is also a reluctance to merge large chunks of > > >> functionality without an identified maintainer (and maybe reviewers) who > > >> can be the contact point for new patches. So in short you need: > > >> > > >> - Maintainer Reviewed-by/Acked-by on patches that touch other > > >> sub-systems > > > > > > This is the challenging one. I have Cc'd the relevant maintainers (hw/pci > > > and > > > hw/mem are the two) in the past, but I think there interest is lacking > > > (and > > > reasonably so, it is an entirely different subsystem). > > >
Re: [PATCH] target/arm: Correct calculation of tlb range invalidate length
On 11/30/21 6:32 PM, Peter Maydell wrote: The calculation of the length of TLB range invalidate operations in tlbi_aa64_range_get_length() is incorrect in two ways: * the NUM field is 5 bits, but we read only 4 bits * we miscalculate the page_shift value, because of an off-by-one error: TG 0b00 is invalid TG 0b01 is 4K granule size == 4096 == 2^12 TG 0b10 is 16K granule size == 16384 == 2^14 TG 0b11 is 64K granule size == 65536 == 2^16 so page_shift should be (TG - 1) * 2 + 12 Thanks to the bug report submitter Cha HyunSoo for identifying both these errors. Fixes: 84940ed82552d3c Resolves: https://gitlab.com/qemu-project/qemu/-/issues/734 Signed-off-by: Peter Maydell --- Not marked for-6.2 because this isn't a regression: we shipped the TLBI range invalidate support in 6.1. I have no repro case for this issue, but this doesn't break booting an aarch64 kernel, at least. --- Reviewed-by: Richard Henderson r~
[PATCH] target/arm: Correct calculation of tlb range invalidate length
The calculation of the length of TLB range invalidate operations in tlbi_aa64_range_get_length() is incorrect in two ways: * the NUM field is 5 bits, but we read only 4 bits * we miscalculate the page_shift value, because of an off-by-one error: TG 0b00 is invalid TG 0b01 is 4K granule size == 4096 == 2^12 TG 0b10 is 16K granule size == 16384 == 2^14 TG 0b11 is 64K granule size == 65536 == 2^16 so page_shift should be (TG - 1) * 2 + 12 Thanks to the bug report submitter Cha HyunSoo for identifying both these errors. Fixes: 84940ed82552d3c Resolves: https://gitlab.com/qemu-project/qemu/-/issues/734 Signed-off-by: Peter Maydell --- Not marked for-6.2 because this isn't a regression: we shipped the TLBI range invalidate support in 6.1. I have no repro case for this issue, but this doesn't break booting an aarch64 kernel, at least. --- target/arm/helper.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/target/arm/helper.c b/target/arm/helper.c index 9b317899a66..db837d53bd9 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -4519,18 +4519,18 @@ static uint64_t tlbi_aa64_range_get_length(CPUARMState *env, uint64_t exponent; uint64_t length; -num = extract64(value, 39, 4); +num = extract64(value, 39, 5); scale = extract64(value, 44, 2); page_size_granule = extract64(value, 46, 2); -page_shift = page_size_granule * 2 + 12; - if (page_size_granule == 0) { qemu_log_mask(LOG_GUEST_ERROR, "Invalid page size granule %d\n", page_size_granule); return 0; } +page_shift = (page_size_granule - 1) * 2 + 12; + exponent = (5 * scale) + 1; length = (num + 1) << (exponent + page_shift); -- 2.25.1
[PATCH] scripts/entitlement.sh: Use backward-compatible cp flags
Older versions of Mac OS X do not support cp -a. The cp man page indicates that -a is equivalent to -pPR. Signed-off-by: Evan Miller --- scripts/entitlement.sh.orig +++ scripts/entitlement.sh @@ -15,7 +15,7 @@ if $in_place; then trap 'rm "$DST.tmp"' exit - cp -af "$SRC" "$DST.tmp" + cp -pPRf "$SRC" "$DST.tmp" SRC="$DST.tmp" else cd "$MESON_INSTALL_DESTDIR_PREFIX"
Re: Odd square bracket encoding in QOM names
On Tue, 30 Nov 2021 at 08:36, Mark Cave-Ayland wrote: > Has there been a recent change as to how square brackets are encoded within > QOM > names? I noticed that the output has changed here in the "info qom-tree" > output in > qemu-system-m68k for the q800 machine. > > The q800 machine has a set of 256 memory region aliases that used to appear > in the > "info qom-tree" output as: > > /mac_m68k.io[100] (memory-region) > /mac_m68k.io[101] (memory-region) > /mac_m68k.io[102] (memory-region) > > but they now appear as: > > /mac_m68k.io\x5b100\x5d[0] (memory-region) > /mac_m68k.io\x5b101\x5d[0] (memory-region) > /mac_m68k.io\x5b102\x5d[0] (memory-region) I looked at info qom-tree for an Arm machine, and the [..] seem to be OK there. I tried to test with q800 but got stuck on finding a command line to get it to run. Do you have repro instructions? thanks -- PMM
Re: [PATCH v3 2/3] test/tcg/ppc64le: test mtfsf
Hello Lucas, On 11/24/21 18:25, Lucas Mateus Castro (alqotel) wrote: Added tests for the mtfsf to check if FI bit of FPSCR is being set and if exception calls are being made correctly. Signed-off-by: Lucas Mateus Castro (alqotel) Could you please rebase on mainline and resend ? Thanks, C. --- tests/tcg/ppc64/Makefile.target | 1 + tests/tcg/ppc64le/Makefile.target | 1 + tests/tcg/ppc64le/mtfsf.c | 61 +++ 3 files changed, 63 insertions(+) create mode 100644 tests/tcg/ppc64le/mtfsf.c diff --git a/tests/tcg/ppc64/Makefile.target b/tests/tcg/ppc64/Makefile.target index 6ab7934fdf..8f4c7ac4ed 100644 --- a/tests/tcg/ppc64/Makefile.target +++ b/tests/tcg/ppc64/Makefile.target @@ -11,6 +11,7 @@ endif bcdsub: CFLAGS += -mpower8-vector PPC64_TESTS += byte_reverse +PPC64_TESTS += mtfsf ifneq ($(DOCKER_IMAGE)$(CROSS_CC_HAS_POWER10),) run-byte_reverse: QEMU_OPTS+=-cpu POWER10 run-plugin-byte_reverse-with-%: QEMU_OPTS+=-cpu POWER10 diff --git a/tests/tcg/ppc64le/Makefile.target b/tests/tcg/ppc64le/Makefile.target index 5e65b1590d..b8cd9bf73a 100644 --- a/tests/tcg/ppc64le/Makefile.target +++ b/tests/tcg/ppc64le/Makefile.target @@ -10,6 +10,7 @@ endif bcdsub: CFLAGS += -mpower8-vector PPC64LE_TESTS += byte_reverse +PPC64LE_TESTS += mtfsf ifneq ($(DOCKER_IMAGE)$(CROSS_CC_HAS_POWER10),) run-byte_reverse: QEMU_OPTS+=-cpu POWER10 run-plugin-byte_reverse-with-%: QEMU_OPTS+=-cpu POWER10 diff --git a/tests/tcg/ppc64le/mtfsf.c b/tests/tcg/ppc64le/mtfsf.c new file mode 100644 index 00..b3d31f3637 --- /dev/null +++ b/tests/tcg/ppc64le/mtfsf.c @@ -0,0 +1,61 @@ +#include +#include +#include +#include + +#define FPSCR_VE 7 /* Floating-point invalid operation exception enable */ +#define FPSCR_VXSOFT 10 /* Floating-point invalid operation exception (soft) */ +#define FPSCR_FI 17 /* Floating-point fraction inexact */ + +#define FP_VE (1ull << FPSCR_VE) +#define FP_VXSOFT (1ull << FPSCR_VXSOFT) +#define FP_FI (1ull << FPSCR_FI) + +void sigfpe_handler(int sig, siginfo_t *si, void *ucontext) +{ +if (si->si_code == FPE_FLTINV) { +exit(0); +} +exit(1); +} + +int main(void) +{ +union { +double d; +long long ll; +} fpscr; + +struct sigaction sa = { +.sa_sigaction = sigfpe_handler, +.sa_flags = SA_SIGINFO +}; + +/* + * Enable the MSR bits F0 and F1 to enable exceptions. + * This shouldn't be needed in linux-user as these bits are enabled by + * default, but this allows to execute either in a VM or a real machine + * to compare the behaviors. + */ +prctl(PR_SET_FPEXC, PR_FP_EXC_PRECISE); + +/* First test if the FI bit is being set correctly */ +fpscr.ll = FP_FI; +__builtin_mtfsf(0b, fpscr.d); +fpscr.d = __builtin_mffs(); +assert((fpscr.ll & FP_FI) != 0); + +/* Then test if the deferred exception is being called correctly */ +sigaction(SIGFPE, , NULL); + +/* + * Although the VXSOFT exception has been chosen, based on test in a Power9 + * any combination of exception bit + its enabling bit should work. + * But if a different exception is chosen si_code check should + * change accordingly. + */ +fpscr.ll = FP_VE | FP_VXSOFT; +__builtin_mtfsf(0b, fpscr.d); + +return 1; +}
Re: why is iommu_platform set to off by default?
On Tue, Nov 30, 2021 at 02:32:49PM +, Peter Maydell wrote: > I've just spent a day or so trying to track down why PCI passthrough > of a virtio-blk-pci device wasn't working. The problem turns out to be > that by default virtio pci devices don't use the IOMMU, even when the > machine model has created an IOMMU and arranged for the PCI bus to > be underneath it. So when the L2 guest tries to program the virtio device, > the virtio device treats the IPAs it writes as if they were PAs and > of course the data structures it's looking for aren't there. > > Why do we default this to 'off'? It seems pretty unhelpful not to > honour the existence of the IOMMU, and the failure mode is pretty > opaque (L2 guest just hangs)... Historically VIRTIO used guest physical addresses instead of bus addresses (IOVA). I think when IOMMU support was added, a QEMU -device virtio-* parameter was added and it's simply off by default: iommu_platform= - on/off (default: false) Maybe this behavior is for backwards compatibility? Existing guests with IOMMUs shouldn't change behavior, although this could be fixed with machine type compat properties. Stefan signature.asc Description: PGP signature
Re: [PATCH v1 0/8] virtio-mem: Support "prealloc=on"
On 11/30/21 11:41, David Hildenbrand wrote: > Based-on: <20211130092838.24224-1-da...@redhat.com> > > Patch #1 - #7 are fully reviewed [1] but did not get picked up yet, so I'm > sending them along here, as they are required to use os_mem_prealloc() in > a safe way once the VM is running. > > Support preallocation of memory to make virtio-mem safe to use with > scarce memory resources such as hugetlb. Before acknowledging a plug > request from the guest, we'll try preallcoating memory. If that fails, > we'll fail the request gracefully and warn the usr once. > > To fully support huge pages for shared memory, we'll have to adjust shared > memory users, such as virtiofsd, to map guest memory via MAP_NORESERVE as > well, because otherwise, they'll end up overwriting the "reserve=off" > decision made by QEMU and try reserving huge pages for the sparse memory > region. > > In the future, we might want to process guest requests, including > preallocating memory, asynchronously via a dedicated iothread. > > [1] https://lkml.kernel.org/r/20211004120208.7409-1-da...@redhat.com > > Cc: Paolo Bonzini > Cc: "Michael S. Tsirkin" > Cc: Igor Mammedov > Cc: Eduardo Habkost > Cc: Dr. David Alan Gilbert > Cc: Daniel P. Berrangé > Cc: Gavin Shan > Cc: Hui Zhu > Cc: Sebastien Boeuf > Cc: Pankaj Gupta > Cc: Michal Prívozník > > David Hildenbrand (8): > util/oslib-posix: Let touch_all_pages() return an error > util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() > util/oslib-posix: Introduce and use MemsetContext for > touch_all_pages() > util/oslib-posix: Don't create too many threads with small memory or > little pages > util/oslib-posix: Avoid creating a single thread with > MADV_POPULATE_WRITE > util/oslib-posix: Support concurrent os_mem_prealloc() invocation > util/oslib-posix: Forward SIGBUS to MCE handler under Linux > virtio-mem: Support "prealloc=on" option > > hw/virtio/virtio-mem.c | 39 +- > include/hw/virtio/virtio-mem.h | 4 + > include/qemu/osdep.h | 7 + > softmmu/cpus.c | 4 + > util/oslib-posix.c | 231 + > 5 files changed, 226 insertions(+), 59 deletions(-) > Reviewed-by: Michal Privoznik Michal
Re: [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)
On 11/10/2021 2:48 AM, Zheng Chuan wrote: > > Hi, steve > > On 2021/8/11 1:06, Alex Williamson wrote: >> On Fri, 6 Aug 2021 14:43:53 -0700 >> Steve Sistare wrote: >> [...] >>> +static int >>> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp) >>> +{ >>> +MemoryRegion *mr = section->mr; >>> +VFIOContainer *container = handle; >>> +const char *name = memory_region_name(mr); >>> +ram_addr_t size = int128_get64(section->size); >>> +hwaddr offset, iova, roundup; >>> +void *vaddr; >>> + >>> +if (vfio_listener_skipped_section(section) || >>> memory_region_is_iommu(mr)) { >>> +return 0; >>> +} >>> + >>> +offset = section->offset_within_address_space; >>> +iova = REAL_HOST_PAGE_ALIGN(offset); > We should not do remap if it shares on host page with other structures. > I think a judgement like int128_ge((int128_make64(iova), llend)) in > vfio_listener_region_add() should be also added here to check it, > otherwise it will remap no-exit dma which causes the live update failure. > diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c > index 0981d31..d231841 100644 > --- a/hw/vfio/cpr.c > +++ b/hw/vfio/cpr.c > @@ -58,13 +58,21 @@ vfio_region_remap(MemoryRegionSection *section, void > *handle, Error **errp) > ram_addr_t size = int128_get64(section->size); > hwaddr offset, iova, roundup; > void *vaddr; > - > +Int128 llend; > + > if (vfio_listener_skipped_section(section) || > memory_region_is_iommu(mr)) { > return 0; > } > > offset = section->offset_within_address_space; > iova = REAL_HOST_PAGE_ALIGN(offset); > +llend = int128_make64(section->offset_within_address_space); > +llend = int128_add(llend, section->size); > +llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask)); > +if (int128_ge(int128_make64(iova), llend)) { > +return 0; > +} > + > roundup = iova - offset; > size -= roundup; > size = REAL_HOST_PAGE_ALIGN(size); > >>> +roundup = iova - offset; >>> +size -= roundup; >>> +size = REAL_HOST_PAGE_ALIGN(size); >>> +vaddr = memory_region_get_ram_ptr(mr) + >>> +section->offset_within_region + roundup; >>> + >>> +trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, >>> vaddr); >>> +return vfio_dma_map_vaddr(container, iova, size, vaddr, errp); >>> +} Thank you Zheng. I intended to implement the logic you suggest, using 64-bit arithmetic, but I botched it. This should do the trick: diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c index df334d9..bbdeaea 100644 --- a/hw/vfio/cpr.c +++ b/hw/vfio/cpr.c @@ -66,8 +66,8 @@ vfio_region_remap(MemoryRegionSection *section, void *handle, offset = section->offset_within_address_space; iova = REAL_HOST_PAGE_ALIGN(offset); roundup = iova - offset; -size -= roundup; -size = REAL_HOST_PAGE_ALIGN(size); +size -= roundup;/* adjust for starting alignment */ +size &= qemu_real_host_page_mask; /* adjust for ending alignment */ end = iova + size; if (iova >= end) { return 0; - Steve
Re: [PATCH v7 0/3] support dirty restraint on vCPU
On 11/30/21 22:57, Hyman Huang wrote: 1. Start vm with kernel+initrd.img with qemu command line as following: [root@Hyman_server1 fast_qemu]# cat vm.sh #!/bin/bash /usr/bin/qemu-system-x86_64 \ -display none -vga none \ -name guest=simple_vm,debug-threads=on \ -monitor stdio \ -machine pc-i440fx-2.12 \ -accel kvm,dirty-ring-size=65536 -cpu host \ -kernel /home/work/fast_qemu/vmlinuz-5.13.0-rc4+ \ -initrd /home/work/fast_qemu/initrd-stress.img \ -append "noapic edd=off printk.time=1 noreplace-smp cgroup_disable=memory pci=noearly console=ttyS0 debug ramsize=1500 ratio=1 sleep=1" \ -chardev file,id=charserial0,path=/var/log/vm_console.log \ -serial chardev:charserial0 \ -qmp unix:/tmp/qmp-sock,server,nowait \ -D /var/log/vm.log \ --trace events=/home/work/fast_qemu/events \ -m 4096 -smp 2 -device sga 2. Enable the dirtylimit trace event which will output to /var/log/vm.log [root@Hyman_server1 fast_qemu]# cat /home/work/fast_qemu/events dirtylimit_state_init dirtylimit_vcpu dirtylimit_impose dirtyrate_do_calculate_vcpu 3. Connect the qmp server with low level qmp client and set-dirty-limit [root@Hyman_server1 my_qemu]# python3.6 ./scripts/qmp/qmp-shell -v -p /tmp/qmp-sock Welcome to the QMP low-level shell! Connected to QEMU 6.1.92 (QEMU) set-dirty-limit cpu-index=1 dirty-rate=400 { "arguments": { "cpu-index": 1, "dirty-rate": 400 }, "execute": "set-dirty-limit" } 4. observe the vcpu current dirty rate and quota dirty rate... [root@Hyman_server1 ~]# tail -f /var/log/vm.log dirtylimit_state_init dirtylimit state init: max cpus 2 dirtylimit_vcpu CPU[1] set quota dirtylimit 400 dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 0, percentage 0 dirtyrate_do_calculate_vcpu vcpu[0]: 1075 MB/s dirtyrate_do_calculate_vcpu vcpu[1]: 1061 MB/s dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 1061, percentage 62 dirtyrate_do_calculate_vcpu vcpu[0]: 1133 MB/s dirtyrate_do_calculate_vcpu vcpu[1]: 380 MB/s dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 380, percentage 57 dirtyrate_do_calculate_vcpu vcpu[0]: 1227 MB/s dirtyrate_do_calculate_vcpu vcpu[1]: 464 MB/s We can observe that vcpu-1's dirtyrate is about 400MB/s with dirty page limit set and the vcpu-0 is not affected. 5. observe the vm stress info... [root@Hyman_server1 fast_qemu]# tail -f /var/log/vm_console.log [ 0.838051] Run /init as init process [ 0.839216] with arguments: [ 0.840153] /init [ 0.840882] with environment: [ 0.841884] HOME=/ [ 0.842649] TERM=linux [ 0.843478] edd=off [ 0.844233] ramsize=1500 [ 0.845079] ratio=1 [ 0.845829] sleep=1 /init (1): INFO: RAM 1500 MiB across 2 CPUs, ratio 1, sleep 1 us [ 1.158011] random: init: uninitialized urandom read (4096 bytes read) [ 1.448205] random: init: uninitialized urandom read (4096 bytes read) /init (1): INFO: 1638282593684ms copied 1 GB in 00729ms /init (00110): INFO: 1638282593964ms copied 1 GB in 00719ms /init (1): INFO: 1638282594405ms copied 1 GB in 00719ms /init (00110): INFO: 1638282594677ms copied 1 GB in 00713ms /init (1): INFO: 1638282595093ms copied 1 GB in 00686ms /init (00110): INFO: 1638282595339ms copied 1 GB in 00662ms /init (1): INFO: 1638282595764ms copied 1 GB in 00670m PS: the kernel and initrd images comes from: kernel image: vmlinuz-5.13.0-rc4+, normal centos vmlinuz copied from /boot directory initrd.img: initrd-stress.img, only contains a stress binary, which compiled from qemu source tests/migration/stress.c and run as init in vm. you can view README.md file of my project "met"(https://github.com/newfriday/met) to compile the initrd-stress.img. :) On 11/30/21 20:57, Peter Xu wrote: On Tue, Nov 30, 2021 at 06:28:10PM +0800, huang...@chinatelecom.cn wrote: From: Hyman Huang(黄勇) The patch [2/3] has not been touched so far. Any corrections and suggetions are welcome. I played with it today, but the vcpu didn't got throttled as expected. What I did was starting two workload with 500mb/s, each pinned on one vcpu thread: [root@fedora ~]# pgrep -fa mig_mon 595 ./mig_mon mm_dirty 1000 500 sequential 604 ./mig_mon mm_dirty 1000 500 sequential [root@fedora ~]# taskset -pc 595 pid 595's current affinity list: 2 [root@fedora ~]# taskset -pc 604 pid 604's current affinity list: 3 Then start throttle with 100mb/s: (QEMU) set-dirty-limit cpu-index=3 dirty-rate=100 {"return": {}} (QEMU) set-dirty-limit cpu-index=2 dirty-rate=100 {"return": {}} I can see the workload dropped a tiny little bit (perhaps 500mb -> 499mb), then it keeps going.. The test step above i listed assume that dirtyrate calculated by dirtylimit_calc_func via dirty-ring is accurate, which differ from your test policy. The macro DIRTYLIMIT_CALC_TIME_MS used as calculation period in migration/dirtyrate.c has a big affect on result. So "how we
Re: [PATCH v1 3/3] virtio-mem: Set "unplugged-inaccessible=auto" for the 6.2 machine on x86
On 30.11.21 16:34, Michal Prívozník wrote: > On 11/30/21 10:28, David Hildenbrand wrote: >> Set the new default to "auto", keeping it set to "on" for compat > > I believe you wanted to say: keeping it set to "off", because that's > what the patch does. Right, I switched semantics from a "legacy-guest-support=on/off/auto" to "unplugged-inaccessible=on/off/auto" Thanks a lot for the fast review!!! -- Thanks, David / dhildenb
Re: [PATCH 2/2] hw/core/loader: workaround read() size limit.
Hi Philippe, On Thu, Nov 11, 2021 at 05:04:56PM +, Jamie Iles wrote: > On Thu, Nov 11, 2021 at 04:55:35PM +0100, Philippe Mathieu-Daudé wrote: > > On 11/11/21 16:43, Philippe Mathieu-Daudé wrote: > > > On 11/11/21 16:36, Jamie Iles wrote: > > >> Hi Philippe, > > >> > > >> On Thu, Nov 11, 2021 at 03:55:48PM +0100, Philippe Mathieu-Daudé wrote: > > >>> Hi Jamie, > > >>> > > >>> On 11/11/21 15:11, Jamie Iles wrote: > > On Linux, read() will only ever read a maximum of 0x7000 bytes > > regardless of what is asked. If the file is larger than 0x7000 > > bytes the read will need to be broken up into multiple chunks. > > > > Cc: Luc Michel > > Signed-off-by: Jamie Iles > > --- > > hw/core/loader.c | 40 ++-- > > 1 file changed, 34 insertions(+), 6 deletions(-) > > > > diff --git a/hw/core/loader.c b/hw/core/loader.c > > index 348bbf535bd9..16ca9b99cf0f 100644 > > --- a/hw/core/loader.c > > +++ b/hw/core/loader.c > > @@ -80,6 +80,34 @@ int64_t get_image_size(const char *filename) > > return size; > > } > > > > +static ssize_t read_large(int fd, void *dst, size_t len) > > +{ > > +/* > > + * man 2 read says: > > + * > > + * On Linux, read() (and similar system calls) will transfer at > > most > > + * 0x7000 (2,147,479,552) bytes, returning the number of bytes > > >>> > > >>> Could you mention MAX_RW_COUNT from linux/fs.h? > > >>> > > + * actually transferred. (This is true on both 32-bit and 64-bit > > + * systems.) > > >>> > > >>> Maybe "This is true for both ILP32 and LP64 data models used by Linux"? > > >>> (because that would not be the case for the ILP64 model). > > >>> > > >>> Otherwise s/systems/Linux variants/? > > >>> > > + * > > + * So read in chunks no larger than 0x7000 bytes. > > + */ > > +size_t max_chunk_size = 0x7000; > > >>> > > >>> We can declare it static const. > > >> > > >> Ack, can fix all of those up. > > >> > > +size_t offset = 0; > > + > > +while (offset < len) { > > +size_t chunk_len = MIN(max_chunk_size, len - offset); > > +ssize_t br = read(fd, dst + offset, chunk_len); > > + > > +if (br < 0) { > > +return br; > > +} > > +offset += br; > > +} > > + > > +return (ssize_t)len; > > +} > > >>> > > >>> I see other read()/pread() calls: > > >>> > > >>> hw/9pfs/9p-local.c:472:tsize = read(fd, (void *)buf, bufsz); > > >>> hw/vfio/common.c:269:if (pread(vbasedev->fd, , size, > > >>> region->fd_offset + addr) != size) { > > >>> ... > > >>> > > >>> Maybe the read_large() belongs to "sysemu/os-xxx.h"? > > >> > > >> I think util/osdep.c would be a good fit for this. To make sure we're > > > > > > Yes. > > > > > >> on the same page though are you proposing converting all pread/read > > >> calls to a qemu variant or auditing for ones that could potentially take > > >> a larger size? > > > > > > Yes, I took some time wondering beside loading blob in guest memory, > > > what would be the other issues you might encounter. I couldn't find > > > many cases. Eventually hw/vfio/. I haven't audit much, only noticed > > > hw/9pfs/9p-local.c and qga/commands-*.c (not sure if relevant), but > > > since we want to fix this, I'd rather try to fix it globally. > > > > Actually what you suggest is simpler, add qemu_read() / qemu_pread() > > in util/osdep.c, convert all uses without caring about any audit. > > Okay, this hasn't worked out too badly - I'll do the same for > write/pwrite too and then switch all of the callers over with a > coccinelle patch so it'll be a fairly large diff but simple. > > We could elect to keep any calls with a compile-time constant length > with the unwrapped variants but I think that's probably more confusing > in the long-run. Coming back to this I think this is probably a non-starter because of non-blocking file descriptors. There is already a qemu_write_full so I'm inclined to add qemu_read_full following the same pattern and then convert all of the read calls in the loader to use that. Thanks, Jamie
Re: [PATCH v1 2/3] virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
On 11/30/21 10:28, David Hildenbrand wrote: > With VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we signal the VM that reading > unplugged memory is not supported. We have to fail feature negotiation > in case the guest does not support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. > > First, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE is required to properly handle > memory backends (or architectures) without support for the shared zeropage > in the hypervisor cleanly. Without the shared zeropage, even reading an > unpopulated virtual memory location can populate real memory and > consequently consume memory in the hypervisor. We have a guaranteed shared > zeropage only on MAP_PRIVATE anonymous memory. > > Second, we want VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE to be the default > long-term as even populating the shared zeropage can be problematic: for > example, without THP support (possible) or without support for the shared > huge zeropage with THP (unlikely), the PTE page tables to hold the shared > zeropage entries can consume quite some memory that cannot be reclaimed > easily. > > Third, there are other optimizations+features (e.g., protection of > unplugged memory, reducing the total memory slot size and bitmap sizes) > that will require VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. > > We really only support x86 targets with virtio-mem for now (and > Linux similarly only support x86), but that might change soon, so prepare > for different targets already. > > Add a new "unplugged-inaccessible" tristate property for x86 targets: > - "off" will keep VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE unset and legacy > guests working. > - "on" will set VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE and stop legacy guests > from using the device. > - "auto" selects the default based on support for the shared zeropage. > > Warn in case the property is set to "off" and we don't have support for the > shared zeropage. > > For existing compat machines, the property will default to "off", to > not change the behavior but eventually warn about a problematic setup. > Short-term, we'll set the property default to "auto" for new QEMU machines. > Mid-term, we'll set the property default to "on" for new QEMU machines. > Long-term, we'll deprecate the parameter and disallow legacy > guests completely. > > The property has to match on the migration source and destination. "auto" > will result in the same VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE setting as long > as the qemu command line (esp. memdev) match -- so "auto" is good enough > for migration purposes and the parameter doesn't have to be migrated > explicitly. > > Signed-off-by: David Hildenbrand > --- > hw/virtio/virtio-mem.c | 63 ++ > include/hw/virtio/virtio-mem.h | 8 + > 2 files changed, 71 insertions(+) > > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c > index d5a578142b..1e57156e81 100644 > --- a/hw/virtio/virtio-mem.c > +++ b/hw/virtio/virtio-mem.c > @@ -32,6 +32,14 @@ > #include CONFIG_DEVICES > #include "trace.h" > > +/* > + * We only had legacy x86 guests that did not support > + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. Other targets don't have legacy > guests. > + */ > +#if defined(TARGET_X86_64) || defined(TARGET_I386) > +#define VIRTIO_MEM_HAS_LEGACY_GUESTS > +#endif > + > /* > * Let's not allow blocks smaller than 1 MiB, for example, to keep the > tracking > * bitmap small. > @@ -110,6 +118,19 @@ static uint64_t virtio_mem_default_block_size(RAMBlock > *rb) > return MAX(page_size, VIRTIO_MEM_MIN_BLOCK_SIZE); > } > > +#if defined(VIRTIO_MEM_HAS_LEGACY_GUESTS) > +static bool virtio_mem_has_shared_zeropage(RAMBlock *rb) > +{ > +/* > + * We only have a guaranteed shared zeropage on ordinary MAP_PRIVATE > + * anonymous RAM. In any other case, reading unplugged *can* populate a > + * fresh page, consuming actual memory. > + */ > +return !qemu_ram_is_shared(rb) && rb->fd < 0 && > + qemu_ram_pagesize(rb) == qemu_real_host_page_size; > +} > +#endif /* VIRTIO_MEM_HAS_LEGACY_GUESTS */ > + > /* > * Size the usable region bigger than the requested size if possible. Esp. > * Linux guests will only add (aligned) memory blocks in case they fully > @@ -653,15 +674,29 @@ static uint64_t virtio_mem_get_features(VirtIODevice > *vdev, uint64_t features, > Error **errp) > { > MachineState *ms = MACHINE(qdev_get_machine()); > +VirtIOMEM *vmem = VIRTIO_MEM(vdev); > > if (ms->numa_state) { > #if defined(CONFIG_ACPI) > virtio_add_feature(, VIRTIO_MEM_F_ACPI_PXM); > #endif > } > +assert(vmem->unplugged_inaccessible != ON_OFF_AUTO_AUTO); > +if (vmem->unplugged_inaccessible == ON_OFF_AUTO_ON) { > +virtio_add_feature(, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE); > +} > return features; > } > > +static int virtio_mem_validate_features(VirtIODevice *vdev) > +{ > +if (virtio_host_has_feature(vdev,
Re: [PATCH v1 1/3] linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
On 11/30/21 10:28, David Hildenbrand wrote: > TODO: replace by a proper header sync. > > Signed-off-by: David Hildenbrand > --- > include/standard-headers/linux/virtio_mem.h | 9 ++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/include/standard-headers/linux/virtio_mem.h > b/include/standard-headers/linux/virtio_mem.h > index 05e5ade75d..18c74c527c 100644 > --- a/include/standard-headers/linux/virtio_mem.h > +++ b/include/standard-headers/linux/virtio_mem.h > @@ -68,9 +68,10 @@ > * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG). > * > * There are no guarantees what will happen if unplugged memory is > - * read/written. Such memory should, in general, not be touched. E.g., > - * even writing might succeed, but the values will simply be discarded at > - * random points in time. > + * read/written. In general, unplugged memory should not be touched, because > + * the resulting action is undefined. There is one exception: without > + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, unplugged memory inside the usable > + * region can be read, to simplify creation of memory dumps. > * > * It can happen that the device cannot process a request, because it is > * busy. The device driver has to retry later. > @@ -87,6 +88,8 @@ > > /* node_id is an ACPI PXM and is valid */ > #define VIRTIO_MEM_F_ACPI_PXM0 > +/* unplugged memory must not be accessed */ > +#define VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE 1 > > > /* --- virtio-mem: guest -> host requests --- */ > This is verbatim copy of kernel commit of 61082ad6a6e1f. Reviewed-by: Michal Privoznik Michal
Re: [PATCH v1 3/3] virtio-mem: Set "unplugged-inaccessible=auto" for the 6.2 machine on x86
On 11/30/21 10:28, David Hildenbrand wrote: > Set the new default to "auto", keeping it set to "on" for compat I believe you wanted to say: keeping it set to "off", because that's what the patch does. > machines. This property is only available for x86 targets. > > TODO: once 6.2 was released and we have compat machines, target the next > QEMU release. > > Signed-off-by: David Hildenbrand > --- > hw/i386/pc.c | 1 + > hw/virtio/virtio-mem.c | 2 +- > 2 files changed, 2 insertions(+), 1 deletion(-) > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > index a2ef40ecbc..045ba05431 100644 > --- a/hw/i386/pc.c > +++ b/hw/i386/pc.c > @@ -99,6 +99,7 @@ GlobalProperty pc_compat_6_1[] = { > { TYPE_X86_CPU, "hv-version-id-major", "0x0006" }, > { TYPE_X86_CPU, "hv-version-id-minor", "0x0001" }, > { "ICH9-LPC", "x-keep-pci-slot-hpc", "false" }, > +{ "virtio-mem", "unplugged-inaccessible", "off" }, > }; > const size_t pc_compat_6_1_len = G_N_ELEMENTS(pc_compat_6_1); > > diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c > index 1e57156e81..a5d26d414f 100644 > --- a/hw/virtio/virtio-mem.c > +++ b/hw/virtio/virtio-mem.c > @@ -1169,7 +1169,7 @@ static Property virtio_mem_properties[] = { > TYPE_MEMORY_BACKEND, HostMemoryBackend *), > #if defined(VIRTIO_MEM_HAS_LEGACY_GUESTS) > DEFINE_PROP_ON_OFF_AUTO(VIRTIO_MEM_UNPLUGGED_INACCESSIBLE_PROP, > VirtIOMEM, > -unplugged_inaccessible, ON_OFF_AUTO_OFF), > +unplugged_inaccessible, ON_OFF_AUTO_AUTO), > #endif > DEFINE_PROP_END_OF_LIST(), > }; > Reviewed-by: Michal Privoznik Michal
Re: [PATCH v7 3/3] cpus-common: implement dirty page limit on vCPU
On 11/30/21 21:21, Peter Xu wrote: On Tue, Nov 30, 2021 at 06:28:13PM +0800, huang...@chinatelecom.cn wrote: ## +# @set-dirty-limit: +# +# Set the upper limit of dirty page rate for a virtual CPU. +# +# Requires KVM with accelerator property "dirty-ring-size" set. +# A virtual CPU's dirty page rate is a measure of its memory load. +# To observe dirty page rates, use @calc-dirty-rate. +# +# @cpu-index: index of the virtual CPU. +# +# @dirty-rate: upper limit for the specified vCPU's dirty page rate (MB/s) +# +# Since: 7.0 +# +# Example: +# {"execute": "set-dirty-limit"} +#"arguments": { "cpu-index": 0, +# "dirty-rate": 200 } } +# +## +{ 'command': 'set-dirty-limit', + 'data': { 'cpu-index': 'int', 'dirty-rate': 'uint64' } } + +## +# @cancel-dirty-limit: +# +# Cancel the dirty page limit for the vCPU which has been set with +# set-dirty-limit command. Note that this command requires support from +# dirty ring, same as the "set-dirty-limit" command. +# +# @cpu-index: index of the virtual CPU to cancel the dirty page limit +# +# Since: 7.0 +# +# Example: +# {"execute": "cancel-dirty-limit"} +#"arguments": { "cpu-index": 0 } } +# +## +{ 'command': 'cancel-dirty-limit', + 'data': { 'cpu-index': 'int' } } This seems to be overloaded to be a standalone cmd.. How about: { "cmd": "vcpu-dirty-limit", "arguments": { "cpu": $cpu, "enable": true/false, "dirty-rate": 100, } } If "enable"==false, then "dirty-rate" can be ignored and it'll shut down the throttling on vcpu N. Then this command will literally merge the two you proposed. It'll be nice if we provide yet another command: { "cmd": "query-vcpu-dirty-limit", "arguments": { "*cpu": $cpu, } } When $cpu is specified, we return (cpu=$cpu, real_dirty_rate, target_dirty_rate) for this vcpu. When $cpu is not specified, we return an array of that containing all the vcpus. It'll be nicer to enhance the output of the query command to e.g. have a global "enabled"=true/false as long as any vcpu has throttle enabled then the global throttle is enabled. I didn't think more than that, but how's that sound so far? Soud good, it makes the command easier for programmers to use and understand, i'll try this out next version. Thanks, -- Best Regards Hyman Huang(黄勇)
Re: [PATCH v7 1/3] migration/dirtyrate: implement vCPU dirtyrate calculation periodically
On 11/30/21 21:04, Peter Xu wrote: On Tue, Nov 30, 2021 at 06:28:11PM +0800, huang...@chinatelecom.cn wrote: +static void dirtylimit_calc_func(void) +{ +CPUState *cpu; +DirtyPageRecord *dirty_pages; +int64_t start_time, end_time, calc_time; +DirtyRateVcpu rate; +int i = 0; + +dirty_pages = g_malloc0(sizeof(*dirty_pages) * +dirtylimit_calc_state->data.nvcpu); + +dirtylimit_global_dirty_log_start(); + +CPU_FOREACH(cpu) { +record_dirtypages(dirty_pages, cpu, true); +} + +start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); +g_usleep(DIRTYLIMIT_CALC_TIME_MS * 1000); +end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); +calc_time = end_time - start_time; + +dirtylimit_global_dirty_log_stop(); I haven't looked into the details, but.. I'm wondering whether we should just keep the dirty ring enabled during the whole process of throttling. start/stop can be expensive, especially when huge pages are used, start dirty tracking will start to do huge page split. While right after the "stop" all the huge pages will need to be rebuild again. David from Google is even proposing a kernel change to eagerly splitting huge pages when dirty tracking is enabled. So I think we can keep the dirty tracking enabled until all the vcpu throttles are stopped. Yes, it's a good idea and i'll try this out next version. + +CPU_FOREACH(cpu) { +record_dirtypages(dirty_pages, cpu, false); +} -- Best Regards Hyman Huang(黄勇)
Re: [PATCH] memory: Fix incorrect calls of log_global_start/stop
On 30.11.21 09:00, Peter Xu wrote: > We should only call the log_global_start/stop when the global dirty track > bitmask changes from zero<->non-zero. > > No real issue reported for this yet probably because no immediate user to > enable both dirty rate measurement and migration at the same time. However > it'll be good to be prepared for it. > > Fixes: 63b41db4bc ("memory: make global_dirty_tracking a bitmask") > Cc: Hyman Huang > Cc: Paolo Bonzini > Cc: Dr. David Alan Gilbert > Cc: Juan Quintela > Cc: David Hildenbrand > Signed-off-by: Peter Xu LGTM Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH v7 0/3] support dirty restraint on vCPU
1. Start vm with kernel+initrd.img with qemu command line as following: [root@Hyman_server1 fast_qemu]# cat vm.sh #!/bin/bash /usr/bin/qemu-system-x86_64 \ -display none -vga none \ -name guest=simple_vm,debug-threads=on \ -monitor stdio \ -machine pc-i440fx-2.12 \ -accel kvm,dirty-ring-size=65536 -cpu host \ -kernel /home/work/fast_qemu/vmlinuz-5.13.0-rc4+ \ -initrd /home/work/fast_qemu/initrd-stress.img \ -append "noapic edd=off printk.time=1 noreplace-smp cgroup_disable=memory pci=noearly console=ttyS0 debug ramsize=1500 ratio=1 sleep=1" \ -chardev file,id=charserial0,path=/var/log/vm_console.log \ -serial chardev:charserial0 \ -qmp unix:/tmp/qmp-sock,server,nowait \ -D /var/log/vm.log \ --trace events=/home/work/fast_qemu/events \ -m 4096 -smp 2 -device sga 2. Enable the dirtylimit trace event which will output to /var/log/vm.log [root@Hyman_server1 fast_qemu]# cat /home/work/fast_qemu/events dirtylimit_state_init dirtylimit_vcpu dirtylimit_impose dirtyrate_do_calculate_vcpu 3. Connect the qmp server with low level qmp client and set-dirty-limit [root@Hyman_server1 my_qemu]# python3.6 ./scripts/qmp/qmp-shell -v -p /tmp/qmp-sock Welcome to the QMP low-level shell! Connected to QEMU 6.1.92 (QEMU) set-dirty-limit cpu-index=1 dirty-rate=400 { "arguments": { "cpu-index": 1, "dirty-rate": 400 }, "execute": "set-dirty-limit" } 4. observe the vcpu current dirty rate and quota dirty rate... [root@Hyman_server1 ~]# tail -f /var/log/vm.log dirtylimit_state_init dirtylimit state init: max cpus 2 dirtylimit_vcpu CPU[1] set quota dirtylimit 400 dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 0, percentage 0 dirtyrate_do_calculate_vcpu vcpu[0]: 1075 MB/s dirtyrate_do_calculate_vcpu vcpu[1]: 1061 MB/s dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 1061, percentage 62 dirtyrate_do_calculate_vcpu vcpu[0]: 1133 MB/s dirtyrate_do_calculate_vcpu vcpu[1]: 380 MB/s dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 380, percentage 57 dirtyrate_do_calculate_vcpu vcpu[0]: 1227 MB/s dirtyrate_do_calculate_vcpu vcpu[1]: 464 MB/s We can observe that vcpu-1's dirtyrate is about 400MB/s with dirty page limit set and the vcpu-0 is not affected. 5. observe the vm stress info... [root@Hyman_server1 fast_qemu]# tail -f /var/log/vm_console.log [0.838051] Run /init as init process [0.839216] with arguments: [0.840153] /init [0.840882] with environment: [0.841884] HOME=/ [0.842649] TERM=linux [0.843478] edd=off [0.844233] ramsize=1500 [0.845079] ratio=1 [0.845829] sleep=1 /init (1): INFO: RAM 1500 MiB across 2 CPUs, ratio 1, sleep 1 us [1.158011] random: init: uninitialized urandom read (4096 bytes read) [1.448205] random: init: uninitialized urandom read (4096 bytes read) /init (1): INFO: 1638282593684ms copied 1 GB in 00729ms /init (00110): INFO: 1638282593964ms copied 1 GB in 00719ms /init (1): INFO: 1638282594405ms copied 1 GB in 00719ms /init (00110): INFO: 1638282594677ms copied 1 GB in 00713ms /init (1): INFO: 1638282595093ms copied 1 GB in 00686ms /init (00110): INFO: 1638282595339ms copied 1 GB in 00662ms /init (1): INFO: 1638282595764ms copied 1 GB in 00670m PS: the kernel and initrd images comes from: kernel image: vmlinuz-5.13.0-rc4+, normal centos vmlinuz copied from /boot directory initrd.img: initrd-stress.img, only contains a stress binary, which compiled from qemu source tests/migration/stress.c and run as init in vm. you can view README.md file of my project "met"(https://github.com/newfriday/met) to compile the initrd-stress.img. :) On 11/30/21 20:57, Peter Xu wrote: On Tue, Nov 30, 2021 at 06:28:10PM +0800, huang...@chinatelecom.cn wrote: From: Hyman Huang(黄勇) The patch [2/3] has not been touched so far. Any corrections and suggetions are welcome. I played with it today, but the vcpu didn't got throttled as expected. What I did was starting two workload with 500mb/s, each pinned on one vcpu thread: [root@fedora ~]# pgrep -fa mig_mon 595 ./mig_mon mm_dirty 1000 500 sequential 604 ./mig_mon mm_dirty 1000 500 sequential [root@fedora ~]# taskset -pc 595 pid 595's current affinity list: 2 [root@fedora ~]# taskset -pc 604 pid 604's current affinity list: 3 Then start throttle with 100mb/s: (QEMU) set-dirty-limit cpu-index=3 dirty-rate=100 {"return": {}} (QEMU) set-dirty-limit cpu-index=2 dirty-rate=100 {"return": {}} I can see the workload dropped a tiny little bit (perhaps 500mb -> 499mb), then it keeps going.. Further throttle won't work too: (QEMU) set-dirty-limit cpu-index=2 dirty-rate=10 {"return": {}} Funnily, the ssh client got slowed down instead... :( Yong, how did you test it? -- Best Regards Hyman Huang(黄勇)
why is iommu_platform set to off by default?
I've just spent a day or so trying to track down why PCI passthrough of a virtio-blk-pci device wasn't working. The problem turns out to be that by default virtio pci devices don't use the IOMMU, even when the machine model has created an IOMMU and arranged for the PCI bus to be underneath it. So when the L2 guest tries to program the virtio device, the virtio device treats the IPAs it writes as if they were PAs and of course the data structures it's looking for aren't there. Why do we default this to 'off'? It seems pretty unhelpful not to honour the existence of the IOMMU, and the failure mode is pretty opaque (L2 guest just hangs)... thanks -- PMM
Re: [PATCH 2/2] intel_iommu: Fix irqchip / X2APIC configuration checks
Acked-by: Claudio Fontana I'll try to find time tonight to give you a tested by. Ciao, Claudio On 11/30/21 2:42 PM, David Woodhouse wrote: > We don't need to check kvm_enable_x2apic(). It's perfectly OK to support > interrupt remapping even if we can't address CPUs above 254. Kind of > pointless, but still functional. > > The check on kvm_enable_x2apic() needs to happen *anyway* in order to > allow CPUs above 254 even without an IOMMU, so allow that to happen > elsewhere. > > However, we do require the *split* irqchip in order to rewrite I/OAPIC > destinations. So fix that check while we're here. > > Signed-off-by: David Woodhouse > --- > hw/i386/intel_iommu.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 294499ee20..b0439d0fbf 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -3746,15 +3746,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, > Error **errp) >ON_OFF_AUTO_ON : > ON_OFF_AUTO_OFF; > } > if (s->intr_eim == ON_OFF_AUTO_ON && !s->buggy_eim) { > -if (!kvm_irqchip_in_kernel()) { > +if (!kvm_irqchip_is_split()) { > error_setg(errp, "eim=on requires > accel=kvm,kernel-irqchip=split"); > return false; > } > -if (!kvm_enable_x2apic()) { > -error_setg(errp, "eim=on requires support on the KVM side" > - "(X2APIC_API, first shipped in v4.7)"); > -return false; > -} > } > > /* Currently only address widths supported are 39 and 48 bits */ >
[PATCH] virtio: signal after wrapping packed used_idx
Packed Virtqueues wrap used_idx instead of letting it run freely like Split Virtqueues do. If the used ring wraps more than once there is no way to compare vq->signalled_used and vq->used_idx in virtio_packed_should_notify() since they are modulo vq->vring.num. This causes the device to stop sending used buffer notifications when when virtio_packed_should_notify() is called less than once each time around the used ring. It is possible to trigger this with virtio-blk's dataplane notify_guest_bh() irq coalescing optimization. The call to virtio_notify_irqfd() (and virtio_packed_should_notify()) is deferred to a BH. If the guest driver is polling it can complete and submit more requests before the BH executes, causing the used ring to wrap more than once. The result is that the virtio-blk device ceases to raise interrupts and I/O hangs. Cc: Tiwei Bie Cc: Jason Wang Cc: Michael S. Tsirkin Signed-off-by: Stefan Hajnoczi --- Smarter solutions welcome, but I think notifying once per vq->vring.num is acceptable. --- hw/virtio/virtio.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index ea7c079fb0..f7851c2750 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -885,6 +885,7 @@ static void virtqueue_packed_flush(VirtQueue *vq, unsigned int count) if (vq->used_idx >= vq->vring.num) { vq->used_idx -= vq->vring.num; vq->used_wrap_counter ^= 1; +vq->signalled_used_valid = false; } } -- 2.33.1
[PATCH 2/2] intel_iommu: Fix irqchip / X2APIC configuration checks
We don't need to check kvm_enable_x2apic(). It's perfectly OK to support interrupt remapping even if we can't address CPUs above 254. Kind of pointless, but still functional. The check on kvm_enable_x2apic() needs to happen *anyway* in order to allow CPUs above 254 even without an IOMMU, so allow that to happen elsewhere. However, we do require the *split* irqchip in order to rewrite I/OAPIC destinations. So fix that check while we're here. Signed-off-by: David Woodhouse --- hw/i386/intel_iommu.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index 294499ee20..b0439d0fbf 100644 --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -3746,15 +3746,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp) ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF; } if (s->intr_eim == ON_OFF_AUTO_ON && !s->buggy_eim) { -if (!kvm_irqchip_in_kernel()) { +if (!kvm_irqchip_is_split()) { error_setg(errp, "eim=on requires accel=kvm,kernel-irqchip=split"); return false; } -if (!kvm_enable_x2apic()) { -error_setg(errp, "eim=on requires support on the KVM side" - "(X2APIC_API, first shipped in v4.7)"); -return false; -} } /* Currently only address widths supported are 39 and 48 bits */ -- 2.31.1 smime.p7s Description: S/MIME cryptographic signature
[PATCH 1/2] target/i386: Fix sanity check on max APIC ID / X2APIC enablement
The check on x86ms->apic_id_limit in pc_machine_done() had two problems. Firstly, we need KVM to support the X2APIC API in order to allow IRQ delivery to APICs >= 255. So we need to call/check kvm_enable_x2apic(), which was done elsewhere in *some* cases but not all. Secondly, microvm needs the same check. So move it from pc_machine_done() to x86_cpus_init() where it will work for both. The check in kvm_cpu_instance_init() is now redundant and can be dropped. Signed-off-by: David Woodhouse --- hw/i386/pc.c | 8 hw/i386/x86.c | 16 target/i386/kvm/kvm-cpu.c | 2 +- 3 files changed, 17 insertions(+), 9 deletions(-) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index a2ef40ecbc..9959f93216 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -736,14 +736,6 @@ void pc_machine_done(Notifier *notifier, void *data) /* update FW_CFG_NB_CPUS to account for -device added CPUs */ fw_cfg_modify_i16(x86ms->fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus); } - - -if (x86ms->apic_id_limit > 255 && !xen_enabled() && -!kvm_irqchip_in_kernel()) { -error_report("current -smp configuration requires kernel " - "irqchip support."); -exit(EXIT_FAILURE); -} } void pc_guest_info_init(PCMachineState *pcms) diff --git a/hw/i386/x86.c b/hw/i386/x86.c index b84840a1bb..660e9413f5 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -39,6 +39,7 @@ #include "sysemu/replay.h" #include "sysemu/sysemu.h" #include "sysemu/cpu-timers.h" +#include "sysemu/xen.h" #include "trace.h" #include "hw/i386/x86.h" @@ -136,6 +137,21 @@ void x86_cpus_init(X86MachineState *x86ms, int default_cpu_version) */ x86ms->apic_id_limit = x86_cpu_apic_id_from_index(x86ms, ms->smp.max_cpus - 1) + 1; + +/* + * Can we support APIC ID 255 or higher? + * + * Under Xen: yes. + * With userspace emulated lapic: no + * With KVM's in-kernel lapic: only if X2APIC API is enabled. + */ +if (x86ms->apic_id_limit > 255 && !xen_enabled() && +(!kvm_irqchip_in_kernel() || !kvm_enable_x2apic())) { +error_report("current -smp configuration requires kernel " + "irqchip and X2APIC API support."); +exit(EXIT_FAILURE); +} + possible_cpus = mc->possible_cpu_arch_ids(ms); for (i = 0; i < ms->smp.cpus; i++) { x86_cpu_new(x86ms, possible_cpus->cpus[i].arch_id, _fatal); diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c index d95028018e..c60cb2dafb 100644 --- a/target/i386/kvm/kvm-cpu.c +++ b/target/i386/kvm/kvm-cpu.c @@ -165,7 +165,7 @@ static void kvm_cpu_instance_init(CPUState *cs) /* only applies to builtin_x86_defs cpus */ if (!kvm_irqchip_in_kernel()) { x86_cpu_change_kvm_default("x2apic", "off"); -} else if (kvm_irqchip_is_split() && kvm_enable_x2apic()) { +} else if (kvm_irqchip_is_split()) { x86_cpu_change_kvm_default("kvm-msi-ext-dest-id", "on"); } -- 2.31.1 smime.p7s Description: S/MIME cryptographic signature
[PATCH v2 1/2] ppc/pnv.c: add a friendly warning when accel=kvm is used
If one tries to use -machine powernv9,accel=kvm in a Power9 host, a cryptic error will be shown: qemu-system-ppc64: Register sync failed... If you're using kvm-hv.ko, only "-cpu host" is possible qemu-system-ppc64: kvm_init_vcpu: kvm_arch_init_vcpu failed (0): Invalid argument Appending '-cpu host' will throw another error: qemu-system-ppc64: invalid chip model 'host' for powernv9 machine The root cause is that in IBM PowerPC we have different specs for the bare-metal and the guests. The bare-metal follows OPAL, the guests follow PAPR. The kernel KVM modules presented in the ppc kernels implements PAPR. This means that we can't use KVM accel when using the powernv machine, which is the emulation of the bare-metal host. All that said, let's give a more informative error in this case. Signed-off-by: Daniel Henrique Barboza --- hw/ppc/pnv.c | 5 + 1 file changed, 5 insertions(+) diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c index 71e45515f1..e5b87e8730 100644 --- a/hw/ppc/pnv.c +++ b/hw/ppc/pnv.c @@ -742,6 +742,11 @@ static void pnv_init(MachineState *machine) DriveInfo *pnor = drive_get(IF_MTD, 0, 0); DeviceState *dev; +if (kvm_enabled()) { +error_report("The powernv machine does not work with KVM acceleration"); +exit(EXIT_FAILURE); +} + /* allocate RAM */ if (machine->ram_size < mc->default_ram_size) { char *sz = size_to_str(mc->default_ram_size); -- 2.31.1
[PATCH v2 2/2] docs/system/ppc/powernv.rst: document KVM support status
Put in a more accessible place the reasoning behind our decision to officially drop KVM support in the powernv machine. Signed-off-by: Daniel Henrique Barboza --- docs/system/ppc/powernv.rst | 13 + 1 file changed, 13 insertions(+) diff --git a/docs/system/ppc/powernv.rst b/docs/system/ppc/powernv.rst index 86186b7d2c..8304e85c51 100644 --- a/docs/system/ppc/powernv.rst +++ b/docs/system/ppc/powernv.rst @@ -58,6 +58,19 @@ Prebuilt images of ``skiboot`` and ``skiroot`` are made available on the `OpenPO QEMU includes a prebuilt image of ``skiboot`` which is updated when a more recent version is required by the models. +Current acceleration status +--- + +KVM acceleration in Linux Power hosts is provided by the kvm-hv and +kvm-pr modules. kvm-hv is adherent to PAPR and it's not compliant with +powernv. kvm-pr in theory could be used as a valid accel option but +this isn't supported by kvm-pr at this moment. + +To spare users from dealing with not so informative errors when attempting +to use accel=kvm, the powernv machine will throw an error informing that +KVM is not supported. This can be revisited in the future if kvm-pr (or +any other KVM alternative) is usable as KVM accel for this machine. + Boot options -- 2.31.1
[PATCH v2 0/2] ppc/pnv.c: add a friendly warning when accel=kvm is used
Hi, In this version a documentation patch was added to explain our motivations to officially disable KVM accel in the powernv machine. Changes from v1: - added a doc patch - v1 link: https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg05207.html Daniel Henrique Barboza (2): ppc/pnv.c: add a friendly warning when accel=kvm is used docs/system/ppc/powernv.rst: document KVM support status docs/system/ppc/powernv.rst | 13 + hw/ppc/pnv.c| 5 + 2 files changed, 18 insertions(+) -- 2.31.1
Re: [PATCH v7 3/3] cpus-common: implement dirty page limit on vCPU
On Tue, Nov 30, 2021 at 06:28:13PM +0800, huang...@chinatelecom.cn wrote: > ## > +# @set-dirty-limit: > +# > +# Set the upper limit of dirty page rate for a virtual CPU. > +# > +# Requires KVM with accelerator property "dirty-ring-size" set. > +# A virtual CPU's dirty page rate is a measure of its memory load. > +# To observe dirty page rates, use @calc-dirty-rate. > +# > +# @cpu-index: index of the virtual CPU. > +# > +# @dirty-rate: upper limit for the specified vCPU's dirty page rate (MB/s) > +# > +# Since: 7.0 > +# > +# Example: > +# {"execute": "set-dirty-limit"} > +#"arguments": { "cpu-index": 0, > +# "dirty-rate": 200 } } > +# > +## > +{ 'command': 'set-dirty-limit', > + 'data': { 'cpu-index': 'int', 'dirty-rate': 'uint64' } } > + > +## > +# @cancel-dirty-limit: > +# > +# Cancel the dirty page limit for the vCPU which has been set with > +# set-dirty-limit command. Note that this command requires support from > +# dirty ring, same as the "set-dirty-limit" command. > +# > +# @cpu-index: index of the virtual CPU to cancel the dirty page limit > +# > +# Since: 7.0 > +# > +# Example: > +# {"execute": "cancel-dirty-limit"} > +#"arguments": { "cpu-index": 0 } } > +# > +## > +{ 'command': 'cancel-dirty-limit', > + 'data': { 'cpu-index': 'int' } } This seems to be overloaded to be a standalone cmd.. How about: { "cmd": "vcpu-dirty-limit", "arguments": { "cpu": $cpu, "enable": true/false, "dirty-rate": 100, } } If "enable"==false, then "dirty-rate" can be ignored and it'll shut down the throttling on vcpu N. Then this command will literally merge the two you proposed. It'll be nice if we provide yet another command: { "cmd": "query-vcpu-dirty-limit", "arguments": { "*cpu": $cpu, } } When $cpu is specified, we return (cpu=$cpu, real_dirty_rate, target_dirty_rate) for this vcpu. When $cpu is not specified, we return an array of that containing all the vcpus. It'll be nicer to enhance the output of the query command to e.g. have a global "enabled"=true/false as long as any vcpu has throttle enabled then the global throttle is enabled. I didn't think more than that, but how's that sound so far? Thanks, -- Peter Xu
Re: [PATCH for-6.2 v2] block/nbd: forbid incompatible change of server options on reconnect
30.11.2021 00:53, Vladimir Sementsov-Ogievskiy wrote: Reconnect feature was never prepared to handle server options changed on reconnect. Let's be stricter and check what exactly is changed. If server capabilities just got richer don't worry. Otherwise fail and drop the established connection. Signed-off-by: Vladimir Sementsov-Ogievskiy --- v2: by Eric's comments: - drop extra check about old->min_block % new->min_block - make context_id check conditional itself - don't handle READ_ONLY flag here (see comment in code) - wording Code seems quite obvious, but honestly I still didn't test that it does what it should :( And I'm afraid, Qemu actually doesn't provide good possibility to do so. Eric, may be you know some simple way to test it with nbdkit? Ok, a simple test I can do: qemu-img create -f qcow2 a 10M qemu-img create -f qcow2 b 20M qemu-nbd b then in parallel: ./qemu-io --image-opts driver=nbd,host=localhost,reconnect-delay=5 check that it works: qemu-io> write 10M 1M wrote 1048576/1048576 bytes at offset 10485760 Now, kill nbd server try again qemu-io> write 10M 1M - it will wait up to 5 seconds for reconnection Now start nbd server with shorter file qemu-nbd a Prepatch, qemu-io will successfully connect, request will fail with "Invalid argument". Afterpatch, qemu-io will refuse to connect to shorter export, write will fail with "Input/output error". include/block/nbd.h | 9 + nbd/client-connection.c | 88 + 2 files changed, 97 insertions(+) diff --git a/include/block/nbd.h b/include/block/nbd.h index 78d101b774..9e1943d24c 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -157,6 +157,10 @@ enum { #define NBD_FLAG_SEND_RESIZE (1 << NBD_FLAG_SEND_RESIZE_BIT) #define NBD_FLAG_SEND_CACHE(1 << NBD_FLAG_SEND_CACHE_BIT) #define NBD_FLAG_SEND_FAST_ZERO(1 << NBD_FLAG_SEND_FAST_ZERO_BIT) +/* + * WARNING! If you add any new NBD_FLAG_ flag, check that logic in + * nbd_is_new_info_compatible() is still good about handling flags. + */ /* New-style handshake (global) flags, sent from server to client, and control what will happen during handshake phase. */ @@ -305,6 +309,11 @@ struct NBDExportInfo { uint32_t context_id; +/* + * WARNING! When adding any new field to the structure, don't forget + * to check and update the nbd_is_new_info_compatible() function. + */ + /* Set by server results during nbd_receive_export_list() */ char *description; int n_contexts; diff --git a/nbd/client-connection.c b/nbd/client-connection.c index 695f855754..d50c187482 100644 --- a/nbd/client-connection.c +++ b/nbd/client-connection.c @@ -37,6 +37,10 @@ struct NBDClientConnection { bool do_negotiation; bool do_retry; +/* Used only by connection thread, does not need mutex protection */ +bool has_prev_info; +NBDExportInfo prev_info; + QemuMutex mutex; /* @@ -160,6 +164,69 @@ static int nbd_connect(QIOChannelSocket *sioc, SocketAddress *addr, return 0; } +static bool nbd_is_new_info_compatible(NBDExportInfo *old, NBDExportInfo *new, + Error **errp) +{ +uint32_t dropped_flags; + +if (old->structured_reply && !new->structured_reply) { +error_setg(errp, "Server options degraded after reconnect: " + "structured_reply is not supported anymore"); +return false; +} + +if (old->base_allocation) { +if (!new->base_allocation) { +error_setg(errp, "Server options degraded after reconnect: " + "base_allocation is not supported anymore"); +return false; +} + +if (old->context_id != new->context_id) { +error_setg(errp, "Meta context id changed after reconnect"); +return false; +} +} + +if (old->size != new->size) { +error_setg(errp, "NBD export size changed after reconnect"); +return false; +} + +/* + * No worry if rotational status changed. + * + * Also, we can't handle NBD_FLAG_READ_ONLY properly at this level: we don't + * actually know, does our client need write access or not. So, it's handled + * in block layer in nbd_handle_updated_info(). + * + * All other flags are feature flags, they should not degrade. + */ +dropped_flags = (old->flags & ~new->flags) & +~(NBD_FLAG_ROTATIONAL | NBD_FLAG_READ_ONLY); +if (dropped_flags) { +error_setg(errp, "Server options degraded after reconnect: flags 0x%" + PRIx32 " are not reported anymore", dropped_flags); +return false; +} + +if (new->min_block > old->min_block) { +error_setg(errp, "Server requires more strict min_block after " + "reconnect: %" PRIu32 " instead of %" PRIu32, + new->min_block,
Re: Follow-up on the CXL discussion at OFTC
On Mon, 29 Nov 2021 18:28:43 + Alex Bennée wrote: > Ben Widawsky writes: > > > On 21-11-26 12:08:08, Alex Bennée wrote: > >> > >> Ben Widawsky writes: > >> > >> > On 21-11-19 02:29:51, Shreyas Shah wrote: > >> >> Hi Ben > >> >> > >> >> Are you planning to add the CXL2.0 switch inside QEMU or already added > >> >> in one of the version? > >> >> > >> > > >> > From me, there are no plans for QEMU anything until/unless upstream > >> > thinks it > >> > will merge the existing patches, or provide feedback as to what it would > >> > take to > >> > get them merged. If upstream doesn't see a point in these patches, then > >> > I really > >> > don't see much value in continuing to further them. Once hardware comes > >> > out, the > >> > value proposition is certainly less. > >> > >> I take it: > >> > >> Subject: [RFC PATCH v3 00/31] CXL 2.0 Support > >> Date: Mon, 1 Feb 2021 16:59:17 -0800 > >> Message-Id: <20210202005948.241655-1-ben.widaw...@intel.com> > >> > >> is the current state of the support? I saw there was a fair amount of > >> discussion on the thread so assumed there would be a v4 forthcoming at > >> some point. > > > > Hi Alex, > > > > There is a v4, however, we never really had a solid plan for the primary > > issue > > which was around handling CXL memory expander devices properly (both from an > > interleaving standpoint as well as having a device which hosts multiple > > memory > > capacities, persistent and volatile). I didn't feel it was worth sending a > > v4 > > unless someone could say > > > > 1. we will merge what's there and fix later, or > > 2. you must have a more perfect emulation in place, or > > 3. we want to see usages for a real guest > > I think 1. is acceptable if the community is happy there will be ongoing > development and it's not just a code dump. Given it will have a > MAINTAINERS entry I think that is demonstrated. My thought is also 1. There are a few hacks we need to clean out but nothing that should take too long. I'm sure it'll take a rev or two more. Right now for example, I've added support to arm-virt and maybe need to move that over to a different machine model... > > What's the current use case? Testing drivers before real HW comes out? > Will it still be useful after real HW comes out for people wanting to > debug things without HW? CXL is continuing to expand in scope and capabilities and I don't see that reducing any time soon (My guess is 3 years+ to just catch up with what is under discussion today). So I see two long term use cases: 1) Automated verification that we haven't broken things. I suspect no one person is going to have a test farm covering all the corner cases. So we'll need emulation + firmware + kernel based testing. 2) New feature prove out. We have already used it for some features that will appear in the next spec version. Obviously I can't say what or send that code out yet. Its very useful and the spec draft has changed in various ways a result. I can't commit others, but Huawei will be doing more of this going forwards. For that we need a stable base to which we add the new stuff once spec publication allows it. > > > > > I had hoped we could merge what was there mostly as is and fix it up as we > > go. > > It's useful in the state it is now, and as time goes on, we find more > > usecases > > for it in a VMM, and not just driver development. > > > >> > >> Adding new subsystems to QEMU does seem to be a pain point for new > >> contributors. Patches tend to fall through the cracks of existing > >> maintainers who spend most of their time looking at stuff that directly > >> touches their files. There is also a reluctance to merge large chunks of > >> functionality without an identified maintainer (and maybe reviewers) who > >> can be the contact point for new patches. So in short you need: > >> > >> - Maintainer Reviewed-by/Acked-by on patches that touch other sub-systems > >> > > > > This is the challenging one. I have Cc'd the relevant maintainers (hw/pci > > and > > hw/mem are the two) in the past, but I think there interest is lacking (and > > reasonably so, it is an entirely different subsystem). > > So the best approach to that is to leave a Cc: tag in the patch itself > on your next posting so we can see the maintainer did see it but didn't > contribute a review tag. This is also a good reason to keep Message-Id > tags in patches so we can go back to the original threads. > > So in my latest PR you'll see: > > Signed-off-by: Willian Rampazzo > Reviewed-by: Beraldo Leal > Message-Id: <20211122191124.31620-1-willi...@redhat.com> > Signed-off-by: Alex Bennée > Reviewed-by: Philippe Mathieu-Daudé > Message-Id: <20211129140932.4115115-7-alex.ben...@linaro.org> > > which shows the Message-Id from Willian's original posting and the > latest Message-Id from my posting of the maintainer tree (I trim off my > old ones). > > >> - Reviewed-by tags on the
Re: [PATCH v7 1/3] migration/dirtyrate: implement vCPU dirtyrate calculation periodically
On Tue, Nov 30, 2021 at 06:28:11PM +0800, huang...@chinatelecom.cn wrote: > +static void dirtylimit_calc_func(void) > +{ > +CPUState *cpu; > +DirtyPageRecord *dirty_pages; > +int64_t start_time, end_time, calc_time; > +DirtyRateVcpu rate; > +int i = 0; > + > +dirty_pages = g_malloc0(sizeof(*dirty_pages) * > +dirtylimit_calc_state->data.nvcpu); > + > +dirtylimit_global_dirty_log_start(); > + > +CPU_FOREACH(cpu) { > +record_dirtypages(dirty_pages, cpu, true); > +} > + > +start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > +g_usleep(DIRTYLIMIT_CALC_TIME_MS * 1000); > +end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); > +calc_time = end_time - start_time; > + > +dirtylimit_global_dirty_log_stop(); I haven't looked into the details, but.. I'm wondering whether we should just keep the dirty ring enabled during the whole process of throttling. start/stop can be expensive, especially when huge pages are used, start dirty tracking will start to do huge page split. While right after the "stop" all the huge pages will need to be rebuild again. David from Google is even proposing a kernel change to eagerly splitting huge pages when dirty tracking is enabled. So I think we can keep the dirty tracking enabled until all the vcpu throttles are stopped. > + > +CPU_FOREACH(cpu) { > +record_dirtypages(dirty_pages, cpu, false); > +} -- Peter Xu
Re: [PATCH v7 0/3] support dirty restraint on vCPU
On Tue, Nov 30, 2021 at 06:28:10PM +0800, huang...@chinatelecom.cn wrote: > From: Hyman Huang(黄勇) > > The patch [2/3] has not been touched so far. Any corrections and > suggetions are welcome. I played with it today, but the vcpu didn't got throttled as expected. What I did was starting two workload with 500mb/s, each pinned on one vcpu thread: [root@fedora ~]# pgrep -fa mig_mon 595 ./mig_mon mm_dirty 1000 500 sequential 604 ./mig_mon mm_dirty 1000 500 sequential [root@fedora ~]# taskset -pc 595 pid 595's current affinity list: 2 [root@fedora ~]# taskset -pc 604 pid 604's current affinity list: 3 Then start throttle with 100mb/s: (QEMU) set-dirty-limit cpu-index=3 dirty-rate=100 {"return": {}} (QEMU) set-dirty-limit cpu-index=2 dirty-rate=100 {"return": {}} I can see the workload dropped a tiny little bit (perhaps 500mb -> 499mb), then it keeps going.. Further throttle won't work too: (QEMU) set-dirty-limit cpu-index=2 dirty-rate=10 {"return": {}} Funnily, the ssh client got slowed down instead... :( Yong, how did you test it? -- Peter Xu
Re: [PATCH 2/3] target/m68k: Implement TRAPcc
On 11/30/21 12:57 PM, Laurent Vivier wrote: +DISAS_INSN(trapcc) +{ + /* Consume and discard the immediate operand. */ + switch (extract32(insn, 0, 3)) { + case 2: /* trapcc.w */ + (void)read_im16(env, s); + break; + case 3: /* trapcc.l */ + (void)read_im32(env, s); + break; Do we need to actually read the memory to trigger a fault if needed or can we only increase the PC? Yes, and to pass the entire instruction to plugins. + case 4: /* trapcc (no operand) */ + break; + default: + /* Illegal insn */ + disas_undef(env, s, insn); + return; + } + do_trapcc(s, extract32(insn, 8, 4)); +} Do we need to change something in m68k_interrupt_all()? Yes, and cpu_loop. Thanks, r~
Re: unable to execute QEMU command 'qom-get': Property 'sgx-epc.unavailable-features' not found
On Thu, Nov 25, 2021 at 08:47:22PM +0800, Yang Zhong wrote: > Hello Paolo, > > Our customer used the Libvirt XML to start a SGX VM, but failed. > > libvirt.libvirtError: internal error: unable to execute QEMU command > 'qom-get': Property 'sgx-epc.unavailable-features' not found > > The XML file, > > > value="host,+sgx,+sgx-debug,+sgx-exinfo,+sgx-kss,+sgx-mode64,+sgx-provisionkey,+sgx-tokenkey,+sgx1,+sgx2,+sgxlc"/> > > > > > > > The new compound property command should be located in /machine path, > which are different with old command '-sgx-epc id=epc1,memdev=mem1'. > > I also tried this from Qemu monitor tool, > (qemu) qom-list /machine > type (string) > kernel (string) > .. > sgx-epc (SgxEPC) > .. > sgx-epc[0] (child) > .. > > We can find sgx-epc from /machine list. > This issue is clear now, which is caused by Libvirt to get the CPU's unavailable-features by below command: {"execute":"qom-get","arguments":{"path":"/machine/unattached/device[0]","property":"unavailable-features"} but in SGX vm, since the sgx is initialized before VCPU because sgx need set the virtual EPC info in the cpuid. So the /machine/unattached/device[0] is occupied by sgx, which fail to get the unvailable-features from /machine/unattached/device[0]. We need fix this issue, but this can be done in Qemu or Libvirt side. 1) Libvirt side If the libvirt support SGX EPCs, libvirt can use /machine/unattached/device[n] to check "unavailable-features". n is the next number of sgx's unattached_count. 2) Qemu side One temp patch to create one /sgx in the /machine in the device_set_realized() diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 84f3019440..4154eef0d8 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -497,7 +497,7 @@ static void device_set_realized(Object *obj, bool value, Error **errp) NamedClockList *ncl; Error *local_err = NULL; bool unattached_parent = false; -static int unattached_count; +static int unattached_count, sgx_count; if (dev->hotplugged && !dc->hotpluggable) { error_setg(errp, QERR_DEVICE_NO_HOTPLUG, object_get_typename(obj)); @@ -509,7 +509,15 @@ static void device_set_realized(Object *obj, bool value, Error **errp) goto fail; } -if (!obj->parent) { +if (!obj->parent && !strcmp(object_get_typename(obj), "sgx-epc")) { +gchar *name = g_strdup_printf("device[%d]", sgx_count++); + +object_property_add_child(container_get(qdev_get_machine(), +"/sgx"), + name, obj); +unattached_parent = true; +g_free(name); +} else if (!obj->parent) { gchar *name = g_strdup_printf("device[%d]", unattached_count++); object_property_add_child(container_get(qdev_get_machine() This patch can make sure vcpu is still /machine/unattached/device[0]. Which solution is best? thanks! Yang > I am not familiar with Libvirt side, would you please suggest how to implement > this compound command in the XML file? thanks a lot! > > Regards, > > Yang >
Re: [PATCH for-6.1 v2] i386: do not call cpudef-only models functions for max, host, base
On Tue, 2021-11-30 at 10:00 +0100, Claudio Fontana wrote: > I tend to agree that what we want if kvm_enable_x2apic fails is to abort QEMU > if we have been requesting smp > 255, > and if we did not request smp > 255 cpus, we want to not advertise the > feature. > > This is not accomplished, neither by my snippet above, not by the existing > code at any point in git history, and not by yours in itself. > > Your change seems to accomplish the call to kvm_enable_x2apic, and abort if > requested smp > 255, > but it does not stop advertising X2APIC and ext-dest-id on kvm_enable_x2apic > failure for the case < 255, so we'd need to add that. We don't need kvm_enable_x2apic() at all if all APIC IDs are <255. The kvm_enable_x2apic() call sets two flags. The first (USE_32BIT_IDS) makes KVM take the high bits of the target APIC ID from the high bits of the MSI address. So is only relevant if we ever want to deliver interrupts to CPUs above 255. The second (DISABLE_BROADCAST_QUIRK) prevents APIC ID 255 from being interpreted as a broadcast. This is relevant if you ever want to deliver interrupts to APIC #255. So we need kvm_enable_x2apic() *if* we have APICs >= 255, but we don't care about it if we have fewer CPUs. Note the condition is '>= 255'. Not '> 255'. With APIC IDs < 256 it also doesn't matter whether we advertise the ext-dest-id feature to the guest or not, since that only tells them where to put the upper bits... and there aren't any upper bits. > Do I understand it right? Do we need to wrap all of this logic in a if > (kvm_enabled()) ? For Xen because we aren't really emulating CPUs or APICs at all, it doesn't matter and you can have as many CPUs as you like. For TCG or (KVM && !kvm_irqchip_in_kernel()) we are limited to 254 because our userspace lapic doesn't emulate x2apic at all. But in the TCG case, even if KVM isn't *built*, kvm_irqchip_in_kernel() will return false. So all we need to check is kvm_irqchip_in_kernel(). I think the correct check is what I had in pc_machine_done() with the off-by-one error fixed: if (x86ms->apic_id_limit >= 255 && !xen_enabled() && (!kvm_irqchip_in_kernel() || !kvm_enable_x2apic())) { But that doesn't help microvm unless we copy it there too. Let's take a look at the code we were already looking at in kvm_cpu_instance_init(): /* only applies to builtin_x86_defs cpus */ if (!kvm_irqchip_in_kernel()) { x86_cpu_change_kvm_default("x2apic", "off"); That part is purely cosmetic because kvm_arch_get_supported_cpuid() is *already* going to mask out the X2APIC bit if(!kvm_irqchip_in_kernel()) anyway. } else if (kvm_irqchip_is_split() && kvm_enable_x2apic()) { x86_cpu_change_kvm_default("kvm-msi-ext-dest-id", "on"); } That part makes sense but there's a check missing here because even *without* ext-dest-id we can't support APIC ID 255 without using kvm_enable_x2apic(). And by the time we're in kvm_cpu_instance_init() for the CPU with APIC ID #255, it's too late to disable x2apic support for all the previous CPUs. So... we either do the check in pc_machine_done() and also a similar check for microvm, or we actually abort in kvm_cpu_instance_init() if this CPU's APIC ID is >=255 and kvm_enable_x2apic() has failed. Either way, the call in intel_iommu can die. > > > > In that case it needs to turn x2apic support *off*. But simply calling > > (or not calling) x86_cpu_change_kvm_default() makes absolutely no > > difference unless those defaults are *applied* by calling > > x86_cpu_apply_props() > > right, it makes absolutely no difference, and we cannot use > kvm_default_props, as they are for something else entirely. > > > or making the same change by some other means. > > right, we need to change it by other means. > > It is still unclear to me for which cpu classes and versioned models we > should behave like this. Thoughts? > > "max"? "base"? versioned models: depending on the model features? > > > > > > > > > So what I care about (in case ∃ APIC IDs >= 255) is two things: > > > > > > > > 1. Qemu needs to call kvm_enable_x2apic(). > > > > 2. If that *fails* qemu needs to *stop* advertising X2APIC and > > > > ext-dest-id. > > Understand. We also need though: > > 3. Not call kvm_enable_x2apic() when it should not be called (non-KVM > accelerator, which cpu classes and models) > 4. Not stop advertising X2APIC and ext-dest-id when we shouldn't stop > advertising it. > > > > > > > > > > > > > That last patch snippet in pc_machine_done() should suffice to achieve > > > > that, I think. Because if kvm_enable_x2apic() fails and qemu has been > > > > asked for that many CPUs, it aborts completely. Which seems right. > > see comments above, and should we limit that code to when kvm is enabled? > > > > > > > > > > > seems right to abort if requesting > 255 APIC IDs cannot be satisfied, I > > > agree. > > > > > > So I think in the end, we want to: > > > > > > 1)
Re: [PATCH 1/3] target/m68k: Implement TRAPV
Le 30/11/2021 à 11:37, Richard Henderson a écrit : Signed-off-by: Richard Henderson --- target/m68k/translate.c | 17 + 1 file changed, 17 insertions(+) diff --git a/target/m68k/translate.c b/target/m68k/translate.c index af43c8eab8..858ba761fc 100644 --- a/target/m68k/translate.c +++ b/target/m68k/translate.c @@ -4863,6 +4863,22 @@ DISAS_INSN(trap) gen_exception(s, s->base.pc_next, EXCP_TRAP0 + (insn & 0xf)); } +static void do_trapcc(DisasContext *s, int cond) +{ +TCGLabel *over = gen_new_label(); + +/* Jump over if !cond. */ +gen_jmpcc(s, cond ^ 1, over); + +gen_exception(s, s->base.pc_next, EXCP_TRAPCC); +gen_set_label(over); +} + +DISAS_INSN(trapv) +{ +do_trapcc(s, 9); /* VS */ +} + static void gen_load_fcr(DisasContext *s, TCGv res, int reg) { switch (reg) { @@ -6026,6 +6042,7 @@ void register_m68k_insns (CPUM68KState *env) BASE(nop, 4e71, ); INSN(rtd, 4e74, , RTD); BASE(rts, 4e75, ); +INSN(trapv, 4e76, , M68000); INSN(rtr, 4e77, , M68000); BASE(jump, 4e80, ffc0); BASE(jump, 4ec0, ffc0); Same question as for PATCH 2 regarding m68k_interrupt_all() Reviewed-by: Laurent Vivier
Re: [PATCH v3 17/23] multifd: Use normal pages array on the send side
"Dr. David Alan Gilbert" wrote: > * Juan Quintela (quint...@redhat.com) wrote: >> Signed-off-by: Juan Quintela > > Can you explain a bit more what's going on here? Sorry. Until patch 20, we have what we had always have: pages that are sent through multifd (non zero pages). We are going to call it normal pages. So right now, we use the array of pages that we are passed in directly on the multifd send methods. But when we introduce zero pages handling around patch 20, we end having two types of pages sent through multifd: - normal pages (a.k.a. non-zero pages) - zero pages So the options are: - we rename the fields before we introduce the zero page code, and then we introduce the zero page code. - we rename at the same time that we introduce the zero page code. I decided to go with the 1st option. The other thing that we do here is that we introduce the normal array pages, so right now we do: for (i = 0; i < pages->num; i++) { p->narmal[p->normal_num] = pages->offset[i]; p->normal_num++: } Why? Because then patch 20 becomes: for (i = 0; i < pages->num; i++) { if (buffer_is_zero(page->offset[i])) { p->zerol[p->zero_num] = pages->offset[i]; p->zeronum++: } else { p->narmal[p->normal_num] = pages->offset[i]; p->normal_num++: } } i.e. don't have to touch the handling of normal pages at all, only this for loop. As an added benefit, after this patch, multifd methods don't need to know about the pages array, only about the params array (that will allow me to drop the locking earlier). I hope this helps. Later, Juan.
Re: [PATCH 2/3] target/m68k: Implement TRAPcc
Le 30/11/2021 à 11:37, Richard Henderson a écrit : Resolves: https://gitlab.com/qemu-project/qemu/-/issues/754 Signed-off-by: Richard Henderson --- target/m68k/cpu.h | 2 ++ target/m68k/cpu.c | 1 + target/m68k/translate.c | 21 + 3 files changed, 24 insertions(+) diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h index a3423729ef..03f600f7e7 100644 --- a/target/m68k/cpu.h +++ b/target/m68k/cpu.h @@ -527,6 +527,8 @@ enum m68k_features { M68K_FEATURE_MOVEC, /* Unaligned data accesses (680[2346]0) */ M68K_FEATURE_UNALIGNED_DATA, +/* TRAPCC insn. (680[2346]0, and CPU32) */ +M68K_FEATURE_TRAPCC, }; static inline int m68k_feature(CPUM68KState *env, int feature) diff --git a/target/m68k/cpu.c b/target/m68k/cpu.c index c7aeb7da9c..5f778773d1 100644 --- a/target/m68k/cpu.c +++ b/target/m68k/cpu.c @@ -162,6 +162,7 @@ static void m68020_cpu_initfn(Object *obj) m68k_set_feature(env, M68K_FEATURE_CHK2); m68k_set_feature(env, M68K_FEATURE_MSP); m68k_set_feature(env, M68K_FEATURE_UNALIGNED_DATA); +m68k_set_feature(env, M68K_FEATURE_TRAPCC); } /* diff --git a/target/m68k/translate.c b/target/m68k/translate.c index 858ba761fc..cf29f35d91 100644 --- a/target/m68k/translate.c +++ b/target/m68k/translate.c @@ -4879,6 +4879,26 @@ DISAS_INSN(trapv) do_trapcc(s, 9); /* VS */ } +DISAS_INSN(trapcc) +{ +/* Consume and discard the immediate operand. */ +switch (extract32(insn, 0, 3)) { +case 2: /* trapcc.w */ +(void)read_im16(env, s); +break; +case 3: /* trapcc.l */ +(void)read_im32(env, s); +break; Do we need to actually read the memory to trigger a fault if needed or can we only increase the PC? Normally these values are for the trap handler. +case 4: /* trapcc (no operand) */ +break; +default: +/* Illegal insn */ +disas_undef(env, s, insn); +return; +} +do_trapcc(s, extract32(insn, 8, 4)); +} Do we need to change something in m68k_interrupt_all()? if (!is_hw) { switch (cs->exception_index) { case EXCP_RTE: /* Return from an exception. */ m68k_rte(env); return; case EXCP_TRAP0 ... EXCP_TRAP15: /* Move the PC after the trap instruction. */ retaddr += 2; break; } } Thanks, Laurent + static void gen_load_fcr(DisasContext *s, TCGv res, int reg) { switch (reg) { @@ -6051,6 +6071,7 @@ void register_m68k_insns (CPUM68KState *env) INSN(scc, 50c0, f0f8, CF_ISA_A); /* Scc.B Dx */ INSN(scc, 50c0, f0c0, M68000); /* Scc.B */ INSN(dbcc, 50c8, f0f8, M68000); +INSN(trapcc,50f8, f0f8, TRAPCC); INSN(tpf, 51f8, fff8, CF_ISA_A); /* Branch instructions. */
Re: [PATCH 3/3] target/m68k: Implement FTRAPcc
On 11/30/21 11:37 AM, Richard Henderson wrote: +INSN(ftrapcc, f278, ff80, FPU); Whoops, mask should be fff8. r~
[PATCH] aio-posix: split poll check from ready handler
Adaptive polling measures the execution time of the polling check plus handlers called when a polled event becomes ready. Handlers can take a significant amount of time, making it look like polling was running for a long time when in fact the event handler was running for a long time. For example, on Linux the io_submit(2) syscall invoked when a virtio-blk device's virtqueue becomes ready can take 10s of microseconds. This can exceed the default polling interval (32 microseconds) and cause adaptive polling to stop polling. By excluding the handler's execution time from the polling check we make the adaptive polling calculation more accurate. As a result, the event loop now stays in polling mode where previously it would have fallen back to file descriptor monitoring. The following data was collected with virtio-blk num-queues=2 event_idx=off using an IOThread. Before: 168k IOPS, IOThread syscalls: 9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0)= 16 9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8) = 8 9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8) = 8 9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3 9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512)= 8 9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512)= 8 9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512)= 8 9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0)= 32 174k IOPS (+3.6%), IOThread syscalls: 9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0)= 32 9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8) = 8 9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8) = 8 9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50)= 32 Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because the IOThread stays in polling mode instead of falling back to file descriptor monitoring. As usual, polling is not implemented on Windows so this patch ignores the new io_poll_read() callback in aio-win32.c. Signed-off-by: Stefan Hajnoczi --- include/block/aio.h | 4 +- util/aio-posix.h | 1 + block/curl.c | 11 ++--- block/io_uring.c | 19 + block/iscsi.c| 4 +- block/linux-aio.c| 16 +--- block/nfs.c | 6 +-- block/nvme.c | 34 +--- block/ssh.c | 4 +- block/win32-aio.c| 4 +- hw/virtio/virtio.c | 16 +--- hw/xen/xen-bus.c | 6 +-- io/channel-command.c | 6 ++- io/channel-file.c| 3 +- io/channel-socket.c | 3 +- migration/rdma.c | 8 ++-- tests/unit/test-aio.c| 4 +- util/aio-posix.c | 88 ++-- util/aio-win32.c | 4 +- util/async.c | 10 - util/main-loop.c | 4 +- util/qemu-coroutine-io.c | 5 ++- util/vhost-user-server.c | 11 ++--- 23 files changed, 184 insertions(+), 87 deletions(-) diff --git a/include/block/aio.h b/include/block/aio.h index 47fbe9d81f..5634173b12 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -469,6 +469,7 @@ void aio_set_fd_handler(AioContext *ctx, IOHandler *io_read, IOHandler *io_write, AioPollFn *io_poll, +IOHandler *io_poll_ready, void *opaque); /* Set polling begin/end callbacks for a file descriptor that has already been @@ -490,7 +491,8 @@ void aio_set_event_notifier(AioContext *ctx, EventNotifier *notifier, bool is_external, EventNotifierHandler *io_read, -AioPollFn *io_poll); +AioPollFn *io_poll, +EventNotifierHandler *io_poll_ready); /* Set polling begin/end callbacks for an event notifier that has already been * registered with aio_set_event_notifier. Do nothing if the event notifier is diff --git a/util/aio-posix.h b/util/aio-posix.h index c80c04506a..7f2c37a684 100644 --- a/util/aio-posix.h +++ b/util/aio-posix.h @@ -24,6 +24,7 @@ struct AioHandler { IOHandler *io_read; IOHandler *io_write; AioPollFn *io_poll; +IOHandler *io_poll_ready; IOHandler
Re: [PATCH v3 17/23] multifd: Use normal pages array on the send side
* Juan Quintela (quint...@redhat.com) wrote: > Signed-off-by: Juan Quintela Can you explain a bit more what's going on here? Dave > --- > migration/multifd.h | 8 ++-- > migration/multifd-zlib.c | 6 +++--- > migration/multifd-zstd.c | 6 +++--- > migration/multifd.c | 30 +++--- > migration/trace-events | 4 ++-- > 5 files changed, 33 insertions(+), 21 deletions(-) > > diff --git a/migration/multifd.h b/migration/multifd.h > index 7496f951a7..78e73df3ec 100644 > --- a/migration/multifd.h > +++ b/migration/multifd.h > @@ -104,14 +104,18 @@ typedef struct { > /* thread local variables */ > /* packets sent through this channel */ > uint64_t num_packets; > -/* pages sent through this channel */ > -uint64_t num_pages; > +/* non zero pages sent through this channel */ > +uint64_t num_normal_pages; > /* syncs main thread and channels */ > QemuSemaphore sem_sync; > /* buffers to send */ > struct iovec *iov; > /* number of iovs used */ > uint32_t iovs_num; > +/* Pages that are not zero */ > +ram_addr_t *normal; > +/* num of non zero pages */ > +uint32_t normal_num; > /* used for compression methods */ > void *data; > } MultiFDSendParams; > diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c > index f65159392a..25ef68a548 100644 > --- a/migration/multifd-zlib.c > +++ b/migration/multifd-zlib.c > @@ -106,16 +106,16 @@ static int zlib_send_prepare(MultiFDSendParams *p, > Error **errp) > int ret; > uint32_t i; > > -for (i = 0; i < p->pages->num; i++) { > +for (i = 0; i < p->normal_num; i++) { > uint32_t available = z->zbuff_len - out_size; > int flush = Z_NO_FLUSH; > > -if (i == p->pages->num - 1) { > +if (i == p->normal_num - 1) { > flush = Z_SYNC_FLUSH; > } > > zs->avail_in = page_size; > -zs->next_in = p->pages->block->host + p->pages->offset[i]; > +zs->next_in = p->pages->block->host + p->normal[i]; > > zs->avail_out = available; > zs->next_out = z->zbuff + out_size; > diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c > index 6933ba622a..61842d713e 100644 > --- a/migration/multifd-zstd.c > +++ b/migration/multifd-zstd.c > @@ -121,13 +121,13 @@ static int zstd_send_prepare(MultiFDSendParams *p, > Error **errp) > z->out.size = z->zbuff_len; > z->out.pos = 0; > > -for (i = 0; i < p->pages->num; i++) { > +for (i = 0; i < p->normal_num; i++) { > ZSTD_EndDirective flush = ZSTD_e_continue; > > -if (i == p->pages->num - 1) { > +if (i == p->normal_num - 1) { > flush = ZSTD_e_flush; > } > -z->in.src = p->pages->block->host + p->pages->offset[i]; > +z->in.src = p->pages->block->host + p->normal[i]; > z->in.size = page_size; > z->in.pos = 0; > > diff --git a/migration/multifd.c b/migration/multifd.c > index 6983ba3e7c..dbe919b764 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -89,13 +89,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, > Error **errp) > MultiFDPages_t *pages = p->pages; > size_t page_size = qemu_target_page_size(); > > -for (int i = 0; i < p->pages->num; i++) { > -p->iov[p->iovs_num].iov_base = pages->block->host + pages->offset[i]; > +for (int i = 0; i < p->normal_num; i++) { > +p->iov[p->iovs_num].iov_base = pages->block->host + p->normal[i]; > p->iov[p->iovs_num].iov_len = page_size; > p->iovs_num++; > } > > -p->next_packet_size = p->pages->num * page_size; > +p->next_packet_size = p->normal_num * page_size; > p->flags |= MULTIFD_FLAG_NOCOMP; > return 0; > } > @@ -262,7 +262,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p) > > packet->flags = cpu_to_be32(p->flags); > packet->pages_alloc = cpu_to_be32(p->pages->allocated); > -packet->pages_used = cpu_to_be32(p->pages->num); > +packet->pages_used = cpu_to_be32(p->normal_num); > packet->next_packet_size = cpu_to_be32(p->next_packet_size); > packet->packet_num = cpu_to_be64(p->packet_num); > > @@ -270,9 +270,9 @@ static void multifd_send_fill_packet(MultiFDSendParams *p) > strncpy(packet->ramblock, p->pages->block->idstr, 256); > } > > -for (i = 0; i < p->pages->num; i++) { > +for (i = 0; i < p->normal_num; i++) { > /* there are architectures where ram_addr_t is 32 bit */ > -uint64_t temp = p->pages->offset[i]; > +uint64_t temp = p->normal[i]; > > packet->offset[i] = cpu_to_be64(temp); > } > @@ -556,6 +556,8 @@ void multifd_save_cleanup(void) > p->packet = NULL; > g_free(p->iov); > p->iov = NULL; > +g_free(p->normal); > +p->normal = NULL; > multifd_send_state->ops->send_cleanup(p, _err); >
[PATCH v1 8/8] virtio-mem: Support "prealloc=on" option
For scarce memory resources, such as hugetlb, we want to be able to prealloc such memory resources in order to not crash later on access. On simple user errors we could otherwise easily run out of memory resources an crash the VM -- pretty much undesired. For ordinary memory devices, such as DIMMs, we preallocate memory via the memory backend for such use cases; however, with virtio-mem we're dealing with sparse memory backends; preallocating the whole memory backend destroys the whole purpose of virtio-mem. Instead, we want to preallocate memory when actually exposing memory to the VM dynamically, and fail plugging memory gracefully + warn the user in case preallocation fails. A common use case for hugetlb will be using "reserve=off,prealloc=off" for the memory backend and "prealloc=on" for the virtio-mem device. This way, no huge pages will be reserved for the process, but we can recover if there are no actual huge pages when plugging memory. Libvirt is already prepared for this. Note that preallocation cannot protect from the OOM killer -- which holds true for any kind of preallocation in QEMU. It's primarily useful only for scarce memory resources such as hugetlb, or shared file-backed memory. It's of little use for ordinary anonymous memory that can be swapped, KSM merged, ... but we won't forbid it. Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 39 ++ include/hw/virtio/virtio-mem.h | 4 2 files changed, 39 insertions(+), 4 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index a5d26d414f..33c7884aa0 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -450,10 +450,40 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, return -EBUSY; } virtio_mem_notify_unplug(vmem, offset, size); -} else if (virtio_mem_notify_plug(vmem, offset, size)) { -/* Could be a mapping attempt resulted in memory getting populated. */ -ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size); -return -EBUSY; +} else { +int ret = 0; + +if (vmem->prealloc) { +void *area = memory_region_get_ram_ptr(>memdev->mr) + offset; +int fd = memory_region_get_fd(>memdev->mr); +Error *local_err = NULL; + +os_mem_prealloc(fd, area, size, 1, _err); +if (local_err) { +static bool warned; + +/* + * Warn only once, we don't want to fill the log with these + * warnings. + */ +if (!warned) { +warn_report_err(local_err); +warned = true; +} else { +error_free(local_err); +} +ret = -EBUSY; +} +} +if (!ret) { +ret = virtio_mem_notify_plug(vmem, offset, size); +} + +if (ret) { +/* Could be preallocation or a notifier populated memory. */ +ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size); +return -EBUSY; +} } virtio_mem_set_bitmap(vmem, start_gpa, size, plug); return 0; @@ -1165,6 +1195,7 @@ static void virtio_mem_instance_init(Object *obj) static Property virtio_mem_properties[] = { DEFINE_PROP_UINT64(VIRTIO_MEM_ADDR_PROP, VirtIOMEM, addr, 0), DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0), +DEFINE_PROP_BOOL(VIRTIO_MEM_PREALLOC_PROP, VirtIOMEM, prealloc, false), DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev, TYPE_MEMORY_BACKEND, HostMemoryBackend *), #if defined(VIRTIO_MEM_HAS_LEGACY_GUESTS) diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h index 38c67a89f2..7745cfc1a3 100644 --- a/include/hw/virtio/virtio-mem.h +++ b/include/hw/virtio/virtio-mem.h @@ -31,6 +31,7 @@ OBJECT_DECLARE_TYPE(VirtIOMEM, VirtIOMEMClass, #define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size" #define VIRTIO_MEM_ADDR_PROP "memaddr" #define VIRTIO_MEM_UNPLUGGED_INACCESSIBLE_PROP "unplugged-inaccessible" +#define VIRTIO_MEM_PREALLOC_PROP "prealloc" struct VirtIOMEM { VirtIODevice parent_obj; @@ -70,6 +71,9 @@ struct VirtIOMEM { */ OnOffAuto unplugged_inaccessible; +/* whether to prealloc memory when plugging new blocks */ +bool prealloc; + /* notifiers to notify when "size" changes */ NotifierList size_change_notifiers; -- 2.31.1
[PATCH v1 6/8] util/oslib-posix: Support concurrent os_mem_prealloc() invocation
Add a mutex to protect the SIGBUS case, as we cannot mess concurrently with the sigbus handler and we have to manage the global variable sigbus_memset_context. The MADV_POPULATE_WRITE path can run concurrently. Note that page_mutex and page_cond are shared between concurrent invocations, which shouldn't be a problem. This is a preparation for future virtio-mem prealloc code, which will call os_mem_prealloc() asynchronously from an iothread when handling guest requests. Reviewed-by: Pankaj Gupta Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- util/oslib-posix.c | 9 + 1 file changed, 9 insertions(+) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index efa4f96d56..9829149e4b 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -95,6 +95,7 @@ typedef struct MemsetThread MemsetThread; /* used by sigbus_handler() */ static MemsetContext *sigbus_memset_context; +static QemuMutex sigbus_mutex; static QemuMutex page_mutex; static QemuCond page_cond; @@ -625,6 +626,7 @@ static bool madv_populate_write_possible(char *area, size_t pagesize) void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, Error **errp) { +static gsize initialized; int ret; struct sigaction act, oldact; size_t hpagesize = qemu_fd_getpagesize(fd); @@ -638,6 +640,12 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, use_madv_populate_write = madv_populate_write_possible(area, hpagesize); if (!use_madv_populate_write) { +if (g_once_init_enter()) { +qemu_mutex_init(_mutex); +g_once_init_leave(, 1); +} + +qemu_mutex_lock(_mutex); memset(, 0, sizeof(act)); act.sa_handler = _handler; act.sa_flags = 0; @@ -665,6 +673,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, perror("os_mem_prealloc: failed to reinstall signal handler"); exit(1); } +qemu_mutex_unlock(_mutex); } } -- 2.31.1
[PATCH v1 7/8] util/oslib-posix: Forward SIGBUS to MCE handler under Linux
Temporarily modifying the SIGBUS handler is really nasty, as we might be unlucky and receive an MCE SIGBUS while having our handler registered. Unfortunately, there is no way around messing with SIGBUS when MADV_POPULATE_WRITE is not applicable or not around. Let's forward SIGBUS that don't belong to us to the already registered handler and document the situation. Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- softmmu/cpus.c | 4 util/oslib-posix.c | 36 +--- 2 files changed, 37 insertions(+), 3 deletions(-) diff --git a/softmmu/cpus.c b/softmmu/cpus.c index 071085f840..23bca46b07 100644 --- a/softmmu/cpus.c +++ b/softmmu/cpus.c @@ -352,6 +352,10 @@ static void qemu_init_sigbus(void) { struct sigaction action; +/* + * ALERT: when modifying this, take care that SIGBUS forwarding in + * os_mem_prealloc() will continue working as expected. + */ memset(, 0, sizeof(action)); action.sa_flags = SA_SIGINFO; action.sa_sigaction = sigbus_handler; diff --git a/util/oslib-posix.c b/util/oslib-posix.c index 9829149e4b..5c47aa9cb7 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -95,6 +95,7 @@ typedef struct MemsetThread MemsetThread; /* used by sigbus_handler() */ static MemsetContext *sigbus_memset_context; +struct sigaction sigbus_oldact; static QemuMutex sigbus_mutex; static QemuMutex page_mutex; @@ -446,7 +447,11 @@ const char *qemu_get_exec_dir(void) return exec_dir; } +#ifdef CONFIG_LINUX +static void sigbus_handler(int signal, siginfo_t *siginfo, void *ctx) +#else /* CONFIG_LINUX */ static void sigbus_handler(int signal) +#endif /* CONFIG_LINUX */ { int i; @@ -459,6 +464,26 @@ static void sigbus_handler(int signal) } } } + +#ifdef CONFIG_LINUX +/* + * We assume that the MCE SIGBUS handler could have been registered. We + * should never receive BUS_MCEERR_AO on any of our threads, but only on + * the main thread registered for PR_MCE_KILL_EARLY. Further, we should not + * receive BUS_MCEERR_AR triggered by action of other threads on one of + * our threads. So, no need to check for unrelated SIGBUS when seeing one + * for our threads. + * + * We will forward to the MCE handler, which will either handle the SIGBUS + * or reinstall the default SIGBUS handler and reraise the SIGBUS. The + * default SIGBUS handler will crash the process, so we don't care. + */ +if (sigbus_oldact.sa_flags & SA_SIGINFO) { +sigbus_oldact.sa_sigaction(signal, siginfo, ctx); +return; +} +#endif /* CONFIG_LINUX */ +warn_report("os_mem_prealloc: unrelated SIGBUS detected and ignored"); } static void *do_touch_pages(void *arg) @@ -628,10 +653,10 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, { static gsize initialized; int ret; -struct sigaction act, oldact; size_t hpagesize = qemu_fd_getpagesize(fd); size_t numpages = DIV_ROUND_UP(memory, hpagesize); bool use_madv_populate_write; +struct sigaction act; /* * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for @@ -647,10 +672,15 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, qemu_mutex_lock(_mutex); memset(, 0, sizeof(act)); +#ifdef CONFIG_LINUX +act.sa_sigaction = _handler; +act.sa_flags = SA_SIGINFO; +#else /* CONFIG_LINUX */ act.sa_handler = _handler; act.sa_flags = 0; +#endif /* CONFIG_LINUX */ -ret = sigaction(SIGBUS, , ); +ret = sigaction(SIGBUS, , _oldact); if (ret) { error_setg_errno(errp, errno, "os_mem_prealloc: failed to install signal handler"); @@ -667,7 +697,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, } if (!use_madv_populate_write) { -ret = sigaction(SIGBUS, , NULL); +ret = sigaction(SIGBUS, _oldact, NULL); if (ret) { /* Terminate QEMU since it can't recover from error */ perror("os_mem_prealloc: failed to reinstall signal handler"); -- 2.31.1
Re: [PATCH v3 16/23] multifd: Unfold "used" variable by its value
* Juan Quintela (quint...@redhat.com) wrote: > Signed-off-by: Juan Quintela Reviewed-by: Dr. David Alan Gilbert > --- > migration/multifd.c | 8 +++- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/migration/multifd.c b/migration/multifd.c > index 65676d56fd..6983ba3e7c 100644 > --- a/migration/multifd.c > +++ b/migration/multifd.c > @@ -1059,7 +1059,6 @@ static void *multifd_recv_thread(void *opaque) > rcu_register_thread(); > > while (true) { > -uint32_t used; > uint32_t flags; > > if (p->quit) { > @@ -1082,17 +1081,16 @@ static void *multifd_recv_thread(void *opaque) > break; > } > > -used = p->pages->num; > flags = p->flags; > /* recv methods don't know how to handle the SYNC flag */ > p->flags &= ~MULTIFD_FLAG_SYNC; > -trace_multifd_recv(p->id, p->packet_num, used, flags, > +trace_multifd_recv(p->id, p->packet_num, p->pages->num, flags, > p->next_packet_size); > p->num_packets++; > -p->num_pages += used; > +p->num_pages += p->pages->num; > qemu_mutex_unlock(>mutex); > > -if (used) { > +if (p->pages->num) { > ret = multifd_recv_state->ops->recv_pages(p, _err); > if (ret != 0) { > break; > -- > 2.33.1 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
[PATCH v1 3/8] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages()
Let's minimize the number of global variables to prepare for os_mem_prealloc() getting called concurrently and make the code a bit easier to read. The only consumer that really needs a global variable is the sigbus handler, which will require protection via a mutex in the future either way as we cannot concurrently mess with the SIGBUS handler. Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- util/oslib-posix.c | 73 +- 1 file changed, 47 insertions(+), 26 deletions(-) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index cb89e07770..cf2ead54ad 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -73,21 +73,30 @@ #define MAX_MEM_PREALLOC_THREAD_COUNT 16 +struct MemsetThread; + +typedef struct MemsetContext { +bool all_threads_created; +bool any_thread_failed; +struct MemsetThread *threads; +int num_threads; +} MemsetContext; + struct MemsetThread { char *addr; size_t numpages; size_t hpagesize; QemuThread pgthread; sigjmp_buf env; +MemsetContext *context; }; typedef struct MemsetThread MemsetThread; -static MemsetThread *memset_thread; -static int memset_num_threads; +/* used by sigbus_handler() */ +static MemsetContext *sigbus_memset_context; static QemuMutex page_mutex; static QemuCond page_cond; -static bool threads_created_flag; int qemu_get_thread_id(void) { @@ -438,10 +447,13 @@ const char *qemu_get_exec_dir(void) static void sigbus_handler(int signal) { int i; -if (memset_thread) { -for (i = 0; i < memset_num_threads; i++) { -if (qemu_thread_is_self(_thread[i].pgthread)) { -siglongjmp(memset_thread[i].env, 1); + +if (sigbus_memset_context) { +for (i = 0; i < sigbus_memset_context->num_threads; i++) { +MemsetThread *thread = _memset_context->threads[i]; + +if (qemu_thread_is_self(>pgthread)) { +siglongjmp(thread->env, 1); } } } @@ -459,7 +471,7 @@ static void *do_touch_pages(void *arg) * clearing until all threads have been created. */ qemu_mutex_lock(_mutex); -while(!threads_created_flag){ +while (!memset_args->context->all_threads_created) { qemu_cond_wait(_cond, _mutex); } qemu_mutex_unlock(_mutex); @@ -502,7 +514,7 @@ static void *do_madv_populate_write_pages(void *arg) /* See do_touch_pages(). */ qemu_mutex_lock(_mutex); -while (!threads_created_flag) { +while (!memset_args->context->all_threads_created) { qemu_cond_wait(_cond, _mutex); } qemu_mutex_unlock(_mutex); @@ -529,6 +541,9 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, int smp_cpus, bool use_madv_populate_write) { static gsize initialized = 0; +MemsetContext context = { +.num_threads = get_memset_num_threads(smp_cpus), +}; size_t numpages_per_thread, leftover; void *(*touch_fn)(void *); int ret = 0, i = 0; @@ -546,35 +561,41 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, touch_fn = do_touch_pages; } -threads_created_flag = false; -memset_num_threads = get_memset_num_threads(smp_cpus); -memset_thread = g_new0(MemsetThread, memset_num_threads); -numpages_per_thread = numpages / memset_num_threads; -leftover = numpages % memset_num_threads; -for (i = 0; i < memset_num_threads; i++) { -memset_thread[i].addr = addr; -memset_thread[i].numpages = numpages_per_thread + (i < leftover); -memset_thread[i].hpagesize = hpagesize; -qemu_thread_create(_thread[i].pgthread, "touch_pages", - touch_fn, _thread[i], +context.threads = g_new0(MemsetThread, context.num_threads); +numpages_per_thread = numpages / context.num_threads; +leftover = numpages % context.num_threads; +for (i = 0; i < context.num_threads; i++) { +context.threads[i].addr = addr; +context.threads[i].numpages = numpages_per_thread + (i < leftover); +context.threads[i].hpagesize = hpagesize; +context.threads[i].context = +qemu_thread_create([i].pgthread, "touch_pages", + touch_fn, [i], QEMU_THREAD_JOINABLE); -addr += memset_thread[i].numpages * hpagesize; +addr += context.threads[i].numpages * hpagesize; +} + +if (!use_madv_populate_write) { +sigbus_memset_context = } qemu_mutex_lock(_mutex); -threads_created_flag = true; +context.all_threads_created = true; qemu_cond_broadcast(_cond); qemu_mutex_unlock(_mutex); -for (i = 0; i < memset_num_threads; i++) { -int tmp = (uintptr_t)qemu_thread_join(_thread[i].pgthread); +for (i = 0; i < context.num_threads; i++) { +int tmp = (uintptr_t)qemu_thread_join([i].pgthread);
[PATCH v1 4/8] util/oslib-posix: Don't create too many threads with small memory or little pages
Let's limit the number of threads to something sane, especially that - We don't have more threads than the number of pages we have - We don't have threads that initialize small (< 64 MiB) memory Reviewed-by: Pankaj Gupta Reviewed-by: Dr. David Alan Gilbert Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- util/oslib-posix.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index cf2ead54ad..67c08a425e 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -40,6 +40,7 @@ #include #include "qemu/cutils.h" #include "qemu/compiler.h" +#include "qemu/units.h" #ifdef CONFIG_LINUX #include @@ -525,7 +526,8 @@ static void *do_madv_populate_write_pages(void *arg) return (void *)(uintptr_t)ret; } -static inline int get_memset_num_threads(int smp_cpus) +static inline int get_memset_num_threads(size_t hpagesize, size_t numpages, + int smp_cpus) { long host_procs = sysconf(_SC_NPROCESSORS_ONLN); int ret = 1; @@ -533,6 +535,12 @@ static inline int get_memset_num_threads(int smp_cpus) if (host_procs > 0) { ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), smp_cpus); } + +/* Especially with gigantic pages, don't create more threads than pages. */ +ret = MIN(ret, numpages); +/* Don't start threads to prealloc comparatively little memory. */ +ret = MIN(ret, MAX(1, hpagesize * numpages / (64 * MiB))); + /* In case sysconf() fails, we fall back to single threaded */ return ret; } @@ -542,7 +550,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, { static gsize initialized = 0; MemsetContext context = { -.num_threads = get_memset_num_threads(smp_cpus), +.num_threads = get_memset_num_threads(hpagesize, numpages, smp_cpus), }; size_t numpages_per_thread, leftover; void *(*touch_fn)(void *); -- 2.31.1
[PATCH v1 5/8] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE
Let's simplify the case when we only want a single thread and don't have to mess with signal handlers. Reviewed-by: Pankaj Gupta Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- util/oslib-posix.c | 8 1 file changed, 8 insertions(+) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index 67c08a425e..efa4f96d56 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -564,6 +564,14 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, } if (use_madv_populate_write) { +/* Avoid creating a single thread for MADV_POPULATE_WRITE */ +if (context.num_threads == 1) { +if (qemu_madvise(area, hpagesize * numpages, + QEMU_MADV_POPULATE_WRITE)) { +return -errno; +} +return 0; +} touch_fn = do_madv_populate_write_pages; } else { touch_fn = do_touch_pages; -- 2.31.1
[PATCH v1 2/8] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
Let's sense support and use it for preallocation. MADV_POPULATE_WRITE does not require a SIGBUS handler, doesn't actually touch page content, and avoids context switches; it is, therefore, faster and easier to handle than our current approach. While MADV_POPULATE_WRITE is, in general, faster than manual prefaulting, and especially faster with 4k pages, there is still value in prefaulting using multiple threads to speed up preallocation. More details on MADV_POPULATE_WRITE can be found in the Linux commits 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") and eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)"), and in the man page proposal [1]. This resolves the TODO in do_touch_pages(). In the future, we might want to look into using fallocate(), eventually combined with MADV_POPULATE_READ, when dealing with shared file/fd mappings and not caring about memory bindings. [1] https://lkml.kernel.org/r/20210816081922.5155-1-da...@redhat.com Reviewed-by: Pankaj Gupta Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- include/qemu/osdep.h | 7 util/oslib-posix.c | 83 +--- 2 files changed, 69 insertions(+), 21 deletions(-) diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index 60718fc342..d1660d67fa 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -471,6 +471,11 @@ static inline void qemu_cleanup_generic_vfree(void *p) #else #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED #endif +#ifdef MADV_POPULATE_WRITE +#define QEMU_MADV_POPULATE_WRITE MADV_POPULATE_WRITE +#else +#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID +#endif #elif defined(CONFIG_POSIX_MADVISE) @@ -484,6 +489,7 @@ static inline void qemu_cleanup_generic_vfree(void *p) #define QEMU_MADV_HUGEPAGE QEMU_MADV_INVALID #define QEMU_MADV_NOHUGEPAGE QEMU_MADV_INVALID #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED +#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID #else /* no-op */ @@ -497,6 +503,7 @@ static inline void qemu_cleanup_generic_vfree(void *p) #define QEMU_MADV_HUGEPAGE QEMU_MADV_INVALID #define QEMU_MADV_NOHUGEPAGE QEMU_MADV_INVALID #define QEMU_MADV_REMOVE QEMU_MADV_INVALID +#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID #endif diff --git a/util/oslib-posix.c b/util/oslib-posix.c index b146beef78..cb89e07770 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -484,10 +484,6 @@ static void *do_touch_pages(void *arg) * * 'volatile' to stop compiler optimizing this away * to a no-op - * - * TODO: get a better solution from kernel so we - * don't need to write at all so we don't cause - * wear on the storage backing the region... */ *(volatile char *)addr = *addr; addr += hpagesize; @@ -497,6 +493,26 @@ static void *do_touch_pages(void *arg) return (void *)(uintptr_t)ret; } +static void *do_madv_populate_write_pages(void *arg) +{ +MemsetThread *memset_args = (MemsetThread *)arg; +const size_t size = memset_args->numpages * memset_args->hpagesize; +char * const addr = memset_args->addr; +int ret = 0; + +/* See do_touch_pages(). */ +qemu_mutex_lock(_mutex); +while (!threads_created_flag) { +qemu_cond_wait(_cond, _mutex); +} +qemu_mutex_unlock(_mutex); + +if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) { +ret = -errno; +} +return (void *)(uintptr_t)ret; +} + static inline int get_memset_num_threads(int smp_cpus) { long host_procs = sysconf(_SC_NPROCESSORS_ONLN); @@ -510,10 +526,11 @@ static inline int get_memset_num_threads(int smp_cpus) } static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, - int smp_cpus) + int smp_cpus, bool use_madv_populate_write) { static gsize initialized = 0; size_t numpages_per_thread, leftover; +void *(*touch_fn)(void *); int ret = 0, i = 0; char *addr = area; @@ -523,6 +540,12 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, g_once_init_leave(, 1); } +if (use_madv_populate_write) { +touch_fn = do_madv_populate_write_pages; +} else { +touch_fn = do_touch_pages; +} + threads_created_flag = false; memset_num_threads = get_memset_num_threads(smp_cpus); memset_thread = g_new0(MemsetThread, memset_num_threads); @@ -533,7 +556,7 @@ static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, memset_thread[i].numpages = numpages_per_thread + (i < leftover); memset_thread[i].hpagesize = hpagesize; qemu_thread_create(_thread[i].pgthread, "touch_pages", - do_touch_pages, _thread[i], + touch_fn, _thread[i],
[PATCH v7 0/3] support dirty restraint on vCPU
From: Hyman Huang(黄勇) The patch [2/3] has not been touched so far. Any corrections and suggetions are welcome. Please review, thanks! v7: - rebase on master - polish the comments and error message according to the advices given by Markus - introduce dirtylimit_enabled function to pre-check if dirty page limit is enabled before canceling. v6: - rebase on master - fix dirtylimit setup crash found by Markus - polish the comments according to the advice given by Markus - adjust the qemu qmp command tag to 7.0 v5: - rebase on master - adjust the throttle algorithm by removing the tuning in RESTRAINT_RATIO case so that dirty page rate could reachs the quota more quickly. - fix percentage update in throttle iteration. v4: - rebase on master - modify the following points according to the advice given by Markus 1. move the defination into migration.json 2. polish the comments of set-dirty-limit 3. do the syntax check and change dirty rate to dirty page rate Thanks for the carefule reviews made by Markus. Please review, thanks! v3: - rebase on master - modify the following points according to the advice given by Markus 1. remove the DirtyRateQuotaVcpu and use its field as option directly 2. add comments to show details of what dirtylimit setup do 3. explain how to use dirtylimit in combination with existing qmp commands "calc-dirty-rate" and "query-dirty-rate" in documentation. Thanks for the carefule reviews made by Markus. Please review, thanks! Hyman v2: - rebase on master - modify the following points according to the advices given by Juan 1. rename dirtyrestraint to dirtylimit 2. implement the full lifecyle function of dirtylimit_calc, include dirtylimit_calc and dirtylimit_calc_quit 3. introduce 'quit' field in dirtylimit_calc_state to implement the dirtylimit_calc_quit 4. remove the ready_cond and ready_mtx since it may not be suitable 5. put the 'record_dirtypage' function code at the beggining of the file 6. remove the unnecesary return; - other modifications has been made after code review 1. introduce 'bmap' and 'nr' field in dirtylimit_state to record the number of running thread forked by dirtylimit 2. stop the dirtyrate calculation thread if all the dirtylimit thread are stopped 3. do some renaming works dirtyrate calulation thread -> dirtylimit-calc dirtylimit thread -> dirtylimit-{cpu_index} function name do_dirtyrestraint -> dirtylimit_check qmp command dirty-restraint -> set-drity-limit qmp command dirty-restraint-cancel -> cancel-dirty-limit header file dirtyrestraint.h -> dirtylimit.h Please review, thanks ! thanks for the accurate and timely advices given by Juan. we really appreciate it if corrections and suggetions about this patchset are proposed. Best Regards ! Hyman v1: this patchset introduce a mechanism to impose dirty restraint on vCPU, aiming to keep the vCPU running in a certain dirtyrate given by user. dirty restraint on vCPU maybe an alternative method to implement convergence logic for live migration, which could improve guest memory performance during migration compared with traditional method in theory. For the current live migration implementation, the convergence logic throttles all vCPUs of the VM, which has some side effects. -'read processes' on vCPU will be unnecessarily penalized - throttle increase percentage step by step, which seems struggling to find the optimal throttle percentage when dirtyrate is high. - hard to predict the remaining time of migration if the throttling percentage reachs 99% to a certain extent, the dirty restraint machnism can fix these effects by throttling at vCPU granularity during migration. the implementation is rather straightforward, we calculate vCPU dirtyrate via the Dirty Ring mechanism periodically as the commit 0e21bf246 "implement dirty-ring dirtyrate calculation" does, for vCPU that be specified to impose dirty restraint, we throttle it periodically as the auto-converge does, once after throttling, we compare the quota dirtyrate with current dirtyrate, if current dirtyrate is not under the quota, increase the throttling percentage until current dirtyrate is under the quota. this patchset is the basis of implmenting a new auto-converge method for live migration, we introduce two qmp commands for impose/cancel the dirty restraint on specified vCPU, so it also can be an independent api to supply the upper app such as libvirt, which can use it to implement the convergence logic during live migration, supplemented with the qmp 'calc-dirty-rate' command or whatever. we post this patchset for RFC and any corrections and suggetions about the implementation, api, throttleing algorithm or whatever are very appreciated! Please review, thanks ! Best Regards ! Hyman Huang (3): migration/dirtyrate: implement vCPU dirtyrate calculation periodically cpu-throttle: implement vCPU throttle cpus-common: implement dirty
[PATCH v1 0/8] virtio-mem: Support "prealloc=on"
Based-on: <20211130092838.24224-1-da...@redhat.com> Patch #1 - #7 are fully reviewed [1] but did not get picked up yet, so I'm sending them along here, as they are required to use os_mem_prealloc() in a safe way once the VM is running. Support preallocation of memory to make virtio-mem safe to use with scarce memory resources such as hugetlb. Before acknowledging a plug request from the guest, we'll try preallcoating memory. If that fails, we'll fail the request gracefully and warn the usr once. To fully support huge pages for shared memory, we'll have to adjust shared memory users, such as virtiofsd, to map guest memory via MAP_NORESERVE as well, because otherwise, they'll end up overwriting the "reserve=off" decision made by QEMU and try reserving huge pages for the sparse memory region. In the future, we might want to process guest requests, including preallocating memory, asynchronously via a dedicated iothread. [1] https://lkml.kernel.org/r/20211004120208.7409-1-da...@redhat.com Cc: Paolo Bonzini Cc: "Michael S. Tsirkin" Cc: Igor Mammedov Cc: Eduardo Habkost Cc: Dr. David Alan Gilbert Cc: Daniel P. Berrangé Cc: Gavin Shan Cc: Hui Zhu Cc: Sebastien Boeuf Cc: Pankaj Gupta Cc: Michal Prívozník David Hildenbrand (8): util/oslib-posix: Let touch_all_pages() return an error util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() util/oslib-posix: Don't create too many threads with small memory or little pages util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE util/oslib-posix: Support concurrent os_mem_prealloc() invocation util/oslib-posix: Forward SIGBUS to MCE handler under Linux virtio-mem: Support "prealloc=on" option hw/virtio/virtio-mem.c | 39 +- include/hw/virtio/virtio-mem.h | 4 + include/qemu/osdep.h | 7 + softmmu/cpus.c | 4 + util/oslib-posix.c | 231 + 5 files changed, 226 insertions(+), 59 deletions(-) -- 2.31.1
[PATCH v1 1/8] util/oslib-posix: Let touch_all_pages() return an error
Let's prepare touch_all_pages() for returning differing errors. Return an error from the thread and report the last processed error. Translate SIGBUS to -EFAULT, as a SIGBUS can mean all different kind of things (memory error, read error, out of memory). When allocating memory fails via the current SIGBUS-based mechanism, we'll get: os_mem_prealloc: preallocating memory failed: Bad address Reviewed-by: Daniel P. Berrangé Signed-off-by: David Hildenbrand --- util/oslib-posix.c | 28 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/util/oslib-posix.c b/util/oslib-posix.c index e8bdb02e1d..b146beef78 100644 --- a/util/oslib-posix.c +++ b/util/oslib-posix.c @@ -84,7 +84,6 @@ typedef struct MemsetThread MemsetThread; static MemsetThread *memset_thread; static int memset_num_threads; -static bool memset_thread_failed; static QemuMutex page_mutex; static QemuCond page_cond; @@ -452,6 +451,7 @@ static void *do_touch_pages(void *arg) { MemsetThread *memset_args = (MemsetThread *)arg; sigset_t set, oldset; +int ret = 0; /* * On Linux, the page faults from the loop below can cause mmap_sem @@ -470,7 +470,7 @@ static void *do_touch_pages(void *arg) pthread_sigmask(SIG_UNBLOCK, , ); if (sigsetjmp(memset_args->env, 1)) { -memset_thread_failed = true; +ret = -EFAULT; } else { char *addr = memset_args->addr; size_t numpages = memset_args->numpages; @@ -494,7 +494,7 @@ static void *do_touch_pages(void *arg) } } pthread_sigmask(SIG_SETMASK, , NULL); -return NULL; +return (void *)(uintptr_t)ret; } static inline int get_memset_num_threads(int smp_cpus) @@ -509,13 +509,13 @@ static inline int get_memset_num_threads(int smp_cpus) return ret; } -static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages, -int smp_cpus) +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages, + int smp_cpus) { static gsize initialized = 0; size_t numpages_per_thread, leftover; +int ret = 0, i = 0; char *addr = area; -int i = 0; if (g_once_init_enter()) { qemu_mutex_init(_mutex); @@ -523,7 +523,6 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages, g_once_init_leave(, 1); } -memset_thread_failed = false; threads_created_flag = false; memset_num_threads = get_memset_num_threads(smp_cpus); memset_thread = g_new0(MemsetThread, memset_num_threads); @@ -545,12 +544,16 @@ static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages, qemu_mutex_unlock(_mutex); for (i = 0; i < memset_num_threads; i++) { -qemu_thread_join(_thread[i].pgthread); +int tmp = (uintptr_t)qemu_thread_join(_thread[i].pgthread); + +if (tmp) { +ret = tmp; +} } g_free(memset_thread); memset_thread = NULL; -return memset_thread_failed; +return ret; } void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, @@ -573,9 +576,10 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus, } /* touch pages simultaneously */ -if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) { -error_setg(errp, "os_mem_prealloc: Insufficient free host memory " -"pages available to allocate guest RAM"); +ret = touch_all_pages(area, hpagesize, numpages, smp_cpus); +if (ret) { +error_setg_errno(errp, -ret, + "os_mem_prealloc: preallocating memory failed"); } ret = sigaction(SIGBUS, , NULL); -- 2.31.1
[PATCH 2/3] target/m68k: Implement TRAPcc
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/754 Signed-off-by: Richard Henderson --- target/m68k/cpu.h | 2 ++ target/m68k/cpu.c | 1 + target/m68k/translate.c | 21 + 3 files changed, 24 insertions(+) diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h index a3423729ef..03f600f7e7 100644 --- a/target/m68k/cpu.h +++ b/target/m68k/cpu.h @@ -527,6 +527,8 @@ enum m68k_features { M68K_FEATURE_MOVEC, /* Unaligned data accesses (680[2346]0) */ M68K_FEATURE_UNALIGNED_DATA, +/* TRAPCC insn. (680[2346]0, and CPU32) */ +M68K_FEATURE_TRAPCC, }; static inline int m68k_feature(CPUM68KState *env, int feature) diff --git a/target/m68k/cpu.c b/target/m68k/cpu.c index c7aeb7da9c..5f778773d1 100644 --- a/target/m68k/cpu.c +++ b/target/m68k/cpu.c @@ -162,6 +162,7 @@ static void m68020_cpu_initfn(Object *obj) m68k_set_feature(env, M68K_FEATURE_CHK2); m68k_set_feature(env, M68K_FEATURE_MSP); m68k_set_feature(env, M68K_FEATURE_UNALIGNED_DATA); +m68k_set_feature(env, M68K_FEATURE_TRAPCC); } /* diff --git a/target/m68k/translate.c b/target/m68k/translate.c index 858ba761fc..cf29f35d91 100644 --- a/target/m68k/translate.c +++ b/target/m68k/translate.c @@ -4879,6 +4879,26 @@ DISAS_INSN(trapv) do_trapcc(s, 9); /* VS */ } +DISAS_INSN(trapcc) +{ +/* Consume and discard the immediate operand. */ +switch (extract32(insn, 0, 3)) { +case 2: /* trapcc.w */ +(void)read_im16(env, s); +break; +case 3: /* trapcc.l */ +(void)read_im32(env, s); +break; +case 4: /* trapcc (no operand) */ +break; +default: +/* Illegal insn */ +disas_undef(env, s, insn); +return; +} +do_trapcc(s, extract32(insn, 8, 4)); +} + static void gen_load_fcr(DisasContext *s, TCGv res, int reg) { switch (reg) { @@ -6051,6 +6071,7 @@ void register_m68k_insns (CPUM68KState *env) INSN(scc, 50c0, f0f8, CF_ISA_A); /* Scc.B Dx */ INSN(scc, 50c0, f0c0, M68000); /* Scc.B */ INSN(dbcc, 50c8, f0f8, M68000); +INSN(trapcc,50f8, f0f8, TRAPCC); INSN(tpf, 51f8, fff8, CF_ISA_A); /* Branch instructions. */ -- 2.25.1
[PATCH v7 2/3] cpu-throttle: implement vCPU throttle
From: Hyman Huang(黄勇) Impose dirty restraint on vCPU by kicking it and sleep as the auto-converge does during migration, but just kick the specified vCPU instead, not all vCPUs of vm. Start a thread to track the dirtylimit status and adjust the throttle pencentage dynamically depend on current and quota dirtyrate. Introduce the util function in the header for dirtylimit implementation. Signed-off-by: Hyman Huang(黄勇) --- include/sysemu/cpu-throttle.h | 30 softmmu/cpu-throttle.c| 316 ++ softmmu/trace-events | 5 + 3 files changed, 351 insertions(+) diff --git a/include/sysemu/cpu-throttle.h b/include/sysemu/cpu-throttle.h index d65bdef..334e5e2 100644 --- a/include/sysemu/cpu-throttle.h +++ b/include/sysemu/cpu-throttle.h @@ -65,4 +65,34 @@ bool cpu_throttle_active(void); */ int cpu_throttle_get_percentage(void); +/** + * dirtylimit_enabled + * + * Returns: %true if dirty page limit for vCPU is enabled, %false otherwise. + */ +bool dirtylimit_enabled(int cpu_index); + +/** + * dirtylimit_state_init: + * + * initialize golobal state for dirtylimit + */ +void dirtylimit_state_init(int max_cpus); + +/** + * dirtylimit_vcpu: + * + * impose dirtylimit on vcpu util reaching the quota dirtyrate + */ +void dirtylimit_vcpu(int cpu_index, + uint64_t quota); +/** + * dirtylimit_cancel_vcpu: + * + * cancel dirtylimit for the specified vcpu + * + * Returns: the number of running threads for dirtylimit + */ +int dirtylimit_cancel_vcpu(int cpu_index); + #endif /* SYSEMU_CPU_THROTTLE_H */ diff --git a/softmmu/cpu-throttle.c b/softmmu/cpu-throttle.c index 8c2144a..f199d68 100644 --- a/softmmu/cpu-throttle.c +++ b/softmmu/cpu-throttle.c @@ -29,6 +29,8 @@ #include "qemu/main-loop.h" #include "sysemu/cpus.h" #include "sysemu/cpu-throttle.h" +#include "sysemu/dirtylimit.h" +#include "trace.h" /* vcpu throttling controls */ static QEMUTimer *throttle_timer; @@ -38,6 +40,320 @@ static unsigned int throttle_percentage; #define CPU_THROTTLE_PCT_MAX 99 #define CPU_THROTTLE_TIMESLICE_NS 1000 +#define DIRTYLIMIT_TOLERANCE_RANGE 15 /* 15MB/s */ + +#define DIRTYLIMIT_THROTTLE_HEAVY_WATERMARK 75 +#define DIRTYLIMIT_THROTTLE_SLIGHT_WATERMARK90 + +#define DIRTYLIMIT_THROTTLE_HEAVY_STEP_SIZE 5 +#define DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE2 + +typedef enum { +RESTRAIN_KEEP, +RESTRAIN_RATIO, +RESTRAIN_HEAVY, +RESTRAIN_SLIGHT, +} RestrainPolicy; + +typedef struct DirtyLimitState { +int cpu_index; +bool enabled; +uint64_t quota; /* quota dirtyrate MB/s */ +QemuThread thread; +char *name; /* thread name */ +} DirtyLimitState; + +struct { +DirtyLimitState *states; +int max_cpus; +unsigned long *bmap; /* running thread bitmap */ +unsigned long nr; +} *dirtylimit_state; + +bool dirtylimit_enabled(int cpu_index) +{ +return qatomic_read(_state->states[cpu_index].enabled); +} + +static inline void dirtylimit_set_quota(int cpu_index, uint64_t quota) +{ +qatomic_set(_state->states[cpu_index].quota, quota); +} + +static inline uint64_t dirtylimit_quota(int cpu_index) +{ +return qatomic_read(_state->states[cpu_index].quota); +} + +static int64_t dirtylimit_current(int cpu_index) +{ +return dirtylimit_calc_current(cpu_index); +} + +static void dirtylimit_vcpu_thread(CPUState *cpu, run_on_cpu_data data) +{ +double pct; +double throttle_ratio; +int64_t sleeptime_ns, endtime_ns; +int *percentage = (int *)data.host_ptr; + +pct = (double)(*percentage) / 100; +throttle_ratio = pct / (1 - pct); +/* Add 1ns to fix double's rounding error (like 0.999...) */ +sleeptime_ns = (int64_t)(throttle_ratio * CPU_THROTTLE_TIMESLICE_NS + 1); +endtime_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + sleeptime_ns; +while (sleeptime_ns > 0 && !cpu->stop) { +if (sleeptime_ns > SCALE_MS) { +qemu_cond_timedwait_iothread(cpu->halt_cond, + sleeptime_ns / SCALE_MS); +} else { +qemu_mutex_unlock_iothread(); +g_usleep(sleeptime_ns / SCALE_US); +qemu_mutex_lock_iothread(); +} +sleeptime_ns = endtime_ns - qemu_clock_get_ns(QEMU_CLOCK_REALTIME); +} +qatomic_set(>throttle_thread_scheduled, 0); + +free(percentage); +} + +static void dirtylimit_check(int cpu_index, + int percentage) +{ +CPUState *cpu; +int64_t sleeptime_ns, starttime_ms, currenttime_ms; +int *pct_parameter; +double pct; + +pct = (double) percentage / 100; + +starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); + +while (true) { +CPU_FOREACH(cpu) { +if ((cpu_index == cpu->cpu_index) && +(!qatomic_xchg(>throttle_thread_scheduled, 1))) { +pct_parameter = malloc(sizeof(*pct_parameter)); +*pct_parameter = percentage; +
[PATCH v7 1/3] migration/dirtyrate: implement vCPU dirtyrate calculation periodically
From: Hyman Huang(黄勇) Introduce the third method GLOBAL_DIRTY_LIMIT of dirty tracking for calculate dirtyrate periodly for dirty restraint. Implement thread for calculate dirtyrate periodly, which will be used for dirty restraint. Add dirtylimit.h to introduce the util function for dirty limit implementation. Signed-off-by: Hyman Huang(黄勇) --- include/exec/memory.h | 5 +- include/sysemu/dirtylimit.h | 44 ++ migration/dirtyrate.c | 139 migration/dirtyrate.h | 2 + 4 files changed, 179 insertions(+), 11 deletions(-) create mode 100644 include/sysemu/dirtylimit.h diff --git a/include/exec/memory.h b/include/exec/memory.h index 20f1b27..606bec8 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -69,7 +69,10 @@ static inline void fuzz_dma_read_cb(size_t addr, /* Dirty tracking enabled because measuring dirty rate */ #define GLOBAL_DIRTY_DIRTY_RATE (1U << 1) -#define GLOBAL_DIRTY_MASK (0x3) +/* Dirty tracking enabled because dirty limit */ +#define GLOBAL_DIRTY_LIMIT (1U << 2) + +#define GLOBAL_DIRTY_MASK (0x7) extern unsigned int global_dirty_tracking; diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h new file mode 100644 index 000..49298a2 --- /dev/null +++ b/include/sysemu/dirtylimit.h @@ -0,0 +1,44 @@ +/* + * dirty limit helper functions + * + * Copyright (c) 2021 CHINA TELECOM CO.,LTD. + * + * Authors: + * Hyman Huang(黄勇) + * + * This work is licensed under the terms of the GNU GPL, version 2 or later. + * See the COPYING file in the top-level directory. + */ +#ifndef QEMU_DIRTYRLIMIT_H +#define QEMU_DIRTYRLIMIT_H + +#define DIRTYLIMIT_CALC_PERIOD_TIME_S 15 /* 15s */ + +/** + * dirtylimit_calc_current: + * + * get current dirty page rate for specified vCPU. + */ +int64_t dirtylimit_calc_current(int cpu_index); + +/** + * dirtylimit_calc: + * + * start dirty page rate calculation thread. + */ +void dirtylimit_calc(void); + +/** + * dirtylimit_calc_quit: + * + * quit dirty page rate calculation thread. + */ +void dirtylimit_calc_quit(void); + +/** + * dirtylimit_calc_state_init: + * + * initialize dirty page rate calculation state. + */ +void dirtylimit_calc_state_init(int max_cpus); +#endif diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c index d65e744..d370a21 100644 --- a/migration/dirtyrate.c +++ b/migration/dirtyrate.c @@ -27,6 +27,7 @@ #include "qapi/qmp/qdict.h" #include "sysemu/kvm.h" #include "sysemu/runstate.h" +#include "sysemu/dirtylimit.h" #include "exec/memory.h" /* @@ -46,6 +47,134 @@ static struct DirtyRateStat DirtyStat; static DirtyRateMeasureMode dirtyrate_mode = DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING; +#define DIRTYLIMIT_CALC_TIME_MS 1000/* 1000ms */ + +struct { +DirtyRatesData data; +int64_t period; +bool quit; +} *dirtylimit_calc_state; + +static void dirtylimit_global_dirty_log_start(void) +{ +qemu_mutex_lock_iothread(); +memory_global_dirty_log_start(GLOBAL_DIRTY_LIMIT); +qemu_mutex_unlock_iothread(); +} + +static void dirtylimit_global_dirty_log_stop(void) +{ +qemu_mutex_lock_iothread(); +memory_global_dirty_log_sync(); +memory_global_dirty_log_stop(GLOBAL_DIRTY_LIMIT); +qemu_mutex_unlock_iothread(); +} + +static inline void record_dirtypages(DirtyPageRecord *dirty_pages, + CPUState *cpu, bool start) +{ +if (start) { +dirty_pages[cpu->cpu_index].start_pages = cpu->dirty_pages; +} else { +dirty_pages[cpu->cpu_index].end_pages = cpu->dirty_pages; +} +} + +static void dirtylimit_calc_func(void) +{ +CPUState *cpu; +DirtyPageRecord *dirty_pages; +int64_t start_time, end_time, calc_time; +DirtyRateVcpu rate; +int i = 0; + +dirty_pages = g_malloc0(sizeof(*dirty_pages) * +dirtylimit_calc_state->data.nvcpu); + +dirtylimit_global_dirty_log_start(); + +CPU_FOREACH(cpu) { +record_dirtypages(dirty_pages, cpu, true); +} + +start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); +g_usleep(DIRTYLIMIT_CALC_TIME_MS * 1000); +end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME); +calc_time = end_time - start_time; + +dirtylimit_global_dirty_log_stop(); + +CPU_FOREACH(cpu) { +record_dirtypages(dirty_pages, cpu, false); +} + +for (i = 0; i < dirtylimit_calc_state->data.nvcpu; i++) { +uint64_t increased_dirty_pages = +dirty_pages[i].end_pages - dirty_pages[i].start_pages; +uint64_t memory_size_MB = +(increased_dirty_pages * TARGET_PAGE_SIZE) >> 20; +int64_t dirtyrate = (memory_size_MB * 1000) / calc_time; + +rate.id = i; +rate.dirty_rate = dirtyrate; +dirtylimit_calc_state->data.rates[i] = rate; + +trace_dirtyrate_do_calculate_vcpu(i, +dirtylimit_calc_state->data.rates[i].dirty_rate); +} +} + +static void
[PATCH 3/3] target/m68k: Implement FTRAPcc
Signed-off-by: Richard Henderson --- target/m68k/translate.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/target/m68k/translate.c b/target/m68k/translate.c index cf29f35d91..3c04f9d1a9 100644 --- a/target/m68k/translate.c +++ b/target/m68k/translate.c @@ -5547,6 +5547,43 @@ DISAS_INSN(fscc) tcg_temp_free(tmp); } +DISAS_INSN(ftrapcc) +{ +DisasCompare c; +TCGLabel *over; +uint16_t ext; +int cond; + +ext = read_im16(env, s); +cond = ext & 0x3f; + +/* Consume and discard the immediate operand. */ +switch (extract32(insn, 0, 3)) { +case 2: /* ftrapcc.w */ +(void)read_im16(env, s); +break; +case 3: /* ftrapcc.l */ +(void)read_im32(env, s); +break; +case 4: /* ftrapcc (no operand) */ +break; +default: +/* Illegal insn */ +disas_undef(env, s, insn); +return; +} + +/* Jump over if !cond. */ +gen_fcc_cond(, s, cond); +update_cc_op(s); +over = gen_new_label(); +tcg_gen_brcond_i32(tcg_invert_cond(c.tcond), c.v1, c.v2, over); +free_cond(); + +gen_exception(s, s->base.pc_next, EXCP_TRAPCC); +gen_set_label(over); +} + #if defined(CONFIG_SOFTMMU) DISAS_INSN(frestore) { @@ -6170,6 +6207,7 @@ void register_m68k_insns (CPUM68KState *env) INSN(fbcc, f280, ffc0, CF_FPU); INSN(fpu, f200, ffc0, FPU); INSN(fscc, f240, ffc0, FPU); +INSN(ftrapcc, f278, ff80, FPU); INSN(fbcc, f280, ff80, FPU); #if defined(CONFIG_SOFTMMU) INSN(frestore, f340, ffc0, CF_FPU); -- 2.25.1
[PATCH 1/3] target/m68k: Implement TRAPV
Signed-off-by: Richard Henderson --- target/m68k/translate.c | 17 + 1 file changed, 17 insertions(+) diff --git a/target/m68k/translate.c b/target/m68k/translate.c index af43c8eab8..858ba761fc 100644 --- a/target/m68k/translate.c +++ b/target/m68k/translate.c @@ -4863,6 +4863,22 @@ DISAS_INSN(trap) gen_exception(s, s->base.pc_next, EXCP_TRAP0 + (insn & 0xf)); } +static void do_trapcc(DisasContext *s, int cond) +{ +TCGLabel *over = gen_new_label(); + +/* Jump over if !cond. */ +gen_jmpcc(s, cond ^ 1, over); + +gen_exception(s, s->base.pc_next, EXCP_TRAPCC); +gen_set_label(over); +} + +DISAS_INSN(trapv) +{ +do_trapcc(s, 9); /* VS */ +} + static void gen_load_fcr(DisasContext *s, TCGv res, int reg) { switch (reg) { @@ -6026,6 +6042,7 @@ void register_m68k_insns (CPUM68KState *env) BASE(nop, 4e71, ); INSN(rtd, 4e74, , RTD); BASE(rts, 4e75, ); +INSN(trapv, 4e76, , M68000); INSN(rtr, 4e77, , M68000); BASE(jump, 4e80, ffc0); BASE(jump, 4ec0, ffc0); -- 2.25.1
[PATCH for-7.0 0/3] target/m68k: Implement conditional traps
While looking at #754 for trapcc, I noticed that the other conditional traps, trapv and ftrapcc, are also missing. r~ Richard Henderson (3): target/m68k: Implement TRAPV target/m68k: Implement TRAPcc target/m68k: Implement FTRAPcc target/m68k/cpu.h | 2 ++ target/m68k/cpu.c | 1 + target/m68k/translate.c | 76 + 3 files changed, 79 insertions(+) -- 2.25.1
[PATCH v7 3/3] cpus-common: implement dirty page limit on vCPU
From: Hyman Huang(黄勇) Implement dirtyrate calculation periodically basing on dirty-ring and throttle vCPU until it reachs the quota dirty page rate given by user. Introduce qmp commands set-dirty-limit/cancel-dirty-limit to set/cancel dirty page limit on vCPU. Signed-off-by: Hyman Huang(黄勇) --- cpus-common.c | 48 include/hw/core/cpu.h | 9 + qapi/migration.json | 43 +++ softmmu/vl.c | 1 + 4 files changed, 101 insertions(+) diff --git a/cpus-common.c b/cpus-common.c index 6e73d3e..86c7712 100644 --- a/cpus-common.c +++ b/cpus-common.c @@ -23,6 +23,11 @@ #include "hw/core/cpu.h" #include "sysemu/cpus.h" #include "qemu/lockable.h" +#include "sysemu/dirtylimit.h" +#include "sysemu/cpu-throttle.h" +#include "sysemu/kvm.h" +#include "qapi/error.h" +#include "qapi/qapi-commands-migration.h" static QemuMutex qemu_cpu_list_lock; static QemuCond exclusive_cond; @@ -352,3 +357,46 @@ void process_queued_cpu_work(CPUState *cpu) qemu_mutex_unlock(>work_mutex); qemu_cond_broadcast(_work_cond); } + +void qmp_set_dirty_limit(int64_t idx, + uint64_t dirtyrate, + Error **errp) +{ +if (!kvm_enabled() || !kvm_dirty_ring_enabled()) { +error_setg(errp, "setting a dirty page limit requires KVM with" + " accelerator property 'dirty-ring-size' set'"); +return; +} + +dirtylimit_calc(); +dirtylimit_vcpu(idx, dirtyrate); +} + +void qmp_cancel_dirty_limit(int64_t idx, +Error **errp) +{ +if (!kvm_enabled() || !kvm_dirty_ring_enabled()) { +error_setg(errp, "KVM with accelerator property 'dirty-ring-size'" + " not set, abort canceling a dirty page limit"); +return; +} + +if (!dirtylimit_enabled(idx)) { +error_setg(errp, "dirty page limit for the CPU %ld not set", idx); +return; +} + +if (unlikely(!dirtylimit_cancel_vcpu(idx))) { +dirtylimit_calc_quit(); +} +} + +void dirtylimit_setup(int max_cpus) +{ +if (!kvm_enabled() || !kvm_dirty_ring_enabled()) { +return; +} + +dirtylimit_calc_state_init(max_cpus); +dirtylimit_state_init(max_cpus); +} diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h index e948e81..11df012 100644 --- a/include/hw/core/cpu.h +++ b/include/hw/core/cpu.h @@ -881,6 +881,15 @@ void end_exclusive(void); */ void qemu_init_vcpu(CPUState *cpu); +/** + * dirtylimit_setup: + * + * Initializes the global state of dirtylimit calculation and + * dirtylimit itself. This is prepared for vCPU dirtylimit which + * could be triggered during vm lifecycle. + */ +void dirtylimit_setup(int max_cpus); + #define SSTEP_ENABLE 0x1 /* Enable simulated HW single stepping */ #define SSTEP_NOIRQ 0x2 /* Do not use IRQ while single stepping */ #define SSTEP_NOTIMER 0x4 /* Do not Timers while single stepping */ diff --git a/qapi/migration.json b/qapi/migration.json index bbfd48c..57c9a63 100644 --- a/qapi/migration.json +++ b/qapi/migration.json @@ -1850,6 +1850,49 @@ { 'command': 'query-dirty-rate', 'returns': 'DirtyRateInfo' } ## +# @set-dirty-limit: +# +# Set the upper limit of dirty page rate for a virtual CPU. +# +# Requires KVM with accelerator property "dirty-ring-size" set. +# A virtual CPU's dirty page rate is a measure of its memory load. +# To observe dirty page rates, use @calc-dirty-rate. +# +# @cpu-index: index of the virtual CPU. +# +# @dirty-rate: upper limit for the specified vCPU's dirty page rate (MB/s) +# +# Since: 7.0 +# +# Example: +# {"execute": "set-dirty-limit"} +#"arguments": { "cpu-index": 0, +# "dirty-rate": 200 } } +# +## +{ 'command': 'set-dirty-limit', + 'data': { 'cpu-index': 'int', 'dirty-rate': 'uint64' } } + +## +# @cancel-dirty-limit: +# +# Cancel the dirty page limit for the vCPU which has been set with +# set-dirty-limit command. Note that this command requires support from +# dirty ring, same as the "set-dirty-limit" command. +# +# @cpu-index: index of the virtual CPU to cancel the dirty page limit +# +# Since: 7.0 +# +# Example: +# {"execute": "cancel-dirty-limit"} +#"arguments": { "cpu-index": 0 } } +# +## +{ 'command': 'cancel-dirty-limit', + 'data': { 'cpu-index': 'int' } } + +## # @snapshot-save: # # Save a VM snapshot diff --git a/softmmu/vl.c b/softmmu/vl.c index 620a1f1..0f83ce3 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -3777,5 +3777,6 @@ void qemu_init(int argc, char **argv, char **envp) qemu_init_displays(); accel_setup_post(current_machine); os_setup_post(); +dirtylimit_setup(current_machine->smp.max_cpus); resume_mux_open(); } -- 1.8.3.1
[Bug 1749393] Re: sbrk() not working under qemu-user with a PIE-compiled binary?
Uploaded to F-unapproved, waiting for the SRU team to accept it. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1749393 Title: sbrk() not working under qemu-user with a PIE-compiled binary? Status in QEMU: Fix Released Status in qemu package in Ubuntu: Fix Released Status in qemu source package in Focal: In Progress Bug description: [Impact] * The current space reserved can be too small and we can end up with no space at all for BRK. It can happen to any case, but is much more likely with the now common PIE binaries. * Backport the upstream fix which reserves a bit more space while loading and giving it back after interpreter and stack is loaded. [Test Plan] * On x86 run: sudo apt install -y qemu-user-static docker.io sudo docker run --rm arm64v8/debian:bullseye bash -c 'apt update && apt install -y wget' ... Running hooks in /etc/ca-certificates/update.d... done. Errors were encountered while processing: libc-bin E: Sub-process /usr/bin/dpkg returned an error code (1) Second test from bug 1928075 $ sudo qemu-debootstrap --arch=arm64 bullseye bullseye-arm64 http://ftp.debian.org/debian In the bad case this is failing like W: Failure trying to run: /sbin/ldconfig W: See //debootstrap/debootstrap.log for detail And in that log file you'll see the segfault $ tail -n 2 bullseye-arm64/debootstrap/debootstrap.log qemu: uncaught target signal 11 (Segmentation fault) - core dumped Segmentation fault (core dumped) [Where problems could occur] * Regressions would be around use-cases of linux-user that is emulation not of a system but of binaries. Commonly uses for cross-tests and cross-builds so that is the space to watch for regressions [Other Info] * n/a --- In Debian unstable, we recently switched bash to be a PIE-compiled binary (for hardening). Unfortunately this resulted in bash being broken when run under qemu-user (for all target architectures, host being amd64 for me). $ sudo chroot /srv/chroots/sid-i386/ qemu-i386-static /bin/bash bash: xmalloc: .././shell.c:1709: cannot allocate 10 bytes (0 bytes allocated) bash has its own malloc implementation based on sbrk(): https://git.savannah.gnu.org/cgit/bash.git/tree/lib/malloc/malloc.c When we disable this internal implementation and rely on glibc's malloc, then everything is fine. But it might be that glibc has a fallback when sbrk() is not working properly and it might hide the underlying problem in qemu-user. This issue has also been reported to the bash upstream author and he suggested that the issue might be in qemu-user so I'm opening a ticket here. Here's the discussion with the bash upstream author: https://lists.gnu.org/archive/html/bug-bash/2018-02/threads.html#00080 You can find the problematic bash binary in that .deb file: http://snapshot.debian.org/archive/debian/20180206T154716Z/pool/main/b/bash/bash_4.4.18-1_i386.deb The version of qemu I have been using is 2.11 (Debian package qemu- user-static version 1:2.11+dfsg-1) but I have had reports that the problem is reproducible with older versions (back to 2.8 at least). Here are the related Debian bug reports: https://bugs.debian.org/889869 https://bugs.debian.org/865599 It's worth noting that bash used to have this problem (when compiled as a PIE binary) even when run directly but then something got fixed in the kernel and now the problem only appears when run under qemu-user: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518483 To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1749393/+subscriptions
Re: [PATCH for-7.0 0/4] qemu-common.h include cleanup
Peter Maydell writes: > On Mon, 29 Nov 2021 at 20:05, Peter Maydell wrote: >> >> qemu-common.h has a comment at the top: >> >> * This file is supposed to be included only by .c files. No header file >> should >> * depend on qemu-common.h, as this would easily lead to circular header >> * dependencies. > > As a side note, that comment was added back in 2012 when qemu-common.h > was bigger, included other headers, and did some of the work we currently > use osdep.h for. As it stands today qemu-common.h includes no other > files so it isn't a source of possible circular dependencies -- it's > just a grab-bag of miscellaneous prototypes that in an ideal world > would be in more focused individual headers[*]. So there's an argument > for deleting this comment... First, thank you for this cleanup series. The comment is from commit 04509ad939a: qemu-common.h: Comment about usage rules Every time we make a tiny change on a header file, we often find circular header dependency problems. To avoid this nightmare, we need to stop including qemu-common.h from other headers, and we should gradually move the declarations from the catch-all qemu-common.h header to their specific headers. This simply adds a comment documenting the rules about qemu-common.h, hoping that people will see it before including qemu-common.h from other header files, and before adding more declarations to qemu-common.h. Signed-off-by: Eduardo Habkost Signed-off-by: Andreas Färber You're right: we've since moved all #include out of qemu-common.h, so the risk of circular header dependency is gone, and the comment's claim "this would easily lead to circular header dependencies" is no longer correct. In my opinion, including qemu-common.h in a header is a bad idea regardless. Headers should include the headers they need to make them self-contained, and no more, because more risks slower compiles, in particular frequent recompiles of pretty much everything. Previously discussed at https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg06291.html Message-ID: <877eac82il@dusky.pond.sub.org> Nothing in qemu-common.h should be required to make another header self-contained. This is likely the case for other headers as well, which don't carry a comment forbidding inclusion into headers. You could argue that qemu-common.h should not carry one, either. I can accept that. I'm not attached to the comment. I am interested in keeping unwanted #include in headers under control. > [*] A cleanup that would be nice, and I'm about to send out a patchset > that splits out the rtc related functions; but the grab-bag at the > bottom of osdep.h is probably higher priority because that header > gets pulled in by an order of magnitude more C files. By all "normal" ones, in fact.
[PATCH v2 1/4] block_int: make bdrv_backing_overridden static
bdrv_backing_overridden is only used in block.c, so there is no need to leave it in block_int.h Signed-off-by: Emanuele Giuseppe Esposito --- block.c | 4 +++- include/block/block_int.h | 3 --- 2 files changed, 3 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index 0ac5b163d2..10346b5011 100644 --- a/block.c +++ b/block.c @@ -103,6 +103,8 @@ static int bdrv_reopen_prepare(BDRVReopenState *reopen_state, static void bdrv_reopen_commit(BDRVReopenState *reopen_state); static void bdrv_reopen_abort(BDRVReopenState *reopen_state); +static bool bdrv_backing_overridden(BlockDriverState *bs); + /* If non-zero, use only whitelisted block drivers */ static int use_bdrv_whitelist; @@ -7475,7 +7477,7 @@ static bool append_strong_runtime_options(QDict *d, BlockDriverState *bs) /* Note: This function may return false positives; it may return true * even if opening the backing file specified by bs's image header * would result in exactly bs->backing. */ -bool bdrv_backing_overridden(BlockDriverState *bs) +static bool bdrv_backing_overridden(BlockDriverState *bs) { if (bs->backing) { return strcmp(bs->auto_backing_file, diff --git a/include/block/block_int.h b/include/block/block_int.h index f4c75e8ba9..27008cfb22 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -1122,9 +1122,6 @@ BlockDriver *bdrv_probe_all(const uint8_t *buf, int buf_size, void bdrv_parse_filename_strip_prefix(const char *filename, const char *prefix, QDict *options); -bool bdrv_backing_overridden(BlockDriverState *bs); - - /** * bdrv_add_aio_context_notifier: * -- 2.31.1
[PATCH v2 4/4] include/sysemu/blockdev.h: remove drive_get_max_devs
Remove drive_get_max_devs, as it is not used by anyone. Last use was removed in commit 8f2d75e81d5 ("hw: Drop superfluous special checks for orphaned -drive"). Signed-off-by: Emanuele Giuseppe Esposito Reviewed-by: Philippe Mathieu-Daudé --- blockdev.c| 17 - include/sysemu/blockdev.h | 1 - 2 files changed, 18 deletions(-) diff --git a/blockdev.c b/blockdev.c index e5cb3eb95f..0112f488a4 100644 --- a/blockdev.c +++ b/blockdev.c @@ -173,23 +173,6 @@ void blockdev_auto_del(BlockBackend *blk) } } -/** - * Returns the current mapping of how many units per bus - * a particular interface can support. - * - * A positive integer indicates n units per bus. - * 0 implies the mapping has not been established. - * -1 indicates an invalid BlockInterfaceType was given. - */ -int drive_get_max_devs(BlockInterfaceType type) -{ -if (type >= IF_IDE && type < IF_COUNT) { -return if_max_devs[type]; -} - -return -1; -} - static int drive_index_to_bus_id(BlockInterfaceType type, int index) { int max_devs = if_max_devs[type]; diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h index 21ae0b994a..430abdbdb8 100644 --- a/include/sysemu/blockdev.h +++ b/include/sysemu/blockdev.h @@ -49,7 +49,6 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit); void drive_check_orphaned(void); DriveInfo *drive_get_by_index(BlockInterfaceType type, int index); int drive_get_max_bus(BlockInterfaceType type); -int drive_get_max_devs(BlockInterfaceType type); DriveInfo *drive_get_next(BlockInterfaceType type); DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type, -- 2.31.1
[PATCH v2 2/4] include/sysemu/blockdev.c: introduce block_if_name
Add a getter function for the if_name array, so that also outside functions can access it. Signed-off-by: Emanuele Giuseppe Esposito --- blockdev.c| 5 + include/sysemu/blockdev.h | 1 + 2 files changed, 6 insertions(+) diff --git a/blockdev.c b/blockdev.c index b35072644e..1581f0d2f5 100644 --- a/blockdev.c +++ b/blockdev.c @@ -83,6 +83,11 @@ static const char *const if_name[IF_COUNT] = { [IF_XEN] = "xen", }; +const char *block_if_name(BlockInterfaceType iface) +{ +return if_name[iface]; +} + static int if_max_devs[IF_COUNT] = { /* * Do not change these numbers! They govern how drive option diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h index 32c2d6023c..b321286dee 100644 --- a/include/sysemu/blockdev.h +++ b/include/sysemu/blockdev.h @@ -42,6 +42,7 @@ DriveInfo *blk_legacy_dinfo(BlockBackend *blk); DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo); BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo); +const char *block_if_name(BlockInterfaceType iface); void override_max_devs(BlockInterfaceType type, int max_devs); DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit); -- 2.31.1