Re: [PATCH 5/6] pcie: fast unplug when slot power is off
On Mon, Oct 11, 2021 at 02:05:03PM +0200, Gerd Hoffmann wrote: > In case the slot is powered off (and the power indicator turned off too) > we can unplug right away, without round-trip to the guest. > > Also clear pending attention button press, there is nothing to care > about any more. > > Signed-off-by: Gerd Hoffmann > --- > hw/pci/pcie.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c > index 70fc11ba4c7d..f3ac04399969 100644 > --- a/hw/pci/pcie.c > +++ b/hw/pci/pcie.c > @@ -561,6 +561,16 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler > *hotplug_dev, > return; > } > > +if (((sltctl & PCI_EXP_SLTCTL_PIC) == PCI_EXP_SLTCTL_PWR_IND_OFF) && > +((sltctl & PCI_EXP_SLTCTL_PCC) == PCI_EXP_SLTCTL_PWR_OFF)) { > +/* slot is powered off -> unplug without round-trip to the guest */ > +pcie_cap_slot_do_unplug(hotplug_pdev); > +hotplug_event_notify(hotplug_pdev); > +pci_word_test_and_clear_mask(exp_cap + PCI_EXP_SLTSTA, > + PCI_EXP_SLTSTA_ABP); Does this handle all the things including link status etc btw? I don't remember off-hand. > +return; > +} > + > pcie_cap_slot_push_attention_button(hotplug_pdev); > } > > -- > 2.31.1
Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)
Leonardo Bras Soares Passos writes: > Hello Eric, > > On Mon, Oct 11, 2021 at 4:32 PM Eric Blake wrote: >> >> On Sat, Oct 09, 2021 at 04:56:13AM -0300, Leonardo Bras wrote: >> > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel >> > zerocopy interface. >> > >> > Change multifd_send_sync_main() so it can distinguish the last sync from >> > the setup and per-iteration ones, so a flush_zerocopy() can be called >> > at the last sync in order to make sure all RAM is sent before finishing >> > the migration. >> > >> > Also make it return -1 if flush_zerocopy() fails, in order to cancel >> > the migration process, and avoid resuming the guest in the target host >> > without receiving all current RAM. >> > >> > This will work fine on RAM migration because the RAM pages are not usually >> > freed, >> > and there is no problem on changing the pages content between async_send() >> > and >> > the actual sending of the buffer, because this change will dirty the page >> > and >> > cause it to be re-sent on a next iteration anyway. >> > >> > Given a lot of locked memory may be needed in order to use multid migration >> > with zerocopy enabled, make it optional by creating a new parameter >> > multifd-zerocopy on qapi, so low-privileged users can still perform multifd >> > migrations. >> > >> > Signed-off-by: Leonardo Bras >> > --- >> > qapi/migration.json | 18 ++ >> > migration/migration.h | 1 + >> > migration/multifd.h | 2 +- >> > migration/migration.c | 20 >> > migration/multifd.c | 33 - >> > migration/ram.c | 20 +--- >> > monitor/hmp-cmds.c| 4 >> > 7 files changed, 85 insertions(+), 13 deletions(-) >> > >> > diff --git a/qapi/migration.json b/qapi/migration.json >> > index 88f07baedd..c4890cbb54 100644 >> > --- a/qapi/migration.json >> > +++ b/qapi/migration.json >> > @@ -724,6 +724,11 @@ >> > # will consume more CPU. >> > # Defaults to 1. (Since 5.0) >> > # >> > +# @multifd-zerocopy: Controls behavior on sending memory pages on multifd >> > migration. >> > +#When true, enables a zerocopy mechanism for sending >> > memory >> > +#pages, if host does support it. >> >> s/does support/supports/ (several instances this patch) > > I will make sure to fix that in v5. > >> >> > +#Defaults to false. (Since 6.2) >> > +# >> > # @block-bitmap-mapping: Maps block nodes and bitmaps on them to >> > #aliases for the purpose of dirty bitmap >> > migration. Such >> > #aliases may for example be the corresponding >> > names on the >> > @@ -758,6 +763,7 @@ >> > 'xbzrle-cache-size', 'max-postcopy-bandwidth', >> > 'max-cpu-throttle', 'multifd-compression', >> > 'multifd-zlib-level' ,'multifd-zstd-level', >> > +'multifd-zerocopy', >> > 'block-bitmap-mapping' ] } >> >> Should this feature be guarded with 'if':'CONFIG_LINUX', since that's >> the only platform where you even compile the code (even then, it can >> still fail if the kernel doesn't support it). > > I think it makes sense for the feature to always be available, even > though it's not supported > outside linux > v4.14. > > IMHO it makes more sense for the user to get an error when it starts > migration, due to host > not supporting zerocopy, than the error happening in the config step, > which may cause the user > to question if it was the right parameter. > > The config message error here could also be ignored, and users can > think zerocopy is working, while it's not. > > For automated migrations, this feature should never be enabled for > non-linux / older linux hosts anyway. > > Is there a good argument I am missing for this feature being disabled > on non-linux? The general argument for having QAPI schema 'if' mirror the C implementation's #if is introspection. Let me explain why that matters. Consider a management application that supports a range of QEMU versions, say 5.0 to 6.2. Say it wants to use an QMP command that is new in QEMU 6.2. The sane way to do that is to probe for the command with query-qmp-schema. Same for command arguments, and anything else QMP. If you doubt "sane", check out Part II of "QEMU interface introspection: >From hacks to solutions"[*]. The same technique works when a QMP command / argument / whatever is compile-time conditional ('if' in the schema). The code the management application needs anyway to deal with older QEMU now also deals with "compiled out". Nice. Of course, a command or argument present in QEMU can still fail, and the management application still needs to handle failure. Distinguishing different failure modes can be bothersome and/or fragile. By making the QAPI schema conditional mirror the C conditional, you squash the failure mode "this version of QEMU
Re: [PATCH 6/6] pcie: expire pending delete
On Tue, Oct 12, 2021 at 07:30:34AM +0200, Gerd Hoffmann wrote: > > > index f3ac04399969..477c8776aa27 100644 > > > --- a/hw/pci/pcie.c > > > +++ b/hw/pci/pcie.c > > > @@ -549,6 +549,8 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler > > > *hotplug_dev, > > > } > > > > > > dev->pending_deleted_event = true; > > > +dev->pending_deleted_expires_ms = > > > +qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 5000; /* 5 secs */ > > > > > > /* In case user cancel the operation of multi-function hot-add, > > > * remove the function that is unexposed to guest individually, > > > > > > Well this will be barely enough, right? > > > > Once the Power > > Indicator begins blinking, a 5-second abort interval exists during > > which a second depression of the > > Attention Button cancels the operation. > > Well, canceling the hot-plug is not supported in qemu right now (there > is no qmp command for that). I'm also not sure it makes sense in the > first place for virtual machines. Yes. However if you resend an attention button press within the 5 second window, guest will think you cancelled hot-plug and act accordingly. It's a fundamentally racy algorithm :( > > So I guess it needs to be more. Problem is of course if guest is > > busy because of interrupts and whatnot, it might not get to > > handling that in time ... > > See patch #3, that one should take care of a busy guest ... > > take care, > Gerd
Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote: > Hello, > Hi Sorry for top-posting, just wanted to provide a quik update. We are currently working on the support for ioregionfd in Qemu and will be posting the patches soon. Plus the KVM patches will be posted based of the RFC v3 with some fixes if there are no objections from Elena's side who originally posted KVM RFC patchset. Thanks! Elena Ufimtseva > I'm an Outreachy intern with QEMU and I’m working on implementing the > ioregionfd > API in KVM. So I’d like to resume the ioregionfd design discussion. The > latest > version of the ioregionfd API document is provided below. > > Overview > > ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses over a > file descriptor without returning from ioctl(KVM_RUN). This allows device > emulation to run in another task separate from the vCPU task. > > This is achieved through KVM ioctls for registering MMIO/PIO regions and a > wire > protocol that KVM uses to communicate with a task handling an MMIO/PIO access. > > The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation in a > separate task looks like this: > >kvm.ko <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task > > ioregionfd improves performance by eliminating the need for the vCPU task to > forward MMIO/PIO exits to device emulation tasks: > >kvm.ko <---ioctl(KVM_RUN)---> VMM vCPU task > ^ > `---ioregionfd---> device task > > Both multi-threaded and multi-process VMMs can take advantage of ioregionfd to > run device emulation in dedicated threads and processes, respectively. > > This mechanism is similar to ioeventfd except it supports all read and write > accesses, whereas ioeventfd only supports posted doorbell writes. > > Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work alongside > the new mechanism, but only one mechanism handles a MMIO/PIO access. > > KVM_CREATE_IOREGIONFD > - > :Capability: KVM_CAP_IOREGIONFD > :Architectures: all > :Type: system ioctl > :Parameters: none > :Returns: an ioregionfd file descriptor, -1 on error > > This ioctl creates a new ioregionfd and returns the file descriptor. The fd > can > be used to handle MMIO/PIO accesses instead of returning from ioctl(KVM_RUN) > with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must be > registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses on the > fd. An ioregionfd can be used with multiple VMs and its lifecycle is not tied > to a specific VM. > > When the last file descriptor for an ioregionfd is closed, all regions > registered with KVM_SET_IOREGION are dropped and guest accesses to those > regions cause ioctl(KVM_RUN) to return again. > > KVM_SET_IOREGION > > :Capability: KVM_CAP_IOREGIONFD > :Architectures: all > :Type: vm ioctl > :Parameters: struct kvm_ioregion (in) > :Returns: 0 on success, -1 on error > > This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO region. Guest > read and write accesses are dispatched through the given ioregionfd instead of > returning from ioctl(KVM_RUN). > > :: > > struct kvm_ioregion { > __u64 guest_paddr; /* guest physical address */ > __u64 memory_size; /* bytes */ > __u64 user_data; > __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */ > __u32 flags; > __u8 pad[32]; > }; > > /* for kvm_ioregion::flags */ > #define KVM_IOREGION_PIO (1u << 0) > #define KVM_IOREGION_POSTED_WRITES (1u << 1) > > If a new region would split an existing region -1 is returned and errno is > EINVAL. > > Regions can be deleted by setting fd to -1. If no existing region matches > guest_paddr and memory_size then -1 is returned and errno is ENOENT. > > Existing regions can be modified as long as guest_paddr and memory_size > match an existing region. > > MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead. > > The user_data value is included in messages KVM writes to the ioregionfd upon > guest access. KVM does not interpret user_data. > > Both read and write guest accesses wait for a response before entering the > guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a response > and immediately enters the guest again. This is suitable for accesses that do > not require synchronous emulation, such as posted doorbell register writes. > Note that guest writes may block the vCPU despite KVM_IOREGION_POSTED_WRITES > if > the device is too slow in reading from the ioregionfd. > > Wire protocol > - > The protocol spoken over the file descriptor is as follows. The device reads > commands from the file descriptor with the following layout:: > > struct ioregionfd_cmd { > __u32 info; > __u32 padding; > __u64 user_data; > __u64 offset; > __u64 data; > }; > > The info field layout is as follows:: > > bits: | 31 ... 8
Re: [PATCH 6/6] pcie: expire pending delete
> > index f3ac04399969..477c8776aa27 100644 > > --- a/hw/pci/pcie.c > > +++ b/hw/pci/pcie.c > > @@ -549,6 +549,8 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler > > *hotplug_dev, > > } > > > > dev->pending_deleted_event = true; > > +dev->pending_deleted_expires_ms = > > +qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 5000; /* 5 secs */ > > > > /* In case user cancel the operation of multi-function hot-add, > > * remove the function that is unexposed to guest individually, > > > Well this will be barely enough, right? > > Once the Power > Indicator begins blinking, a 5-second abort interval exists during > which a second depression of the > Attention Button cancels the operation. Well, canceling the hot-plug is not supported in qemu right now (there is no qmp command for that). I'm also not sure it makes sense in the first place for virtual machines. > So I guess it needs to be more. Problem is of course if guest is > busy because of interrupts and whatnot, it might not get to > handling that in time ... See patch #3, that one should take care of a busy guest ... take care, Gerd
Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
Eugenio Pérez writes: > Initial version of shadow virtqueue that actually forward buffers. There > are no iommu support at the moment, and that will be addressed in future > patches of this series. Since all vhost-vdpa devices uses forced IOMMU, > this means that SVQ is not usable at this point of the series on any > device. > > For simplicity it only supports modern devices, that expects vring > in little endian, with split ring and no event idx or indirect > descriptors. Support for them will not be added in this series. > > It reuses the VirtQueue code for the device part. The driver part is > based on Linux's virtio_ring driver, but with stripped functionality > and optimizations so it's easier to review. Later commits add simpler > ones. > > SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and > retrieve its status (next available idx the device was going to > consume) race-free. It can later reset the device to replace vring > addresses etc. When SVQ starts qemu can resume consuming the guest's > driver ring from that state, without notice from the latter. > > This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed > in VirtIO, and is implemented in qemu VirtIO-net devices in previous > commits. > > Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device) > can be done in the future if an use case arises. At this moment we can > just rely on reseting the full device. > > Signed-off-by: Eugenio Pérez > --- > qapi/net.json | 2 +- > hw/virtio/vhost-shadow-virtqueue.c | 237 - > hw/virtio/vhost-vdpa.c | 109 - > 3 files changed, 337 insertions(+), 11 deletions(-) > > diff --git a/qapi/net.json b/qapi/net.json > index fe546b0e7c..1f4a55f2c5 100644 > --- a/qapi/net.json > +++ b/qapi/net.json > @@ -86,7 +86,7 @@ > # > # @name: the device name of the VirtIO device > # > -# @enable: true to use the alternate shadow VQ notifications > +# @enable: true to use the alternate shadow VQ buffers fowarding path Uh, why does the flag change meaning half-way through this series? > # > # Returns: Error if failure, or 'no error' for success. > # [...]
Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
Eugenio Pérez writes: > Shadow virtqueue notifications forwarding is disabled when vhost_dev > stops, so code flow follows usual cleanup. > > Also, host notifiers must be disabled at SVQ start, and they will not > start if SVQ has been enabled when device is stopped. This is trivial > to address, but it is left out for simplicity at this moment. > > Signed-off-by: Eugenio Pérez > --- > qapi/net.json | 2 +- > hw/virtio/vhost-shadow-virtqueue.h | 8 ++ > include/hw/virtio/vhost-vdpa.h | 4 + > hw/virtio/vhost-shadow-virtqueue.c | 138 - > hw/virtio/vhost-vdpa.c | 116 +++- > 5 files changed, 264 insertions(+), 4 deletions(-) > > diff --git a/qapi/net.json b/qapi/net.json > index a2c30fd455..fe546b0e7c 100644 > --- a/qapi/net.json > +++ b/qapi/net.json > @@ -88,7 +88,7 @@ > # > # @enable: true to use the alternate shadow VQ notifications > # > -# Returns: Always error, since SVQ is not implemented at the moment. > +# Returns: Error if failure, or 'no error' for success. Delete the whole line, please. > # > # Since: 6.2 > # [...]
Re: [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp
Eugenio Pérez writes: > Command to enable shadow virtqueue. > > Signed-off-by: Eugenio Pérez > --- > qapi/net.json | 23 +++ > hw/virtio/vhost-vdpa.c | 8 > 2 files changed, 31 insertions(+) > > diff --git a/qapi/net.json b/qapi/net.json > index 7fab2e7cd8..a2c30fd455 100644 > --- a/qapi/net.json > +++ b/qapi/net.json > @@ -79,6 +79,29 @@ > { 'command': 'netdev_del', 'data': {'id': 'str'}, >'allow-preconfig': true } > > +## > +# @x-vhost-enable-shadow-vq: > +# > +# Use vhost shadow virtqueue. > +# > +# @name: the device name of the VirtIO device Is this a qdev ID? A network client name? > +# > +# @enable: true to use the alternate shadow VQ notifications > +# > +# Returns: Always error, since SVQ is not implemented at the moment. > +# > +# Since: 6.2 > +# > +# Example: > +# > +# -> { "execute": "x-vhost-enable-shadow-vq", > +# "arguments": { "name": "virtio-net", "enable": false } } > +# > +## > +{ 'command': 'x-vhost-enable-shadow-vq', > + 'data': {'name': 'str', 'enable': 'bool'}, > + 'if': 'defined(CONFIG_VHOST_KERNEL)' } > + Adding an command just for controlling a flag in some object is fine for quick experiments. As a permanent interface, it's problematic: one command per flag would result in way too many commands. Better: one command to control a set of related properties. I hesitate to suggest qom-set, because qom-set is not introspectable. Recurring complaint about QOM: poor integration with QAPI/QMP. Naming nitpick: since the command can both enable and disable, I'd call it -set-vq instead of -enable-vq. > ## > # @NetLegacyNicOptions: > # > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c > index 4fa414feea..c63e311d7c 100644 > --- a/hw/virtio/vhost-vdpa.c > +++ b/hw/virtio/vhost-vdpa.c > @@ -23,6 +23,8 @@ > #include "cpu.h" > #include "trace.h" > #include "qemu-common.h" > +#include "qapi/qapi-commands-net.h" > +#include "qapi/error.h" > > static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section) > { > @@ -656,6 +658,12 @@ static bool vhost_vdpa_force_iommu(struct vhost_dev > *dev) > return true; > } > > + > +void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error > **errp) > +{ > +error_setg(errp, "Shadow virtqueue still not implemented"); > +} > + > const VhostOps vdpa_ops = { > .backend_type = VHOST_BACKEND_TYPE_VDPA, > .vhost_backend_init = vhost_vdpa_init,
Re: [PATCH] Trim some trailing space from human-readable output
Philippe Mathieu-Daudé writes: > On 10/9/21 17:24, Markus Armbruster wrote: >> I noticed -cpu help printing enough trailing spaces to make the output >> at least 84 characters wide. Looks ugly unless the terminal is wider. >> Ugly or not, trailing spaces are stupid. >> >> The culprit is this line in x86_cpu_list_entry(): >> >> qemu_printf("x86 %-20s %-58s\n", name, desc); >> >> This prints a string with minimum field left-justified right before a >> newline. Change it to >> >> qemu_printf("x86 %-20s %s\n", name, desc); >> >> which avoids the trailing spaces and is simpler to boot. >> >> A search for the pattern with "git-grep -E '%-[0-9]+s\\n'" found a few >> more instances. Change them similarly. >> >> Signed-off-by: Markus Armbruster >> --- >> monitor/hmp-cmds.c | 2 +- >> target/i386/cpu-dump.c | 4 ++-- >> target/i386/cpu.c | 2 +- >> target/ppc/cpu_init.c | 2 +- >> target/s390x/cpu_models.c | 4 ++-- >> target/xtensa/mmu_helper.c | 2 +- >> 6 files changed, 8 insertions(+), 8 deletions(-) > > Nitpicking, do you mind prefixing the patch subject with 'monitor:'? You're right, the code I patch is called from HMP commands, and probably not from anywhere else. > Otherwise: > Reviewed-by: Philippe Mathieu-Daudé Thanks!
[PATCH] tests/tcg/xtensa: allow testing big-endian cores
Don't disable all big-endian tests, instead check whether $(CORE) is supported by the configured $(QEMU) and enable tests if it is. Signed-off-by: Max Filippov --- MAINTAINERS| 1 + tests/tcg/xtensa/Makefile.softmmu-target | 4 ++-- tests/tcg/xtensaeb/Makefile.softmmu-target | 5 + 3 files changed, 8 insertions(+), 2 deletions(-) create mode 100644 tests/tcg/xtensaeb/Makefile.softmmu-target diff --git a/MAINTAINERS b/MAINTAINERS index 50435b8d2f50..8b5ed46a5f1c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -343,6 +343,7 @@ S: Maintained F: target/xtensa/ F: hw/xtensa/ F: tests/tcg/xtensa/ +F: tests/tcg/xtensaeb/ F: disas/xtensa.c F: include/hw/xtensa/xtensa-isa.h F: configs/devices/xtensa*/default.mak diff --git a/tests/tcg/xtensa/Makefile.softmmu-target b/tests/tcg/xtensa/Makefile.softmmu-target index 9530cac2ad95..f1cf2a6496d2 100644 --- a/tests/tcg/xtensa/Makefile.softmmu-target +++ b/tests/tcg/xtensa/Makefile.softmmu-target @@ -2,7 +2,8 @@ # Xtensa softmmu tests # -ifneq ($(TARGET_WORDS_BIGENDIAN),y) +CORE=dc232b +ifneq ($(shell $(QEMU) -cpu help | grep -w $(CORE)),) XTENSA_SRC = $(SRC_PATH)/tests/tcg/xtensa XTENSA_ALL = $(filter-out $(XTENSA_SRC)/linker.ld.S,$(wildcard $(XTENSA_SRC)/*.S)) @@ -15,7 +16,6 @@ XTENSA_USABLE_TESTS = $(filter-out $(XTENSA_BROKEN_TESTS), $(XTENSA_TESTS)) TESTS += $(XTENSA_USABLE_TESTS) VPATH += $(XTENSA_SRC) -CORE=dc232b QEMU_OPTS+=-M sim -cpu $(CORE) -nographic -semihosting -icount 6 $(EXTFLAGS) -kernel INCLUDE_DIRS = $(SRC_PATH)/target/xtensa/core-$(CORE) diff --git a/tests/tcg/xtensaeb/Makefile.softmmu-target b/tests/tcg/xtensaeb/Makefile.softmmu-target new file mode 100644 index ..4204a96d53c0 --- /dev/null +++ b/tests/tcg/xtensaeb/Makefile.softmmu-target @@ -0,0 +1,5 @@ +# +# Xtensa softmmu tests +# + +include $(SRC_PATH)/tests/tcg/xtensa/Makefile.softmmu-target -- 2.20.1
Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue
On Tue, Oct 12, 2021 at 11:59 AM Jason Wang wrote: > > > 在 2021/10/1 下午3:05, Eugenio Pérez 写道: > > This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This > > is intended as a new method of tracking the memory the devices touch > > during a migration process: Instead of relay on vhost device's dirty > > logging capability, SVQ intercepts the VQ dataplane forwarding the > > descriptors between VM and device. This way qemu is the effective > > writer of guests memory, like in qemu's virtio device operation. > > > > When SVQ is enabled qemu offers a new vring to the device to read > > and write into, and also intercepts kicks and calls between the device > > and the guest. Used buffers relay would cause dirty memory being > > tracked, but at this RFC SVQ is not enabled on migration automatically. > > > > It is based on the ideas of DPDK SW assisted LM, in the series of > > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does > > not map the shadow vq in guest's VA, but in qemu's. > > > > For qemu to use shadow virtqueues the guest virtio driver must not use > > features like event_idx or indirect descriptors. These limitations will > > be addressed in later series, but they are left out for simplicity at > > the moment. > > > > SVQ needs to be enabled with QMP command: > > > > { "execute": "x-vhost-enable-shadow-vq", > >"arguments": { "name": "dev0", "enable": true } } > > > > This series includes some patches to delete in the final version that > > helps with its testing. The first two of the series freely implements > > the feature to stop the device and be able to retrieve its status. It's > > intended to be used with vp_vpda driver in a nested environment. This > > driver also need modifications to forward the new status bit. > > > > Patches 2-8 prepares the SVQ and QMP command to support guest to host > > notifications forwarding. If the SVQ is enabled with these ones > > applied and the device supports it, that part can be tested in > > isolation (for example, with networking), hopping through SVQ. > > > > Same thing is true with patches 9-13, but with device to guest > > notifications. > > > > The rest of the patches implements the actual buffer forwarding. > > > > Comments are welcome. > > > Hi Eugenio: > > > It would be helpful to have a public git repo for us to ease the review. > > Thanks > Btw, we also need to measure the performance impact of the shadow virtqueue. Thanks
Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue
在 2021/10/1 下午3:05, Eugenio Pérez 写道: This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This is intended as a new method of tracking the memory the devices touch during a migration process: Instead of relay on vhost device's dirty logging capability, SVQ intercepts the VQ dataplane forwarding the descriptors between VM and device. This way qemu is the effective writer of guests memory, like in qemu's virtio device operation. When SVQ is enabled qemu offers a new vring to the device to read and write into, and also intercepts kicks and calls between the device and the guest. Used buffers relay would cause dirty memory being tracked, but at this RFC SVQ is not enabled on migration automatically. It is based on the ideas of DPDK SW assisted LM, in the series of DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does not map the shadow vq in guest's VA, but in qemu's. For qemu to use shadow virtqueues the guest virtio driver must not use features like event_idx or indirect descriptors. These limitations will be addressed in later series, but they are left out for simplicity at the moment. SVQ needs to be enabled with QMP command: { "execute": "x-vhost-enable-shadow-vq", "arguments": { "name": "dev0", "enable": true } } This series includes some patches to delete in the final version that helps with its testing. The first two of the series freely implements the feature to stop the device and be able to retrieve its status. It's intended to be used with vp_vpda driver in a nested environment. This driver also need modifications to forward the new status bit. Patches 2-8 prepares the SVQ and QMP command to support guest to host notifications forwarding. If the SVQ is enabled with these ones applied and the device supports it, that part can be tested in isolation (for example, with networking), hopping through SVQ. Same thing is true with patches 9-13, but with device to guest notifications. The rest of the patches implements the actual buffer forwarding. Comments are welcome. Hi Eugenio: It would be helpful to have a public git repo for us to ease the review. Thanks TODO: * Event, indirect, packed, and others features of virtio - Waiting for confirmation of the big picture. * Use already available iova tree to track mappings. * To sepparate buffers forwarding in its own AIO context, so we can throw more threads to that task and we don't need to stop the main event loop. * unmap iommu memory. Now the tree can only grow from SVQ enable, but it should be fine as long as not a lot of memory is added to the guest. * Rebase on top of latest qemu (and, hopefully, on top of multiqueue vdpa). * Some assertions need to be appropiate error handling paths. * Proper documentation. Changes from v3 RFC: * Move everything to vhost-vdpa backend. A big change, this allowed some cleanup but more code has been added in other places. * More use of glib utilities, especially to manage memory. v3 link: https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html Changes from v2 RFC: * Adding vhost-vdpa devices support * Fixed some memory leaks pointed by different comments v2 link: https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html Changes from v1 RFC: * Use QMP instead of migration to start SVQ mode. * Only accepting IOMMU devices, closer behavior with target devices (vDPA) * Fix invalid masking/unmasking of vhost call fd. * Use of proper methods for synchronization. * No need to modify VirtIO device code, all of the changes are contained in vhost code. * Delete superfluous code. * An intermediate RFC was sent with only the notifications forwarding changes. It can be seen in https://patchew.org/QEMU/20210129205415.876290-1-epere...@redhat.com/ v1 link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html Eugenio Pérez (20): virtio: Add VIRTIO_F_QUEUE_STATE virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED virtio: Add virtio_queue_is_host_notifier_enabled vhost: Make vhost_virtqueue_{start,stop} public vhost: Add x-vhost-enable-shadow-vq qmp vhost: Add VhostShadowVirtqueue vdpa: Register vdpa devices in a list vhost: Route guest->host notification through shadow virtqueue Add vhost_svq_get_svq_call_notifier Add vhost_svq_set_guest_call_notifier vdpa: Save call_fd in vhost-vdpa vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call vhost: Route host->guest notification through shadow virtqueue virtio: Add vhost_shadow_vq_get_vring_addr vdpa: Save host and guest features vhost: Add vhost_svq_valid_device_features to shadow vq vhost: Shadow virtqueue buffers forwarding vhost: Add VhostIOVATree vhost: Use a tree to store memory mappings vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez (20): virtio:
Re: [PATCH v2 3/3] vdpa: Check for iova range at mappings changes
On Tue, Oct 5, 2021 at 9:49 PM Eugenio Pérez wrote: > > Check vdpa device range before updating memory regions so we don't add > any outside of it, and report the invalid change if any. > > Signed-off-by: Eugenio Pérez > --- > include/hw/virtio/vhost-vdpa.h | 2 + > hw/virtio/vhost-vdpa.c | 68 ++ > hw/virtio/trace-events | 1 + > 3 files changed, 55 insertions(+), 16 deletions(-) > > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h > index a8963da2d9..c288cf7ecb 100644 > --- a/include/hw/virtio/vhost-vdpa.h > +++ b/include/hw/virtio/vhost-vdpa.h > @@ -13,6 +13,7 @@ > #define HW_VIRTIO_VHOST_VDPA_H > > #include "hw/virtio/virtio.h" > +#include "standard-headers/linux/vhost_types.h" > > typedef struct VhostVDPAHostNotifier { > MemoryRegion mr; > @@ -24,6 +25,7 @@ typedef struct vhost_vdpa { > uint32_t msg_type; > bool iotlb_batch_begin_sent; > MemoryListener listener; > +struct vhost_vdpa_iova_range iova_range; > struct vhost_dev *dev; > VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX]; > } VhostVDPA; > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c > index be7c63b4ba..6654287050 100644 > --- a/hw/virtio/vhost-vdpa.c > +++ b/hw/virtio/vhost-vdpa.c > @@ -37,20 +37,34 @@ static Int128 vhost_vdpa_section_end(const > MemoryRegionSection *section) > return llend; > } > > -static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section) > -{ > -return (!memory_region_is_ram(section->mr) && > -!memory_region_is_iommu(section->mr)) || > -memory_region_is_protected(section->mr) || > - /* vhost-vDPA doesn't allow MMIO to be mapped */ > -memory_region_is_ram_device(section->mr) || > - /* > -* Sizing an enabled 64-bit BAR can cause spurious mappings to > -* addresses in the upper part of the 64-bit address space. These > -* are never accessed by the CPU and beyond the address width of > -* some IOMMU hardware. TODO: VDPA should tell us the IOMMU > width. > -*/ > - section->offset_within_address_space & (1ULL << 63); > +static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section, > +uint64_t iova_min, > +uint64_t iova_max) > +{ > +Int128 llend; > + > +if ((!memory_region_is_ram(section->mr) && > + !memory_region_is_iommu(section->mr)) || > +memory_region_is_protected(section->mr) || > +/* vhost-vDPA doesn't allow MMIO to be mapped */ > +memory_region_is_ram_device(section->mr)) { > +return true; > +} > + > +if (section->offset_within_address_space < iova_min) { > +error_report("RAM section out of device range (min=%lu, addr=%lu)", > + iova_min, section->offset_within_address_space); > +return true; > +} > + > +llend = vhost_vdpa_section_end(section); > +if (int128_gt(llend, int128_make64(iova_max))) { > +error_report("RAM section out of device range (max=%lu, end > addr=%lu)", > + iova_max, int128_get64(llend)); > +return true; > +} > + > +return false; > } > > static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size, > @@ -162,7 +176,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener > *listener, > void *vaddr; > int ret; > > -if (vhost_vdpa_listener_skipped_section(section)) { > +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first, > +v->iova_range.last)) { > return; > } > > @@ -220,7 +235,8 @@ static void vhost_vdpa_listener_region_del(MemoryListener > *listener, > Int128 llend, llsize; > int ret; > > -if (vhost_vdpa_listener_skipped_section(section)) { > +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first, > +v->iova_range.last)) { > return; > } > > @@ -288,9 +304,24 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, > uint8_t status) > vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, ); > } > > +static int vhost_vdpa_get_iova_range(struct vhost_vdpa *v) > +{ > +int ret; > + > +ret = vhost_vdpa_call(v->dev, VHOST_VDPA_GET_IOVA_RANGE, >iova_range); > +if (ret != 0) { > +return ret; > +} I think we need a fallback for the kernel that does not support VHOST_VDPA_GET_IOVA_RANGE? Thanks > + > +trace_vhost_vdpa_get_iova_range(v->dev, v->iova_range.first, > +v->iova_range.last); > +return ret; > +} > + > static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp) > { > struct vhost_vdpa *v; > +int r; > assert(dev->vhost_ops->backend_type ==
Re: [PATCH v2 2/3] vdpa: Add vhost_vdpa_section_end
On Tue, Oct 5, 2021 at 9:49 PM Eugenio Pérez wrote: > > Abstract this operation, that will be reused when validating the region > against the iova range that the device supports. > > Signed-off-by: Eugenio Pérez Acked-by: Jason Wang > --- > hw/virtio/vhost-vdpa.c | 22 +++--- > 1 file changed, 15 insertions(+), 7 deletions(-) > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c > index ea1aa71ad8..be7c63b4ba 100644 > --- a/hw/virtio/vhost-vdpa.c > +++ b/hw/virtio/vhost-vdpa.c > @@ -24,6 +24,19 @@ > #include "trace.h" > #include "qemu-common.h" > > +/* > + * Return one past the end of the end of section. Be careful with uint64_t > + * conversions! > + */ > +static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section) > +{ > +Int128 llend = int128_make64(section->offset_within_address_space); > +llend = int128_add(llend, section->size); > +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK)); > + > +return llend; > +} > + > static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section) > { > return (!memory_region_is_ram(section->mr) && > @@ -160,10 +173,7 @@ static void > vhost_vdpa_listener_region_add(MemoryListener *listener, > } > > iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); > -llend = int128_make64(section->offset_within_address_space); > -llend = int128_add(llend, section->size); > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK)); > - > +llend = vhost_vdpa_section_end(section); > if (int128_ge(int128_make64(iova), llend)) { > return; > } > @@ -221,9 +231,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener > *listener, > } > > iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); > -llend = int128_make64(section->offset_within_address_space); > -llend = int128_add(llend, section->size); > -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK)); > +llend = vhost_vdpa_section_end(section); > > trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend)); > > -- > 2.27.0 >
Re: [PATCH v2 1/3] vdpa: Skip protected ram IOMMU mappings
On Tue, Oct 5, 2021 at 9:49 PM Eugenio Pérez wrote: > > Following the logic of commit 56918a126ae ("memory: Add RAM_PROTECTED > flag to skip IOMMU mappings") with VFIO, skip memory sections > inaccessible via normal mechanisms, including DMA. > > Signed-off-by: Eugenio Pérez Acked-by: Jason Wang > --- > hw/virtio/vhost-vdpa.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c > index 47d7a5a23d..ea1aa71ad8 100644 > --- a/hw/virtio/vhost-vdpa.c > +++ b/hw/virtio/vhost-vdpa.c > @@ -28,6 +28,7 @@ static bool > vhost_vdpa_listener_skipped_section(MemoryRegionSection *section) > { > return (!memory_region_is_ram(section->mr) && > !memory_region_is_iommu(section->mr)) || > +memory_region_is_protected(section->mr) || > /* vhost-vDPA doesn't allow MMIO to be mapped */ > memory_region_is_ram_device(section->mr) || > /* > -- > 2.27.0 >
[PATCH v2] hw/riscv: virt: bugfix the memory-backend-file command is invalid
From: Mingwang Li When I start the VM with the following command: $ ./qemu-system-riscv64 -M virt,accel=kvm -m 4096M -cpu host -nographic \ -name guest=riscv-guset \ -smp 2 \ -bios none \ -kernel ./Image \ -drive file=./guest.img,format=raw,id=hd0 \ -device virtio-blk-device,drive=hd0 \ -append "root=/dev/vda rw console=ttyS0 earlycon=sbi" \ -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on \ -numa node,memdev=mem -mem-prealloc \ -chardev socket,id=char0,path=/mnt/vhost-net0 \ -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ -device virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,mrg_rxbuf=on,csum=on,guest_csum=on,guest_ecn=on \ Then, QEMU displays the following error information: qemu-system-riscv64: Failed initializing vhost-user memory map, consider using -object memory-backend-file share=on qemu-system-riscv64: vhost_set_mem_table failed: Interrupted system call (4) qemu-system-riscv64: unable to start vhost net: 4: falling back on userspace virtio Note that, before starting the kvm-acceled QEMU VM, following temporarily unaccepted QEMU patches should be used: https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg02516.html This error was made bacause default main_mem is used to be registered as the system memory, other memory cannot be initialized. Therefore, the system memory should be initialized to the machine->ram, which consists of the default main_mem and other possible memory required by applications, such as shared hugepage memory in DPDK. Also, the mc->defaul_ram_id should be set to the default main_mem, such as "riscv_virt_board.ram" for the virt machine. Signed-off-by: Mingwang Li Signed-off-by: Yifei Jiang Reviewed-by: Alistair Francis --- hw/riscv/virt.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c index ec0cb69b8c..b3b431c847 100644 --- a/hw/riscv/virt.c +++ b/hw/riscv/virt.c @@ -771,7 +771,6 @@ static void virt_machine_init(MachineState *machine) const MemMapEntry *memmap = virt_memmap; RISCVVirtState *s = RISCV_VIRT_MACHINE(machine); MemoryRegion *system_memory = get_system_memory(); -MemoryRegion *main_mem = g_new(MemoryRegion, 1); MemoryRegion *mask_rom = g_new(MemoryRegion, 1); char *plic_hart_config, *soc_name; target_ulong start_addr = memmap[VIRT_DRAM].base; @@ -890,10 +889,8 @@ static void virt_machine_init(MachineState *machine) } /* register system main memory (actual RAM) */ -memory_region_init_ram(main_mem, NULL, "riscv_virt_board.ram", - machine->ram_size, _fatal); memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base, -main_mem); +machine->ram); /* create device tree */ create_fdt(s, memmap, machine->ram_size, machine->kernel_cmdline, @@ -1032,6 +1029,7 @@ static void virt_machine_class_init(ObjectClass *oc, void *data) mc->cpu_index_to_instance_props = riscv_numa_cpu_index_to_props; mc->get_default_cpu_node_id = riscv_numa_get_default_cpu_node_id; mc->numa_mem_supported = true; +mc->default_ram_id = "riscv_virt_board.ram"; machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE); -- 2.19.1
Re: [PATCH 02/13] target/riscv: Create RISCVMXL enumeration
On Fri, Oct 8, 2021 at 3:47 AM Richard Henderson wrote: > > Move the MXL_RV* defines to enumerators. > > Signed-off-by: Richard Henderson Reviewed-by: Alistair Francis Alistair > --- > target/riscv/cpu_bits.h | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h > index 999187a9ee..e248c6bf6d 100644 > --- a/target/riscv/cpu_bits.h > +++ b/target/riscv/cpu_bits.h > @@ -364,9 +364,11 @@ > #define MISA32_MXL 0xC000 > #define MISA64_MXL 0xC000ULL > > -#define MXL_RV321 > -#define MXL_RV642 > -#define MXL_RV128 3 > +typedef enum { > +MXL_RV32 = 1, > +MXL_RV64 = 2, > +MXL_RV128 = 3, > +} RISCVMXL; > > /* sstatus CSR bits */ > #define SSTATUS_UIE 0x0001 > -- > 2.25.1 > >
Re: [PATCH] hw/riscv: virt: bugfix the memory-backend-file command is invalid
On Thu, Sep 30, 2021 at 12:03 AM MingWang Li wrote: > > From: Mingwang Li > > If default main_mem is used to be registered as the system memory, > other memory cannot be initialized. Therefore, the system memory > should be initialized to the machine->ram, which consists of the > default main_mem and other possible memory required by applications, > such as shared hugepage memory in DPDK. > Also, the mc->defaul_ram_id should be set to the default main_mem, > which is named as "riscv_virt_board.ram". > > Signed-off-by: Mingwang Li > Signed-off-by: Yifei Jiang Reviewed-by: Alistair Francis Alistair
Re: [PATCH] target/riscv: line up all of the registers in the info register dump
On Sat, Oct 9, 2021 at 3:51 PM Travis Geiselbrecht wrote: > > Ensure the columns for all of the register names and values line up. > No functional change, just a minor tweak to the output. > > Signed-off-by: Travis Geiselbrecht Reviewed-by: Alistair Francis Alistair > --- > target/riscv/cpu.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c > index 1d69d1887e..660f9ce131 100644 > --- a/target/riscv/cpu.c > +++ b/target/riscv/cpu.c > @@ -258,7 +258,7 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, > int flags) > } > if (riscv_has_ext(env, RVH)) { > qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "hstatus ", env->hstatus); > -qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "vsstatus ", > +qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "vsstatus", > (target_ulong)env->vsstatus); > } > qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mip ", env->mip); > @@ -289,8 +289,8 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, > int flags) > qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mtval ", env->mtval); > qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "stval ", env->stval); > if (riscv_has_ext(env, RVH)) { > -qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "htval ", env->htval); > -qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mtval2 ", env->mtval2); > +qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "htval ", env->htval); > +qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mtval2 ", env->mtval2); > } > qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mscratch", env->mscratch); > qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "sscratch", env->sscratch); > @@ -298,7 +298,7 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, > int flags) > #endif > > for (i = 0; i < 32; i++) { > -qemu_fprintf(f, " %s " TARGET_FMT_lx, > +qemu_fprintf(f, " %-8s " TARGET_FMT_lx, > riscv_int_regnames[i], env->gpr[i]); > if ((i & 3) == 3) { > qemu_fprintf(f, "\n"); > @@ -306,7 +306,7 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, > int flags) > } > if (flags & CPU_DUMP_FPU) { > for (i = 0; i < 32; i++) { > -qemu_fprintf(f, " %s %016" PRIx64, > +qemu_fprintf(f, " %-8s %016" PRIx64, > riscv_fpr_regnames[i], env->fpr[i]); > if ((i & 3) == 3) { > qemu_fprintf(f, "\n"); > -- > 2.25.1 > >
Re: [PATCH 2/8] accel/tcg: Split out g2h_tlbe
On Mon, Oct 11, 2021 at 3:44 AM Richard Henderson wrote: > > Create a new function to combine a CPUTLBEntry addend > with the guest address to form a host address. > > Signed-off-by: Richard Henderson Reviewed-by: Alistair Francis Alistair > --- > accel/tcg/cputlb.c | 24 ++-- > 1 file changed, 14 insertions(+), 10 deletions(-) > > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c > index 46140ccff3..761f726722 100644 > --- a/accel/tcg/cputlb.c > +++ b/accel/tcg/cputlb.c > @@ -90,6 +90,11 @@ static inline size_t sizeof_tlb(CPUTLBDescFast *fast) > return fast->mask + (1 << CPU_TLB_ENTRY_BITS); > } > > +static inline uintptr_t g2h_tlbe(const CPUTLBEntry *tlb, target_ulong gaddr) > +{ > +return tlb->addend + (uintptr_t)gaddr; > +} > + > static void tlb_window_reset(CPUTLBDesc *desc, int64_t ns, > size_t max_entries) > { > @@ -976,8 +981,7 @@ static void tlb_reset_dirty_range_locked(CPUTLBEntry > *tlb_entry, > > if ((addr & (TLB_INVALID_MASK | TLB_MMIO | > TLB_DISCARD_WRITE | TLB_NOTDIRTY)) == 0) { > -addr &= TARGET_PAGE_MASK; > -addr += tlb_entry->addend; > +addr = g2h_tlbe(tlb_entry, addr & TARGET_PAGE_MASK); > if ((addr - start) < length) { > #if TCG_OVERSIZED_GUEST > tlb_entry->addr_write |= TLB_NOTDIRTY; > @@ -1527,7 +1531,7 @@ tb_page_addr_t get_page_addr_code_hostp(CPUArchState > *env, target_ulong addr, > return -1; > } > > -p = (void *)((uintptr_t)addr + entry->addend); > +p = (void *)g2h_tlbe(entry, addr); > if (hostp) { > *hostp = p; > } > @@ -1619,7 +1623,7 @@ static int probe_access_internal(CPUArchState *env, > target_ulong addr, > } > > /* Everything else is RAM. */ > -*phost = (void *)((uintptr_t)addr + entry->addend); > +*phost = (void *)g2h_tlbe(entry, addr); > return flags; > } > > @@ -1727,7 +1731,7 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong > addr, int mmu_idx, > data->v.io.offset = (iotlbentry->addr & TARGET_PAGE_MASK) + addr; > } else { > data->is_io = false; > -data->v.ram.hostaddr = (void *)((uintptr_t)addr + tlbe->addend); > +data->v.ram.hostaddr = (void *)g2h_tlbe(tlbe, addr); > } > return true; > } else { > @@ -1826,7 +1830,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, > target_ulong addr, > goto stop_the_world; > } > > -hostaddr = (void *)((uintptr_t)addr + tlbe->addend); > +hostaddr = (void *)g2h_tlbe(tlbe, addr); > > if (unlikely(tlb_addr & TLB_NOTDIRTY)) { > notdirty_write(env_cpu(env), addr, size, > @@ -1938,7 +1942,7 @@ load_helper(CPUArchState *env, target_ulong addr, > MemOpIdx oi, > access_type, op ^ (need_swap * MO_BSWAP)); > } > > -haddr = (void *)((uintptr_t)addr + entry->addend); > +haddr = (void *)g2h_tlbe(entry, addr); > > /* > * Keep these two load_memop separate to ensure that the compiler > @@ -1975,7 +1979,7 @@ load_helper(CPUArchState *env, target_ulong addr, > MemOpIdx oi, > return res & MAKE_64BIT_MASK(0, size * 8); > } > > -haddr = (void *)((uintptr_t)addr + entry->addend); > +haddr = (void *)g2h_tlbe(entry, addr); > return load_memop(haddr, op); > } > > @@ -2467,7 +2471,7 @@ store_helper(CPUArchState *env, target_ulong addr, > uint64_t val, > notdirty_write(env_cpu(env), addr, size, iotlbentry, retaddr); > } > > -haddr = (void *)((uintptr_t)addr + entry->addend); > +haddr = (void *)g2h_tlbe(entry, addr); > > /* > * Keep these two store_memop separate to ensure that the compiler > @@ -2492,7 +2496,7 @@ store_helper(CPUArchState *env, target_ulong addr, > uint64_t val, > return; > } > > -haddr = (void *)((uintptr_t)addr + entry->addend); > +haddr = (void *)g2h_tlbe(entry, addr); > store_memop(haddr, val, op); > } > > -- > 2.25.1 > >
Re: [PATCH 1/8] tcg: Add TCG_TARGET_SIGNED_ADDR32
On Mon, Oct 11, 2021 at 3:49 AM Richard Henderson wrote: > > Define as 0 for all tcg hosts. Put this in a separate header, > because we'll want this in places that do not ordinarily have > access to all of tcg/tcg.h. > > Signed-off-by: Richard Henderson Reviewed-by: Alistair Francis Alistair > --- > tcg/aarch64/tcg-target-sa32.h | 1 + > tcg/arm/tcg-target-sa32.h | 1 + > tcg/i386/tcg-target-sa32.h| 1 + > tcg/mips/tcg-target-sa32.h| 1 + > tcg/ppc/tcg-target-sa32.h | 1 + > tcg/riscv/tcg-target-sa32.h | 1 + > tcg/s390x/tcg-target-sa32.h | 1 + > tcg/sparc/tcg-target-sa32.h | 1 + > tcg/tci/tcg-target-sa32.h | 1 + > 9 files changed, 9 insertions(+) > create mode 100644 tcg/aarch64/tcg-target-sa32.h > create mode 100644 tcg/arm/tcg-target-sa32.h > create mode 100644 tcg/i386/tcg-target-sa32.h > create mode 100644 tcg/mips/tcg-target-sa32.h > create mode 100644 tcg/ppc/tcg-target-sa32.h > create mode 100644 tcg/riscv/tcg-target-sa32.h > create mode 100644 tcg/s390x/tcg-target-sa32.h > create mode 100644 tcg/sparc/tcg-target-sa32.h > create mode 100644 tcg/tci/tcg-target-sa32.h > > diff --git a/tcg/aarch64/tcg-target-sa32.h b/tcg/aarch64/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/aarch64/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/arm/tcg-target-sa32.h b/tcg/arm/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/arm/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/i386/tcg-target-sa32.h b/tcg/i386/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/i386/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/mips/tcg-target-sa32.h b/tcg/mips/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/mips/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/ppc/tcg-target-sa32.h b/tcg/ppc/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/ppc/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/riscv/tcg-target-sa32.h b/tcg/riscv/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/riscv/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/s390x/tcg-target-sa32.h b/tcg/s390x/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/s390x/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/sparc/tcg-target-sa32.h b/tcg/sparc/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/sparc/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > diff --git a/tcg/tci/tcg-target-sa32.h b/tcg/tci/tcg-target-sa32.h > new file mode 100644 > index 00..cb185b1526 > --- /dev/null > +++ b/tcg/tci/tcg-target-sa32.h > @@ -0,0 +1 @@ > +#define TCG_TARGET_SIGNED_ADDR32 0 > -- > 2.25.1 > >
Re: [PATCH] Trim some trailing space from human-readable output
On Sat, Oct 9, 2021 at 8:24 AM Markus Armbruster wrote: > > I noticed -cpu help printing enough trailing spaces to make the output > at least 84 characters wide. Looks ugly unless the terminal is wider. > Ugly or not, trailing spaces are stupid. > > The culprit is this line in x86_cpu_list_entry(): > > qemu_printf("x86 %-20s %-58s\n", name, desc); > > This prints a string with minimum field left-justified right before a > newline. Change it to > > qemu_printf("x86 %-20s %s\n", name, desc); > > which avoids the trailing spaces and is simpler to boot. > > A search for the pattern with "git-grep -E '%-[0-9]+s\\n'" found a few > more instances. Change them similarly. > > Signed-off-by: Markus Armbruster > --- > monitor/hmp-cmds.c | 2 +- > target/i386/cpu-dump.c | 4 ++-- > target/i386/cpu.c | 2 +- > target/ppc/cpu_init.c | 2 +- > target/s390x/cpu_models.c | 4 ++-- > target/xtensa/mmu_helper.c | 2 +- > 6 files changed, 8 insertions(+), 8 deletions(-) For target/xtensa: Acked-by: Max Filippov -- Thanks. -- Max
Re: [PATCH] Trim some trailing space from human-readable output
On 10/9/21 17:24, Markus Armbruster wrote: > I noticed -cpu help printing enough trailing spaces to make the output > at least 84 characters wide. Looks ugly unless the terminal is wider. > Ugly or not, trailing spaces are stupid. > > The culprit is this line in x86_cpu_list_entry(): > > qemu_printf("x86 %-20s %-58s\n", name, desc); > > This prints a string with minimum field left-justified right before a > newline. Change it to > > qemu_printf("x86 %-20s %s\n", name, desc); > > which avoids the trailing spaces and is simpler to boot. > > A search for the pattern with "git-grep -E '%-[0-9]+s\\n'" found a few > more instances. Change them similarly. > > Signed-off-by: Markus Armbruster > --- > monitor/hmp-cmds.c | 2 +- > target/i386/cpu-dump.c | 4 ++-- > target/i386/cpu.c | 2 +- > target/ppc/cpu_init.c | 2 +- > target/s390x/cpu_models.c | 4 ++-- > target/xtensa/mmu_helper.c | 2 +- > 6 files changed, 8 insertions(+), 8 deletions(-) Nitpicking, do you mind prefixing the patch subject with 'monitor:'? Otherwise: Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH v3 12/12] vfio-user: acceptance test
On 10/11/21 07:31, Jagannathan Raman wrote: > Acceptance test for libvfio-user in QEMU > > Signed-off-by: Elena Ufimtseva > Signed-off-by: John G Johnson > Signed-off-by: Jagannathan Raman > --- > MAINTAINERS | 1 + > tests/acceptance/vfio-user.py | 96 +++ > 2 files changed, 97 insertions(+) > create mode 100644 tests/acceptance/vfio-user.py > +class VfioUser(Test): > +""" > +:avocado: tags=vfiouser > +""" > +KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 ' > + > +def do_test(self, kernel_url, initrd_url, kernel_command_line, > +machine_type): > +"""Main test method""" > +self.require_accelerator('kvm') > + > +kernel_path = self.fetch_asset(kernel_url) > +initrd_path = self.fetch_asset(initrd_url) > + > +socket = os.path.join('/tmp', str(uuid.uuid4())) > +if os.path.exists(socket): > +os.remove(socket) > + > +# Create remote process > +remote_vm = self.get_vm() > +remote_vm.add_args('-machine', 'x-remote') > +remote_vm.add_args('-nodefaults') > +remote_vm.add_args('-device', 'lsi53c895a,id=lsi1') > +remote_vm.add_args('-object', 'vfio-user,id=vfioobj1,' > + 'devid=lsi1,socket='+socket) Nitpicking for style: spaces around '+' here, > +remote_vm.launch() > + > +# Create proxy process > +self.vm.set_console() > +self.vm.add_args('-machine', machine_type) > +self.vm.add_args('-accel', 'kvm') > +self.vm.add_args('-cpu', 'host') > +self.vm.add_args('-object', > + 'memory-backend-memfd,id=sysmem-file,size=2G') > +self.vm.add_args('--numa', 'node,memdev=sysmem-file') > +self.vm.add_args('-m', '2048') > +self.vm.add_args('-kernel', kernel_path, > + '-initrd', initrd_path, > + '-append', kernel_command_line) > +self.vm.add_args('-device', > + 'vfio-user-pci,' > + 'socket='+socket) and here. Otherwise: Reviewed-by: Philippe Mathieu-Daudé Thanks for adding this test :)
Re: [PATCH 0/8] target/mips: Use tcg_constant_*
On 10/3/21 19:57, Philippe Mathieu-Daudé wrote: > Replace temporary TCG registers by tcg_constant_*() when possible. > > Philippe Mathieu-Daudé (8): > target/mips: Remove unused register from MSA 2R/2RF instruction format > target/mips: Use tcg_constant_i32() in gen_msa_elm_df() > target/mips: Use tcg_constant_i32() in gen_msa_2rf() > target/mips: Use tcg_constant_i32() in gen_msa_2r() > target/mips: Use tcg_constant_i32() in gen_msa_3rf() > target/mips: Use explicit extract32() calls in gen_msa_i5() > target/mips: Use tcg_constant_i32() in gen_msa_i5() > target/mips: Use tcg_constant_tl() in gen_compute_compact_branch() > > target/mips/tcg/msa_translate.c | 87 + > target/mips/tcg/translate.c | 4 +- > 2 files changed, 45 insertions(+), 46 deletions(-) Thanks, series applied to mips-next tree.
Re: [PATCH 2/4] MAINTAINERS: Add entries to cover MIPS CPS / GIC hardware
Hi Paul, You are the maintainer of the Boston machine which is the only one using these peripherals; would you agree to be (co-)maintainer of these files? Regards, Phil. On 10/4/21 11:25, Philippe Mathieu-Daudé wrote: > MIPS CPS and GIC models are unrelated to the TCG frontend. > Move them as new sections under the 'Devices' group. > > Cc: Paul Burton > Signed-off-by: Philippe Mathieu-Daudé > --- > MAINTAINERS | 22 -- > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index cfee52a3046..a5268ad0651 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -239,14 +239,8 @@ F: target/mips/ > F: configs/devices/mips*/* > F: disas/mips.c > F: docs/system/cpu-models-mips.rst.inc > -F: hw/intc/mips_gic.c > F: hw/mips/ > -F: hw/misc/mips_* > -F: hw/timer/mips_gictimer.c > -F: include/hw/intc/mips_gic.h > F: include/hw/mips/ > -F: include/hw/misc/mips_* > -F: include/hw/timer/mips_gictimer.h > F: tests/tcg/mips/ > > MIPS TCG CPUs (nanoMIPS ISA) > @@ -2271,6 +2265,22 @@ S: Odd Fixes > F: hw/intc/openpic.c > F: include/hw/ppc/openpic.h > > +MIPS CPS > +M: Paul Burton > +M: Philippe Mathieu-Daudé > +S: Odd Fixes > +F: hw/misc/mips_* > +F: include/hw/misc/mips_* > + > +MIPS GIC > +M: Paul Burton > +M: Philippe Mathieu-Daudé > +S: Odd Fixes > +F: hw/intc/mips_gic.c > +F: hw/timer/mips_gictimer.c > +F: include/hw/intc/mips_gic.h > +F: include/hw/timer/mips_gictimer.h > + > Subsystems > -- > Overall Audio backends >
Re: [PATCH v1 2/2] memory: Update description of memory_region_is_mapped()
On 10/11/21 23:21, Richard Henderson wrote: > On 10/11/21 10:45 AM, David Hildenbrand wrote: >> /** >> * memory_region_is_mapped: returns true if #MemoryRegion is mapped >> - * into any address space. >> + * into another #MemoryRegion directly. Will return false if the >> + * #MemoryRegion is mapped indirectly via an alias. > > Hmm. I guess. It kinda sorta sounds like a bug, but I don't know the > interface well enough to tell. I tend to agree there is a generic issue with aliases, see: https://www.mail-archive.com/qemu-devel@nongnu.org/msg732527.html then https://www.mail-archive.com/qemu-devel@nongnu.org/msg799622.html "memory: Directly dispatch alias accesses on origin memory region" The API description looks OK to me, I'd rather change the implementation... Maybe we need a MR_ALIAS_FOREACH() macro?
Re: [PATCH 1/8] tcg: Add TCG_TARGET_SIGNED_ADDR32
On 10/10/21 19:43, Richard Henderson wrote: > Define as 0 for all tcg hosts. Put this in a separate header, > because we'll want this in places that do not ordinarily have > access to all of tcg/tcg.h. > > Signed-off-by: Richard Henderson > --- > tcg/aarch64/tcg-target-sa32.h | 1 + > tcg/arm/tcg-target-sa32.h | 1 + > tcg/i386/tcg-target-sa32.h| 1 + > tcg/mips/tcg-target-sa32.h| 1 + > tcg/ppc/tcg-target-sa32.h | 1 + > tcg/riscv/tcg-target-sa32.h | 1 + > tcg/s390x/tcg-target-sa32.h | 1 + > tcg/sparc/tcg-target-sa32.h | 1 + > tcg/tci/tcg-target-sa32.h | 1 + > 9 files changed, 9 insertions(+) Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH 4/8] accel/tcg: Add guest_base_signed_addr32 for user-only
On 10/10/21 19:43, Richard Henderson wrote: > While the host may prefer to treat 32-bit addresses as signed, > there are edge cases of guests that cannot be implemented with > addresses 0x7fff_ and 0x8000_ being non-consecutive. > > Therefore, default to guest_base_signed_addr32 false, and allow > probe_guest_base to determine whether it is possible to set it > to true. A tcg backend which sets TCG_TARGET_SIGNED_ADDR32 will > have to cope with either setting for user-only. > > Signed-off-by: Richard Henderson > --- > include/exec/cpu-all.h | 16 > include/exec/cpu_ldst.h | 3 ++- > bsd-user/main.c | 4 > linux-user/main.c | 3 +++ > 4 files changed, 25 insertions(+), 1 deletion(-) > > diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h > index 32cfb634c6..80b5e17329 100644 > --- a/include/exec/cpu-all.h > +++ b/include/exec/cpu-all.h > @@ -146,6 +146,7 @@ static inline void tswap64s(uint64_t *s) > > #if defined(CONFIG_USER_ONLY) > #include "exec/user/abitypes.h" > +#include "tcg-target-sa32.h" Unrelated but this header could be simplified by moving this block to a new header such "exec/user/address.h". > > /* On some host systems the guest address space is reserved on the host. > * This allows the guest address space to be offset to a convenient location. > @@ -154,6 +155,21 @@ extern uintptr_t guest_base; > extern bool have_guest_base; > extern unsigned long reserved_va; > > +#if TCG_TARGET_SIGNED_ADDR32 && TARGET_LONG_BITS == 32 > +extern bool guest_base_signed_addr32; > +#else > +#define guest_base_signed_addr32 false > +#endif > + > +static inline void set_guest_base_signed_addr32(void) > +{ > +#ifdef guest_base_signed_addr32 > +qemu_build_not_reached(); > +#else > +guest_base_signed_addr32 = true; > +#endif > +} > + > /* > * Limit the guest addresses as best we can. > * Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH 8/8] target/riscv: Support TCG_TARGET_SIGNED_ADDR32
On 10/10/21 19:44, Richard Henderson wrote: > All RV64 32-bit operations sign-extend the output, so we are easily > able to keep TCG_TYPE_I32 values sign-extended in host registers. > > Signed-off-by: Richard Henderson > --- > tcg/riscv/tcg-target-sa32.h | 6 +- > tcg/riscv/tcg-target.c.inc | 8 ++-- > 2 files changed, 7 insertions(+), 7 deletions(-) Nice. Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH] memory: Log access direction for invalid accesses
On 10/11/21 10:32 AM, BALATON Zoltan wrote: In memory_region_access_valid() invalid accesses are logged to help debugging but the log message does not say if it was a read or write. Log that too to better identify the access causing the problem. Signed-off-by: BALATON Zoltan --- softmmu/memory.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) Thanks, queued for tcg-next. r~
Re: [PATCH 2/8] accel/tcg: Split out g2h_tlbe
On 10/10/21 19:43, Richard Henderson wrote: > Create a new function to combine a CPUTLBEntry addend > with the guest address to form a host address. > > Signed-off-by: Richard Henderson > --- > accel/tcg/cputlb.c | 24 ++-- > 1 file changed, 14 insertions(+), 10 deletions(-) Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH v3 6/9] bsd-user/mmap.c: Convert to qemu_log logging for mmap debugging
On 10/8/21 23:23, Warner Losh wrote: > Convert DEBUG_MMAP to qemu_log CPU_LOG_PAGE. > > Signed-off-by: Warner Losh > --- > bsd-user/mmap.c | 53 + > 1 file changed, 23 insertions(+), 30 deletions(-) Reviewed-by: Philippe Mathieu-Daudé
Re: [PATCH] s390x/ipl: check kernel command line size
On 10/11/21 15:38, Thomas Huth wrote: > On 06/10/2021 11.26, Marc Hartmayer wrote: >> Check if the provided kernel command line exceeds the maximum size of >> the s390x >> Linux kernel command line size, which is 896 bytes. >> >> Reported-by: Sven Schnelle >> Signed-off-by: Marc Hartmayer >> --- >> hw/s390x/ipl.c | 12 +++- >> 1 file changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c >> index 1821c6faeef3..a58ea58cc736 100644 >> --- a/hw/s390x/ipl.c >> +++ b/hw/s390x/ipl.c >> @@ -38,6 +38,7 @@ >> #define KERN_IMAGE_START 0x01UL >> #define LINUX_MAGIC_ADDR 0x010008UL >> #define KERN_PARM_AREA 0x010480UL >> +#define KERN_PARM_AREA_SIZE 0x000380UL >> #define INITRD_START 0x80UL >> #define INITRD_PARM_START 0x010408UL >> #define PARMFILE_START 0x001000UL >> @@ -190,10 +191,19 @@ static void s390_ipl_realize(DeviceState *dev, >> Error **errp) >> * loader) and it won't work. For this case we force it to >> 0x1, too. >> */ >> if (pentry == KERN_IMAGE_START || pentry == 0x800) { >> - char *parm_area = rom_ptr(KERN_PARM_AREA, >> strlen(ipl->cmdline) + 1); >> + size_t cmdline_size = strlen(ipl->cmdline) + 1; >> + char *parm_area = rom_ptr(KERN_PARM_AREA, cmdline_size); >> + >> ipl->start_addr = KERN_IMAGE_START; >> /* Overwrite parameters in the kernel image, which are >> "rom" */ >> if (parm_area) { >> + if (cmdline_size > KERN_PARM_AREA_SIZE) { >> + error_setg(errp, >> + "kernel command line exceeds maximum >> size: %lu > %lu", > > I think the first %lu should be %zd instead? %zu ;) Reviewed-by: Philippe Mathieu-Daudé > > Apart from that, the patch looks fine to me... so if you agree, I can > fix that up when picking up the patch. > > Thomas
Re: [RFC PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start
Richard Henderson writes: > On 10/11/21 8:57 AM, Alex Bennée wrote: >> We use INDEX_op_insn_start to make the start of instruction >> boundaries. If we don't do it in the .insn_start hook things get >> confused especially now plugins want to use that marking to identify >> the start of instructions and will bomb out if it sees instrumented >> ops before the first instruction boundary. >> Signed-off-by: Alex Bennée >> --- >> target/s390x/tcg/translate.c | 25 ++--- >> 1 file changed, 22 insertions(+), 3 deletions(-) >> diff --git a/target/s390x/tcg/translate.c >> b/target/s390x/tcg/translate.c >> index f284870cd2..fe145ff2eb 100644 >> --- a/target/s390x/tcg/translate.c >> +++ b/target/s390x/tcg/translate.c >> @@ -6380,9 +6380,6 @@ static DisasJumpType translate_one(CPUS390XState *env, >> DisasContext *s) >> /* Search for the insn in the table. */ >> insn = extract_insn(env, s); >> -/* Emit insn_start now that we know the ILEN. */ >> -tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen); >> - >> /* Not found means unimplemented/illegal opcode. */ >> if (insn == NULL) { >> qemu_log_mask(LOG_UNIMP, "unimplemented opcode 0x%02x%02x\n", >> @@ -6550,8 +6547,30 @@ static void s390x_tr_tb_start(DisasContextBase *db, >> CPUState *cs) >> { >> } >> +/* >> + * We just enough partial instruction decoding here to calculate the >> + * length of the instruction so we can drop the INDEX_op_insn_start >> + * before anything else is emitted in the TCGOp stream. >> + * >> + * See extract_insn for the full decode. >> + */ >> static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs) >> { >> +CPUS390XState *env = cs->env_ptr; >> +DisasContext *s = container_of(dcbase, DisasContext, base); >> +uint64_t insn, pc = s->base.pc_next; >> +int op, ilen; >> + >> +if (unlikely(s->ex_value)) { >> +ilen = s->ex_value & 0xf; >> +} else { >> +insn = ld_code2(env, s, pc); /* FIXME: don't reload same pc twice >> */ > > I think we might as well delay the set of ilen until after the normal > load, rather than introduce a fixme. I'd got as far as this before thinking it was getting messy: --8<---cut here---start->8--- squash! target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start 1 file changed, 16 insertions(+), 19 deletions(-) target/s390x/tcg/translate.c | 35 --- modified target/s390x/tcg/translate.c @@ -147,6 +147,7 @@ struct DisasContext { */ uint64_t pc_tmp; uint32_t ilen; +uint16_t start_of_insn; /* collected at s390x_tr_insn_start */ enum cc_op cc_op; bool do_debug; }; @@ -388,10 +389,10 @@ static void update_cc_op(DisasContext *s) } } -static inline uint64_t ld_code2(CPUS390XState *env, DisasContext *s, +static inline uint16_t ld_code2(CPUS390XState *env, DisasContext *s, uint64_t pc) { -return (uint64_t)translator_lduw(env, >base, pc); +return translator_lduw(env, >base, pc); } static inline uint64_t ld_code4(CPUS390XState *env, DisasContext *s, @@ -6261,7 +6262,7 @@ static void extract_field(DisasFields *o, const DisasField *f, uint64_t insn) static const DisasInsn *extract_insn(CPUS390XState *env, DisasContext *s) { uint64_t insn, pc = s->base.pc_next; -int op, op2, ilen; +int op, op2; const DisasInsn *info; if (unlikely(s->ex_value)) { @@ -6272,28 +6273,25 @@ static const DisasInsn *extract_insn(CPUS390XState *env, DisasContext *s) /* Extract the values saved by EXECUTE. */ insn = s->ex_value & 0xull; -ilen = s->ex_value & 0xf; +s->ilen = s->ex_value & 0xf; op = insn >> 56; } else { -insn = ld_code2(env, s, pc); -op = (insn >> 8) & 0xff; -ilen = get_ilen(op); -switch (ilen) { +op = extract32(s->start_of_insn, 8, 8); +insn = deposit64(0, 48, 16, s->start_of_insn); +switch (s->ilen) { case 2: -insn = insn << 48; break; case 4: -insn = ld_code4(env, s, pc) << 32; +insn = deposit64(insn, 32, 16, ld_code2(env, s, pc + 2)); break; case 6: -insn = (insn << 48) | (ld_code4(env, s, pc + 2) << 16); +insn = deposit64(insn, 16, 32, ld_code4(env, s, pc + 2)); break; default: g_assert_not_reached(); } } -s->pc_tmp = s->base.pc_next + ilen; -s->ilen = ilen; +s->pc_tmp = s->base.pc_next + s->ilen; /* We can't actually determine the insn format until we've looked up the full insn opcode. Which we can't do without locating the @@ -6558,18 +6556,17 @@ static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs) { CPUS390XState *env = cs->env_ptr; DisasContext *s = container_of(dcbase,
Re: [PATCH v1 2/2] memory: Update description of memory_region_is_mapped()
On 10/11/21 10:45 AM, David Hildenbrand wrote: /** * memory_region_is_mapped: returns true if #MemoryRegion is mapped - * into any address space. + * into another #MemoryRegion directly. Will return false if the + * #MemoryRegion is mapped indirectly via an alias. Hmm. I guess. It kinda sorta sounds like a bug, but I don't know the interface well enough to tell. r~
Re: [PATCH v1 1/2] machine: Use host_memory_backend_is_mapped() in machine_consume_memdev()
On 10/11/21 10:45 AM, David Hildenbrand wrote: memory_region_is_mapped() is the wrong check, we actually want to check whether the backend is already marked mapped. For example, memory regions mapped via an alias, such as NVDIMMs, currently don't make memory_region_is_mapped() return "true". As the machine is initialized before any memory devices (and thereby before NVDIMMs are initialized), this isn't a fix but merely a cleanup. Signed-off-by: David Hildenbrand --- hw/core/machine.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Richard Henderson r~
Re: [PATCH v2 09/15] softmmu/qdev-monitor: add error handling in qdev_set_id
On Fri, Oct 08, 2021 at 03:34:36PM +0200, Kevin Wolf wrote: > From: Damien Hedde > > qdev_set_id() is mostly used when the user adds a device (using > -device cli option or device_add qmp command). This commit adds > an error parameter to handle the case where the given id is > already taken. > > Also document the function and add a return value in order to > be able to capture success/failure: the function now returns the > id in case of success, or NULL in case of failure. > > The commit modifies the 2 calling places (qdev-monitor and > xen-legacy-backend) to add the error object parameter. > > Note that the id is, right now, guaranteed to be unique because > all ids came from the "device" QemuOptsList where the id is used > as key. This addition is a preparation for a future commit which > will relax the uniqueness. > > Signed-off-by: Damien Hedde > Signed-off-by: Kevin Wolf > --- > +++ b/softmmu/qdev-monitor.c > @@ -593,22 +593,34 @@ static BusState *qbus_find(const char *path, Error > **errp) > } > > /* Takes ownership of @id, will be freed when deleting the device */ > -void qdev_set_id(DeviceState *dev, char *id) > +const char *qdev_set_id(DeviceState *dev, char *id, Error **errp) > { > -if (id) { > -dev->id = id; > -} > +ObjectProperty *prop; > > -if (dev->id) { > -object_property_add_child(qdev_get_peripheral(), dev->id, > - OBJECT(dev)); > +assert(!dev->id && !dev->realized); > + > +/* > + * object_property_[try_]add_child() below will assert the device > + * has no parent > + */ > +if (id) { > +prop = object_property_try_add_child(qdev_get_peripheral(), id, > + OBJECT(dev), NULL); > +if (prop) { > +dev->id = id; > +} else { > +error_setg(errp, "Duplicate device ID '%s'", id); > +return NULL; id is not freed on this error path... > +} > } else { > static int anon_count; > gchar *name = g_strdup_printf("device[%d]", anon_count++); > -object_property_add_child(qdev_get_peripheral_anon(), name, > - OBJECT(dev)); > +prop = object_property_add_child(qdev_get_peripheral_anon(), name, > + OBJECT(dev)); > g_free(name); > } > + > +return prop->name; > } > > DeviceState *qdev_device_add(QemuOpts *opts, Error **errp) > @@ -691,7 +703,13 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error > **errp) > } > } > > -qdev_set_id(dev, g_strdup(qemu_opts_id(opts))); > +/* > + * set dev's parent and register its id. > + * If it fails it means the id is already taken. > + */ > +if (!qdev_set_id(dev, g_strdup(qemu_opts_id(opts)), errp)) { > +goto err_del_dev; ...nor on this, which means the g_strdup() leaks on failure. > +} > > /* set properties */ > if (qemu_opt_foreach(opts, set_property, dev, errp)) { > -- > 2.31.1 > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks
On Mon, Oct 11, 2021 at 5:45 PM Eric Blake wrote: > > On Mon, Oct 11, 2021 at 04:38:23PM -0300, Leonardo Bras Soares Passos wrote: > > > > /** > > > > - * qio_channel_writev_full: > > > > + * qio_channel_writev_full_flags: > > > > * @ioc: the channel object > > > > * @iov: the array of memory regions to write data from > > > > * @niov: the length of the @iov array > > > > * @fds: an array of file handles to send > > > > * @nfds: number of file handles in @fds > > > > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*) > > > > * @errp: pointer to a NULL-initialized error object > > > > * > > > > * Write data to the IO channel, reading it from the > > > > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc, > > > > * guaranteed. If the channel is non-blocking and no > > > > * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK > > > > * > > > > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed, > > > > + * function will return once each buffer was queued for > > > > + * sending. > > > > > > This would be a good place to document the requirement to keep the > > > buffer unchanged until the zerocopy sequence completes. > > > > That makes sense, even though that may be true for just some > > implementations, > > it makes sense to document it here. > > > > > > > Ok, > > Is it enough to document it in a single one of the places suggested, or > > would you recommend documenting it in all suggested places? > > Ah, the curse of maintaining copy-and-paste. If you can find a way to > say "see this other type for limitations" that sounds fine, it avoids > the risk of later edits touching one but not all identical copies. > But our current process for generating sphynx documentation from the > qapi generator does not have cross-referencing abilities that I'm > aware of. Markus or John, any thoughts? > > > > > > > > +int qio_channel_flush_zerocopy(QIOChannel *ioc, > > > > + Error **errp) > > > > +{ > > > > +QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc); > > > > + > > > > +if (!klass->io_flush_zerocopy || > > > > +!qio_channel_has_feature(ioc, > > > > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY)) { > > > > +return 0; > > > > > > Matches your documentation, but an ideal app should not be trying to > > > flush if the write failed in the first place. So wouldn't it be > > > better to return -1 or even abort() on a coding error? > > > > The point here is that any valid user of zrocopy_flush would have > > already used zerocopy_writev > > at some point, and failed if not supported / enabled. > > > > Having this not returning error can help the user keep a simpler > > approach when using > > a setup in which the writev can happen in both zerocopy or default behavior. > > > > I mean, the user will not need to check if zerocopy was or was not > > enabled, and just flush anyway. > > > > But if it's not good behavior, or you guys think it's a better > > approach to fail here, I can also do that. > > Either flush is supposed to be a no-op when zerocopy fails (so > returning 0 is okay), or should not be attempted unless zerocopy > succeeded (in which case abort()ing seems like the best way to point > out the programmer's error). But I wasn't clear from your > documentation which of the two behaviors you had in mind. Oh, sorry about that. Yeah, I intend to use it as a no-op. If it's fine I will update the docs for v5. > > -- > Eric Blake, Principal Software Engineer > Red Hat, Inc. +1-919-301-3266 > Virtualization: qemu.org | libvirt.org > Thanks!
Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks
On Mon, Oct 11, 2021 at 04:38:23PM -0300, Leonardo Bras Soares Passos wrote: > > > /** > > > - * qio_channel_writev_full: > > > + * qio_channel_writev_full_flags: > > > * @ioc: the channel object > > > * @iov: the array of memory regions to write data from > > > * @niov: the length of the @iov array > > > * @fds: an array of file handles to send > > > * @nfds: number of file handles in @fds > > > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*) > > > * @errp: pointer to a NULL-initialized error object > > > * > > > * Write data to the IO channel, reading it from the > > > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc, > > > * guaranteed. If the channel is non-blocking and no > > > * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK > > > * > > > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed, > > > + * function will return once each buffer was queued for > > > + * sending. > > > > This would be a good place to document the requirement to keep the > > buffer unchanged until the zerocopy sequence completes. > > That makes sense, even though that may be true for just some implementations, > it makes sense to document it here. > > > Ok, > Is it enough to document it in a single one of the places suggested, or > would you recommend documenting it in all suggested places? Ah, the curse of maintaining copy-and-paste. If you can find a way to say "see this other type for limitations" that sounds fine, it avoids the risk of later edits touching one but not all identical copies. But our current process for generating sphynx documentation from the qapi generator does not have cross-referencing abilities that I'm aware of. Markus or John, any thoughts? > > > > > +int qio_channel_flush_zerocopy(QIOChannel *ioc, > > > + Error **errp) > > > +{ > > > +QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc); > > > + > > > +if (!klass->io_flush_zerocopy || > > > +!qio_channel_has_feature(ioc, > > > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY)) { > > > +return 0; > > > > Matches your documentation, but an ideal app should not be trying to > > flush if the write failed in the first place. So wouldn't it be > > better to return -1 or even abort() on a coding error? > > The point here is that any valid user of zrocopy_flush would have > already used zerocopy_writev > at some point, and failed if not supported / enabled. > > Having this not returning error can help the user keep a simpler > approach when using > a setup in which the writev can happen in both zerocopy or default behavior. > > I mean, the user will not need to check if zerocopy was or was not > enabled, and just flush anyway. > > But if it's not good behavior, or you guys think it's a better > approach to fail here, I can also do that. Either flush is supposed to be a no-op when zerocopy fails (so returning 0 is okay), or should not be attempted unless zerocopy succeeded (in which case abort()ing seems like the best way to point out the programmer's error). But I wasn't clear from your documentation which of the two behaviors you had in mind. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[PATCH v1] libvhost-user: fix VHOST_USER_REM_MEM_REG skipping mmap_addr
We end up not copying the mmap_addr of all existing regions, resulting in a SEGFAULT once we actually try to map/access anything within our memory regions. Fixes: 875b9fd97b34 ("Support individual region unmap in libvhost-user") Cc: qemu-sta...@nongnu.org Cc: Michael S. Tsirkin Cc: Raphael Norwitz Cc: "Marc-André Lureau" Cc: Stefan Hajnoczi Cc: Paolo Bonzini Cc: Coiby Xu Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 1 + 1 file changed, 1 insertion(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index bf09693255..787f4d2d4f 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -816,6 +816,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { shadow_regions[j].gpa = dev->regions[i].gpa; shadow_regions[j].size = dev->regions[i].size; shadow_regions[j].qva = dev->regions[i].qva; +shadow_regions[j].mmap_addr = dev->regions[i].mmap_addr; shadow_regions[j].mmap_offset = dev->regions[i].mmap_offset; j++; } else { -- 2.31.1
Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)
Hello Eric, On Mon, Oct 11, 2021 at 4:32 PM Eric Blake wrote: > > On Sat, Oct 09, 2021 at 04:56:13AM -0300, Leonardo Bras wrote: > > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel > > zerocopy interface. > > > > Change multifd_send_sync_main() so it can distinguish the last sync from > > the setup and per-iteration ones, so a flush_zerocopy() can be called > > at the last sync in order to make sure all RAM is sent before finishing > > the migration. > > > > Also make it return -1 if flush_zerocopy() fails, in order to cancel > > the migration process, and avoid resuming the guest in the target host > > without receiving all current RAM. > > > > This will work fine on RAM migration because the RAM pages are not usually > > freed, > > and there is no problem on changing the pages content between async_send() > > and > > the actual sending of the buffer, because this change will dirty the page > > and > > cause it to be re-sent on a next iteration anyway. > > > > Given a lot of locked memory may be needed in order to use multid migration > > with zerocopy enabled, make it optional by creating a new parameter > > multifd-zerocopy on qapi, so low-privileged users can still perform multifd > > migrations. > > > > Signed-off-by: Leonardo Bras > > --- > > qapi/migration.json | 18 ++ > > migration/migration.h | 1 + > > migration/multifd.h | 2 +- > > migration/migration.c | 20 > > migration/multifd.c | 33 - > > migration/ram.c | 20 +--- > > monitor/hmp-cmds.c| 4 > > 7 files changed, 85 insertions(+), 13 deletions(-) > > > > diff --git a/qapi/migration.json b/qapi/migration.json > > index 88f07baedd..c4890cbb54 100644 > > --- a/qapi/migration.json > > +++ b/qapi/migration.json > > @@ -724,6 +724,11 @@ > > # will consume more CPU. > > # Defaults to 1. (Since 5.0) > > # > > +# @multifd-zerocopy: Controls behavior on sending memory pages on multifd > > migration. > > +#When true, enables a zerocopy mechanism for sending > > memory > > +#pages, if host does support it. > > s/does support/supports/ (several instances this patch) I will make sure to fix that in v5. > > > +#Defaults to false. (Since 6.2) > > +# > > # @block-bitmap-mapping: Maps block nodes and bitmaps on them to > > #aliases for the purpose of dirty bitmap > > migration. Such > > #aliases may for example be the corresponding > > names on the > > @@ -758,6 +763,7 @@ > > 'xbzrle-cache-size', 'max-postcopy-bandwidth', > > 'max-cpu-throttle', 'multifd-compression', > > 'multifd-zlib-level' ,'multifd-zstd-level', > > +'multifd-zerocopy', > > 'block-bitmap-mapping' ] } > > Should this feature be guarded with 'if':'CONFIG_LINUX', since that's > the only platform where you even compile the code (even then, it can > still fail if the kernel doesn't support it). I think it makes sense for the feature to always be available, even though it's not supported outside linux > v4.14. IMHO it makes more sense for the user to get an error when it starts migration, due to host not supporting zerocopy, than the error happening in the config step, which may cause the user to question if it was the right parameter. The config message error here could also be ignored, and users can think zerocopy is working, while it's not. For automated migrations, this feature should never be enabled for non-linux / older linux hosts anyway. Is there a good argument I am missing for this feature being disabled on non-linux? > > -- > Eric Blake, Principal Software Engineer > Red Hat, Inc. +1-919-301-3266 > Virtualization: qemu.org | libvirt.org > Best regards, Leo
Re: [PATCH v4 2/3] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX
Hello Eric, On Mon, Oct 11, 2021 at 4:28 PM Eric Blake wrote: > > On Sat, Oct 09, 2021 at 04:56:12AM -0300, Leonardo Bras wrote: > > For CONFIG_LINUX, implement the new optional callbacks io_write_zerocopy and > > io_flush_zerocopy on QIOChannelSocket, but enables it only when MSG_ZEROCOPY > > feature is available in the host kernel, which is checked on > > qio_channel_socket_connect_sync() > > > > qio_channel_socket_writev() contents were moved to a helper function > > qio_channel_socket_writev_flags() which accepts an extra argument for flags. > > (This argument is passed directly to sendmsg(). > > > > The above helper function is used to implement qio_channel_socket_writev(), > > with flags = 0, keeping it's behavior unchanged, and > > its (remember, "it's" is shorthand for "it is", which does not fit here) > > > qio_channel_socket_writev_zerocopy() with flags = MSG_ZEROCOPY. > > > > qio_channel_socket_flush_zerocopy() was implemented by counting how many > > times > > sendmsg(...,MSG_ZEROCOPY) was sucessfully called, and then reading the > > socket's error queue, in order to find how many of them finished sending. > > Flush will loop until those counters are the same, or until some error > > ocurs. > > occurs Thanks! (I will check if my codespell is enabled in this setup) > > > > > A new function qio_channel_socket_poll() was also created in order to avoid > > busy-looping recvmsg() in qio_channel_socket_flush_zerocopy() while waiting > > for > > updates in socket's error queue. > > > > Notes on using writev_zerocopy(): > > 1: Buffer > > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid > > copying, > > some caution is necessary to avoid overwriting any buffer before it's sent. > > If something like this happen, a newer version of the buffer may be sent > > instead. > > - If this is a problem, it's recommended to call flush_zerocopy() before > > freeing > > or re-using the buffer. > > > > 2: Locked memory > > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, > > and > > unlocked after it's sent. > > - Depending on the size of each buffer, and how often it's sent, it may > > require > > a larger amount of locked memory than usually available to non-root user. > > - If the required amount of locked memory is not available, writev_zerocopy > > will return an error, which can abort an operation like migration, > > - Because of this, when an user code wants to add zerocopy as a feature, it > > requires a mechanism to disable it, so it can still be acessible to less > > privileged users. > > > > Signed-off-by: Leonardo Bras > > --- > > include/io/channel-socket.h | 2 + > > include/io/channel.h| 1 + > > io/channel-socket.c | 180 ++-- > > 3 files changed, 173 insertions(+), 10 deletions(-) > > > > +static int qio_channel_socket_flush_zerocopy(QIOChannel *ioc, > > + Error **errp) > > +{ > > > + > > +/* No errors, count sucessfully finished sendmsg()*/ > > Space before */ Thanks! Also, typo on 'successfully'. > > > +sioc->zerocopy_sent += serr->ee_data - serr->ee_info + 1; > > +} > > +return 0; > > +} > > + > > +#endif /* CONFIG_LINUX */ > > + > > static int > > qio_channel_socket_set_blocking(QIOChannel *ioc, > > bool enabled, > > @@ -787,6 +943,10 @@ static void qio_channel_socket_class_init(ObjectClass > > *klass, > > ioc_klass->io_set_delay = qio_channel_socket_set_delay; > > ioc_klass->io_create_watch = qio_channel_socket_create_watch; > > ioc_klass->io_set_aio_fd_handler = > > qio_channel_socket_set_aio_fd_handler; > > +#ifdef CONFIG_LINUX > > +ioc_klass->io_writev_zerocopy = qio_channel_socket_writev_zerocopy; > > +ioc_klass->io_flush_zerocopy = qio_channel_socket_flush_zerocopy; > > +#endif > > } > > I did a high-level look at the code, rather than an in-depth review of > whether zero-copy was being used correctly. It's so far been very helpful. Thanks! Best regards, Leo
Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks
Hello Eric, thank you for the feedback! On Mon, Oct 11, 2021 at 4:17 PM Eric Blake wrote: > > On Sat, Oct 09, 2021 at 04:56:11AM -0300, Leonardo Bras wrote: > > Adds io_async_writev and io_async_flush as optional callback to > > QIOChannelClass, > > Are these names accurate? I am sorry, I think I missed some renaming before generating the patchset. As you suggested those names are incorrect (as they reflect a previous naming used in v3). I will replace them for io_{writev,flush}_zerocopy in v5. > > > allowing the implementation of asynchronous writes by subclasses. > > > > How to use them: > > - Write data using qio_channel_writev_zerocopu(), > > s/copu/copy/ Thanks, I will fix that. > > > - Wait write completion with qio_channel_flush_zerocopy(). > > > > Notes: > > As some zerocopy implementations work asynchronously, it's > > recommended to keep the write buffer untouched until the return of > > qio_channel_flush_zerocopy(), by the risk of sending an updated buffer > > s/by/to avoid/ > > > instead of the one at the write. > > > > As the new callbacks are optional, if a subclass does not implement them, > > then: > > - io_async_writev will return -1, > > - io_async_flush will return 0 without changing anything. > > Are these names accurate? They are not, I will replace them for io_{writev,flush}_zerocopy in v5. > > > > > Also, some functions like qio_channel_writev_full_all() were adapted to > > receive a flag parameter. That allows shared code between zerocopy and > > non-zerocopy writev. > > > > Signed-off-by: Leonardo Bras > > --- > > include/io/channel.h | 103 +++ > > io/channel.c | 74 +++ > > 2 files changed, 141 insertions(+), 36 deletions(-) > > > > diff --git a/include/io/channel.h b/include/io/channel.h > > index 88988979f8..e7d4e1521f 100644 > > --- a/include/io/channel.h > > +++ b/include/io/channel.h > > @@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass, > > > > #define QIO_CHANNEL_ERR_BLOCK -2 > > > > +#define QIO_CHANNEL_WRITE_FLAG_ZEROCOPY 0x1 > > + > > typedef enum QIOChannelFeature QIOChannelFeature; > > > > enum QIOChannelFeature { > > QIO_CHANNEL_FEATURE_FD_PASS, > > QIO_CHANNEL_FEATURE_SHUTDOWN, > > QIO_CHANNEL_FEATURE_LISTEN, > > +QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY, > > }; > > > > > > @@ -136,6 +139,12 @@ struct QIOChannelClass { > >IOHandler *io_read, > >IOHandler *io_write, > >void *opaque); > > +ssize_t (*io_writev_zerocopy)(QIOChannel *ioc, > > + const struct iovec *iov, > > + size_t niov, > > + Error **errp); > > +int (*io_flush_zerocopy)(QIOChannel *ioc, > > + Error **errp); > > Indentation is off by one. Thanks, I will fix that for v5. > > > }; > > > > /* General I/O handling functions */ > > @@ -222,12 +231,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc, > > > > > > /** > > - * qio_channel_writev_full: > > + * qio_channel_writev_full_flags: > > * @ioc: the channel object > > * @iov: the array of memory regions to write data from > > * @niov: the length of the @iov array > > * @fds: an array of file handles to send > > * @nfds: number of file handles in @fds > > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*) > > * @errp: pointer to a NULL-initialized error object > > * > > * Write data to the IO channel, reading it from the > > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc, > > * guaranteed. If the channel is non-blocking and no > > * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK > > * > > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed, > > + * function will return once each buffer was queued for > > + * sending. > > This would be a good place to document the requirement to keep the > buffer unchanged until the zerocopy sequence completes. That makes sense, even though that may be true for just some implementations, it makes sense to document it here. > > > Error **errp); > > > > /** > > - * qio_channel_writev_full_all: > > + * qio_channel_writev_full_all_flags: > > * @ioc: the channel object > > * @iov: the array of memory regions to write data from > > * @niov: the length of the @iov array > > * @fds: an array of file handles to send > > * @nfds: number of file handles in @fds > > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*) > > * @errp: pointer to a NULL-initialized error object > > * > > * > > @@ -846,13 +868,58 @@ int qio_channel_readv_full_all(QIOChannel *ioc, > > * to be written, yielding from the current coroutine > > * if required. > > * > > + * If QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed in flags, > > + * instead of waiting for all requested data to be written,
Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)
On Sat, Oct 09, 2021 at 04:56:13AM -0300, Leonardo Bras wrote: > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel > zerocopy interface. > > Change multifd_send_sync_main() so it can distinguish the last sync from > the setup and per-iteration ones, so a flush_zerocopy() can be called > at the last sync in order to make sure all RAM is sent before finishing > the migration. > > Also make it return -1 if flush_zerocopy() fails, in order to cancel > the migration process, and avoid resuming the guest in the target host > without receiving all current RAM. > > This will work fine on RAM migration because the RAM pages are not usually > freed, > and there is no problem on changing the pages content between async_send() and > the actual sending of the buffer, because this change will dirty the page and > cause it to be re-sent on a next iteration anyway. > > Given a lot of locked memory may be needed in order to use multid migration > with zerocopy enabled, make it optional by creating a new parameter > multifd-zerocopy on qapi, so low-privileged users can still perform multifd > migrations. > > Signed-off-by: Leonardo Bras > --- > qapi/migration.json | 18 ++ > migration/migration.h | 1 + > migration/multifd.h | 2 +- > migration/migration.c | 20 > migration/multifd.c | 33 - > migration/ram.c | 20 +--- > monitor/hmp-cmds.c| 4 > 7 files changed, 85 insertions(+), 13 deletions(-) > > diff --git a/qapi/migration.json b/qapi/migration.json > index 88f07baedd..c4890cbb54 100644 > --- a/qapi/migration.json > +++ b/qapi/migration.json > @@ -724,6 +724,11 @@ > # will consume more CPU. > # Defaults to 1. (Since 5.0) > # > +# @multifd-zerocopy: Controls behavior on sending memory pages on multifd > migration. > +#When true, enables a zerocopy mechanism for sending > memory > +#pages, if host does support it. s/does support/supports/ (several instances this patch) > +#Defaults to false. (Since 6.2) > +# > # @block-bitmap-mapping: Maps block nodes and bitmaps on them to > #aliases for the purpose of dirty bitmap migration. > Such > #aliases may for example be the corresponding names > on the > @@ -758,6 +763,7 @@ > 'xbzrle-cache-size', 'max-postcopy-bandwidth', > 'max-cpu-throttle', 'multifd-compression', > 'multifd-zlib-level' ,'multifd-zstd-level', > +'multifd-zerocopy', > 'block-bitmap-mapping' ] } Should this feature be guarded with 'if':'CONFIG_LINUX', since that's the only platform where you even compile the code (even then, it can still fail if the kernel doesn't support it). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v4 2/3] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX
On Sat, Oct 09, 2021 at 04:56:12AM -0300, Leonardo Bras wrote: > For CONFIG_LINUX, implement the new optional callbacks io_write_zerocopy and > io_flush_zerocopy on QIOChannelSocket, but enables it only when MSG_ZEROCOPY > feature is available in the host kernel, which is checked on > qio_channel_socket_connect_sync() > > qio_channel_socket_writev() contents were moved to a helper function > qio_channel_socket_writev_flags() which accepts an extra argument for flags. > (This argument is passed directly to sendmsg(). > > The above helper function is used to implement qio_channel_socket_writev(), > with flags = 0, keeping it's behavior unchanged, and its (remember, "it's" is shorthand for "it is", which does not fit here) > qio_channel_socket_writev_zerocopy() with flags = MSG_ZEROCOPY. > > qio_channel_socket_flush_zerocopy() was implemented by counting how many times > sendmsg(...,MSG_ZEROCOPY) was sucessfully called, and then reading the > socket's error queue, in order to find how many of them finished sending. > Flush will loop until those counters are the same, or until some error ocurs. occurs > > A new function qio_channel_socket_poll() was also created in order to avoid > busy-looping recvmsg() in qio_channel_socket_flush_zerocopy() while waiting > for > updates in socket's error queue. > > Notes on using writev_zerocopy(): > 1: Buffer > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid > copying, > some caution is necessary to avoid overwriting any buffer before it's sent. > If something like this happen, a newer version of the buffer may be sent > instead. > - If this is a problem, it's recommended to call flush_zerocopy() before > freeing > or re-using the buffer. > > 2: Locked memory > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and > unlocked after it's sent. > - Depending on the size of each buffer, and how often it's sent, it may > require > a larger amount of locked memory than usually available to non-root user. > - If the required amount of locked memory is not available, writev_zerocopy > will return an error, which can abort an operation like migration, > - Because of this, when an user code wants to add zerocopy as a feature, it > requires a mechanism to disable it, so it can still be acessible to less > privileged users. > > Signed-off-by: Leonardo Bras > --- > include/io/channel-socket.h | 2 + > include/io/channel.h| 1 + > io/channel-socket.c | 180 ++-- > 3 files changed, 173 insertions(+), 10 deletions(-) > > +static int qio_channel_socket_flush_zerocopy(QIOChannel *ioc, > + Error **errp) > +{ > + > +/* No errors, count sucessfully finished sendmsg()*/ Space before */ > +sioc->zerocopy_sent += serr->ee_data - serr->ee_info + 1; > +} > +return 0; > +} > + > +#endif /* CONFIG_LINUX */ > + > static int > qio_channel_socket_set_blocking(QIOChannel *ioc, > bool enabled, > @@ -787,6 +943,10 @@ static void qio_channel_socket_class_init(ObjectClass > *klass, > ioc_klass->io_set_delay = qio_channel_socket_set_delay; > ioc_klass->io_create_watch = qio_channel_socket_create_watch; > ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler; > +#ifdef CONFIG_LINUX > +ioc_klass->io_writev_zerocopy = qio_channel_socket_writev_zerocopy; > +ioc_klass->io_flush_zerocopy = qio_channel_socket_flush_zerocopy; > +#endif > } I did a high-level look at the code, rather than an in-depth review of whether zero-copy was being used correctly. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks
On Sat, Oct 09, 2021 at 04:56:11AM -0300, Leonardo Bras wrote: > Adds io_async_writev and io_async_flush as optional callback to > QIOChannelClass, Are these names accurate? > allowing the implementation of asynchronous writes by subclasses. > > How to use them: > - Write data using qio_channel_writev_zerocopu(), s/copu/copy/ > - Wait write completion with qio_channel_flush_zerocopy(). > > Notes: > As some zerocopy implementations work asynchronously, it's > recommended to keep the write buffer untouched until the return of > qio_channel_flush_zerocopy(), by the risk of sending an updated buffer s/by/to avoid/ > instead of the one at the write. > > As the new callbacks are optional, if a subclass does not implement them, > then: > - io_async_writev will return -1, > - io_async_flush will return 0 without changing anything. Are these names accurate? > > Also, some functions like qio_channel_writev_full_all() were adapted to > receive a flag parameter. That allows shared code between zerocopy and > non-zerocopy writev. > > Signed-off-by: Leonardo Bras > --- > include/io/channel.h | 103 +++ > io/channel.c | 74 +++ > 2 files changed, 141 insertions(+), 36 deletions(-) > > diff --git a/include/io/channel.h b/include/io/channel.h > index 88988979f8..e7d4e1521f 100644 > --- a/include/io/channel.h > +++ b/include/io/channel.h > @@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass, > > #define QIO_CHANNEL_ERR_BLOCK -2 > > +#define QIO_CHANNEL_WRITE_FLAG_ZEROCOPY 0x1 > + > typedef enum QIOChannelFeature QIOChannelFeature; > > enum QIOChannelFeature { > QIO_CHANNEL_FEATURE_FD_PASS, > QIO_CHANNEL_FEATURE_SHUTDOWN, > QIO_CHANNEL_FEATURE_LISTEN, > +QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY, > }; > > > @@ -136,6 +139,12 @@ struct QIOChannelClass { >IOHandler *io_read, >IOHandler *io_write, >void *opaque); > +ssize_t (*io_writev_zerocopy)(QIOChannel *ioc, > + const struct iovec *iov, > + size_t niov, > + Error **errp); > +int (*io_flush_zerocopy)(QIOChannel *ioc, > + Error **errp); Indentation is off by one. > }; > > /* General I/O handling functions */ > @@ -222,12 +231,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc, > > > /** > - * qio_channel_writev_full: > + * qio_channel_writev_full_flags: > * @ioc: the channel object > * @iov: the array of memory regions to write data from > * @niov: the length of the @iov array > * @fds: an array of file handles to send > * @nfds: number of file handles in @fds > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*) > * @errp: pointer to a NULL-initialized error object > * > * Write data to the IO channel, reading it from the > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc, > * guaranteed. If the channel is non-blocking and no > * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK > * > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed, > + * function will return once each buffer was queued for > + * sending. This would be a good place to document the requirement to keep the buffer unchanged until the zerocopy sequence completes. > Error **errp); > > /** > - * qio_channel_writev_full_all: > + * qio_channel_writev_full_all_flags: > * @ioc: the channel object > * @iov: the array of memory regions to write data from > * @niov: the length of the @iov array > * @fds: an array of file handles to send > * @nfds: number of file handles in @fds > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*) > * @errp: pointer to a NULL-initialized error object > * > * > @@ -846,13 +868,58 @@ int qio_channel_readv_full_all(QIOChannel *ioc, > * to be written, yielding from the current coroutine > * if required. > * > + * If QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed in flags, > + * instead of waiting for all requested data to be written, > + * this function will wait until it's all queued for writing. Another good place to document restrictions on buffer stability. > + * > * Returns: 0 if all bytes were written, or -1 on error > */ > > -int qio_channel_writev_full_all(QIOChannel *ioc, > -const struct iovec *iov, > -size_t niov, > -int *fds, size_t nfds, > -Error **errp); > +int qio_channel_writev_full_all_flags(QIOChannel *ioc, > + const struct iovec *iov, > + size_t niov, > + int *fds, size_t nfds, > + int flags, Error
Re: [PATCH 0/9] nios2: Enable cross compile and fix signals
Ping. On 10/1/21 8:33 AM, Richard Henderson wrote: Patches 2, 3, and 5 have appeared before. The patch for the kuser page has been updated to use the commpage infrastructure, which needed expanding just a bit to handle the page being at the beginning of the address space. Getting the toolchain built allowed the code to actually be tested, which showed up a few more problems in the testsuite. I have already pushed the debian-nios2-cross image to gitlab, much like we did for hexagon and its locally built toolchain. r~ Richard Henderson (9): tests/docker: Add debian-nios2-cross image linux-user/nios2: Properly emulate EXCP_TRAP linux-user/nios2: Fixes for signal frame setup linux-user/elfload: Rename ARM_COMMPAGE to HI_COMMPAGE linux-user/nios2: Map a real kuser page linux-user/nios2: Fix EA vs PC confusion linux-user/nios2: Fix sigmask in setup_rt_frame linux-user/nios2: Use set_sigmask in do_rt_sigreturn tests/tcg: Enable container_cross_cc for nios2 target/nios2/cpu.h| 2 +- linux-user/elfload.c | 66 +++-- linux-user/nios2/cpu_loop.c | 93 +- linux-user/nios2/signal.c | 56 +-- linux-user/signal.c | 2 - target/nios2/translate.c | 26 +++-- tests/docker/Makefile.include | 19 .../dockerfiles/debian-nios2-cross.docker | 34 +++ .../build-toolchain.sh| 97 +++ tests/tcg/configure.sh| 6 ++ 10 files changed, 302 insertions(+), 99 deletions(-) create mode 100644 tests/docker/dockerfiles/debian-nios2-cross.docker create mode 100755 tests/docker/dockerfiles/debian-nios2-cross.docker.d/build-toolchain.sh
Re: [PATCH RFC v2 5/5] block: Deprecate transaction type drive-backup
On Sat, Oct 09, 2021 at 02:09:44PM +0200, Markus Armbruster wrote: > Several moons ago, Vladimir posted > > Subject: [PATCH v2 3/3] qapi: deprecate drive-backup > Date: Wed, 5 May 2021 16:58:03 +0300 > Message-Id: <20210505135803.67896-4-vsement...@virtuozzo.com> > https://lists.gnu.org/archive/html/qemu-devel/2021-05/msg01394.html > > with this > > TODO: We also need to deprecate drive-backup transaction action.. > But union members in QAPI doesn't support 'deprecated' feature. I tried > to dig a bit, but failed :/ Markus, could you please help with it? At > least by advice? > > This is one way to resolve it. Sorry it took so long. > > John explored another way, namely adding feature flags to union > branches. Could also be useful, say to add different features to > branches in multiple unions sharing the same tag enum. > > Signed-off-by: Markus Armbruster > --- > qapi/transaction.json | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/qapi/transaction.json b/qapi/transaction.json > index d175b5f863..0564a893b3 100644 > --- a/qapi/transaction.json > +++ b/qapi/transaction.json > @@ -54,6 +54,9 @@ > # @blockdev-snapshot-sync: since 1.1 > # @drive-backup: Since 1.6 > # > +# Features: > +# @deprecated: Member @drive-backup is deprecated. Use FIXME instead. Obviously, we'd need to flesh this out ("'blockdev-backup' with proper node names"? something else?) before dropping RFC on this patch. And we'd want to edit docs/about/deprecated.rst to match. > +# > # Since: 1.1 > ## > { 'enum': 'TransactionActionKind', > @@ -62,7 +65,7 @@ > 'block-dirty-bitmap-disable', 'block-dirty-bitmap-merge', > 'blockdev-backup', 'blockdev-snapshot', > 'blockdev-snapshot-internal-sync', 'blockdev-snapshot-sync', > -'drive-backup' ] } > +{ 'name': 'drive-backup', 'features': [ 'deprecated' ] } ] } > > ## > # @AbortWrapper: > -- > 2.31.1 > But the idea is reasonable, and I'm not sure if we're any closer to John's idea of feature flags on union branches. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v2 4/5] qapi: Implement deprecated-input={reject,crash} for enum values
On Sat, Oct 09, 2021 at 02:09:43PM +0200, Markus Armbruster wrote: > This copies the code implementing the policy from qapi/qmp-dispatch.c > to qapi/qobject-input-visitor.c. Tolerable, but if we acquire more > copes, we should look into factoring them out. copies > > Signed-off-by: Markus Armbruster > --- > docs/devel/qapi-code-gen.rst | 6 -- > qapi/compat.json | 3 ++- > include/qapi/util.h | 6 +- > qapi/qapi-visit-core.c | 18 +++--- > scripts/qapi/types.py| 17 - > 5 files changed, 42 insertions(+), 8 deletions(-) > > diff --git a/docs/devel/qapi-code-gen.rst b/docs/devel/qapi-code-gen.rst > index 00334e9fb8..006a6f4a9a 100644 > --- a/docs/devel/qapi-code-gen.rst > +++ b/docs/devel/qapi-code-gen.rst > @@ -708,8 +708,10 @@ QEMU shows a certain behaviour. > Special features > > > -Feature "deprecated" marks a command, event, or struct member as > -deprecated. It is not supported elsewhere so far. > +Feature "deprecated" marks a command, event, struct or enum member as Do we want the comma before the conjunction? (I've seen style guides that recommend it, and style guides that discourage it; while I tend to write by the former style, I usually don't care about the latter. Rather, switching styles mid-patch caught my attention). > +deprecated. It is not supported elsewhere so far. Interfaces so > +marked may be withdrawn in future releases in accordance with QEMU's > +deprecation policy. > > > +++ b/qapi/qapi-visit-core.c > @@ -393,7 +393,7 @@ static bool input_type_enum(Visitor *v, const char *name, > int *obj, > const QEnumLookup *lookup, Error **errp) > { > int64_t value; > -char *enum_str; > +g_autofree char *enum_str = NULL; Nice change while touching the code. Is it worth mentioning in the commit message? > > if (!visit_type_str(v, name, _str, errp)) { > return false; > @@ -402,11 +402,23 @@ static bool input_type_enum(Visitor *v, const char > *name, int *obj, > value = qapi_enum_parse(lookup, enum_str, -1, NULL); > if (value < 0) { > error_setg(errp, QERR_INVALID_PARAMETER, enum_str); > -g_free(enum_str); > return false; > } > > -g_free(enum_str); > +if (lookup->flags && (lookup->flags[value] & QAPI_ENUM_DEPRECATED)) { > +switch (v->compat_policy.deprecated_input) { > +case COMPAT_POLICY_INPUT_ACCEPT: > +break; > +case COMPAT_POLICY_INPUT_REJECT: > +error_setg(errp, "Deprecated value '%s' disabled by policy", > + enum_str); > +return false; > +case COMPAT_POLICY_INPUT_CRASH: > +default: > +abort(); > +} > +} > + > *obj = value; > return true; > } Grammar fixes are minor, so: Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start
We use INDEX_op_insn_start to make the start of instruction boundaries. If we don't do it in the .insn_start hook things get confused especially now plugins want to use that marking to identify the start of instructions and will bomb out if it sees instrumented ops before the first instruction boundary. Signed-off-by: Richard Henderson --- target/s390x/tcg/translate.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c index f284870cd2..a2d6fa5cca 100644 --- a/target/s390x/tcg/translate.c +++ b/target/s390x/tcg/translate.c @@ -138,6 +138,7 @@ struct DisasFields { struct DisasContext { DisasContextBase base; const DisasInsn *insn; +TCGOp *insn_start; DisasFields fields; uint64_t ex_value; /* @@ -6380,8 +6381,8 @@ static DisasJumpType translate_one(CPUS390XState *env, DisasContext *s) /* Search for the insn in the table. */ insn = extract_insn(env, s); -/* Emit insn_start now that we know the ILEN. */ -tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen); +/* Update insn_start now that we know the ILEN. */ +tcg_set_insn_start_param(s->insn_start, 2, s->ilen); /* Not found means unimplemented/illegal opcode. */ if (insn == NULL) { @@ -6552,6 +6553,11 @@ static void s390x_tr_tb_start(DisasContextBase *db, CPUState *cs) static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs) { +DisasContext *dc = container_of(dcbase, DisasContext, base); + +/* Delay the set of ilen until we've read the insn. */ +tcg_gen_insn_start(dc->base.pc_next, dc->cc_op, 0); +dc->insn_start = tcg_last_op(); } static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs) -- 2.25.1
Re: [PATCH v2 1/6] net/vmnet: dependencies setup, initial preparations
On Tue, Aug 31, 2021 at 10:27:15PM +0300, Vladislav Yaroshchuk wrote: > Add 'vmnet' customizable option and 'vmnet.framework' probe into > configure; > > Create separate netdev per each vmnet operating mode > because they use quite different settings. Especially since > macOS 11.0 (vmnet.framework API gets lots of updates) > Create source files for network client driver, update meson.build; > > Three new netdevs are added: > - vmnet-host > - vmnet-shared > - vmnet-bridged > > Signed-off-by: Vladislav Yaroshchuk > --- While I'm not the best for reviewing the overall series, I can at least comment on the proposed QAPI changes: > +++ b/qapi/net.json > @@ -452,6 +452,89 @@ > '*vhostdev': 'str', > '*queues': 'int' } } > > +## > +# @NetdevVmnetHostOptions: > +# > +# vmnet (host mode) network backend. > +# > +# Allows the vmnet interface to communicate with > +# other vmnet interfaces that are in host mode and also with the native host. > +# > +# @dhcpstart: The starting IPv4 address to use for the interface. Must be in > the > +# private IP range (RFC 1918). Must be specified along > +# with @dhcpend and @subnetmask. > +# This address is used as the gateway address. The subsequent > address > +# up to and including dhcpend are placed in the DHCP pool. Long lines. Most of the qapi docs wrap around 70 or so columns rather than pushing right up to the 80 column limit. > +# > +# @dhcpend: The DHCP IPv4 range end address to use for the interface. Must > be in > +# the private IP range (RFC 1918). Must be specified along > +# with @dhcpstart and @subnetmask. > +# > +# @subnetmask: The IPv4 subnet mask to use on the interface. Must be > specified > +# along with @dhcpstart and @subnetmask. > +# > +# > +# Since: 6.2, Drop the trailing comma > +## > +{ 'struct': 'NetdevVmnetHostOptions', > + 'data': { > +'*dhcpstart': 'str', > +'*dhcpend': 'str', > +'*subnetmask': 'str' > + }, > + 'if': 'CONFIG_VMNET' } > + > +## > +# @NetdevVmnetSharedOptions: > +# > +# vmnet (shared mode) network backend. > +# > +# Allows traffic originating from the vmnet interface to reach the > +# Internet through a network address translator (NAT). The vmnet interface > +# can also communicate with the native host. By default, the vmnet interface > +# is able to communicate with other shared mode interfaces. If a subnet range > +# is specified, the vmnet interface can communicate with other shared mode > +# interfaces on the same subnet. > +# > +# @dhcpstart: The starting IPv4 address to use for the interface. Must be in > the > +# private IP range (RFC 1918). Must be specified along > +# with @dhcpend and @subnetmask. > +# This address is used as the gateway address. The subsequent > address > +# up to and including dhcpend are placed in the DHCP pool. extra space > +# > +# @dhcpend: The DHCP IPv4 range end address to use for the interface. Must > be in > +# the private IP range (RFC 1918). Must be specified along > +# with @dhcpstart and @subnetmask. > +# > +# @subnetmask: The IPv4 subnet mask to use on the interface. Must be > specified > +# along with @dhcpstart and @subnetmask. > +# > +# > +# Since: 6.2, Another odd comma > +## > +{ 'struct': 'NetdevVmnetSharedOptions', > + 'data': { > +'*dhcpstart':'str', > +'*dhcpend': 'str', > +'*subnetmask': 'str' > + }, > + 'if': 'CONFIG_VMNET' } The NetdevVmnetHostOptions and NetdevVmnetSharedOptions types are identical; why do you need two types? > + > +## > +# @NetdevVmnetBridgedOptions: > +# > +# vmnet (bridged mode) network backend. > +# > +# Bridges the vmnet interface with a physical network interface. > +# > +# @ifname: The name of the physical interface to be bridged. > +# > +# Since: 6.2 > +## > +{ 'struct': 'NetdevVmnetBridgedOptions', > + 'data': { 'ifname': 'str' }, > + 'if': 'CONFIG_VMNET' } > + > ## > # @NetClientDriver: > # > @@ -460,10 +543,16 @@ > # Since: 2.7 > # > #@vhost-vdpa since 5.1 > +#@vmnet-host since 6.2 > +#@vmnet-shared since 6.2 > +#@vmnet-bridged since 6.2 > ## > { 'enum': 'NetClientDriver', >'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'vde', > -'bridge', 'hubport', 'netmap', 'vhost-user', 'vhost-vdpa' ] } > +'bridge', 'hubport', 'netmap', 'vhost-user', 'vhost-vdpa', > +{ 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' }, > +{ 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' }, > +{ 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] } > > ## > # @Netdev: > @@ -477,6 +566,9 @@ > # Since: 1.2 > # > #'l2tpv3' - since 2.1 > +#'vmnet-host' - since 6.2 > +#'vmnet-shared' - since 6.2 > +#'vmnet-bridged' - since 6.2 > ## > { 'union': 'Netdev', >'base': { 'id': 'str', 'type':
Re: [PATCH v2 1/5] qapi: Enable enum member introspection to show more than name
On Sat, Oct 09, 2021 at 02:09:40PM +0200, Markus Armbruster wrote: > The next commit will add feature flags to enum members. There's a > problem, though: query-qmp-schema shows an enum type's members as an > array of member names (SchemaInfoEnum member @values). If it showed > an array of objects with a name member, we could simply add more > members to these objects. Since it's just strings, we can't. > > I can see three ways to correct this design mistake: > > 1. Do it the way we should have done it, plus compatibility goo. ... > 2. Like 1, but omit "boring" elements of @member, and empty @member. > 3. Versioned query-qmp-schema. > This commit implements 1. Libvirt developers prefer it. > > Signed-off-by: Markus Armbruster > --- > qapi/introspect.json | 21 +++-- > scripts/qapi/introspect.py | 18 ++ > 2 files changed, 33 insertions(+), 6 deletions(-) Reviewed-by: Eric Blake > > diff --git a/qapi/introspect.json b/qapi/introspect.json > index 39bd303778..f806bd7281 100644 > --- a/qapi/introspect.json > +++ b/qapi/introspect.json > @@ -142,14 +142,31 @@ > # > # Additional SchemaInfo members for meta-type 'enum'. > # > -# @values: the enumeration type's values, in no particular order. > +# @members: the enum type's members, in no particular order > +# (since 6.2). > +# > +# @values: the enumeration type's member names, in no particular order. > +# Redundant with @members. Just for backward compatibility. > # > # Values of this type are JSON string on the wire. > # > # Since: 2.5 > ## > { 'struct': 'SchemaInfoEnum', > - 'data': { 'values': ['str'] } } > + 'data': { 'members': [ 'SchemaInfoEnumMember' ], > +'values': ['str'] } } Not deprecated at this time, I agree with your choice to make the actual deprecation of 'values' to be in a later patch (if at all). > + > +## > +# @SchemaInfoEnumMember: > +# > +# An object member. > +# > +# @name: the member's name, as defined in the QAPI schema. > +# > +# Since: 6.2 > +## > +{ 'struct': 'SchemaInfoEnumMember', > + 'data': { 'name': 'str' } } Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [RFC PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start
On 10/11/21 8:57 AM, Alex Bennée wrote: We use INDEX_op_insn_start to make the start of instruction boundaries. If we don't do it in the .insn_start hook things get confused especially now plugins want to use that marking to identify the start of instructions and will bomb out if it sees instrumented ops before the first instruction boundary. Signed-off-by: Alex Bennée --- target/s390x/tcg/translate.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c index f284870cd2..fe145ff2eb 100644 --- a/target/s390x/tcg/translate.c +++ b/target/s390x/tcg/translate.c @@ -6380,9 +6380,6 @@ static DisasJumpType translate_one(CPUS390XState *env, DisasContext *s) /* Search for the insn in the table. */ insn = extract_insn(env, s); -/* Emit insn_start now that we know the ILEN. */ -tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen); - /* Not found means unimplemented/illegal opcode. */ if (insn == NULL) { qemu_log_mask(LOG_UNIMP, "unimplemented opcode 0x%02x%02x\n", @@ -6550,8 +6547,30 @@ static void s390x_tr_tb_start(DisasContextBase *db, CPUState *cs) { } +/* + * We just enough partial instruction decoding here to calculate the + * length of the instruction so we can drop the INDEX_op_insn_start + * before anything else is emitted in the TCGOp stream. + * + * See extract_insn for the full decode. + */ static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs) { +CPUS390XState *env = cs->env_ptr; +DisasContext *s = container_of(dcbase, DisasContext, base); +uint64_t insn, pc = s->base.pc_next; +int op, ilen; + +if (unlikely(s->ex_value)) { +ilen = s->ex_value & 0xf; +} else { +insn = ld_code2(env, s, pc); /* FIXME: don't reload same pc twice */ I think we might as well delay the set of ilen until after the normal load, rather than introduce a fixme. r~ +op = (insn >> 8) & 0xff; +ilen = get_ilen(op); +} + +/* Emit insn_start now that we know the ILEN. */ +tcg_gen_insn_start(s->base.pc_next, s->cc_op, ilen); } static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
Re: [PULL 0/2] Block patches
On 10/11/21 5:40 AM, Stefan Hajnoczi wrote: The following changes since commit ca61fa4b803e5d0abaf6f1ceb690f23bb78a4def: Merge remote-tracking branch 'remotes/quic/tags/pull-hex-20211006' into staging (2021-10-06 12:11:14 -0700) are available in the Git repository at: https://gitlab.com/stefanha/qemu.git tags/block-pull-request for you to fetch changes up to 1cc7eada97914f090125e588497986f6f7900514: iothread: use IOThreadParamInfo in iothread_[set|get]_param() (2021-10-07 15:29:50 +0100) Pull request Stefano Garzarella (2): iothread: rename PollParamInfo to IOThreadParamInfo iothread: use IOThreadParamInfo in iothread_[set|get]_param() iothread.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) Applied, thanks. r~
Re: [RFC PATCH v1 2/2] s390x/kvm: Pass SIGP Stop flags
On Mon, 2021-10-11 at 11:21 +0200, David Hildenbrand wrote: > On 11.10.21 10:40, Christian Borntraeger wrote: > > > > Am 11.10.21 um 09:09 schrieb David Hildenbrand: > > > On 08.10.21 22:38, Eric Farman wrote: > > > > When building a Stop IRQ to pass to KVM, we should incorporate > > > > the flags if handling the SIGP Stop and Store Status order. > > > > With that, KVM can reject other orders that are submitted for > > > > the same CPU while the operation is fully processed. > > > > > > > > Signed-off-by: Eric Farman > > > > Acked-by: Janosch Frank > > > > --- > > > >target/s390x/kvm/kvm.c | 4 > > > >1 file changed, 4 insertions(+) > > > > > > > > diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c > > > > index 5b1fdb55c4..701b9ddc88 100644 > > > > --- a/target/s390x/kvm/kvm.c > > > > +++ b/target/s390x/kvm/kvm.c > > > > @@ -2555,6 +2555,10 @@ void kvm_s390_stop_interrupt(S390CPU > > > > *cpu) > > > >.type = KVM_S390_SIGP_STOP, > > > >}; > > > > +if (cpu->env.sigp_order == SIGP_STOP_STORE_STATUS) { > > > > +irq.u.stop.flags = KVM_S390_STOP_FLAG_STORE_STATUS; > > > > +} > > > > + > > > > > > KVM_S390_STOP_FLAG_STORE_STATUS tells KVM to perform the store > > > status as well ... is that really what we want? > > At least it should not hurt I guess. QEMU then does it again? > > The thing is, that before we officially completed the action in user > space (and let other SIGP actions actually succeed in user space on > the > CPU), the target CPU will be reported as !busy in the kernel > already. > And before we even inject the stop interrupt, the CPU will be > detected > as !busy in the kernel. I guess it will fix some cases where we poll > via > SENSE if the stop and store happened, because the store *did* happen > in > the kernel and we'll simply store again in user space. > > However, I wonder if we want to handle it more generically: Properly > flag a CPU as busy for SIGP when we start processing the order until > we > completed processing the order. That would allow to handle other > SIGP > operations in user space cleanly, without any chance for races with > SENSE code running in the kernel. > I think a generic solution would be ideal, but I'm wrestling with the race with the kernel's SENSE code. Today, handle_sigp_single_dst already checks to see if a CPU is currently processing an order and returns a CC2 when it does, but of course the kernel's SENSE code doesn't know that. We could flag the CPU as busy in the kernel when sending a SIGP to userspace, so that the SENSE code indicates BUSY, but then how do we know when userspace is finished and the CPU is no longer BUSY? Eric
[PATCH v1 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
Let's factor out prefaulting/populating to make further changes easier to review and add a comment what we are actually expecting to happen. While at it, use the actual page size of the ramblock, which defaults to qemu_real_host_page_size for anonymous memory. Further, rename ram_block_populate_pages() to ram_block_populate_read() as well, to make it clearer what we are doing. In the future, we might want to use MADV_POPULATE_READ to speed up population. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- migration/ram.c | 35 ++- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index b225ec7507..c212081f85 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1639,26 +1639,35 @@ out: return ret; } +static inline void populate_read_range(RAMBlock *block, ram_addr_t offset, + ram_addr_t size) +{ +/* + * We read one byte of each page; this will preallocate page tables if + * required and populate the shared zeropage on MAP_PRIVATE anonymous memory + * where no page was populated yet. This might require adaption when + * supporting other mappings, like shmem. + */ +for (; offset < size; offset += block->page_size) { +char tmp = *((char *)block->host + offset); + +/* Don't optimize the read out */ +asm volatile("" : "+r" (tmp)); +} +} + /* - * ram_block_populate_pages: populate memory in the RAM block by reading - * an integer from the beginning of each page. + * ram_block_populate_read: preallocate page tables and populate pages in the + * RAM block by reading a byte of each page. * * Since it's solely used for userfault_fd WP feature, here we just * hardcode page size to qemu_real_host_page_size. * * @block: RAM block to populate */ -static void ram_block_populate_pages(RAMBlock *block) +static void ram_block_populate_read(RAMBlock *block) { -char *ptr = (char *) block->host; - -for (ram_addr_t offset = 0; offset < block->used_length; -offset += qemu_real_host_page_size) { -char tmp = *(ptr + offset); - -/* Don't optimize the read out */ -asm volatile("" : "+r" (tmp)); -} +populate_read_range(block, 0, block->used_length); } /* @@ -1684,7 +1693,7 @@ void ram_write_tracking_prepare(void) * UFFDIO_WRITEPROTECT_MODE_WP mode setting would silently skip * pages with pte_none() entries in page table. */ -ram_block_populate_pages(block); +ram_block_populate_read(block); } } -- 2.31.1
Re: [PATCH 2/2] target/i386/sev: Use local variable for kvm_sev_launch_measure
* Dov Murik (dovmu...@linux.ibm.com) wrote: > The struct kvm_sev_launch_measure has a constant and small size, and > therefore we can use a regular local variable for it instead of > allocating and freeing heap memory for it. > > Signed-off-by: Dov Murik Reviewed-by: Dr. David Alan Gilbert > --- > target/i386/sev.c | 16 +++- > 1 file changed, 7 insertions(+), 9 deletions(-) > > diff --git a/target/i386/sev.c b/target/i386/sev.c > index 0062566c71..eede07f11d 100644 > --- a/target/i386/sev.c > +++ b/target/i386/sev.c > @@ -729,7 +729,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused) > SevGuestState *sev = sev_guest; > int ret, error; > g_autofree guchar *data = NULL; > -g_autofree struct kvm_sev_launch_measure *measurement = NULL; > +struct kvm_sev_launch_measure measurement = {}; > > if (!sev_check_state(sev, SEV_STATE_LAUNCH_UPDATE)) { > return; > @@ -743,23 +743,21 @@ sev_launch_get_measure(Notifier *notifier, void *unused) > } > } > > -measurement = g_new0(struct kvm_sev_launch_measure, 1); > - > /* query the measurement blob length */ > ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE, > -measurement, ); > -if (!measurement->len) { > +, ); > +if (!measurement.len) { > error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'", > __func__, ret, error, fw_error_to_str(errno)); > return; > } > > -data = g_new0(guchar, measurement->len); > -measurement->uaddr = (unsigned long)data; > +data = g_new0(guchar, measurement.len); > +measurement.uaddr = (unsigned long)data; > > /* get the measurement blob */ > ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE, > -measurement, ); > +, ); > if (ret) { > error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'", > __func__, ret, error, fw_error_to_str(errno)); > @@ -769,7 +767,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused) > sev_set_guest_state(sev, SEV_STATE_LAUNCH_SECRET); > > /* encode the measurement value and emit the event */ > -sev->measurement = g_base64_encode(data, measurement->len); > +sev->measurement = g_base64_encode(data, measurement.len); > trace_kvm_sev_launch_measurement(sev->measurement); > } > > -- > 2.25.1 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [RFC PATCH v1 2/2] s390x/kvm: Pass SIGP Stop flags
On 11.10.21 19:58, Eric Farman wrote: On Mon, 2021-10-11 at 11:21 +0200, David Hildenbrand wrote: On 11.10.21 10:40, Christian Borntraeger wrote: Am 11.10.21 um 09:09 schrieb David Hildenbrand: On 08.10.21 22:38, Eric Farman wrote: When building a Stop IRQ to pass to KVM, we should incorporate the flags if handling the SIGP Stop and Store Status order. With that, KVM can reject other orders that are submitted for the same CPU while the operation is fully processed. Signed-off-by: Eric Farman Acked-by: Janosch Frank --- target/s390x/kvm/kvm.c | 4 1 file changed, 4 insertions(+) diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c index 5b1fdb55c4..701b9ddc88 100644 --- a/target/s390x/kvm/kvm.c +++ b/target/s390x/kvm/kvm.c @@ -2555,6 +2555,10 @@ void kvm_s390_stop_interrupt(S390CPU *cpu) .type = KVM_S390_SIGP_STOP, }; +if (cpu->env.sigp_order == SIGP_STOP_STORE_STATUS) { +irq.u.stop.flags = KVM_S390_STOP_FLAG_STORE_STATUS; +} + KVM_S390_STOP_FLAG_STORE_STATUS tells KVM to perform the store status as well ... is that really what we want? At least it should not hurt I guess. QEMU then does it again? The thing is, that before we officially completed the action in user space (and let other SIGP actions actually succeed in user space on the CPU), the target CPU will be reported as !busy in the kernel already. And before we even inject the stop interrupt, the CPU will be detected as !busy in the kernel. I guess it will fix some cases where we poll via SENSE if the stop and store happened, because the store *did* happen in the kernel and we'll simply store again in user space. However, I wonder if we want to handle it more generically: Properly flag a CPU as busy for SIGP when we start processing the order until we completed processing the order. That would allow to handle other SIGP operations in user space cleanly, without any chance for races with SENSE code running in the kernel. I think a generic solution would be ideal, but I'm wrestling with the race with the kernel's SENSE code. Today, handle_sigp_single_dst already checks to see if a CPU is currently processing an order and returns a CC2 when it does, but of course the kernel's SENSE code doesn't know that. We could flag the CPU as busy in the kernel when sending a SIGP to userspace, so that the SENSE code indicates BUSY, but then how do we know when userspace is finished and the CPU is no longer BUSY? I'd just add a new IOCTL for marking a CPU busy/!busy for SIGP from user space. You can then either let user space perform both actions (set+unset), or let the kernel automatically set "busy" and user space only clear "busy". You can define a new capability to enable the "automatically set busy when going to user space on sigp" -- might require some thoughts on some corner cases. Maybe there might be other scenarios in the future where we might want to set a CPU busy for sigp without that CPU triggering a sigp action itself (e.g., externally triggered reset of a CPU? Simulation of check-stop? store status?), so at least having a way to set/reset a CPU busy for SIGP might be valuable. Once we go to user space to process a SIGP, we usually don't care too much about some additional overhead due to 1 or 2 ioctls IMHO. -- Thanks, David / dhildenb
[PATCH v1 5/9] virtio-mem: Drop precopy notifier
Migration code now properly handles RAMBlocks which are indirectly managed by a RamDiscardManager. No need for manual handling via the free page optimization interface, let's get rid of it. Acked-by: Michael S. Tsirkin Acked-by: Peter Xu Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 34 -- include/hw/virtio/virtio-mem.h | 3 --- 2 files changed, 37 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index 284096ec5f..d5a578142b 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -776,7 +776,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp) host_memory_backend_set_mapped(vmem->memdev, true); vmstate_register_ram(>memdev->mr, DEVICE(vmem)); qemu_register_reset(virtio_mem_system_reset, vmem); -precopy_add_notifier(>precopy_notifier); /* * Set ourselves as RamDiscardManager before the plug handler maps the @@ -796,7 +795,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev) * found via an address space anymore. Unset ourselves. */ memory_region_set_ram_discard_manager(>memdev->mr, NULL); -precopy_remove_notifier(>precopy_notifier); qemu_unregister_reset(virtio_mem_system_reset, vmem); vmstate_unregister_ram(>memdev->mr, DEVICE(vmem)); host_memory_backend_set_mapped(vmem->memdev, false); @@ -1089,43 +1087,11 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name, vmem->block_size = value; } -static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void *arg, - uint64_t offset, uint64_t size) -{ -void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block); - -qemu_guest_free_page_hint(host + offset, size); -return 0; -} - -static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem) -{ -virtio_mem_for_each_unplugged_range(vmem, NULL, -virtio_mem_precopy_exclude_range_cb); -} - -static int virtio_mem_precopy_notify(NotifierWithReturn *n, void *data) -{ -VirtIOMEM *vmem = container_of(n, VirtIOMEM, precopy_notifier); -PrecopyNotifyData *pnd = data; - -switch (pnd->reason) { -case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC: -virtio_mem_precopy_exclude_unplugged(vmem); -break; -default: -break; -} - -return 0; -} - static void virtio_mem_instance_init(Object *obj) { VirtIOMEM *vmem = VIRTIO_MEM(obj); notifier_list_init(>size_change_notifiers); -vmem->precopy_notifier.notify = virtio_mem_precopy_notify; QLIST_INIT(>rdl_list); object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size, diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h index 9a6e348fa2..a5dd6a493b 100644 --- a/include/hw/virtio/virtio-mem.h +++ b/include/hw/virtio/virtio-mem.h @@ -65,9 +65,6 @@ struct VirtIOMEM { /* notifiers to notify when "size" changes */ NotifierList size_change_notifiers; -/* don't migrate unplugged memory */ -NotifierWithReturn precopy_notifier; - /* listeners to notify on plug/unplug activity. */ QLIST_HEAD(, RamDiscardListener) rdl_list; }; -- 2.31.1
[PATCH v1 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*()
The parameter is unused, let's drop it. Reviewed-by: Peter Xu Reviewed-by: Philippe Mathieu-Daudé Reviewed-by: Juan Quintela Signed-off-by: David Hildenbrand --- migration/ram.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index 7a43bfd7af..bb908822d5 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -789,8 +789,7 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb, return find_next_bit(bitmap, size, start); } -static void migration_clear_memory_region_dirty_bitmap(RAMState *rs, - RAMBlock *rb, +static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb, unsigned long page) { uint8_t shift; @@ -818,8 +817,7 @@ static void migration_clear_memory_region_dirty_bitmap(RAMState *rs, } static void -migration_clear_memory_region_dirty_bitmap_range(RAMState *rs, - RAMBlock *rb, +migration_clear_memory_region_dirty_bitmap_range(RAMBlock *rb, unsigned long start, unsigned long npages) { @@ -832,7 +830,7 @@ migration_clear_memory_region_dirty_bitmap_range(RAMState *rs, * exclusive. */ for (i = chunk_start; i < chunk_end; i += chunk_pages) { -migration_clear_memory_region_dirty_bitmap(rs, rb, i); +migration_clear_memory_region_dirty_bitmap(rb, i); } } @@ -850,7 +848,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs, * the page in the chunk we clear the remote dirty bitmap for all. * Clearing it earlier won't be a problem, but too late will. */ -migration_clear_memory_region_dirty_bitmap(rs, rb, page); +migration_clear_memory_region_dirty_bitmap(rb, page); ret = test_and_clear_bit(page, rb->bmap); if (ret) { @@ -2777,8 +2775,7 @@ void qemu_guest_free_page_hint(void *addr, size_t len) * are initially set. Otherwise those skipped pages will be sent in * the next round after syncing from the memory region bitmap. */ -migration_clear_memory_region_dirty_bitmap_range(ram_state, block, - start, npages); +migration_clear_memory_region_dirty_bitmap_range(block, start, npages); ram_state->migration_dirty_pages -= bitmap_count_one_with_offset(block->bmap, start, npages); bitmap_clear(block->bmap, start, npages); -- 2.31.1
[PATCH v1 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots
We already don't ever migrate memory that corresponds to discarded ranges as managed by a RamDiscardManager responsible for the mapped memory region of the RAMBlock. virtio-mem uses this mechanism to logically unplug parts of a RAMBlock. Right now, we still populate zeropages for the whole usable part of the RAMBlock, which is undesired because: 1. Even populating the shared zeropage will result in memory getting consumed for page tables. 2. Memory backends without a shared zeropage (like hugetlbfs and shmem) will populate an actual, fresh page, resulting in an unintended memory consumption. Discarded ("logically unplugged") parts have to remain discarded. As these pages are never part of the migration stream, there is no need to track modifications via userfaultfd WP reliably for these parts. Further, any writes to these ranges by the VM are invalid and the behavior is undefined. Note that Linux only supports userfaultfd WP on private anonymous memory for now, which usually results in the shared zeropage getting populated. The issue will become more relevant once userfaultfd WP supports shmem and hugetlb. Acked-by: Peter Xu Signed-off-by: David Hildenbrand --- migration/ram.c | 38 -- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/migration/ram.c b/migration/ram.c index c212081f85..dbbb1e6712 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1656,6 +1656,17 @@ static inline void populate_read_range(RAMBlock *block, ram_addr_t offset, } } +static inline int populate_read_section(MemoryRegionSection *section, +void *opaque) +{ +const hwaddr size = int128_get64(section->size); +hwaddr offset = section->offset_within_region; +RAMBlock *block = section->mr->ram_block; + +populate_read_range(block, offset, size); +return 0; +} + /* * ram_block_populate_read: preallocate page tables and populate pages in the * RAM block by reading a byte of each page. @@ -1665,9 +1676,32 @@ static inline void populate_read_range(RAMBlock *block, ram_addr_t offset, * * @block: RAM block to populate */ -static void ram_block_populate_read(RAMBlock *block) +static void ram_block_populate_read(RAMBlock *rb) { -populate_read_range(block, 0, block->used_length); +/* + * Skip populating all pages that fall into a discarded range as managed by + * a RamDiscardManager responsible for the mapped memory region of the + * RAMBlock. Such discarded ("logically unplugged") parts of a RAMBlock + * must not get populated automatically. We don't have to track + * modifications via userfaultfd WP reliably, because these pages will + * not be part of the migration stream either way -- see + * ramblock_dirty_bitmap_exclude_discarded_pages(). + * + * Note: The result is only stable while migrating (precopy/postcopy). + */ +if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) { +RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr); +MemoryRegionSection section = { +.mr = rb->mr, +.offset_within_region = 0, +.size = rb->mr->size, +}; + +ram_discard_manager_replay_populated(rdm, , + populate_read_section, NULL); +} else { +populate_read_range(rb, 0, rb->used_length); +} } /* -- 2.31.1
[PATCH v1 7/9] migration: Simplify alignment and alignment checks
Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to read. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- migration/migration.c| 6 +++--- migration/postcopy-ram.c | 9 - migration/ram.c | 2 +- 3 files changed, 8 insertions(+), 9 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 6ac807ef3d..aa89ebac0c 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis, int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb, ram_addr_t start, uint64_t haddr) { -void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb))); +void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb)); bool received = false; WITH_QEMU_LOCK_GUARD(>page_request_mutex) { @@ -2619,8 +2619,8 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname, * Since we currently insist on matching page sizes, just sanity check * we're being asked for whole host pages. */ -if (start & (our_host_ps - 1) || - (len & (our_host_ps - 1))) { +if (!QEMU_IS_ALIGNED(start, our_host_ps) || +!QEMU_IS_ALIGNED(len, our_host_ps)) { error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT " len: %zd", __func__, start, len); mark_source_rp_bad(ms); diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index 3609ce7e52..e721f69d0f 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -402,7 +402,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis) strerror(errno)); goto out; } -g_assert(((size_t)testarea & (pagesize - 1)) == 0); +g_assert(QEMU_PTR_IS_ALIGNED(testarea, pagesize)); reg_struct.range.start = (uintptr_t)testarea; reg_struct.range.len = pagesize; @@ -660,7 +660,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, struct uffdio_range range; int ret; trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb)); -range.start = client_addr & ~(pagesize - 1); +range.start = ROUND_DOWN(client_addr, pagesize); range.len = pagesize; ret = ioctl(pcfd->fd, UFFDIO_WAKE, ); if (ret) { @@ -702,8 +702,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb, int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb, uint64_t client_addr, uint64_t rb_offset) { -size_t pagesize = qemu_ram_pagesize(rb); -uint64_t aligned_rbo = rb_offset & ~(pagesize - 1); +uint64_t aligned_rbo = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb)); MigrationIncomingState *mis = migration_incoming_get_current(); trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb), @@ -993,7 +992,7 @@ static void *postcopy_ram_fault_thread(void *opaque) break; } -rb_offset &= ~(qemu_ram_pagesize(rb) - 1); +rb_offset = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb)); trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address, qemu_ram_get_idstr(rb), rb_offset, diff --git a/migration/ram.c b/migration/ram.c index 56240f0f17..b225ec7507 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -811,7 +811,7 @@ static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb, assert(shift >= 6); size = 1ULL << (TARGET_PAGE_BITS + shift); -start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size); +start = QEMU_ALIGN_DOWN((ram_addr_t)page << TARGET_PAGE_BITS, size); trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page); memory_region_clear_dirty_bitmap(rb->mr, start, size); } -- 2.31.1
[PATCH v1 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination
Currently, when someone (i.e., the VM) accesses discarded parts inside a RAMBlock with a RamDiscardManager managing the corresponding mapped memory region, postcopy will request migration of the corresponding page from the source. The source, however, will never answer, because it refuses to migrate such pages with undefined content ("logically unplugged"): the pages are never dirty, and get_queued_page() will consequently skip processing these postcopy requests. Especially reading discarded ("logically unplugged") ranges is supposed to work in some setups (for example with current virtio-mem), although it barely ever happens: still, not placing a page would currently stall the VM, as it cannot make forward progress. Let's check the state via the RamDiscardManager (the state e.g., of virtio-mem is migrated during precopy) and avoid sending a request that will never get answered. Place a fresh zero page instead to keep the VM working. This is the same behavior that would happen automatically without userfaultfd being active, when accessing virtual memory regions without populated pages -- "populate on demand". For now, there are valid cases (as documented in the virtio-mem spec) where a VM might read discarded memory; in the future, we will disallow that. Then, we might want to handle that case differently, e.g., warning the user that the VM seems to be mis-behaving. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- migration/postcopy-ram.c | 31 +++ migration/ram.c | 21 + migration/ram.h | 1 + 3 files changed, 49 insertions(+), 4 deletions(-) diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c index 2e9697bdd2..3609ce7e52 100644 --- a/migration/postcopy-ram.c +++ b/migration/postcopy-ram.c @@ -671,6 +671,29 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd, return ret; } +static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb, + ram_addr_t start, uint64_t haddr) +{ +void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb)); + +/* + * Discarded pages (via RamDiscardManager) are never migrated. On unlikely + * access, place a zeropage, which will also set the relevant bits in the + * recv_bitmap accordingly, so we won't try placing a zeropage twice. + * + * Checking a single bit is sufficient to handle pagesize > TPS as either + * all relevant bits are set or not. + */ +assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb))); +if (ramblock_page_is_discarded(rb, start)) { +bool received = ramblock_recv_bitmap_test_byte_offset(rb, start); + +return received ? 0 : postcopy_place_page_zero(mis, aligned, rb); +} + +return migrate_send_rp_req_pages(mis, rb, start, haddr); +} + /* * Callback from shared fault handlers to ask for a page, * the page must be specified by a RAMBlock and an offset in that rb @@ -690,7 +713,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb, qemu_ram_get_idstr(rb), rb_offset); return postcopy_wake_shared(pcfd, client_addr, rb); } -migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr); +postcopy_request_page(mis, rb, aligned_rbo, client_addr); return 0; } @@ -984,8 +1007,8 @@ retry: * Send the request to the source - we want to request one * of our host page sizes (which is >= TPS) */ -ret = migrate_send_rp_req_pages(mis, rb, rb_offset, -msg.arg.pagefault.address); +ret = postcopy_request_page(mis, rb, rb_offset, +msg.arg.pagefault.address); if (ret) { /* May be network failure, try to wait for recovery */ if (ret == -EIO && postcopy_pause_fault_thread(mis)) { @@ -993,7 +1016,7 @@ retry: goto retry; } else { /* This is a unavoidable fault */ -error_report("%s: migrate_send_rp_req_pages() get %d", +error_report("%s: postcopy_request_page() get %d", __func__, ret); break; } diff --git a/migration/ram.c b/migration/ram.c index 3be969f749..56240f0f17 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -912,6 +912,27 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb) return cleared_bits; } +/* + * Check if a host-page aligned page falls into a discarded range as managed by + * a RamDiscardManager responsible for the mapped memory region of the RAMBlock. + * + * Note: The result is only stable while migrating (precopy/postcopy). + */ +bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start) +{ +if (rb->mr &&
[PATCH v1 1/9] memory: Introduce replay_discarded callback for RamDiscardManager
Introduce replay_discarded callback similar to our existing replay_populated callback, to be used my migration code to never migrate discarded memory. Acked-by: Peter Xu Signed-off-by: David Hildenbrand --- include/exec/memory.h | 21 + softmmu/memory.c | 11 +++ 2 files changed, 32 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index a185b6dcb8..75b4f600e3 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -540,6 +540,7 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl, } typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque); +typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque); /* * RamDiscardManagerClass: @@ -628,6 +629,21 @@ struct RamDiscardManagerClass { MemoryRegionSection *section, ReplayRamPopulate replay_fn, void *opaque); +/** + * @replay_discarded: + * + * Call the #ReplayRamDiscard callback for all discarded parts within the + * #MemoryRegionSection via the #RamDiscardManager. + * + * @rdm: the #RamDiscardManager + * @section: the #MemoryRegionSection + * @replay_fn: the #ReplayRamDiscard callback + * @opaque: pointer to forward to the callback + */ +void (*replay_discarded)(const RamDiscardManager *rdm, + MemoryRegionSection *section, + ReplayRamDiscard replay_fn, void *opaque); + /** * @register_listener: * @@ -672,6 +688,11 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm, ReplayRamPopulate replay_fn, void *opaque); +void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm, + MemoryRegionSection *section, + ReplayRamDiscard replay_fn, + void *opaque); + void ram_discard_manager_register_listener(RamDiscardManager *rdm, RamDiscardListener *rdl, MemoryRegionSection *section); diff --git a/softmmu/memory.c b/softmmu/memory.c index db182e5d3d..3bcfc3899b 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -2081,6 +2081,17 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm, return rdmc->replay_populated(rdm, section, replay_fn, opaque); } +void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm, + MemoryRegionSection *section, + ReplayRamDiscard replay_fn, + void *opaque) +{ +RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm); + +g_assert(rdmc->replay_discarded); +rdmc->replay_discarded(rdm, section, replay_fn, opaque); +} + void ram_discard_manager_register_listener(RamDiscardManager *rdm, RamDiscardListener *rdl, MemoryRegionSection *section) -- 2.31.1
[PATCH v1 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source
We don't want to migrate memory that corresponds to discarded ranges as managed by a RamDiscardManager responsible for the mapped memory region of the RAMBlock. The content of these pages is essentially stale and without any guarantees for the VM ("logically unplugged"). Depending on the underlying memory type, even reading memory might populate memory on the source, resulting in an undesired memory consumption. Of course, on the destination, even writing a zeropage consumes memory, which we also want to avoid (similar to free page hinting). Currently, virtio-mem tries achieving that goal (not migrating "unplugged" memory that was discarded) by going via qemu_guest_free_page_hint() - but it's hackish and incomplete. For example, background snapshots still end up reading all memory, as they don't do bitmap syncs. Postcopy recovery code will re-add previously cleared bits to the dirty bitmap and migrate them. Let's consult the RamDiscardManager after setting up our dirty bitmap initially and when postcopy recovery code reinitializes it: clear corresponding bits in the dirty bitmaps (e.g., of the RAMBlock and inside KVM). It's important to fixup the dirty bitmap *after* our initial bitmap sync, such that the corresponding dirty bits in KVM are actually cleared. As colo is incompatible with discarding of RAM and inhibits it, we don't have to bother. Note: if a misbehaving guest would use discarded ranges after migration started we would still migrate that memory: however, then we already populated that memory on the migration source. Reviewed-by: Peter Xu Signed-off-by: David Hildenbrand --- migration/ram.c | 77 + 1 file changed, 77 insertions(+) diff --git a/migration/ram.c b/migration/ram.c index bb908822d5..3be969f749 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -858,6 +858,60 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs, return ret; } +static void dirty_bitmap_clear_section(MemoryRegionSection *section, + void *opaque) +{ +const hwaddr offset = section->offset_within_region; +const hwaddr size = int128_get64(section->size); +const unsigned long start = offset >> TARGET_PAGE_BITS; +const unsigned long npages = size >> TARGET_PAGE_BITS; +RAMBlock *rb = section->mr->ram_block; +uint64_t *cleared_bits = opaque; + +/* + * We don't grab ram_state->bitmap_mutex because we expect to run + * only when starting migration or during postcopy recovery where + * we don't have concurrent access. + */ +if (!migration_in_postcopy() && !migrate_background_snapshot()) { +migration_clear_memory_region_dirty_bitmap_range(rb, start, npages); +} +*cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages); +bitmap_clear(rb->bmap, start, npages); +} + +/* + * Exclude all dirty pages from migration that fall into a discarded range as + * managed by a RamDiscardManager responsible for the mapped memory region of + * the RAMBlock. Clear the corresponding bits in the dirty bitmaps. + * + * Discarded pages ("logically unplugged") have undefined content and must + * not get migrated, because even reading these pages for migration might + * result in undesired behavior. + * + * Returns the number of cleared bits in the RAMBlock dirty bitmap. + * + * Note: The result is only stable while migrating (precopy/postcopy). + */ +static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb) +{ +uint64_t cleared_bits = 0; + +if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) { +RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr); +MemoryRegionSection section = { +.mr = rb->mr, +.offset_within_region = 0, +.size = int128_make64(qemu_ram_get_used_length(rb)), +}; + +ram_discard_manager_replay_discarded(rdm, , + dirty_bitmap_clear_section, + _bits); +} +return cleared_bits; +} + /* Called with RCU critical section */ static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb) { @@ -2668,6 +2722,19 @@ static void ram_list_init_bitmaps(void) } } +static void migration_bitmap_clear_discarded_pages(RAMState *rs) +{ +unsigned long pages; +RAMBlock *rb; + +RCU_READ_LOCK_GUARD(); + +RAMBLOCK_FOREACH_NOT_IGNORED(rb) { +pages = ramblock_dirty_bitmap_clear_discarded_pages(rb); +rs->migration_dirty_pages -= pages; +} +} + static void ram_init_bitmaps(RAMState *rs) { /* For memory_global_dirty_log_start below. */ @@ -2684,6 +2751,12 @@ static void ram_init_bitmaps(RAMState *rs) } qemu_mutex_unlock_ramlist(); qemu_mutex_unlock_iothread(); + +/* + * After an eventual first bitmap sync, fixup the initial bitmap + * containing all 1s to
[PATCH v1 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback
Implement it similar to the replay_populated callback. Acked-by: Peter Xu Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 58 ++ 1 file changed, 58 insertions(+) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index df91e454b2..284096ec5f 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -228,6 +228,38 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem, return ret; } +static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem, + MemoryRegionSection *s, + void *arg, + virtio_mem_section_cb cb) +{ +unsigned long first_bit, last_bit; +uint64_t offset, size; +int ret = 0; + +first_bit = s->offset_within_region / vmem->bitmap_size; +first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, first_bit); +while (first_bit < vmem->bitmap_size) { +MemoryRegionSection tmp = *s; + +offset = first_bit * vmem->block_size; +last_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size, + first_bit + 1) - 1; +size = (last_bit - first_bit + 1) * vmem->block_size; + +if (!virito_mem_intersect_memory_section(, offset, size)) { +break; +} +ret = cb(, arg); +if (ret) { +break; +} +first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, + last_bit + 2); +} +return ret; +} + static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg) { RamDiscardListener *rdl = arg; @@ -1170,6 +1202,31 @@ static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm, virtio_mem_rdm_replay_populated_cb); } +static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s, + void *arg) +{ +struct VirtIOMEMReplayData *data = arg; + +((ReplayRamDiscard)data->fn)(s, data->opaque); +return 0; +} + +static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm, +MemoryRegionSection *s, +ReplayRamDiscard replay_fn, +void *opaque) +{ +const VirtIOMEM *vmem = VIRTIO_MEM(rdm); +struct VirtIOMEMReplayData data = { +.fn = replay_fn, +.opaque = opaque, +}; + +g_assert(s->mr == >memdev->mr); +virtio_mem_for_each_unplugged_section(vmem, s, , + virtio_mem_rdm_replay_discarded_cb); +} + static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm, RamDiscardListener *rdl, MemoryRegionSection *s) @@ -1234,6 +1291,7 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data) rdmc->get_min_granularity = virtio_mem_rdm_get_min_granularity; rdmc->is_populated = virtio_mem_rdm_is_populated; rdmc->replay_populated = virtio_mem_rdm_replay_populated; +rdmc->replay_discarded = virtio_mem_rdm_replay_discarded; rdmc->register_listener = virtio_mem_rdm_register_listener; rdmc->unregister_listener = virtio_mem_rdm_unregister_listener; } -- 2.31.1
Re: [PATCH 1/2] target/i386/sev: Use local variable for kvm_sev_launch_start
* Dov Murik (dovmu...@linux.ibm.com) wrote: > The struct kvm_sev_launch_start has a constant and small size, and > therefore we can use a regular local variable for it instead of > allocating and freeing heap memory for it. > > Signed-off-by: Dov Murik Reviewed-by: Dr. David Alan Gilbert > --- > target/i386/sev.c | 23 ++- > 1 file changed, 10 insertions(+), 13 deletions(-) > > diff --git a/target/i386/sev.c b/target/i386/sev.c > index 4c64c68244..0062566c71 100644 > --- a/target/i386/sev.c > +++ b/target/i386/sev.c > @@ -647,31 +647,29 @@ sev_launch_start(SevGuestState *sev) > gsize sz; > int ret = 1; > int fw_error, rc; > -struct kvm_sev_launch_start *start; > +struct kvm_sev_launch_start start = { > +.handle = sev->handle, .policy = sev->policy > +}; > guchar *session = NULL, *dh_cert = NULL; > > -start = g_new0(struct kvm_sev_launch_start, 1); > - > -start->handle = sev->handle; > -start->policy = sev->policy; > if (sev->session_file) { > if (sev_read_file_base64(sev->session_file, , ) < 0) { > goto out; > } > -start->session_uaddr = (unsigned long)session; > -start->session_len = sz; > +start.session_uaddr = (unsigned long)session; > +start.session_len = sz; > } > > if (sev->dh_cert_file) { > if (sev_read_file_base64(sev->dh_cert_file, _cert, ) < 0) { > goto out; > } > -start->dh_uaddr = (unsigned long)dh_cert; > -start->dh_len = sz; > +start.dh_uaddr = (unsigned long)dh_cert; > +start.dh_len = sz; > } > > -trace_kvm_sev_launch_start(start->policy, session, dh_cert); > -rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, start, _error); > +trace_kvm_sev_launch_start(start.policy, session, dh_cert); > +rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, , _error); > if (rc < 0) { > error_report("%s: LAUNCH_START ret=%d fw_error=%d '%s'", > __func__, ret, fw_error, fw_error_to_str(fw_error)); > @@ -679,11 +677,10 @@ sev_launch_start(SevGuestState *sev) > } > > sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE); > -sev->handle = start->handle; > +sev->handle = start.handle; > ret = 0; > > out: > -g_free(start); > g_free(session); > g_free(dh_cert); > return ret; > -- > 2.25.1 > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
[PATCH v1 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager
This series is fully reviewed by Peter and I hope we can get either more review feedback or get it merged via the migration tree soonish. Thanks. --- virtio-mem exposes a dynamic amount of memory within RAMBlocks by coordinating with the VM. Memory within a RAMBlock can either get plugged and consequently used by the VM, or unplugged and consequently no longer used by the VM. Logical unplug is realized by discarding the physical memory backing for virtual memory ranges, similar to memory ballooning. However, important difference to virtio-balloon are: a) A virtio-mem device only operates on its assigned memory region / RAMBlock ("device memory") b) Initially, all device memory is logically unplugged c) Virtual machines will never accidentally reuse memory that is currently logically unplugged. The spec defines most accesses to unplugged memory as "undefined behavior" -- except reading unplugged memory, which is currently expected to work, but that will change in the future. d) The (un)plug granularity is in the range of megabytes -- "memory blocks" e) The state (plugged/unplugged) of a memory block is always known and properly tracked. Whenever memory blocks within the RAMBlock get (un)plugged, changes are communicated via the RamDiscardManager to other QEMU subsystems, most prominently vfio which updates the DMA mapping accordingly. "Unplugging" corresponds to "discarding" and "plugging" corresponds to "populating". While migrating (precopy/postcopy) that state of such memory blocks cannot change, as virtio-mem will reject any guest requests that would change the state of blocks with "busy". We don't want to migrate such logically unplugged memory, because it can result in an unintended memory consumption both, on the source (when reading memory from some memory backends) and on the destination (when writing memory). Further, migration time can be heavily reduced when skipping logically unplugged blocks and we avoid populating unnecessary page tables in Linux. Right now, virtio-mem reuses the free page hinting infrastructure during precopy to exclude all logically unplugged ("discarded") parts from the migration stream. However, there are some scenarios that are not handled properly and need fixing. Further, there are some ugly corner cases in postcopy code and background snapshotting code that similarly have to handle such special RAMBlocks. Let's reuse the RamDiscardManager infrastructure to essentially handle precopy, postcopy and background snapshots cleanly, which means: a) In precopy code, fixing up the initial dirty bitmaps (in the RAMBlock and e.g., KVM) to exclude discarded ranges. b) In postcopy code, placing a zeropage when requested to handle a page falling into a discarded range -- because the source will never send it. Further, fix up the dirty bitmap when overwriting it in recovery mode. c) In background snapshot code, never populating discarded ranges, not even with the shared zeropage, to avoid unintended memory consumption, especially in the future with hugetlb and shmem. Detail: When realizing a virtio-mem devices, it will register the RAM for migration via vmstate_register_ram(). Further, it will set itself as the RamDiscardManager for the corresponding memory region of the RAMBlock via memory_region_set_ram_discard_manager(). Last but not least, memory device code will actually map the memory region into guest physical address space. So migration code can always properly identify such RAMBlocks. Tested with precopy/postcopy on shmem, where even reading unpopulated memory ranges will populate actual memory and not the shared zeropage. Tested with background snapshots on anonymous memory, because other backends are not supported yet with upstream Linux. v5 -> v6: - Rebased and added ACKs v4 -> v5: - "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination" -- Use ROUND_DOWN and fix compile warning on 32 bit -- Use int128_make64() instead of wrongly int128_get64() - "migration: Simplify alignment and alignment checks" -- Use ROUND_DOWN where possible instead of QEMU_ALIGN_DOWN and fix compilation warning on 32 bit - "migration/ram: Factor out populating pages readable in ram_block_populate_pages()" -- Rename functions, add a comment. - "migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots" -- Adjust to changed function names v3 -> v4: - Added ACKs - "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination" -- Use QEMU_ALIGN_DOWN() to align to ram pagesize - "migration: Simplify alignment and alignment checks" -- Added - "migration/ram: Factor out populating pages readable in ram_block_populate_pages()" -- Added - "migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots" -- Simplified due to factored out code v2 -> v3: - "migration/ram: Don't passs RAMState to
Re: [PATCH] memory: Log access direction for invalid accesses
On 11.10.21 19:32, BALATON Zoltan wrote: In memory_region_access_valid() invalid accesses are logged to help debugging but the log message does not say if it was a read or write. Log that too to better identify the access causing the problem. Signed-off-by: BALATON Zoltan --- softmmu/memory.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index db182e5d3d..e5826faa0c 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1378,17 +1378,17 @@ bool memory_region_access_valid(MemoryRegion *mr, { if (mr->ops->valid.accepts && !mr->ops->valid.accepts(mr->opaque, addr, size, is_write, attrs)) { -qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr " - "0x%" HWADDR_PRIX ", size %u, " - "region '%s', reason: rejected\n", +qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX + ", size %u, region '%s', reason: rejected\n", + is_write ? "write" : "read", addr, size, memory_region_name(mr)); return false; } if (!mr->ops->valid.unaligned && (addr & (size - 1))) { -qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr " - "0x%" HWADDR_PRIX ", size %u, " - "region '%s', reason: unaligned\n", +qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX + ", size %u, region '%s', reason: unaligned\n", + is_write ? "write" : "read", addr, size, memory_region_name(mr)); return false; } @@ -1400,10 +1400,10 @@ bool memory_region_access_valid(MemoryRegion *mr, if (size > mr->ops->valid.max_access_size || size < mr->ops->valid.min_access_size) { -qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr " - "0x%" HWADDR_PRIX ", size %u, " - "region '%s', reason: invalid size " - "(min:%u max:%u)\n", +qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX + ", size %u, region '%s', reason: invalid size " + "(min:%u max:%u)\n", + is_write ? "write" : "read", addr, size, memory_region_name(mr), mr->ops->valid.min_access_size, mr->ops->valid.max_access_size); Reviewed-by: David Hildenbrand -- Thanks, David / dhildenb
Re: ACPI endianness
On Mon, 11 Oct 2021, Michael S. Tsirkin wrote: On Mon, Oct 11, 2021 at 03:51:01PM +0200, BALATON Zoltan wrote: ... but given we did not previously do the read, maybe we should keep it that way at least for the time being. How do you know there was no read before this write? Did you check it? I've only added a printf in the write method and saw the value was swapped but did not check if there was a read before that. There are no traces in these methods so maybe I would not see it unless adding a printf there too. All I am saying is that qemu did not convert a write into a read+write. OK confirmed that by disabling pm_io_space_update() in hw/isa/vt82c686.c so the via-pm region is never mapped and then modifying the log messages for invalid accesses (patch sent separately) I get: ~ # poweroff Sent SIGKILL to all processes Requesting system poweroff pci_cfg_write vt82c686b-usb-uhci 12:3 @0xc0 <- 0x8f00 pci_cfg_write vt82c686b-usb-uhci 12:2 @0xc0 <- 0x8f00 [ 16.498465] reboot: Power down Invalid write at addr 0xFE000F05, size 1, region '(null)', reason: rejected Invalid write at addr 0xF05, size 1, region '(null)', reason: rejected So the guest only tries to write one byte but not sure if the read generated within memory.c would show up in these logs. I suspect if you fixed that you'd get all sorts of problems with other device models so the less likely to break anything fix would be to go back to native endian for acpi but I wait for what you come up with. I'd like this to be fixed one way or another for 6.2 and fixing endianness would probably be enough for that. Regards, BALATON Zoltan
[PATCH v1 2/2] memory: Update description of memory_region_is_mapped()
memory_region_is_mapped() only indicates if the memory region is mapped into a different memory region, and only if it is mapped directly (->container), not via an alias. Update the documentation to make this clearer. Signed-off-by: David Hildenbrand --- include/exec/memory.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/exec/memory.h b/include/exec/memory.h index a185b6dcb8..abc17fc3c0 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -2265,7 +2265,8 @@ bool memory_region_present(MemoryRegion *container, hwaddr addr); /** * memory_region_is_mapped: returns true if #MemoryRegion is mapped - * into any address space. + * into another #MemoryRegion directly. Will return false if the + * #MemoryRegion is mapped indirectly via an alias. * * @mr: a #MemoryRegion which should be checked if it's mapped */ -- 2.31.1
[PATCH v1 1/2] machine: Use host_memory_backend_is_mapped() in machine_consume_memdev()
memory_region_is_mapped() is the wrong check, we actually want to check whether the backend is already marked mapped. For example, memory regions mapped via an alias, such as NVDIMMs, currently don't make memory_region_is_mapped() return "true". As the machine is initialized before any memory devices (and thereby before NVDIMMs are initialized), this isn't a fix but merely a cleanup. Signed-off-by: David Hildenbrand --- hw/core/machine.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index b8d95eec32..a1db865939 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -1260,7 +1260,7 @@ MemoryRegion *machine_consume_memdev(MachineState *machine, { MemoryRegion *ret = host_memory_backend_get_memory(backend); -if (memory_region_is_mapped(ret)) { +if (host_memory_backend_is_mapped(backend)) { error_report("memory backend %s can't be used multiple times.", object_get_canonical_path_component(OBJECT(backend))); exit(EXIT_FAILURE); -- 2.31.1
[PATCH v1 0/2] memory: Update description of memory_region_is_mapped()
Playing with memory_region_is_mapped(), I realized that memory regions mapped via an alias behave a little bit "differently", as they don't have their ->container set. * memory_region_is_mapped() will never succeed for memory regions mapped via an alias * memory_region_to_address_space(), memory_region_find(), memory_region_find_rcu(), memory_region_present() won't work, which seems okay, because we don't expect such memory regions getting passed to these functions. * memory_region_to_absolute_addr() will result in a wrong address. As the result is only used for tracing, that is tolerable. Let's clarify the documentation of memory_region_is_mapped() and change one user that really should be checking something else. Cc: Eduardo Habkost Cc: Marcel Apfelbaum Cc: Paolo Bonzini Cc: Peter Xu Cc: Igor Mammedov Cc: Richard Henderson Cc: Philippe Mathieu-Daudé David Hildenbrand (2): machine: Use host_memory_backend_is_mapped() in machine_consume_memdev() memory: Update description of memory_region_is_mapped() hw/core/machine.c | 2 +- include/exec/memory.h | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) -- 2.31.1
[PATCH] memory: Log access direction for invalid accesses
In memory_region_access_valid() invalid accesses are logged to help debugging but the log message does not say if it was a read or write. Log that too to better identify the access causing the problem. Signed-off-by: BALATON Zoltan --- softmmu/memory.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/softmmu/memory.c b/softmmu/memory.c index db182e5d3d..e5826faa0c 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1378,17 +1378,17 @@ bool memory_region_access_valid(MemoryRegion *mr, { if (mr->ops->valid.accepts && !mr->ops->valid.accepts(mr->opaque, addr, size, is_write, attrs)) { -qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr " - "0x%" HWADDR_PRIX ", size %u, " - "region '%s', reason: rejected\n", +qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX + ", size %u, region '%s', reason: rejected\n", + is_write ? "write" : "read", addr, size, memory_region_name(mr)); return false; } if (!mr->ops->valid.unaligned && (addr & (size - 1))) { -qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr " - "0x%" HWADDR_PRIX ", size %u, " - "region '%s', reason: unaligned\n", +qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX + ", size %u, region '%s', reason: unaligned\n", + is_write ? "write" : "read", addr, size, memory_region_name(mr)); return false; } @@ -1400,10 +1400,10 @@ bool memory_region_access_valid(MemoryRegion *mr, if (size > mr->ops->valid.max_access_size || size < mr->ops->valid.min_access_size) { -qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr " - "0x%" HWADDR_PRIX ", size %u, " - "region '%s', reason: invalid size " - "(min:%u max:%u)\n", +qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX + ", size %u, region '%s', reason: invalid size " + "(min:%u max:%u)\n", + is_write ? "write" : "read", addr, size, memory_region_name(mr), mr->ops->valid.min_access_size, mr->ops->valid.max_access_size); -- 2.21.4
Re: [PATCH v4 00/11] virtio-iommu: Add ACPI support
Hi Haiwei, On Mon, Oct 11, 2021 at 06:10:07PM +0800, Haiwei Li wrote: [...] > Gave up waiting for root file system device. Common problems: > - Boot args (cat /proc/cmdline) >- Check rootdelay= (did the system wait long enough?) > - Missing modules (cat /proc/modules; ls /dev) > ALERT! UUID=3caf26b5-4d08-43e0-8634-7573269c4f70 does not exist. > Dropping to a shell! > > Any suggestions? Thanks. It's possible that the rootfs is on a disk behind the IOMMU, and the IOMMU driver doesn't get loaded. That could happen, for example, if the virtio-iommu module is not present in the initramfs. Since IOMMU drivers are typically built into the kernel rather than modules, distro tools that build the initramfs might not pick up IOMMU modules. I'm guessing this could be the issue here because of the hints and "Dropping to a shell" line. The clean solution will be to patch the initramfs tools to learn about IOMMU drivers (I'm somewhat working on that). In the meantime, if this is indeed the problem, you could try explicitly adding the virtio-iommu module to the initramfs, or building the kernel with CONFIG_VIRTIO_IOMMU=y rather than =m, though that requires VIRTIO and VIRTIO_PCI to be built-in as well. Thanks, Jean
[PATCH v1] virtio-mem: Don't skip alignment checks when warning about block size
If we warn about the block size being smaller than the default, we skip some alignment checks. This can currently only fail on x86-64, when specifying a block size of 1 MiB, however, we detect the THP size of 2 MiB. Fixes: 228957fea3a9 ("virtio-mem: Probe THP size to determine default block size") Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index df91e454b2..7ce9901791 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -701,7 +701,8 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp) warn_report("'%s' property is smaller than the default block size (%" PRIx64 " MiB)", VIRTIO_MEM_BLOCK_SIZE_PROP, virtio_mem_default_block_size(rb) / MiB); -} else if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) { +} +if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) { error_setg(errp, "'%s' property has to be multiples of '%s' (0x%" PRIx64 ")", VIRTIO_MEM_REQUESTED_SIZE_PROP, VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size); -- 2.31.1
[PATCH 1/2] target/i386/sev: Use local variable for kvm_sev_launch_start
The struct kvm_sev_launch_start has a constant and small size, and therefore we can use a regular local variable for it instead of allocating and freeing heap memory for it. Signed-off-by: Dov Murik --- target/i386/sev.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/target/i386/sev.c b/target/i386/sev.c index 4c64c68244..0062566c71 100644 --- a/target/i386/sev.c +++ b/target/i386/sev.c @@ -647,31 +647,29 @@ sev_launch_start(SevGuestState *sev) gsize sz; int ret = 1; int fw_error, rc; -struct kvm_sev_launch_start *start; +struct kvm_sev_launch_start start = { +.handle = sev->handle, .policy = sev->policy +}; guchar *session = NULL, *dh_cert = NULL; -start = g_new0(struct kvm_sev_launch_start, 1); - -start->handle = sev->handle; -start->policy = sev->policy; if (sev->session_file) { if (sev_read_file_base64(sev->session_file, , ) < 0) { goto out; } -start->session_uaddr = (unsigned long)session; -start->session_len = sz; +start.session_uaddr = (unsigned long)session; +start.session_len = sz; } if (sev->dh_cert_file) { if (sev_read_file_base64(sev->dh_cert_file, _cert, ) < 0) { goto out; } -start->dh_uaddr = (unsigned long)dh_cert; -start->dh_len = sz; +start.dh_uaddr = (unsigned long)dh_cert; +start.dh_len = sz; } -trace_kvm_sev_launch_start(start->policy, session, dh_cert); -rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, start, _error); +trace_kvm_sev_launch_start(start.policy, session, dh_cert); +rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, , _error); if (rc < 0) { error_report("%s: LAUNCH_START ret=%d fw_error=%d '%s'", __func__, ret, fw_error, fw_error_to_str(fw_error)); @@ -679,11 +677,10 @@ sev_launch_start(SevGuestState *sev) } sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE); -sev->handle = start->handle; +sev->handle = start.handle; ret = 0; out: -g_free(start); g_free(session); g_free(dh_cert); return ret; -- 2.25.1
[PATCH 0/2] target/i386/sev: Replace malloc with local variables
In two places in sev.c we use malloc+free to manage memory for small constant struct variables. Modify this to use local variables. This small series can be applied on top of master or on top of Phil's Housekeeping SEV series [1]. [1] https://lore.kernel.org/qemu-devel/20211007161716.453984-1-phi...@redhat.com/ Dov Murik (2): target/i386/sev: Use local variable for kvm_sev_launch_start target/i386/sev: Use local variable for kvm_sev_launch_measure target/i386/sev.c | 39 +-- 1 file changed, 17 insertions(+), 22 deletions(-) -- 2.25.1
[PATCH 2/2] target/i386/sev: Use local variable for kvm_sev_launch_measure
The struct kvm_sev_launch_measure has a constant and small size, and therefore we can use a regular local variable for it instead of allocating and freeing heap memory for it. Signed-off-by: Dov Murik --- target/i386/sev.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/target/i386/sev.c b/target/i386/sev.c index 0062566c71..eede07f11d 100644 --- a/target/i386/sev.c +++ b/target/i386/sev.c @@ -729,7 +729,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused) SevGuestState *sev = sev_guest; int ret, error; g_autofree guchar *data = NULL; -g_autofree struct kvm_sev_launch_measure *measurement = NULL; +struct kvm_sev_launch_measure measurement = {}; if (!sev_check_state(sev, SEV_STATE_LAUNCH_UPDATE)) { return; @@ -743,23 +743,21 @@ sev_launch_get_measure(Notifier *notifier, void *unused) } } -measurement = g_new0(struct kvm_sev_launch_measure, 1); - /* query the measurement blob length */ ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE, -measurement, ); -if (!measurement->len) { +, ); +if (!measurement.len) { error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'", __func__, ret, error, fw_error_to_str(errno)); return; } -data = g_new0(guchar, measurement->len); -measurement->uaddr = (unsigned long)data; +data = g_new0(guchar, measurement.len); +measurement.uaddr = (unsigned long)data; /* get the measurement blob */ ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE, -measurement, ); +, ); if (ret) { error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'", __func__, ret, error, fw_error_to_str(errno)); @@ -769,7 +767,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused) sev_set_guest_state(sev, SEV_STATE_LAUNCH_SECRET); /* encode the measurement value and emit the event */ -sev->measurement = g_base64_encode(data, measurement->len); +sev->measurement = g_base64_encode(data, measurement.len); trace_kvm_sev_launch_measurement(sev->measurement); } -- 2.25.1
Re: [PATCH 3/6] numa: Add SGXEPCSection list for multiple sections
On Mon, Oct 11, 2021 at 07:15:51PM +0800, Yang Zhong wrote: > The SGXEPCSection list added into SGXInfo to show the multiple > SGX EPC sections detailed info, not the total size like before. > > Signed-off-by: Yang Zhong > --- > qapi/misc-target.json | 19 +-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/qapi/misc-target.json b/qapi/misc-target.json > index 594fbd1577..89a5a4250a 100644 > --- a/qapi/misc-target.json > +++ b/qapi/misc-target.json > @@ -334,6 +334,21 @@ >'returns': 'SevAttestationReport', >'if': 'TARGET_I386' } > > +## > +# @SGXEPCSection: > +# > +# Information about intel SGX EPC section info > +# > +# @index: the SGX epc section index > +# > +# @size: the size of epc section > +# > +# Since: 6.2 > +## > +{ 'struct': 'SGXEPCSection', > + 'data': { 'index': 'uint64', > +'size': 'uint64'}} > + > ## > # @SGXInfo: > # > @@ -347,7 +362,7 @@ > # > # @flc: true if FLC is supported > # > -# @section-size: The EPC section size for guest > +# @sections: The EPC sections info for guest > # > # Since: 6.2 Given this has not yet been in a stable release, we can make this change... > ## > @@ -356,7 +371,7 @@ > 'sgx1': 'bool', > 'sgx2': 'bool', > 'flc': 'bool', > -'section-size': 'uint64'}, > +'sections': ['SGXEPCSection']}, > 'if': 'TARGET_I386' } ...but are we sure we have the best interface possible if we are still expressing uncertainty about the QAPI used to represent it? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v4 0/7] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
On 04.10.21 14:02, David Hildenbrand wrote: #1 is a preparation for improved error reporting, #2 adds support for MADV_POPULATE_WRITE, #3 cleans up the code to avoid global variables and prepare for concurrency, #4 and #5 optimize thread handling, #6 makes os_mem_prealloc() safe to be called from multiple threads concurrently and #7 makes the SIGBUS handler coexist cleanly with the MCE SIGBUS handler under Linux. Details regarding MADV_POPULATE_WRITE can be found in introducing upstream Linux commits 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") and eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)"), and in the man page update [1]. Paolo, are you planning on taking this via your tree (POSIX)? Thanks! -- Thanks, David / dhildenb
Re: [PATCH 1/6] numa: Enable numa for SGX EPC sections
On Mon, Oct 11, 2021 at 07:15:49PM +0800, Yang Zhong wrote: > The basic SGX did not enable numa for SGX EPC sections, which > result in all EPC sections located in numa node 0. This patch > enable SGX numa function in the guest and the EPC section can > work with RAM as one numa node. > > The Guest kernel related log: > [0.009981] ACPI: SRAT: Node 0 PXM 0 [mem 0x18000-0x183ff] > [0.009982] ACPI: SRAT: Node 1 PXM 1 [mem 0x18400-0x185bf] > The SRAT table can normally show SGX EPC sections menory info in different > numa nodes. > > The SGX EPC numa related command: > .. > -m 4G,maxmem=20G \ > -smp sockets=2,cores=2 \ > -cpu host,+sgx-provisionkey \ > -object memory-backend-ram,size=2G,host-nodes=0,policy=bind,id=node0 \ > -object > memory-backend-epc,id=mem0,size=64M,prealloc=on,host-nodes=0,policy=bind \ > -numa node,nodeid=0,cpus=0-1,memdev=node0 \ > -object memory-backend-ram,size=2G,host-nodes=1,policy=bind,id=node1 \ > -object > memory-backend-epc,id=mem1,size=28M,prealloc=on,host-nodes=1,policy=bind \ > -numa node,nodeid=1,cpus=2-3,memdev=node1 \ > -M > sgx-epc.0.memdev=mem0,sgx-epc.0.node=0,sgx-epc.1.memdev=mem1,sgx-epc.1.node=1 > \ > .. > > Signed-off-by: Yang Zhong > --- > qapi/machine.json | 6 +- > include/hw/i386/sgx-epc.h | 3 +++ > hw/i386/acpi-build.c | 4 > hw/i386/sgx-epc.c | 3 +++ > hw/i386/sgx.c | 44 +++ > monitor/hmp-cmds.c| 1 + > qemu-options.hx | 4 ++-- > 7 files changed, 62 insertions(+), 3 deletions(-) > > diff --git a/qapi/machine.json b/qapi/machine.json > index 5db54df298..09b6188e6f 100644 > --- a/qapi/machine.json > +++ b/qapi/machine.json > @@ -1213,6 +1213,7 @@ >'data': { '*id': 'str', > 'memaddr': 'size', > 'size': 'size', > +'node': 'int', > 'memdev': 'str' >} > } > @@ -1288,7 +1289,10 @@ > # Since: 6.2 > ## > { 'struct': 'SgxEPC', > - 'data': { 'memdev': 'str' } } > + 'data': { 'memdev': 'str', > +'node': 'int' > + } > +} Missing documentation of the new member. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: Moving QEMU downloads to GitLab Releases?
Hi, > > So while I'm all for making things a little more independent, > > let's not do it in a way that makes life difficult for downstreams. > > There are more there than just a couple of big distro builders. > > I think this is fine. Firmware blobs aren't needed to build QEMU, only > to run the system emulator. Yep. But doing firmware builds outside gitlab should be easy, so we should probably keep the core logic in a script / Makefile / whatever (simliar to todays roms/Makefile), then simply invoke that from gitlab ci yaml files. take care, Gerd
Re: [PATCH] qcow2: Silence clang -m32 compiler warning
On Mon, Oct 11, 2021 at 05:50:31PM +0200, Hanna Reitz wrote: > With -m32, size_t is generally only a uint32_t. That makes clang > complain that in the assertion > > assert(qiov->size <= INT64_MAX); > > the range of the type of qiov->size (size_t) is too small for any of its > values to ever exceed INT64_MAX. Yep, I'm not surprised that we hit that. > > Cast qiov->size to uint64_t to silence clang. > > Fixes: f7ef38dd1310d7d9db76d0aa16899cbc5744f36d >("block: use int64_t instead of uint64_t in driver read >handlers") > Signed-off-by: Hanna Reitz > --- > I don't know whether this is the best possible solution, or whether we > should care about this at all (I personally think it's basically just > wrong for clang to warn about always-true conditions in assertions), but > I thought I might as well just send this patch as the basis for a > discussion. I agree that the compiler is overly noisy, but the fix is simple enough that I'm fine with it as written. Reviewed-by: Eric Blake Since the original went through my tree, I'm happy to take this one through my NBD tree as well. > --- > block/qcow2-cluster.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c > index 5727f92dcb..21884a1ab9 100644 > --- a/block/qcow2-cluster.c > +++ b/block/qcow2-cluster.c > @@ -513,7 +513,8 @@ static int coroutine_fn > do_perform_cow_read(BlockDriverState *bs, > */ > assert(src_cluster_offset <= INT64_MAX); > assert(src_cluster_offset + offset_in_cluster <= INT64_MAX); > -assert(qiov->size <= INT64_MAX); > +/* Cast qiov->size to uint64_t to silence a compiler warning on -m32 */ > +assert((uint64_t)qiov->size <= INT64_MAX); > bdrv_check_qiov_request(src_cluster_offset + offset_in_cluster, > qiov->size, > qiov, 0, _abort); > /* > -- > 2.31.1 > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [PATCH v4 2/2] monitor: refactor set/expire_password and allow VNC display id
On Mon, Oct 11, 2021 at 04:03:01PM +0100, Dr. David Alan Gilbert wrote: > * Stefan Reiter (s.rei...@proxmox.com) wrote: > > It is possible to specify more than one VNC server on the command line, > > either with an explicit ID or the auto-generated ones à la "default", > > "vnc2", "vnc3", ... > > > > +++ b/monitor/hmp-cmds.c > > @@ -1451,10 +1451,41 @@ void hmp_set_password(Monitor *mon, const QDict > > *qdict) > > { > > const char *protocol = qdict_get_str(qdict, "protocol"); > > const char *password = qdict_get_str(qdict, "password"); > > +const char *display = qdict_get_try_str(qdict, "display"); > > const char *connected = qdict_get_try_str(qdict, "connected"); > > Error *err = NULL; > > +DisplayProtocol proto; > > > > -qmp_set_password(protocol, password, !!connected, connected, ); > > +SetPasswordOptions opts = { > > +.password = g_strdup(password), > > You're leaking that strdup on the error returns; you probably want to > add an error: exit and goto it to do all the cleanup. Or maybe there's a way to use g_autofree to let the compiler clean it up automatically. > > > +.u.vnc.display = NULL, > > +}; > > + > > +proto = qapi_enum_parse(_lookup, protocol, > > +DISPLAY_PROTOCOL_VNC, ); > > +if (err) { > > +hmp_handle_error(mon, err); > > +return; > > +} > > +opts.protocol = proto; > > + > > +if (proto == DISPLAY_PROTOCOL_VNC) { > > +opts.u.vnc.has_display = !!display; > > +opts.u.vnc.display = g_strdup(display); > > +} else if (proto == DISPLAY_PROTOCOL_SPICE) { > > +opts.u.spice.has_connected = !!connected; > > +opts.u.spice.connected = > > +qapi_enum_parse(_lookup, connected, > > +SET_PASSWORD_ACTION_KEEP, ); > > +if (err) { > > +hmp_handle_error(mon, err); > > +return; > > +} > > +} > > + > > +qmp_set_password(, ); > > +g_free(opts.password); > > +g_free(opts.u.vnc.display); Hmm. Why are we hand-cleaning only portions of the QAPI type instead of using the generated qapi_free_SetPasswordOptions() do to things? > > hmp_handle_error(mon, err); > > } > > > > @@ -1462,9 +1493,31 @@ void hmp_expire_password(Monitor *mon, const QDict > > *qdict) > > { > > const char *protocol = qdict_get_str(qdict, "protocol"); > > const char *whenstr = qdict_get_str(qdict, "time"); > > +const char *display = qdict_get_try_str(qdict, "display"); > > Error *err = NULL; > > +DisplayProtocol proto; > > > > -qmp_expire_password(protocol, whenstr, ); > > +ExpirePasswordOptions opts = { > > +.time = g_strdup(whenstr), > > +.u.vnc.display = NULL, > > +}; > > Same here; that 'whenstr' gets leaked on errors. > > > +proto = qapi_enum_parse(_lookup, protocol, > > +DISPLAY_PROTOCOL_VNC, ); > > +if (err) { > > +hmp_handle_error(mon, err); > > +return; > > +} > > +opts.protocol = proto; > > + > > +if (proto == DISPLAY_PROTOCOL_VNC) { > > +opts.u.vnc.has_display = !!display; > > +opts.u.vnc.display = g_strdup(display); > > +} > > + > > +qmp_expire_password(, ); > > +g_free(opts.time); > > +g_free(opts.u.vnc.display); > > hmp_handle_error(mon, err); Same here - using the generated qapi_free_ function rather than doing things by hand, and/or smart use of g_autofree, may make this easier to maintain. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[RFC PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start
We use INDEX_op_insn_start to make the start of instruction boundaries. If we don't do it in the .insn_start hook things get confused especially now plugins want to use that marking to identify the start of instructions and will bomb out if it sees instrumented ops before the first instruction boundary. Signed-off-by: Alex Bennée --- target/s390x/tcg/translate.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c index f284870cd2..fe145ff2eb 100644 --- a/target/s390x/tcg/translate.c +++ b/target/s390x/tcg/translate.c @@ -6380,9 +6380,6 @@ static DisasJumpType translate_one(CPUS390XState *env, DisasContext *s) /* Search for the insn in the table. */ insn = extract_insn(env, s); -/* Emit insn_start now that we know the ILEN. */ -tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen); - /* Not found means unimplemented/illegal opcode. */ if (insn == NULL) { qemu_log_mask(LOG_UNIMP, "unimplemented opcode 0x%02x%02x\n", @@ -6550,8 +6547,30 @@ static void s390x_tr_tb_start(DisasContextBase *db, CPUState *cs) { } +/* + * We just enough partial instruction decoding here to calculate the + * length of the instruction so we can drop the INDEX_op_insn_start + * before anything else is emitted in the TCGOp stream. + * + * See extract_insn for the full decode. + */ static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs) { +CPUS390XState *env = cs->env_ptr; +DisasContext *s = container_of(dcbase, DisasContext, base); +uint64_t insn, pc = s->base.pc_next; +int op, ilen; + +if (unlikely(s->ex_value)) { +ilen = s->ex_value & 0xf; +} else { +insn = ld_code2(env, s, pc); /* FIXME: don't reload same pc twice */ +op = (insn >> 8) & 0xff; +ilen = get_ilen(op); +} + +/* Emit insn_start now that we know the ILEN. */ +tcg_gen_insn_start(s->base.pc_next, s->cc_op, ilen); } static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs) -- 2.30.2
Re: [PATCH v4 08/11] tests/acpi: allow updates of VIOT expected data files
On Fri, 8 Oct 2021 16:26:22 +0100 Jean-Philippe Brucker wrote: > On Wed, Oct 06, 2021 at 10:12:15AM +0200, Igor Mammedov wrote: > > On Fri, 1 Oct 2021 18:33:56 +0100 > > Jean-Philippe Brucker wrote: > > > > > Create empty data files and allow updates for the upcoming VIOT tests. > > > > > > Signed-off-by: Jean-Philippe Brucker > > > --- > > > tests/qtest/bios-tables-test-allowed-diff.h | 3 +++ > > > tests/data/acpi/q35/DSDT.viot | 0 > > > > does default tests/data/acpi/q35/DSDT differs from > > DSDT.viot? > > Yes the VIOT test has one more PCI device (virtio-iommu) and PXB devices, > so there are additional descriptors in the DSDT also see tests/qtest/bios-tables-test.c step 6 (---include diff--- part) > > Thanks, > Jean > > > > > > tests/data/acpi/q35/VIOT.viot | 0 > > > tests/data/acpi/virt/VIOT | 0 > > > 4 files changed, 3 insertions(+) > > > create mode 100644 tests/data/acpi/q35/DSDT.viot > > > create mode 100644 tests/data/acpi/q35/VIOT.viot > > > create mode 100644 tests/data/acpi/virt/VIOT > > > > > > diff --git a/tests/qtest/bios-tables-test-allowed-diff.h > > > b/tests/qtest/bios-tables-test-allowed-diff.h > > > index dfb8523c8b..29b5b1eabc 100644 > > > --- a/tests/qtest/bios-tables-test-allowed-diff.h > > > +++ b/tests/qtest/bios-tables-test-allowed-diff.h > > > @@ -1 +1,4 @@ > > > /* List of comma-separated changed AML files to ignore */ > > > +"tests/data/acpi/virt/VIOT", > > > +"tests/data/acpi/q35/DSDT.viot", > > > +"tests/data/acpi/q35/VIOT.viot", > > > diff --git a/tests/data/acpi/q35/DSDT.viot b/tests/data/acpi/q35/DSDT.viot > > > new file mode 100644 > > > index 00..e69de29bb2 > > > diff --git a/tests/data/acpi/q35/VIOT.viot b/tests/data/acpi/q35/VIOT.viot > > > new file mode 100644 > > > index 00..e69de29bb2 > > > diff --git a/tests/data/acpi/virt/VIOT b/tests/data/acpi/virt/VIOT > > > new file mode 100644 > > > index 00..e69de29bb2 > > >
Re: ACPI endianness
On Mon, Oct 11, 2021 at 03:51:01PM +0200, BALATON Zoltan wrote: > > ... but given we did not previously do the read, maybe we should keep > > it that way at least for the time being. > > How do you know there was no read before this write? Did you check it? I've > only added a printf in the write method and saw the value was swapped but > did not check if there was a read before that. There are no traces in these > methods so maybe I would not see it unless adding a printf there too. All I am saying is that qemu did not convert a write into a read+write.
[PATCH] qcow2: Silence clang -m32 compiler warning
With -m32, size_t is generally only a uint32_t. That makes clang complain that in the assertion assert(qiov->size <= INT64_MAX); the range of the type of qiov->size (size_t) is too small for any of its values to ever exceed INT64_MAX. Cast qiov->size to uint64_t to silence clang. Fixes: f7ef38dd1310d7d9db76d0aa16899cbc5744f36d ("block: use int64_t instead of uint64_t in driver read handlers") Signed-off-by: Hanna Reitz --- I don't know whether this is the best possible solution, or whether we should care about this at all (I personally think it's basically just wrong for clang to warn about always-true conditions in assertions), but I thought I might as well just send this patch as the basis for a discussion. --- block/qcow2-cluster.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 5727f92dcb..21884a1ab9 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -513,7 +513,8 @@ static int coroutine_fn do_perform_cow_read(BlockDriverState *bs, */ assert(src_cluster_offset <= INT64_MAX); assert(src_cluster_offset + offset_in_cluster <= INT64_MAX); -assert(qiov->size <= INT64_MAX); +/* Cast qiov->size to uint64_t to silence a compiler warning on -m32 */ +assert((uint64_t)qiov->size <= INT64_MAX); bdrv_check_qiov_request(src_cluster_offset + offset_in_cluster, qiov->size, qiov, 0, _abort); /* -- 2.31.1
Re: Moving QEMU downloads to GitLab Releases?
On Mon, Oct 11, 2021 at 08:28:34AM -0600, Warner Losh wrote: > On Mon, Oct 11, 2021 at 4:59 AM Stefan Hajnoczi wrote: > > > On Mon, Oct 11, 2021 at 09:21:30AM +0200, Gerd Hoffmann wrote: > > > Hi, > > > > > > > > I guess the main question is who is using the ROM/BIOS sources in the > > > > > tarballs, and would this 2-step process work for those users? If > > there > > > > > are distros relying on them then maybe there are some more logistics > > to > > > > > consider since the make-release downloads are both time-consuming and > > > > > prone to network errors/stalls since it relies on the availability > > of a > > > > > good number of different hosts. > > > > > > > > Great, maybe Gerd can comment on splitting out firmware. > > > > > > I think the binaries are mostly a convenience feature for developers. > > > Distros typically build from source anyway, and typically they build > > > from upstream source instead of qemu submodule. > > > > > > The idea to split them out to a separate repo is exists for quite a > > > while. I have an old qemu-firmware repo laying around on my disk, which > > > basically moves roms/ + submodules and the binaries built from it over. > > > > > > I think with the switch to gitlab it doesn't make sense any more to > > > commit pre-built firmware binaries to a git repo. Instead we should > > > build the firmware in gitlab ci, i.e. effectively move the build rules > > > we have right now in roms/Makefile to .gitlab-ci.d/, then publish the > > > firmware binaries as build artifacts or gitlab pages. > > > > > > When done just remove the pre-build binaries from git and add a helper > > > script to fetch firmware binaries from gitlab instead. > > > > > > > QEMU mirrors firmware sources on GitLab too. We're able to provide the > > > > same level of download availability on our mirror firmware repos as for > > > > the main qemu.git repo. > > > > > > I think enabling CI for the firmware mirrors should work given that it > > > is possible to specify a custom CI/CD configuration file, i.e. use > > > something like > > > > > > /qemu-project/qemu/.gitlab-ci.d/firmware/seabios.yml > > > > > > or > > > > > > /qemu-project/qemu-firmware/seabios.yml > > > > > > as config file for the > > > > > > /qemu-project/seabios/ > > > > > > mirror. Then we can publish binaries using gitlab pages at > > > > > > https://qemu-project.gitlab.io/seabios/ > > > > > > That way we also don't need the roms/ submodules any more, i.e. we > > > can remove both sources and binaries for the firmware from the > > > release tarballs. > > > > Thanks! > > > > For developer convenience it would be nice to avoid manual steps after > > git clone qemu.git. Maybe ./configure should check for firmware blobs > > and automatically download them. There could be a ./configure > > --disable-firmware-download option that distros can use to skip the > > download when building everything from source. > > > > One thing to keep in mind about the downstream consumers: Many > of them have two phases to their build process that run asynchronously > to each other. There is a fetch phase that grabs everything, and a build > phase that builds the binaries. The second phase may not have access > to the internet for a variety of reasons (these days being a security > measure, for good or ill). Please make sure that any such plans > allow for this model. > > Today, that's all dealt with by grabbing tarballs from github which > is how the submodules are dealt with. So long as the images > had well known names, and could be fetched with the normal > wget/cgit/fetch programs, that would suffice. Requiring use of > some weird bespoke script would cause friction for the downstreams > using qemu. > > So while I'm all for making things a little more independent, > let's not do it in a way that makes life difficult for downstreams. > There are more there than just a couple of big distro builders. I think this is fine. Firmware blobs aren't needed to build QEMU, only to run the system emulator. Stefan signature.asc Description: PGP signature
Re: [PATCH 5/8] linux-user: Support TCG_TARGET_SIGNED_ADDR32
On 10/11/21 3:22 AM, Alex Bennée wrote: Richard Henderson writes: When using reserved_va, which is the default for a 64-bit host and a 32-bit guest, set guest_base_signed_addr32 if requested by TCG_TARGET_SIGNED_ADDR32, and the executable layout allows. Signed-off-by: Richard Henderson --- include/exec/cpu-all.h | 4 --- linux-user/elfload.c | 62 ++ 2 files changed, 50 insertions(+), 16 deletions(-) diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h index 80b5e17329..71d8e1de7a 100644 --- a/include/exec/cpu-all.h +++ b/include/exec/cpu-all.h @@ -278,11 +278,7 @@ extern intptr_t qemu_host_page_mask; #define PAGE_RESET 0x0040 /* For linux-user, indicates that the page is MAP_ANON. */ #define PAGE_ANON 0x0080 - -#if defined(CONFIG_BSD) && defined(CONFIG_USER_ONLY) -/* FIXME: Code that sets/uses this is broken and needs to go away. */ #define PAGE_RESERVED 0x0100 -#endif Can we reference why this FIXME is being dropped in the commit message? I'm not sure to what pbrook was referring with "... and is already broken" there. I need something here to reserve a page, PAGE_RESERVED seems like a good name, so I took it out of the cupboard. I'll do some archaeology. r~ Looking at the current tree state I can see several uses of it due to moves in 5b6dd8683d (exec: move TB handling to translate-all.c) which post-date 2e9a5713f0 (Remove PAGE_RESERVED). Otherwise looks reasonable: Reviewed-by: Alex Bennée
Re: [PATCH 3/8] accel/tcg: Support TCG_TARGET_SIGNED_ADDR32 for softmmu
On 10/10/21 9:30 PM, WANG Xuerui wrote: @@ -92,6 +93,9 @@ static inline size_t sizeof_tlb(CPUTLBDescFast *fast) static inline uintptr_t g2h_tlbe(const CPUTLBEntry *tlb, target_ulong gaddr) { +if (TCG_TARGET_SIGNED_ADDR32 && TARGET_LONG_BITS == 32) { It seems this branch's direction should always match that of the branch added below, so if TARGET_LONG_BITS == TARGET_LONG_BITS == 32 this invariant is broken? Or is this expected behavior? The conditions should match, yes. In revising the patch set I decided that the tcg backend should simply not set this flag for a 32-bit host. r~
Re: [PATCH 6/8] tcg/aarch64: Support TCG_TARGET_SIGNED_ADDR32
On 10/11/21 3:28 AM, Alex Bennée wrote: Richard Henderson writes: AArch64 has both sign and zero-extending addressing modes, which means that either treatment of guest addresses is equally efficient. Enabling this for AArch64 gives us testing of the feature in CI. So which guests front ends will exercise this backend? All 32-bit guests. Is this something we can exercise in 32 bit user mode tests? Yes. Which is why I enabled this for aarch64, so that we'd have a major platform testing it daily. r~
Re: [RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
On Tue, 5 Oct 2021 10:53:12 +0200 Eric Auger wrote: > Add a 'preserve_config' field in struct GPEXConfig and > if set generate the DSM #5 for preserving PCI boot configurations. > The DSM presence is needed to expose RMRs. here should be pointers to spec and location within it where it says preserving PCI boot configuration is necessary or in absence of that a bit more detailed explanation why it's necessary. > > At the moment the DSM generation is not yet enabled. > > Signed-off-by: Eric Auger > --- > include/hw/pci-host/gpex.h | 1 + > hw/pci-host/gpex-acpi.c| 12 > 2 files changed, 13 insertions(+) > > diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h > index fcf8b63820..3f8f8ec38d 100644 > --- a/include/hw/pci-host/gpex.h > +++ b/include/hw/pci-host/gpex.h > @@ -64,6 +64,7 @@ struct GPEXConfig { > MemMapEntry pio; > int irq; > PCIBus *bus; > +boolpreserve_config; s/^^^/preserve_fw_config/ > }; > > int gpex_set_irq_num(GPEXHost *s, int index, int gsi); > diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c > index e7e162a00a..7dab259379 100644 > --- a/hw/pci-host/gpex-acpi.c > +++ b/hw/pci-host/gpex-acpi.c > @@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig > *cfg) > aml_append(dev, aml_name_decl("_PXM", aml_int(numa_node))); > } > > +if (cfg->preserve_config) { > +method = aml_method("_DSM", 5, AML_SERIALIZED); > +aml_append(method, aml_return(aml_int(0))); > +aml_append(dev, method); > +} > + > acpi_dsdt_add_pci_route_table(dev, cfg->irq); > > /* > @@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig > *cfg) > aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device"))); > aml_append(dev, aml_name_decl("_CCA", aml_int(1))); > > +if (cfg->preserve_config) { > +method = aml_method("_DSM", 5, AML_SERIALIZED); > +aml_append(method, aml_return(aml_int(0))); > +aml_append(dev, method); > +} > + these ones seem to wrong , it adds duplicate _DSM methods with wrong ARGs number. virt board already has _DSM defined, see acpi_dsdt_add_pci_osc() E5C937D0-3553-4D7A-9117-EA4D19C3434D you need to modify that one (and possibly move out DSM into a separate function), also preserving config might regress what commit 40c3472a29c9a was fixing. > acpi_dsdt_add_pci_route_table(dev, cfg->irq); > > method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
Re: [PATCH v2 04/15] qom: Reduce use of error_propagate()
Kevin Wolf writes: > ERRP_GUARD() makes debugging easier by making sure that _abort > still fails at the real origin of the error instead of > error_propagate(). > > Signed-off-by: Kevin Wolf Yes, please! Reviewed-by: Markus Armbruster
Re: [PATCH v4 2/2] monitor: refactor set/expire_password and allow VNC display id
* Stefan Reiter (s.rei...@proxmox.com) wrote: > It is possible to specify more than one VNC server on the command line, > either with an explicit ID or the auto-generated ones à la "default", > "vnc2", "vnc3", ... > > It is not possible to change the password on one of these extra VNC > displays though. Fix this by adding a "display" parameter to the > "set_password" and "expire_password" QMP and HMP commands. > > For HMP, the display is specified using the "-d" value flag. > > For QMP, the schema is updated to explicitly express the supported > variants of the commands with protocol-discriminated unions. > > Suggested-by: Eric Blake > Suggested-by: Markus Armbruster > Signed-off-by: Stefan Reiter > --- > hmp-commands.hx| 24 --- > monitor/hmp-cmds.c | 57 ++- > monitor/qmp-cmds.c | 62 ++-- > qapi/ui.json | 173 + > 4 files changed, 235 insertions(+), 81 deletions(-) > > diff --git a/hmp-commands.hx b/hmp-commands.hx > index cf723c69ac..d78e4cfc47 100644 > --- a/hmp-commands.hx > +++ b/hmp-commands.hx > @@ -1514,33 +1514,35 @@ ERST > > { > .name = "set_password", > -.args_type = "protocol:s,password:s,connected:s?", > -.params = "protocol password action-if-connected", > +.args_type = "protocol:s,password:s,display:-dS,connected:s?", > +.params = "protocol password [-d display] [action-if-connected]", > .help = "set spice/vnc password", > .cmd= hmp_set_password, > }, > > SRST > -``set_password [ vnc | spice ] password [ action-if-connected ]`` > - Change spice/vnc password. *action-if-connected* specifies what > - should happen in case a connection is established: *fail* makes the > - password change fail. *disconnect* changes the password and > +``set_password [ vnc | spice ] password [ -d display ] [ action-if-connected > ]`` > + Change spice/vnc password. *display* can be used with 'vnc' to specify > + which display to set the password on. *action-if-connected* specifies > + what should happen in case a connection is established: *fail* makes > + the password change fail. *disconnect* changes the password and >disconnects the client. *keep* changes the password and keeps the >connection up. *keep* is the default. > ERST > > { > .name = "expire_password", > -.args_type = "protocol:s,time:s", > -.params = "protocol time", > +.args_type = "protocol:s,time:s,display:-dS", > +.params = "protocol time [-d display]", > .help = "set spice/vnc password expire-time", > .cmd= hmp_expire_password, > }, > > SRST > -``expire_password [ vnc | spice ]`` *expire-time* > - Specify when a password for spice/vnc becomes > - invalid. *expire-time* accepts: > +``expire_password [ vnc | spice ] expire-time [ -d display ]`` > + Specify when a password for spice/vnc becomes invalid. > + *display* behaves the same as in ``set_password``. > + *expire-time* accepts: > >``now`` > Invalidate password instantly. > diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c > index b5e71d9e6f..48b3fe4519 100644 > --- a/monitor/hmp-cmds.c > +++ b/monitor/hmp-cmds.c > @@ -1451,10 +1451,41 @@ void hmp_set_password(Monitor *mon, const QDict > *qdict) > { > const char *protocol = qdict_get_str(qdict, "protocol"); > const char *password = qdict_get_str(qdict, "password"); > +const char *display = qdict_get_try_str(qdict, "display"); > const char *connected = qdict_get_try_str(qdict, "connected"); > Error *err = NULL; > +DisplayProtocol proto; > > -qmp_set_password(protocol, password, !!connected, connected, ); > +SetPasswordOptions opts = { > +.password = g_strdup(password), You're leaking that strdup on the error returns; you probably want to add an error: exit and goto it to do all the cleanup. > +.u.vnc.display = NULL, > +}; > + > +proto = qapi_enum_parse(_lookup, protocol, > +DISPLAY_PROTOCOL_VNC, ); > +if (err) { > +hmp_handle_error(mon, err); > +return; > +} > +opts.protocol = proto; > + > +if (proto == DISPLAY_PROTOCOL_VNC) { > +opts.u.vnc.has_display = !!display; > +opts.u.vnc.display = g_strdup(display); > +} else if (proto == DISPLAY_PROTOCOL_SPICE) { > +opts.u.spice.has_connected = !!connected; > +opts.u.spice.connected = > +qapi_enum_parse(_lookup, connected, > +SET_PASSWORD_ACTION_KEEP, ); > +if (err) { > +hmp_handle_error(mon, err); > +return; > +} > +} > + > +qmp_set_password(, ); > +g_free(opts.password); > +g_free(opts.u.vnc.display); > hmp_handle_error(mon, err); > } > > @@ -1462,9 +1493,31 @@ void hmp_expire_password(Monitor
Re: Moving QEMU downloads to GitLab Releases?
On Fri, Oct 1, 2021 at 12:20 AM Stefan Hajnoczi wrote: > On Thu, Sep 30, 2021 at 03:57:49PM +, Eldon Stegall wrote: > > Hello! > > I'd be happy to help with this. I'm mostly a consumer of QEMU, but > > greatly appreciate all the work this community has done, and was able > > to contribute a little by helping with QEMU advent this past year. I > > would be happy to help streamline some of this activities if that would > > be welcome, and would gratefully contribute time and resources. Hosting > > and serving data like this has been core to my recent experience. > > > > I would be happy to suggest and build out a distribution strategy for > > these packages, and believe I could cut some costs, and even convince a > > small consultancy I am a part of here that uses QEMU to foot a > > reasonable bill. > > > > A brief introduction, since I haven't had the pleasure of attending > > FOSDEM or any other QEMU meetups: I am a startup-oriented Cloud Security > > Architect, based out of Atlanta, previously with companies like > > DataStax, but now working on AWS video pipelines for a startup here. > > Thanks for joining the discussion and for running last year's QEMU > Advent Calendar, Eldon. > > Any ideas for moving download.qemu.org to a hosted service would be > appreciated! We haven't compared CDN and cloud providers closely yet. If > you have experience in this area or time to check them out, then that > would be valuable. > > QEMU has funds if there is a cost for file hosting (probably less than > $100/month). Some providers may be willing to support an open source > project for free. Possible providers include CloudFlare, Akamai, Fastly, > Microsoft Azure, Google Cloud Storage, etc. > https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/ Let me know if ya'll apply and I'm happy to push it through. Regards, Anthony Liguori
RE: [PATCH v2] vmdk: allow specification of tools version
Hi Hanna, > -Original Message- > From: Hanna Reitz > Sent: Monday, October 11, 2021 4:09 PM > Subject: Re: [PATCH v2] vmdk: allow specification of tools version > > On 13.09.21 15:04, Thomas Weißschuh wrote: > > VMDK files support an attribute that represents the version of the > > guest tools that are installed on the disk. > > This attribute is used by vSphere before a machine has been started to > > determine if the VM has the guest tools installed. > > This is important when configuring "Operating system customizations" > > in vSphere, as it checks for the presence of the guest tools before > > allowing those customizations. > > Thus when the VM has not yet booted normally it would be impossible to > > customize it, therefore preventing a customized first-boot. > > > > The attribute should not hurt on disks that do not have the guest > > tools installed and indeed the VMware tools also unconditionally add > > this attribute. > > (Defaulting to the value "2147483647", as is done in this patch) > > > > Signed-off-by: Thomas Weißschuh > > --- > > > > v1: > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore > > .kernel.org%2Fqemu-devel%2F20210908174250.12946-1-thomas.weissschuh.ex > > t%40zeiss.com%2Fdata=04%7C01%7C%7C6f16aae6ba9b49d75a0b08d98cc0c1c > > a%7C28042244bb514cd680347776fa3703e8%7C1%7C0%7C637695581774186236%7CUn > > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haW > > wiLCJXVCI6Mn0%3D%7C1000sdata=wyrXC3HNxa5QFY8MW4PuzqZNpMh6NnzicD9k > > 4Pw0j7o%3Dreserved=0 > > > > v1 -> v2: > > * Expand QAPI docs (Eric Blake) > > > > block/vmdk.c | 24 > > qapi/block-core.json | 3 +++ > > 2 files changed, 23 insertions(+), 4 deletions(-) > > [...] > > > diff --git a/qapi/block-core.json b/qapi/block-core.json index > > c8ce1d9d5d..42431f52d0 100644 > > --- a/qapi/block-core.json > > +++ b/qapi/block-core.json > > @@ -4597,6 +4597,8 @@ > > # @adapter-type: The adapter type used to fill in the descriptor. > Default: ide. > > # @hwversion: Hardware version. The meaningful options are "4" or "6". > > # Default: "4". > > +# @toolsversion: VMware guest tools version. > > + Default: "2147483647" (Since 6.2) > > There's a # missing at the start of the line, and so this doesn't compile. > > I've added it (hope that's OK for you) and taken this patch to my block > branch: Absolutely, thanks for noticing and fixing. > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.c > om%2Fhreitz%2Fqemu%2F- > %2Fcommits%2Fblock%2Fdata=04%7C01%7C%7C6f16aae6ba9b49d75a0b08d98cc0c1 > ca%7C28042244bb514cd680347776fa3703e8%7C1%7C0%7C637695581774186236%7CUnkno > wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV > CI6Mn0%3D%7C1000sdata=BsLGfL%2BpnI1NE4M%2FnBN6UCWUQkb19HOPjgKTveMaA1k > %3Dreserved=0 Thomas
Re: [PATCH] s390x/ipl: check kernel command line size
Thomas Huth writes: > On 06/10/2021 11.26, Marc Hartmayer wrote: >> Check if the provided kernel command line exceeds the maximum size of the >> s390x >> Linux kernel command line size, which is 896 bytes. >> >> Reported-by: Sven Schnelle >> Signed-off-by: Marc Hartmayer >> --- >> hw/s390x/ipl.c | 12 +++- >> 1 file changed, 11 insertions(+), 1 deletion(-) >> >> diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c >> index 1821c6faeef3..a58ea58cc736 100644 >> --- a/hw/s390x/ipl.c >> +++ b/hw/s390x/ipl.c >> @@ -38,6 +38,7 @@ >> #define KERN_IMAGE_START0x01UL >> #define LINUX_MAGIC_ADDR0x010008UL >> #define KERN_PARM_AREA 0x010480UL >> +#define KERN_PARM_AREA_SIZE 0x000380UL >> #define INITRD_START0x80UL >> #define INITRD_PARM_START 0x010408UL >> #define PARMFILE_START 0x001000UL >> @@ -190,10 +191,19 @@ static void s390_ipl_realize(DeviceState *dev, Error >> **errp) >>* loader) and it won't work. For this case we force it to >> 0x1, too. >>*/ >> if (pentry == KERN_IMAGE_START || pentry == 0x800) { >> -char *parm_area = rom_ptr(KERN_PARM_AREA, strlen(ipl->cmdline) >> + 1); >> +size_t cmdline_size = strlen(ipl->cmdline) + 1; >> +char *parm_area = rom_ptr(KERN_PARM_AREA, cmdline_size); >> + >> ipl->start_addr = KERN_IMAGE_START; >> /* Overwrite parameters in the kernel image, which are "rom" */ >> if (parm_area) { >> +if (cmdline_size > KERN_PARM_AREA_SIZE) { >> +error_setg(errp, >> + "kernel command line exceeds maximum size: >> %lu > %lu", > > I think the first %lu should be %zd instead? Yep, makes sense - thanks! > > Apart from that, the patch looks fine to me... so if you agree, I can fix > that up when picking up the patch. Thanks. > > Thomas > > >> + cmdline_size, KERN_PARM_AREA_SIZE); >> +return; >> +} >> + >> strcpy(parm_area, ipl->cmdline); >> } >> } else { >> > -- Kind regards / Beste Grüße Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Gregor Pillen Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294