Re: [PATCH 1/2] block: pass BlockDriver reference to the .bdrv_co_create
On 3/25/20 8:12 PM, Maxim Levitsky wrote: This will allow to reuse a single generic .bdrv_co_create "allow to ${verb}" is not idiomatic, better is "allow ${subject} to ${verb}" or "allow ${verb}ing". In this case, I'd go with: This will allow the reuse of a single... implementation for several drivers. No functional changes. Signed-off-by: Maxim Levitsky --- -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
RE: [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API
Hi Eric, I'm also considering how to inject iommu fault to vIOMMU. As our previous discussion (long time ago), MemoryRegion way doesn't work well for VTd case. So I'd like see your opinion on the proposal below: I've a patch to make vIOMMUs register PCIIOMMUOps to PCI layer. Current usage is to get address space and set/unset HostIOMMUContext (added by me). I think it may be also nice to add the fault injection callback in the PCIIOMMUOps. Thoughts? https://patchwork.ozlabs.org/patch/1259645/ Regards, Yi Liu > From: Eric Auger > Sent: Saturday, March 21, 2020 12:58 AM > To: eric.auger@gmail.com; eric.au...@redhat.com; qemu-devel@nongnu.org; > Subject: [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults > API > > This new API allows to inject @count iommu_faults into > the IOMMU memory region. > > Signed-off-by: Eric Auger > --- > include/exec/memory.h | 25 + > memory.c | 10 ++ > 2 files changed, 35 insertions(+) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index f2c773163f..141a5dc197 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -57,6 +57,8 @@ struct MemoryRegionMmio { > CPUWriteMemoryFunc *write[3]; > }; > > +struct iommu_fault; > + > typedef struct IOMMUTLBEntry IOMMUTLBEntry; > > /* See address_space_translate: bit 0 is read, bit 1 is write. */ > @@ -357,6 +359,19 @@ typedef struct IOMMUMemoryRegionClass { > * @iommu: the IOMMUMemoryRegion > */ > int (*num_indexes)(IOMMUMemoryRegion *iommu); > + > +/* > + * Inject @count faults into the IOMMU memory region > + * > + * Optional method: if this method is not provided, then > + * memory_region_injection_faults() will return -ENOENT > + * > + * @iommu: the IOMMU memory region to inject the faults in > + * @count: number of faults to inject > + * @buf: fault buffer > + */ > +int (*inject_faults)(IOMMUMemoryRegion *iommu, int count, > + struct iommu_fault *buf); > } IOMMUMemoryRegionClass; > > typedef struct CoalescedMemoryRange CoalescedMemoryRange; > @@ -1365,6 +1380,16 @@ int > memory_region_iommu_attrs_to_index(IOMMUMemoryRegion *iommu_mr, > */ > int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr); > > +/** > + * memory_region_inject_faults : inject @count faults stored in @buf > + * > + * @iommu_mr: the IOMMU memory region > + * @count: number of faults to be injected > + * @buf: buffer containing the faults > + */ > +int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count, > +struct iommu_fault *buf); > + > /** > * memory_region_name: get a memory region's name > * > diff --git a/memory.c b/memory.c > index 09be40edd2..9cdd77e0de 100644 > --- a/memory.c > +++ b/memory.c > @@ -2001,6 +2001,16 @@ int > memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr) > return imrc->num_indexes(iommu_mr); > } > > +int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count, > +struct iommu_fault *buf) > +{ > +IOMMUMemoryRegionClass *imrc = > IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr); > +if (!imrc->inject_faults) { > +return -ENOENT; > +} > +return imrc->inject_faults(iommu_mr, count, buf); > +} > + > void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client) > { > uint8_t mask = 1 << client; > -- > 2.20.1
[Bug 1867519] Re: qemu 4.2 segfaults on VF detach
** Merge proposal linked: https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/381033 -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1867519 Title: qemu 4.2 segfaults on VF detach Status in QEMU: Fix Committed Status in qemu package in Ubuntu: Fix Released Bug description: After updating Ubuntu 20.04 to the Beta version, we get the following error and the virtual machines stucks when detaching PCI devices using virsh command: Error: error: Failed to detach device from /tmp/vf_interface_attached.xml error: internal error: End of file from qemu monitor steps to reproduce: 1. create a VM over Ubuntu 20.04 (5.4.0-14-generic) 2. attach PCI device to this VM (Mellanox VF for example) 3. try to detaching the PCI device using virsh command: a. create a pci interface xml file: b. #virsh detach-device - Ubuntu release: Description:Ubuntu Focal Fossa (development branch) Release:20.04 - Package ver: libvirt0: Installed: 6.0.0-0ubuntu3 Candidate: 6.0.0-0ubuntu5 Version table: 6.0.0-0ubuntu5 500 500 http://il.archive.ubuntu.com/ubuntu focal/main amd64 Packages *** 6.0.0-0ubuntu3 100 100 /var/lib/dpkg/status - What you expected to happen: PCI device detached without any errors. - What happened instead: getting the errors above and he VM stuck additional info: after downgrading the libvirt0 package and all the dependent packages to 5.4 the previous, version, seems that the issue disappeared To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1867519/+subscriptions
Re: [PATCH v2] block: make BlockConf.*_size properties 32-bit
On 3/26/20 1:52 AM, Roman Kagan wrote: Devices (virtio-blk, scsi, etc.) and the block layer are happy to use 32-bit for logical_block_size, physical_block_size, and min_io_size. However, the properties in BlockConf are defined as uint16_t limiting the values to 32768. This appears unnecessary tight, and we've seen bigger block sizes handy at times. Make them 32 bit instead and lift the limitation up to 2 MiB which appears to be good enough for everybody. and matches the current qemu limit for qcow2 cluster sizes As the values can now be fairly big and awkward to type, make the property setter accept common size suffixes (k, m). Signed-off-by: Roman Kagan --- v1 -> v2: - cap the property at 2 MiB [Eric] - accept size suffixes +++ b/hw/core/qdev-properties.c @@ -14,6 +14,7 @@ #include "qapi/visitor.h" #include "chardev/char.h" #include "qemu/uuid.h" +#include "qemu/units.h" void qdev_prop_set_after_realize(DeviceState *dev, const char *name, Error **errp) @@ -729,30 +730,39 @@ const PropertyInfo qdev_prop_pci_devfn = { /* --- blocksize --- */ +/* lower limit is sector size */ +#define MIN_BLOCK_SIZE 512 +#define MIN_BLOCK_SIZE_STR "512 B" +/* upper limit is arbitrary, 2 MiB looks sufficient */ +#define MAX_BLOCK_SIZE (2 * MiB) +#define MAX_BLOCK_SIZE_STR "2 MiB" For this comment, I might add that it matches our limit for qcow2 cluster size. Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Re: [PATCH v1 20/22] intel_iommu: propagate PASID-based iotlb invalidation to host
On Thu, Mar 26, 2020 at 05:41:39AM +, Liu, Yi L wrote: > > From: Liu, Yi L > > Sent: Wednesday, March 25, 2020 9:22 PM > > To: 'Peter Xu' > > Subject: RE: [PATCH v1 20/22] intel_iommu: propagate PASID-based iotlb > > invalidation to host > > > > > From: Peter Xu > > > Sent: Wednesday, March 25, 2020 2:34 AM > > > To: Liu, Yi L > > > Subject: Re: [PATCH v1 20/22] intel_iommu: propagate PASID-based iotlb > > > invalidation to host > > > > > > On Sun, Mar 22, 2020 at 05:36:17AM -0700, Liu Yi L wrote: > > > > This patch propagates PASID-based iotlb invalidation to host. > > > > > > > > Intel VT-d 3.0 supports nested translation in PASID granular. > > > > Guest SVA support could be implemented by configuring nested > > > > translation on specific PASID. This is also known as dual stage DMA > > > > translation. > > > > > > > > Under such configuration, guest owns the GVA->GPA translation which > > > > is configured as first level page table in host side for a specific > > > > pasid, and host owns GPA->HPA translation. As guest owns first level > > > > translation table, piotlb invalidation should be propagated to host > > > > since host IOMMU will cache first level page table related mappings > > > > during DMA address translation. > > > > > > > > This patch traps the guest PASID-based iotlb flush and propagate it > > > > to host. > > > > > > > > Cc: Kevin Tian > > > > Cc: Jacob Pan > > > > Cc: Peter Xu > > > > Cc: Yi Sun > > > > Cc: Paolo Bonzini > > > > Cc: Richard Henderson > > > > Cc: Eduardo Habkost > > > > Signed-off-by: Liu Yi L > > > > --- > > > > hw/i386/intel_iommu.c | 139 > > > + > > > > hw/i386/intel_iommu_internal.h | 7 +++ > > > > 2 files changed, 146 insertions(+) > > > > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index > > > > b9ac07d..10d314d 100644 > > > > --- a/hw/i386/intel_iommu.c > > > > +++ b/hw/i386/intel_iommu.c > > > > @@ -3134,15 +3134,154 @@ static bool > > > vtd_process_pasid_desc(IntelIOMMUState *s, > > > > return (ret == 0) ? true : false; } > > > > > > > > +/** > > > > + * Caller of this function should hold iommu_lock. > > > > + */ > > > > +static void vtd_invalidate_piotlb(IntelIOMMUState *s, > > > > + VTDBus *vtd_bus, > > > > + int devfn, > > > > + DualIOMMUStage1Cache > > > > +*stage1_cache) { > > > > +VTDHostIOMMUContext *vtd_dev_icx; > > > > +HostIOMMUContext *host_icx; > > > > + > > > > +vtd_dev_icx = vtd_bus->dev_icx[devfn]; > > > > +if (!vtd_dev_icx) { > > > > +goto out; > > > > +} > > > > +host_icx = vtd_dev_icx->host_icx; > > > > +if (!host_icx) { > > > > +goto out; > > > > +} > > > > +if (host_iommu_ctx_flush_stage1_cache(host_icx, stage1_cache)) { > > > > +error_report("Cache flush failed"); > > > > > > I think this should not easily be triggered by the guest, but just in > > > case... Let's use > > > error_report_once() to be safe. > > > > Agreed. > > > > > > +} > > > > +out: > > > > +return; > > > > +} > > > > + > > > > +static inline bool vtd_pasid_cache_valid( > > > > + VTDPASIDAddressSpace *vtd_pasid_as) { > > > > +return vtd_pasid_as->iommu_state && ^ > > > > > > This check can be dropped because always true? > > > > > > If you agree with both the changes, please add: > > > > > > Reviewed-by: Peter Xu > > > > I think the code should ensure all the pasid_as in hash table is valid. And > > we can > > since all the operations are under protection of iommu_lock. > > > Peter, > > I think my reply was wrong. pasid_as in has table may be stale since > the per pasid_as cache_gen may be not identical with the cache_gen > in iommu_state. e.g. vtd_pasid_cache_reset() only increases the > cache_gen in iommu_state. So there will be pasid_as in hash table > which has cached pasid entry but its cache_gen is not equal to the > one in iommu_state. For such pasid_as, we should treat it as stale. > So I guess the vtd_pasid_cache_valid() is still necessary. I guess you misread my comment. :) I was saying the "vtd_pasid_as->iommu_state" check is not needed, because iommu_state was always set if the address space is created. vtd_pasid_cache_valid() is needed. Also, please double confirm that vtd_pasid_cache_reset() should drop all the address spaces (as I think it should), not "only increase the cache_gen". IMHO you should only increase the cache_gen in the PSI hook (vtd_pasid_cache_psi()) only. Thanks, -- Peter Xu
RE: [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice
Hi Eric, Not sure about your preference. I've modified my patch as below, which HostIOMMUContext to provide callbacks for vIOMMU to call into VFIO. Please feel free to give your suggestions. https://patchwork.ozlabs.org/patch/1259665/ Regards, Yi Liu > From: Eric Auger > Sent: Saturday, March 21, 2020 12:58 AM > To: eric.auger@gmail.com; eric.au...@redhat.com; qemu-devel@nongnu.org; > Subject: [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice > > From: Liu Yi L > > This patch introduces PCIPASIDOps for IOMMU related operations. > > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00940.html > > So far, to setup virt-SVA for assigned SVA capable device, needs to > configure host translation structures for specific pasid. (e.g. bind > guest page table to host and enable nested translation in host). > Besides, vIOMMU emulator needs to forward guest's cache invalidation > to host since host nested translation is enabled. e.g. on VT-d, guest > owns 1st level translation table, thus cache invalidation for 1st > level should be propagated to host. > > This patch adds two functions: alloc_pasid and free_pasid to support > guest pasid allocation and free. The implementations of the callbacks > would be device passthru modules. Like vfio. > > Cc: Kevin Tian > Cc: Jacob Pan > Cc: Peter Xu > Cc: Eric Auger > Cc: Yi Sun > Cc: David Gibson > Signed-off-by: Liu Yi L > Signed-off-by: Yi Sun > --- > hw/pci/pci.c | 34 ++ > include/hw/pci/pci.h | 11 +++ > 2 files changed, 45 insertions(+) > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index e1ed6677e1..67e03b8db1 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -2695,6 +2695,40 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, > void *opaque) > bus->iommu_opaque = opaque; > } > > +void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops) > +{ > +assert(ops && !dev->pasid_ops); > +dev->pasid_ops = ops; > +} > + > +bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn) > +{ > +PCIDevice *dev; > + > +if (!bus) { > +return false; > +} > + > +dev = bus->devices[devfn]; > +return !!(dev && dev->pasid_ops); > +} > + > +int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, > + IOMMUConfig *config) > +{ > +PCIDevice *dev; > + > +if (!bus) { > +return -EINVAL; > +} > + > +dev = bus->devices[devfn]; > +if (dev && dev->pasid_ops && dev->pasid_ops->set_pasid_table) { > +return dev->pasid_ops->set_pasid_table(bus, devfn, config); > +} > +return -ENOENT; > +} > + > static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque) > { > Range *range = opaque; > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h > index cfedf5a995..2146cb7519 100644 > --- a/include/hw/pci/pci.h > +++ b/include/hw/pci/pci.h > @@ -8,6 +8,7 @@ > #include "hw/isa/isa.h" > > #include "hw/pci/pcie.h" > +#include "hw/iommu/iommu.h" > > extern bool pci_available; > > @@ -264,6 +265,11 @@ struct PCIReqIDCache { > }; > typedef struct PCIReqIDCache PCIReqIDCache; > > +struct PCIPASIDOps { > +int (*set_pasid_table)(PCIBus *bus, int32_t devfn, IOMMUConfig *config); > +}; > +typedef struct PCIPASIDOps PCIPASIDOps; > + > struct PCIDevice { > DeviceState qdev; > bool partially_hotplugged; > @@ -357,6 +363,7 @@ struct PCIDevice { > > /* ID of standby device in net_failover pair */ > char *failover_pair_id; > +PCIPASIDOps *pasid_ops; > }; > > void pci_register_bar(PCIDevice *pci_dev, int region_num, > @@ -490,6 +497,10 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void > *, int); > AddressSpace *pci_device_iommu_address_space(PCIDevice *dev); > void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque); > > +void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops); > +bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn); > +int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, IOMMUConfig > *config); > + > static inline void > pci_set_byte(uint8_t *config, uint8_t val) > { > -- > 2.20.1
RE: [RFC v6 01/24] update-linux-headers: Import iommu.h
> From: Eric Auger > Sent: Saturday, March 21, 2020 12:58 AM > To: eric.auger@gmail.com; eric.au...@redhat.com; qemu-devel@nongnu.org; > Subject: [RFC v6 01/24] update-linux-headers: Import iommu.h > > Update the script to import the new iommu.h uapi header. > > Signed-off-by: Eric Auger > --- > scripts/update-linux-headers.sh | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/scripts/update-linux-headers.sh > b/scripts/update-linux-headers.sh index > 29c27f4681..5b64ee3912 100755 > --- a/scripts/update-linux-headers.sh > +++ b/scripts/update-linux-headers.sh > @@ -141,7 +141,7 @@ done > > rm -rf "$output/linux-headers/linux" > mkdir -p "$output/linux-headers/linux" > -for header in kvm.h vfio.h vfio_ccw.h vhost.h \ > +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \ >psci.h psp-sev.h userfaultfd.h mman.h; do > cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux" > done Hi Eric, This patch already got acked by from Cornelia. :-) https://patchwork.ozlabs.org/patch/1259643/ Regards, Yi Liu
[Bug 1868116] Re: QEMU monitor no longer works
Thank you for investigating this. I would bisect QEMU, but wouldn't investigate its libraries. Consequently, I would never find the cause of this problem. For now, I am using -monitor telnet:127.0.0.1:5,server,nowait to have access to the monitor on QEMU guests. -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1868116 Title: QEMU monitor no longer works Status in QEMU: New Status in qemu package in Ubuntu: Triaged Status in vte2.91 package in Ubuntu: New Bug description: Repro: VTE $ meson _build && ninja -C _build && ninja -C _build install qemu: $ ../configure --python=/usr/bin/python3 --disable-werror --disable-user --disable-linux-user --disable-docs --disable-guest-agent --disable-sdl --enable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-fdt --disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi --disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir --disable-seccomp --disable-glusterfs --disable-tpm --disable-numa --disable-opengl --disable-virglrenderer --disable-xfsctl --disable-vxhs --disable-slirp --disable-blobs --target-list=x86_64-softmmu --disable-rdma --disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock --disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user --disable-spice --disable-qom-cast-debug --disable-vxhs --disable-bochs --disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat --disable-qed --disable-parallels --disable-sheepdog --disable-avx2 --disable-nettle --disable-gnutls --disable-capstone --disable-tools --disable-libpmem --disable-iconv --disable-cap-ng $ make Test: $ LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH ./build/x86_64-softmmu/qemu-system-x86_64 -enable-kvm --drive media=cdrom,file=http://archive.ubuntu.com/ubuntu/dists/bionic/main/installer-amd64/current/images/netboot/mini.iso - switch to monitor with CTRL+ALT+2 - try to enter something Affects head of both usptream git repos. --- original bug --- It was observed that the QEMU console (normally accessible using Ctrl+Alt+2) accepts no input, so it can't be used. This is being problematic because there are cases where it's required to send commands to the guest, or key combinations that the host would grab (as Ctrl-Alt-F1 or Alt-F4). ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: qemu 1:4.2-3ubuntu2 Uname: Linux 5.6.0-rc6+ x86_64 ApportVersion: 2.20.11-0ubuntu20 Architecture: amd64 CurrentDesktop: XFCE Date: Thu Mar 19 12:16:31 2020 Dependencies: InstallationDate: Installed on 2017-06-13 (1009 days ago) InstallationMedia: Xubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412) KvmCmdLine: COMMAND STAT EUID RUID PIDPPID %CPU COMMAND qemu-system-x86 Sl+ 1000 1000 34275 25235 29.2 qemu-system-x86_64 -m 4G -cpu Skylake-Client -device virtio-vga,virgl=true,xres=1280,yres=720 -accel kvm -device nec-usb-xhci -serial vc -serial stdio -hda /home/usuario/Sistemas/androidx86.img -display gtk,gl=on -device usb-audio kvm-nx-lpage-re S0 0 34284 2 0.0 [kvm-nx-lpage-re] kvm-pit/34275 S0 0 34286 2 0.0 [kvm-pit/34275] MachineType: LENOVO 80UG ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-rc6+ root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 kvm.report_ignored_msrs=0 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 i915.enable_gvt=1 i915.fastboot=1 cgroup_enable=memory swapaccount=1 zswap.enabled=1 zswap.zpool=z3fold resume=UUID=a82e38a0-8d20-49dd-9cbd-de7216b589fc log_buf_len=16M usbhid.quirks=0x0079:0x0006:0x10 config_scsi_mq_default=y scsi_mod.use_blk_mq=1 mtrr_gran_size=64M mtrr_chunk_size=64M nbd.nbds_max=2 nbd.max_part=63 SourcePackage: qemu UpgradeStatus: Upgraded to focal on 2019-12-22 (87 days ago) dmi.bios.date: 08/09/2018 dmi.bios.vendor: LENOVO dmi.bios.version: 0XCN45WW dmi.board.asset.tag: NO Asset Tag dmi.board.name: Toronto 4A2 dmi.board.vendor: LENOVO dmi.board.version: SDK0J40679 WIN dmi.chassis.asset.tag: NO Asset Tag dmi.chassis.type: 10 dmi.chassis.vendor: LENOVO dmi.chassis.version: Lenovo ideapad 310-14ISK dmi.modalias: dmi:bvnLENOVO:bvr0XCN45WW:bd08/09/2018:svnLENOVO:pn80UG:pvrLenovoideapad310-14ISK:rvnLENOVO:rnToronto4A2:rvrSDK0J40679WIN:cvnLENOVO:ct10:cvrLenovoideapad310-14ISK: dmi.product.family: IDEAPAD dmi.product.name: 80UG dmi.product.sku: LENOVO_MT_80UG_BU_idea_FM_Lenovo ideapad 310-14ISK dmi.product.version: Lenovo ideapad 310-14ISK dmi.sys.vendor: LENOVO mtime.conffile..etc.apport.crashdb.conf: 2019-08-29T08:39:36.787240 To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1868116/+subscriptions
Re: acpi_pcihp_eject_slot() bug if passed 'slots == 0'
On Thu, 26 Mar 2020 13:29:01 +0100 Igor Mammedov wrote: > On Thu, 26 Mar 2020 11:52:36 + > Peter Maydell wrote: > > > Hi; Coverity spots that if hw/acpi/pcihp.c:acpi_pcihp_eject_slot() > > is passed a zero 'slots' argument then ctz32(slots) will return 32, > > and then the code that does '1U << slot' is C undefined behaviour > > because it's an oversized shift. (This is CID 1421896.) > > > > Since the pci_write() function in this file can call > > acpi_pcihp_eject_slot() with an arbitrary value from the guest, > > I think we need to handle 'slots == 0' safely. But what should > > the behaviour be? > > 0 is not valid value, we should ignore and return early in this case > like we do with bsel. I'll post a path shortly. well, looking more that is only true for main bus, for bridges it can be slot number can be zero, then AML left shifts it and writes into B0EJ which traps into pci_write(, data) and that is supposed to eject slot 0 according to guest(AML). Michael, what's your take on it? > > > > > thanks > > -- PMM > > > >
Re: [PATCH] i386/cpu: Expand MAX_FIXED_COUNTERS from 3 to 4 to for Icelake
Anyone to help review this change? Thanks, Like Xu On 2020/3/17 13:54, Like Xu wrote: In the Intel SDM, "Table 18-2. Association of Fixed-Function Performance Counters with Architectural Performance Events", we may have a new fixed counter 'TOPDOWN.SLOTS' (since Icelake), which counts the number of available slots for an unhalted logical processor. Check commit 6017608936 in the kernel tree. Signed-off-by: Like Xu --- target/i386/cpu.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 576f309bbf..ec2b67d425 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1185,7 +1185,7 @@ typedef struct { #define CPU_NB_REGS CPU_NB_REGS32 #endif -#define MAX_FIXED_COUNTERS 3 +#define MAX_FIXED_COUNTERS 4 #define MAX_GP_COUNTERS(MSR_IA32_PERF_STATUS - MSR_P6_EVNTSEL0) #define TARGET_INSN_START_EXTRA_WORDS 1
Re: [PATCH] backup: don't acquire aio_context in backup_clean
26.03.2020 12:43, Stefan Reiter wrote: On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote: 25.03.2020 18:50, Stefan Reiter wrote: backup_clean is only ever called as a handler via job_exit, which Hmm.. I'm afraid it's not quite correct. job_clean job_finalize_single job_completed_txn_abort (lock aio context) job_do_finalize Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio context.. And on the same time, it directaly calls job_txn_apply(job->txn, job_finalize_single) without locking. Is it a bug? I think, as you say, the idea is that job_do_finalize is always called with the lock acquired. That's why job_completed_txn_abort takes care to release the lock (at least of the "outer_ctx" as it calls it) before reacquiring it. And, even if job_do_finalize called always with locked context, where is guarantee that all context of all jobs in txn are locked? I also don't see anything that guarantees that... I guess it could be adapted to handle locks like job_completed_txn_abort does? Haven't looked into transactions too much, but does it even make sense to have jobs in different contexts in one transaction? Why not? Assume backing two disks in one transaction, each in separate io thread.. (honestly, I don't know does it work) Still, let's look through its callers. job_finalize qmp_block_job_finalize (lock aio context) qmp_job_finalize (lock aio context) test_cancel_concluded (doesn't lock, but it's a test) job_completed_txn_success job_completed job_exit (lock aio context) job_cancel blockdev_mark_auto_del (lock aio context) job_user_cancel qmp_block_job_cancel (locks context) qmp_job_cancel (locks context) job_cancel_err job_cancel_sync (return job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls callback) replication_close (it's .bdrv_close.. Hmm, I don't see context locking, where is it ?) Hm, don't see it either. This might indeed be a way to get to job_clean without a lock held. I don't have any testing set up for replication atm, but if you believe this would be correct I can send a patch for that as well (just acquire the lock in replication_close before job_cancel_async?). I don't know.. But sending a patch is good way to start a discussion) replication_stop (locks context) drive_backup_abort (locks context) blockdev_backup_abort (locks context) job_cancel_sync_all (locks context) cancel_common (locks context) test_* (I don't care) To clarify, aside from the commit message the patch itself does not appear to be wrong? All paths (aside from replication_close mentioned above) guarantee the job lock to be held. I mostly worry about the case with transaction with jobs from different aio contexts than about replication.. Anyway, I hope that someone who has better understanding of these things will look at this. It usually not good idea to send [PATCH] inside discussion thread, it'd better be a separate thread, to be more visible. May be you send separate series, which will include this patch, some fix for replication, and try to fix job_do_finalize in some way, and we continue discussion from this new series? already acquires the job's context. The job's context is guaranteed to be the same as the one used by backup_top via backup_job_create. Since the previous logic effectively acquired the lock twice, this broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock once, thus deadlocking with the IO thread. Signed-off-by: Stefan Reiter Just note, that this thing were recently touched by 0abf2581717a19 , so add Sergio (its author) to CC. --- This is a fix for the issue discussed in this part of the thread: https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html ...not the original problem (core dump) posted by Dietmar. I've still seen it occasionally hang during a backup abort. I'm trying to figure out why that happens, stack trace indicates a similar problem with the main thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet. block/backup.c | 4 1 file changed, 4 deletions(-) diff --git a/block/backup.c b/block/backup.c index 7430ca5883..a7a7dcaf4c 100644 --- a/block/backup.c +++ b/block/backup.c @@ -126,11 +126,7 @@ static void backup_abort(Job *job) static void backup_clean(Job *job)
Re: [PATCH v6 7/7] virtio-net: add migration support for RSS and hash report
ping On Fri, Mar 20, 2020 at 1:58 PM Yuri Benditovich < yuri.benditov...@daynix.com> wrote: > Save and restore RSS/hash report configuration. > > Signed-off-by: Yuri Benditovich > --- > hw/net/virtio-net.c | 37 + > 1 file changed, 37 insertions(+) > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c > index a0614ad4e6..7de7587abd 100644 > --- a/hw/net/virtio-net.c > +++ b/hw/net/virtio-net.c > @@ -2842,6 +2842,13 @@ static int virtio_net_post_load_device(void > *opaque, int version_id) > } > } > > +if (n->rss_data.enabled) { > +trace_virtio_net_rss_enable(n->rss_data.hash_types, > +n->rss_data.indirections_len, > +sizeof(n->rss_data.key)); > +} else { > +trace_virtio_net_rss_disable(); > +} > return 0; > } > > @@ -3019,6 +3026,32 @@ static const VMStateDescription > vmstate_virtio_net_has_vnet = { > }, > }; > > +static bool virtio_net_rss_needed(void *opaque) > +{ > +return VIRTIO_NET(opaque)->rss_data.enabled; > +} > + > +static const VMStateDescription vmstate_virtio_net_rss = { > +.name = "virtio-net-device/rss", > +.version_id = 1, > +.minimum_version_id = 1, > +.needed = virtio_net_rss_needed, > +.fields = (VMStateField[]) { > +VMSTATE_BOOL(rss_data.enabled, VirtIONet), > +VMSTATE_BOOL(rss_data.redirect, VirtIONet), > +VMSTATE_BOOL(rss_data.populate_hash, VirtIONet), > +VMSTATE_UINT32(rss_data.hash_types, VirtIONet), > +VMSTATE_UINT16(rss_data.indirections_len, VirtIONet), > +VMSTATE_UINT16(rss_data.default_queue, VirtIONet), > +VMSTATE_UINT8_ARRAY(rss_data.key, VirtIONet, > +VIRTIO_NET_RSS_MAX_KEY_SIZE), > +VMSTATE_VARRAY_UINT16_ALLOC(rss_data.indirections_table, > VirtIONet, > +rss_data.indirections_len, 0, > +vmstate_info_uint16, uint16_t), > +VMSTATE_END_OF_LIST() > +}, > +}; > + > static const VMStateDescription vmstate_virtio_net_device = { > .name = "virtio-net-device", > .version_id = VIRTIO_NET_VM_VERSION, > @@ -3069,6 +3102,10 @@ static const VMStateDescription > vmstate_virtio_net_device = { > has_ctrl_guest_offloads), > VMSTATE_END_OF_LIST() > }, > +.subsections = (const VMStateDescription * []) { > +&vmstate_virtio_net_rss, > +NULL > +} > }; > > static NetClientInfo net_virtio_info = { > -- > 2.17.1 > >
Re: acpi_pcihp_eject_slot() bug if passed 'slots == 0'
On Thu, 26 Mar 2020 11:52:36 + Peter Maydell wrote: > Hi; Coverity spots that if hw/acpi/pcihp.c:acpi_pcihp_eject_slot() > is passed a zero 'slots' argument then ctz32(slots) will return 32, > and then the code that does '1U << slot' is C undefined behaviour > because it's an oversized shift. (This is CID 1421896.) > > Since the pci_write() function in this file can call > acpi_pcihp_eject_slot() with an arbitrary value from the guest, > I think we need to handle 'slots == 0' safely. But what should > the behaviour be? 0 is not valid value, we should ignore and return early in this case like we do with bsel. I'll post a path shortly. > > thanks > -- PMM >
Re: [RFC for-5.1 4/4] spapr: Don't allow unplug of NVLink2 devices
On Thu, 26 Mar 2020 16:40:09 +1100 David Gibson wrote: > Currently, we can't properly handle unplug of NVLink2 devices, because we > don't have code to tear down their special memory resources. There's not > a lot of impetus to implement that. Since hardware NVLink2 devices can't > be hot unplugged, the guest side drivers don't usually support unplug > anyway. > > Therefore, simply prevent unplug of NVLink2 devices. > This could maybe considered as a valid fix for 5.0 since this prevents guest crashes IIUC. But since this requires the two preliminary cleanup patches, I understand you may prefer to postpone that to 5.1. > Signed-off-by: David Gibson > --- Reviewed-by: Greg Kurz > hw/ppc/spapr_pci.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c > index 55ca9dee1e..5c8262413a 100644 > --- a/hw/ppc/spapr_pci.c > +++ b/hw/ppc/spapr_pci.c > @@ -1666,6 +1666,11 @@ static void spapr_pci_unplug_request(HotplugHandler > *plug_handler, > return; > } > > +if (spapr_phb_is_nvlink_dev(pdev, phb)) { > +error_setg(errp, "PCI: Cannot unplug NVLink2 devices"); > +return; > +} > + > /* ensure any other present functions are pending unplug */ > if (PCI_FUNC(pdev->devfn) == 0) { > for (i = 1; i < 8; i++) {
Re: [PATCH 0/2] Fix the generic image creation code
On 26.03.20 02:12, Maxim Levitsky wrote: > The recent patches from Max Reitz allowed some block drivers to not > provide the .bdrv_co_create_opts and still allow qemu-img to > create/format images as long as the image is already existing > (that is the case with various block storage drivers like nbd/iscsi/nvme, etc) > > However it was found out that some places in the code depend on the > .bdrv_co_create_opts/.create_opts to be != NULL to decide if to allow > image creation. > > To avoid adding failback code to all these places, just make generic failback > code be used by the drivers that need it, so that for outside user, there > is no diffirence if failback was used or not. > > Best regards, > Maxim Levitsky > > Maxim Levitsky (2): > block: pass BlockDriver reference to the .bdrv_co_create > block: trickle down the fallback image creation function use to the > block drivers Thanks, fixed the function parameter alignment, moved the declarations from block.h into block_int.h, and applied the series to my block branch: https://git.xanclic.moe/XanClic/qemu/commits/branch/block Max signature.asc Description: OpenPGP digital signature
Re: [RFC for-5.1 3/4] spapr: Fix failure path for attempting to hot unplug PCI bridges
On Thu, 26 Mar 2020 16:40:08 +1100 David Gibson wrote: > For various technical reasons we can't currently allow unplug a PCI to PCI > bridge on the pseries machine. spapr_pci_unplug_request() correctly > generates an error message if that's attempted. > > But.. if the given errp is not error_abort or error_fatal, Which is the always case when trying to unplug a device AFAICT: void qdev_unplug(DeviceState *dev, Error **errp) { DeviceClass *dc = DEVICE_GET_CLASS(dev); HotplugHandler *hotplug_ctrl; HotplugHandlerClass *hdc; Error *local_err = NULL; [...] hdc = HOTPLUG_HANDLER_GET_CLASS(hotplug_ctrl); if (hdc->unplug_request) { hotplug_handler_unplug_request(hotplug_ctrl, dev, &local_err); And anyway, spapr_pci_unplug_request() shouldn't rely on the caller passing &error_fatal or &error_abort to do the right thing. Calling error_setg() without returning right away is a dangerous practice since it would cause a subsequent call to error_setg() with the same errp to abort QEMU. > it doesn't actually stop trying to unplug the bridge anyway. > This looks like a bug fix that could go to 5.0 IMHO. Maybe add this tag ? Fixes: 14e714900f6b "spapr: Allow hot plug/unplug of PCI bridges and devices under PCI bridges" > Signed-off-by: David Gibson > --- Reviewed-by: Greg Kurz > hw/ppc/spapr_pci.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c > index 709a52780d..55ca9dee1e 100644 > --- a/hw/ppc/spapr_pci.c > +++ b/hw/ppc/spapr_pci.c > @@ -1663,6 +1663,7 @@ static void spapr_pci_unplug_request(HotplugHandler > *plug_handler, > > if (pc->is_bridge) { > error_setg(errp, "PCI: Hot unplug of PCI bridges not supported"); > +return; > } > > /* ensure any other present functions are pending unplug */
Re: [RFC for-5.1 2/4] spapr: Helper to determine if a device is NVLink2 related
On Thu, 26 Mar 2020 16:40:07 +1100 David Gibson wrote: > This adds a simple exported helper function which determins if a given > (supposedly) PCI device is actually an NVLink2 device, which has some > special considerations. > > Signed-off-by: David Gibson > --- Reviewed-by: Greg Kurz > hw/ppc/spapr_pci_nvlink2.c | 5 + > include/hw/pci-host/spapr.h | 1 + > 2 files changed, 6 insertions(+) > > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c > index 7d3a685421..0cec1ae02b 100644 > --- a/hw/ppc/spapr_pci_nvlink2.c > +++ b/hw/ppc/spapr_pci_nvlink2.c > @@ -449,6 +449,11 @@ static bool is_nvnpu(PCIDevice *dev, SpaprPhbState > *sphb, int *slot, int *link) > return false; > } > > +bool spapr_phb_is_nvlink_dev(PCIDevice *dev, SpaprPhbState *sphb) > +{ > +return is_nvgpu(dev, sphb, NULL) || is_nvnpu(dev, sphb, NULL, NULL); > +} > + > void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int > offset, > SpaprPhbState *sphb) > { > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h > index 8877ff51fb..eaba4a5825 100644 > --- a/include/hw/pci-host/spapr.h > +++ b/include/hw/pci-host/spapr.h > @@ -156,6 +156,7 @@ void spapr_phb_nvgpu_free(SpaprPhbState *sphb); > void spapr_phb_nvgpu_populate_dt(SpaprPhbState *sphb, void *fdt, int bus_off, > Error **errp); > void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt); > +bool spapr_phb_is_nvlink_dev(PCIDevice *dev, SpaprPhbState *sphb); > void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int > offset, > SpaprPhbState *sphb); > #else
Re: [RFC for-5.1 1/4] spapr: Refactor locating NVLink2 devices for device tree creation
On Thu, 26 Mar 2020 16:40:06 +1100 David Gibson wrote: > Currently spapr_phb_nvgpu_populate_pcidev_dt() works a little cryptically. > It steps through all the NVLink2 GPUs and NPUs and if they match the device > we're called for, we generate the relevant device tree information. > > Make this a little more obvious by introducing helpers to determine it a ... to determine if a > given PCI device is an NVLink2 GPU or NPU, returning the NVLink2 slot and > link number information as well. > > Signed-off-by: David Gibson > --- LGTM Reviewed-by: Greg Kurz > hw/ppc/spapr_pci_nvlink2.c | 115 + > 1 file changed, 79 insertions(+), 36 deletions(-) > > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c > index 8332d5694e..7d3a685421 100644 > --- a/hw/ppc/spapr_pci_nvlink2.c > +++ b/hw/ppc/spapr_pci_nvlink2.c > @@ -390,13 +390,12 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState > *sphb, void *fdt) > > } > > -void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int > offset, > -SpaprPhbState *sphb) > +static bool is_nvgpu(PCIDevice *dev, SpaprPhbState *sphb, int *slot) > { > -int i, j; > +int i; > > if (!sphb->nvgpus) { > -return; > +return false; > } > > for (i = 0; i < sphb->nvgpus->num; ++i) { > @@ -406,47 +405,91 @@ void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, > void *fdt, int offset, > if (!nvslot->gpdev) { > continue; > } > + > if (dev == nvslot->gpdev) { > -uint32_t npus[nvslot->linknum]; > +if (slot) { > +*slot = i; > +} > +return true; > +} > +} > > -for (j = 0; j < nvslot->linknum; ++j) { > -PCIDevice *npdev = nvslot->links[j].npdev; > +return false; > +} > > -npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev)); > -} > -_FDT(fdt_setprop(fdt, offset, "ibm,npu", npus, > - j * sizeof(npus[0]))); > -_FDT((fdt_setprop_cell(fdt, offset, "phandle", > - PHANDLE_PCIDEV(sphb, dev; > +static bool is_nvnpu(PCIDevice *dev, SpaprPhbState *sphb, int *slot, int > *link) > +{ > +int i, j; > + > +if (!sphb->nvgpus) { > +return false; > +} > + > +for (i = 0; i < sphb->nvgpus->num; ++i) { > +SpaprPhbPciNvGpuSlot *nvslot = &sphb->nvgpus->slots[i]; > + > +/* Skip "slot" without attached GPU */ > +if (!nvslot->gpdev) { > continue; > } > > for (j = 0; j < nvslot->linknum; ++j) { > -if (dev != nvslot->links[j].npdev) { > -continue; > +if (dev == nvslot->links[j].npdev) { > +if (slot) { > +*slot = i; > +} > +if (link) { > +*link = j; > +} > +return true; > } > +} > +} > > -_FDT((fdt_setprop_cell(fdt, offset, "phandle", > - PHANDLE_PCIDEV(sphb, dev; > -_FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu", > - PHANDLE_PCIDEV(sphb, nvslot->gpdev))); > -_FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink", > - PHANDLE_NVLINK(sphb, i, j; > -/* > - * If we ever want to emulate GPU RAM at the same location as on > - * the host - here is the encoding GPA->TGT: > - * > - * gta = ((sphb->nv2_gpa >> 42) & 0x1) << 42; > - * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43; > - * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45; > - * gta |= sphb->nv2_gpa & ((1UL << 43) - 1); > - */ > -_FDT(fdt_setprop_cell(fdt, offset, "memory-region", > - PHANDLE_GPURAM(sphb, i))); > -_FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr", > - nvslot->tgt)); > -_FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed", > - nvslot->links[j].link_speed)); > +return false; > +} > + > +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int > offset, > +SpaprPhbState *sphb) > +{ > +int slot, link; > + > +if (is_nvgpu(dev, sphb, &slot)) { > +SpaprPhbPciNvGpuSlot *nvslot = &sphb->nvgpus->slots[slot]; > +uint32_t npus[nvslot->linknum]; > + > +for (link = 0; link < nvslot->linknum; ++link) { > +PCIDevice *npdev = nvslot->links[link].npdev; > + > +npus[link] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev)); > } > +_FDT(fdt_setprop(fdt, offset, "ibm,npu", npus, > +
Re: [PATCH] backup: don't acquire aio_context in backup_clean
On Thu, Mar 26, 2020 at 08:54:53AM +0300, Vladimir Sementsov-Ogievskiy wrote: > 25.03.2020 18:50, Stefan Reiter wrote: > > backup_clean is only ever called as a handler via job_exit, which > > Hmm.. I'm afraid it's not quite correct. > > job_clean > > job_finalize_single > > job_completed_txn_abort (lock aio context) > > job_do_finalize > > > Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio > context.. > And on the same time, it directaly calls job_txn_apply(job->txn, > job_finalize_single) > without locking. Is it a bug? Indeed, looks like a bug to me. In fact, that's the one causing the issue that Dietmar initially reported. In think the proper fix is drop the context acquisition/release that in backup_clean that I added in 0abf2581717a19, as Stefan proposed, and also acquire the context of "foreign" jobs at job_txn_apply, just as job_completed_txn_abort does. Thanks, Sergio. > And, even if job_do_finalize called always with locked context, where is > guarantee that all > context of all jobs in txn are locked? > > Still, let's look through its callers. > > job_finalize > >qmp_block_job_finalize (lock aio context) >qmp_job_finalize (lock aio context) >test_cancel_concluded (doesn't lock, but it's a test) > > job_completed_txn_success > >job_completed > > job_exit (lock aio context) > > job_cancel > > blockdev_mark_auto_del (lock aio context) > > job_user_cancel > > qmp_block_job_cancel (locks context) > qmp_job_cancel (locks context) > > job_cancel_err > > job_cancel_sync (return job_finish_sync(job, > &job_cancel_err, NULL);, job_finish_sync just calls callback) > >replication_close (it's .bdrv_close.. Hmm, > I don't see context locking, where is it ?) > >replication_stop (locks context) > >drive_backup_abort (locks context) > >blockdev_backup_abort (locks context) > >job_cancel_sync_all (locks context) > >cancel_common (locks context) > > test_* (I don't care) > > > already acquires the job's context. The job's context is guaranteed to > > be the same as the one used by backup_top via backup_job_create. > > > > Since the previous logic effectively acquired the lock twice, this > > broke cleanup of backups for disks using IO threads, since the > > BDRV_POLL_WHILE > > in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock > > once, thus deadlocking with the IO thread. > > > > Signed-off-by: Stefan Reiter > > Just note, that this thing were recently touched by 0abf2581717a19 , so add > Sergio (its author) to CC. > > > --- > > > > This is a fix for the issue discussed in this part of the thread: > > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html > > ...not the original problem (core dump) posted by Dietmar. > > > > I've still seen it occasionally hang during a backup abort. I'm trying to > > figure > > out why that happens, stack trace indicates a similar problem with the main > > thread hanging at bdrv_do_drained_begin, though I have no clue why as of > > yet. > > > > block/backup.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/block/backup.c b/block/backup.c > > index 7430ca5883..a7a7dcaf4c 100644 > > --- a/block/backup.c > > +++ b/block/backup.c > > @@ -126,11 +126,7 @@ static void backup_abort(Job *job) > > static void backup_clean(Job *job) > > { > > BackupBlockJob *s = container_of(job, BackupBlockJob, common.job); > > -AioContext *aio_context = bdrv_get_aio_context(s->backup_top); > > - > > -aio_context_acquire(aio_context); > > bdrv_backup_top_drop(s->backup_top); > > -aio_context_release(aio_context); > > } > > void backup_do_checkpoint(BlockJob *job, Error **errp) > > > > > -- > Best regards, > Vladimir > signature.asc Description: PGP signature
acpi_pcihp_eject_slot() bug if passed 'slots == 0'
Hi; Coverity spots that if hw/acpi/pcihp.c:acpi_pcihp_eject_slot() is passed a zero 'slots' argument then ctz32(slots) will return 32, and then the code that does '1U << slot' is C undefined behaviour because it's an oversized shift. (This is CID 1421896.) Since the pci_write() function in this file can call acpi_pcihp_eject_slot() with an arbitrary value from the guest, I think we need to handle 'slots == 0' safely. But what should the behaviour be? thanks -- PMM
Re: [RFC v1] arm/virt: Add memory hot remove support
Hi Shameer, On 3/26/20 12:14 PM, Shameerali Kolothum Thodi wrote: > Hi Eric, > >> -Original Message- >> From: Auger Eric [mailto:eric.au...@redhat.com] >> Sent: 26 March 2020 11:01 >> To: Shameerali Kolothum Thodi ; >> qemu-devel@nongnu.org; qemu-...@nongnu.org >> Cc: imamm...@redhat.com; peter.mayd...@linaro.org; m...@redhat.com; >> xuwei (O) ; Zengtao (B) ; >> Linuxarm ; Anshuman Khandual >> >> Subject: Re: [RFC v1] arm/virt: Add memory hot remove support >> >> Hi Shameer, >> >> On 3/18/20 1:37 PM, Shameer Kolothum wrote: >>> This adds support for memory hot remove on arm/virt that >>> uses acpi ged device. >> >> I gave this a try and it works fine if the PCDIMM slot was initially >> hotplugged: >> (QEMU) object-add qom-type=memory-backend-ram id=mem1 >> props.size=4294967296 >> {"return": {}} >> (QEMU) device_add driver=pc-dimm id=pcdimm1 memdev=mem1 >> (QEMU) device_del id=pcdimm1 >> {"return": {}} >> >> on guest I can see: >> [ 82.466321] Offlined Pages 262144 >> [ 82.541712] Offlined Pages 262144 >> [ 82.589236] Offlined Pages 262144 >> [ 82.969166] Offlined Pages 262144 >> >> However I noticed that if qemu is launched directly with >> >> -m 16G,maxmem=32G,slots=2 \ >> -object memory-backend-ram,id=mem1,size=4G \ >> -device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm -device >> >> and then in the qmp shell: >> (QEMU) device_del id=dimm1 >> >> the hot-unplug fails in guest: >> >> [ 78.897407] Offlined Pages 262144 >> [ 79.260811] Offlined Pages 262144 >> [ 79.308105] Offlined Pages 262144 >> [ 79.333675] page:fe00137d1f40 refcount:1 mapcount:0 >> mapping:0004ea9f20b1 index:0xaaab11c6e >> [ 79.335927] anon flags: 0x17880024(uptodate|active|swapbacked) >> [ 79.337571] raw: 17880024 dead0100 >> dead0122 >> 0004ea9f20b1 >> [ 79.339502] raw: 000aaab11c6e 0001 >> 0004fd4e3000 >> [ 79.341701] page dumped because: unmovable page >> [ 79.342887] page->mem_cgroup:0004fd4e3000 >> [ 79.354729] page:fe00137d1f40 refcount:1 mapcount:0 >> mapping:0004ea9f20b1 index:0xaaab11c6e >> [ 79.357012] anon flags: 0x17880024(uptodate|active|swapbacked) >> [ 79.358658] raw: 17880024 dead0100 >> dead0122 >> 0004ea9f20b1 >> [ 79.360611] raw: 000aaab11c6e 0001 >> 0004fd4e3000 >> [ 79.362560] page dumped because: unmovable page >> [ 79.363742] page->mem_cgroup:0004fd4e3000 >> [ 79.368636] memory memory20: Offline failed. >> >> I did not expect this. The PCDIMM slot in that case does not seem to be >> interpreted as a hot-unpluggable one (?). I added Anshuman in cc. > > Could you please try adding "movable_node" to qemu guest kernel command line > params. > This will prevent any kernel allocation from hotplugable memory nodes which I > think is > causing the behavior you are seeing. Effectively, when adding the movable_node option in the guest kernel parameters, I get the following traces: [ 29.581418] Offlined Pages 262144 [ 29.663605] Offlined Pages 262144 [ 29.714225] Offlined Pages 262144 [ 30.222953] Offlined Pages 262144 [ 30.314288] Built 1 zonelists, mobility grouping on. Total pages: 4076898 [ 30.316067] Policy zone: Normal Thanks Eric > > Thanks, > Shameer > > >> Thanks >> >> Eric >> >> >> >>> >>> Signed-off-by: Shameer Kolothum >>> --- >>> -RFC because linux kernel support for mem hot remove is just queued >>> for 5.7[1]. >>> -Tested with guest kernel 5.6-rc5 + [1] >>> >>> 1. https://patchwork.kernel.org/cover/11419301/ >>> --- >>> hw/acpi/generic_event_device.c | 28 + >>> hw/arm/virt.c | 56 >> -- >>> 2 files changed, 82 insertions(+), 2 deletions(-) >>> >>> diff --git a/hw/acpi/generic_event_device.c >> b/hw/acpi/generic_event_device.c >>> index 021ed2bf23..3e28c110fa 100644 >>> --- a/hw/acpi/generic_event_device.c >>> +++ b/hw/acpi/generic_event_device.c >>> @@ -182,6 +182,32 @@ static void >> acpi_ged_device_plug_cb(HotplugHandler *hotplug_dev, >>> } >>> } >>> >>> +static void acpi_ged_unplug_request_cb(HotplugHandler *hotplug_dev, >>> + DeviceState *dev, Error >> **errp) >>> +{ >>> +AcpiGedState *s = ACPI_GED(hotplug_dev); >>> + >>> +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { >>> +acpi_memory_unplug_request_cb(hotplug_dev, >> &s->memhp_state, dev, errp); >>> +} else { >>> +error_setg(errp, "acpi: device unplug request for unsupported >> device" >>> + " type: %s", object_get_typename(OBJECT(dev))); >>> +} >>> +} >>> + >>> +static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev, >>> + DeviceState *dev, Error **errp) >>> +{ >>> +AcpiGedState *s = ACPI_GED(hotplug_dev); >>> + >>> +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { >>> +acpi_memory_unplu
Re: [PULL for 5.0-rc1 00/11] testing updates (+ one mttcg change)
On Wed, 25 Mar 2020 at 15:15, Alex Bennée wrote: > > The following changes since commit 736cf607e40674776d752acc201f565723e86045: > > Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +) > > are available in the Git repository at: > > https://github.com/stsquad/qemu.git tags/pull-testing-250320-1 > > for you to fetch changes up to 6a5970988e7c9ce7f9fa9747397b63e8455144c6: > > .travis.yml: Add a KVM-only s390x job (2020-03-25 14:40:14 +) > > > Testing updates: > > - docker updates (various dependencies) > - travis updates (s390x KVM build) > - test/vm updates (NetBSD -> 9.0, FreeBSD -> 12.1) > - disable MTTCG for mips64/mips64el This produces some new warnings for the freebsd test: In file included from /home/qemu/qemu-test.XaCd3t/src/net/netmap.c:36: In file included from /home/qemu/qemu-test.XaCd3t/src/include/sysemu/sysemu.h:5: In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/timer.h:4: In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/bitops.h:17: /home/qemu/qemu-test.XaCd3t/src/include/qemu/atomic.h:211:9: warning: 'atomic_fetch_add' macro redefined [-Wmacro-redefined] #define atomic_fetch_add(ptr, n) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST) ^ /usr/include/stdatomic.h:350:9: note: previous definition is here #define atomic_fetch_add(object, operand) \ ^ In file included from /home/qemu/qemu-test.XaCd3t/src/net/netmap.c:36: In file included from /home/qemu/qemu-test.XaCd3t/src/include/sysemu/sysemu.h:5: In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/timer.h:4: In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/bitops.h:17: /home/qemu/qemu-test.XaCd3t/src/include/qemu/atomic.h:212:9: warning: 'atomic_fetch_sub' macro redefined [-Wmacro-redefined] #define atomic_fetch_sub(ptr, n) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST) ^ /usr/include/stdatomic.h:356:9: note: previous definition is here #define atomic_fetch_sub(object, operand) \ ^ (similarly for _and, _or, _xor). thanks -- PMM
Re: [PATCH for-5.0] arm:virt: fix broken IPA range with KVM enabled
Hi Igor, On 3/26/20 12:28 PM, Igor Mammedov wrote: > Commit a1b18df9a4848, broke virt_kvm_type() logic, which depends on > maxram_size, ram_size, ram_slots being parsed/set on machine instance > at the time accelerator (KVM) is initialized. > > set_memory_options() part was already reverted by commit 2a7b18a3205b, > so revert remaining initialization of above machine fields to make > virt_kvm_type() work as it used to. Oh I did not notice set_memory_options() change was already reverted. > > Signed-off-by: Igor Mammedov > Reported-by: Auger Eric Reviewed-by: Eric Auger Tested-by: Eric Auger > --- > softmmu/vl.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/softmmu/vl.c b/softmmu/vl.c > index 814537bb42..132c6e73dc 100644 > --- a/softmmu/vl.c > +++ b/softmmu/vl.c > @@ -4137,6 +4137,9 @@ void qemu_init(int argc, char **argv, char **envp) > machine_opts = qemu_get_machine_opts(); > qemu_opt_foreach(machine_opts, machine_set_property, current_machine, > &error_fatal); > +current_machine->ram_size = ram_size; > +current_machine->maxram_size = maxram_size; > +current_machine->ram_slots = ram_slots; Nit: Before configure_accelerators() call there is a comment stating that configure_accelerators uses machine properties and must be called after machine_set_property. Maybe we should add it also uses memory properties and should be called after set_memory_options. This may avoid other changes of the same style. Thanks Eric > > /* > * Note: uses machine properties such as kernel-irqchip, must run > @@ -4315,10 +4318,6 @@ void qemu_init(int argc, char **argv, char **envp) > } > } > > -current_machine->ram_size = ram_size; > -current_machine->maxram_size = maxram_size; > -current_machine->ram_slots = ram_slots; > - > parse_numa_opts(current_machine); > > if (machine_class->default_ram_id && current_machine->ram_size && >
Re: [PATCH qemu] vfio/spapr: Fix page size calculation
On Thu, Mar 26, 2020 at 11:21:47AM +, Peter Maydell wrote: > On Thu, 26 Mar 2020 at 00:39, David Gibson > wrote: > > > > On Tue, Mar 24, 2020 at 05:39:12PM +1100, Alexey Kardashevskiy wrote: > > > Coverity detected an issue (CID 1421903) with potential call of clz64(0) > > > which returns 64 which make it do "<<" with a negative number. > > > > > > This checks the mask and avoids undefined behaviour. > > > > > > In practice pgsizes and memory_region_iommu_get_min_page_size() always > > > have some common page sizes and even if they did not, the resulting page > > > size would be 0x8000... (gcc 9.2) and > > > ioctl(VFIO_IOMMU_SPAPR_TCE_CREATE) would fail anyway. > > > > > > Signed-off-by: Alexey Kardashevskiy > > > > Applied to ppc-for-5.1. > > As a coverity-issue-fix it would be nice to have this in > 5.0 unless you think it's particularly risky. In fact, I realized that shortly after and moved it. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PATCH for-5.0] arm:virt: fix broken IPA range with KVM enabled
Commit a1b18df9a4848, broke virt_kvm_type() logic, which depends on maxram_size, ram_size, ram_slots being parsed/set on machine instance at the time accelerator (KVM) is initialized. set_memory_options() part was already reverted by commit 2a7b18a3205b, so revert remaining initialization of above machine fields to make virt_kvm_type() work as it used to. Signed-off-by: Igor Mammedov Reported-by: Auger Eric --- softmmu/vl.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/softmmu/vl.c b/softmmu/vl.c index 814537bb42..132c6e73dc 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -4137,6 +4137,9 @@ void qemu_init(int argc, char **argv, char **envp) machine_opts = qemu_get_machine_opts(); qemu_opt_foreach(machine_opts, machine_set_property, current_machine, &error_fatal); +current_machine->ram_size = ram_size; +current_machine->maxram_size = maxram_size; +current_machine->ram_slots = ram_slots; /* * Note: uses machine properties such as kernel-irqchip, must run @@ -4315,10 +4318,6 @@ void qemu_init(int argc, char **argv, char **envp) } } -current_machine->ram_size = ram_size; -current_machine->maxram_size = maxram_size; -current_machine->ram_slots = ram_slots; - parse_numa_opts(current_machine); if (machine_class->default_ram_id && current_machine->ram_size && -- 2.18.1
Re: [PATCH] Refactor vhost_user_set_mem_table functions
On Thu, Mar 26, 2020 at 6:39 AM Raphael Norwitz wrote: > > vhost_user_set_mem_table() and vhost_user_set_mem_table_postcopy() have > gotten convoluted, and have some identical code. > > This change moves the logic populating the VhostUserMemory struct and > fds array from vhost_user_set_mem_table() and > vhost_user_set_mem_table_postcopy() to a new function, > vhost_user_fill_set_mem_table_msg(). > > No functionality is impacted. > > Signed-off-by: Raphael Norwitz > Signed-off-by: Peter Turschmid Reviewed-by: Marc-André Lureau > --- > hw/virtio/vhost-user.c | 143 > +++-- > 1 file changed, 67 insertions(+), 76 deletions(-) > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index 08e7e63..ec21e8f 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -407,18 +407,79 @@ static int vhost_user_set_log_base(struct vhost_dev > *dev, uint64_t base, > return 0; > } > > +static int vhost_user_fill_set_mem_table_msg(struct vhost_user *u, > + struct vhost_dev *dev, > + VhostUserMsg *msg, > + int *fds, size_t *fd_num, > + bool track_ramblocks) > +{ > +int i, fd; > +ram_addr_t offset; > +MemoryRegion *mr; > +struct vhost_memory_region *reg; > + > +msg->hdr.request = VHOST_USER_SET_MEM_TABLE; > + > +for (i = 0; i < dev->mem->nregions; ++i) { > +reg = dev->mem->regions + i; > + > +assert((uintptr_t)reg->userspace_addr == reg->userspace_addr); > +mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr, > + &offset); > +fd = memory_region_get_fd(mr); > +if (fd > 0) { > +if (track_ramblocks) { > +assert(*fd_num < VHOST_MEMORY_MAX_NREGIONS); > +trace_vhost_user_set_mem_table_withfd(*fd_num, mr->name, > + reg->memory_size, > + reg->guest_phys_addr, > + reg->userspace_addr, > + offset); > +u->region_rb_offset[i] = offset; > +u->region_rb[i] = mr->ram_block; > +} else if (*fd_num == VHOST_MEMORY_MAX_NREGIONS) { > +error_report("Failed preparing vhost-user memory table msg"); > +return -1; > +} > +msg->payload.memory.regions[*fd_num].userspace_addr = > +reg->userspace_addr; > +msg->payload.memory.regions[*fd_num].memory_size = > +reg->memory_size; > +msg->payload.memory.regions[*fd_num].guest_phys_addr = > +reg->guest_phys_addr; > +msg->payload.memory.regions[*fd_num].mmap_offset = offset; > +fds[(*fd_num)++] = fd; > +} else if (track_ramblocks) { > +u->region_rb_offset[i] = 0; > +u->region_rb[i] = NULL; > +} > +} > + > +msg->payload.memory.nregions = *fd_num; > + > +if (!*fd_num) { > +error_report("Failed initializing vhost-user memory map, " > + "consider using -object memory-backend-file share=on"); > +return -1; > +} > + > +msg->hdr.size = sizeof(msg->payload.memory.nregions); > +msg->hdr.size += sizeof(msg->payload.memory.padding); > +msg->hdr.size += *fd_num * sizeof(VhostUserMemoryRegion); > + > +return 1; > +} > + > static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev, > struct vhost_memory *mem) > { > struct vhost_user *u = dev->opaque; > int fds[VHOST_MEMORY_MAX_NREGIONS]; > -int i, fd; > size_t fd_num = 0; > VhostUserMsg msg_reply; > int region_i, msg_i; > > VhostUserMsg msg = { > -.hdr.request = VHOST_USER_SET_MEM_TABLE, > .hdr.flags = VHOST_USER_VERSION, > }; > > @@ -433,48 +494,11 @@ static int vhost_user_set_mem_table_postcopy(struct > vhost_dev *dev, > u->region_rb_len = dev->mem->nregions; > } > > -for (i = 0; i < dev->mem->nregions; ++i) { > -struct vhost_memory_region *reg = dev->mem->regions + i; > -ram_addr_t offset; > -MemoryRegion *mr; > - > -assert((uintptr_t)reg->userspace_addr == reg->userspace_addr); > -mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr, > - &offset); > -fd = memory_region_get_fd(mr); > -if (fd > 0) { > -assert(fd_num < VHOST_MEMORY_MAX_NREGIONS); > -trace_vhost_user_set_mem_table_withfd(fd_num, mr->name, > - reg->memory_size, > -
Re: [PATCH qemu] vfio/spapr: Fix page size calculation
On Thu, 26 Mar 2020 at 00:39, David Gibson wrote: > > On Tue, Mar 24, 2020 at 05:39:12PM +1100, Alexey Kardashevskiy wrote: > > Coverity detected an issue (CID 1421903) with potential call of clz64(0) > > which returns 64 which make it do "<<" with a negative number. > > > > This checks the mask and avoids undefined behaviour. > > > > In practice pgsizes and memory_region_iommu_get_min_page_size() always > > have some common page sizes and even if they did not, the resulting page > > size would be 0x8000... (gcc 9.2) and > > ioctl(VFIO_IOMMU_SPAPR_TCE_CREATE) would fail anyway. > > > > Signed-off-by: Alexey Kardashevskiy > > Applied to ppc-for-5.1. As a coverity-issue-fix it would be nice to have this in 5.0 unless you think it's particularly risky. thanks -- PMM
Re: [PATCH 2/2] block: trickle down the fallback image creation function use to the block drivers
On 26.03.20 02:12, Maxim Levitsky wrote: > Instead of checking the .bdrv_co_create_opts to see if we need the failback, > just implement the .bdrv_co_create_opts in the drivers that need it. > > This way we don't break various places that need to know if the underlying > protocol/format really supports image creation, and this way we still > allow some drivers to not support image creation. > > Fixes: fd17146cd93d1704cd96d7c2757b325fc7aac6fd > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1816007 Thanks, the series looks good to me, just some thoughts below. > Note that technically this driver reverts the image creation failback > for the vxhs driver since I don't have a means to test it, > and IMHO it is better to leave it not supported as it was prior to > generic image creation patches. There’s also file-win32. I don’t really have the means to test that either, though, so I suppose it’s reasonable to wait until someone requests it. OTOH, it shouldn’t be different from file-posix, so maybe it wouldn’t hurt to support it, too. We could also take this series for 5.0 as-is, and queue a file-win32 patch for 5.1. What do you think? > Also drop iscsi_create_opts which was left accidently > > Signed-off-by: Maxim Levitsky > --- > block.c | 35 --- > block/file-posix.c| 7 ++- > block/iscsi.c | 16 > block/nbd.c | 6 ++ > block/nvme.c | 3 +++ > include/block/block.h | 7 +++ > 6 files changed, 46 insertions(+), 28 deletions(-) > > diff --git a/block.c b/block.c > index ff23e20443..72fdf56af7 100644 > --- a/block.c > +++ b/block.c > @@ -598,8 +598,15 @@ static int > create_file_fallback_zero_first_sector(BlockBackend *blk, > return 0; > } > > -static int bdrv_create_file_fallback(const char *filename, BlockDriver *drv, > - QemuOpts *opts, Error **errp) > +/** > + * Simple implementation of bdrv_co_create_opts for protocol drivers > + * which only support creation via opening a file > + * (usually existing raw storage device) > + */ > +int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv, > + const char *filename, > + QemuOpts *opts, > + Error **errp) The alignment’s off (in the header, too), but that can be fixed when this series is applied. > { > BlockBackend *blk; > QDict *options; > @@ -663,11 +670,7 @@ int bdrv_create_file(const char *filename, QemuOpts > *opts, Error **errp) > return -ENOENT; > } > > -if (drv->bdrv_co_create_opts) { > -return bdrv_create(drv, filename, opts, errp); > -} else { > -return bdrv_create_file_fallback(filename, drv, opts, errp); > -} > +return bdrv_create(drv, filename, opts, errp); I thought we’d just let the drivers set BlockDriver.create_opts to &bdrv_create_opts_simple and keep this bit of code (maybe with an “else if (drv->create_opts != NULL)” and an “assert(drv->create_opts == &bdrv_create_opts_simple)”). That would make the first patch unnecessary. OTOH, it’s completely reasonable to pass the object as the first argument to a class method, so why not. (Well, technically the BlockDriver kind of is the class, and the BDS is the object, it’s weird.) And it definitely follows what we do elsewhere (to provide default implementations for block drivers to use). > } > > int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp) [...] > diff --git a/block/file-posix.c b/block/file-posix.c > index 65bc980bc4..7e19bbff5f 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c [...] > @@ -3639,10 +3641,11 @@ static BlockDriver bdrv_host_cdrom = { > .bdrv_reopen_prepare = raw_reopen_prepare, > .bdrv_reopen_commit = raw_reopen_commit, > .bdrv_reopen_abort = raw_reopen_abort, > +.bdrv_co_create_opts = bdrv_co_create_opts_simple, > +.create_opts = &bdrv_create_opts_simple, > .mutable_opts= mutable_opts, > .bdrv_co_invalidate_cache = raw_co_invalidate_cache, > > - This line removal seems unrelated, but why not. Max > .bdrv_co_preadv = raw_co_preadv, > .bdrv_co_pwritev= raw_co_pwritev, > .bdrv_co_flush_to_disk = raw_co_flush_to_disk, signature.asc Description: OpenPGP digital signature
RE: [RFC v1] arm/virt: Add memory hot remove support
Hi Eric, > -Original Message- > From: Auger Eric [mailto:eric.au...@redhat.com] > Sent: 26 March 2020 11:01 > To: Shameerali Kolothum Thodi ; > qemu-devel@nongnu.org; qemu-...@nongnu.org > Cc: imamm...@redhat.com; peter.mayd...@linaro.org; m...@redhat.com; > xuwei (O) ; Zengtao (B) ; > Linuxarm ; Anshuman Khandual > > Subject: Re: [RFC v1] arm/virt: Add memory hot remove support > > Hi Shameer, > > On 3/18/20 1:37 PM, Shameer Kolothum wrote: > > This adds support for memory hot remove on arm/virt that > > uses acpi ged device. > > I gave this a try and it works fine if the PCDIMM slot was initially > hotplugged: > (QEMU) object-add qom-type=memory-backend-ram id=mem1 > props.size=4294967296 > {"return": {}} > (QEMU) device_add driver=pc-dimm id=pcdimm1 memdev=mem1 > (QEMU) device_del id=pcdimm1 > {"return": {}} > > on guest I can see: > [ 82.466321] Offlined Pages 262144 > [ 82.541712] Offlined Pages 262144 > [ 82.589236] Offlined Pages 262144 > [ 82.969166] Offlined Pages 262144 > > However I noticed that if qemu is launched directly with > > -m 16G,maxmem=32G,slots=2 \ > -object memory-backend-ram,id=mem1,size=4G \ > -device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm -device > > and then in the qmp shell: > (QEMU) device_del id=dimm1 > > the hot-unplug fails in guest: > > [ 78.897407] Offlined Pages 262144 > [ 79.260811] Offlined Pages 262144 > [ 79.308105] Offlined Pages 262144 > [ 79.333675] page:fe00137d1f40 refcount:1 mapcount:0 > mapping:0004ea9f20b1 index:0xaaab11c6e > [ 79.335927] anon flags: 0x17880024(uptodate|active|swapbacked) > [ 79.337571] raw: 17880024 dead0100 > dead0122 > 0004ea9f20b1 > [ 79.339502] raw: 000aaab11c6e 0001 > 0004fd4e3000 > [ 79.341701] page dumped because: unmovable page > [ 79.342887] page->mem_cgroup:0004fd4e3000 > [ 79.354729] page:fe00137d1f40 refcount:1 mapcount:0 > mapping:0004ea9f20b1 index:0xaaab11c6e > [ 79.357012] anon flags: 0x17880024(uptodate|active|swapbacked) > [ 79.358658] raw: 17880024 dead0100 > dead0122 > 0004ea9f20b1 > [ 79.360611] raw: 000aaab11c6e 0001 > 0004fd4e3000 > [ 79.362560] page dumped because: unmovable page > [ 79.363742] page->mem_cgroup:0004fd4e3000 > [ 79.368636] memory memory20: Offline failed. > > I did not expect this. The PCDIMM slot in that case does not seem to be > interpreted as a hot-unpluggable one (?). I added Anshuman in cc. Could you please try adding "movable_node" to qemu guest kernel command line params. This will prevent any kernel allocation from hotplugable memory nodes which I think is causing the behavior you are seeing. Thanks, Shameer > Thanks > > Eric > > > > > > > Signed-off-by: Shameer Kolothum > > --- > > -RFC because linux kernel support for mem hot remove is just queued > > for 5.7[1]. > > -Tested with guest kernel 5.6-rc5 + [1] > > > > 1. https://patchwork.kernel.org/cover/11419301/ > > --- > > hw/acpi/generic_event_device.c | 28 + > > hw/arm/virt.c | 56 > -- > > 2 files changed, 82 insertions(+), 2 deletions(-) > > > > diff --git a/hw/acpi/generic_event_device.c > b/hw/acpi/generic_event_device.c > > index 021ed2bf23..3e28c110fa 100644 > > --- a/hw/acpi/generic_event_device.c > > +++ b/hw/acpi/generic_event_device.c > > @@ -182,6 +182,32 @@ static void > acpi_ged_device_plug_cb(HotplugHandler *hotplug_dev, > > } > > } > > > > +static void acpi_ged_unplug_request_cb(HotplugHandler *hotplug_dev, > > + DeviceState *dev, Error > **errp) > > +{ > > +AcpiGedState *s = ACPI_GED(hotplug_dev); > > + > > +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > > +acpi_memory_unplug_request_cb(hotplug_dev, > &s->memhp_state, dev, errp); > > +} else { > > +error_setg(errp, "acpi: device unplug request for unsupported > device" > > + " type: %s", object_get_typename(OBJECT(dev))); > > +} > > +} > > + > > +static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev, > > + DeviceState *dev, Error **errp) > > +{ > > +AcpiGedState *s = ACPI_GED(hotplug_dev); > > + > > +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > > +acpi_memory_unplug_cb(&s->memhp_state, dev, errp); > > +} else { > > +error_setg(errp, "acpi: device unplug for unsupported device" > > + " type: %s", object_get_typename(OBJECT(dev))); > > +} > > +} > > + > > static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits > ev) > > { > > AcpiGedState *s = ACPI_GED(adev); > > @@ -286,6 +312,8 @@ static void acpi_ged_class_init(ObjectClass *class, > void *data) > > dc->vmsd = &vmstate_acpi_ged; > > > > hc->plug = acpi_ged_device_plug_cb; > > +hc-
Re: [PATCH v7] net: tulip: check frame size and r/w data length
P J P 于2020年3月25日周三 上午1:31写道: > From: Prasad J Pandit > > Tulip network driver while copying tx/rx buffers does not check > frame size against r/w data length. This may lead to OOB buffer > access. Add check to avoid it. > > Limit iterations over descriptors to avoid potential infinite > loop issue in tulip_xmit_list_update. > > Reported-by: Li Qiang > Reported-by: Ziming Zhang > Reported-by: Jason Wang > Signed-off-by: Prasad J Pandit > Tested-by: Li Qiang Reviewed-by: Li Qiang Thanks, Li Qiang > --- > hw/net/tulip.c | 36 +++- > 1 file changed, 27 insertions(+), 9 deletions(-) > > Update v7: fix length check expression to replace '>=' with '>' > -> https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07160.html > > diff --git a/hw/net/tulip.c b/hw/net/tulip.c > index cfac2719d3..1295f51d07 100644 > --- a/hw/net/tulip.c > +++ b/hw/net/tulip.c > @@ -170,6 +170,10 @@ static void tulip_copy_rx_bytes(TULIPState *s, struct > tulip_descriptor *desc) > } else { > len = s->rx_frame_len; > } > + > +if (s->rx_frame_len + len > sizeof(s->rx_frame)) { > +return; > +} > pci_dma_write(&s->dev, desc->buf_addr1, s->rx_frame + > (s->rx_frame_size - s->rx_frame_len), len); > s->rx_frame_len -= len; > @@ -181,6 +185,10 @@ static void tulip_copy_rx_bytes(TULIPState *s, struct > tulip_descriptor *desc) > } else { > len = s->rx_frame_len; > } > + > +if (s->rx_frame_len + len > sizeof(s->rx_frame)) { > +return; > +} > pci_dma_write(&s->dev, desc->buf_addr2, s->rx_frame + > (s->rx_frame_size - s->rx_frame_len), len); > s->rx_frame_len -= len; > @@ -227,7 +235,8 @@ static ssize_t tulip_receive(TULIPState *s, const > uint8_t *buf, size_t size) > > trace_tulip_receive(buf, size); > > -if (size < 14 || size > 2048 || s->rx_frame_len || > tulip_rx_stopped(s)) { > +if (size < 14 || size > sizeof(s->rx_frame) - 4 > +|| s->rx_frame_len || tulip_rx_stopped(s)) { > return 0; > } > > @@ -275,7 +284,6 @@ static ssize_t tulip_receive_nc(NetClientState *nc, > return tulip_receive(qemu_get_nic_opaque(nc), buf, size); > } > > - > static NetClientInfo net_tulip_info = { > .type = NET_CLIENT_DRIVER_NIC, > .size = sizeof(NICState), > @@ -558,7 +566,7 @@ static void tulip_tx(TULIPState *s, struct > tulip_descriptor *desc) > if ((s->csr[6] >> CSR6_OM_SHIFT) & CSR6_OM_MASK) { > /* Internal or external Loopback */ > tulip_receive(s, s->tx_frame, s->tx_frame_len); > -} else { > +} else if (s->tx_frame_len <= sizeof(s->tx_frame)) { > qemu_send_packet(qemu_get_queue(s->nic), > s->tx_frame, s->tx_frame_len); > } > @@ -570,23 +578,31 @@ static void tulip_tx(TULIPState *s, struct > tulip_descriptor *desc) > } > } > > -static void tulip_copy_tx_buffers(TULIPState *s, struct tulip_descriptor > *desc) > +static int tulip_copy_tx_buffers(TULIPState *s, struct tulip_descriptor > *desc) > { > int len1 = (desc->control >> TDES1_BUF1_SIZE_SHIFT) & > TDES1_BUF1_SIZE_MASK; > int len2 = (desc->control >> TDES1_BUF2_SIZE_SHIFT) & > TDES1_BUF2_SIZE_MASK; > > +if (s->tx_frame_len + len1 > sizeof(s->tx_frame)) { > +return -1; > +} > if (len1) { > pci_dma_read(&s->dev, desc->buf_addr1, > s->tx_frame + s->tx_frame_len, len1); > s->tx_frame_len += len1; > } > > +if (s->tx_frame_len + len2 > sizeof(s->tx_frame)) { > +return -1; > +} > if (len2) { > pci_dma_read(&s->dev, desc->buf_addr2, > s->tx_frame + s->tx_frame_len, len2); > s->tx_frame_len += len2; > } > desc->status = (len1 + len2) ? 0 : 0x7fff; > + > +return 0; > } > > static void tulip_setup_filter_addr(TULIPState *s, uint8_t *buf, int n) > @@ -651,13 +667,15 @@ static uint32_t tulip_ts(TULIPState *s) > > static void tulip_xmit_list_update(TULIPState *s) > { > +#define TULIP_DESC_MAX 128 > +uint8_t i = 0; > struct tulip_descriptor desc; > > if (tulip_ts(s) != CSR5_TS_SUSPENDED) { > return; > } > > -for (;;) { > +for (i = 0; i < TULIP_DESC_MAX; i++) { > tulip_desc_read(s, s->current_tx_desc, &desc); > tulip_dump_tx_descriptor(s, &desc); > > @@ -675,10 +693,10 @@ static void tulip_xmit_list_update(TULIPState *s) > s->tx_frame_len = 0; > } > > -tulip_copy_tx_buffers(s, &desc); > - > -if (desc.control & TDES1_LS) { > -tulip_tx(s, &desc); > +if (!tulip_copy_tx_buffers(s, &desc)) { > +if (desc.control & TDES1_LS) { > +tulip_tx(s, &desc); > +} > } > } > tulip_desc_write(s, s->current_tx_desc, &des
Re: [RFC v1] arm/virt: Add memory hot remove support
Hi Shameer, On 3/18/20 1:37 PM, Shameer Kolothum wrote: > This adds support for memory hot remove on arm/virt that > uses acpi ged device. I gave this a try and it works fine if the PCDIMM slot was initially hotplugged: (QEMU) object-add qom-type=memory-backend-ram id=mem1 props.size=4294967296 {"return": {}} (QEMU) device_add driver=pc-dimm id=pcdimm1 memdev=mem1 (QEMU) device_del id=pcdimm1 {"return": {}} on guest I can see: [ 82.466321] Offlined Pages 262144 [ 82.541712] Offlined Pages 262144 [ 82.589236] Offlined Pages 262144 [ 82.969166] Offlined Pages 262144 However I noticed that if qemu is launched directly with -m 16G,maxmem=32G,slots=2 \ -object memory-backend-ram,id=mem1,size=4G \ -device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm -device and then in the qmp shell: (QEMU) device_del id=dimm1 the hot-unplug fails in guest: [ 78.897407] Offlined Pages 262144 [ 79.260811] Offlined Pages 262144 [ 79.308105] Offlined Pages 262144 [ 79.333675] page:fe00137d1f40 refcount:1 mapcount:0 mapping:0004ea9f20b1 index:0xaaab11c6e [ 79.335927] anon flags: 0x17880024(uptodate|active|swapbacked) [ 79.337571] raw: 17880024 dead0100 dead0122 0004ea9f20b1 [ 79.339502] raw: 000aaab11c6e 0001 0004fd4e3000 [ 79.341701] page dumped because: unmovable page [ 79.342887] page->mem_cgroup:0004fd4e3000 [ 79.354729] page:fe00137d1f40 refcount:1 mapcount:0 mapping:0004ea9f20b1 index:0xaaab11c6e [ 79.357012] anon flags: 0x17880024(uptodate|active|swapbacked) [ 79.358658] raw: 17880024 dead0100 dead0122 0004ea9f20b1 [ 79.360611] raw: 000aaab11c6e 0001 0004fd4e3000 [ 79.362560] page dumped because: unmovable page [ 79.363742] page->mem_cgroup:0004fd4e3000 [ 79.368636] memory memory20: Offline failed. I did not expect this. The PCDIMM slot in that case does not seem to be interpreted as a hot-unpluggable one (?). I added Anshuman in cc. Thanks Eric > > Signed-off-by: Shameer Kolothum > --- > -RFC because linux kernel support for mem hot remove is just queued > for 5.7[1]. > -Tested with guest kernel 5.6-rc5 + [1] > > 1. https://patchwork.kernel.org/cover/11419301/ > --- > hw/acpi/generic_event_device.c | 28 + > hw/arm/virt.c | 56 -- > 2 files changed, 82 insertions(+), 2 deletions(-) > > diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c > index 021ed2bf23..3e28c110fa 100644 > --- a/hw/acpi/generic_event_device.c > +++ b/hw/acpi/generic_event_device.c > @@ -182,6 +182,32 @@ static void acpi_ged_device_plug_cb(HotplugHandler > *hotplug_dev, > } > } > > +static void acpi_ged_unplug_request_cb(HotplugHandler *hotplug_dev, > + DeviceState *dev, Error **errp) > +{ > +AcpiGedState *s = ACPI_GED(hotplug_dev); > + > +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > +acpi_memory_unplug_request_cb(hotplug_dev, &s->memhp_state, dev, > errp); > +} else { > +error_setg(errp, "acpi: device unplug request for unsupported device" > + " type: %s", object_get_typename(OBJECT(dev))); > +} > +} > + > +static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev, > + DeviceState *dev, Error **errp) > +{ > +AcpiGedState *s = ACPI_GED(hotplug_dev); > + > +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { > +acpi_memory_unplug_cb(&s->memhp_state, dev, errp); > +} else { > +error_setg(errp, "acpi: device unplug for unsupported device" > + " type: %s", object_get_typename(OBJECT(dev))); > +} > +} > + > static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev) > { > AcpiGedState *s = ACPI_GED(adev); > @@ -286,6 +312,8 @@ static void acpi_ged_class_init(ObjectClass *class, void > *data) > dc->vmsd = &vmstate_acpi_ged; > > hc->plug = acpi_ged_device_plug_cb; > +hc->unplug_request = acpi_ged_unplug_request_cb; > +hc->unplug = acpi_ged_unplug_cb; > > adevc->send_event = acpi_ged_send_event; > } > diff --git a/hw/arm/virt.c b/hw/arm/virt.c > index 94f93dda54..91974e4e80 100644 > --- a/hw/arm/virt.c > +++ b/hw/arm/virt.c > @@ -2096,11 +2096,62 @@ static void > virt_machine_device_plug_cb(HotplugHandler *hotplug_dev, > } > } > > +static void virt_dimm_unplug_request(HotplugHandler *hotplug_dev, > + DeviceState *dev, Error **errp) > +{ > +VirtMachineState *vms = VIRT_MACHINE(hotplug_dev); > +Error *local_err = NULL; > + > +if (!vms->acpi_dev) { > +error_setg(errp, > + "memory hotplug is not enabled: missing acpi-ged device"); > +goto out; > +} > + > +hotplug_handler_unplug_request(HOTPLUG_HANDLER(vms->acpi_dev
Re: [PATCH 00/13] microvm: add acpi support
On Thu, 26 Mar 2020 03:33:35 -0400 "Michael S. Tsirkin" wrote: > On Wed, Mar 25, 2020 at 07:44:34PM +0100, Igor Mammedov wrote: > > On Wed, 25 Mar 2020 16:03:39 +0100 > > Gerd Hoffmann wrote: > > > > > On Wed, Mar 25, 2020 at 01:32:12PM +0100, Igor Mammedov wrote: > > > > On Thu, 19 Mar 2020 09:01:04 +0100 > > > > Gerd Hoffmann wrote: > > > > > > > > > I know that not supporting ACPI in microvm is intentional. If you > > > > > still > > > > > don't want ACPI this is perfectly fine, you can use the usual -no-acpi > > > > > switch to toggle ACPI support. > > > > > > > > > > These are the advantages you are going to loose then: > > > > > > > > > > (1) virtio-mmio device discovery without command line hacks > > > > > (tweaking > > > > > the command line is a problem when not using direct kernel > > > > > boot). > > > > > (2) Better IO-APIC support, we can use IRQ lines 16-23. > > > > > (3) ACPI power button (aka powerdown request) works. > > > > > (4) machine poweroff (aka S5 state) works. > > > > > > > > > > Together with seabios patches for virtio-mmio support this allows to > > > > > boot standard fedora images (cloud, coreos, workstation live) with the > > > > > microvm machine type. > > > > > > > > what CLI do you use to test it? > > > > > > Test script below. "qemu-default" is a wrapper script which starts > > > qemu-system-x86_64 from my build directory. "qemu-firmware" is the > > > same plus isa-debugcon configured for a firmware log on stdout. > > > > > > Latest bits (with some of the review comments addressed) just pushed > > > to git://git,kraxel.org/qemu sirius/microvm > > > > thanks, below are test results I got on my system, > > spoiler hw-reduced reduces boot time on ~0.02s compared to full blown acpi > > > > using timestamp at "Run /init as init process" as measuring point > > > > no acpi > > 1.967316 > > 1.975272 > > 1.981267 > > 1.974316 > > 1.962452 > > 1.960988 > > > > hw reduced acpi > > 0.893838 > > 0.892573 > > 0.890585 > > 0.900306 > > 0.897902 > > > > normal acpi: > > 0.921647 > > 0.916298 > > 0.923518 > > 0.916298 > > 0.913234 > > > > PS: > > I just quickly hacked hw-reduced acpi (using arm/virt as model) > > without implementing power button but I doubt that would affect results > > noticeably > > on qemu side it probably also will save some time since there are less > > things to setup for qemu. > > And no ACPI is faster because of PS/2 probing, right? I suppose you've meant -slower- because of ... if I compare at 'i8042: PNP: No PS/2 controller found.' to rule out probing, then no-api and hw-reduced are about in the same ballpark (3-4ms difference in favor of no-acpi). no-acpi 0.765785 0.783540 0.785269 0.51 0.774474 0.770611 0.789309 hw-reduced 0.788733 0.775982 0.793132 0.769077 0.774771 0.775191 0.771170 > > > > > > > HTH, > > > Gerd > > > > > > cut here > > > #!/bin/sh > > > > > > mode="${1}" > > > shift > > > > > > back=() > > > devs=() > > > args=() > > > qemu="qemu-firmware -monitor none -boot menu=on" > > > disk="" > > > liso="" > > > krnl="" > > > karg="console=ttyS0,115200" > > > > > > case "$mode" in > > > kernel) > > > qemu="qemu-default -nographic" > > > disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2" > > > krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage" > > > karg="$karg root=/dev/sda4" > > > karg="$karg quiet" > > > ;; > > > seabios) > > > disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2" > > > krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage" > > > karg="$karg root=/dev/sda4" > > > args+=("-bios" > > > "/home/kraxel/projects/seabios/out-bios-microvm/bios.bin") > > > ;; > > > cloud) > > > disk="/vmdisk/iso/Fedora-Cloud-Base-31-1.9.x86_64.raw" > > > ;; > > > coreos) > > > disk="/vmdisk/iso/fedora-coreos-31.20200210.3.0-metal.x86_64.raw" > > > ;; > > > live) > > > liso="/vmdisk/iso/Fedora-Workstation-Live-x86_64-30-1.2.iso" > > > devs+=("-device" "virtio-gpu-device") > > > devs+=("-device" "virtio-keyboard-device") > > > devs+=("-device" "virtio-tablet-device") > > > ;; > > > *) > > > echo "unknown mode: $mode" > > > echo "known modes: kernel seabios cloud coreos live" > > > exit 1 > > > ;; > > > esac > > > > > > if test "$disk" != ""; then > > > format="${disk##*.}" > > > back+=("-drive" "if=none,id=disk,format=${format},file=${disk}") > > > devs+=("-device" "scsi-hd,drive=disk,bootindex=1") > > > fi > > > if test "$liso" != ""; then > > > back+=("-drive" > > > "if=none,id=cdrom,media=cdrom,readonly,format=raw,file=${liso}") > > > devs+=("-device" "scsi-cd,drive=cdrom,bootindex=2") > > > fi > > > if test "$krnl" != ""; then > > > args+=("-kernel" "$krnl") > > > args+=("-append" "$karg") > > > fi > > > > > > set -ex > > > $qemu \ > > > -enable-kvm \ > > > -cpu host \ > > > -M microvm,
Re: [PATCH 0/2] Fix the generic image creation code
On 3/26/20 4:12 AM, Maxim Levitsky wrote: > The recent patches from Max Reitz allowed some block drivers to not > provide the .bdrv_co_create_opts and still allow qemu-img to > create/format images as long as the image is already existing > (that is the case with various block storage drivers like nbd/iscsi/nvme, etc) > > However it was found out that some places in the code depend on the > .bdrv_co_create_opts/.create_opts to be != NULL to decide if to allow > image creation. > > To avoid adding failback code to all these places, just make generic failback > code be used by the drivers that need it, so that for outside user, there > is no diffirence if failback was used or not. > > Best regards, > Maxim Levitsky > > Maxim Levitsky (2): > block: pass BlockDriver reference to the .bdrv_co_create > block: trickle down the fallback image creation function use to the > block drivers > > block.c | 38 ++ > block/crypto.c| 3 ++- > block/file-posix.c| 11 +-- > block/file-win32.c| 4 +++- > block/gluster.c | 3 ++- > block/iscsi.c | 16 > block/nbd.c | 6 ++ > block/nfs.c | 4 +++- > block/nvme.c | 3 +++ > block/parallels.c | 3 ++- > block/qcow.c | 3 ++- > block/qcow2.c | 4 +++- > block/qed.c | 3 ++- > block/raw-format.c| 4 +++- > block/rbd.c | 3 ++- > block/sheepdog.c | 4 +++- > block/ssh.c | 4 +++- > block/vdi.c | 4 +++- > block/vhdx.c | 3 ++- > block/vmdk.c | 4 +++- > block/vpc.c | 6 -- > include/block/block.h | 7 +++ > include/block/block_int.h | 3 ++- > 23 files changed, 95 insertions(+), 48 deletions(-) > Reviewed-by: Denis V. Lunev
[PATCH for-5.0] hw/i386/amd_iommu.c: Fix corruption of log events passed to guest
In the function amdvi_log_event(), we write an event log buffer entry into guest ram, whose contents are passed to the function via the "uint64_t *evt" argument. Unfortunately, a spurious '&' in the call to dma_memory_write() meant that instead of writing the event to the guest we would write the literal value of the pointer, plus whatever was in the following 8 bytes on the stack. This error was spotted by Coverity. Fix the bug by removing the '&'. Fixes: CID 1421945 Cc: qemu-sta...@nongnu.org Signed-off-by: Peter Maydell --- Disclaimer: only tested with 'make check' and 'make check-acceptance' which probably don't exercise this corner of the code. --- hw/i386/amd_iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index b1175e52c7d..fd75cae0243 100644 --- a/hw/i386/amd_iommu.c +++ b/hw/i386/amd_iommu.c @@ -181,7 +181,7 @@ static void amdvi_log_event(AMDVIState *s, uint64_t *evt) } if (dma_memory_write(&address_space_memory, s->evtlog + s->evtlog_tail, -&evt, AMDVI_EVENT_LEN)) { + evt, AMDVI_EVENT_LEN)) { trace_amdvi_evntlog_fail(s->evtlog, s->evtlog_tail); } -- 2.20.1
Re: [PATCH v16 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages
On Wed, 25 Mar 2020 01:02:34 +0530 Kirti Wankhede wrote: > vfio_pfn.ref_count is always updated by holding iommu->lock, using atomic s/by/while/ > variable is overkill. > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > Reviewed-by: Eric Auger > --- > drivers/vfio/vfio_iommu_type1.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > Reviewed-by: Cornelia Huck
Re: [PULL 0/9] migration queue
On Wed, 25 Mar 2020 at 13:17, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert" > > The following changes since commit 736cf607e40674776d752acc201f565723e86045: > > Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +) > > are available in the Git repository at: > > git://github.com/dagrh/qemu.git tags/pull-migration-20200325b > > for you to fetch changes up to 7cd75cbdb8a45d9e2d5912f774d8194cbafdfa97: > > migration: use "" instead of (null) for tls-authz (2020-03-25 12:31:38 > +) > > > Combo Migration/HMP/virtiofs pull > > Small fixes all around. > Ones that are noticeable: > a) Igor's migration compatibility fix affecting older machine types > has been seen in the wild > b) Philippe's autconverge fix should fix an intermittently > failing migration test. > c) Mao's makes a small change to the output of 'info > migrate_parameters' for tls-authz. > Applied, thanks. Please update the changelog at https://wiki.qemu.org/ChangeLog/5.0 for any user-visible changes. -- PMM
Re: [PULL 006/136] vl.c: move -m parsing after memory backends has been processed
On Thu, 26 Mar 2020 10:20:31 +0100 Auger Eric wrote: > Hi > > On 2/25/20 12:48 PM, Paolo Bonzini wrote: > > From: Igor Mammedov > > > > It will be possible for main RAM to come from memory-backend > > and we should check that size specified in -m matches the size > > of the backend and [MachineState::]ram_size also matches > > backend's size. > > > > However -m parsing (set_memory_options()) happens before backends > > are intialized (object_create_delayed()) which complicates it. > > Consolidate set_memory_options() and assigning parsed results to > > current_machine after backends are initialized, so it would be > > possible access the initialized backend instance to compare > > sizes. > > > > This patch only consolidates scattered places touching ram_size > > within vl.c. And follow up patch will integrate backend handling > > to set_memory_options(). > > > > Signed-off-by: Igor Mammedov > > Message-Id: <20200219160953.13771-7-imamm...@redhat.com> > > Unfortunately this patch breaks ARM virt memory map setting in KVM mode. > If you launch ARM virt with PCDIMM you now get > > qemu-system-aarch64: -device > pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm: memory devices (e.g. for > memory hotplug) are not supported by the machine > > With machvirt/KVM > configure_accelerators() calls ARM virt_kvm_type(). This function > freezes the machine memory map and computes the size of the IPA to be > supported by KVM. See: > c9650222b8 ("hw/arm/virt: Implement kvm_type function for 4.0 machine") > > virt_kvm_type() uses machine memory settings. Igor's patch moves those > after the mem backend init, so the virt memory map does not properly > take into account the machine memory settings anymore. > > So we need to find a way to rework this. set_memory_options() was already reverted, so safest way to fix it during freeze is to revert back lines -current_machine->ram_size = ram_size; -current_machine->maxram_size = maxram_size; -current_machine->ram_slots = ram_slots; I'll post a patch > > Thanks > > Eric > > > > --- > > vl.c | 27 ++- > > 1 file changed, 14 insertions(+), 13 deletions(-) > > > > diff --git a/vl.c b/vl.c > > index 2103804..72ffc06 100644 > > --- a/vl.c > > +++ b/vl.c > > @@ -2655,6 +2655,14 @@ static void set_memory_options(uint64_t *ram_slots, > > ram_addr_t *maxram_size, > > exit(EXIT_FAILURE); > > } > > > > +if (!xen_enabled()) { > > +/* On 32-bit hosts, QEMU is limited by virtual address space */ > > +if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) { > > +error_report("at most 2047 MB RAM can be simulated"); > > +exit(1); > > +} > > +} > > + > > loc_pop(&loc); > > } > > > > @@ -3819,8 +3827,6 @@ int main(int argc, char **argv, char **envp) > > machine_class = select_machine(); > > object_set_machine_compat_props(machine_class->compat_props); > > > > -set_memory_options(&ram_slots, &maxram_size, machine_class); > > - > > os_daemonize(); > > rcu_disable_atfork(); > > > > @@ -4122,9 +4128,6 @@ int main(int argc, char **argv, char **envp) > > machine_opts = qemu_get_machine_opts(); > > qemu_opt_foreach(machine_opts, machine_set_property, current_machine, > > &error_fatal); > > -current_machine->ram_size = ram_size; > > -current_machine->maxram_size = maxram_size; > > -current_machine->ram_slots = ram_slots; > > > > /* > > * Note: uses machine properties such as kernel-irqchip, must run > > @@ -4235,14 +4238,6 @@ int main(int argc, char **argv, char **envp) > > > > tpm_init(); > > > > -if (!xen_enabled()) { > > -/* On 32-bit hosts, QEMU is limited by virtual address space */ > > -if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) { > > -error_report("at most 2047 MB RAM can be simulated"); > > -exit(1); > > -} > > -} > > - > > blk_mig_init(); > > ram_mig_init(); > > dirty_bitmap_mig_init(); > > @@ -4287,6 +4282,12 @@ int main(int argc, char **argv, char **envp) > > if (cpu_option) { > > current_machine->cpu_type = parse_cpu_option(cpu_option); > > } > > + > > +set_memory_options(&ram_slots, &maxram_size, machine_class); > > +current_machine->ram_size = ram_size; > > +current_machine->maxram_size = maxram_size; > > +current_machine->ram_slots = ram_slots; > > + > > parse_numa_opts(current_machine); > > > > if (machine_class->default_ram_id && current_machine->ram_size && > >
Re: [PATCH v16 Kernel 1/7] vfio: KABI for migration interface for device state
On Wed, 25 Mar 2020 01:02:33 +0530 Kirti Wankhede wrote: > - Defined MIGRATION region type and sub-type. > > - Defined vfio_device_migration_info structure which will be placed at the > 0th offset of migration region to get/set VFIO device related > information. Defined members of structure and usage on read/write access. > > - Defined device states and state transition details. > > - Defined sequence to be followed while saving and resuming VFIO device. > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > include/uapi/linux/vfio.h | 228 > ++ > 1 file changed, 228 insertions(+) (...) > +struct vfio_device_migration_info { > + __u32 device_state; /* VFIO device state */ > +#define VFIO_DEVICE_STATE_STOP (0) > +#define VFIO_DEVICE_STATE_RUNNING (1 << 0) > +#define VFIO_DEVICE_STATE_SAVING(1 << 1) > +#define VFIO_DEVICE_STATE_RESUMING (1 << 2) > +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ > + VFIO_DEVICE_STATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING) > + > +#define VFIO_DEVICE_STATE_VALID(state) \ > + (state & VFIO_DEVICE_STATE_RESUMING ? \ > + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1) > + > +#define VFIO_DEVICE_STATE_IS_ERROR(state) \ > + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING)) > + > +#define VFIO_DEVICE_STATE_SET_ERROR(state) \ > + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ > + VFIO_DEVICE_STATE_RESUMING) > + > + __u32 reserved; > + __u64 pending_bytes; > + __u64 data_offset; > + __u64 data_size; > +} __attribute__((packed)); The 'packed' should not even be needed, I think? > + > /* > * The MSIX mappable capability informs that MSIX data of a BAR can be > mmapped > * which allows direct access to non-MSIX registers which happened to be > within Generally, this looks sane to me; however, we should really have something under Documentation/ in the long run that describes how this works, so that you can find out about the protocol without having to dig through headers.
Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue
On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote: > > > > Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin : > > > > On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote: > >>> On 12.03.20 09:47, Michael S. Tsirkin wrote: > >>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote: > 2. You are essentially stealing THPs in the guest. So the fastest > mapping (THP in guest and host) is gone. The guest won't be able to make > use of THP where it previously was able to. I can imagine this implies a > performance degradation for some workloads. This needs a proper > performance evaluation. > >>> > >>> I think the problem is more with the alloc_pages API. > >>> That gives you exactly the given order, and if there's > >>> a larger chunk available, it will split it up. > >>> > >>> But for balloon - I suspect lots of other users, > >>> we do not want to stress the system but if a large > >>> chunk is available anyway, then we could handle > >>> that more optimally by getting it all in one go. > >>> > >>> > >>> So if we want to address this, IMHO this calls for a new API. > >>> Along the lines of > >>> > >>>struct page *alloc_page_range(gfp_t gfp, unsigned int min_order, > >>>unsigned int max_order, unsigned int *order) > >>> > >>> the idea would then be to return at a number of pages in the given > >>> range. > >>> > >>> What do you think? Want to try implementing that? > >> > >> You can just start with the highest order and decrement the order until > >> your allocation succeeds using alloc_pages(), which would be enough for > >> a first version. At least I don't see the immediate need for a new > >> kernel API. > > > > OK I remember now. The problem is with reclaim. Unless reclaim is > > completely disabled, any of these calls can sleep. After it wakes up, > > we would like to get the larger order that has become available > > meanwhile. > > > > Yes, but that‘s a pure optimization IMHO. > So I think we should do a trivial implementation first and then see what we > gain from a new allocator API. Then we might also be able to justify it using > real numbers. > Well how do you propose implement the necessary semantics? I think we are both agreed that alloc_page_range is more or less what's necessary anyway - so how would you approximate it on top of existing APIs? > > > >> -- > >> Thanks, > >> > >> David / dhildenb > >
RFC: use VFIO over a UNIX domain socket to implement device offloading
I want to continue the discussion regarding using MUSER (https://github.com/nutanix/muser) as a device offloading mechanism. The main drawback of MUSER is that it requires a kernel module, so I've experimented with a proof of concept of how MUSER would look like if we somehow didn't need a kernel module. I did this by implementing a wrapper library (https://github.com/tmakatos/libpathtrap) that intercepts accesses to VFIO-related paths and forwards them to the MUSER process providing device emulation over a UNIX domain socket. This does not require any changes to QEMU (4.1.0). Obviously this is a massive hack and is done only for the needs of this PoC. The result is a fully working PCI device in QEMU (the gpio sample explained in https://github.com/nutanix/muser/blob/master/README.md#running-gpio-pci-idio-16), which is as simple as possible. I've also tested with a much more complicated device emulation, https://github.com/tmakatos/spdk, which provides NVMe device emulation and requires accessing guest memory for DMA, allowing BAR0 to be memory mapped into the guest, using MSI-X interrupts, etc. The changes required in MUSER are fairly small, all that is needed is to introduce a new concept of "transport" to receive requests from a UNIX domain socket instead of the kernel (from a character device) and to send/receive file descriptors for sharing memory and firing interrupts. My experience is that VFIO is so intuitive to use for offloading device emulation from one process to another that makes this feature quite straightforward. There's virtually nothing specific to the kernel in the VFIO API. Therefore I strongly agree with Stefan's suggestion to use it for device offloading when interacting with QEMU. Using 'muser.ko' is still interesting when QEMU is not the client, but if everyone is happy to proceed with the vfio-over-socket alternative the kernel module can become a second-class citizen. (QEMU is, after all, our first and most relevant client.) Next I explain how to test the PoC. Build MUSER with vfio-over-socket: git clone --single-branch --branch vfio-over-socket g...@github.com:tmakatos/muser.git cd muser/ git submodule update --init make Run device emulation, e.g. ./build/dbg/samples/gpio-pci-idio-16 -s Where is an available IOMMU group, essentially the device ID, which must not previously exist in /dev/vfio/. Run QEMU using the vfio wrapper library and specifying the MUSER device: LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \ ... \ -device vfio-pci,sysfsdev=/dev/vfio/ \ -object memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824 \ -numa node,nodeid=0,cpus=0,memdev=ram-node0 Bear in mind that since this is just a PoC lots of things can break, e.g. some system call not intercepted etc.
Re: 答复: [question]vhost-user: atuo fix network link broken during migration
On 2020/3/24 下午7:08, yangke (J) wrote: We find an issue when host mce trigger openvswitch(dpdk) restart in source host during guest migration, Did you mean the vhost-user netev was deleted from the source host? The vhost-user netev was not deleted from the source host. I mean that: in normal scenario, OVS(DPDK) begin to restart, then qemu_chr disconnect to OVS and link status is set to link down; OVS(DPDK) started, then qemu_chr reconnect to OVS and link status is set to link up. But in our scenario, before qemu_chr reconnect to OVS, the VM migrate is finished. The link_down of frontend was loaded from n->status in destination, it cause the network in gust never be up again. I'm not sure we should fix this in qemu. Generally, it's the task of management to make sure the destination device configuration is the same as source. E.g in this case, management should bring up the link if re-connection in source is completed. What's more the qmp_set_link() done in vhost-user.c looks hacky which changes the link status without the care of management. qemu_chr disconnect: #0 vhost_user_write (msg=msg@entry=0x7fff59ecb2b0, fds=fds@entry=0x0, fd_num=fd_num@entry=0, dev=0x295c730, dev=0x295c730) at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost_user.c:239 #1 0x004e6bad in vhost_user_get_vring_base (dev=0x295c730, ring=0x7fff59ecb510) at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost_user.c:497 #2 0x004e2e88 in vhost_virtqueue_stop (dev=dev@entry=0x295c730, vdev=vdev@entry=0x2ca36c0, vq=0x295c898, idx=0) at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost.c:1036 #3 0x004e45ab in vhost_dev_stop (hdev=hdev@entry=0x295c730, vdev=vdev@entry=0x2ca36c0) at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost.c:1556 #4 0x004bc56a in vhost_net_stop_one (net=0x295c730, dev=dev@entry=0x2ca36c0) at /usr/src/debug/qemu-kvm-2.8.1/hw/net/vhost_net.c:326 #5 0x004bcc3b in vhost_net_stop (dev=dev@entry=0x2ca36c0, ncs=, total_queues=4) at /usr/src/debug/qemu-kvm-2.8.1/hw/net/vhost_net.c:407 #6 0x004b85f6 in virtio_net_vhost_status (n=n@entry=0x2ca36c0, status=status@entry=7 '\a') at /usr/src/debug/qemu-kvm-2.8.1/hw/net/virtio_net.c:177 #7 0x004b869f in virtio_net_set_status (vdev=, status=) at /usr/src/debug/qemu-kvm-2.8.1/hw/net/virtio_net.c:243 #8 0x0073d00d in qmp_set_link (name=name@entry=0x2956d40 "hostnet0", up=up@entry=false, errp=errp@entry=0x7fff59ecd718) at net/net.c:1437 #9 0x007460c1 in net_vhost_user_event (opaque=0x2956d40, event=4) at net/vhost_user.c:217//qemu_chr_be_event #10 0x00574f0d in tcp_chr_disconnect (chr=0x2951a40) at qemu_char.c:3220 #11 0x0057511f in tcp_chr_hup (channel=, cond=, opaque=) at qemu_char.c:3265 VM is still link down in frontend after migration, it cause the network in VM never be up again. virtio_net_load_device: /* nc.link_down can't be migrated, so infer link_down according * to link status bit in n->status */ link_down = (n->status & VIRTIO_NET_S_LINK_UP) == 0; for (i = 0; i < n->max_queues; i++) { qemu_get_subqueue(n->nic, i)->link_down = link_down; } guset: migrate begin -> vCPU pause ---> vmsate load ---> migrate finish ^^^ ||| openvswitch in source host: begin to restart restartingstarted ^^^ ||| nc in frontend in source:link downlink downlink down ^^^ ||| nc in frontend in destination: link up link up link down ^^^ ||| guset network:broken broken broken ^^^ ||| nc in backend in source: link downlink downlink up ^^^ ||| nc in backend in destination:link up link up link up The link_down of frontend was loaded from n->status, n->status is link down in source, so the link_down of frontend is true. The backend in destination host is link up, but the frontend in destination host is link down, it cause the network in gust never be up again until an guest cold reboot. Is there a way to auto fix
Re: [PATCH 2/2] util/bufferiszero: improve avx2 accelerator
On 26/03/20 03:09, Hu, Robert wrote: > BTW, do I need to resend these 2 patches? No, thanks! I have queued them. Paolo
Re: [PATCH] backup: don't acquire aio_context in backup_clean
On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote: 25.03.2020 18:50, Stefan Reiter wrote: backup_clean is only ever called as a handler via job_exit, which Hmm.. I'm afraid it's not quite correct. job_clean job_finalize_single job_completed_txn_abort (lock aio context) job_do_finalize Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio context.. And on the same time, it directaly calls job_txn_apply(job->txn, job_finalize_single) without locking. Is it a bug? I think, as you say, the idea is that job_do_finalize is always called with the lock acquired. That's why job_completed_txn_abort takes care to release the lock (at least of the "outer_ctx" as it calls it) before reacquiring it. And, even if job_do_finalize called always with locked context, where is guarantee that all context of all jobs in txn are locked? I also don't see anything that guarantees that... I guess it could be adapted to handle locks like job_completed_txn_abort does? Haven't looked into transactions too much, but does it even make sense to have jobs in different contexts in one transaction? Still, let's look through its callers. job_finalize qmp_block_job_finalize (lock aio context) qmp_job_finalize (lock aio context) test_cancel_concluded (doesn't lock, but it's a test) job_completed_txn_success job_completed job_exit (lock aio context) job_cancel blockdev_mark_auto_del (lock aio context) job_user_cancel qmp_block_job_cancel (locks context) qmp_job_cancel (locks context) job_cancel_err job_cancel_sync (return job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls callback) replication_close (it's .bdrv_close.. Hmm, I don't see context locking, where is it ?) Hm, don't see it either. This might indeed be a way to get to job_clean without a lock held. I don't have any testing set up for replication atm, but if you believe this would be correct I can send a patch for that as well (just acquire the lock in replication_close before job_cancel_async?). replication_stop (locks context) drive_backup_abort (locks context) blockdev_backup_abort (locks context) job_cancel_sync_all (locks context) cancel_common (locks context) test_* (I don't care) To clarify, aside from the commit message the patch itself does not appear to be wrong? All paths (aside from replication_close mentioned above) guarantee the job lock to be held. already acquires the job's context. The job's context is guaranteed to be the same as the one used by backup_top via backup_job_create. Since the previous logic effectively acquired the lock twice, this broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock once, thus deadlocking with the IO thread. Signed-off-by: Stefan Reiter Just note, that this thing were recently touched by 0abf2581717a19 , so add Sergio (its author) to CC. --- This is a fix for the issue discussed in this part of the thread: https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html ...not the original problem (core dump) posted by Dietmar. I've still seen it occasionally hang during a backup abort. I'm trying to figure out why that happens, stack trace indicates a similar problem with the main thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet. block/backup.c | 4 1 file changed, 4 deletions(-) diff --git a/block/backup.c b/block/backup.c index 7430ca5883..a7a7dcaf4c 100644 --- a/block/backup.c +++ b/block/backup.c @@ -126,11 +126,7 @@ static void backup_abort(Job *job) static void backup_clean(Job *job) { BackupBlockJob *s = container_of(job, BackupBlockJob, common.job); - AioContext *aio_context = bdrv_get_aio_context(s->backup_top); - - aio_context_acquire(aio_context); bdrv_backup_top_drop(s->backup_top); - aio_context_release(aio_context); } void backup_do_checkpoint(BlockJob *job, Error **errp)
Re: [PATCH for-5.0 3/3] object-add: don't create return value if failed
On 25/03/20 19:47, Marc-André Lureau wrote: > If object-add failed, no need to create a return value that may later > be leaked. > > Signed-off-by: Marc-André Lureau > --- > qom/qom-qmp-cmds.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/qom/qom-qmp-cmds.c b/qom/qom-qmp-cmds.c > index 435193b036..6bd137ccbf 100644 > --- a/qom/qom-qmp-cmds.c > +++ b/qom/qom-qmp-cmds.c > @@ -287,8 +287,8 @@ void qmp_object_add(QDict *qdict, QObject **ret_data, > Error **errp) > visit_free(v); > if (obj) { > object_unref(obj); > +*ret_data = QOBJECT(qdict_new()); > } > -*ret_data = QOBJECT(qdict_new()); > } > > void qmp_object_del(const char *id, Error **errp) > It can be slightly simplified: --- 8< -- From: Paolo Bonzini Subject: [PATCH] object-add: don't create return value if failed No need to return an empty value from object-add (it would also leak if the command failed). While at it, remove the "if" around object_unref since object_unref handles NULL arguments just fine. Reported-by: Marc-André Lureau Signed-off-by: Paolo Bonzini diff --git a/qom/qom-qmp-cmds.c b/qom/qom-qmp-cmds.c index 435193b036..e47ebe8ed1 100644 --- a/qom/qom-qmp-cmds.c +++ b/qom/qom-qmp-cmds.c @@ -285,10 +285,7 @@ void qmp_object_add(QDict *qdict, QObject **ret_data, Error **errp) v = qobject_input_visitor_new(QOBJECT(qdict)); obj = user_creatable_add_type(type, id, qdict, v, errp); visit_free(v); -if (obj) { -object_unref(obj); -} -*ret_data = QOBJECT(qdict_new()); +object_unref(obj); } void qmp_object_del(const char *id, Error **errp) I queued this patch and your other two. Thanks, Paolo
Re: [PATCH v15 Kernel 1/7] vfio: KABI for migration interface for device state
s/KABI/UAPI/ in the subject and anywhere else in the series. Please avoid __packed__ structures and just properly pad them, they have a major performance impact on some platforms and will cause compiler warnings when taking addresses of members.
Re: [PULL 0/2] Fixes 20200325 patches
On Wed, 25 Mar 2020 at 11:05, Gerd Hoffmann wrote: > > The following changes since commit 736cf607e40674776d752acc201f565723e86045: > > Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +) > > are available in the Git repository at: > > git://git.kraxel.org/qemu tags/fixes-20200325-pull-request > > for you to fetch changes up to 95fad99cb28e9970944b01fd7af452f6f9f37484: > > hw/audio/fmopl: fix segmentation fault (2020-03-25 09:55:40 +0100) > > > fixes: input error handling & audio segfault > > Applied, thanks. Please update the changelog at https://wiki.qemu.org/ChangeLog/5.0 for any user-visible changes. -- PMM
Re: backup transaction with io-thread core dumps
> > > As mentioned earlier, even a totally simple/normal backup job fails when > > > using io-threads and the VM is under load. It results in a total > > > VM freeze! > > > > > > > This is definitely a different issue. I'll take a look at it today. > > Thanks. Stefan found a way to avoid that bug with: > > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07749.html > > But there are doubts that this is the correct way to fix it ... And I just run into another freeze (with Stefans path applied). This time when I cancel a running backup: #0 0x75cb3916 in __GI_ppoll (fds=0x7fff63d35c40, nfds=2, timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:39 #1 0x55c5fcd9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=, __fds=) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77 #2 0x55c5fcd9 in qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=-1) at util/qemu-timer.c:335 #3 0x55c624c1 in fdmon_poll_wait (ctx=0x7fffe8905e80, ready_list=0x7fffd2a8, timeout=-1) at util/fdmon-poll.c:79 #4 0x55c61aa7 in aio_poll (ctx=0x7fffe8905e80, blocking=blocking@entry=true) at util/aio-posix.c:589 #5 0x55bc2c83 in bdrv_do_drained_begin (poll=, ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x7fffe8954bc0) at block/io.c:429 #6 0x55bc2c83 in bdrv_do_drained_begin (bs=0x7fffe8954bc0, recursive=, parent=0x0, ignore_bds_parents=, poll=) at block/io.c:395 #7 0x55bb3c37 in blk_drain (blk=0x7fffe8ebcf80) at block/block-backend.c:1617 #8 0x55bb481d in blk_unref (blk=0x7fffe8ebcf80) at block/block-backend.c:473 #9 0x55b6c835 in block_job_free (job=0x7fff64505000) at blockjob.c:89 #10 0x55b6dd19 in job_unref (job=0x7fff64505000) at job.c:360 #11 0x55b6dd19 in job_unref (job=0x7fff64505000) at job.c:352 #12 0x55b6e69a in job_finish_sync (job=job@entry=0x7fff64505000, finish=finish@entry=0x55b6ec80 , errp=errp@entry=0x0) at job.c:988 #13 0x55b6ec9e in job_cancel_sync (job=job@entry=0x7fff64505000) at job.c:931 ... Any ideas?
Re: [PULL 006/136] vl.c: move -m parsing after memory backends has been processed
Hi On 2/25/20 12:48 PM, Paolo Bonzini wrote: > From: Igor Mammedov > > It will be possible for main RAM to come from memory-backend > and we should check that size specified in -m matches the size > of the backend and [MachineState::]ram_size also matches > backend's size. > > However -m parsing (set_memory_options()) happens before backends > are intialized (object_create_delayed()) which complicates it. > Consolidate set_memory_options() and assigning parsed results to > current_machine after backends are initialized, so it would be > possible access the initialized backend instance to compare > sizes. > > This patch only consolidates scattered places touching ram_size > within vl.c. And follow up patch will integrate backend handling > to set_memory_options(). > > Signed-off-by: Igor Mammedov > Message-Id: <20200219160953.13771-7-imamm...@redhat.com> Unfortunately this patch breaks ARM virt memory map setting in KVM mode. If you launch ARM virt with PCDIMM you now get qemu-system-aarch64: -device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm: memory devices (e.g. for memory hotplug) are not supported by the machine With machvirt/KVM configure_accelerators() calls ARM virt_kvm_type(). This function freezes the machine memory map and computes the size of the IPA to be supported by KVM. See: c9650222b8 ("hw/arm/virt: Implement kvm_type function for 4.0 machine") virt_kvm_type() uses machine memory settings. Igor's patch moves those after the mem backend init, so the virt memory map does not properly take into account the machine memory settings anymore. So we need to find a way to rework this. Thanks Eric > --- > vl.c | 27 ++- > 1 file changed, 14 insertions(+), 13 deletions(-) > > diff --git a/vl.c b/vl.c > index 2103804..72ffc06 100644 > --- a/vl.c > +++ b/vl.c > @@ -2655,6 +2655,14 @@ static void set_memory_options(uint64_t *ram_slots, > ram_addr_t *maxram_size, > exit(EXIT_FAILURE); > } > > +if (!xen_enabled()) { > +/* On 32-bit hosts, QEMU is limited by virtual address space */ > +if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) { > +error_report("at most 2047 MB RAM can be simulated"); > +exit(1); > +} > +} > + > loc_pop(&loc); > } > > @@ -3819,8 +3827,6 @@ int main(int argc, char **argv, char **envp) > machine_class = select_machine(); > object_set_machine_compat_props(machine_class->compat_props); > > -set_memory_options(&ram_slots, &maxram_size, machine_class); > - > os_daemonize(); > rcu_disable_atfork(); > > @@ -4122,9 +4128,6 @@ int main(int argc, char **argv, char **envp) > machine_opts = qemu_get_machine_opts(); > qemu_opt_foreach(machine_opts, machine_set_property, current_machine, > &error_fatal); > -current_machine->ram_size = ram_size; > -current_machine->maxram_size = maxram_size; > -current_machine->ram_slots = ram_slots; > > /* > * Note: uses machine properties such as kernel-irqchip, must run > @@ -4235,14 +4238,6 @@ int main(int argc, char **argv, char **envp) > > tpm_init(); > > -if (!xen_enabled()) { > -/* On 32-bit hosts, QEMU is limited by virtual address space */ > -if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) { > -error_report("at most 2047 MB RAM can be simulated"); > -exit(1); > -} > -} > - > blk_mig_init(); > ram_mig_init(); > dirty_bitmap_mig_init(); > @@ -4287,6 +4282,12 @@ int main(int argc, char **argv, char **envp) > if (cpu_option) { > current_machine->cpu_type = parse_cpu_option(cpu_option); > } > + > +set_memory_options(&ram_slots, &maxram_size, machine_class); > +current_machine->ram_size = ram_size; > +current_machine->maxram_size = maxram_size; > +current_machine->ram_slots = ram_slots; > + > parse_numa_opts(current_machine); > > if (machine_class->default_ram_id && current_machine->ram_size && >
Re: backup transaction with io-thread core dumps
> > > > So the solution is to disable backups when using io-threads? > > > > > > > > > > I meant forbidding transactions with completion-mode == grouped. It > > > would be still possible running transactions (and thus, backups) with > > > completion-mode == individual, which is the default. > > > > As mentioned earlier, even a totally simple/normal backup job fails when > > using io-threads and the VM is under load. It results in a total > > VM freeze! > > > > This is definitely a different issue. I'll take a look at it today. Thanks. Stefan found a way to avoid that bug with: https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07749.html But there are doubts that this is the correct way to fix it ...
Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue
> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin : > > On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote: >>> On 12.03.20 09:47, Michael S. Tsirkin wrote: >>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote: 2. You are essentially stealing THPs in the guest. So the fastest mapping (THP in guest and host) is gone. The guest won't be able to make use of THP where it previously was able to. I can imagine this implies a performance degradation for some workloads. This needs a proper performance evaluation. >>> >>> I think the problem is more with the alloc_pages API. >>> That gives you exactly the given order, and if there's >>> a larger chunk available, it will split it up. >>> >>> But for balloon - I suspect lots of other users, >>> we do not want to stress the system but if a large >>> chunk is available anyway, then we could handle >>> that more optimally by getting it all in one go. >>> >>> >>> So if we want to address this, IMHO this calls for a new API. >>> Along the lines of >>> >>>struct page *alloc_page_range(gfp_t gfp, unsigned int min_order, >>>unsigned int max_order, unsigned int *order) >>> >>> the idea would then be to return at a number of pages in the given >>> range. >>> >>> What do you think? Want to try implementing that? >> >> You can just start with the highest order and decrement the order until >> your allocation succeeds using alloc_pages(), which would be enough for >> a first version. At least I don't see the immediate need for a new >> kernel API. > > OK I remember now. The problem is with reclaim. Unless reclaim is > completely disabled, any of these calls can sleep. After it wakes up, > we would like to get the larger order that has become available > meanwhile. > Yes, but that‘s a pure optimization IMHO. So I think we should do a trivial implementation first and then see what we gain from a new allocator API. Then we might also be able to justify it using real numbers. > >> -- >> Thanks, >> >> David / dhildenb >
Re: backup transaction with io-thread core dumps
On Wed, Mar 25, 2020 at 04:40:48PM +0100, Dietmar Maurer wrote: > > On March 25, 2020 1:39 PM Sergio Lopez wrote: > > > > > > On Wed, Mar 25, 2020 at 01:29:48PM +0100, Dietmar Maurer wrote: > > > > As expected, if both BDS are running on the same IOThread (and thus, > > > > the same AioContext), the problem is not reproducible. > > > > > > > > In a general sense, we could say that completion modes other than > > > > "individual" are not supported for a transaction that may access > > > > different AioContexts. I don't see a safe an easy way to fix this. We > > > > could opt for simply detect and forbid such completion modes when the > > > > BDS's are assigned to different AioContexts. Perhaps Kevin (in CC) has > > > > a better idea. > > > > > > So the solution is to disable backups when using io-threads? > > > > > > > I meant forbidding transactions with completion-mode == grouped. It > > would be still possible running transactions (and thus, backups) with > > completion-mode == individual, which is the default. > > As mentioned earlier, even a totally simple/normal backup job fails when > using io-threads and the VM is under load. It results in a total > VM freeze! > This is definitely a different issue. I'll take a look at it today. Cheers, Sergio. signature.asc Description: PGP signature
Re: [PATCH 00/13] microvm: add acpi support
On Wed, Mar 25, 2020 at 07:44:34PM +0100, Igor Mammedov wrote: > On Wed, 25 Mar 2020 16:03:39 +0100 > Gerd Hoffmann wrote: > > > On Wed, Mar 25, 2020 at 01:32:12PM +0100, Igor Mammedov wrote: > > > On Thu, 19 Mar 2020 09:01:04 +0100 > > > Gerd Hoffmann wrote: > > > > > > > I know that not supporting ACPI in microvm is intentional. If you still > > > > don't want ACPI this is perfectly fine, you can use the usual -no-acpi > > > > switch to toggle ACPI support. > > > > > > > > These are the advantages you are going to loose then: > > > > > > > > (1) virtio-mmio device discovery without command line hacks (tweaking > > > > the command line is a problem when not using direct kernel boot). > > > > (2) Better IO-APIC support, we can use IRQ lines 16-23. > > > > (3) ACPI power button (aka powerdown request) works. > > > > (4) machine poweroff (aka S5 state) works. > > > > > > > > Together with seabios patches for virtio-mmio support this allows to > > > > boot standard fedora images (cloud, coreos, workstation live) with the > > > > microvm machine type. > > > > > > what CLI do you use to test it? > > > > Test script below. "qemu-default" is a wrapper script which starts > > qemu-system-x86_64 from my build directory. "qemu-firmware" is the > > same plus isa-debugcon configured for a firmware log on stdout. > > > > Latest bits (with some of the review comments addressed) just pushed > > to git://git,kraxel.org/qemu sirius/microvm > > thanks, below are test results I got on my system, > spoiler hw-reduced reduces boot time on ~0.02s compared to full blown acpi > > using timestamp at "Run /init as init process" as measuring point > > no acpi > 1.967316 > 1.975272 > 1.981267 > 1.974316 > 1.962452 > 1.960988 > > hw reduced acpi > 0.893838 > 0.892573 > 0.890585 > 0.900306 > 0.897902 > > normal acpi: > 0.921647 > 0.916298 > 0.923518 > 0.916298 > 0.913234 > > PS: > I just quickly hacked hw-reduced acpi (using arm/virt as model) > without implementing power button but I doubt that would affect results > noticeably > on qemu side it probably also will save some time since there are less > things to setup for qemu. And no ACPI is faster because of PS/2 probing, right? > > > > HTH, > > Gerd > > > > cut here > > #!/bin/sh > > > > mode="${1}" > > shift > > > > back=() > > devs=() > > args=() > > qemu="qemu-firmware -monitor none -boot menu=on" > > disk="" > > liso="" > > krnl="" > > karg="console=ttyS0,115200" > > > > case "$mode" in > > kernel) > > qemu="qemu-default -nographic" > > disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2" > > krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage" > > karg="$karg root=/dev/sda4" > > karg="$karg quiet" > > ;; > > seabios) > > disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2" > > krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage" > > karg="$karg root=/dev/sda4" > > args+=("-bios" > > "/home/kraxel/projects/seabios/out-bios-microvm/bios.bin") > > ;; > > cloud) > > disk="/vmdisk/iso/Fedora-Cloud-Base-31-1.9.x86_64.raw" > > ;; > > coreos) > > disk="/vmdisk/iso/fedora-coreos-31.20200210.3.0-metal.x86_64.raw" > > ;; > > live) > > liso="/vmdisk/iso/Fedora-Workstation-Live-x86_64-30-1.2.iso" > > devs+=("-device" "virtio-gpu-device") > > devs+=("-device" "virtio-keyboard-device") > > devs+=("-device" "virtio-tablet-device") > > ;; > > *) > > echo "unknown mode: $mode" > > echo "known modes: kernel seabios cloud coreos live" > > exit 1 > > ;; > > esac > > > > if test "$disk" != ""; then > > format="${disk##*.}" > > back+=("-drive" "if=none,id=disk,format=${format},file=${disk}") > > devs+=("-device" "scsi-hd,drive=disk,bootindex=1") > > fi > > if test "$liso" != ""; then > > back+=("-drive" > > "if=none,id=cdrom,media=cdrom,readonly,format=raw,file=${liso}") > > devs+=("-device" "scsi-cd,drive=cdrom,bootindex=2") > > fi > > if test "$krnl" != ""; then > > args+=("-kernel" "$krnl") > > args+=("-append" "$karg") > > fi > > > > set -ex > > $qemu \ > > -enable-kvm \ > > -cpu host \ > > -M microvm,graphics=off,pit=off,pic=on,rtc=on \ > > \ > > -m 4G \ > > \ > > -netdev user,id=net \ > > "${back[@]}" \ > > \ > > -global virtio-mmio.force-legacy=false \ > > -device virtio-net-device,netdev=net \ > > -device virtio-scsi-device \ > > "${devs[@]}" \ > > \ > > "${args[@]}" \ > > "$@"
[Bug 1868116] Re: QEMU monitor no longer works
I'm not sure how many of you are tracking the Vte bug [1] so here a summary of the latest insight from there. - Short term it seems that new behavior will be reverted in Vte 0.60.1. - Long term the Vte devs might want to deprecate no-pty use cases or at least better understand why apps use it that way. For more details please read [1]. [1]: https://gitlab.gnome.org/GNOME/vte/issues/222 -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1868116 Title: QEMU monitor no longer works Status in QEMU: New Status in qemu package in Ubuntu: Triaged Status in vte2.91 package in Ubuntu: New Bug description: Repro: VTE $ meson _build && ninja -C _build && ninja -C _build install qemu: $ ../configure --python=/usr/bin/python3 --disable-werror --disable-user --disable-linux-user --disable-docs --disable-guest-agent --disable-sdl --enable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-fdt --disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi --disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir --disable-seccomp --disable-glusterfs --disable-tpm --disable-numa --disable-opengl --disable-virglrenderer --disable-xfsctl --disable-vxhs --disable-slirp --disable-blobs --target-list=x86_64-softmmu --disable-rdma --disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock --disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user --disable-spice --disable-qom-cast-debug --disable-vxhs --disable-bochs --disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat --disable-qed --disable-parallels --disable-sheepdog --disable-avx2 --disable-nettle --disable-gnutls --disable-capstone --disable-tools --disable-libpmem --disable-iconv --disable-cap-ng $ make Test: $ LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH ./build/x86_64-softmmu/qemu-system-x86_64 -enable-kvm --drive media=cdrom,file=http://archive.ubuntu.com/ubuntu/dists/bionic/main/installer-amd64/current/images/netboot/mini.iso - switch to monitor with CTRL+ALT+2 - try to enter something Affects head of both usptream git repos. --- original bug --- It was observed that the QEMU console (normally accessible using Ctrl+Alt+2) accepts no input, so it can't be used. This is being problematic because there are cases where it's required to send commands to the guest, or key combinations that the host would grab (as Ctrl-Alt-F1 or Alt-F4). ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: qemu 1:4.2-3ubuntu2 Uname: Linux 5.6.0-rc6+ x86_64 ApportVersion: 2.20.11-0ubuntu20 Architecture: amd64 CurrentDesktop: XFCE Date: Thu Mar 19 12:16:31 2020 Dependencies: InstallationDate: Installed on 2017-06-13 (1009 days ago) InstallationMedia: Xubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412) KvmCmdLine: COMMAND STAT EUID RUID PIDPPID %CPU COMMAND qemu-system-x86 Sl+ 1000 1000 34275 25235 29.2 qemu-system-x86_64 -m 4G -cpu Skylake-Client -device virtio-vga,virgl=true,xres=1280,yres=720 -accel kvm -device nec-usb-xhci -serial vc -serial stdio -hda /home/usuario/Sistemas/androidx86.img -display gtk,gl=on -device usb-audio kvm-nx-lpage-re S0 0 34284 2 0.0 [kvm-nx-lpage-re] kvm-pit/34275 S0 0 34286 2 0.0 [kvm-pit/34275] MachineType: LENOVO 80UG ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-rc6+ root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 kvm.report_ignored_msrs=0 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 i915.enable_gvt=1 i915.fastboot=1 cgroup_enable=memory swapaccount=1 zswap.enabled=1 zswap.zpool=z3fold resume=UUID=a82e38a0-8d20-49dd-9cbd-de7216b589fc log_buf_len=16M usbhid.quirks=0x0079:0x0006:0x10 config_scsi_mq_default=y scsi_mod.use_blk_mq=1 mtrr_gran_size=64M mtrr_chunk_size=64M nbd.nbds_max=2 nbd.max_part=63 SourcePackage: qemu UpgradeStatus: Upgraded to focal on 2019-12-22 (87 days ago) dmi.bios.date: 08/09/2018 dmi.bios.vendor: LENOVO dmi.bios.version: 0XCN45WW dmi.board.asset.tag: NO Asset Tag dmi.board.name: Toronto 4A2 dmi.board.vendor: LENOVO dmi.board.version: SDK0J40679 WIN dmi.chassis.asset.tag: NO Asset Tag dmi.chassis.type: 10 dmi.chassis.vendor: LENOVO dmi.chassis.version: Lenovo ideapad 310-14ISK dmi.modalias: dmi:bvnLENOVO:bvr0XCN45WW:bd08/09/2018:svnLENOVO:pn80UG:pvrLenovoideapad310-14ISK:rvnLENOVO:rnToronto4A2:rvrSDK0J40679WIN:cvnLENOVO:ct10:cvrLenovoideapad310-14ISK: dmi.product.family: IDEAPAD dmi.product.name: 80UG dmi.product.sku: LENOVO_MT_80UG_BU_idea_FM_Lenovo ideapad 310-14ISK dmi.product.version: Lenovo ideapad 310-14ISK dmi.sys.vendor: LENOVO mtime.conffile..etc.apport.crashdb.conf: 2019-08-29T08:39:36.787240 To manage notifications about this bug go to: https://bugs.launchpad.net/q
[PULL 3/6] linux-user/i386: Split out gen_signal
From: Richard Henderson This is a bit tidier than open-coding the 5 lines necessary to initialize the target_siginfo_t. In addition, this zeros the remaining bytes of the target_siginfo_t, rather than passing in garbage. Reviewed-by: Paolo Bonzini Reviewed-by: Alex Bennée Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Richard Henderson Message-Id: <20200213032223.14643-3-richard.hender...@linaro.org> Signed-off-by: Laurent Vivier --- linux-user/i386/cpu_loop.c | 93 ++ 1 file changed, 33 insertions(+), 60 deletions(-) diff --git a/linux-user/i386/cpu_loop.c b/linux-user/i386/cpu_loop.c index 024b6f4d588c..e217cca5ee1e 100644 --- a/linux-user/i386/cpu_loop.c +++ b/linux-user/i386/cpu_loop.c @@ -81,13 +81,23 @@ static void set_idt(int n, unsigned int dpl) } #endif +static void gen_signal(CPUX86State *env, int sig, int code, abi_ptr addr) +{ +target_siginfo_t info = { +.si_signo = sig, +.si_code = code, +._sifields._sigfault._addr = addr +}; + +queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); +} + void cpu_loop(CPUX86State *env) { CPUState *cs = env_cpu(env); int trapnr; abi_ulong pc; abi_ulong ret; -target_siginfo_t info; for(;;) { cpu_exec_start(cs); @@ -134,70 +144,45 @@ void cpu_loop(CPUX86State *env) #endif case EXCP0B_NOSEG: case EXCP0C_STACK: -info.si_signo = TARGET_SIGBUS; -info.si_errno = 0; -info.si_code = TARGET_SI_KERNEL; -info._sifields._sigfault._addr = 0; -queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); +gen_signal(env, TARGET_SIGBUS, TARGET_SI_KERNEL, 0); break; case EXCP0D_GPF: /* XXX: potential problem if ABI32 */ #ifndef TARGET_X86_64 if (env->eflags & VM_MASK) { handle_vm86_fault(env); -} else -#endif -{ -info.si_signo = TARGET_SIGSEGV; -info.si_errno = 0; -info.si_code = TARGET_SI_KERNEL; -info._sifields._sigfault._addr = 0; -queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); +break; } +#endif +gen_signal(env, TARGET_SIGSEGV, TARGET_SI_KERNEL, 0); break; case EXCP0E_PAGE: -info.si_signo = TARGET_SIGSEGV; -info.si_errno = 0; -if (!(env->error_code & 1)) -info.si_code = TARGET_SEGV_MAPERR; -else -info.si_code = TARGET_SEGV_ACCERR; -info._sifields._sigfault._addr = env->cr[2]; -queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); +gen_signal(env, TARGET_SIGSEGV, + (env->error_code & 1 ? +TARGET_SEGV_ACCERR : TARGET_SEGV_MAPERR), + env->cr[2]); break; case EXCP00_DIVZ: #ifndef TARGET_X86_64 if (env->eflags & VM_MASK) { handle_vm86_trap(env, trapnr); -} else -#endif -{ -/* division by zero */ -info.si_signo = TARGET_SIGFPE; -info.si_errno = 0; -info.si_code = TARGET_FPE_INTDIV; -info._sifields._sigfault._addr = env->eip; -queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); +break; } +#endif +gen_signal(env, TARGET_SIGFPE, TARGET_FPE_INTDIV, env->eip); break; case EXCP01_DB: case EXCP03_INT3: #ifndef TARGET_X86_64 if (env->eflags & VM_MASK) { handle_vm86_trap(env, trapnr); -} else +break; +} #endif -{ -info.si_signo = TARGET_SIGTRAP; -info.si_errno = 0; -if (trapnr == EXCP01_DB) { -info.si_code = TARGET_TRAP_BRKPT; -info._sifields._sigfault._addr = env->eip; -} else { -info.si_code = TARGET_SI_KERNEL; -info._sifields._sigfault._addr = 0; -} -queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); +if (trapnr == EXCP01_DB) { +gen_signal(env, TARGET_SIGTRAP, TARGET_TRAP_BRKPT, env->eip); +} else { +gen_signal(env, TARGET_SIGTRAP, TARGET_SI_KERNEL, 0); } break; case EXCP04_INTO: @@ -205,31 +190,19 @@ void cpu_loop(CPUX86State *env) #ifndef TARGET_X86_64 if (env->eflags & VM_MASK) { handle_vm86_trap(env, trapnr); -} else -#endif -{ -info.si_signo = TARGET_SIGSEGV; -info.si_errno = 0; -info.si_code = TARGET_SI_KERNEL; -info._sifields._sigfault
[PULL 4/6] linux-user/i386: Emulate x86_64 vsyscalls
From: Richard Henderson Notice the magic page during translate, much like we already do for the arm32 commpage. At runtime, raise an exception to return cpu_loop for emulation. Reviewed-by: Paolo Bonzini Signed-off-by: Richard Henderson Message-Id: <20200213032223.14643-4-richard.hender...@linaro.org> Signed-off-by: Laurent Vivier --- linux-user/i386/cpu_loop.c | 108 + target/i386/cpu.h | 7 +++ target/i386/translate.c| 14 - 3 files changed, 128 insertions(+), 1 deletion(-) diff --git a/linux-user/i386/cpu_loop.c b/linux-user/i386/cpu_loop.c index e217cca5ee1e..70cde417e605 100644 --- a/linux-user/i386/cpu_loop.c +++ b/linux-user/i386/cpu_loop.c @@ -92,6 +92,109 @@ static void gen_signal(CPUX86State *env, int sig, int code, abi_ptr addr) queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info); } +#ifdef TARGET_X86_64 +static bool write_ok_or_segv(CPUX86State *env, abi_ptr addr, size_t len) +{ +/* + * For all the vsyscalls, NULL means "don't write anything" not + * "write it at address 0". + */ +if (addr == 0 || access_ok(VERIFY_WRITE, addr, len)) { +return true; +} + +env->error_code = PG_ERROR_W_MASK | PG_ERROR_U_MASK; +gen_signal(env, TARGET_SIGSEGV, TARGET_SEGV_MAPERR, addr); +return false; +} + +/* + * Since v3.1, the kernel traps and emulates the vsyscall page. + * Entry points other than the official generate SIGSEGV. + */ +static void emulate_vsyscall(CPUX86State *env) +{ +int syscall; +abi_ulong ret; +uint64_t caller; + +/* + * Validate the entry point. We have already validated the page + * during translation to get here; now verify the offset. + */ +switch (env->eip & ~TARGET_PAGE_MASK) { +case 0x000: +syscall = TARGET_NR_gettimeofday; +break; +case 0x400: +syscall = TARGET_NR_time; +break; +case 0x800: +syscall = TARGET_NR_getcpu; +break; +default: +goto sigsegv; +} + +/* + * Validate the return address. + * Note that the kernel treats this the same as an invalid entry point. + */ +if (get_user_u64(caller, env->regs[R_ESP])) { +goto sigsegv; +} + +/* + * Validate the the pointer arguments. + */ +switch (syscall) { +case TARGET_NR_gettimeofday: +if (!write_ok_or_segv(env, env->regs[R_EDI], + sizeof(struct target_timeval)) || +!write_ok_or_segv(env, env->regs[R_ESI], + sizeof(struct target_timezone))) { +return; +} +break; +case TARGET_NR_time: +if (!write_ok_or_segv(env, env->regs[R_EDI], sizeof(abi_long))) { +return; +} +break; +case TARGET_NR_getcpu: +if (!write_ok_or_segv(env, env->regs[R_EDI], sizeof(uint32_t)) || +!write_ok_or_segv(env, env->regs[R_ESI], sizeof(uint32_t))) { +return; +} +break; +default: +g_assert_not_reached(); +} + +/* + * Perform the syscall. None of the vsyscalls should need restarting. + */ +ret = do_syscall(env, syscall, env->regs[R_EDI], env->regs[R_ESI], + env->regs[R_EDX], env->regs[10], env->regs[8], + env->regs[9], 0, 0); +g_assert(ret != -TARGET_ERESTARTSYS); +g_assert(ret != -TARGET_QEMU_ESIGRETURN); +if (ret == -TARGET_EFAULT) { +goto sigsegv; +} +env->regs[R_EAX] = ret; + +/* Emulate a ret instruction to leave the vsyscall page. */ +env->eip = caller; +env->regs[R_ESP] += 8; +return; + + sigsegv: +/* Like force_sig(SIGSEGV). */ +gen_signal(env, TARGET_SIGSEGV, TARGET_SI_KERNEL, 0); +} +#endif + void cpu_loop(CPUX86State *env) { CPUState *cs = env_cpu(env); @@ -141,6 +244,11 @@ void cpu_loop(CPUX86State *env) env->regs[R_EAX] = ret; } break; +#endif +#ifdef TARGET_X86_64 +case EXCP_VSYSCALL: +emulate_vsyscall(env); +break; #endif case EXCP0B_NOSEG: case EXCP0C_STACK: diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 49ecc23104c9..9af1b0c12e8e 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1003,6 +1003,7 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS]; #define EXCP_VMEXIT 0x100 /* only for system emulation */ #define EXCP_SYSCALL0x101 /* only for user emulation */ +#define EXCP_VSYSCALL 0x102 /* only for user emulation */ /* i386-specific interrupt pending bits. */ #define CPU_INTERRUPT_POLL CPU_INTERRUPT_TGT_EXT_1 @@ -2218,4 +2219,10 @@ static inline bool hyperv_feat_enabled(X86CPU *cpu, int feat) return !!(cpu->hyperv_features & BIT(feat)); } +#if defined(TARGET_X86_64) && \ +defined(CONFIG_USER_ONLY) && \ +defined(CONFIG_LINUX) +# define TARGET_VSYSCALL_PAGE (UINT64_C(-10) << 20) +#endif + #endif /* I3
Re: [PATCH for QEMU v2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT
On Mon, Mar 23, 2020 at 12:04:57AM +0800, Hui Zhu wrote: > If the guest kernel has many fragmentation pages, use virtio_balloon > will split THP of QEMU when it calls MADV_DONTNEED madvise to release > the balloon pages. > Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT > and set continuous pages order to THP order. > Then It will get continuous pages PFN from VQ icvq use use madvise > MADV_DONTNEED release the THP page. > This will handle the THP split issue. > > Signed-off-by: Hui Zhu > --- > hw/virtio/virtio-balloon.c | 32 > + > include/hw/virtio/virtio-balloon.h | 4 +++- > include/standard-headers/linux/virtio_balloon.h | 4 > 3 files changed, 35 insertions(+), 5 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index a4729f7..88bdaca 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -34,6 +34,7 @@ > #include "hw/virtio/virtio-access.h" > > #define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define CONT_PAGES_ORDER 9 > > typedef struct PartiallyBalloonedPage { > ram_addr_t base_gpa; This doesn't look right to me. I suspect this is different between different hosts. Fixing this would also be tricky as we might need to migrate beween two hosts with different huge page sizes. My proposal is to instead enhance the PartiallyBalloonedPage machinery, teaching it to handle the case where host page size is smaller than the supported number of subpages. > @@ -65,7 +66,8 @@ static bool > virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp, > > static void balloon_inflate_page(VirtIOBalloon *balloon, > MemoryRegion *mr, hwaddr mr_offset, > - PartiallyBalloonedPage *pbp) > + PartiallyBalloonedPage *pbp, > + bool is_cont_pages) > { > void *addr = memory_region_get_ram_ptr(mr) + mr_offset; > ram_addr_t rb_offset, rb_aligned_offset, base_gpa; > @@ -76,6 +78,13 @@ static void balloon_inflate_page(VirtIOBalloon *balloon, > /* XXX is there a better way to get to the RAMBlock than via a > * host address? */ > rb = qemu_ram_block_from_host(addr, false, &rb_offset); > + > +if (is_cont_pages) { > +ram_block_discard_range(rb, rb_offset, > +BALLOON_PAGE_SIZE << CONT_PAGES_ORDER); > +return; > +} > + > rb_page_size = qemu_ram_pagesize(rb); > > if (rb_page_size == BALLOON_PAGE_SIZE) { > @@ -361,9 +370,10 @@ static void virtio_balloon_handle_output(VirtIODevice > *vdev, VirtQueue *vq) > > trace_virtio_balloon_handle_output(memory_region_name(section.mr), > pa); > if (!qemu_balloon_is_inhibited()) { > -if (vq == s->ivq) { > +if (vq == s->ivq || vq == s->icvq) { > balloon_inflate_page(s, section.mr, > - section.offset_within_region, &pbp); > + section.offset_within_region, &pbp, > + vq == s->icvq); > } else if (vq == s->dvq) { > balloon_deflate_page(s, section.mr, > section.offset_within_region); > } else { > @@ -618,9 +628,12 @@ static size_t virtio_balloon_config_size(VirtIOBalloon > *s) > if (s->qemu_4_0_config_size) { > return sizeof(struct virtio_balloon_config); > } > -if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) { > +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) { > return sizeof(struct virtio_balloon_config); > } > +if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) { > +return offsetof(struct virtio_balloon_config, pages_order); > +} > if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { > return offsetof(struct virtio_balloon_config, poison_val); > } > @@ -646,6 +659,10 @@ static void virtio_balloon_get_config(VirtIODevice > *vdev, uint8_t *config_data) > cpu_to_le32(VIRTIO_BALLOON_CMD_ID_DONE); > } > > +if (virtio_has_feature(dev->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) > { > +config.pages_order = cpu_to_le32(CONT_PAGES_ORDER); > +} > + > trace_virtio_balloon_get_config(config.num_pages, config.actual); > memcpy(config_data, &config, virtio_balloon_config_size(dev)); > } > @@ -816,6 +833,11 @@ static void virtio_balloon_device_realize(DeviceState > *dev, Error **errp) > virtio_error(vdev, "iothread is missing"); > } > } > + > +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) { > +s->icvq = virtio_add_queue(vdev, 128, virtio_bal
[PULL 1/6] linux-user, configure: fix (again) syscall_nr.h dependencies cleanup
This patch fixes two problems: - it cleanups linux-user variants (for instance ppc64-linux-user and ppc64le-linux-user) - it removes the .o file when it removes the .d file, otherwise the .o file is never updated Fixes: 5f29856b852d ("linux-user, configure: improve syscall_nr.h dependencies checking") Fixes: 4d6a835dea47 ("linux-user: introduce parameters to generate syscall_nr.h") Signed-off-by: Laurent Vivier Reviewed-by: Richard Henderson Message-Id: <20200325075757.1959961-1-laur...@vivier.eu> --- configure | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/configure b/configure index da09c3589572..89fe881dd46f 100755 --- a/configure +++ b/configure @@ -1910,9 +1910,11 @@ for arch in alpha hppa m68k xtensa sh4 microblaze arm ppc s390x sparc sparc64 \ # remove the file if it has been generated in the source directory rm -f "${source_path}/linux-user/${arch}/syscall_nr.h" # remove the dependency files -test -d ${arch}-linux-user && find ${arch}-linux-user -type f -name "*.d" \ - -exec grep -q "${source_path}/linux-user/${arch}/syscall_nr.h" {} \; \ - -exec rm {} \; +for target in ${arch}*-linux-user ; do +test -d "${target}" && find "${target}" -type f -name "*.d" \ + -exec grep -q "${source_path}/linux-user/${arch}/syscall_nr.h" {} \; \ + -print | while read file ; do rm "${file}" "${file%.d}.o" ; done +done done if test -z "$python" -- 2.25.1
[PULL 6/6] linux-user: Flush out implementation of gettimeofday
From: Richard Henderson The first argument, timeval, is allowed to be NULL. The second argument, timezone, was missing. While its use is deprecated, it is still present in the syscall. Reviewed-by: Alex Bennée Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Richard Henderson Message-Id: <20200213032223.14643-6-richard.hender...@linaro.org> [lv: add "#if defined(TARGET_NR_gettimeofday)"] Signed-off-by: Laurent Vivier --- linux-user/syscall.c | 29 +++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/linux-user/syscall.c b/linux-user/syscall.c index dbdd56e42077..49395dcea978 100644 --- a/linux-user/syscall.c +++ b/linux-user/syscall.c @@ -1273,6 +1273,25 @@ static inline abi_long host_to_target_timespec64(abi_ulong target_addr, return 0; } +#if defined(TARGET_NR_gettimeofday) +static inline abi_long copy_to_user_timezone(abi_ulong target_tz_addr, + struct timezone *tz) +{ +struct target_timezone *target_tz; + +if (!lock_user_struct(VERIFY_WRITE, target_tz, target_tz_addr, 1)) { +return -TARGET_EFAULT; +} + +__put_user(tz->tz_minuteswest, &target_tz->tz_minuteswest); +__put_user(tz->tz_dsttime, &target_tz->tz_dsttime); + +unlock_user_struct(target_tz, target_tz_addr, 1); + +return 0; +} +#endif + #if defined(TARGET_NR_settimeofday) static inline abi_long copy_from_user_timezone(struct timezone *tz, abi_ulong target_tz_addr) @@ -8710,10 +8729,16 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1, case TARGET_NR_gettimeofday: { struct timeval tv; -ret = get_errno(gettimeofday(&tv, NULL)); +struct timezone tz; + +ret = get_errno(gettimeofday(&tv, &tz)); if (!is_error(ret)) { -if (copy_to_user_timeval(arg1, &tv)) +if (arg1 && copy_to_user_timeval(arg1, &tv)) { +return -TARGET_EFAULT; +} +if (arg2 && copy_to_user_timezone(arg2, &tz)) { return -TARGET_EFAULT; +} } } return ret; -- 2.25.1
[PULL 2/6] target/i386: Renumber EXCP_SYSCALL
From: Richard Henderson We are not short of numbers for EXCP_*. There is no need to confuse things by having EXCP_VMEXIT and EXCP_SYSCALL overlap, even though the former is only used for system mode and the latter is only used for user mode. Reviewed-by: Paolo Bonzini Reviewed-by: Alex Bennée Reviewed-by: Philippe Mathieu-Daudé Signed-off-by: Richard Henderson Message-Id: <20200213032223.14643-2-richard.hender...@linaro.org> Signed-off-by: Laurent Vivier --- target/i386/cpu.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index 60d797d5941f..49ecc23104c9 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -1001,9 +1001,8 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS]; #define EXCP11_ALGN17 #define EXCP12_MCHK18 -#define EXCP_SYSCALL0x100 /* only happens in user only emulation - for syscall instruction */ -#define EXCP_VMEXIT 0x100 +#define EXCP_VMEXIT 0x100 /* only for system emulation */ +#define EXCP_SYSCALL0x101 /* only for user emulation */ /* i386-specific interrupt pending bits. */ #define CPU_INTERRUPT_POLL CPU_INTERRUPT_TGT_EXT_1 -- 2.25.1
[PULL 5/6] linux-user: Add x86_64 vsyscall page to /proc/self/maps
From: Richard Henderson The page isn't (necessarily) present in the host /proc/self/maps, and even if it might be it isn't present in page_flags, and even if it was it might not have the same set of page permissions. The easiest thing to do, particularly when it comes to the "[vsyscall]" note at the end of line, is to special case it. Signed-off-by: Richard Henderson Message-Id: <20200213032223.14643-5-richard.hender...@linaro.org> [lv: remove trailing space] Signed-off-by: Laurent Vivier --- linux-user/syscall.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/linux-user/syscall.c b/linux-user/syscall.c index 35f414666243..dbdd56e42077 100644 --- a/linux-user/syscall.c +++ b/linux-user/syscall.c @@ -7144,6 +7144,16 @@ static int open_self_maps(void *cpu_env, int fd) } } +#ifdef TARGET_VSYSCALL_PAGE +/* + * We only support execution from the vsyscall page. + * This is as if CONFIG_LEGACY_VSYSCALL_XONLY=y from v5.3. + */ +dprintf(fd, TARGET_FMT_lx "-" TARGET_FMT_lx +" --xp 00:00 0 [vsyscall]\n", +TARGET_VSYSCALL_PAGE, TARGET_VSYSCALL_PAGE + TARGET_PAGE_SIZE); +#endif + free(line); fclose(fp); -- 2.25.1
[PULL 0/6] Linux user for 5.0 patches
The following changes since commit 736cf607e40674776d752acc201f565723e86045: Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +) are available in the Git repository at: git://github.com/vivier/qemu.git tags/linux-user-for-5.0-pull-request for you to fetch changes up to a52f5f87bece827a338d6eb3332e3def86fb9c33: linux-user: Flush out implementation of gettimeofday (2020-03-26 08:08:54 +0100) Emulate x86_64 vsyscalls Fix syscall_nr.h cleanup Laurent Vivier (1): linux-user, configure: fix (again) syscall_nr.h dependencies cleanup Richard Henderson (5): target/i386: Renumber EXCP_SYSCALL linux-user/i386: Split out gen_signal linux-user/i386: Emulate x86_64 vsyscalls linux-user: Add x86_64 vsyscall page to /proc/self/maps linux-user: Flush out implementation of gettimeofday configure | 8 +- linux-user/i386/cpu_loop.c | 201 ++--- linux-user/syscall.c | 39 ++- target/i386/cpu.h | 12 ++- target/i386/translate.c| 14 ++- 5 files changed, 205 insertions(+), 69 deletions(-) -- 2.25.1
Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue
On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote: > On 12.03.20 09:47, Michael S. Tsirkin wrote: > > On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote: > >> 2. You are essentially stealing THPs in the guest. So the fastest > >> mapping (THP in guest and host) is gone. The guest won't be able to make > >> use of THP where it previously was able to. I can imagine this implies a > >> performance degradation for some workloads. This needs a proper > >> performance evaluation. > > > > I think the problem is more with the alloc_pages API. > > That gives you exactly the given order, and if there's > > a larger chunk available, it will split it up. > > > > But for balloon - I suspect lots of other users, > > we do not want to stress the system but if a large > > chunk is available anyway, then we could handle > > that more optimally by getting it all in one go. > > > > > > So if we want to address this, IMHO this calls for a new API. > > Along the lines of > > > > struct page *alloc_page_range(gfp_t gfp, unsigned int min_order, > > unsigned int max_order, unsigned int > > *order) > > > > the idea would then be to return at a number of pages in the given > > range. > > > > What do you think? Want to try implementing that? > > You can just start with the highest order and decrement the order until > your allocation succeeds using alloc_pages(), which would be enough for > a first version. At least I don't see the immediate need for a new > kernel API. OK I remember now. The problem is with reclaim. Unless reclaim is completely disabled, any of these calls can sleep. After it wakes up, we would like to get the larger order that has become available meanwhile. > -- > Thanks, > > David / dhildenb
Re: [PATCH for QEMU v2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT
> Am 26.03.2020 um 08:06 schrieb teawater : > > Ping. > On paid leave this week. Will try to have a look next week, but it could take a bit longer. Cheers > Thanks, > Hui > >> 2020年3月23日 00:04,Hui Zhu 写道: >> >> If the guest kernel has many fragmentation pages, use virtio_balloon >> will split THP of QEMU when it calls MADV_DONTNEED madvise to release >> the balloon pages. >> Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT >> and set continuous pages order to THP order. >> Then It will get continuous pages PFN from VQ icvq use use madvise >> MADV_DONTNEED release the THP page. >> This will handle the THP split issue. >> >> Signed-off-by: Hui Zhu >> --- >> hw/virtio/virtio-balloon.c | 32 >> + >> include/hw/virtio/virtio-balloon.h | 4 +++- >> include/standard-headers/linux/virtio_balloon.h | 4 >> 3 files changed, 35 insertions(+), 5 deletions(-) >> >> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c >> index a4729f7..88bdaca 100644 >> --- a/hw/virtio/virtio-balloon.c >> +++ b/hw/virtio/virtio-balloon.c >> @@ -34,6 +34,7 @@ >> #include "hw/virtio/virtio-access.h" >> >> #define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) >> +#define CONT_PAGES_ORDER 9 >> >> typedef struct PartiallyBalloonedPage { >>ram_addr_t base_gpa; >> @@ -65,7 +66,8 @@ static bool >> virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp, >> >> static void balloon_inflate_page(VirtIOBalloon *balloon, >> MemoryRegion *mr, hwaddr mr_offset, >> - PartiallyBalloonedPage *pbp) >> + PartiallyBalloonedPage *pbp, >> + bool is_cont_pages) >> { >>void *addr = memory_region_get_ram_ptr(mr) + mr_offset; >>ram_addr_t rb_offset, rb_aligned_offset, base_gpa; >> @@ -76,6 +78,13 @@ static void balloon_inflate_page(VirtIOBalloon *balloon, >>/* XXX is there a better way to get to the RAMBlock than via a >> * host address? */ >>rb = qemu_ram_block_from_host(addr, false, &rb_offset); >> + >> +if (is_cont_pages) { >> +ram_block_discard_range(rb, rb_offset, >> +BALLOON_PAGE_SIZE << CONT_PAGES_ORDER); >> +return; >> +} >> + >>rb_page_size = qemu_ram_pagesize(rb); >> >>if (rb_page_size == BALLOON_PAGE_SIZE) { >> @@ -361,9 +370,10 @@ static void virtio_balloon_handle_output(VirtIODevice >> *vdev, VirtQueue *vq) >>trace_virtio_balloon_handle_output(memory_region_name(section.mr), >> pa); >>if (!qemu_balloon_is_inhibited()) { >> -if (vq == s->ivq) { >> +if (vq == s->ivq || vq == s->icvq) { >>balloon_inflate_page(s, section.mr, >> - section.offset_within_region, >> &pbp); >> + section.offset_within_region, &pbp, >> + vq == s->icvq); >>} else if (vq == s->dvq) { >>balloon_deflate_page(s, section.mr, >> section.offset_within_region); >>} else { >> @@ -618,9 +628,12 @@ static size_t virtio_balloon_config_size(VirtIOBalloon >> *s) >>if (s->qemu_4_0_config_size) { >>return sizeof(struct virtio_balloon_config); >>} >> -if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) { >> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) { >>return sizeof(struct virtio_balloon_config); >>} >> +if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) { >> +return offsetof(struct virtio_balloon_config, pages_order); >> +} >>if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { >>return offsetof(struct virtio_balloon_config, poison_val); >>} >> @@ -646,6 +659,10 @@ static void virtio_balloon_get_config(VirtIODevice >> *vdev, uint8_t *config_data) >> cpu_to_le32(VIRTIO_BALLOON_CMD_ID_DONE); >>} >> >> +if (virtio_has_feature(dev->host_features, >> VIRTIO_BALLOON_F_CONT_PAGES)) { >> +config.pages_order = cpu_to_le32(CONT_PAGES_ORDER); >> +} >> + >>trace_virtio_balloon_get_config(config.num_pages, config.actual); >>memcpy(config_data, &config, virtio_balloon_config_size(dev)); >> } >> @@ -816,6 +833,11 @@ static void virtio_balloon_device_realize(DeviceState >> *dev, Error **errp) >>virtio_error(vdev, "iothread is missing"); >>} >>} >> + >> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) { >> +s->icvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); >> +} >> + >>reset_stats(s); >> } >> >> @@ -916,6 +938,8 @@ static Property virtio_balloon_properties[] = { >>VIRTIO_BALLOON_F_DE
Re: [PATCH for Linux v2] virtio_balloon: Add VIRTIO_BALLOON_VQ_INFLATE_CONT to handle THP split issue
First, either QEMU or Linux version of any interface changes should be copied to the virtio TC. : On Mon, Mar 23, 2020 at 12:04:56AM +0800, Hui Zhu wrote: > The first version is in [1]. > According to the comments from Michael and David, I updated the patch. > 1. Added a separate vq inflate_cont_vq to transport inflate continuous >pages. > 2. Set all the pages in the continuous pages movable then they can be >compaction. > 3. Added a new element pages_order to virtio_balloon_config. It is the >inflate pages order that is set by the QEMU. > 4. If the balloon cannot get any continuous pages from the system. >Go back to use the single page to fill the balloon. > 5. Use balloon_pages_alloc to allocate the single page and continuous > pages. Replace __GFP_NORETRY with __GFP_RETRY_MAYFAIL when allocating > the continuous pages because it can increase the success rate of > allocating large chunks of memory. > > Following is the introduction of the function. > If the guest kernel has many fragmentation pages, use virtio_balloon > will split THP of QEMU when it calls MADV_DONTNEED madvise to release > the balloon pages. > This is an example in a VM with 1G memory 1CPU: > // This is the THP number before VM execution in the host. > // None use THP. > cat /proc/meminfo | grep AnonHugePages: > AnonHugePages: 0 kB > > // After VM start, use usemem > // (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git) > // punch-holes function generates 400m fragmentation pages in the guest > // kernel. > usemem --punch-holes -s -1 800m & > > // This is the THP number after this command in the host. > // Some THP is used by VM because usemem will access 800M memory > // in the guest. > cat /proc/meminfo | grep AnonHugePages: > AnonHugePages:976896 kB > > // Connect to the QEMU monitor, setup balloon, and set it size to 600M. > (qemu) device_add virtio-balloon-pci,id=balloon1 > (qemu) info balloon > balloon: actual=1024 > (qemu) balloon 600 > (qemu) info balloon > balloon: actual=600 > > // This is the THP number after inflate the balloon in the host. > cat /proc/meminfo | grep AnonHugePages: > AnonHugePages:151552 kB > > THP number decreased more than 800M. > The reason is usemem with punch-holes option will free every other > page after allocation. Then 400M free memory inside the guest kernel > is fragmentation pages. > The guest kernel will use them to inflate the balloon. When these > fragmentation pages are freed, THP will be split. > > This commit tries to handle this with add a new flag > VIRTIO_BALLOON_VQ_INFLATE_CONT. > When this flag is set, the balloon will try to use continuous pages > inflate the balloon. And the pages order is set to THP order. > Then THP pages will be freed together in the host. > This is an example in a VM with 1G memory 1CPU: > // This is the THP number before VM execution in the host. > // None use THP. > cat /proc/meminfo | grep AnonHugePages: > AnonHugePages: 0 kB > > // After VM start, use usemem punch-holes function generates 400M > // fragmentation pages in the guest kernel. > usemem --punch-holes -s -1 800m & > > // This is the THP number after this command in the host. > // Some THP is used by VM because usemem will access 800M memory > // in the guest. > cat /proc/meminfo | grep AnonHugePages: > AnonHugePages:976896 kB > > // Connect to the QEMU monitor, setup balloon, and set it size to 600M. > (qemu) device_add virtio-balloon-pci,id=balloon1,cont-pages=on > (qemu) info balloon > balloon: actual=1024 > (qemu) balloon 600 > (qemu) info balloon > balloon: actual=600 > > // This is the THP number after inflate the balloon in the host. > cat /proc/meminfo | grep AnonHugePages: > AnonHugePages:610304 kB > > The THP number decreases 358M. This shows that > VIRTIO_BALLOON_VQ_INFLATE_CONT can help handle the THP split issue. > > [1] https://lkml.org/lkml/2020/3/12/144 > > Signed-off-by: Hui Zhu I'd like to repeat my original idea of doing large page allocations unconditionally within guest. > --- > drivers/virtio/virtio_balloon.c | 78 > ++--- > include/linux/balloon_compaction.h | 9 - > include/uapi/linux/virtio_balloon.h | 3 ++ > mm/balloon_compaction.c | 40 +++ > 4 files changed, 109 insertions(+), 21 deletions(-) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index 341458f..fbd2b02f 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -47,6 +47,7 @@ enum virtio_balloon_vq { > VIRTIO_BALLOON_VQ_DEFLATE, > VIRTIO_BALLOON_VQ_STATS, > VIRTIO_BALLOON_VQ_FREE_PAGE, > + VIRTIO_BALLOON_VQ_INFLATE_CONT, > VIRTIO_BALLOON_VQ_MAX > }; > > @@ -56,7 +57,8 @@ enum virtio_balloon_config_read { > > struct virtio_balloon { > struct virtio_device *vdev; > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *
[Bug 1869073] Re: qemu-arm-static crashes "segmentation fault" when running "git clone -s"
What is the version of QEMU you are using? ** Changed in: qemu Assignee: (unassigned) => Laurent Vivier (laurent-vivier) -- You received this bug notification because you are a member of qemu- devel-ml, which is subscribed to QEMU. https://bugs.launchpad.net/bugs/1869073 Title: qemu-arm-static crashes "segmentation fault" when running "git clone -s" Status in QEMU: New Bug description: I want to use qemu-arm-static to cross-compile software. The compiler itself is a native cross-compiler connected via "distcc". The problem is that a script tries to do some stuff with "git" and with a "git clone -s" command the whole story reproducibly stops with a "segmentation fault". I don't know how to properly debug the issue but it happens 100% of the time that I get the "crash" or git just hangs forever with 100% CPU usage. To manage notifications about this bug go to: https://bugs.launchpad.net/qemu/+bug/1869073/+subscriptions
Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue
On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote: > On 12.03.20 09:47, Michael S. Tsirkin wrote: > > On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote: > >> 2. You are essentially stealing THPs in the guest. So the fastest > >> mapping (THP in guest and host) is gone. The guest won't be able to make > >> use of THP where it previously was able to. I can imagine this implies a > >> performance degradation for some workloads. This needs a proper > >> performance evaluation. > > > > I think the problem is more with the alloc_pages API. > > That gives you exactly the given order, and if there's > > a larger chunk available, it will split it up. > > > > But for balloon - I suspect lots of other users, > > we do not want to stress the system but if a large > > chunk is available anyway, then we could handle > > that more optimally by getting it all in one go. > > > > > > So if we want to address this, IMHO this calls for a new API. > > Along the lines of > > > > struct page *alloc_page_range(gfp_t gfp, unsigned int min_order, > > unsigned int max_order, unsigned int > > *order) > > > > the idea would then be to return at a number of pages in the given > > range. > > > > What do you think? Want to try implementing that? > > You can just start with the highest order and decrement the order until > your allocation succeeds using alloc_pages(), which would be enough for > a first version. At least I don't see the immediate need for a new > kernel API. Well there's still a chance of splitting a big page if one becomes available meanwhile. But OK. > -- > Thanks, > > David / dhildenb
Re: [RFC for QEMU] virtio-balloon: Add option thp-order to set VIRTIO_BALLOON_F_THP_ORDER
On Tue, Mar 17, 2020 at 06:13:32PM +0800, teawater wrote: > > > > 2020年3月12日 16:25,Michael S. Tsirkin 写道: > > > > On Thu, Mar 12, 2020 at 03:49:55PM +0800, Hui Zhu wrote: > >> If the guest kernel has many fragmentation pages, use virtio_balloon > >> will split THP of QEMU when it calls MADV_DONTNEED madvise to release > >> the balloon pages. > >> Set option thp-order to on will open flags VIRTIO_BALLOON_F_THP_ORDER. > >> It will set balloon size to THP size to handle the THP split issue. > >> > >> Signed-off-by: Hui Zhu > > > > What's wrong with just using the PartiallyBalloonedPage machinery > > instead? That would make it guest transparent. > > In balloon_inflate_page: > rb_page_size = qemu_ram_pagesize(rb); > > if (rb_page_size == BALLOON_PAGE_SIZE) { > /* Easy case */ > > It seems that PartiallyBalloonedPage is only used when rb_page_size is > greater than BALLOON_PAGE_SIZE. > Do you mean I should modify the working mechanism of balloon_inflate_page > function? > > Thanks, > Hui Yes, we can tweak it to unconditionally combine pages to a huge page. > > > >> --- > >> hw/virtio/virtio-balloon.c | 67 > >> - > >> include/standard-headers/linux/virtio_balloon.h | 4 ++ > >> 2 files changed, 47 insertions(+), 24 deletions(-) > >> > >> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > >> index a4729f7..cfe86b0 100644 > >> --- a/hw/virtio/virtio-balloon.c > >> +++ b/hw/virtio/virtio-balloon.c > >> @@ -340,37 +340,49 @@ static void > >> virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq) > >> while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == > >> 4) { > >> unsigned int p = virtio_ldl_p(vdev, &pfn); > >> hwaddr pa; > >> +size_t handle_size = BALLOON_PAGE_SIZE; > >> > >> pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT; > >> offset += 4; > >> > >> -section = memory_region_find(get_system_memory(), pa, > >> - BALLOON_PAGE_SIZE); > >> -if (!section.mr) { > >> -trace_virtio_balloon_bad_addr(pa); > >> -continue; > >> -} > >> -if (!memory_region_is_ram(section.mr) || > >> -memory_region_is_rom(section.mr) || > >> -memory_region_is_romd(section.mr)) { > >> -trace_virtio_balloon_bad_addr(pa); > >> -memory_region_unref(section.mr); > >> -continue; > >> -} > >> +if (virtio_has_feature(s->host_features, > >> + VIRTIO_BALLOON_F_THP_ORDER)) > >> +handle_size = BALLOON_PAGE_SIZE << > >> VIRTIO_BALLOON_THP_ORDER; > >> + > >> +while (handle_size > 0) { > >> +section = memory_region_find(get_system_memory(), pa, > >> + BALLOON_PAGE_SIZE); > >> +if (!section.mr) { > >> +trace_virtio_balloon_bad_addr(pa); > >> +continue; > >> +} > >> +if (!memory_region_is_ram(section.mr) || > >> +memory_region_is_rom(section.mr) || > >> +memory_region_is_romd(section.mr)) { > >> +trace_virtio_balloon_bad_addr(pa); > >> +memory_region_unref(section.mr); > >> +continue; > >> +} > >> > >> - > >> trace_virtio_balloon_handle_output(memory_region_name(section.mr), > >> - pa); > >> -if (!qemu_balloon_is_inhibited()) { > >> -if (vq == s->ivq) { > >> -balloon_inflate_page(s, section.mr, > >> - section.offset_within_region, > >> &pbp); > >> -} else if (vq == s->dvq) { > >> -balloon_deflate_page(s, section.mr, > >> section.offset_within_region); > >> -} else { > >> -g_assert_not_reached(); > >> + > >> trace_virtio_balloon_handle_output(memory_region_name(section.mr), > >> + pa); > >> +if (!qemu_balloon_is_inhibited()) { > >> +if (vq == s->ivq) { > >> +balloon_inflate_page(s, section.mr, > >> + section.offset_within_region, > >> + &pbp); > >> +} else if (vq == s->dvq) { > >> +balloon_deflate_page(s, section.mr, > >> + > >> section.offset_within_region); > >> +} else { > >> +g_assert_not_reached(); > >> +} > >> } > >> +
Re: [PATCH for QEMU v2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT
Ping. Thanks, Hui > 2020年3月23日 00:04,Hui Zhu 写道: > > If the guest kernel has many fragmentation pages, use virtio_balloon > will split THP of QEMU when it calls MADV_DONTNEED madvise to release > the balloon pages. > Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT > and set continuous pages order to THP order. > Then It will get continuous pages PFN from VQ icvq use use madvise > MADV_DONTNEED release the THP page. > This will handle the THP split issue. > > Signed-off-by: Hui Zhu > --- > hw/virtio/virtio-balloon.c | 32 + > include/hw/virtio/virtio-balloon.h | 4 +++- > include/standard-headers/linux/virtio_balloon.h | 4 > 3 files changed, 35 insertions(+), 5 deletions(-) > > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c > index a4729f7..88bdaca 100644 > --- a/hw/virtio/virtio-balloon.c > +++ b/hw/virtio/virtio-balloon.c > @@ -34,6 +34,7 @@ > #include "hw/virtio/virtio-access.h" > > #define BALLOON_PAGE_SIZE (1 << VIRTIO_BALLOON_PFN_SHIFT) > +#define CONT_PAGES_ORDER 9 > > typedef struct PartiallyBalloonedPage { > ram_addr_t base_gpa; > @@ -65,7 +66,8 @@ static bool > virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp, > > static void balloon_inflate_page(VirtIOBalloon *balloon, > MemoryRegion *mr, hwaddr mr_offset, > - PartiallyBalloonedPage *pbp) > + PartiallyBalloonedPage *pbp, > + bool is_cont_pages) > { > void *addr = memory_region_get_ram_ptr(mr) + mr_offset; > ram_addr_t rb_offset, rb_aligned_offset, base_gpa; > @@ -76,6 +78,13 @@ static void balloon_inflate_page(VirtIOBalloon *balloon, > /* XXX is there a better way to get to the RAMBlock than via a > * host address? */ > rb = qemu_ram_block_from_host(addr, false, &rb_offset); > + > +if (is_cont_pages) { > +ram_block_discard_range(rb, rb_offset, > +BALLOON_PAGE_SIZE << CONT_PAGES_ORDER); > +return; > +} > + > rb_page_size = qemu_ram_pagesize(rb); > > if (rb_page_size == BALLOON_PAGE_SIZE) { > @@ -361,9 +370,10 @@ static void virtio_balloon_handle_output(VirtIODevice > *vdev, VirtQueue *vq) > trace_virtio_balloon_handle_output(memory_region_name(section.mr), >pa); > if (!qemu_balloon_is_inhibited()) { > -if (vq == s->ivq) { > +if (vq == s->ivq || vq == s->icvq) { > balloon_inflate_page(s, section.mr, > - section.offset_within_region, &pbp); > + section.offset_within_region, &pbp, > + vq == s->icvq); > } else if (vq == s->dvq) { > balloon_deflate_page(s, section.mr, > section.offset_within_region); > } else { > @@ -618,9 +628,12 @@ static size_t virtio_balloon_config_size(VirtIOBalloon > *s) > if (s->qemu_4_0_config_size) { > return sizeof(struct virtio_balloon_config); > } > -if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) { > +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) { > return sizeof(struct virtio_balloon_config); > } > +if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) { > +return offsetof(struct virtio_balloon_config, pages_order); > +} > if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) { > return offsetof(struct virtio_balloon_config, poison_val); > } > @@ -646,6 +659,10 @@ static void virtio_balloon_get_config(VirtIODevice > *vdev, uint8_t *config_data) >cpu_to_le32(VIRTIO_BALLOON_CMD_ID_DONE); > } > > +if (virtio_has_feature(dev->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) > { > +config.pages_order = cpu_to_le32(CONT_PAGES_ORDER); > +} > + > trace_virtio_balloon_get_config(config.num_pages, config.actual); > memcpy(config_data, &config, virtio_balloon_config_size(dev)); > } > @@ -816,6 +833,11 @@ static void virtio_balloon_device_realize(DeviceState > *dev, Error **errp) > virtio_error(vdev, "iothread is missing"); > } > } > + > +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) { > +s->icvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output); > +} > + > reset_stats(s); > } > > @@ -916,6 +938,8 @@ static Property virtio_balloon_properties[] = { > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false), > DEFINE_PROP_BIT("free-page-hint", VirtIOBalloon, host_features, > VIRTIO_BALLOON_F_FREE_PAGE_HINT, false), > +DEFINE_PROP_BIT("cont-pages", VirtIOBalloon, host_features, > +VIRTIO_