Re: [PATCH 1/2] block: pass BlockDriver reference to the .bdrv_co_create

2020-03-26 Thread Eric Blake

On 3/25/20 8:12 PM, Maxim Levitsky wrote:

This will allow to reuse a single generic .bdrv_co_create


"allow to ${verb}" is not idiomatic, better is "allow ${subject} to 
${verb}" or "allow ${verb}ing".  In this case, I'd go with:


This will allow the reuse of a single...


implementation for several drivers.
No functional changes.

Signed-off-by: Maxim Levitsky 
---

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




RE: [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults API

2020-03-26 Thread Liu, Yi L
Hi Eric,

I'm also considering how to inject iommu fault to vIOMMU. As our
previous discussion (long time ago), MemoryRegion way doesn't work
well for VTd case. So I'd like see your opinion on the proposal
below:
I've a patch to make vIOMMUs register PCIIOMMUOps to PCI layer.
Current usage is to get address space and set/unset HostIOMMUContext
(added by me). I think it may be also nice to add the fault injection
callback in the PCIIOMMUOps. Thoughts?

https://patchwork.ozlabs.org/patch/1259645/

Regards,
Yi Liu

> From: Eric Auger 
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger@gmail.com; eric.au...@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 05/24] memory: Introduce IOMMU Memory Region inject_faults
> API
> 
> This new API allows to inject @count iommu_faults into
> the IOMMU memory region.
> 
> Signed-off-by: Eric Auger 
> ---
>  include/exec/memory.h | 25 +
>  memory.c  | 10 ++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index f2c773163f..141a5dc197 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -57,6 +57,8 @@ struct MemoryRegionMmio {
>  CPUWriteMemoryFunc *write[3];
>  };
> 
> +struct iommu_fault;
> +
>  typedef struct IOMMUTLBEntry IOMMUTLBEntry;
> 
>  /* See address_space_translate: bit 0 is read, bit 1 is write.  */
> @@ -357,6 +359,19 @@ typedef struct IOMMUMemoryRegionClass {
>   * @iommu: the IOMMUMemoryRegion
>   */
>  int (*num_indexes)(IOMMUMemoryRegion *iommu);
> +
> +/*
> + * Inject @count faults into the IOMMU memory region
> + *
> + * Optional method: if this method is not provided, then
> + * memory_region_injection_faults() will return -ENOENT
> + *
> + * @iommu: the IOMMU memory region to inject the faults in
> + * @count: number of faults to inject
> + * @buf: fault buffer
> + */
> +int (*inject_faults)(IOMMUMemoryRegion *iommu, int count,
> + struct iommu_fault *buf);
>  } IOMMUMemoryRegionClass;
> 
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> @@ -1365,6 +1380,16 @@ int
> memory_region_iommu_attrs_to_index(IOMMUMemoryRegion *iommu_mr,
>   */
>  int memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr);
> 
> +/**
> + * memory_region_inject_faults : inject @count faults stored in @buf
> + *
> + * @iommu_mr: the IOMMU memory region
> + * @count: number of faults to be injected
> + * @buf: buffer containing the faults
> + */
> +int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
> +struct iommu_fault *buf);
> +
>  /**
>   * memory_region_name: get a memory region's name
>   *
> diff --git a/memory.c b/memory.c
> index 09be40edd2..9cdd77e0de 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -2001,6 +2001,16 @@ int
> memory_region_iommu_num_indexes(IOMMUMemoryRegion *iommu_mr)
>  return imrc->num_indexes(iommu_mr);
>  }
> 
> +int memory_region_inject_faults(IOMMUMemoryRegion *iommu_mr, int count,
> +struct iommu_fault *buf)
> +{
> +IOMMUMemoryRegionClass *imrc =
> IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr);
> +if (!imrc->inject_faults) {
> +return -ENOENT;
> +}
> +return imrc->inject_faults(iommu_mr, count, buf);
> +}
> +
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>  {
>  uint8_t mask = 1 << client;
> --
> 2.20.1




[Bug 1867519] Re: qemu 4.2 segfaults on VF detach

2020-03-26 Thread Launchpad Bug Tracker
** Merge proposal linked:
   
https://code.launchpad.net/~paelzer/ubuntu/+source/qemu/+git/qemu/+merge/381033

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1867519

Title:
  qemu 4.2 segfaults on VF detach

Status in QEMU:
  Fix Committed
Status in qemu package in Ubuntu:
  Fix Released

Bug description:
  After updating Ubuntu 20.04 to the Beta version, we get the following
  error and the virtual machines stucks when detaching PCI devices using
  virsh command:

  Error:
  error: Failed to detach device from /tmp/vf_interface_attached.xml
  error: internal error: End of file from qemu monitor

  steps to reproduce:
   1. create a VM over Ubuntu 20.04 (5.4.0-14-generic)
   2. attach PCI device to this VM (Mellanox VF for example)
   3. try to detaching  the PCI device using virsh command:
 a. create a pci interface xml file:
  






  
 b.  #virsh detach-device  


  - Ubuntu release:
Description:Ubuntu Focal Fossa (development branch)
Release:20.04

  - Package ver:
libvirt0:
Installed: 6.0.0-0ubuntu3
Candidate: 6.0.0-0ubuntu5
Version table:
   6.0.0-0ubuntu5 500
  500 http://il.archive.ubuntu.com/ubuntu focal/main amd64 Packages
   *** 6.0.0-0ubuntu3 100
  100 /var/lib/dpkg/status

  - What you expected to happen: 
PCI device detached without any errors.

  - What happened instead:
getting the errors above and he VM stuck

  additional info:
  after downgrading the libvirt0 package and all the dependent packages to 5.4 
the previous, version, seems that the issue disappeared

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1867519/+subscriptions



Re: [PATCH v2] block: make BlockConf.*_size properties 32-bit

2020-03-26 Thread Eric Blake

On 3/26/20 1:52 AM, Roman Kagan wrote:

Devices (virtio-blk, scsi, etc.) and the block layer are happy to use
32-bit for logical_block_size, physical_block_size, and min_io_size.
However, the properties in BlockConf are defined as uint16_t limiting
the values to 32768.

This appears unnecessary tight, and we've seen bigger block sizes handy
at times.

Make them 32 bit instead and lift the limitation up to 2 MiB which
appears to be good enough for everybody.


and matches the current qemu limit for qcow2 cluster sizes



As the values can now be fairly big and awkward to type, make the
property setter accept common size suffixes (k, m).

Signed-off-by: Roman Kagan 
---
v1 -> v2:
- cap the property at 2 MiB [Eric]
- accept size suffixes




+++ b/hw/core/qdev-properties.c
@@ -14,6 +14,7 @@
  #include "qapi/visitor.h"
  #include "chardev/char.h"
  #include "qemu/uuid.h"
+#include "qemu/units.h"
  
  void qdev_prop_set_after_realize(DeviceState *dev, const char *name,

Error **errp)
@@ -729,30 +730,39 @@ const PropertyInfo qdev_prop_pci_devfn = {
  
  /* --- blocksize --- */
  
+/* lower limit is sector size */

+#define MIN_BLOCK_SIZE  512
+#define MIN_BLOCK_SIZE_STR  "512 B"
+/* upper limit is arbitrary, 2 MiB looks sufficient */
+#define MAX_BLOCK_SIZE  (2 * MiB)
+#define MAX_BLOCK_SIZE_STR  "2 MiB"


For this comment, I might add that it matches our limit for qcow2 
cluster size.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v1 20/22] intel_iommu: propagate PASID-based iotlb invalidation to host

2020-03-26 Thread Peter Xu
On Thu, Mar 26, 2020 at 05:41:39AM +, Liu, Yi L wrote:
> > From: Liu, Yi L
> > Sent: Wednesday, March 25, 2020 9:22 PM
> > To: 'Peter Xu' 
> > Subject: RE: [PATCH v1 20/22] intel_iommu: propagate PASID-based iotlb
> > invalidation to host
> > 
> > > From: Peter Xu 
> > > Sent: Wednesday, March 25, 2020 2:34 AM
> > > To: Liu, Yi L 
> > > Subject: Re: [PATCH v1 20/22] intel_iommu: propagate PASID-based iotlb
> > > invalidation to host
> > >
> > > On Sun, Mar 22, 2020 at 05:36:17AM -0700, Liu Yi L wrote:
> > > > This patch propagates PASID-based iotlb invalidation to host.
> > > >
> > > > Intel VT-d 3.0 supports nested translation in PASID granular.
> > > > Guest SVA support could be implemented by configuring nested
> > > > translation on specific PASID. This is also known as dual stage DMA
> > > > translation.
> > > >
> > > > Under such configuration, guest owns the GVA->GPA translation which
> > > > is configured as first level page table in host side for a specific
> > > > pasid, and host owns GPA->HPA translation. As guest owns first level
> > > > translation table, piotlb invalidation should be propagated to host
> > > > since host IOMMU will cache first level page table related mappings
> > > > during DMA address translation.
> > > >
> > > > This patch traps the guest PASID-based iotlb flush and propagate it
> > > > to host.
> > > >
> > > > Cc: Kevin Tian 
> > > > Cc: Jacob Pan 
> > > > Cc: Peter Xu 
> > > > Cc: Yi Sun 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Richard Henderson 
> > > > Cc: Eduardo Habkost 
> > > > Signed-off-by: Liu Yi L 
> > > > ---
> > > >  hw/i386/intel_iommu.c  | 139
> > > +
> > > >  hw/i386/intel_iommu_internal.h |   7 +++
> > > >  2 files changed, 146 insertions(+)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index
> > > > b9ac07d..10d314d 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -3134,15 +3134,154 @@ static bool
> > > vtd_process_pasid_desc(IntelIOMMUState *s,
> > > >  return (ret == 0) ? true : false;  }
> > > >
> > > > +/**
> > > > + * Caller of this function should hold iommu_lock.
> > > > + */
> > > > +static void vtd_invalidate_piotlb(IntelIOMMUState *s,
> > > > +  VTDBus *vtd_bus,
> > > > +  int devfn,
> > > > +  DualIOMMUStage1Cache
> > > > +*stage1_cache) {
> > > > +VTDHostIOMMUContext *vtd_dev_icx;
> > > > +HostIOMMUContext *host_icx;
> > > > +
> > > > +vtd_dev_icx = vtd_bus->dev_icx[devfn];
> > > > +if (!vtd_dev_icx) {
> > > > +goto out;
> > > > +}
> > > > +host_icx = vtd_dev_icx->host_icx;
> > > > +if (!host_icx) {
> > > > +goto out;
> > > > +}
> > > > +if (host_iommu_ctx_flush_stage1_cache(host_icx, stage1_cache)) {
> > > > +error_report("Cache flush failed");
> > >
> > > I think this should not easily be triggered by the guest, but just in
> > > case... Let's use
> > > error_report_once() to be safe.
> > 
> > Agreed.
> > 
> > > > +}
> > > > +out:
> > > > +return;
> > > > +}
> > > > +
> > > > +static inline bool vtd_pasid_cache_valid(
> > > > +  VTDPASIDAddressSpace *vtd_pasid_as) {
> > > > +return vtd_pasid_as->iommu_state &&
^

> > >
> > > This check can be dropped because always true?
> > >
> > > If you agree with both the changes, please add:
> > >
> > > Reviewed-by: Peter Xu 
> > 
> > I think the code should ensure all the pasid_as in hash table is valid. And 
> > we can
> > since all the operations are under protection of iommu_lock.
> > 
> Peter,
> 
> I think my reply was wrong. pasid_as in has table may be stale since
> the per pasid_as cache_gen may be not identical with the cache_gen
> in iommu_state. e.g. vtd_pasid_cache_reset() only increases the
> cache_gen in iommu_state. So there will be pasid_as in hash table
> which has cached pasid entry but its cache_gen is not equal to the
> one in iommu_state. For such pasid_as, we should treat it as stale.
> So I guess the vtd_pasid_cache_valid() is still necessary.

I guess you misread my comment. :)

I was saying the "vtd_pasid_as->iommu_state" check is not needed,
because iommu_state was always set if the address space is created.
vtd_pasid_cache_valid() is needed.

Also, please double confirm that vtd_pasid_cache_reset() should drop
all the address spaces (as I think it should), not "only increase the
cache_gen".  IMHO you should only increase the cache_gen in the PSI
hook (vtd_pasid_cache_psi()) only.

Thanks,

-- 
Peter Xu




RE: [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice

2020-03-26 Thread Liu, Yi L
Hi Eric,

Not sure about your preference. I've modified my patch as below, which
HostIOMMUContext to provide callbacks for vIOMMU to call into VFIO.
Please feel free to give your suggestions.

https://patchwork.ozlabs.org/patch/1259665/

Regards,
Yi Liu

> From: Eric Auger 
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger@gmail.com; eric.au...@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 08/24] pci: introduce PCIPASIDOps to PCIDevice
> 
> From: Liu Yi L 
> 
> This patch introduces PCIPASIDOps for IOMMU related operations.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00078.html
> https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg00940.html
> 
> So far, to setup virt-SVA for assigned SVA capable device, needs to
> configure host translation structures for specific pasid. (e.g. bind
> guest page table to host and enable nested translation in host).
> Besides, vIOMMU emulator needs to forward guest's cache invalidation
> to host since host nested translation is enabled. e.g. on VT-d, guest
> owns 1st level translation table, thus cache invalidation for 1st
> level should be propagated to host.
> 
> This patch adds two functions: alloc_pasid and free_pasid to support
> guest pasid allocation and free. The implementations of the callbacks
> would be device passthru modules. Like vfio.
> 
> Cc: Kevin Tian 
> Cc: Jacob Pan 
> Cc: Peter Xu 
> Cc: Eric Auger 
> Cc: Yi Sun 
> Cc: David Gibson 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Yi Sun 
> ---
>  hw/pci/pci.c | 34 ++
>  include/hw/pci/pci.h | 11 +++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index e1ed6677e1..67e03b8db1 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2695,6 +2695,40 @@ void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn,
> void *opaque)
>  bus->iommu_opaque = opaque;
>  }
> 
> +void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops)
> +{
> +assert(ops && !dev->pasid_ops);
> +dev->pasid_ops = ops;
> +}
> +
> +bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn)
> +{
> +PCIDevice *dev;
> +
> +if (!bus) {
> +return false;
> +}
> +
> +dev = bus->devices[devfn];
> +return !!(dev && dev->pasid_ops);
> +}
> +
> +int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn,
> +   IOMMUConfig *config)
> +{
> +PCIDevice *dev;
> +
> +if (!bus) {
> +return -EINVAL;
> +}
> +
> +dev = bus->devices[devfn];
> +if (dev && dev->pasid_ops && dev->pasid_ops->set_pasid_table) {
> +return dev->pasid_ops->set_pasid_table(bus, devfn, config);
> +}
> +return -ENOENT;
> +}
> +
>  static void pci_dev_get_w64(PCIBus *b, PCIDevice *dev, void *opaque)
>  {
>  Range *range = opaque;
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cfedf5a995..2146cb7519 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -8,6 +8,7 @@
>  #include "hw/isa/isa.h"
> 
>  #include "hw/pci/pcie.h"
> +#include "hw/iommu/iommu.h"
> 
>  extern bool pci_available;
> 
> @@ -264,6 +265,11 @@ struct PCIReqIDCache {
>  };
>  typedef struct PCIReqIDCache PCIReqIDCache;
> 
> +struct PCIPASIDOps {
> +int (*set_pasid_table)(PCIBus *bus, int32_t devfn, IOMMUConfig *config);
> +};
> +typedef struct PCIPASIDOps PCIPASIDOps;
> +
>  struct PCIDevice {
>  DeviceState qdev;
>  bool partially_hotplugged;
> @@ -357,6 +363,7 @@ struct PCIDevice {
> 
>  /* ID of standby device in net_failover pair */
>  char *failover_pair_id;
> +PCIPASIDOps *pasid_ops;
>  };
> 
>  void pci_register_bar(PCIDevice *pci_dev, int region_num,
> @@ -490,6 +497,10 @@ typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void
> *, int);
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque);
> 
> +void pci_setup_pasid_ops(PCIDevice *dev, PCIPASIDOps *ops);
> +bool pci_device_is_pasid_ops_set(PCIBus *bus, int32_t devfn);
> +int pci_device_set_pasid_table(PCIBus *bus, int32_t devfn, IOMMUConfig 
> *config);
> +
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
>  {
> --
> 2.20.1




RE: [RFC v6 01/24] update-linux-headers: Import iommu.h

2020-03-26 Thread Liu, Yi L
> From: Eric Auger 
> Sent: Saturday, March 21, 2020 12:58 AM
> To: eric.auger@gmail.com; eric.au...@redhat.com; qemu-devel@nongnu.org;
> Subject: [RFC v6 01/24] update-linux-headers: Import iommu.h
> 
> Update the script to import the new iommu.h uapi header.
> 
> Signed-off-by: Eric Auger 
> ---
>  scripts/update-linux-headers.sh | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/scripts/update-linux-headers.sh 
> b/scripts/update-linux-headers.sh index
> 29c27f4681..5b64ee3912 100755
> --- a/scripts/update-linux-headers.sh
> +++ b/scripts/update-linux-headers.sh
> @@ -141,7 +141,7 @@ done
> 
>  rm -rf "$output/linux-headers/linux"
>  mkdir -p "$output/linux-headers/linux"
> -for header in kvm.h vfio.h vfio_ccw.h vhost.h \
> +for header in kvm.h vfio.h vfio_ccw.h vhost.h iommu.h \
>psci.h psp-sev.h userfaultfd.h mman.h; do
>  cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
>  done

Hi Eric,

This patch already got acked by from Cornelia. :-)

https://patchwork.ozlabs.org/patch/1259643/

Regards,
Yi Liu



[Bug 1868116] Re: QEMU monitor no longer works

2020-03-26 Thread Leonardo Müller
Thank you for investigating this. I would bisect QEMU, but wouldn't
investigate its libraries. Consequently, I would never find the cause of
this problem.

For now, I am using -monitor telnet:127.0.0.1:5,server,nowait to
have access to the monitor on QEMU guests.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1868116

Title:
  QEMU monitor no longer works

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  Triaged
Status in vte2.91 package in Ubuntu:
  New

Bug description:
  Repro:
  VTE
  $ meson _build && ninja -C _build && ninja -C _build install

  qemu:
  $ ../configure --python=/usr/bin/python3 --disable-werror --disable-user 
--disable-linux-user --disable-docs --disable-guest-agent --disable-sdl 
--enable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-fdt 
--disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi 
--disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir 
--disable-seccomp --disable-glusterfs --disable-tpm --disable-numa 
--disable-opengl --disable-virglrenderer --disable-xfsctl --disable-vxhs 
--disable-slirp --disable-blobs --target-list=x86_64-softmmu --disable-rdma 
--disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock 
--disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user 
--disable-spice --disable-qom-cast-debug --disable-vxhs --disable-bochs 
--disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat 
--disable-qed --disable-parallels --disable-sheepdog --disable-avx2 
--disable-nettle --disable-gnutls --disable-capstone --disable-tools 
--disable-libpmem --disable-iconv --disable-cap-ng
  $ make

  Test:
  $ LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH 
./build/x86_64-softmmu/qemu-system-x86_64 -enable-kvm --drive 
media=cdrom,file=http://archive.ubuntu.com/ubuntu/dists/bionic/main/installer-amd64/current/images/netboot/mini.iso
  - switch to monitor with CTRL+ALT+2
  - try to enter something

  Affects head of both usptream git repos.

  
  --- original bug ---

  It was observed that the QEMU console (normally accessible using
  Ctrl+Alt+2) accepts no input, so it can't be used. This is being
  problematic because there are cases where it's required to send
  commands to the guest, or key combinations that the host would grab
  (as Ctrl-Alt-F1 or Alt-F4).

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: qemu 1:4.2-3ubuntu2
  Uname: Linux 5.6.0-rc6+ x86_64
  ApportVersion: 2.20.11-0ubuntu20
  Architecture: amd64
  CurrentDesktop: XFCE
  Date: Thu Mar 19 12:16:31 2020
  Dependencies:

  InstallationDate: Installed on 2017-06-13 (1009 days ago)
  InstallationMedia: Xubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412)
  KvmCmdLine:
   COMMAND STAT  EUID  RUID PIDPPID %CPU COMMAND
   qemu-system-x86 Sl+   1000  1000   34275   25235 29.2 qemu-system-x86_64 -m 
4G -cpu Skylake-Client -device virtio-vga,virgl=true,xres=1280,yres=720 -accel 
kvm -device nec-usb-xhci -serial vc -serial stdio -hda 
/home/usuario/Sistemas/androidx86.img -display gtk,gl=on -device usb-audio
   kvm-nx-lpage-re S0 0   34284   2  0.0 [kvm-nx-lpage-re]
   kvm-pit/34275   S0 0   34286   2  0.0 [kvm-pit/34275]
  MachineType: LENOVO 80UG
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-rc6+ 
root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 
kvm.report_ignored_msrs=0 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 
i915.enable_gvt=1 i915.fastboot=1 cgroup_enable=memory swapaccount=1 
zswap.enabled=1 zswap.zpool=z3fold 
resume=UUID=a82e38a0-8d20-49dd-9cbd-de7216b589fc log_buf_len=16M 
usbhid.quirks=0x0079:0x0006:0x10 config_scsi_mq_default=y 
scsi_mod.use_blk_mq=1 mtrr_gran_size=64M mtrr_chunk_size=64M nbd.nbds_max=2 
nbd.max_part=63
  SourcePackage: qemu
  UpgradeStatus: Upgraded to focal on 2019-12-22 (87 days ago)
  dmi.bios.date: 08/09/2018
  dmi.bios.vendor: LENOVO
  dmi.bios.version: 0XCN45WW
  dmi.board.asset.tag: NO Asset Tag
  dmi.board.name: Toronto 4A2
  dmi.board.vendor: LENOVO
  dmi.board.version: SDK0J40679 WIN
  dmi.chassis.asset.tag: NO Asset Tag
  dmi.chassis.type: 10
  dmi.chassis.vendor: LENOVO
  dmi.chassis.version: Lenovo ideapad 310-14ISK
  dmi.modalias: 
dmi:bvnLENOVO:bvr0XCN45WW:bd08/09/2018:svnLENOVO:pn80UG:pvrLenovoideapad310-14ISK:rvnLENOVO:rnToronto4A2:rvrSDK0J40679WIN:cvnLENOVO:ct10:cvrLenovoideapad310-14ISK:
  dmi.product.family: IDEAPAD
  dmi.product.name: 80UG
  dmi.product.sku: LENOVO_MT_80UG_BU_idea_FM_Lenovo ideapad 310-14ISK
  dmi.product.version: Lenovo ideapad 310-14ISK
  dmi.sys.vendor: LENOVO
  mtime.conffile..etc.apport.crashdb.conf: 2019-08-29T08:39:36.787240

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1868116/+subscriptions



Re: acpi_pcihp_eject_slot() bug if passed 'slots == 0'

2020-03-26 Thread Igor Mammedov
On Thu, 26 Mar 2020 13:29:01 +0100
Igor Mammedov  wrote:

> On Thu, 26 Mar 2020 11:52:36 +
> Peter Maydell  wrote:
> 
> > Hi; Coverity spots that if hw/acpi/pcihp.c:acpi_pcihp_eject_slot()
> > is passed a zero 'slots' argument then ctz32(slots) will return 32,
> > and then the code that does '1U << slot' is C undefined behaviour
> > because it's an oversized shift. (This is CID 1421896.)
> > 
> > Since the pci_write() function in this file can call
> > acpi_pcihp_eject_slot() with an arbitrary value from the guest,
> > I think we need to handle 'slots == 0' safely. But what should
> > the behaviour be?  
> 
> 0 is not valid value, we should ignore and return early in this case
> like we do with bsel. I'll post a path shortly.
well, looking more that is only true for main bus, for bridges it can be
slot number can be zero, then AML left shifts it and writes into B0EJ
which traps into pci_write(, data) and that is supposed to eject
slot 0 according to guest(AML).

Michael,
what's your take on it?

> 
> > 
> > thanks
> > -- PMM
> >   
> 
> 




Re: [PATCH] i386/cpu: Expand MAX_FIXED_COUNTERS from 3 to 4 to for Icelake

2020-03-26 Thread Like Xu

Anyone to help review this change?

Thanks,
Like Xu

On 2020/3/17 13:54, Like Xu wrote:

In the Intel SDM, "Table 18-2. Association of Fixed-Function
Performance Counters with Architectural Performance Events",
we may have a new fixed counter 'TOPDOWN.SLOTS' (since Icelake),
which counts the number of available slots for an unhalted
logical processor. Check commit 6017608936 in the kernel tree.

Signed-off-by: Like Xu 
---
  target/i386/cpu.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 576f309bbf..ec2b67d425 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1185,7 +1185,7 @@ typedef struct {
  #define CPU_NB_REGS CPU_NB_REGS32
  #endif
  
-#define MAX_FIXED_COUNTERS 3

+#define MAX_FIXED_COUNTERS 4
  #define MAX_GP_COUNTERS(MSR_IA32_PERF_STATUS - MSR_P6_EVNTSEL0)
  
  #define TARGET_INSN_START_EXTRA_WORDS 1







Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Vladimir Sementsov-Ogievskiy

26.03.2020 12:43, Stefan Reiter wrote:

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:

25.03.2020 18:50, Stefan Reiter wrote:

backup_clean is only ever called as a handler via job_exit, which


Hmm.. I'm afraid it's not quite correct.

job_clean

   job_finalize_single

  job_completed_txn_abort (lock aio context)

  job_do_finalize


Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio 
context..
And on the same time, it directaly calls job_txn_apply(job->txn, 
job_finalize_single)
without locking. Is it a bug?



I think, as you say, the idea is that job_do_finalize is always called with the lock 
acquired. That's why job_completed_txn_abort takes care to release the lock (at least of 
the "outer_ctx" as it calls it) before reacquiring it.


And, even if job_do_finalize called always with locked context, where is 
guarantee that all
context of all jobs in txn are locked?



I also don't see anything that guarantees that... I guess it could be adapted 
to handle locks like job_completed_txn_abort does?

Haven't looked into transactions too much, but does it even make sense to have 
jobs in different contexts in one transaction?


Why not? Assume backing two disks in one transaction, each in separate io 
thread.. (honestly, I don't know does it work)




Still, let's look through its callers.

   job_finalize

    qmp_block_job_finalize (lock aio context)
    qmp_job_finalize (lock aio context)
    test_cancel_concluded (doesn't lock, but it's a test)

   job_completed_txn_success

    job_completed

 job_exit (lock aio context)

 job_cancel

  blockdev_mark_auto_del (lock aio context)

  job_user_cancel

  qmp_block_job_cancel (locks context)
  qmp_job_cancel  (locks context)

  job_cancel_err

   job_cancel_sync (return job_finish_sync(job, 
&job_cancel_err, NULL);, job_finish_sync just calls callback)

    replication_close (it's .bdrv_close.. Hmm, 
I don't see context locking, where is it ?)

Hm, don't see it either. This might indeed be a way to get to job_clean without 
a lock held.

I don't have any testing set up for replication atm, but if you believe this 
would be correct I can send a patch for that as well (just acquire the lock in 
replication_close before job_cancel_async?).


I don't know.. But sending a patch is good way to start a discussion)





    replication_stop (locks context)

    drive_backup_abort (locks context)

    blockdev_backup_abort (locks context)

    job_cancel_sync_all (locks context)

    cancel_common (locks context)

  test_* (I don't care)



To clarify, aside from the commit message the patch itself does not appear to 
be wrong? All paths (aside from replication_close mentioned above) guarantee 
the job lock to be held.


I mostly worry about the case with transaction with jobs from different aio 
contexts than about replication..

Anyway, I hope that someone who has better understanding of these things will 
look at this.

It usually not good idea to send [PATCH] inside discussion thread, it'd better 
be a separate thread, to be more visible.

May be you send separate series, which will include this patch, some fix for 
replication, and try to fix job_do_finalize in some way, and we continue 
discussion from this new series?




already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 


Just note, that this thing were recently touched by 0abf2581717a19 , so add 
Sergio (its author) to CC.


---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying to figure
out why that happens, stack trace indicates a similar problem with the main
thread hanging at bdrv_do_drained_begin, though I have no clue why as of yet.

  block/backup.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)

Re: [PATCH v6 7/7] virtio-net: add migration support for RSS and hash report

2020-03-26 Thread Yuri Benditovich
ping

On Fri, Mar 20, 2020 at 1:58 PM Yuri Benditovich <
yuri.benditov...@daynix.com> wrote:

> Save and restore RSS/hash report configuration.
>
> Signed-off-by: Yuri Benditovich 
> ---
>  hw/net/virtio-net.c | 37 +
>  1 file changed, 37 insertions(+)
>
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index a0614ad4e6..7de7587abd 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -2842,6 +2842,13 @@ static int virtio_net_post_load_device(void
> *opaque, int version_id)
>  }
>  }
>
> +if (n->rss_data.enabled) {
> +trace_virtio_net_rss_enable(n->rss_data.hash_types,
> +n->rss_data.indirections_len,
> +sizeof(n->rss_data.key));
> +} else {
> +trace_virtio_net_rss_disable();
> +}
>  return 0;
>  }
>
> @@ -3019,6 +3026,32 @@ static const VMStateDescription
> vmstate_virtio_net_has_vnet = {
>  },
>  };
>
> +static bool virtio_net_rss_needed(void *opaque)
> +{
> +return VIRTIO_NET(opaque)->rss_data.enabled;
> +}
> +
> +static const VMStateDescription vmstate_virtio_net_rss = {
> +.name  = "virtio-net-device/rss",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.needed = virtio_net_rss_needed,
> +.fields = (VMStateField[]) {
> +VMSTATE_BOOL(rss_data.enabled, VirtIONet),
> +VMSTATE_BOOL(rss_data.redirect, VirtIONet),
> +VMSTATE_BOOL(rss_data.populate_hash, VirtIONet),
> +VMSTATE_UINT32(rss_data.hash_types, VirtIONet),
> +VMSTATE_UINT16(rss_data.indirections_len, VirtIONet),
> +VMSTATE_UINT16(rss_data.default_queue, VirtIONet),
> +VMSTATE_UINT8_ARRAY(rss_data.key, VirtIONet,
> +VIRTIO_NET_RSS_MAX_KEY_SIZE),
> +VMSTATE_VARRAY_UINT16_ALLOC(rss_data.indirections_table,
> VirtIONet,
> +rss_data.indirections_len, 0,
> +vmstate_info_uint16, uint16_t),
> +VMSTATE_END_OF_LIST()
> +},
> +};
> +
>  static const VMStateDescription vmstate_virtio_net_device = {
>  .name = "virtio-net-device",
>  .version_id = VIRTIO_NET_VM_VERSION,
> @@ -3069,6 +3102,10 @@ static const VMStateDescription
> vmstate_virtio_net_device = {
>  has_ctrl_guest_offloads),
>  VMSTATE_END_OF_LIST()
> },
> +.subsections = (const VMStateDescription * []) {
> +&vmstate_virtio_net_rss,
> +NULL
> +}
>  };
>
>  static NetClientInfo net_virtio_info = {
> --
> 2.17.1
>
>


Re: acpi_pcihp_eject_slot() bug if passed 'slots == 0'

2020-03-26 Thread Igor Mammedov
On Thu, 26 Mar 2020 11:52:36 +
Peter Maydell  wrote:

> Hi; Coverity spots that if hw/acpi/pcihp.c:acpi_pcihp_eject_slot()
> is passed a zero 'slots' argument then ctz32(slots) will return 32,
> and then the code that does '1U << slot' is C undefined behaviour
> because it's an oversized shift. (This is CID 1421896.)
> 
> Since the pci_write() function in this file can call
> acpi_pcihp_eject_slot() with an arbitrary value from the guest,
> I think we need to handle 'slots == 0' safely. But what should
> the behaviour be?

0 is not valid value, we should ignore and return early in this case
like we do with bsel. I'll post a path shortly.

> 
> thanks
> -- PMM
> 




Re: [RFC for-5.1 4/4] spapr: Don't allow unplug of NVLink2 devices

2020-03-26 Thread Greg Kurz
On Thu, 26 Mar 2020 16:40:09 +1100
David Gibson  wrote:

> Currently, we can't properly handle unplug of NVLink2 devices, because we
> don't have code to tear down their special memory resources.  There's not
> a lot of impetus to implement that. Since hardware NVLink2 devices can't
> be hot unplugged, the guest side drivers don't usually support unplug
> anyway.
> 
> Therefore, simply prevent unplug of NVLink2 devices.
> 

This could maybe considered as a valid fix for 5.0 since this prevents
guest crashes IIUC. But since this requires the two preliminary cleanup
patches, I understand you may prefer to postpone that to 5.1.

> Signed-off-by: David Gibson 
> ---

Reviewed-by: Greg Kurz 

>  hw/ppc/spapr_pci.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 55ca9dee1e..5c8262413a 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1666,6 +1666,11 @@ static void spapr_pci_unplug_request(HotplugHandler 
> *plug_handler,
>  return;
>  }
>  
> +if (spapr_phb_is_nvlink_dev(pdev, phb)) {
> +error_setg(errp, "PCI: Cannot unplug NVLink2 devices");
> +return;
> +}
> +
>  /* ensure any other present functions are pending unplug */
>  if (PCI_FUNC(pdev->devfn) == 0) {
>  for (i = 1; i < 8; i++) {




Re: [PATCH 0/2] Fix the generic image creation code

2020-03-26 Thread Max Reitz
On 26.03.20 02:12, Maxim Levitsky wrote:
> The recent patches from Max Reitz allowed some block drivers to not
> provide the .bdrv_co_create_opts and still allow qemu-img to
> create/format images as long as the image is already existing
> (that is the case with various block storage drivers like nbd/iscsi/nvme, etc)
> 
> However it was found out that some places in the code depend on the
> .bdrv_co_create_opts/.create_opts to be != NULL to decide if to allow
> image creation.
> 
> To avoid adding failback code to all these places, just make generic failback
> code be used by the drivers that need it, so that for outside user, there
> is no diffirence if failback was used or not.
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (2):
>   block: pass BlockDriver reference to the .bdrv_co_create
>   block: trickle down the fallback image creation function use to the
> block drivers

Thanks, fixed the function parameter alignment, moved the declarations
from block.h into block_int.h, and applied the series to my block branch:

https://git.xanclic.moe/XanClic/qemu/commits/branch/block

Max



signature.asc
Description: OpenPGP digital signature


Re: [RFC for-5.1 3/4] spapr: Fix failure path for attempting to hot unplug PCI bridges

2020-03-26 Thread Greg Kurz
On Thu, 26 Mar 2020 16:40:08 +1100
David Gibson  wrote:

> For various technical reasons we can't currently allow unplug a PCI to PCI
> bridge on the pseries machine.  spapr_pci_unplug_request() correctly
> generates an error message if that's attempted.
> 
> But.. if the given errp is not error_abort or error_fatal,

Which is the always case when trying to unplug a device AFAICT:

void qdev_unplug(DeviceState *dev, Error **errp)
{
DeviceClass *dc = DEVICE_GET_CLASS(dev);
HotplugHandler *hotplug_ctrl;
HotplugHandlerClass *hdc;
Error *local_err = NULL;

[...]
hdc = HOTPLUG_HANDLER_GET_CLASS(hotplug_ctrl);
if (hdc->unplug_request) {
hotplug_handler_unplug_request(hotplug_ctrl, dev, &local_err);

And anyway, spapr_pci_unplug_request() shouldn't rely on the caller
passing &error_fatal or &error_abort to do the right thing. Calling
error_setg() without returning right away is a dangerous practice
since it would cause a subsequent call to error_setg() with the
same errp to abort QEMU.

> it doesn't actually stop trying to unplug the bridge anyway.
> 

This looks like a bug fix that could go to 5.0 IMHO.

Maybe add this tag ?

   Fixes: 14e714900f6b "spapr: Allow hot plug/unplug of PCI bridges and devices 
under PCI bridges"

> Signed-off-by: David Gibson 
> ---

Reviewed-by: Greg Kurz 

>  hw/ppc/spapr_pci.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 709a52780d..55ca9dee1e 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1663,6 +1663,7 @@ static void spapr_pci_unplug_request(HotplugHandler 
> *plug_handler,
>  
>  if (pc->is_bridge) {
>  error_setg(errp, "PCI: Hot unplug of PCI bridges not supported");
> +return;
>  }
>  
>  /* ensure any other present functions are pending unplug */




Re: [RFC for-5.1 2/4] spapr: Helper to determine if a device is NVLink2 related

2020-03-26 Thread Greg Kurz
On Thu, 26 Mar 2020 16:40:07 +1100
David Gibson  wrote:

> This adds a simple exported helper function which determins if a given
> (supposedly) PCI device is actually an NVLink2 device, which has some
> special considerations.
> 
> Signed-off-by: David Gibson 
> ---

Reviewed-by: Greg Kurz 

>  hw/ppc/spapr_pci_nvlink2.c  | 5 +
>  include/hw/pci-host/spapr.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 7d3a685421..0cec1ae02b 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -449,6 +449,11 @@ static bool is_nvnpu(PCIDevice *dev, SpaprPhbState 
> *sphb, int *slot, int *link)
>  return false;
>  }
>  
> +bool spapr_phb_is_nvlink_dev(PCIDevice *dev, SpaprPhbState *sphb)
> +{
> +return is_nvgpu(dev, sphb, NULL) || is_nvnpu(dev, sphb, NULL, NULL);
> +}
> +
>  void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int 
> offset,
>  SpaprPhbState *sphb)
>  {
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 8877ff51fb..eaba4a5825 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -156,6 +156,7 @@ void spapr_phb_nvgpu_free(SpaprPhbState *sphb);
>  void spapr_phb_nvgpu_populate_dt(SpaprPhbState *sphb, void *fdt, int bus_off,
>   Error **errp);
>  void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt);
> +bool spapr_phb_is_nvlink_dev(PCIDevice *dev, SpaprPhbState *sphb);
>  void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int 
> offset,
>  SpaprPhbState *sphb);
>  #else




Re: [RFC for-5.1 1/4] spapr: Refactor locating NVLink2 devices for device tree creation

2020-03-26 Thread Greg Kurz
On Thu, 26 Mar 2020 16:40:06 +1100
David Gibson  wrote:

> Currently spapr_phb_nvgpu_populate_pcidev_dt() works a little cryptically.
> It steps through all the NVLink2 GPUs and NPUs and if they match the device
> we're called for, we generate the relevant device tree information.
> 
> Make this a little more obvious by introducing helpers to determine it a

... to determine if a

> given PCI device is an NVLink2 GPU or NPU, returning the NVLink2 slot and
> link number information as well.
> 
> Signed-off-by: David Gibson 
> ---

LGTM

Reviewed-by: Greg Kurz 

>  hw/ppc/spapr_pci_nvlink2.c | 115 +
>  1 file changed, 79 insertions(+), 36 deletions(-)
> 
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 8332d5694e..7d3a685421 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -390,13 +390,12 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState 
> *sphb, void *fdt)
>  
>  }
>  
> -void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int 
> offset,
> -SpaprPhbState *sphb)
> +static bool is_nvgpu(PCIDevice *dev, SpaprPhbState *sphb, int *slot)
>  {
> -int i, j;
> +int i;
>  
>  if (!sphb->nvgpus) {
> -return;
> +return false;
>  }
>  
>  for (i = 0; i < sphb->nvgpus->num; ++i) {
> @@ -406,47 +405,91 @@ void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, 
> void *fdt, int offset,
>  if (!nvslot->gpdev) {
>  continue;
>  }
> +
>  if (dev == nvslot->gpdev) {
> -uint32_t npus[nvslot->linknum];
> +if (slot) {
> +*slot = i;
> +}
> +return true;
> +}
> +}
>  
> -for (j = 0; j < nvslot->linknum; ++j) {
> -PCIDevice *npdev = nvslot->links[j].npdev;
> +return false;
> +}
>  
> -npus[j] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
> -}
> -_FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
> - j * sizeof(npus[0])));
> -_FDT((fdt_setprop_cell(fdt, offset, "phandle",
> -   PHANDLE_PCIDEV(sphb, dev;
> +static bool is_nvnpu(PCIDevice *dev, SpaprPhbState *sphb, int *slot, int 
> *link)
> +{
> +int i, j;
> +
> +if (!sphb->nvgpus) {
> +return false;
> +}
> +
> +for (i = 0; i < sphb->nvgpus->num; ++i) {
> +SpaprPhbPciNvGpuSlot *nvslot = &sphb->nvgpus->slots[i];
> +
> +/* Skip "slot" without attached GPU */
> +if (!nvslot->gpdev) {
>  continue;
>  }
>  
>  for (j = 0; j < nvslot->linknum; ++j) {
> -if (dev != nvslot->links[j].npdev) {
> -continue;
> +if (dev == nvslot->links[j].npdev) {
> +if (slot) {
> +*slot = i;
> +}
> +if (link) {
> +*link = j;
> +}
> +return true;
>  }
> +}
> +}
>  
> -_FDT((fdt_setprop_cell(fdt, offset, "phandle",
> -   PHANDLE_PCIDEV(sphb, dev;
> -_FDT(fdt_setprop_cell(fdt, offset, "ibm,gpu",
> -  PHANDLE_PCIDEV(sphb, nvslot->gpdev)));
> -_FDT((fdt_setprop_cell(fdt, offset, "ibm,nvlink",
> -   PHANDLE_NVLINK(sphb, i, j;
> -/*
> - * If we ever want to emulate GPU RAM at the same location as on
> - * the host - here is the encoding GPA->TGT:
> - *
> - * gta  = ((sphb->nv2_gpa >> 42) & 0x1) << 42;
> - * gta |= ((sphb->nv2_gpa >> 45) & 0x3) << 43;
> - * gta |= ((sphb->nv2_gpa >> 49) & 0x3) << 45;
> - * gta |= sphb->nv2_gpa & ((1UL << 43) - 1);
> - */
> -_FDT(fdt_setprop_cell(fdt, offset, "memory-region",
> -  PHANDLE_GPURAM(sphb, i)));
> -_FDT(fdt_setprop_u64(fdt, offset, "ibm,device-tgt-addr",
> - nvslot->tgt));
> -_FDT(fdt_setprop_cell(fdt, offset, "ibm,nvlink-speed",
> -  nvslot->links[j].link_speed));
> +return false;
> +}
> +
> +void spapr_phb_nvgpu_populate_pcidev_dt(PCIDevice *dev, void *fdt, int 
> offset,
> +SpaprPhbState *sphb)
> +{
> +int slot, link;
> +
> +if (is_nvgpu(dev, sphb, &slot)) {
> +SpaprPhbPciNvGpuSlot *nvslot = &sphb->nvgpus->slots[slot];
> +uint32_t npus[nvslot->linknum];
> +
> +for (link = 0; link < nvslot->linknum; ++link) {
> +PCIDevice *npdev = nvslot->links[link].npdev;
> +
> +npus[link] = cpu_to_be32(PHANDLE_PCIDEV(sphb, npdev));
>  }
> +_FDT(fdt_setprop(fdt, offset, "ibm,npu", npus,
> +

Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Sergio Lopez
On Thu, Mar 26, 2020 at 08:54:53AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 25.03.2020 18:50, Stefan Reiter wrote:
> > backup_clean is only ever called as a handler via job_exit, which
> 
> Hmm.. I'm afraid it's not quite correct.
> 
> job_clean
> 
>   job_finalize_single
> 
>  job_completed_txn_abort (lock aio context)
> 
>  job_do_finalize
> 
> 
> Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock aio 
> context..
> And on the same time, it directaly calls job_txn_apply(job->txn, 
> job_finalize_single)
> without locking. Is it a bug?

Indeed, looks like a bug to me. In fact, that's the one causing the
issue that Dietmar initially reported.

In think the proper fix is drop the context acquisition/release that
in backup_clean that I added in 0abf2581717a19, as Stefan proposed,
and also acquire the context of "foreign" jobs at job_txn_apply, just
as job_completed_txn_abort does.

Thanks,
Sergio.

> And, even if job_do_finalize called always with locked context, where is 
> guarantee that all
> context of all jobs in txn are locked?
> 
> Still, let's look through its callers.
> 
> job_finalize
> 
>qmp_block_job_finalize (lock aio context)
>qmp_job_finalize (lock aio context)
>test_cancel_concluded (doesn't lock, but it's a test)
> 
>   job_completed_txn_success
> 
>job_completed
> 
> job_exit (lock aio context)
> 
> job_cancel
>   
>  blockdev_mark_auto_del (lock aio context)
> 
>  job_user_cancel
> 
>  qmp_block_job_cancel (locks context)
>  qmp_job_cancel  (locks context)
> 
>  job_cancel_err
> 
>   job_cancel_sync (return job_finish_sync(job, 
> &job_cancel_err, NULL);, job_finish_sync just calls callback)
> 
>replication_close (it's .bdrv_close.. Hmm, 
> I don't see context locking, where is it ?)
> 
>replication_stop (locks context)
> 
>drive_backup_abort (locks context)
> 
>blockdev_backup_abort (locks context)
> 
>job_cancel_sync_all (locks context)
> 
>cancel_common (locks context)
> 
>  test_* (I don't care)
> 
> > already acquires the job's context. The job's context is guaranteed to
> > be the same as the one used by backup_top via backup_job_create.
> > 
> > Since the previous logic effectively acquired the lock twice, this
> > broke cleanup of backups for disks using IO threads, since the 
> > BDRV_POLL_WHILE
> > in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release the lock
> > once, thus deadlocking with the IO thread.
> > 
> > Signed-off-by: Stefan Reiter 
> 
> Just note, that this thing were recently touched by 0abf2581717a19 , so add 
> Sergio (its author) to CC.
> 
> > ---
> > 
> > This is a fix for the issue discussed in this part of the thread:
> > https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
> > ...not the original problem (core dump) posted by Dietmar.
> > 
> > I've still seen it occasionally hang during a backup abort. I'm trying to 
> > figure
> > out why that happens, stack trace indicates a similar problem with the main
> > thread hanging at bdrv_do_drained_begin, though I have no clue why as of 
> > yet.
> > 
> >   block/backup.c | 4 
> >   1 file changed, 4 deletions(-)
> > 
> > diff --git a/block/backup.c b/block/backup.c
> > index 7430ca5883..a7a7dcaf4c 100644
> > --- a/block/backup.c
> > +++ b/block/backup.c
> > @@ -126,11 +126,7 @@ static void backup_abort(Job *job)
> >   static void backup_clean(Job *job)
> >   {
> >   BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
> > -AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
> > -
> > -aio_context_acquire(aio_context);
> >   bdrv_backup_top_drop(s->backup_top);
> > -aio_context_release(aio_context);
> >   }
> >   void backup_do_checkpoint(BlockJob *job, Error **errp)
> > 
> 
> 
> -- 
> Best regards,
> Vladimir
> 


signature.asc
Description: PGP signature


acpi_pcihp_eject_slot() bug if passed 'slots == 0'

2020-03-26 Thread Peter Maydell
Hi; Coverity spots that if hw/acpi/pcihp.c:acpi_pcihp_eject_slot()
is passed a zero 'slots' argument then ctz32(slots) will return 32,
and then the code that does '1U << slot' is C undefined behaviour
because it's an oversized shift. (This is CID 1421896.)

Since the pci_write() function in this file can call
acpi_pcihp_eject_slot() with an arbitrary value from the guest,
I think we need to handle 'slots == 0' safely. But what should
the behaviour be?

thanks
-- PMM



Re: [RFC v1] arm/virt: Add memory hot remove support

2020-03-26 Thread Auger Eric
Hi Shameer,

On 3/26/20 12:14 PM, Shameerali Kolothum Thodi wrote:
> Hi Eric,
> 
>> -Original Message-
>> From: Auger Eric [mailto:eric.au...@redhat.com]
>> Sent: 26 March 2020 11:01
>> To: Shameerali Kolothum Thodi ;
>> qemu-devel@nongnu.org; qemu-...@nongnu.org
>> Cc: imamm...@redhat.com; peter.mayd...@linaro.org; m...@redhat.com;
>> xuwei (O) ; Zengtao (B) ;
>> Linuxarm ; Anshuman Khandual
>> 
>> Subject: Re: [RFC v1] arm/virt: Add memory hot remove support
>>
>> Hi Shameer,
>>
>> On 3/18/20 1:37 PM, Shameer Kolothum wrote:
>>> This adds support for memory hot remove on arm/virt that
>>> uses acpi ged device.
>>
>> I gave this a try and it works fine if the PCDIMM slot was initially
>> hotplugged:
>> (QEMU) object-add qom-type=memory-backend-ram id=mem1
>> props.size=4294967296
>> {"return": {}}
>> (QEMU) device_add driver=pc-dimm  id=pcdimm1 memdev=mem1
>> (QEMU) device_del id=pcdimm1
>> {"return": {}}
>>
>> on guest I can see:
>> [   82.466321] Offlined Pages 262144
>> [   82.541712] Offlined Pages 262144
>> [   82.589236] Offlined Pages 262144
>> [   82.969166] Offlined Pages 262144
>>
>> However I noticed that if qemu is launched directly with
>>
>> -m 16G,maxmem=32G,slots=2 \
>> -object memory-backend-ram,id=mem1,size=4G \
>> -device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm -device
>>
>> and then in the qmp shell:
>> (QEMU) device_del id=dimm1
>>
>> the hot-unplug fails in guest:
>>
>> [   78.897407] Offlined Pages 262144
>> [   79.260811] Offlined Pages 262144
>> [   79.308105] Offlined Pages 262144
>> [   79.333675] page:fe00137d1f40 refcount:1 mapcount:0
>> mapping:0004ea9f20b1 index:0xaaab11c6e
>> [   79.335927] anon flags: 0x17880024(uptodate|active|swapbacked)
>> [   79.337571] raw: 17880024 dead0100
>> dead0122
>> 0004ea9f20b1
>> [   79.339502] raw: 000aaab11c6e  0001
>> 0004fd4e3000
>> [   79.341701] page dumped because: unmovable page
>> [   79.342887] page->mem_cgroup:0004fd4e3000
>> [   79.354729] page:fe00137d1f40 refcount:1 mapcount:0
>> mapping:0004ea9f20b1 index:0xaaab11c6e
>> [   79.357012] anon flags: 0x17880024(uptodate|active|swapbacked)
>> [   79.358658] raw: 17880024 dead0100
>> dead0122
>> 0004ea9f20b1
>> [   79.360611] raw: 000aaab11c6e  0001
>> 0004fd4e3000
>> [   79.362560] page dumped because: unmovable page
>> [   79.363742] page->mem_cgroup:0004fd4e3000
>> [   79.368636] memory memory20: Offline failed.
>>
>> I did not expect this. The PCDIMM slot in that case does not seem to be
>> interpreted as a hot-unpluggable one (?). I added Anshuman in cc.
> 
> Could you please try adding "movable_node" to qemu guest kernel command line 
> params.
> This will prevent any kernel allocation from hotplugable memory nodes which I 
> think is
> causing the behavior you are seeing.

Effectively, when adding the movable_node option in the guest kernel
parameters, I get the following traces:

[   29.581418] Offlined Pages 262144
[   29.663605] Offlined Pages 262144
[   29.714225] Offlined Pages 262144
[   30.222953] Offlined Pages 262144
[   30.314288] Built 1 zonelists, mobility grouping on.  Total pages:
4076898
[   30.316067] Policy zone: Normal

Thanks

Eric

> 
> Thanks,
> Shameer
> 
> 
>> Thanks
>>
>> Eric
>>
>>
>>
>>>
>>> Signed-off-by: Shameer Kolothum 
>>> ---
>>>  -RFC because linux kernel support for mem hot remove is just queued
>>>   for 5.7[1].
>>>  -Tested with guest kernel 5.6-rc5 + [1]
>>>
>>> 1. https://patchwork.kernel.org/cover/11419301/
>>> ---
>>>  hw/acpi/generic_event_device.c | 28 +
>>>  hw/arm/virt.c  | 56
>> --
>>>  2 files changed, 82 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/acpi/generic_event_device.c
>> b/hw/acpi/generic_event_device.c
>>> index 021ed2bf23..3e28c110fa 100644
>>> --- a/hw/acpi/generic_event_device.c
>>> +++ b/hw/acpi/generic_event_device.c
>>> @@ -182,6 +182,32 @@ static void
>> acpi_ged_device_plug_cb(HotplugHandler *hotplug_dev,
>>>  }
>>>  }
>>>
>>> +static void acpi_ged_unplug_request_cb(HotplugHandler *hotplug_dev,
>>> +   DeviceState *dev, Error
>> **errp)
>>> +{
>>> +AcpiGedState *s = ACPI_GED(hotplug_dev);
>>> +
>>> +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
>>> +acpi_memory_unplug_request_cb(hotplug_dev,
>> &s->memhp_state, dev, errp);
>>> +} else {
>>> +error_setg(errp, "acpi: device unplug request for unsupported
>> device"
>>> +   " type: %s", object_get_typename(OBJECT(dev)));
>>> +}
>>> +}
>>> +
>>> +static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev,
>>> +   DeviceState *dev, Error **errp)
>>> +{
>>> +AcpiGedState *s = ACPI_GED(hotplug_dev);
>>> +
>>> +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
>>> +acpi_memory_unplu

Re: [PULL for 5.0-rc1 00/11] testing updates (+ one mttcg change)

2020-03-26 Thread Peter Maydell
On Wed, 25 Mar 2020 at 15:15, Alex Bennée  wrote:
>
> The following changes since commit 736cf607e40674776d752acc201f565723e86045:
>
>   Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +)
>
> are available in the Git repository at:
>
>   https://github.com/stsquad/qemu.git tags/pull-testing-250320-1
>
> for you to fetch changes up to 6a5970988e7c9ce7f9fa9747397b63e8455144c6:
>
>   .travis.yml: Add a KVM-only s390x job (2020-03-25 14:40:14 +)
>
> 
> Testing updates:
>
>   - docker updates (various dependencies)
>   - travis updates (s390x KVM build)
>   - test/vm updates (NetBSD -> 9.0, FreeBSD -> 12.1)
>   - disable MTTCG for mips64/mips64el

This produces some new warnings for the freebsd test:

In file included from /home/qemu/qemu-test.XaCd3t/src/net/netmap.c:36:
In file included from /home/qemu/qemu-test.XaCd3t/src/include/sysemu/sysemu.h:5:
In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/timer.h:4:
In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/bitops.h:17:
/home/qemu/qemu-test.XaCd3t/src/include/qemu/atomic.h:211:9: warning:
'atomic_fetch_add' macro redefined [-Wmacro-redefined]
#define atomic_fetch_add(ptr, n) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST)
^
/usr/include/stdatomic.h:350:9: note: previous definition is here
#define atomic_fetch_add(object, operand)   \
^
In file included from /home/qemu/qemu-test.XaCd3t/src/net/netmap.c:36:
In file included from /home/qemu/qemu-test.XaCd3t/src/include/sysemu/sysemu.h:5:
In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/timer.h:4:
In file included from /home/qemu/qemu-test.XaCd3t/src/include/qemu/bitops.h:17:
/home/qemu/qemu-test.XaCd3t/src/include/qemu/atomic.h:212:9: warning:
'atomic_fetch_sub' macro redefined [-Wmacro-redefined]
#define atomic_fetch_sub(ptr, n) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST)
^
/usr/include/stdatomic.h:356:9: note: previous definition is here
#define atomic_fetch_sub(object, operand)   \
^

(similarly for _and, _or, _xor).

thanks
-- PMM



Re: [PATCH for-5.0] arm:virt: fix broken IPA range with KVM enabled

2020-03-26 Thread Auger Eric
Hi Igor,

On 3/26/20 12:28 PM, Igor Mammedov wrote:
> Commit a1b18df9a4848, broke virt_kvm_type() logic, which depends on
> maxram_size, ram_size, ram_slots being parsed/set on machine instance
> at the time accelerator (KVM) is initialized.
> 
> set_memory_options() part was already reverted by commit 2a7b18a3205b,
> so revert remaining initialization of above machine fields to make
> virt_kvm_type() work as it used to.
Oh I did not notice set_memory_options() change was already reverted.
> 
> Signed-off-by: Igor Mammedov 
> Reported-by: Auger Eric 
Reviewed-by: Eric Auger 
Tested-by: Eric Auger 


> ---
>  softmmu/vl.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 814537bb42..132c6e73dc 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -4137,6 +4137,9 @@ void qemu_init(int argc, char **argv, char **envp)
>  machine_opts = qemu_get_machine_opts();
>  qemu_opt_foreach(machine_opts, machine_set_property, current_machine,
>   &error_fatal);
> +current_machine->ram_size = ram_size;
> +current_machine->maxram_size = maxram_size;
> +current_machine->ram_slots = ram_slots;

Nit: Before configure_accelerators() call there is a comment stating
that configure_accelerators uses machine properties and must be called
after machine_set_property. Maybe we should add it also uses memory
properties and should be called after set_memory_options. This may avoid
other changes of the same style.

Thanks

Eric
>  
>  /*
>   * Note: uses machine properties such as kernel-irqchip, must run
> @@ -4315,10 +4318,6 @@ void qemu_init(int argc, char **argv, char **envp)
>  }
>  }
>  
> -current_machine->ram_size = ram_size;
> -current_machine->maxram_size = maxram_size;
> -current_machine->ram_slots = ram_slots;
> -
>  parse_numa_opts(current_machine);
>  
>  if (machine_class->default_ram_id && current_machine->ram_size &&
> 




Re: [PATCH qemu] vfio/spapr: Fix page size calculation

2020-03-26 Thread David Gibson
On Thu, Mar 26, 2020 at 11:21:47AM +, Peter Maydell wrote:
> On Thu, 26 Mar 2020 at 00:39, David Gibson  
> wrote:
> >
> > On Tue, Mar 24, 2020 at 05:39:12PM +1100, Alexey Kardashevskiy wrote:
> > > Coverity detected an issue (CID 1421903) with potential call of clz64(0)
> > > which returns 64 which make it do "<<" with a negative number.
> > >
> > > This checks the mask and avoids undefined behaviour.
> > >
> > > In practice pgsizes and memory_region_iommu_get_min_page_size() always
> > > have some common page sizes and even if they did not, the resulting page
> > > size would be 0x8000... (gcc 9.2) and
> > > ioctl(VFIO_IOMMU_SPAPR_TCE_CREATE) would fail anyway.
> > >
> > > Signed-off-by: Alexey Kardashevskiy 
> >
> > Applied to ppc-for-5.1.
> 
> As a coverity-issue-fix it would be nice to have this in
> 5.0 unless you think it's particularly risky.

In fact, I realized that shortly after and moved it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PATCH for-5.0] arm:virt: fix broken IPA range with KVM enabled

2020-03-26 Thread Igor Mammedov
Commit a1b18df9a4848, broke virt_kvm_type() logic, which depends on
maxram_size, ram_size, ram_slots being parsed/set on machine instance
at the time accelerator (KVM) is initialized.

set_memory_options() part was already reverted by commit 2a7b18a3205b,
so revert remaining initialization of above machine fields to make
virt_kvm_type() work as it used to.

Signed-off-by: Igor Mammedov 
Reported-by: Auger Eric 
---
 softmmu/vl.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 814537bb42..132c6e73dc 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -4137,6 +4137,9 @@ void qemu_init(int argc, char **argv, char **envp)
 machine_opts = qemu_get_machine_opts();
 qemu_opt_foreach(machine_opts, machine_set_property, current_machine,
  &error_fatal);
+current_machine->ram_size = ram_size;
+current_machine->maxram_size = maxram_size;
+current_machine->ram_slots = ram_slots;
 
 /*
  * Note: uses machine properties such as kernel-irqchip, must run
@@ -4315,10 +4318,6 @@ void qemu_init(int argc, char **argv, char **envp)
 }
 }
 
-current_machine->ram_size = ram_size;
-current_machine->maxram_size = maxram_size;
-current_machine->ram_slots = ram_slots;
-
 parse_numa_opts(current_machine);
 
 if (machine_class->default_ram_id && current_machine->ram_size &&
-- 
2.18.1




Re: [PATCH] Refactor vhost_user_set_mem_table functions

2020-03-26 Thread Marc-André Lureau
On Thu, Mar 26, 2020 at 6:39 AM Raphael Norwitz
 wrote:
>
> vhost_user_set_mem_table() and vhost_user_set_mem_table_postcopy() have
> gotten convoluted, and have some identical code.
>
> This change moves the logic populating the VhostUserMemory struct and
> fds array from vhost_user_set_mem_table() and
> vhost_user_set_mem_table_postcopy() to a new function,
> vhost_user_fill_set_mem_table_msg().
>
> No functionality is impacted.
>
> Signed-off-by: Raphael Norwitz 
> Signed-off-by: Peter Turschmid 

Reviewed-by: Marc-André Lureau 


> ---
>  hw/virtio/vhost-user.c | 143 
> +++--
>  1 file changed, 67 insertions(+), 76 deletions(-)
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 08e7e63..ec21e8f 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -407,18 +407,79 @@ static int vhost_user_set_log_base(struct vhost_dev 
> *dev, uint64_t base,
>  return 0;
>  }
>
> +static int vhost_user_fill_set_mem_table_msg(struct vhost_user *u,
> + struct vhost_dev *dev,
> + VhostUserMsg *msg,
> + int *fds, size_t *fd_num,
> + bool track_ramblocks)
> +{
> +int i, fd;
> +ram_addr_t offset;
> +MemoryRegion *mr;
> +struct vhost_memory_region *reg;
> +
> +msg->hdr.request = VHOST_USER_SET_MEM_TABLE;
> +
> +for (i = 0; i < dev->mem->nregions; ++i) {
> +reg = dev->mem->regions + i;
> +
> +assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> +mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> + &offset);
> +fd = memory_region_get_fd(mr);
> +if (fd > 0) {
> +if (track_ramblocks) {
> +assert(*fd_num < VHOST_MEMORY_MAX_NREGIONS);
> +trace_vhost_user_set_mem_table_withfd(*fd_num, mr->name,
> +  reg->memory_size,
> +  reg->guest_phys_addr,
> +  reg->userspace_addr,
> +  offset);
> +u->region_rb_offset[i] = offset;
> +u->region_rb[i] = mr->ram_block;
> +} else if (*fd_num == VHOST_MEMORY_MAX_NREGIONS) {
> +error_report("Failed preparing vhost-user memory table msg");
> +return -1;
> +}
> +msg->payload.memory.regions[*fd_num].userspace_addr =
> +reg->userspace_addr;
> +msg->payload.memory.regions[*fd_num].memory_size =
> +reg->memory_size;
> +msg->payload.memory.regions[*fd_num].guest_phys_addr =
> +reg->guest_phys_addr;
> +msg->payload.memory.regions[*fd_num].mmap_offset = offset;
> +fds[(*fd_num)++] = fd;
> +} else if (track_ramblocks) {
> +u->region_rb_offset[i] = 0;
> +u->region_rb[i] = NULL;
> +}
> +}
> +
> +msg->payload.memory.nregions = *fd_num;
> +
> +if (!*fd_num) {
> +error_report("Failed initializing vhost-user memory map, "
> + "consider using -object memory-backend-file share=on");
> +return -1;
> +}
> +
> +msg->hdr.size = sizeof(msg->payload.memory.nregions);
> +msg->hdr.size += sizeof(msg->payload.memory.padding);
> +msg->hdr.size += *fd_num * sizeof(VhostUserMemoryRegion);
> +
> +return 1;
> +}
> +
>  static int vhost_user_set_mem_table_postcopy(struct vhost_dev *dev,
>   struct vhost_memory *mem)
>  {
>  struct vhost_user *u = dev->opaque;
>  int fds[VHOST_MEMORY_MAX_NREGIONS];
> -int i, fd;
>  size_t fd_num = 0;
>  VhostUserMsg msg_reply;
>  int region_i, msg_i;
>
>  VhostUserMsg msg = {
> -.hdr.request = VHOST_USER_SET_MEM_TABLE,
>  .hdr.flags = VHOST_USER_VERSION,
>  };
>
> @@ -433,48 +494,11 @@ static int vhost_user_set_mem_table_postcopy(struct 
> vhost_dev *dev,
>  u->region_rb_len = dev->mem->nregions;
>  }
>
> -for (i = 0; i < dev->mem->nregions; ++i) {
> -struct vhost_memory_region *reg = dev->mem->regions + i;
> -ram_addr_t offset;
> -MemoryRegion *mr;
> -
> -assert((uintptr_t)reg->userspace_addr == reg->userspace_addr);
> -mr = memory_region_from_host((void *)(uintptr_t)reg->userspace_addr,
> - &offset);
> -fd = memory_region_get_fd(mr);
> -if (fd > 0) {
> -assert(fd_num < VHOST_MEMORY_MAX_NREGIONS);
> -trace_vhost_user_set_mem_table_withfd(fd_num, mr->name,
> -  reg->memory_size,
> -  

Re: [PATCH qemu] vfio/spapr: Fix page size calculation

2020-03-26 Thread Peter Maydell
On Thu, 26 Mar 2020 at 00:39, David Gibson  wrote:
>
> On Tue, Mar 24, 2020 at 05:39:12PM +1100, Alexey Kardashevskiy wrote:
> > Coverity detected an issue (CID 1421903) with potential call of clz64(0)
> > which returns 64 which make it do "<<" with a negative number.
> >
> > This checks the mask and avoids undefined behaviour.
> >
> > In practice pgsizes and memory_region_iommu_get_min_page_size() always
> > have some common page sizes and even if they did not, the resulting page
> > size would be 0x8000... (gcc 9.2) and
> > ioctl(VFIO_IOMMU_SPAPR_TCE_CREATE) would fail anyway.
> >
> > Signed-off-by: Alexey Kardashevskiy 
>
> Applied to ppc-for-5.1.

As a coverity-issue-fix it would be nice to have this in
5.0 unless you think it's particularly risky.

thanks
-- PMM



Re: [PATCH 2/2] block: trickle down the fallback image creation function use to the block drivers

2020-03-26 Thread Max Reitz
On 26.03.20 02:12, Maxim Levitsky wrote:
> Instead of checking the .bdrv_co_create_opts to see if we need the failback,
> just implement the .bdrv_co_create_opts in the drivers that need it.
> 
> This way we don't break various places that need to know if the underlying
> protocol/format really supports image creation, and this way we still
> allow some drivers to not support image creation.
> 
> Fixes: fd17146cd93d1704cd96d7c2757b325fc7aac6fd
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1816007

Thanks, the series looks good to me, just some thoughts below.

> Note that technically this driver reverts the image creation failback
> for the vxhs driver since I don't have a means to test it,
> and IMHO it is better to leave it not supported as it was prior to
> generic image creation patches.

There’s also file-win32.  I don’t really have the means to test that
either, though, so I suppose it’s reasonable to wait until someone
requests it.  OTOH, it shouldn’t be different from file-posix, so maybe
it wouldn’t hurt to support it, too.

We could also take this series for 5.0 as-is, and queue a file-win32
patch for 5.1.

What do you think?

> Also drop iscsi_create_opts which was left accidently
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  block.c   | 35 ---
>  block/file-posix.c|  7 ++-
>  block/iscsi.c | 16 
>  block/nbd.c   |  6 ++
>  block/nvme.c  |  3 +++
>  include/block/block.h |  7 +++
>  6 files changed, 46 insertions(+), 28 deletions(-)
> 
> diff --git a/block.c b/block.c
> index ff23e20443..72fdf56af7 100644
> --- a/block.c
> +++ b/block.c
> @@ -598,8 +598,15 @@ static int 
> create_file_fallback_zero_first_sector(BlockBackend *blk,
>  return 0;
>  }
>  
> -static int bdrv_create_file_fallback(const char *filename, BlockDriver *drv,
> - QemuOpts *opts, Error **errp)
> +/**
> + * Simple implementation of bdrv_co_create_opts for protocol drivers
> + * which only support creation via opening a file
> + * (usually existing raw storage device)
> + */
> +int coroutine_fn bdrv_co_create_opts_simple(BlockDriver *drv,
> +   const char *filename,
> +   QemuOpts *opts,
> +   Error **errp)

The alignment’s off (in the header, too), but that can be fixed when
this series is applied.

>  {
>  BlockBackend *blk;
>  QDict *options;
> @@ -663,11 +670,7 @@ int bdrv_create_file(const char *filename, QemuOpts 
> *opts, Error **errp)
>  return -ENOENT;
>  }
>  
> -if (drv->bdrv_co_create_opts) {
> -return bdrv_create(drv, filename, opts, errp);
> -} else {
> -return bdrv_create_file_fallback(filename, drv, opts, errp);
> -}
> +return bdrv_create(drv, filename, opts, errp);

I thought we’d just let the drivers set BlockDriver.create_opts to
&bdrv_create_opts_simple and keep this bit of code (maybe with an
“else if (drv->create_opts != NULL)” and an
“assert(drv->create_opts == &bdrv_create_opts_simple)”).  That would
make the first patch unnecessary.

OTOH, it’s completely reasonable to pass the object as the first
argument to a class method, so why not.  (Well, technically the
BlockDriver kind of is the class, and the BDS is the object, it’s
weird.)  And it definitely follows what we do elsewhere (to provide
default implementations for block drivers to use).

>  }
>  
>  int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp)

[...]

> diff --git a/block/file-posix.c b/block/file-posix.c
> index 65bc980bc4..7e19bbff5f 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c

[...]

> @@ -3639,10 +3641,11 @@ static BlockDriver bdrv_host_cdrom = {
>  .bdrv_reopen_prepare = raw_reopen_prepare,
>  .bdrv_reopen_commit  = raw_reopen_commit,
>  .bdrv_reopen_abort   = raw_reopen_abort,
> +.bdrv_co_create_opts = bdrv_co_create_opts_simple,
> +.create_opts = &bdrv_create_opts_simple,
>  .mutable_opts= mutable_opts,
>  .bdrv_co_invalidate_cache = raw_co_invalidate_cache,
>  
> -

This line removal seems unrelated, but why not.

Max

>  .bdrv_co_preadv = raw_co_preadv,
>  .bdrv_co_pwritev= raw_co_pwritev,
>  .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,



signature.asc
Description: OpenPGP digital signature


RE: [RFC v1] arm/virt: Add memory hot remove support

2020-03-26 Thread Shameerali Kolothum Thodi
Hi Eric,

> -Original Message-
> From: Auger Eric [mailto:eric.au...@redhat.com]
> Sent: 26 March 2020 11:01
> To: Shameerali Kolothum Thodi ;
> qemu-devel@nongnu.org; qemu-...@nongnu.org
> Cc: imamm...@redhat.com; peter.mayd...@linaro.org; m...@redhat.com;
> xuwei (O) ; Zengtao (B) ;
> Linuxarm ; Anshuman Khandual
> 
> Subject: Re: [RFC v1] arm/virt: Add memory hot remove support
> 
> Hi Shameer,
> 
> On 3/18/20 1:37 PM, Shameer Kolothum wrote:
> > This adds support for memory hot remove on arm/virt that
> > uses acpi ged device.
> 
> I gave this a try and it works fine if the PCDIMM slot was initially
> hotplugged:
> (QEMU) object-add qom-type=memory-backend-ram id=mem1
> props.size=4294967296
> {"return": {}}
> (QEMU) device_add driver=pc-dimm  id=pcdimm1 memdev=mem1
> (QEMU) device_del id=pcdimm1
> {"return": {}}
> 
> on guest I can see:
> [   82.466321] Offlined Pages 262144
> [   82.541712] Offlined Pages 262144
> [   82.589236] Offlined Pages 262144
> [   82.969166] Offlined Pages 262144
> 
> However I noticed that if qemu is launched directly with
> 
> -m 16G,maxmem=32G,slots=2 \
> -object memory-backend-ram,id=mem1,size=4G \
> -device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm -device
> 
> and then in the qmp shell:
> (QEMU) device_del id=dimm1
> 
> the hot-unplug fails in guest:
> 
> [   78.897407] Offlined Pages 262144
> [   79.260811] Offlined Pages 262144
> [   79.308105] Offlined Pages 262144
> [   79.333675] page:fe00137d1f40 refcount:1 mapcount:0
> mapping:0004ea9f20b1 index:0xaaab11c6e
> [   79.335927] anon flags: 0x17880024(uptodate|active|swapbacked)
> [   79.337571] raw: 17880024 dead0100
> dead0122
> 0004ea9f20b1
> [   79.339502] raw: 000aaab11c6e  0001
> 0004fd4e3000
> [   79.341701] page dumped because: unmovable page
> [   79.342887] page->mem_cgroup:0004fd4e3000
> [   79.354729] page:fe00137d1f40 refcount:1 mapcount:0
> mapping:0004ea9f20b1 index:0xaaab11c6e
> [   79.357012] anon flags: 0x17880024(uptodate|active|swapbacked)
> [   79.358658] raw: 17880024 dead0100
> dead0122
> 0004ea9f20b1
> [   79.360611] raw: 000aaab11c6e  0001
> 0004fd4e3000
> [   79.362560] page dumped because: unmovable page
> [   79.363742] page->mem_cgroup:0004fd4e3000
> [   79.368636] memory memory20: Offline failed.
> 
> I did not expect this. The PCDIMM slot in that case does not seem to be
> interpreted as a hot-unpluggable one (?). I added Anshuman in cc.

Could you please try adding "movable_node" to qemu guest kernel command line 
params.
This will prevent any kernel allocation from hotplugable memory nodes which I 
think is
causing the behavior you are seeing.

Thanks,
Shameer


> Thanks
> 
> Eric
> 
> 
> 
> >
> > Signed-off-by: Shameer Kolothum 
> > ---
> >  -RFC because linux kernel support for mem hot remove is just queued
> >   for 5.7[1].
> >  -Tested with guest kernel 5.6-rc5 + [1]
> >
> > 1. https://patchwork.kernel.org/cover/11419301/
> > ---
> >  hw/acpi/generic_event_device.c | 28 +
> >  hw/arm/virt.c  | 56
> --
> >  2 files changed, 82 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/acpi/generic_event_device.c
> b/hw/acpi/generic_event_device.c
> > index 021ed2bf23..3e28c110fa 100644
> > --- a/hw/acpi/generic_event_device.c
> > +++ b/hw/acpi/generic_event_device.c
> > @@ -182,6 +182,32 @@ static void
> acpi_ged_device_plug_cb(HotplugHandler *hotplug_dev,
> >  }
> >  }
> >
> > +static void acpi_ged_unplug_request_cb(HotplugHandler *hotplug_dev,
> > +   DeviceState *dev, Error
> **errp)
> > +{
> > +AcpiGedState *s = ACPI_GED(hotplug_dev);
> > +
> > +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> > +acpi_memory_unplug_request_cb(hotplug_dev,
> &s->memhp_state, dev, errp);
> > +} else {
> > +error_setg(errp, "acpi: device unplug request for unsupported
> device"
> > +   " type: %s", object_get_typename(OBJECT(dev)));
> > +}
> > +}
> > +
> > +static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev,
> > +   DeviceState *dev, Error **errp)
> > +{
> > +AcpiGedState *s = ACPI_GED(hotplug_dev);
> > +
> > +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> > +acpi_memory_unplug_cb(&s->memhp_state, dev, errp);
> > +} else {
> > +error_setg(errp, "acpi: device unplug for unsupported device"
> > +   " type: %s", object_get_typename(OBJECT(dev)));
> > +}
> > +}
> > +
> >  static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits
> ev)
> >  {
> >  AcpiGedState *s = ACPI_GED(adev);
> > @@ -286,6 +312,8 @@ static void acpi_ged_class_init(ObjectClass *class,
> void *data)
> >  dc->vmsd = &vmstate_acpi_ged;
> >
> >  hc->plug = acpi_ged_device_plug_cb;
> > +hc-

Re: [PATCH v7] net: tulip: check frame size and r/w data length

2020-03-26 Thread Li Qiang
P J P  于2020年3月25日周三 上午1:31写道:

> From: Prasad J Pandit 
>
> Tulip network driver while copying tx/rx buffers does not check
> frame size against r/w data length. This may lead to OOB buffer
> access. Add check to avoid it.
>
> Limit iterations over descriptors to avoid potential infinite
> loop issue in tulip_xmit_list_update.
>
> Reported-by: Li Qiang 
> Reported-by: Ziming Zhang 
> Reported-by: Jason Wang 
> Signed-off-by: Prasad J Pandit 
>


Tested-by: Li Qiang 
Reviewed-by: Li Qiang 

Thanks,
Li Qiang



> ---
>  hw/net/tulip.c | 36 +++-
>  1 file changed, 27 insertions(+), 9 deletions(-)
>
> Update v7: fix length check expression to replace '>=' with '>'
>   -> https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07160.html
>
> diff --git a/hw/net/tulip.c b/hw/net/tulip.c
> index cfac2719d3..1295f51d07 100644
> --- a/hw/net/tulip.c
> +++ b/hw/net/tulip.c
> @@ -170,6 +170,10 @@ static void tulip_copy_rx_bytes(TULIPState *s, struct
> tulip_descriptor *desc)
>  } else {
>  len = s->rx_frame_len;
>  }
> +
> +if (s->rx_frame_len + len > sizeof(s->rx_frame)) {
> +return;
> +}
>  pci_dma_write(&s->dev, desc->buf_addr1, s->rx_frame +
>  (s->rx_frame_size - s->rx_frame_len), len);
>  s->rx_frame_len -= len;
> @@ -181,6 +185,10 @@ static void tulip_copy_rx_bytes(TULIPState *s, struct
> tulip_descriptor *desc)
>  } else {
>  len = s->rx_frame_len;
>  }
> +
> +if (s->rx_frame_len + len > sizeof(s->rx_frame)) {
> +return;
> +}
>  pci_dma_write(&s->dev, desc->buf_addr2, s->rx_frame +
>  (s->rx_frame_size - s->rx_frame_len), len);
>  s->rx_frame_len -= len;
> @@ -227,7 +235,8 @@ static ssize_t tulip_receive(TULIPState *s, const
> uint8_t *buf, size_t size)
>
>  trace_tulip_receive(buf, size);
>
> -if (size < 14 || size > 2048 || s->rx_frame_len ||
> tulip_rx_stopped(s)) {
> +if (size < 14 || size > sizeof(s->rx_frame) - 4
> +|| s->rx_frame_len || tulip_rx_stopped(s)) {
>  return 0;
>  }
>
> @@ -275,7 +284,6 @@ static ssize_t tulip_receive_nc(NetClientState *nc,
>  return tulip_receive(qemu_get_nic_opaque(nc), buf, size);
>  }
>
> -
>  static NetClientInfo net_tulip_info = {
>  .type = NET_CLIENT_DRIVER_NIC,
>  .size = sizeof(NICState),
> @@ -558,7 +566,7 @@ static void tulip_tx(TULIPState *s, struct
> tulip_descriptor *desc)
>  if ((s->csr[6] >> CSR6_OM_SHIFT) & CSR6_OM_MASK) {
>  /* Internal or external Loopback */
>  tulip_receive(s, s->tx_frame, s->tx_frame_len);
> -} else {
> +} else if (s->tx_frame_len <= sizeof(s->tx_frame)) {
>  qemu_send_packet(qemu_get_queue(s->nic),
>  s->tx_frame, s->tx_frame_len);
>  }
> @@ -570,23 +578,31 @@ static void tulip_tx(TULIPState *s, struct
> tulip_descriptor *desc)
>  }
>  }
>
> -static void tulip_copy_tx_buffers(TULIPState *s, struct tulip_descriptor
> *desc)
> +static int tulip_copy_tx_buffers(TULIPState *s, struct tulip_descriptor
> *desc)
>  {
>  int len1 = (desc->control >> TDES1_BUF1_SIZE_SHIFT) &
> TDES1_BUF1_SIZE_MASK;
>  int len2 = (desc->control >> TDES1_BUF2_SIZE_SHIFT) &
> TDES1_BUF2_SIZE_MASK;
>
> +if (s->tx_frame_len + len1 > sizeof(s->tx_frame)) {
> +return -1;
> +}
>  if (len1) {
>  pci_dma_read(&s->dev, desc->buf_addr1,
>  s->tx_frame + s->tx_frame_len, len1);
>  s->tx_frame_len += len1;
>  }
>
> +if (s->tx_frame_len + len2 > sizeof(s->tx_frame)) {
> +return -1;
> +}
>  if (len2) {
>  pci_dma_read(&s->dev, desc->buf_addr2,
>  s->tx_frame + s->tx_frame_len, len2);
>  s->tx_frame_len += len2;
>  }
>  desc->status = (len1 + len2) ? 0 : 0x7fff;
> +
> +return 0;
>  }
>
>  static void tulip_setup_filter_addr(TULIPState *s, uint8_t *buf, int n)
> @@ -651,13 +667,15 @@ static uint32_t tulip_ts(TULIPState *s)
>
>  static void tulip_xmit_list_update(TULIPState *s)
>  {
> +#define TULIP_DESC_MAX 128
> +uint8_t i = 0;
>  struct tulip_descriptor desc;
>
>  if (tulip_ts(s) != CSR5_TS_SUSPENDED) {
>  return;
>  }
>
> -for (;;) {
> +for (i = 0; i < TULIP_DESC_MAX; i++) {
>  tulip_desc_read(s, s->current_tx_desc, &desc);
>  tulip_dump_tx_descriptor(s, &desc);
>
> @@ -675,10 +693,10 @@ static void tulip_xmit_list_update(TULIPState *s)
>  s->tx_frame_len = 0;
>  }
>
> -tulip_copy_tx_buffers(s, &desc);
> -
> -if (desc.control & TDES1_LS) {
> -tulip_tx(s, &desc);
> +if (!tulip_copy_tx_buffers(s, &desc)) {
> +if (desc.control & TDES1_LS) {
> +tulip_tx(s, &desc);
> +}
>  }
>  }
>  tulip_desc_write(s, s->current_tx_desc, &des

Re: [RFC v1] arm/virt: Add memory hot remove support

2020-03-26 Thread Auger Eric
Hi Shameer,

On 3/18/20 1:37 PM, Shameer Kolothum wrote:
> This adds support for memory hot remove on arm/virt that
> uses acpi ged device.

I gave this a try and it works fine if the PCDIMM slot was initially
hotplugged:
(QEMU) object-add qom-type=memory-backend-ram id=mem1 props.size=4294967296
{"return": {}}
(QEMU) device_add driver=pc-dimm  id=pcdimm1 memdev=mem1
(QEMU) device_del id=pcdimm1
{"return": {}}

on guest I can see:
[   82.466321] Offlined Pages 262144
[   82.541712] Offlined Pages 262144
[   82.589236] Offlined Pages 262144
[   82.969166] Offlined Pages 262144

However I noticed that if qemu is launched directly with

-m 16G,maxmem=32G,slots=2 \
-object memory-backend-ram,id=mem1,size=4G \
-device pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm -device

and then in the qmp shell:
(QEMU) device_del id=dimm1

the hot-unplug fails in guest:

[   78.897407] Offlined Pages 262144
[   79.260811] Offlined Pages 262144
[   79.308105] Offlined Pages 262144
[   79.333675] page:fe00137d1f40 refcount:1 mapcount:0
mapping:0004ea9f20b1 index:0xaaab11c6e
[   79.335927] anon flags: 0x17880024(uptodate|active|swapbacked)
[   79.337571] raw: 17880024 dead0100 dead0122
0004ea9f20b1
[   79.339502] raw: 000aaab11c6e  0001
0004fd4e3000
[   79.341701] page dumped because: unmovable page
[   79.342887] page->mem_cgroup:0004fd4e3000
[   79.354729] page:fe00137d1f40 refcount:1 mapcount:0
mapping:0004ea9f20b1 index:0xaaab11c6e
[   79.357012] anon flags: 0x17880024(uptodate|active|swapbacked)
[   79.358658] raw: 17880024 dead0100 dead0122
0004ea9f20b1
[   79.360611] raw: 000aaab11c6e  0001
0004fd4e3000
[   79.362560] page dumped because: unmovable page
[   79.363742] page->mem_cgroup:0004fd4e3000
[   79.368636] memory memory20: Offline failed.

I did not expect this. The PCDIMM slot in that case does not seem to be
interpreted as a hot-unpluggable one (?). I added Anshuman in cc.

Thanks

Eric



> 
> Signed-off-by: Shameer Kolothum 
> ---
>  -RFC because linux kernel support for mem hot remove is just queued
>   for 5.7[1].
>  -Tested with guest kernel 5.6-rc5 + [1]
> 
> 1. https://patchwork.kernel.org/cover/11419301/
> ---
>  hw/acpi/generic_event_device.c | 28 +
>  hw/arm/virt.c  | 56 --
>  2 files changed, 82 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/generic_event_device.c b/hw/acpi/generic_event_device.c
> index 021ed2bf23..3e28c110fa 100644
> --- a/hw/acpi/generic_event_device.c
> +++ b/hw/acpi/generic_event_device.c
> @@ -182,6 +182,32 @@ static void acpi_ged_device_plug_cb(HotplugHandler 
> *hotplug_dev,
>  }
>  }
>  
> +static void acpi_ged_unplug_request_cb(HotplugHandler *hotplug_dev,
> +   DeviceState *dev, Error **errp)
> +{
> +AcpiGedState *s = ACPI_GED(hotplug_dev);
> +
> +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +acpi_memory_unplug_request_cb(hotplug_dev, &s->memhp_state, dev, 
> errp);
> +} else {
> +error_setg(errp, "acpi: device unplug request for unsupported device"
> +   " type: %s", object_get_typename(OBJECT(dev)));
> +}
> +}
> +
> +static void acpi_ged_unplug_cb(HotplugHandler *hotplug_dev,
> +   DeviceState *dev, Error **errp)
> +{
> +AcpiGedState *s = ACPI_GED(hotplug_dev);
> +
> +if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
> +acpi_memory_unplug_cb(&s->memhp_state, dev, errp);
> +} else {
> +error_setg(errp, "acpi: device unplug for unsupported device"
> +   " type: %s", object_get_typename(OBJECT(dev)));
> +}
> +}
> +
>  static void acpi_ged_send_event(AcpiDeviceIf *adev, AcpiEventStatusBits ev)
>  {
>  AcpiGedState *s = ACPI_GED(adev);
> @@ -286,6 +312,8 @@ static void acpi_ged_class_init(ObjectClass *class, void 
> *data)
>  dc->vmsd = &vmstate_acpi_ged;
>  
>  hc->plug = acpi_ged_device_plug_cb;
> +hc->unplug_request = acpi_ged_unplug_request_cb;
> +hc->unplug = acpi_ged_unplug_cb;
>  
>  adevc->send_event = acpi_ged_send_event;
>  }
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 94f93dda54..91974e4e80 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2096,11 +2096,62 @@ static void 
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>  }
>  }
>  
> +static void virt_dimm_unplug_request(HotplugHandler *hotplug_dev,
> + DeviceState *dev, Error **errp)
> +{
> +VirtMachineState *vms = VIRT_MACHINE(hotplug_dev);
> +Error *local_err = NULL;
> +
> +if (!vms->acpi_dev) {
> +error_setg(errp,
> +   "memory hotplug is not enabled: missing acpi-ged device");
> +goto out;
> +}
> +
> +hotplug_handler_unplug_request(HOTPLUG_HANDLER(vms->acpi_dev

Re: [PATCH 00/13] microvm: add acpi support

2020-03-26 Thread Igor Mammedov
On Thu, 26 Mar 2020 03:33:35 -0400
"Michael S. Tsirkin"  wrote:

> On Wed, Mar 25, 2020 at 07:44:34PM +0100, Igor Mammedov wrote:
> > On Wed, 25 Mar 2020 16:03:39 +0100
> > Gerd Hoffmann  wrote:
> >   
> > > On Wed, Mar 25, 2020 at 01:32:12PM +0100, Igor Mammedov wrote:  
> > > > On Thu, 19 Mar 2020 09:01:04 +0100
> > > > Gerd Hoffmann  wrote:
> > > > 
> > > > > I know that not supporting ACPI in microvm is intentional.  If you 
> > > > > still
> > > > > don't want ACPI this is perfectly fine, you can use the usual -no-acpi
> > > > > switch to toggle ACPI support.
> > > > > 
> > > > > These are the advantages you are going to loose then:
> > > > > 
> > > > >   (1) virtio-mmio device discovery without command line hacks 
> > > > > (tweaking
> > > > >   the command line is a problem when not using direct kernel 
> > > > > boot).
> > > > >   (2) Better IO-APIC support, we can use IRQ lines 16-23.
> > > > >   (3) ACPI power button (aka powerdown request) works.
> > > > >   (4) machine poweroff (aka S5 state) works.
> > > > > 
> > > > > Together with seabios patches for virtio-mmio support this allows to
> > > > > boot standard fedora images (cloud, coreos, workstation live) with the
> > > > > microvm machine type.
> > > > 
> > > > what CLI do you use to test it?
> > > 
> > > Test script below.  "qemu-default" is a wrapper script which starts
> > > qemu-system-x86_64 from my build directory.  "qemu-firmware" is the
> > > same plus isa-debugcon configured for a firmware log on stdout.
> > > 
> > > Latest bits (with some of the review comments addressed) just pushed
> > > to git://git,kraxel.org/qemu sirius/microvm  
> > 
> > thanks, below are test results I got on my system,
> > spoiler hw-reduced reduces boot time on ~0.02s compared to full blown acpi
> > 
> > using timestamp at "Run /init as init process" as measuring point
> > 
> > no acpi
> > 1.967316
> > 1.975272
> > 1.981267
> > 1.974316
> > 1.962452
> > 1.960988
> > 
> > hw reduced acpi
> > 0.893838
> > 0.892573
> > 0.890585
> > 0.900306
> > 0.897902
> > 
> > normal acpi:
> > 0.921647
> > 0.916298
> > 0.923518
> > 0.916298
> > 0.913234
> > 
> > PS:
> > I just quickly hacked hw-reduced acpi (using arm/virt as model)
> > without implementing power button but I doubt that would affect results 
> > noticeably 
> > on qemu side it probably also will save some time since there are less
> > things to setup for qemu.  
> 
> And no ACPI is faster because of PS/2 probing, right?
I suppose you've meant -slower- because of ...

if I compare at 'i8042: PNP: No PS/2 controller found.' to rule out
probing, then no-api and hw-reduced are about in the same ballpark
(3-4ms difference in favor of no-acpi).

no-acpi
0.765785
0.783540
0.785269
0.51
0.774474
0.770611
0.789309

hw-reduced
0.788733
0.775982
0.793132
0.769077
0.774771
0.775191
0.771170

 

> 
> > > 
> > > HTH,
> > >   Gerd
> > > 
> > >  cut here 
> > > #!/bin/sh
> > > 
> > > mode="${1}"
> > > shift
> > > 
> > > back=()
> > > devs=()
> > > args=()
> > > qemu="qemu-firmware -monitor none -boot menu=on"
> > > disk=""
> > > liso=""
> > > krnl=""
> > > karg="console=ttyS0,115200"
> > > 
> > > case "$mode" in
> > > kernel)
> > >   qemu="qemu-default -nographic"
> > >   disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2"
> > >   krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage"
> > >   karg="$karg root=/dev/sda4"
> > >   karg="$karg quiet"
> > >   ;;
> > > seabios)
> > >   disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2"
> > >   krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage"
> > >   karg="$karg root=/dev/sda4"
> > >   args+=("-bios" 
> > > "/home/kraxel/projects/seabios/out-bios-microvm/bios.bin")
> > >   ;;
> > > cloud)
> > >   disk="/vmdisk/iso/Fedora-Cloud-Base-31-1.9.x86_64.raw"
> > >   ;;
> > > coreos)
> > >   disk="/vmdisk/iso/fedora-coreos-31.20200210.3.0-metal.x86_64.raw"
> > >   ;;
> > > live)
> > >   liso="/vmdisk/iso/Fedora-Workstation-Live-x86_64-30-1.2.iso"
> > >   devs+=("-device" "virtio-gpu-device")
> > >   devs+=("-device" "virtio-keyboard-device")
> > >   devs+=("-device" "virtio-tablet-device")
> > >   ;;
> > > *)
> > >   echo "unknown mode: $mode"
> > >   echo "known modes: kernel seabios cloud coreos live"
> > >   exit 1
> > >   ;;
> > > esac
> > > 
> > > if test "$disk" != ""; then
> > >   format="${disk##*.}"
> > >   back+=("-drive" "if=none,id=disk,format=${format},file=${disk}")
> > >   devs+=("-device" "scsi-hd,drive=disk,bootindex=1")
> > > fi
> > > if test "$liso" != ""; then
> > >   back+=("-drive" 
> > > "if=none,id=cdrom,media=cdrom,readonly,format=raw,file=${liso}")
> > >   devs+=("-device" "scsi-cd,drive=cdrom,bootindex=2")
> > > fi
> > > if test "$krnl" != ""; then
> > >   args+=("-kernel" "$krnl")
> > >   args+=("-append" "$karg")
> > > fi
> > > 
> > > set -ex
> > > $qemu \
> > >   -enable-kvm \
> > >   -cpu host \
> > >   -M microvm,

Re: [PATCH 0/2] Fix the generic image creation code

2020-03-26 Thread Denis V. Lunev
On 3/26/20 4:12 AM, Maxim Levitsky wrote:
> The recent patches from Max Reitz allowed some block drivers to not
> provide the .bdrv_co_create_opts and still allow qemu-img to
> create/format images as long as the image is already existing
> (that is the case with various block storage drivers like nbd/iscsi/nvme, etc)
>
> However it was found out that some places in the code depend on the
> .bdrv_co_create_opts/.create_opts to be != NULL to decide if to allow
> image creation.
>
> To avoid adding failback code to all these places, just make generic failback
> code be used by the drivers that need it, so that for outside user, there
> is no diffirence if failback was used or not.
>
> Best regards,
>   Maxim Levitsky
>
> Maxim Levitsky (2):
>   block: pass BlockDriver reference to the .bdrv_co_create
>   block: trickle down the fallback image creation function use to the
> block drivers
>
>  block.c   | 38 ++
>  block/crypto.c|  3 ++-
>  block/file-posix.c| 11 +--
>  block/file-win32.c|  4 +++-
>  block/gluster.c   |  3 ++-
>  block/iscsi.c | 16 
>  block/nbd.c   |  6 ++
>  block/nfs.c   |  4 +++-
>  block/nvme.c  |  3 +++
>  block/parallels.c |  3 ++-
>  block/qcow.c  |  3 ++-
>  block/qcow2.c |  4 +++-
>  block/qed.c   |  3 ++-
>  block/raw-format.c|  4 +++-
>  block/rbd.c   |  3 ++-
>  block/sheepdog.c  |  4 +++-
>  block/ssh.c   |  4 +++-
>  block/vdi.c   |  4 +++-
>  block/vhdx.c  |  3 ++-
>  block/vmdk.c  |  4 +++-
>  block/vpc.c   |  6 --
>  include/block/block.h |  7 +++
>  include/block/block_int.h |  3 ++-
>  23 files changed, 95 insertions(+), 48 deletions(-)
>
Reviewed-by: Denis V. Lunev 



[PATCH for-5.0] hw/i386/amd_iommu.c: Fix corruption of log events passed to guest

2020-03-26 Thread Peter Maydell
In the function amdvi_log_event(), we write an event log buffer
entry into guest ram, whose contents are passed to the function
via the "uint64_t *evt" argument. Unfortunately, a spurious
'&' in the call to dma_memory_write() meant that instead of
writing the event to the guest we would write the literal value
of the pointer, plus whatever was in the following 8 bytes
on the stack. This error was spotted by Coverity.

Fix the bug by removing the '&'.

Fixes: CID 1421945
Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Maydell 
---
Disclaimer: only tested with 'make check' and 'make check-acceptance'
which probably don't exercise this corner of the code.
---
 hw/i386/amd_iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
index b1175e52c7d..fd75cae0243 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -181,7 +181,7 @@ static void amdvi_log_event(AMDVIState *s, uint64_t *evt)
 }
 
 if (dma_memory_write(&address_space_memory, s->evtlog + s->evtlog_tail,
-&evt, AMDVI_EVENT_LEN)) {
+ evt, AMDVI_EVENT_LEN)) {
 trace_amdvi_evntlog_fail(s->evtlog, s->evtlog_tail);
 }
 
-- 
2.20.1




Re: [PATCH v16 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages

2020-03-26 Thread Cornelia Huck
On Wed, 25 Mar 2020 01:02:34 +0530
Kirti Wankhede  wrote:

> vfio_pfn.ref_count is always updated by holding iommu->lock, using atomic

s/by/while/

> variable is overkill.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> Reviewed-by: Eric Auger 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 

Reviewed-by: Cornelia Huck 




Re: [PULL 0/9] migration queue

2020-03-26 Thread Peter Maydell
On Wed, 25 Mar 2020 at 13:17, Dr. David Alan Gilbert (git)
 wrote:
>
> From: "Dr. David Alan Gilbert" 
>
> The following changes since commit 736cf607e40674776d752acc201f565723e86045:
>
>   Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +)
>
> are available in the Git repository at:
>
>   git://github.com/dagrh/qemu.git tags/pull-migration-20200325b
>
> for you to fetch changes up to 7cd75cbdb8a45d9e2d5912f774d8194cbafdfa97:
>
>   migration: use "" instead of (null) for tls-authz (2020-03-25 12:31:38 
> +)
>
> 
> Combo Migration/HMP/virtiofs pull
>
> Small fixes all around.
> Ones that are noticeable:
>   a) Igor's migration compatibility fix affecting older machine types
>  has been seen in the wild
>   b) Philippe's autconverge fix should fix an intermittently
>  failing migration test.
>   c) Mao's makes a small change to the output of 'info
>  migrate_parameters'  for tls-authz.
>

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/5.0
for any user-visible changes.

-- PMM



Re: [PULL 006/136] vl.c: move -m parsing after memory backends has been processed

2020-03-26 Thread Igor Mammedov
On Thu, 26 Mar 2020 10:20:31 +0100
Auger Eric  wrote:

> Hi
> 
> On 2/25/20 12:48 PM, Paolo Bonzini wrote:
> > From: Igor Mammedov 
> > 
> > It will be possible for main RAM to come from memory-backend
> > and we should check that size specified in -m matches the size
> > of the backend and [MachineState::]ram_size also matches
> > backend's size.
> > 
> > However -m parsing (set_memory_options()) happens before backends
> > are intialized (object_create_delayed()) which complicates it.
> > Consolidate set_memory_options() and assigning parsed results to
> > current_machine after backends are initialized, so it would be
> > possible access the initialized backend instance to compare
> > sizes.
> > 
> > This patch only consolidates scattered places touching ram_size
> > within vl.c. And follow up patch will integrate backend handling
> > to set_memory_options().
> > 
> > Signed-off-by: Igor Mammedov 
> > Message-Id: <20200219160953.13771-7-imamm...@redhat.com>  
> 
> Unfortunately this patch breaks ARM virt memory map setting in KVM mode.
> If you launch ARM virt with PCDIMM you now get
> 
> qemu-system-aarch64: -device
> pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm: memory devices (e.g. for
> memory hotplug) are not supported by the machine
> 
> With machvirt/KVM
> configure_accelerators() calls ARM virt_kvm_type(). This function
> freezes the machine memory map and computes the size of the IPA to be
> supported by KVM. See:
> c9650222b8 ("hw/arm/virt: Implement kvm_type function for 4.0 machine")
> 
> virt_kvm_type() uses machine memory settings. Igor's patch moves those
> after the mem backend init, so the virt memory map does not properly
> take into account the machine memory settings anymore.
> 
> So we need to find a way to rework this.
set_memory_options() was already reverted, so  safest way to fix it during
freeze is to revert back lines

-current_machine->ram_size = ram_size;
-current_machine->maxram_size = maxram_size;
-current_machine->ram_slots = ram_slots;

I'll post a patch

> 
> Thanks
> 
> Eric
> 
> 
> > ---
> >  vl.c | 27 ++-
> >  1 file changed, 14 insertions(+), 13 deletions(-)
> > 
> > diff --git a/vl.c b/vl.c
> > index 2103804..72ffc06 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -2655,6 +2655,14 @@ static void set_memory_options(uint64_t *ram_slots, 
> > ram_addr_t *maxram_size,
> >  exit(EXIT_FAILURE);
> >  }
> >  
> > +if (!xen_enabled()) {
> > +/* On 32-bit hosts, QEMU is limited by virtual address space */
> > +if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) {
> > +error_report("at most 2047 MB RAM can be simulated");
> > +exit(1);
> > +}
> > +}
> > +
> >  loc_pop(&loc);
> >  }
> >  
> > @@ -3819,8 +3827,6 @@ int main(int argc, char **argv, char **envp)
> >  machine_class = select_machine();
> >  object_set_machine_compat_props(machine_class->compat_props);
> >  
> > -set_memory_options(&ram_slots, &maxram_size, machine_class);
> > -
> >  os_daemonize();
> >  rcu_disable_atfork();
> >  
> > @@ -4122,9 +4128,6 @@ int main(int argc, char **argv, char **envp)
> >  machine_opts = qemu_get_machine_opts();
> >  qemu_opt_foreach(machine_opts, machine_set_property, current_machine,
> >   &error_fatal);
> > -current_machine->ram_size = ram_size;
> > -current_machine->maxram_size = maxram_size;
> > -current_machine->ram_slots = ram_slots;
> >  
> >  /*
> >   * Note: uses machine properties such as kernel-irqchip, must run
> > @@ -4235,14 +4238,6 @@ int main(int argc, char **argv, char **envp)
> >  
> >  tpm_init();
> >  
> > -if (!xen_enabled()) {
> > -/* On 32-bit hosts, QEMU is limited by virtual address space */
> > -if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) {
> > -error_report("at most 2047 MB RAM can be simulated");
> > -exit(1);
> > -}
> > -}
> > -
> >  blk_mig_init();
> >  ram_mig_init();
> >  dirty_bitmap_mig_init();
> > @@ -4287,6 +4282,12 @@ int main(int argc, char **argv, char **envp)
> >  if (cpu_option) {
> >  current_machine->cpu_type = parse_cpu_option(cpu_option);
> >  }
> > +
> > +set_memory_options(&ram_slots, &maxram_size, machine_class);
> > +current_machine->ram_size = ram_size;
> > +current_machine->maxram_size = maxram_size;
> > +current_machine->ram_slots = ram_slots;
> > +
> >  parse_numa_opts(current_machine);
> >  
> >  if (machine_class->default_ram_id && current_machine->ram_size &&
> >   




Re: [PATCH v16 Kernel 1/7] vfio: KABI for migration interface for device state

2020-03-26 Thread Cornelia Huck
On Wed, 25 Mar 2020 01:02:33 +0530
Kirti Wankhede  wrote:

> - Defined MIGRATION region type and sub-type.
> 
> - Defined vfio_device_migration_info structure which will be placed at the
>   0th offset of migration region to get/set VFIO device related
>   information. Defined members of structure and usage on read/write access.
> 
> - Defined device states and state transition details.
> 
> - Defined sequence to be followed while saving and resuming VFIO device.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  include/uapi/linux/vfio.h | 228 
> ++
>  1 file changed, 228 insertions(+)

(...)

> +struct vfio_device_migration_info {
> + __u32 device_state; /* VFIO device state */
> +#define VFIO_DEVICE_STATE_STOP  (0)
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING(1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK  (VFIO_DEVICE_STATE_RUNNING | \
> +  VFIO_DEVICE_STATE_SAVING |  \
> +  VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_VALID(state) \
> + (state & VFIO_DEVICE_STATE_RESUMING ? \
> + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> +
> +#define VFIO_DEVICE_STATE_IS_ERROR(state) \
> + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \
> +   VFIO_DEVICE_STATE_RESUMING))
> +
> +#define VFIO_DEVICE_STATE_SET_ERROR(state) \
> + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \
> +  VFIO_DEVICE_STATE_RESUMING)
> +
> + __u32 reserved;
> + __u64 pending_bytes;
> + __u64 data_offset;
> + __u64 data_size;
> +} __attribute__((packed));

The 'packed' should not even be needed, I think?

> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> mmapped
>   * which allows direct access to non-MSIX registers which happened to be 
> within

Generally, this looks sane to me; however, we should really have
something under Documentation/ in the long run that describes how this
works, so that you can find out about the protocol without having to
dig through headers.




Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-26 Thread Michael S. Tsirkin
On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
> 
> 
> > Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin :
> > 
> > On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
> >>> On 12.03.20 09:47, Michael S. Tsirkin wrote:
> >>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
>  2. You are essentially stealing THPs in the guest. So the fastest
>  mapping (THP in guest and host) is gone. The guest won't be able to make
>  use of THP where it previously was able to. I can imagine this implies a
>  performance degradation for some workloads. This needs a proper
>  performance evaluation.
> >>> 
> >>> I think the problem is more with the alloc_pages API.
> >>> That gives you exactly the given order, and if there's
> >>> a larger chunk available, it will split it up.
> >>> 
> >>> But for balloon - I suspect lots of other users,
> >>> we do not want to stress the system but if a large
> >>> chunk is available anyway, then we could handle
> >>> that more optimally by getting it all in one go.
> >>> 
> >>> 
> >>> So if we want to address this, IMHO this calls for a new API.
> >>> Along the lines of
> >>> 
> >>>struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
> >>>unsigned int max_order, unsigned int *order)
> >>> 
> >>> the idea would then be to return at a number of pages in the given
> >>> range.
> >>> 
> >>> What do you think? Want to try implementing that?
> >> 
> >> You can just start with the highest order and decrement the order until
> >> your allocation succeeds using alloc_pages(), which would be enough for
> >> a first version. At least I don't see the immediate need for a new
> >> kernel API.
> > 
> > OK I remember now.  The problem is with reclaim. Unless reclaim is
> > completely disabled, any of these calls can sleep. After it wakes up,
> > we would like to get the larger order that has become available
> > meanwhile.
> > 
> 
> Yes, but that‘s a pure optimization IMHO.
> So I think we should do a trivial implementation first and then see what we 
> gain from a new allocator API. Then we might also be able to justify it using 
> real numbers.
> 

Well how do you propose implement the necessary semantics?
I think we are both agreed that alloc_page_range is more or
less what's necessary anyway - so how would you approximate it
on top of existing APIs?


> > 
> >> -- 
> >> Thanks,
> >> 
> >> David / dhildenb
> > 




RFC: use VFIO over a UNIX domain socket to implement device offloading

2020-03-26 Thread Thanos Makatos
I want to continue the discussion regarding using MUSER
(https://github.com/nutanix/muser) as a device offloading mechanism. The main
drawback of MUSER is that it requires a kernel module, so I've experimented
with a proof of concept of how MUSER would look like if we somehow didn't need
a kernel module. I did this by implementing a wrapper library
(https://github.com/tmakatos/libpathtrap) that intercepts accesses to
VFIO-related paths and forwards them to the MUSER process providing device
emulation over a UNIX domain socket. This does not require any changes to QEMU
(4.1.0). Obviously this is a massive hack and is done only for the needs of
this PoC.

The result is a fully working PCI device in QEMU (the gpio sample explained in
https://github.com/nutanix/muser/blob/master/README.md#running-gpio-pci-idio-16),
which is as simple as possible. I've also tested with a much more complicated
device emulation, https://github.com/tmakatos/spdk, which provides NVMe device
emulation and requires accessing guest memory for DMA, allowing BAR0 to be
memory mapped into the guest, using MSI-X interrupts, etc.

The changes required in MUSER are fairly small, all that is needed is to
introduce a new concept of "transport" to receive requests from a UNIX domain
socket instead of the kernel (from a character device) and to send/receive file
descriptors for sharing memory and firing interrupts.

My experience is that VFIO is so intuitive to use for offloading device
emulation from one process to another that makes this feature quite
straightforward. There's virtually nothing specific to the kernel in the VFIO
API. Therefore I strongly agree with Stefan's suggestion to use it for device
offloading when interacting with QEMU. Using 'muser.ko' is still interesting
when QEMU is not the client, but if everyone is happy to proceed with the
vfio-over-socket alternative the kernel module can become a second-class
citizen. (QEMU is, after all, our first and most relevant client.)

Next I explain how to test the PoC.

Build MUSER with vfio-over-socket:

git clone --single-branch --branch vfio-over-socket 
g...@github.com:tmakatos/muser.git
cd muser/
git submodule update --init
make

Run device emulation, e.g.

./build/dbg/samples/gpio-pci-idio-16 -s 

Where  is an available IOMMU group, essentially the device ID, which must not
previously exist in /dev/vfio/.

Run QEMU using the vfio wrapper library and specifying the MUSER device:

LD_PRELOAD=muser/build/dbg/libvfio/libvfio.so qemu-system-x86_64 \
... \
-device vfio-pci,sysfsdev=/dev/vfio/ \
-object 
memory-backend-file,id=ram-node0,prealloc=yes,mem-path=mem,share=yes,size=1073741824
 \
-numa node,nodeid=0,cpus=0,memdev=ram-node0

Bear in mind that since this is just a PoC lots of things can break, e.g. some
system call not intercepted etc.



Re: 答复: [question]vhost-user: atuo fix network link broken during migration

2020-03-26 Thread Jason Wang



On 2020/3/24 下午7:08, yangke (J) wrote:

We find an issue when host mce trigger openvswitch(dpdk) restart in
source host during guest migration,


Did you mean the vhost-user netev was deleted from the source host?


The vhost-user netev was not deleted from the source host. I mean that:
in normal scenario, OVS(DPDK) begin to restart, then qemu_chr disconnect to OVS 
and link status is set to link down; OVS(DPDK) started, then qemu_chr reconnect to 
OVS and link status is set to link up. But in our scenario, before qemu_chr 
reconnect to OVS, the VM migrate is finished. The link_down of frontend was loaded 
from n->status in destination, it cause the network in gust never be up again.



I'm not sure we should fix this in qemu.

Generally, it's the task of management to make sure the destination 
device configuration is the same as source.


E.g in this case, management should bring up the link if re-connection 
in source is completed.


What's more the qmp_set_link() done in vhost-user.c looks hacky which 
changes the link status without the care of management.





qemu_chr disconnect:
#0  vhost_user_write (msg=msg@entry=0x7fff59ecb2b0, fds=fds@entry=0x0, 
fd_num=fd_num@entry=0, dev=0x295c730, dev=0x295c730)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost_user.c:239
#1  0x004e6bad in vhost_user_get_vring_base (dev=0x295c730, 
ring=0x7fff59ecb510)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost_user.c:497
#2  0x004e2e88 in vhost_virtqueue_stop (dev=dev@entry=0x295c730, 
vdev=vdev@entry=0x2ca36c0, vq=0x295c898, idx=0)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost.c:1036
#3  0x004e45ab in vhost_dev_stop (hdev=hdev@entry=0x295c730, 
vdev=vdev@entry=0x2ca36c0)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/virtio/vhost.c:1556
#4  0x004bc56a in vhost_net_stop_one (net=0x295c730, 
dev=dev@entry=0x2ca36c0)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/net/vhost_net.c:326
#5  0x004bcc3b in vhost_net_stop (dev=dev@entry=0x2ca36c0, ncs=,   total_queues=4)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/net/vhost_net.c:407
#6  0x004b85f6 in virtio_net_vhost_status (n=n@entry=0x2ca36c0, 
status=status@entry=7 '\a')
 at /usr/src/debug/qemu-kvm-2.8.1/hw/net/virtio_net.c:177
#7  0x004b869f in virtio_net_set_status (vdev=, 
status=)
 at /usr/src/debug/qemu-kvm-2.8.1/hw/net/virtio_net.c:243
#8  0x0073d00d in qmp_set_link (name=name@entry=0x2956d40 "hostnet0", 
up=up@entry=false, errp=errp@entry=0x7fff59ecd718)
 at net/net.c:1437
#9  0x007460c1 in net_vhost_user_event (opaque=0x2956d40, event=4) at 
net/vhost_user.c:217//qemu_chr_be_event
#10 0x00574f0d in tcp_chr_disconnect (chr=0x2951a40) at qemu_char.c:3220
#11 0x0057511f in tcp_chr_hup (channel=,   cond=, opaque=) at qemu_char.c:3265





VM is still link down in frontend after migration, it cause the network in VM 
never be up again.

virtio_net_load_device:
  /* nc.link_down can't be migrated, so infer link_down according
   * to link status bit in n->status */
  link_down = (n->status & VIRTIO_NET_S_LINK_UP) == 0;
  for (i = 0; i < n->max_queues; i++) {
  qemu_get_subqueue(n->nic, i)->link_down = link_down;
  }

guset:   migrate begin -> vCPU pause ---> vmsate load ---> 
migrate finish
  ^^^
  |||
openvswitch in source host:   begin to restart   restartingstarted
  ^^^
  |||
nc in frontend in source:link downlink downlink down
  ^^^
  |||
nc in frontend in destination:   link up  link up  link down
  ^^^
  |||
guset network:broken   broken   broken
  ^^^
  |||
nc in backend in source: link downlink downlink up
  ^^^
  |||
nc in backend in destination:link up  link up  link up

The link_down of frontend was loaded from n->status, n->status is link
down in source, so the link_down of frontend is true. The backend in
destination host is link up, but the frontend in destination host is link down, 
it cause the network in gust never be up again until an guest cold reboot.

Is there a way to auto fix 

Re: [PATCH 2/2] util/bufferiszero: improve avx2 accelerator

2020-03-26 Thread Paolo Bonzini
On 26/03/20 03:09, Hu, Robert wrote:
> BTW, do I need to resend these 2 patches?

No, thanks!  I have queued them.

Paolo




Re: [PATCH] backup: don't acquire aio_context in backup_clean

2020-03-26 Thread Stefan Reiter

On 26/03/2020 06:54, Vladimir Sementsov-Ogievskiy wrote:

25.03.2020 18:50, Stefan Reiter wrote:

backup_clean is only ever called as a handler via job_exit, which


Hmm.. I'm afraid it's not quite correct.

job_clean

   job_finalize_single

  job_completed_txn_abort (lock aio context)

  job_do_finalize


Hmm. job_do_finalize calls job_completed_txn_abort, which cares to lock 
aio context..
And on the same time, it directaly calls job_txn_apply(job->txn, 
job_finalize_single)

without locking. Is it a bug?



I think, as you say, the idea is that job_do_finalize is always called 
with the lock acquired. That's why job_completed_txn_abort takes care to 
release the lock (at least of the "outer_ctx" as it calls it) before 
reacquiring it.


And, even if job_do_finalize called always with locked context, where is 
guarantee that all

context of all jobs in txn are locked?



I also don't see anything that guarantees that... I guess it could be 
adapted to handle locks like job_completed_txn_abort does?


Haven't looked into transactions too much, but does it even make sense 
to have jobs in different contexts in one transaction?



Still, let's look through its callers.

   job_finalize

    qmp_block_job_finalize (lock aio context)
    qmp_job_finalize (lock aio context)
    test_cancel_concluded (doesn't lock, but it's a test)

   job_completed_txn_success

    job_completed

     job_exit (lock aio context)

     job_cancel

  blockdev_mark_auto_del (lock aio context)

  job_user_cancel

  qmp_block_job_cancel (locks context)
  qmp_job_cancel  (locks context)

  job_cancel_err

   job_cancel_sync (return 
job_finish_sync(job, &job_cancel_err, NULL);, job_finish_sync just calls 
callback)


    replication_close (it's 
.bdrv_close.. Hmm, I don't see context locking, where is it ?)
Hm, don't see it either. This might indeed be a way to get to job_clean 
without a lock held.


I don't have any testing set up for replication atm, but if you believe 
this would be correct I can send a patch for that as well (just acquire 
the lock in replication_close before job_cancel_async?).




    replication_stop (locks context)

    drive_backup_abort (locks context)

    blockdev_backup_abort (locks context)

    job_cancel_sync_all (locks context)

    cancel_common (locks context)

  test_* (I don't care)



To clarify, aside from the commit message the patch itself does not 
appear to be wrong? All paths (aside from replication_close mentioned 
above) guarantee the job lock to be held.



already acquires the job's context. The job's context is guaranteed to
be the same as the one used by backup_top via backup_job_create.

Since the previous logic effectively acquired the lock twice, this
broke cleanup of backups for disks using IO threads, since the 
BDRV_POLL_WHILE
in bdrv_backup_top_drop -> bdrv_do_drained_begin would only release 
the lock

once, thus deadlocking with the IO thread.

Signed-off-by: Stefan Reiter 


Just note, that this thing were recently touched by 0abf2581717a19 , so 
add Sergio (its author) to CC.



---

This is a fix for the issue discussed in this part of the thread:
https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07639.html
...not the original problem (core dump) posted by Dietmar.

I've still seen it occasionally hang during a backup abort. I'm trying 
to figure
out why that happens, stack trace indicates a similar problem with the 
main
thread hanging at bdrv_do_drained_begin, though I have no clue why as 
of yet.


  block/backup.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 7430ca5883..a7a7dcaf4c 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -126,11 +126,7 @@ static void backup_abort(Job *job)
  static void backup_clean(Job *job)
  {
  BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
-    AioContext *aio_context = bdrv_get_aio_context(s->backup_top);
-
-    aio_context_acquire(aio_context);
  bdrv_backup_top_drop(s->backup_top);
-    aio_context_release(aio_context);
  }
  void backup_do_checkpoint(BlockJob *job, Error **errp)









Re: [PATCH for-5.0 3/3] object-add: don't create return value if failed

2020-03-26 Thread Paolo Bonzini
On 25/03/20 19:47, Marc-André Lureau wrote:
> If object-add failed, no need to create a return value that may later
> be leaked.
> 
> Signed-off-by: Marc-André Lureau 
> ---
>  qom/qom-qmp-cmds.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/qom/qom-qmp-cmds.c b/qom/qom-qmp-cmds.c
> index 435193b036..6bd137ccbf 100644
> --- a/qom/qom-qmp-cmds.c
> +++ b/qom/qom-qmp-cmds.c
> @@ -287,8 +287,8 @@ void qmp_object_add(QDict *qdict, QObject **ret_data, 
> Error **errp)
>  visit_free(v);
>  if (obj) {
>  object_unref(obj);
> +*ret_data = QOBJECT(qdict_new());
>  }
> -*ret_data = QOBJECT(qdict_new());
>  }
>  
>  void qmp_object_del(const char *id, Error **errp)
> 

It can be slightly simplified:

--- 8< --
From: Paolo Bonzini 
Subject: [PATCH] object-add: don't create return value if failed

No need to return an empty value from object-add (it would also leak
if the command failed).  While at it, remove the "if" around object_unref
since object_unref handles NULL arguments just fine.

Reported-by: Marc-André Lureau 
Signed-off-by: Paolo Bonzini 

diff --git a/qom/qom-qmp-cmds.c b/qom/qom-qmp-cmds.c
index 435193b036..e47ebe8ed1 100644
--- a/qom/qom-qmp-cmds.c
+++ b/qom/qom-qmp-cmds.c
@@ -285,10 +285,7 @@ void qmp_object_add(QDict *qdict, QObject **ret_data, 
Error **errp)
 v = qobject_input_visitor_new(QOBJECT(qdict));
 obj = user_creatable_add_type(type, id, qdict, v, errp);
 visit_free(v);
-if (obj) {
-object_unref(obj);
-}
-*ret_data = QOBJECT(qdict_new());
+object_unref(obj);
 }
 
 void qmp_object_del(const char *id, Error **errp)


I queued this patch and your other two.  Thanks,

Paolo




Re: [PATCH v15 Kernel 1/7] vfio: KABI for migration interface for device state

2020-03-26 Thread Christoph Hellwig
s/KABI/UAPI/ in the subject and anywhere else in the series.

Please avoid __packed__ structures and just properly pad them, they
have a major performance impact on some platforms and will cause
compiler warnings when taking addresses of members.



Re: [PULL 0/2] Fixes 20200325 patches

2020-03-26 Thread Peter Maydell
On Wed, 25 Mar 2020 at 11:05, Gerd Hoffmann  wrote:
>
> The following changes since commit 736cf607e40674776d752acc201f565723e86045:
>
>   Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +)
>
> are available in the Git repository at:
>
>   git://git.kraxel.org/qemu tags/fixes-20200325-pull-request
>
> for you to fetch changes up to 95fad99cb28e9970944b01fd7af452f6f9f37484:
>
>   hw/audio/fmopl: fix segmentation fault (2020-03-25 09:55:40 +0100)
>
> 
> fixes: input error handling & audio segfault
>
> 


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/5.0
for any user-visible changes.

-- PMM



Re: backup transaction with io-thread core dumps

2020-03-26 Thread Dietmar Maurer
> > > As mentioned earlier, even a totally simple/normal backup job fails when
> > > using io-threads and the VM is under load. It results in a total
> > > VM freeze!
> > > 
> > 
> > This is definitely a different issue. I'll take a look at it today.
> 
> Thanks. Stefan found a way to avoid that bug with:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07749.html
> 
> But there are doubts that this is the correct way to fix it ...

And I just run into another freeze (with Stefans path applied). This time
when I cancel a running backup:

#0  0x75cb3916 in __GI_ppoll (fds=0x7fff63d35c40, nfds=2, 
timeout=, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at 
../sysdeps/unix/sysv/linux/ppoll.c:39
#1  0x55c5fcd9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=, __fds=) at /usr/include/x86_64-linux-gnu/bits/poll2.h:77
#2  0x55c5fcd9 in qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=-1) at util/qemu-timer.c:335
#3  0x55c624c1 in fdmon_poll_wait (ctx=0x7fffe8905e80, 
ready_list=0x7fffd2a8, timeout=-1) at util/fdmon-poll.c:79
#4  0x55c61aa7 in aio_poll (ctx=0x7fffe8905e80, 
blocking=blocking@entry=true) at util/aio-posix.c:589
#5  0x55bc2c83 in bdrv_do_drained_begin (poll=, 
ignore_bds_parents=false, parent=0x0, recursive=false, bs=0x7fffe8954bc0) at 
block/io.c:429
#6  0x55bc2c83 in bdrv_do_drained_begin (bs=0x7fffe8954bc0, 
recursive=, parent=0x0, ignore_bds_parents=, 
poll=) at block/io.c:395
#7  0x55bb3c37 in blk_drain (blk=0x7fffe8ebcf80) at 
block/block-backend.c:1617
#8  0x55bb481d in blk_unref (blk=0x7fffe8ebcf80) at 
block/block-backend.c:473
#9  0x55b6c835 in block_job_free (job=0x7fff64505000) at blockjob.c:89
#10 0x55b6dd19 in job_unref (job=0x7fff64505000) at job.c:360
#11 0x55b6dd19 in job_unref (job=0x7fff64505000) at job.c:352
#12 0x55b6e69a in job_finish_sync (job=job@entry=0x7fff64505000, 
finish=finish@entry=0x55b6ec80 , errp=errp@entry=0x0) at 
job.c:988
#13 0x55b6ec9e in job_cancel_sync (job=job@entry=0x7fff64505000) at 
job.c:931
...

Any ideas?




Re: [PULL 006/136] vl.c: move -m parsing after memory backends has been processed

2020-03-26 Thread Auger Eric
Hi

On 2/25/20 12:48 PM, Paolo Bonzini wrote:
> From: Igor Mammedov 
> 
> It will be possible for main RAM to come from memory-backend
> and we should check that size specified in -m matches the size
> of the backend and [MachineState::]ram_size also matches
> backend's size.
> 
> However -m parsing (set_memory_options()) happens before backends
> are intialized (object_create_delayed()) which complicates it.
> Consolidate set_memory_options() and assigning parsed results to
> current_machine after backends are initialized, so it would be
> possible access the initialized backend instance to compare
> sizes.
> 
> This patch only consolidates scattered places touching ram_size
> within vl.c. And follow up patch will integrate backend handling
> to set_memory_options().
> 
> Signed-off-by: Igor Mammedov 
> Message-Id: <20200219160953.13771-7-imamm...@redhat.com>

Unfortunately this patch breaks ARM virt memory map setting in KVM mode.
If you launch ARM virt with PCDIMM you now get

qemu-system-aarch64: -device
pc-dimm,memdev=mem1,id=dimm1,driver=pc-dimm: memory devices (e.g. for
memory hotplug) are not supported by the machine

With machvirt/KVM
configure_accelerators() calls ARM virt_kvm_type(). This function
freezes the machine memory map and computes the size of the IPA to be
supported by KVM. See:
c9650222b8 ("hw/arm/virt: Implement kvm_type function for 4.0 machine")

virt_kvm_type() uses machine memory settings. Igor's patch moves those
after the mem backend init, so the virt memory map does not properly
take into account the machine memory settings anymore.

So we need to find a way to rework this.

Thanks

Eric


> ---
>  vl.c | 27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/vl.c b/vl.c
> index 2103804..72ffc06 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2655,6 +2655,14 @@ static void set_memory_options(uint64_t *ram_slots, 
> ram_addr_t *maxram_size,
>  exit(EXIT_FAILURE);
>  }
>  
> +if (!xen_enabled()) {
> +/* On 32-bit hosts, QEMU is limited by virtual address space */
> +if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) {
> +error_report("at most 2047 MB RAM can be simulated");
> +exit(1);
> +}
> +}
> +
>  loc_pop(&loc);
>  }
>  
> @@ -3819,8 +3827,6 @@ int main(int argc, char **argv, char **envp)
>  machine_class = select_machine();
>  object_set_machine_compat_props(machine_class->compat_props);
>  
> -set_memory_options(&ram_slots, &maxram_size, machine_class);
> -
>  os_daemonize();
>  rcu_disable_atfork();
>  
> @@ -4122,9 +4128,6 @@ int main(int argc, char **argv, char **envp)
>  machine_opts = qemu_get_machine_opts();
>  qemu_opt_foreach(machine_opts, machine_set_property, current_machine,
>   &error_fatal);
> -current_machine->ram_size = ram_size;
> -current_machine->maxram_size = maxram_size;
> -current_machine->ram_slots = ram_slots;
>  
>  /*
>   * Note: uses machine properties such as kernel-irqchip, must run
> @@ -4235,14 +4238,6 @@ int main(int argc, char **argv, char **envp)
>  
>  tpm_init();
>  
> -if (!xen_enabled()) {
> -/* On 32-bit hosts, QEMU is limited by virtual address space */
> -if (ram_size > (2047 << 20) && HOST_LONG_BITS == 32) {
> -error_report("at most 2047 MB RAM can be simulated");
> -exit(1);
> -}
> -}
> -
>  blk_mig_init();
>  ram_mig_init();
>  dirty_bitmap_mig_init();
> @@ -4287,6 +4282,12 @@ int main(int argc, char **argv, char **envp)
>  if (cpu_option) {
>  current_machine->cpu_type = parse_cpu_option(cpu_option);
>  }
> +
> +set_memory_options(&ram_slots, &maxram_size, machine_class);
> +current_machine->ram_size = ram_size;
> +current_machine->maxram_size = maxram_size;
> +current_machine->ram_slots = ram_slots;
> +
>  parse_numa_opts(current_machine);
>  
>  if (machine_class->default_ram_id && current_machine->ram_size &&
> 




Re: backup transaction with io-thread core dumps

2020-03-26 Thread Dietmar Maurer


> > > > So the solution is to disable backups when using io-threads?
> > > >
> > > 
> > > I meant forbidding transactions with completion-mode == grouped. It
> > > would be still possible running transactions (and thus, backups) with
> > > completion-mode == individual, which is the default.
> > 
> > As mentioned earlier, even a totally simple/normal backup job fails when
> > using io-threads and the VM is under load. It results in a total
> > VM freeze!
> > 
> 
> This is definitely a different issue. I'll take a look at it today.

Thanks. Stefan found a way to avoid that bug with:

https://lists.gnu.org/archive/html/qemu-devel/2020-03/msg07749.html

But there are doubts that this is the correct way to fix it ...




Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-26 Thread David Hildenbrand



> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin :
> 
> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
>>> On 12.03.20 09:47, Michael S. Tsirkin wrote:
>>> On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
 2. You are essentially stealing THPs in the guest. So the fastest
 mapping (THP in guest and host) is gone. The guest won't be able to make
 use of THP where it previously was able to. I can imagine this implies a
 performance degradation for some workloads. This needs a proper
 performance evaluation.
>>> 
>>> I think the problem is more with the alloc_pages API.
>>> That gives you exactly the given order, and if there's
>>> a larger chunk available, it will split it up.
>>> 
>>> But for balloon - I suspect lots of other users,
>>> we do not want to stress the system but if a large
>>> chunk is available anyway, then we could handle
>>> that more optimally by getting it all in one go.
>>> 
>>> 
>>> So if we want to address this, IMHO this calls for a new API.
>>> Along the lines of
>>> 
>>>struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
>>>unsigned int max_order, unsigned int *order)
>>> 
>>> the idea would then be to return at a number of pages in the given
>>> range.
>>> 
>>> What do you think? Want to try implementing that?
>> 
>> You can just start with the highest order and decrement the order until
>> your allocation succeeds using alloc_pages(), which would be enough for
>> a first version. At least I don't see the immediate need for a new
>> kernel API.
> 
> OK I remember now.  The problem is with reclaim. Unless reclaim is
> completely disabled, any of these calls can sleep. After it wakes up,
> we would like to get the larger order that has become available
> meanwhile.
> 

Yes, but that‘s a pure optimization IMHO.

So I think we should do a trivial implementation first and then see what we 
gain from a new allocator API. Then we might also be able to justify it using 
real numbers.

> 
>> -- 
>> Thanks,
>> 
>> David / dhildenb
> 




Re: backup transaction with io-thread core dumps

2020-03-26 Thread Sergio Lopez
On Wed, Mar 25, 2020 at 04:40:48PM +0100, Dietmar Maurer wrote:
> > On March 25, 2020 1:39 PM Sergio Lopez  wrote:
> > 
> >  
> > On Wed, Mar 25, 2020 at 01:29:48PM +0100, Dietmar Maurer wrote:
> > > > As expected, if both BDS are running on the same IOThread (and thus,
> > > > the same AioContext), the problem is not reproducible.
> > > >
> > > > In a general sense, we could say that completion modes other than
> > > > "individual" are not supported for a transaction that may access
> > > > different AioContexts. I don't see a safe an easy way to fix this. We
> > > > could opt for simply detect and forbid such completion modes when the
> > > > BDS's are assigned to different AioContexts. Perhaps Kevin (in CC) has
> > > > a better idea.
> > >
> > > So the solution is to disable backups when using io-threads?
> > >
> > 
> > I meant forbidding transactions with completion-mode == grouped. It
> > would be still possible running transactions (and thus, backups) with
> > completion-mode == individual, which is the default.
> 
> As mentioned earlier, even a totally simple/normal backup job fails when
> using io-threads and the VM is under load. It results in a total
> VM freeze!
> 

This is definitely a different issue. I'll take a look at it today.

Cheers,
Sergio.


signature.asc
Description: PGP signature


Re: [PATCH 00/13] microvm: add acpi support

2020-03-26 Thread Michael S. Tsirkin
On Wed, Mar 25, 2020 at 07:44:34PM +0100, Igor Mammedov wrote:
> On Wed, 25 Mar 2020 16:03:39 +0100
> Gerd Hoffmann  wrote:
> 
> > On Wed, Mar 25, 2020 at 01:32:12PM +0100, Igor Mammedov wrote:
> > > On Thu, 19 Mar 2020 09:01:04 +0100
> > > Gerd Hoffmann  wrote:
> > >   
> > > > I know that not supporting ACPI in microvm is intentional.  If you still
> > > > don't want ACPI this is perfectly fine, you can use the usual -no-acpi
> > > > switch to toggle ACPI support.
> > > > 
> > > > These are the advantages you are going to loose then:
> > > > 
> > > >   (1) virtio-mmio device discovery without command line hacks (tweaking
> > > >   the command line is a problem when not using direct kernel boot).
> > > >   (2) Better IO-APIC support, we can use IRQ lines 16-23.
> > > >   (3) ACPI power button (aka powerdown request) works.
> > > >   (4) machine poweroff (aka S5 state) works.
> > > > 
> > > > Together with seabios patches for virtio-mmio support this allows to
> > > > boot standard fedora images (cloud, coreos, workstation live) with the
> > > > microvm machine type.  
> > > 
> > > what CLI do you use to test it?  
> > 
> > Test script below.  "qemu-default" is a wrapper script which starts
> > qemu-system-x86_64 from my build directory.  "qemu-firmware" is the
> > same plus isa-debugcon configured for a firmware log on stdout.
> > 
> > Latest bits (with some of the review comments addressed) just pushed
> > to git://git,kraxel.org/qemu sirius/microvm
> 
> thanks, below are test results I got on my system,
> spoiler hw-reduced reduces boot time on ~0.02s compared to full blown acpi
> 
> using timestamp at "Run /init as init process" as measuring point
> 
> no acpi
> 1.967316
> 1.975272
> 1.981267
> 1.974316
> 1.962452
> 1.960988
> 
> hw reduced acpi
> 0.893838
> 0.892573
> 0.890585
> 0.900306
> 0.897902
> 
> normal acpi:
> 0.921647
> 0.916298
> 0.923518
> 0.916298
> 0.913234
> 
> PS:
> I just quickly hacked hw-reduced acpi (using arm/virt as model)
> without implementing power button but I doubt that would affect results 
> noticeably 
> on qemu side it probably also will save some time since there are less
> things to setup for qemu.

And no ACPI is faster because of PS/2 probing, right?

> > 
> > HTH,
> >   Gerd
> > 
> >  cut here 
> > #!/bin/sh
> > 
> > mode="${1}"
> > shift
> > 
> > back=()
> > devs=()
> > args=()
> > qemu="qemu-firmware -monitor none -boot menu=on"
> > disk=""
> > liso=""
> > krnl=""
> > karg="console=ttyS0,115200"
> > 
> > case "$mode" in
> > kernel)
> > qemu="qemu-default -nographic"
> > disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2"
> > krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage"
> > karg="$karg root=/dev/sda4"
> > karg="$karg quiet"
> > ;;
> > seabios)
> > disk="/vmdisk/imagefish/rhel-8.1.0-ks-x86_64-testboot-sys-disk.qcow2"
> > krnl="$HOME/build/linux-sirius-x86_64-qemu/arch/x86/boot/bzImage"
> > karg="$karg root=/dev/sda4"
> > args+=("-bios" 
> > "/home/kraxel/projects/seabios/out-bios-microvm/bios.bin")
> > ;;
> > cloud)
> > disk="/vmdisk/iso/Fedora-Cloud-Base-31-1.9.x86_64.raw"
> > ;;
> > coreos)
> > disk="/vmdisk/iso/fedora-coreos-31.20200210.3.0-metal.x86_64.raw"
> > ;;
> > live)
> > liso="/vmdisk/iso/Fedora-Workstation-Live-x86_64-30-1.2.iso"
> > devs+=("-device" "virtio-gpu-device")
> > devs+=("-device" "virtio-keyboard-device")
> > devs+=("-device" "virtio-tablet-device")
> > ;;
> > *)
> > echo "unknown mode: $mode"
> > echo "known modes: kernel seabios cloud coreos live"
> > exit 1
> > ;;
> > esac
> > 
> > if test "$disk" != ""; then
> > format="${disk##*.}"
> > back+=("-drive" "if=none,id=disk,format=${format},file=${disk}")
> > devs+=("-device" "scsi-hd,drive=disk,bootindex=1")
> > fi
> > if test "$liso" != ""; then
> > back+=("-drive" 
> > "if=none,id=cdrom,media=cdrom,readonly,format=raw,file=${liso}")
> > devs+=("-device" "scsi-cd,drive=cdrom,bootindex=2")
> > fi
> > if test "$krnl" != ""; then
> > args+=("-kernel" "$krnl")
> > args+=("-append" "$karg")
> > fi
> > 
> > set -ex
> > $qemu \
> > -enable-kvm \
> > -cpu host \
> > -M microvm,graphics=off,pit=off,pic=on,rtc=on \
> > \
> > -m 4G \
> > \
> > -netdev user,id=net \
> > "${back[@]}" \
> > \
> > -global virtio-mmio.force-legacy=false \
> > -device virtio-net-device,netdev=net \
> > -device virtio-scsi-device \
> > "${devs[@]}" \
> > \
> > "${args[@]}" \
> > "$@"




[Bug 1868116] Re: QEMU monitor no longer works

2020-03-26 Thread Christian Ehrhardt 
I'm not sure how many of you are tracking the Vte bug [1] so here a
summary of the latest insight from there.

- Short term it seems that new behavior will be reverted in Vte 0.60.1.
- Long term the Vte devs might want to deprecate no-pty use cases or at least 
better understand why apps use it that way.

For more details please read [1].

[1]: https://gitlab.gnome.org/GNOME/vte/issues/222

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1868116

Title:
  QEMU monitor no longer works

Status in QEMU:
  New
Status in qemu package in Ubuntu:
  Triaged
Status in vte2.91 package in Ubuntu:
  New

Bug description:
  Repro:
  VTE
  $ meson _build && ninja -C _build && ninja -C _build install

  qemu:
  $ ../configure --python=/usr/bin/python3 --disable-werror --disable-user 
--disable-linux-user --disable-docs --disable-guest-agent --disable-sdl 
--enable-gtk --disable-vnc --disable-xen --disable-brlapi --disable-fdt 
--disable-hax --disable-vde --disable-netmap --disable-rbd --disable-libiscsi 
--disable-libnfs --disable-smartcard --disable-libusb --disable-usb-redir 
--disable-seccomp --disable-glusterfs --disable-tpm --disable-numa 
--disable-opengl --disable-virglrenderer --disable-xfsctl --disable-vxhs 
--disable-slirp --disable-blobs --target-list=x86_64-softmmu --disable-rdma 
--disable-pvrdma --disable-attr --disable-vhost-net --disable-vhost-vsock 
--disable-vhost-scsi --disable-vhost-crypto --disable-vhost-user 
--disable-spice --disable-qom-cast-debug --disable-vxhs --disable-bochs 
--disable-cloop --disable-dmg --disable-qcow1 --disable-vdi --disable-vvfat 
--disable-qed --disable-parallels --disable-sheepdog --disable-avx2 
--disable-nettle --disable-gnutls --disable-capstone --disable-tools 
--disable-libpmem --disable-iconv --disable-cap-ng
  $ make

  Test:
  $ LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH 
./build/x86_64-softmmu/qemu-system-x86_64 -enable-kvm --drive 
media=cdrom,file=http://archive.ubuntu.com/ubuntu/dists/bionic/main/installer-amd64/current/images/netboot/mini.iso
  - switch to monitor with CTRL+ALT+2
  - try to enter something

  Affects head of both usptream git repos.

  
  --- original bug ---

  It was observed that the QEMU console (normally accessible using
  Ctrl+Alt+2) accepts no input, so it can't be used. This is being
  problematic because there are cases where it's required to send
  commands to the guest, or key combinations that the host would grab
  (as Ctrl-Alt-F1 or Alt-F4).

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: qemu 1:4.2-3ubuntu2
  Uname: Linux 5.6.0-rc6+ x86_64
  ApportVersion: 2.20.11-0ubuntu20
  Architecture: amd64
  CurrentDesktop: XFCE
  Date: Thu Mar 19 12:16:31 2020
  Dependencies:

  InstallationDate: Installed on 2017-06-13 (1009 days ago)
  InstallationMedia: Xubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412)
  KvmCmdLine:
   COMMAND STAT  EUID  RUID PIDPPID %CPU COMMAND
   qemu-system-x86 Sl+   1000  1000   34275   25235 29.2 qemu-system-x86_64 -m 
4G -cpu Skylake-Client -device virtio-vga,virgl=true,xres=1280,yres=720 -accel 
kvm -device nec-usb-xhci -serial vc -serial stdio -hda 
/home/usuario/Sistemas/androidx86.img -display gtk,gl=on -device usb-audio
   kvm-nx-lpage-re S0 0   34284   2  0.0 [kvm-nx-lpage-re]
   kvm-pit/34275   S0 0   34286   2  0.0 [kvm-pit/34275]
  MachineType: LENOVO 80UG
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.6.0-rc6+ 
root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 
kvm.report_ignored_msrs=0 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 
i915.enable_gvt=1 i915.fastboot=1 cgroup_enable=memory swapaccount=1 
zswap.enabled=1 zswap.zpool=z3fold 
resume=UUID=a82e38a0-8d20-49dd-9cbd-de7216b589fc log_buf_len=16M 
usbhid.quirks=0x0079:0x0006:0x10 config_scsi_mq_default=y 
scsi_mod.use_blk_mq=1 mtrr_gran_size=64M mtrr_chunk_size=64M nbd.nbds_max=2 
nbd.max_part=63
  SourcePackage: qemu
  UpgradeStatus: Upgraded to focal on 2019-12-22 (87 days ago)
  dmi.bios.date: 08/09/2018
  dmi.bios.vendor: LENOVO
  dmi.bios.version: 0XCN45WW
  dmi.board.asset.tag: NO Asset Tag
  dmi.board.name: Toronto 4A2
  dmi.board.vendor: LENOVO
  dmi.board.version: SDK0J40679 WIN
  dmi.chassis.asset.tag: NO Asset Tag
  dmi.chassis.type: 10
  dmi.chassis.vendor: LENOVO
  dmi.chassis.version: Lenovo ideapad 310-14ISK
  dmi.modalias: 
dmi:bvnLENOVO:bvr0XCN45WW:bd08/09/2018:svnLENOVO:pn80UG:pvrLenovoideapad310-14ISK:rvnLENOVO:rnToronto4A2:rvrSDK0J40679WIN:cvnLENOVO:ct10:cvrLenovoideapad310-14ISK:
  dmi.product.family: IDEAPAD
  dmi.product.name: 80UG
  dmi.product.sku: LENOVO_MT_80UG_BU_idea_FM_Lenovo ideapad 310-14ISK
  dmi.product.version: Lenovo ideapad 310-14ISK
  dmi.sys.vendor: LENOVO
  mtime.conffile..etc.apport.crashdb.conf: 2019-08-29T08:39:36.787240

To manage notifications about this bug go to:
https://bugs.launchpad.net/q

[PULL 3/6] linux-user/i386: Split out gen_signal

2020-03-26 Thread Laurent Vivier
From: Richard Henderson 

This is a bit tidier than open-coding the 5 lines necessary
to initialize the target_siginfo_t.  In addition, this zeros
the remaining bytes of the target_siginfo_t, rather than
passing in garbage.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
Message-Id: <20200213032223.14643-3-richard.hender...@linaro.org>
Signed-off-by: Laurent Vivier 
---
 linux-user/i386/cpu_loop.c | 93 ++
 1 file changed, 33 insertions(+), 60 deletions(-)

diff --git a/linux-user/i386/cpu_loop.c b/linux-user/i386/cpu_loop.c
index 024b6f4d588c..e217cca5ee1e 100644
--- a/linux-user/i386/cpu_loop.c
+++ b/linux-user/i386/cpu_loop.c
@@ -81,13 +81,23 @@ static void set_idt(int n, unsigned int dpl)
 }
 #endif
 
+static void gen_signal(CPUX86State *env, int sig, int code, abi_ptr addr)
+{
+target_siginfo_t info = {
+.si_signo = sig,
+.si_code = code,
+._sifields._sigfault._addr = addr
+};
+
+queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
+}
+
 void cpu_loop(CPUX86State *env)
 {
 CPUState *cs = env_cpu(env);
 int trapnr;
 abi_ulong pc;
 abi_ulong ret;
-target_siginfo_t info;
 
 for(;;) {
 cpu_exec_start(cs);
@@ -134,70 +144,45 @@ void cpu_loop(CPUX86State *env)
 #endif
 case EXCP0B_NOSEG:
 case EXCP0C_STACK:
-info.si_signo = TARGET_SIGBUS;
-info.si_errno = 0;
-info.si_code = TARGET_SI_KERNEL;
-info._sifields._sigfault._addr = 0;
-queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
+gen_signal(env, TARGET_SIGBUS, TARGET_SI_KERNEL, 0);
 break;
 case EXCP0D_GPF:
 /* XXX: potential problem if ABI32 */
 #ifndef TARGET_X86_64
 if (env->eflags & VM_MASK) {
 handle_vm86_fault(env);
-} else
-#endif
-{
-info.si_signo = TARGET_SIGSEGV;
-info.si_errno = 0;
-info.si_code = TARGET_SI_KERNEL;
-info._sifields._sigfault._addr = 0;
-queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
+break;
 }
+#endif
+gen_signal(env, TARGET_SIGSEGV, TARGET_SI_KERNEL, 0);
 break;
 case EXCP0E_PAGE:
-info.si_signo = TARGET_SIGSEGV;
-info.si_errno = 0;
-if (!(env->error_code & 1))
-info.si_code = TARGET_SEGV_MAPERR;
-else
-info.si_code = TARGET_SEGV_ACCERR;
-info._sifields._sigfault._addr = env->cr[2];
-queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
+gen_signal(env, TARGET_SIGSEGV,
+   (env->error_code & 1 ?
+TARGET_SEGV_ACCERR : TARGET_SEGV_MAPERR),
+   env->cr[2]);
 break;
 case EXCP00_DIVZ:
 #ifndef TARGET_X86_64
 if (env->eflags & VM_MASK) {
 handle_vm86_trap(env, trapnr);
-} else
-#endif
-{
-/* division by zero */
-info.si_signo = TARGET_SIGFPE;
-info.si_errno = 0;
-info.si_code = TARGET_FPE_INTDIV;
-info._sifields._sigfault._addr = env->eip;
-queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
+break;
 }
+#endif
+gen_signal(env, TARGET_SIGFPE, TARGET_FPE_INTDIV, env->eip);
 break;
 case EXCP01_DB:
 case EXCP03_INT3:
 #ifndef TARGET_X86_64
 if (env->eflags & VM_MASK) {
 handle_vm86_trap(env, trapnr);
-} else
+break;
+}
 #endif
-{
-info.si_signo = TARGET_SIGTRAP;
-info.si_errno = 0;
-if (trapnr == EXCP01_DB) {
-info.si_code = TARGET_TRAP_BRKPT;
-info._sifields._sigfault._addr = env->eip;
-} else {
-info.si_code = TARGET_SI_KERNEL;
-info._sifields._sigfault._addr = 0;
-}
-queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
+if (trapnr == EXCP01_DB) {
+gen_signal(env, TARGET_SIGTRAP, TARGET_TRAP_BRKPT, env->eip);
+} else {
+gen_signal(env, TARGET_SIGTRAP, TARGET_SI_KERNEL, 0);
 }
 break;
 case EXCP04_INTO:
@@ -205,31 +190,19 @@ void cpu_loop(CPUX86State *env)
 #ifndef TARGET_X86_64
 if (env->eflags & VM_MASK) {
 handle_vm86_trap(env, trapnr);
-} else
-#endif
-{
-info.si_signo = TARGET_SIGSEGV;
-info.si_errno = 0;
-info.si_code = TARGET_SI_KERNEL;
-info._sifields._sigfault

[PULL 4/6] linux-user/i386: Emulate x86_64 vsyscalls

2020-03-26 Thread Laurent Vivier
From: Richard Henderson 

Notice the magic page during translate, much like we already
do for the arm32 commpage.  At runtime, raise an exception to
return cpu_loop for emulation.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Richard Henderson 
Message-Id: <20200213032223.14643-4-richard.hender...@linaro.org>
Signed-off-by: Laurent Vivier 
---
 linux-user/i386/cpu_loop.c | 108 +
 target/i386/cpu.h  |   7 +++
 target/i386/translate.c|  14 -
 3 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/linux-user/i386/cpu_loop.c b/linux-user/i386/cpu_loop.c
index e217cca5ee1e..70cde417e605 100644
--- a/linux-user/i386/cpu_loop.c
+++ b/linux-user/i386/cpu_loop.c
@@ -92,6 +92,109 @@ static void gen_signal(CPUX86State *env, int sig, int code, 
abi_ptr addr)
 queue_signal(env, info.si_signo, QEMU_SI_FAULT, &info);
 }
 
+#ifdef TARGET_X86_64
+static bool write_ok_or_segv(CPUX86State *env, abi_ptr addr, size_t len)
+{
+/*
+ * For all the vsyscalls, NULL means "don't write anything" not
+ * "write it at address 0".
+ */
+if (addr == 0 || access_ok(VERIFY_WRITE, addr, len)) {
+return true;
+}
+
+env->error_code = PG_ERROR_W_MASK | PG_ERROR_U_MASK;
+gen_signal(env, TARGET_SIGSEGV, TARGET_SEGV_MAPERR, addr);
+return false;
+}
+
+/*
+ * Since v3.1, the kernel traps and emulates the vsyscall page.
+ * Entry points other than the official generate SIGSEGV.
+ */
+static void emulate_vsyscall(CPUX86State *env)
+{
+int syscall;
+abi_ulong ret;
+uint64_t caller;
+
+/*
+ * Validate the entry point.  We have already validated the page
+ * during translation to get here; now verify the offset.
+ */
+switch (env->eip & ~TARGET_PAGE_MASK) {
+case 0x000:
+syscall = TARGET_NR_gettimeofday;
+break;
+case 0x400:
+syscall = TARGET_NR_time;
+break;
+case 0x800:
+syscall = TARGET_NR_getcpu;
+break;
+default:
+goto sigsegv;
+}
+
+/*
+ * Validate the return address.
+ * Note that the kernel treats this the same as an invalid entry point.
+ */
+if (get_user_u64(caller, env->regs[R_ESP])) {
+goto sigsegv;
+}
+
+/*
+ * Validate the the pointer arguments.
+ */
+switch (syscall) {
+case TARGET_NR_gettimeofday:
+if (!write_ok_or_segv(env, env->regs[R_EDI],
+  sizeof(struct target_timeval)) ||
+!write_ok_or_segv(env, env->regs[R_ESI],
+  sizeof(struct target_timezone))) {
+return;
+}
+break;
+case TARGET_NR_time:
+if (!write_ok_or_segv(env, env->regs[R_EDI], sizeof(abi_long))) {
+return;
+}
+break;
+case TARGET_NR_getcpu:
+if (!write_ok_or_segv(env, env->regs[R_EDI], sizeof(uint32_t)) ||
+!write_ok_or_segv(env, env->regs[R_ESI], sizeof(uint32_t))) {
+return;
+}
+break;
+default:
+g_assert_not_reached();
+}
+
+/*
+ * Perform the syscall.  None of the vsyscalls should need restarting.
+ */
+ret = do_syscall(env, syscall, env->regs[R_EDI], env->regs[R_ESI],
+ env->regs[R_EDX], env->regs[10], env->regs[8],
+ env->regs[9], 0, 0);
+g_assert(ret != -TARGET_ERESTARTSYS);
+g_assert(ret != -TARGET_QEMU_ESIGRETURN);
+if (ret == -TARGET_EFAULT) {
+goto sigsegv;
+}
+env->regs[R_EAX] = ret;
+
+/* Emulate a ret instruction to leave the vsyscall page.  */
+env->eip = caller;
+env->regs[R_ESP] += 8;
+return;
+
+ sigsegv:
+/* Like force_sig(SIGSEGV).  */
+gen_signal(env, TARGET_SIGSEGV, TARGET_SI_KERNEL, 0);
+}
+#endif
+
 void cpu_loop(CPUX86State *env)
 {
 CPUState *cs = env_cpu(env);
@@ -141,6 +244,11 @@ void cpu_loop(CPUX86State *env)
 env->regs[R_EAX] = ret;
 }
 break;
+#endif
+#ifdef TARGET_X86_64
+case EXCP_VSYSCALL:
+emulate_vsyscall(env);
+break;
 #endif
 case EXCP0B_NOSEG:
 case EXCP0C_STACK:
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 49ecc23104c9..9af1b0c12e8e 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1003,6 +1003,7 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
 
 #define EXCP_VMEXIT 0x100 /* only for system emulation */
 #define EXCP_SYSCALL0x101 /* only for user emulation */
+#define EXCP_VSYSCALL   0x102 /* only for user emulation */
 
 /* i386-specific interrupt pending bits.  */
 #define CPU_INTERRUPT_POLL  CPU_INTERRUPT_TGT_EXT_1
@@ -2218,4 +2219,10 @@ static inline bool hyperv_feat_enabled(X86CPU *cpu, int 
feat)
 return !!(cpu->hyperv_features & BIT(feat));
 }
 
+#if defined(TARGET_X86_64) && \
+defined(CONFIG_USER_ONLY) && \
+defined(CONFIG_LINUX)
+# define TARGET_VSYSCALL_PAGE  (UINT64_C(-10) << 20)
+#endif
+
 #endif /* I3

Re: [PATCH for QEMU v2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT

2020-03-26 Thread Michael S. Tsirkin
On Mon, Mar 23, 2020 at 12:04:57AM +0800, Hui Zhu wrote:
> If the guest kernel has many fragmentation pages, use virtio_balloon
> will split THP of QEMU when it calls MADV_DONTNEED madvise to release
> the balloon pages.
> Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT
> and set continuous pages order to THP order.
> Then It will get continuous pages PFN from VQ icvq use use madvise
> MADV_DONTNEED release the THP page.
> This will handle the THP split issue.
> 
> Signed-off-by: Hui Zhu 
> ---
>  hw/virtio/virtio-balloon.c  | 32 
> +
>  include/hw/virtio/virtio-balloon.h  |  4 +++-
>  include/standard-headers/linux/virtio_balloon.h |  4 
>  3 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index a4729f7..88bdaca 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -34,6 +34,7 @@
>  #include "hw/virtio/virtio-access.h"
>  
>  #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define CONT_PAGES_ORDER   9
>  
>  typedef struct PartiallyBalloonedPage {
>  ram_addr_t base_gpa;

This doesn't look right to me. I suspect this is different between
different hosts. Fixing this would also be tricky as we
might need to migrate beween two hosts with different
huge page sizes.

My proposal is to instead enhance the PartiallyBalloonedPage
machinery, teaching it to handle the case where host page
size is smaller than the supported number of subpages.


> @@ -65,7 +66,8 @@ static bool 
> virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
>  
>  static void balloon_inflate_page(VirtIOBalloon *balloon,
>   MemoryRegion *mr, hwaddr mr_offset,
> - PartiallyBalloonedPage *pbp)
> + PartiallyBalloonedPage *pbp, 
> + bool is_cont_pages)
>  {
>  void *addr = memory_region_get_ram_ptr(mr) + mr_offset;
>  ram_addr_t rb_offset, rb_aligned_offset, base_gpa;
> @@ -76,6 +78,13 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
>  /* XXX is there a better way to get to the RAMBlock than via a
>   * host address? */
>  rb = qemu_ram_block_from_host(addr, false, &rb_offset);
> +
> +if (is_cont_pages) {
> +ram_block_discard_range(rb, rb_offset,
> +BALLOON_PAGE_SIZE << CONT_PAGES_ORDER);
> +return;
> +}
> +
>  rb_page_size = qemu_ram_pagesize(rb);
>  
>  if (rb_page_size == BALLOON_PAGE_SIZE) {
> @@ -361,9 +370,10 @@ static void virtio_balloon_handle_output(VirtIODevice 
> *vdev, VirtQueue *vq)
>  
> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> pa);
>  if (!qemu_balloon_is_inhibited()) {
> -if (vq == s->ivq) {
> +if (vq == s->ivq || vq == s->icvq) {
>  balloon_inflate_page(s, section.mr,
> - section.offset_within_region, &pbp);
> + section.offset_within_region, &pbp,
> + vq == s->icvq);
>  } else if (vq == s->dvq) {
>  balloon_deflate_page(s, section.mr, 
> section.offset_within_region);
>  } else {
> @@ -618,9 +628,12 @@ static size_t virtio_balloon_config_size(VirtIOBalloon 
> *s)
>  if (s->qemu_4_0_config_size) {
>  return sizeof(struct virtio_balloon_config);
>  }
> -if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
>  return sizeof(struct virtio_balloon_config);
>  }
> +if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
> +return offsetof(struct virtio_balloon_config, pages_order);
> +}
>  if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>  return offsetof(struct virtio_balloon_config, poison_val);
>  }
> @@ -646,6 +659,10 @@ static void virtio_balloon_get_config(VirtIODevice 
> *vdev, uint8_t *config_data)
> cpu_to_le32(VIRTIO_BALLOON_CMD_ID_DONE);
>  }
>  
> +if (virtio_has_feature(dev->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) 
> {
> +config.pages_order = cpu_to_le32(CONT_PAGES_ORDER);
> +}
> +
>  trace_virtio_balloon_get_config(config.num_pages, config.actual);
>  memcpy(config_data, &config, virtio_balloon_config_size(dev));
>  }
> @@ -816,6 +833,11 @@ static void virtio_balloon_device_realize(DeviceState 
> *dev, Error **errp)
>  virtio_error(vdev, "iothread is missing");
>  }
>  }
> +
> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
> +s->icvq = virtio_add_queue(vdev, 128, virtio_bal

[PULL 1/6] linux-user, configure: fix (again) syscall_nr.h dependencies cleanup

2020-03-26 Thread Laurent Vivier
This patch fixes two problems:
- it cleanups linux-user variants (for instance ppc64-linux-user
  and ppc64le-linux-user)
- it removes the .o file when it removes the .d file, otherwise the .o
  file is never updated

Fixes: 5f29856b852d ("linux-user, configure: improve syscall_nr.h dependencies 
checking")
Fixes: 4d6a835dea47 ("linux-user: introduce parameters to generate 
syscall_nr.h")
Signed-off-by: Laurent Vivier 
Reviewed-by: Richard Henderson 
Message-Id: <20200325075757.1959961-1-laur...@vivier.eu>
---
 configure | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index da09c3589572..89fe881dd46f 100755
--- a/configure
+++ b/configure
@@ -1910,9 +1910,11 @@ for arch in alpha hppa m68k xtensa sh4 microblaze arm 
ppc s390x sparc sparc64 \
 # remove the file if it has been generated in the source directory
 rm -f "${source_path}/linux-user/${arch}/syscall_nr.h"
 # remove the dependency files
-test -d ${arch}-linux-user && find ${arch}-linux-user -type f -name "*.d" \
- -exec grep -q "${source_path}/linux-user/${arch}/syscall_nr.h" {} \; \
- -exec rm {} \;
+for target in ${arch}*-linux-user ; do
+test -d "${target}" && find "${target}" -type f -name "*.d" \
+ -exec grep -q "${source_path}/linux-user/${arch}/syscall_nr.h" {} 
\; \
+ -print | while read file ; do rm "${file}" "${file%.d}.o" ; done
+done
 done
 
 if test -z "$python"
-- 
2.25.1




[PULL 6/6] linux-user: Flush out implementation of gettimeofday

2020-03-26 Thread Laurent Vivier
From: Richard Henderson 

The first argument, timeval, is allowed to be NULL.

The second argument, timezone, was missing.  While its use is
deprecated, it is still present in the syscall.

Reviewed-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
Message-Id: <20200213032223.14643-6-richard.hender...@linaro.org>
[lv: add "#if defined(TARGET_NR_gettimeofday)"]
Signed-off-by: Laurent Vivier 
---
 linux-user/syscall.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index dbdd56e42077..49395dcea978 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -1273,6 +1273,25 @@ static inline abi_long 
host_to_target_timespec64(abi_ulong target_addr,
 return 0;
 }
 
+#if defined(TARGET_NR_gettimeofday)
+static inline abi_long copy_to_user_timezone(abi_ulong target_tz_addr,
+ struct timezone *tz)
+{
+struct target_timezone *target_tz;
+
+if (!lock_user_struct(VERIFY_WRITE, target_tz, target_tz_addr, 1)) {
+return -TARGET_EFAULT;
+}
+
+__put_user(tz->tz_minuteswest, &target_tz->tz_minuteswest);
+__put_user(tz->tz_dsttime, &target_tz->tz_dsttime);
+
+unlock_user_struct(target_tz, target_tz_addr, 1);
+
+return 0;
+}
+#endif
+
 #if defined(TARGET_NR_settimeofday)
 static inline abi_long copy_from_user_timezone(struct timezone *tz,
abi_ulong target_tz_addr)
@@ -8710,10 +8729,16 @@ static abi_long do_syscall1(void *cpu_env, int num, 
abi_long arg1,
 case TARGET_NR_gettimeofday:
 {
 struct timeval tv;
-ret = get_errno(gettimeofday(&tv, NULL));
+struct timezone tz;
+
+ret = get_errno(gettimeofday(&tv, &tz));
 if (!is_error(ret)) {
-if (copy_to_user_timeval(arg1, &tv))
+if (arg1 && copy_to_user_timeval(arg1, &tv)) {
+return -TARGET_EFAULT;
+}
+if (arg2 && copy_to_user_timezone(arg2, &tz)) {
 return -TARGET_EFAULT;
+}
 }
 }
 return ret;
-- 
2.25.1




[PULL 2/6] target/i386: Renumber EXCP_SYSCALL

2020-03-26 Thread Laurent Vivier
From: Richard Henderson 

We are not short of numbers for EXCP_*.  There is no need to confuse things
by having EXCP_VMEXIT and EXCP_SYSCALL overlap, even though the former is
only used for system mode and the latter is only used for user mode.

Reviewed-by: Paolo Bonzini 
Reviewed-by: Alex Bennée 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
Message-Id: <20200213032223.14643-2-richard.hender...@linaro.org>
Signed-off-by: Laurent Vivier 
---
 target/i386/cpu.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 60d797d5941f..49ecc23104c9 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1001,9 +1001,8 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
 #define EXCP11_ALGN17
 #define EXCP12_MCHK18
 
-#define EXCP_SYSCALL0x100 /* only happens in user only emulation
- for syscall instruction */
-#define EXCP_VMEXIT 0x100
+#define EXCP_VMEXIT 0x100 /* only for system emulation */
+#define EXCP_SYSCALL0x101 /* only for user emulation */
 
 /* i386-specific interrupt pending bits.  */
 #define CPU_INTERRUPT_POLL  CPU_INTERRUPT_TGT_EXT_1
-- 
2.25.1




[PULL 5/6] linux-user: Add x86_64 vsyscall page to /proc/self/maps

2020-03-26 Thread Laurent Vivier
From: Richard Henderson 

The page isn't (necessarily) present in the host /proc/self/maps,
and even if it might be it isn't present in page_flags, and even
if it was it might not have the same set of page permissions.

The easiest thing to do, particularly when it comes to the
"[vsyscall]" note at the end of line, is to special case it.

Signed-off-by: Richard Henderson 
Message-Id: <20200213032223.14643-5-richard.hender...@linaro.org>
[lv: remove trailing space]
Signed-off-by: Laurent Vivier 
---
 linux-user/syscall.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 35f414666243..dbdd56e42077 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -7144,6 +7144,16 @@ static int open_self_maps(void *cpu_env, int fd)
 }
 }
 
+#ifdef TARGET_VSYSCALL_PAGE
+/*
+ * We only support execution from the vsyscall page.
+ * This is as if CONFIG_LEGACY_VSYSCALL_XONLY=y from v5.3.
+ */
+dprintf(fd, TARGET_FMT_lx "-" TARGET_FMT_lx
+" --xp  00:00 0 [vsyscall]\n",
+TARGET_VSYSCALL_PAGE, TARGET_VSYSCALL_PAGE + TARGET_PAGE_SIZE);
+#endif
+
 free(line);
 fclose(fp);
 
-- 
2.25.1




[PULL 0/6] Linux user for 5.0 patches

2020-03-26 Thread Laurent Vivier
The following changes since commit 736cf607e40674776d752acc201f565723e86045:

  Update version for v5.0.0-rc0 release (2020-03-24 17:50:00 +)

are available in the Git repository at:

  git://github.com/vivier/qemu.git tags/linux-user-for-5.0-pull-request

for you to fetch changes up to a52f5f87bece827a338d6eb3332e3def86fb9c33:

  linux-user: Flush out implementation of gettimeofday (2020-03-26 08:08:54 
+0100)


Emulate x86_64 vsyscalls
Fix syscall_nr.h cleanup



Laurent Vivier (1):
  linux-user, configure: fix (again) syscall_nr.h dependencies cleanup

Richard Henderson (5):
  target/i386: Renumber EXCP_SYSCALL
  linux-user/i386: Split out gen_signal
  linux-user/i386: Emulate x86_64 vsyscalls
  linux-user: Add x86_64 vsyscall page to /proc/self/maps
  linux-user: Flush out implementation of gettimeofday

 configure  |   8 +-
 linux-user/i386/cpu_loop.c | 201 ++---
 linux-user/syscall.c   |  39 ++-
 target/i386/cpu.h  |  12 ++-
 target/i386/translate.c|  14 ++-
 5 files changed, 205 insertions(+), 69 deletions(-)

-- 
2.25.1




Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-26 Thread Michael S. Tsirkin
On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
> On 12.03.20 09:47, Michael S. Tsirkin wrote:
> > On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
> >> 2. You are essentially stealing THPs in the guest. So the fastest
> >> mapping (THP in guest and host) is gone. The guest won't be able to make
> >> use of THP where it previously was able to. I can imagine this implies a
> >> performance degradation for some workloads. This needs a proper
> >> performance evaluation.
> > 
> > I think the problem is more with the alloc_pages API.
> > That gives you exactly the given order, and if there's
> > a larger chunk available, it will split it up.
> > 
> > But for balloon - I suspect lots of other users,
> > we do not want to stress the system but if a large
> > chunk is available anyway, then we could handle
> > that more optimally by getting it all in one go.
> > 
> > 
> > So if we want to address this, IMHO this calls for a new API.
> > Along the lines of
> > 
> > struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
> > unsigned int max_order, unsigned int 
> > *order)
> > 
> > the idea would then be to return at a number of pages in the given
> > range.
> > 
> > What do you think? Want to try implementing that?
> 
> You can just start with the highest order and decrement the order until
> your allocation succeeds using alloc_pages(), which would be enough for
> a first version. At least I don't see the immediate need for a new
> kernel API.

OK I remember now.  The problem is with reclaim. Unless reclaim is
completely disabled, any of these calls can sleep. After it wakes up,
we would like to get the larger order that has become available
meanwhile.


> -- 
> Thanks,
> 
> David / dhildenb




Re: [PATCH for QEMU v2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT

2020-03-26 Thread David Hildenbrand



> Am 26.03.2020 um 08:06 schrieb teawater :
> 
> Ping.
> 

On paid leave this week. Will try to have a look next week, but it could take a 
bit longer.

Cheers

> Thanks,
> Hui
> 
>> 2020年3月23日 00:04,Hui Zhu  写道:
>> 
>> If the guest kernel has many fragmentation pages, use virtio_balloon
>> will split THP of QEMU when it calls MADV_DONTNEED madvise to release
>> the balloon pages.
>> Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT
>> and set continuous pages order to THP order.
>> Then It will get continuous pages PFN from VQ icvq use use madvise
>> MADV_DONTNEED release the THP page.
>> This will handle the THP split issue.
>> 
>> Signed-off-by: Hui Zhu 
>> ---
>> hw/virtio/virtio-balloon.c  | 32 
>> +
>> include/hw/virtio/virtio-balloon.h  |  4 +++-
>> include/standard-headers/linux/virtio_balloon.h |  4 
>> 3 files changed, 35 insertions(+), 5 deletions(-)
>> 
>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
>> index a4729f7..88bdaca 100644
>> --- a/hw/virtio/virtio-balloon.c
>> +++ b/hw/virtio/virtio-balloon.c
>> @@ -34,6 +34,7 @@
>> #include "hw/virtio/virtio-access.h"
>> 
>> #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
>> +#define CONT_PAGES_ORDER   9
>> 
>> typedef struct PartiallyBalloonedPage {
>>ram_addr_t base_gpa;
>> @@ -65,7 +66,8 @@ static bool 
>> virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
>> 
>> static void balloon_inflate_page(VirtIOBalloon *balloon,
>> MemoryRegion *mr, hwaddr mr_offset,
>> - PartiallyBalloonedPage *pbp)
>> + PartiallyBalloonedPage *pbp, 
>> + bool is_cont_pages)
>> {
>>void *addr = memory_region_get_ram_ptr(mr) + mr_offset;
>>ram_addr_t rb_offset, rb_aligned_offset, base_gpa;
>> @@ -76,6 +78,13 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
>>/* XXX is there a better way to get to the RAMBlock than via a
>> * host address? */
>>rb = qemu_ram_block_from_host(addr, false, &rb_offset);
>> +
>> +if (is_cont_pages) {
>> +ram_block_discard_range(rb, rb_offset,
>> +BALLOON_PAGE_SIZE << CONT_PAGES_ORDER);
>> +return;
>> +}
>> +
>>rb_page_size = qemu_ram_pagesize(rb);
>> 
>>if (rb_page_size == BALLOON_PAGE_SIZE) {
>> @@ -361,9 +370,10 @@ static void virtio_balloon_handle_output(VirtIODevice 
>> *vdev, VirtQueue *vq)
>>trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>>   pa);
>>if (!qemu_balloon_is_inhibited()) {
>> -if (vq == s->ivq) {
>> +if (vq == s->ivq || vq == s->icvq) {
>>balloon_inflate_page(s, section.mr,
>> - section.offset_within_region, 
>> &pbp);
>> + section.offset_within_region, &pbp,
>> + vq == s->icvq);
>>} else if (vq == s->dvq) {
>>balloon_deflate_page(s, section.mr, 
>> section.offset_within_region);
>>} else {
>> @@ -618,9 +628,12 @@ static size_t virtio_balloon_config_size(VirtIOBalloon 
>> *s)
>>if (s->qemu_4_0_config_size) {
>>return sizeof(struct virtio_balloon_config);
>>}
>> -if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
>> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
>>return sizeof(struct virtio_balloon_config);
>>}
>> +if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
>> +return offsetof(struct virtio_balloon_config, pages_order);
>> +}
>>if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
>>return offsetof(struct virtio_balloon_config, poison_val);
>>}
>> @@ -646,6 +659,10 @@ static void virtio_balloon_get_config(VirtIODevice 
>> *vdev, uint8_t *config_data)
>>   cpu_to_le32(VIRTIO_BALLOON_CMD_ID_DONE);
>>}
>> 
>> +if (virtio_has_feature(dev->host_features, 
>> VIRTIO_BALLOON_F_CONT_PAGES)) {
>> +config.pages_order = cpu_to_le32(CONT_PAGES_ORDER);
>> +}
>> +
>>trace_virtio_balloon_get_config(config.num_pages, config.actual);
>>memcpy(config_data, &config, virtio_balloon_config_size(dev));
>> }
>> @@ -816,6 +833,11 @@ static void virtio_balloon_device_realize(DeviceState 
>> *dev, Error **errp)
>>virtio_error(vdev, "iothread is missing");
>>}
>>}
>> +
>> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
>> +s->icvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
>> +}
>> +
>>reset_stats(s);
>> }
>> 
>> @@ -916,6 +938,8 @@ static Property virtio_balloon_properties[] = {
>>VIRTIO_BALLOON_F_DE

Re: [PATCH for Linux v2] virtio_balloon: Add VIRTIO_BALLOON_VQ_INFLATE_CONT to handle THP split issue

2020-03-26 Thread Michael S. Tsirkin
First, either QEMU or Linux version of any interface changes
should be copied to the virtio TC.

:


On Mon, Mar 23, 2020 at 12:04:56AM +0800, Hui Zhu wrote:
> The first version is in [1].
> According to the comments from Michael and David, I updated the patch.
> 1. Added a separate vq inflate_cont_vq to transport inflate continuous
>pages.
> 2. Set all the pages in the continuous pages movable then they can be
>compaction.
> 3. Added a new element pages_order to virtio_balloon_config.  It is the
>inflate pages order that is set by the QEMU.
> 4. If the balloon cannot get any continuous pages from the system.
>Go back to use the single page to fill the balloon.
> 5.  Use balloon_pages_alloc to allocate the single page and continuous
> pages.  Replace __GFP_NORETRY with __GFP_RETRY_MAYFAIL when allocating
> the continuous pages because it can increase the success rate of
> allocating large chunks of memory.
> 
> Following is the introduction of the function.
> If the guest kernel has many fragmentation pages, use virtio_balloon
> will split THP of QEMU when it calls MADV_DONTNEED madvise to release
> the balloon pages.
> This is an example in a VM with 1G memory 1CPU:
> // This is the THP number before VM execution in the host.
> // None use THP.
> cat /proc/meminfo | grep AnonHugePages:
> AnonHugePages: 0 kB
> 
> // After VM start, use usemem
> // (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git)
> // punch-holes function generates 400m fragmentation pages in the guest
> // kernel.
> usemem --punch-holes -s -1 800m &
> 
> // This is the THP number after this command in the host.
> // Some THP is used by VM because usemem will access 800M memory
> // in the guest.
> cat /proc/meminfo | grep AnonHugePages:
> AnonHugePages:976896 kB
> 
> // Connect to the QEMU monitor, setup balloon, and set it size to 600M.
> (qemu) device_add virtio-balloon-pci,id=balloon1
> (qemu) info balloon
> balloon: actual=1024
> (qemu) balloon 600
> (qemu) info balloon
> balloon: actual=600
> 
> // This is the THP number after inflate the balloon in the host.
> cat /proc/meminfo | grep AnonHugePages:
> AnonHugePages:151552 kB
> 
> THP number decreased more than 800M.
> The reason is usemem with punch-holes option will free every other
> page after allocation.  Then 400M free memory inside the guest kernel
> is fragmentation pages.
> The guest kernel will use them to inflate the balloon.  When these
> fragmentation pages are freed, THP will be split.
> 
> This commit tries to handle this with add a new flag
> VIRTIO_BALLOON_VQ_INFLATE_CONT.
> When this flag is set, the balloon will try to use continuous pages
> inflate the balloon.  And the pages order is set to THP order.
> Then THP pages will be freed together in the host.
> This is an example in a VM with 1G memory 1CPU:
> // This is the THP number before VM execution in the host.
> // None use THP.
> cat /proc/meminfo | grep AnonHugePages:
> AnonHugePages: 0 kB
> 
> // After VM start, use usemem punch-holes function generates 400M
> // fragmentation pages in the guest kernel.
> usemem --punch-holes -s -1 800m &
> 
> // This is the THP number after this command in the host.
> // Some THP is used by VM because usemem will access 800M memory
> // in the guest.
> cat /proc/meminfo | grep AnonHugePages:
> AnonHugePages:976896 kB
> 
> // Connect to the QEMU monitor, setup balloon, and set it size to 600M.
> (qemu) device_add virtio-balloon-pci,id=balloon1,cont-pages=on
> (qemu) info balloon
> balloon: actual=1024
> (qemu) balloon 600
> (qemu) info balloon
> balloon: actual=600
> 
> // This is the THP number after inflate the balloon in the host.
> cat /proc/meminfo | grep AnonHugePages:
> AnonHugePages:610304 kB
> 
> The THP number decreases 358M.  This shows that
> VIRTIO_BALLOON_VQ_INFLATE_CONT can help handle the THP split issue.
> 
> [1] https://lkml.org/lkml/2020/3/12/144
> 
> Signed-off-by: Hui Zhu 

I'd like to repeat my original idea of doing large page allocations
unconditionally within guest.


> ---
>  drivers/virtio/virtio_balloon.c | 78 
> ++---
>  include/linux/balloon_compaction.h  |  9 -
>  include/uapi/linux/virtio_balloon.h |  3 ++
>  mm/balloon_compaction.c | 40 +++
>  4 files changed, 109 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 341458f..fbd2b02f 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -47,6 +47,7 @@ enum virtio_balloon_vq {
>   VIRTIO_BALLOON_VQ_DEFLATE,
>   VIRTIO_BALLOON_VQ_STATS,
>   VIRTIO_BALLOON_VQ_FREE_PAGE,
> + VIRTIO_BALLOON_VQ_INFLATE_CONT,
>   VIRTIO_BALLOON_VQ_MAX
>  };
>  
> @@ -56,7 +57,8 @@ enum virtio_balloon_config_read {
>  
>  struct virtio_balloon {
>   struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *

[Bug 1869073] Re: qemu-arm-static crashes "segmentation fault" when running "git clone -s"

2020-03-26 Thread Laurent Vivier
What is the version of QEMU you are using?

** Changed in: qemu
 Assignee: (unassigned) => Laurent Vivier (laurent-vivier)

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1869073

Title:
  qemu-arm-static crashes "segmentation fault" when running "git clone
  -s"

Status in QEMU:
  New

Bug description:
  I want to use qemu-arm-static to cross-compile software. The compiler
  itself is a native cross-compiler connected via "distcc".

  The problem is that a script tries to do some stuff with "git" and
  with a "git clone -s" command the whole story reproducibly stops with
  a "segmentation fault".

  I don't know how to properly debug the issue but it happens 100% of
  the time that I get the "crash" or git just hangs forever with 100%
  CPU usage.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1869073/+subscriptions



Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-26 Thread Michael S. Tsirkin
On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
> On 12.03.20 09:47, Michael S. Tsirkin wrote:
> > On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
> >> 2. You are essentially stealing THPs in the guest. So the fastest
> >> mapping (THP in guest and host) is gone. The guest won't be able to make
> >> use of THP where it previously was able to. I can imagine this implies a
> >> performance degradation for some workloads. This needs a proper
> >> performance evaluation.
> > 
> > I think the problem is more with the alloc_pages API.
> > That gives you exactly the given order, and if there's
> > a larger chunk available, it will split it up.
> > 
> > But for balloon - I suspect lots of other users,
> > we do not want to stress the system but if a large
> > chunk is available anyway, then we could handle
> > that more optimally by getting it all in one go.
> > 
> > 
> > So if we want to address this, IMHO this calls for a new API.
> > Along the lines of
> > 
> > struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
> > unsigned int max_order, unsigned int 
> > *order)
> > 
> > the idea would then be to return at a number of pages in the given
> > range.
> > 
> > What do you think? Want to try implementing that?
> 
> You can just start with the highest order and decrement the order until
> your allocation succeeds using alloc_pages(), which would be enough for
> a first version. At least I don't see the immediate need for a new
> kernel API.

Well there's still a chance of splitting a big page if one
becomes available meanwhile. But OK.

> -- 
> Thanks,
> 
> David / dhildenb




Re: [RFC for QEMU] virtio-balloon: Add option thp-order to set VIRTIO_BALLOON_F_THP_ORDER

2020-03-26 Thread Michael S. Tsirkin
On Tue, Mar 17, 2020 at 06:13:32PM +0800, teawater wrote:
> 
> 
> > 2020年3月12日 16:25,Michael S. Tsirkin  写道:
> > 
> > On Thu, Mar 12, 2020 at 03:49:55PM +0800, Hui Zhu wrote:
> >> If the guest kernel has many fragmentation pages, use virtio_balloon
> >> will split THP of QEMU when it calls MADV_DONTNEED madvise to release
> >> the balloon pages.
> >> Set option thp-order to on will open flags VIRTIO_BALLOON_F_THP_ORDER.
> >> It will set balloon size to THP size to handle the THP split issue.
> >> 
> >> Signed-off-by: Hui Zhu 
> > 
> > What's wrong with just using the PartiallyBalloonedPage machinery
> > instead? That would make it guest transparent.
> 
> In balloon_inflate_page:
> rb_page_size = qemu_ram_pagesize(rb);
> 
> if (rb_page_size == BALLOON_PAGE_SIZE) {
> /* Easy case */
> 
> It seems that PartiallyBalloonedPage is only used when rb_page_size is 
> greater than BALLOON_PAGE_SIZE.
> Do you mean I should modify the working mechanism of balloon_inflate_page 
> function?
> 
> Thanks,
> Hui

Yes, we can tweak it to unconditionally combine pages to
a huge page.


> > 
> >> ---
> >> hw/virtio/virtio-balloon.c  | 67 
> >> -
> >> include/standard-headers/linux/virtio_balloon.h |  4 ++
> >> 2 files changed, 47 insertions(+), 24 deletions(-)
> >> 
> >> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> >> index a4729f7..cfe86b0 100644
> >> --- a/hw/virtio/virtio-balloon.c
> >> +++ b/hw/virtio/virtio-balloon.c
> >> @@ -340,37 +340,49 @@ static void 
> >> virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
> >> while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 
> >> 4) {
> >> unsigned int p = virtio_ldl_p(vdev, &pfn);
> >> hwaddr pa;
> >> +size_t handle_size = BALLOON_PAGE_SIZE;
> >> 
> >> pa = (hwaddr) p << VIRTIO_BALLOON_PFN_SHIFT;
> >> offset += 4;
> >> 
> >> -section = memory_region_find(get_system_memory(), pa,
> >> - BALLOON_PAGE_SIZE);
> >> -if (!section.mr) {
> >> -trace_virtio_balloon_bad_addr(pa);
> >> -continue;
> >> -}
> >> -if (!memory_region_is_ram(section.mr) ||
> >> -memory_region_is_rom(section.mr) ||
> >> -memory_region_is_romd(section.mr)) {
> >> -trace_virtio_balloon_bad_addr(pa);
> >> -memory_region_unref(section.mr);
> >> -continue;
> >> -}
> >> +if (virtio_has_feature(s->host_features,
> >> +   VIRTIO_BALLOON_F_THP_ORDER))
> >> +handle_size = BALLOON_PAGE_SIZE << 
> >> VIRTIO_BALLOON_THP_ORDER;
> >> +
> >> +while (handle_size > 0) {
> >> +section = memory_region_find(get_system_memory(), pa,
> >> + BALLOON_PAGE_SIZE);
> >> +if (!section.mr) {
> >> +trace_virtio_balloon_bad_addr(pa);
> >> +continue;
> >> +}
> >> +if (!memory_region_is_ram(section.mr) ||
> >> +memory_region_is_rom(section.mr) ||
> >> +memory_region_is_romd(section.mr)) {
> >> +trace_virtio_balloon_bad_addr(pa);
> >> +memory_region_unref(section.mr);
> >> +continue;
> >> +}
> >> 
> >> -
> >> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> >> -   pa);
> >> -if (!qemu_balloon_is_inhibited()) {
> >> -if (vq == s->ivq) {
> >> -balloon_inflate_page(s, section.mr,
> >> - section.offset_within_region, 
> >> &pbp);
> >> -} else if (vq == s->dvq) {
> >> -balloon_deflate_page(s, section.mr, 
> >> section.offset_within_region);
> >> -} else {
> >> -g_assert_not_reached();
> >> +
> >> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> >> +   pa);
> >> +if (!qemu_balloon_is_inhibited()) {
> >> +if (vq == s->ivq) {
> >> +balloon_inflate_page(s, section.mr,
> >> + section.offset_within_region,
> >> + &pbp);
> >> +} else if (vq == s->dvq) {
> >> +balloon_deflate_page(s, section.mr,
> >> + 
> >> section.offset_within_region);
> >> +} else {
> >> +g_assert_not_reached();
> >> +}
> >> }
> >> +   

Re: [PATCH for QEMU v2] virtio-balloon: Add option cont-pages to set VIRTIO_BALLOON_VQ_INFLATE_CONT

2020-03-26 Thread teawater
Ping.

Thanks,
Hui

> 2020年3月23日 00:04,Hui Zhu  写道:
> 
> If the guest kernel has many fragmentation pages, use virtio_balloon
> will split THP of QEMU when it calls MADV_DONTNEED madvise to release
> the balloon pages.
> Set option cont-pages to on will open flags VIRTIO_BALLOON_VQ_INFLATE_CONT
> and set continuous pages order to THP order.
> Then It will get continuous pages PFN from VQ icvq use use madvise
> MADV_DONTNEED release the THP page.
> This will handle the THP split issue.
> 
> Signed-off-by: Hui Zhu 
> ---
> hw/virtio/virtio-balloon.c  | 32 +
> include/hw/virtio/virtio-balloon.h  |  4 +++-
> include/standard-headers/linux/virtio_balloon.h |  4 
> 3 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index a4729f7..88bdaca 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -34,6 +34,7 @@
> #include "hw/virtio/virtio-access.h"
> 
> #define BALLOON_PAGE_SIZE  (1 << VIRTIO_BALLOON_PFN_SHIFT)
> +#define CONT_PAGES_ORDER   9
> 
> typedef struct PartiallyBalloonedPage {
> ram_addr_t base_gpa;
> @@ -65,7 +66,8 @@ static bool 
> virtio_balloon_pbp_matches(PartiallyBalloonedPage *pbp,
> 
> static void balloon_inflate_page(VirtIOBalloon *balloon,
>  MemoryRegion *mr, hwaddr mr_offset,
> - PartiallyBalloonedPage *pbp)
> + PartiallyBalloonedPage *pbp, 
> + bool is_cont_pages)
> {
> void *addr = memory_region_get_ram_ptr(mr) + mr_offset;
> ram_addr_t rb_offset, rb_aligned_offset, base_gpa;
> @@ -76,6 +78,13 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
> /* XXX is there a better way to get to the RAMBlock than via a
>  * host address? */
> rb = qemu_ram_block_from_host(addr, false, &rb_offset);
> +
> +if (is_cont_pages) {
> +ram_block_discard_range(rb, rb_offset,
> +BALLOON_PAGE_SIZE << CONT_PAGES_ORDER);
> +return;
> +}
> +
> rb_page_size = qemu_ram_pagesize(rb);
> 
> if (rb_page_size == BALLOON_PAGE_SIZE) {
> @@ -361,9 +370,10 @@ static void virtio_balloon_handle_output(VirtIODevice 
> *vdev, VirtQueue *vq)
> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
>pa);
> if (!qemu_balloon_is_inhibited()) {
> -if (vq == s->ivq) {
> +if (vq == s->ivq || vq == s->icvq) {
> balloon_inflate_page(s, section.mr,
> - section.offset_within_region, &pbp);
> + section.offset_within_region, &pbp,
> + vq == s->icvq);
> } else if (vq == s->dvq) {
> balloon_deflate_page(s, section.mr, 
> section.offset_within_region);
> } else {
> @@ -618,9 +628,12 @@ static size_t virtio_balloon_config_size(VirtIOBalloon 
> *s)
> if (s->qemu_4_0_config_size) {
> return sizeof(struct virtio_balloon_config);
> }
> -if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
> return sizeof(struct virtio_balloon_config);
> }
> +if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
> +return offsetof(struct virtio_balloon_config, pages_order);
> +}
> if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
> return offsetof(struct virtio_balloon_config, poison_val);
> }
> @@ -646,6 +659,10 @@ static void virtio_balloon_get_config(VirtIODevice 
> *vdev, uint8_t *config_data)
>cpu_to_le32(VIRTIO_BALLOON_CMD_ID_DONE);
> }
> 
> +if (virtio_has_feature(dev->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) 
> {
> +config.pages_order = cpu_to_le32(CONT_PAGES_ORDER);
> +}
> +
> trace_virtio_balloon_get_config(config.num_pages, config.actual);
> memcpy(config_data, &config, virtio_balloon_config_size(dev));
> }
> @@ -816,6 +833,11 @@ static void virtio_balloon_device_realize(DeviceState 
> *dev, Error **errp)
> virtio_error(vdev, "iothread is missing");
> }
> }
> +
> +if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_CONT_PAGES)) {
> +s->icvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> +}
> +
> reset_stats(s);
> }
> 
> @@ -916,6 +938,8 @@ static Property virtio_balloon_properties[] = {
> VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false),
> DEFINE_PROP_BIT("free-page-hint", VirtIOBalloon, host_features,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT, false),
> +DEFINE_PROP_BIT("cont-pages", VirtIOBalloon, host_features,
> +VIRTIO_

<    1   2   3   4