[Bug 1749393] Re: sbrk() not working under qemu-user with a PIE-compiled binary?

2021-11-30 Thread Christian Ehrhardt 
Focal

old

$ sudo apt install --reinstall qemu-user-static=1:4.2-3ubuntu6.18
Reading package lists... Done
Building dependency tree   
Reading state information... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 0 not upgraded.
Need to get 21.3 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 
qemu-user-static amd64 1:4.2-3ubuntu6.18 [21.3 MB]
Fetched 21.3 MB in 1s (16.4 MB/s)   
(Reading database ... 126154 files and directories currently installed.)
Preparing to unpack .../qemu-user-static_1%3a4.2-3ubuntu6.18_amd64.deb ...
Unpacking qemu-user-static (1:4.2-3ubuntu6.18) over (1:4.2-3ubuntu6.18) ...
Setting up qemu-user-static (1:4.2-3ubuntu6.18) ...
Processing triggers for man-db (2.9.1-1) ...

ubuntu@f-1928075-qemuuserstatic:~$ sudo chroot /home/ubuntu/bullseye-arm64 
/bin/sh /debootstrap/debootstrap --second-stage
W: Failure trying to run:  /sbin/ldconfig
W: See //debootstrap/debootstrap.log for details
ubuntu@f-1928075-qemuuserstatic:~$ tail -n 2 
bullseye-arm64/debootstrap/debootstrap.log
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault (core dumped)

Upgrade


ubuntu@f-1928075-qemuuserstatic:~$ apt-cache policy qemu-user-static
qemu-user-static:
  Installed: 1:4.2-3ubuntu6.18
  Candidate: 1:4.2-3ubuntu6.19
  Version table:
 1:4.2-3ubuntu6.19 500
500 http://archive.ubuntu.com/ubuntu focal-proposed/universe amd64 
Packages
 *** 1:4.2-3ubuntu6.18 500
500 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 
Packages
100 /var/lib/dpkg/status
 1:4.2-3ubuntu6.17 500
500 http://security.ubuntu.com/ubuntu focal-security/universe amd64 
Packages
 1:4.2-3ubuntu6 500
500 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages
ubuntu@f-1928075-qemuuserstatic:~$ sudo apt install qemu-user-static
Reading package lists... Done
Building dependency tree   
Reading state information... Done
The following packages will be upgraded:
  qemu-user-static
1 upgraded, 0 newly installed, 0 to remove and 65 not upgraded.
Need to get 21.3 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-proposed/universe amd64 
qemu-user-static amd64 1:4.2-3ubuntu6.19 [21.3 MB]
Fetched 21.3 MB in 2s (9092 kB/s)   
(Reading database ... 126160 files and directories currently installed.)
Preparing to unpack .../qemu-user-static_1%3a4.2-3ubuntu6.19_amd64.deb ...
Unpacking qemu-user-static (1:4.2-3ubuntu6.19) over (1:4.2-3ubuntu6.18) ...
Setting up qemu-user-static (1:4.2-3ubuntu6.19) ...
Processing triggers for man-db (2.9.1-1) ...
ubuntu@f-1928075-qemuuserstatic:~$ sudo update-binfmts  --test --display  
qemu-aarch64
qemu-aarch64 (enabled):
 package = qemu-user-static
type = magic
  offset = 0
   magic = 
\x7f\x45\x4c\x46\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\xb7\x00
mask = 
\xff\xff\xff\xff\xff\xff\xff\x00\xff\xff\xff\xff\xff\xff\xff\xff\xfe\xff\xff\xff
 interpreter = /usr/bin/qemu-aarch64-static
detector = 


Test with new versio

ubuntu@f-1928075-qemuuserstatic:~$ sudo chroot /home/ubuntu/bullseye-arm64 
/bin/sh /debootstrap/debootstrap --second-stage
I: Installing core packages...
W: Failure trying to run:  dpkg --force-depends --install 
/var/cache/apt/archives/base-passwd_3.5.51_arm64.deb
W: See //debootstrap/debootstrap.log for details
ubuntu@f-1928075-qemuuserstatic:~$ tail -n 2 
bullseye-arm64/debootstrap/debootstrap.log
dpkg: error: parsing file '/var/lib/dpkg/status' near line 5 package 'dpkg':
 duplicate value for 'Package' field


That is the good case and also a full run now completes.

$ sudo rm -rf bullseye-arm64; sudo qemu-debootstrap --arch=arm64 bullseye 
bullseye-arm64 http://ftp.debian.org/debian
I: Running command: debootstrap --arch arm64 --foreign bullseye bullseye-arm64 
http://ftp.debian.org/debian
W: Cannot check Release signature; keyring file not available 
/usr/share/keyrings/debian-archive-keyring.gpg
I: Retrieving InRelease 
I: Retrieving Packages 
...
I: Configuring tasksel...
I: Configuring libc-bin...
I: Base system installed successfully.


I can't run the docker test due to networking restrictions, but it was
the same fault and the same fix - so that should be ok. If anyone else
can test -proposed with docker please feel free to do so.

** Tags removed: verification-needed verification-needed-focal
** Tags added: verification-done verification-done-focal

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1749393

Title:
  sbrk() not working under qemu-user with a PIE-compiled binary?

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

 

Re: [PATCH v4] target/ppc: fix Hash64 MMU update of PTE bit R

2021-11-30 Thread Cédric Le Goater

On 11/30/21 21:00, Leandro Lupori wrote:

On 30/11/2021 05:44, Cédric Le Goater wrote:

It would be interesting to boot directly the PowerNV machine from a
FreeBSB kernel and a minimum inirtd without using the skiroot images
and an iso. Are images available ?


AFAIK there are no minimum initrd images available. The closest thing would be the 
"bootonly" ISOs that can be found in the links below:

https://download.freebsd.org/ftp/releases/powerpc/powerpc64/ISO-IMAGES/13.0/

https://download.freebsd.org/ftp/releases/powerpc/powerpc64le/ISO-IMAGES/13.0/

https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64/ISO-IMAGES/14.0/

https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64le/ISO-IMAGES/14.0/


But to avoid using skiroot, you would need to manually extract the kernel from the ISO 
and also specify the rootfs, using something like: "-append cd9660:cd0".
If you don't want to emulate a disk or cd, the ISO can be passed to -initrd and 
"-append cd9660:md0" may be used to tell FreeBSD where to find the root 
partition (it loads it as a memory disk).


The ISO is too big for quick tests. Isn't there a minimum initrd ? can we build 
a
builroot-like image for FreeBSD ?



There are also qcow2 snapshots available:

https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64/

https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64le/

But you'll also need to extract the kernel from the image or from kernel.txz to 
boot them.


ok. I will take a look.


As these images target pseries and lack a FAT32 partition, Petitboot won't be 
able to boot them.


The idea would be to skip petitboot and load directly the FreeBSD kernel from 
skiboot
with a minimum initrd or disk image. See :

  
https://github.com/legoater/qemu-ppc-boot/blob/main/buildroot/qemu_ppc64le_powernv-2021.11-rc1-199-g927444a04e-20211120/start-qemu.sh

or

  
https://github.com/legoater/qemu-ppc-boot/blob/main/buildroot/qemu_ppc64le_pseries-2021.11-rc1-199-g927444a04e-20211120/start-qemu.sh

Thanks,

C.



Alfredo (cc'ed) was trying to make them Petitboot compatible.






[PATCH] QIO: Add force_quit to the QIOChannel to ensure QIO exits cleanly in some corner case

2021-11-30 Thread Rao, Lei
We found that the QIO channel coroutine could not be awakened in some 
corner cases during our stress test for COLO.
The patch fixes as follow:
#0  0x7fad72e24bf6 in __ppoll (fds=0x5563d75861f0, nfds=1, 
timeout=, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x5563d6977c68 in qemu_poll_ns (fds=0x5563d75861f0, nfds=1, 
timeout=59697630) at util/qemu-timer.c:348
#2  0x5563d697ac44 in aio_poll (ctx=0x5563d755dfa0, blocking=true) 
at util/aio-posix.c:669
#3  0x5563d68ba24f in bdrv_do_drained_begin (bs=0x5563d75ea0c0, 
recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at 
block/io.c:432
#4  0x5563d68ba338 in bdrv_drained_begin (bs=0x5563d75ea0c0) at 
block/io.c:438
#5  0x5563d689c6d2 in quorum_del_child (bs=0x5563d75ea0c0, 
child=0x5563d7908600, errp=0x7fff3cf5b960) at block/quorum.c:1063
#6  0x5563d684328f in bdrv_del_child (parent_bs=0x5563d75ea0c0, 
child=0x5563d7908600, errp=0x7fff3cf5b960) at block.c:6568
#7  0x5563d658499e in qmp_x_blockdev_change (parent=0x5563d80ec4c0 
"colo-disk0", has_child=true, child=0x5563d7577d30 "children.1", 
has_node=false, node=0x0,errp=0x7fff3cf5b960) at blockdev.c:4494
#8  0x5563d67e8b4e in qmp_marshal_x_blockdev_change 
(args=0x7fad6400ada0, ret=0x7fff3cf5b9f8, errp=0x7fff3cf5b9f0) at 
qapi/qapi-commands-block-core.c:1538
#9  0x5563d691cd9e in do_qmp_dispatch (cmds=0x5563d719a4f0 
, request=0x7fad64009d80, allow_oob=true, errp=0x7fff3cf5ba98)
at qapi/qmp-dispatch.c:132
#10 0x5563d691cfab in qmp_dispatch (cmds=0x5563d719a4f0 
, request=0x7fad64009d80, allow_oob=true) at 
qapi/qmp-dispatch.c:175
#11 0x5563d67b7787 in monitor_qmp_dispatch (mon=0x5563d7605d40, 
req=0x7fad64009d80) at monitor/qmp.c:145
#12 0x5563d67b7b24 in monitor_qmp_bh_dispatcher (data=0x0) at 
monitor/qmp.c:234
#13 0x5563d69754c2 in aio_bh_call (bh=0x5563d7473230) at 
util/async.c:89
#14 0x5563d697555e in aio_bh_poll (ctx=0x5563d7471da0) at 
util/async.c:117
#15 0x5563d697a423 in aio_dispatch (ctx=0x5563d7471da0) at 
util/aio-posix.c:459
#16 0x5563d6975917 in aio_ctx_dispatch (source=0x5563d7471da0, 
callback=0x0, user_data=0x0) at util/async.c:260
#17 0x7fad730e1fbd in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x5563d6978cda in glib_pollfds_poll () at util/main-loop.c:219
#19 0x5563d6978d58 in os_host_main_loop_wait (timeout=977650) at 
util/main-loop.c:242
#20 0x5563d6978e69 in main_loop_wait (nonblocking=0) at 
util/main-loop.c:518
#21 0x5563d658f5ed in main_loop () at vl.c:1814
#22 0x5563d6596bb7 in main (argc=61, argv=0x7fff3cf5c0c8, 
envp=0x7fff3cf5c2b8) at vl.c:450

From the call trace, we can see that the QEMU main thread is waiting for 
the in_flight return to zero. But the in_filght never reaches 0.
After analysis, the root cause is that the coroutine of NBD was not 
awakened. Although this bug was found in the COLO test, I think this is a
universal problem in the QIO module. This issue also affects other modules 
depending on QIO such as NBD. We dump the following data:
$2 = {
  in_flight = 2,
  state = NBD_CLIENT_QUIT,
  connect_status = 0,
  connect_err = 0x0,
  wait_in_flight = false,
  requests = {{
  coroutine = 0x0,
  offset = 273077071872,
  receiving = false,
}, {
  coroutine = 0x7f1164571df0,
  offset = 498792529920,
  receiving = false,
}, {
  coroutine = 0x0,
  offset = 500663590912,
  receiving = false,
}, {
  ...
} },
  reply = {
simple = {
  magic = 1732535960,
  error = 0,
  handle = 0
},
structured = {
  magic = 1732535960,
  flags = 0,
  type = 0,
  handle = 0,
  length = 0
},
{
  magic = 1732535960,
  _skip = 0,
  handle = 0
}
  },
  bs = 0x7f11640e2140,
  reconnect_delay = 0,
  saddr = 0x7f11640e1f80,
  export = 0x7f11640dc560 "parent0",
}
From the data, we can see the coroutine of NBD does not exit normally when 
killing the NBD server(we kill the Secondary VM in the COLO stress test).
Then it will not execute in_flight--. So, the QEMU main thread will hang 
here. Further analysis, I found the coroutine of NBD will yield
in nbd_send_request() or qio_channel_write_all() in nbd_co_send_request. By 
the way, the yield is due to the kernel return EAGAIN(under the stress test).
However, NBD didn't know it would yield here. So, the 
nbd_recv_coroutines_wake won't wake it up because req->receiving is false. if 
the NBD server
is terminated abnormally at the same time. The G_IO_OUT will be invalid, 

[PATCH v2] docs: fix section numbering in pcie.txt

2021-11-30 Thread Ani Sinha
There is no 5.2 section. Section 5.3 should really be 5.2. Fix it.

fixes: 453ac8835b0022 ("docs: add PCIe devices placement guidelines")

Reviewed-by: Liu Yi L 
Signed-off-by: Ani Sinha 
---
 docs/pcie.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/pcie.txt b/docs/pcie.txt
index 89e3502075..90310b0c5e 100644
--- a/docs/pcie.txt
+++ b/docs/pcie.txt
@@ -262,7 +262,7 @@ PCI Express Root Ports (and PCI Express Downstream Ports).
 Port, which may come handy for hot-plugging another device.
 
 
-5.3 Hot-plug example:
+5.2 Hot-plug example:
 Using HMP: (add -monitor stdio to QEMU command line)
   device_add ,id=,bus=
 
-- 
2.25.1




RE: [PATCH] docs: fix section numbering in pcie.txt

2021-11-30 Thread Liu, Yi L
> From: Qemu-devel 
> On Behalf Of Ani Sinha
> Sent: Wednesday, December 1, 2021 2:43 PM
> 
> There is no 5.2 section. Section 5.3 should really be 5.2. Fix it.

Reviewed-by: Liu Yi L 

BTW. Is a fix tag needed?

Regards,
Yi Liu

> Signed-off-by: Ani Sinha 
> ---
>  docs/pcie.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/docs/pcie.txt b/docs/pcie.txt
> index 89e3502075..90310b0c5e 100644
> --- a/docs/pcie.txt
> +++ b/docs/pcie.txt
> @@ -262,7 +262,7 @@ PCI Express Root Ports (and PCI Express Downstream
> Ports).
>  Port, which may come handy for hot-plugging another device.
> 
> 
> -5.3 Hot-plug example:
> +5.2 Hot-plug example:
>  Using HMP: (add -monitor stdio to QEMU command line)
>device_add ,id=,bus= Downstream Port Id/PCI-PCI Bridge Id/>
> 
> --
> 2.25.1
> 




[PATCH] docs: fix section numbering in pcie.txt

2021-11-30 Thread Ani Sinha
There is no 5.2 section. Section 5.3 should really be 5.2. Fix it.

Signed-off-by: Ani Sinha 
---
 docs/pcie.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/pcie.txt b/docs/pcie.txt
index 89e3502075..90310b0c5e 100644
--- a/docs/pcie.txt
+++ b/docs/pcie.txt
@@ -262,7 +262,7 @@ PCI Express Root Ports (and PCI Express Downstream Ports).
 Port, which may come handy for hot-plugging another device.
 
 
-5.3 Hot-plug example:
+5.2 Hot-plug example:
 Using HMP: (add -monitor stdio to QEMU command line)
   device_add ,id=,bus=
 
-- 
2.25.1




Re: [PULL 0/1] MAINTAINERS update

2021-11-30 Thread Richard Henderson

On 11/30/21 9:47 PM, Eduardo Habkost wrote:

* MAINTAINERS: Change my email address (Eduardo Habkost)

Eduardo Habkost (1):
   MAINTAINERS: Change my email address

  MAINTAINERS | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)



Not a pull request.  But since it's just one patch that needs no regression testing, I 
have applied it directly.



r~



Re: [PATCH] memory: Fix incorrect calls of log_global_start/stop

2021-11-30 Thread Peter Xu
On Tue, Nov 30, 2021 at 04:03:10PM +0100, David Hildenbrand wrote:
> On 30.11.21 09:00, Peter Xu wrote:
> > We should only call the log_global_start/stop when the global dirty track
> > bitmask changes from zero<->non-zero.
> > 
> > No real issue reported for this yet probably because no immediate user to
> > enable both dirty rate measurement and migration at the same time.  However
> > it'll be good to be prepared for it.
> > 
> > Fixes: 63b41db4bc ("memory: make global_dirty_tracking a bitmask")
> > Cc: Hyman Huang 
> > Cc: Paolo Bonzini 
> > Cc: Dr. David Alan Gilbert 
> > Cc: Juan Quintela 
> > Cc: David Hildenbrand 
> > Signed-off-by: Peter Xu 
> 
> LGTM
> 
> Reviewed-by: David Hildenbrand 

Thanks for the quick review, David.

Just a heads-up that I think it'll be nice to have this as 6.2-rc3 material.
QEMU planning page told me rc3 has just passed (Nov 30th) but still I didn't
see it released, so not sure..

I won't call it a blocker though, so if it missed 6.2 we just copy stable, and
it'll be needed for 6.2 stable branch only.

-- 
Peter Xu




Re: [PATCH 1/1] hw/arm/virt: Support for virtio-mem-pci

2021-11-30 Thread Gavin Shan

On 11/30/21 8:37 PM, David Hildenbrand wrote:

On 30.11.21 01:33, Gavin Shan wrote:

This supports virtio-mem-pci device on "virt" platform, by simply
following the implementation on x86.


Thanks for picking this up!



Thanks, David.



* The patch was written by David Hildenbrand 
  modified by Jonathan Cameron 


Maybe replace this section by

Co-developed-by: David Hildenbrand 
Co-developed-by: Jonathan Cameron 



Yes, it will be included into v2.



* This implements the hotplug handlers to support virtio-mem-pci
  device hot-add, while the hot-remove isn't supported as we have
  on x86.

* The block size is 1GB on ARM64 instead of 128MB on x86.


See below, isn't it actually 512 MiB nowadays?



I think so.



* It has been passing the tests with various combinations like 64KB
  and 4KB page sizes on host and guest, different memory device
  backends like normal, transparent huge page and HugeTLB, plus
  migration.


Perfect. A note that hugetlbfs isn't fully supported/safe to use until
we have preallocation support in QEMU (WIP).



Yes, there is some warnings raised to enlarge 'request-size' on
host with 64KB page size. Note that the memory backends I used
in the testings always have "prealloc=on" property though.



Signed-off-by: Gavin Shan 
---
  hw/arm/Kconfig |  1 +
  hw/arm/virt.c  | 68 +-
  hw/virtio/virtio-mem.c |  2 ++
  3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 2d37d29f02..15aff8efb8 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -27,6 +27,7 @@ config ARM_VIRT
  select DIMM
  select ACPI_HW_REDUCED
  select ACPI_APEI
+select VIRTIO_MEM_SUPPORTED
  
  config CHEETAH

  bool
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 369552ad45..f4599a5ef0 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -71,9 +71,11 @@
  #include "hw/arm/smmuv3.h"
  #include "hw/acpi/acpi.h"
  #include "target/arm/internals.h"
+#include "hw/mem/memory-device.h"
  #include "hw/mem/pc-dimm.h"
  #include "hw/mem/nvdimm.h"
  #include "hw/acpi/generic_event_device.h"
+#include "hw/virtio/virtio-mem-pci.h"
  #include "hw/virtio/virtio-iommu.h"
  #include "hw/char/pl011.h"
  #include "qemu/guest-random.h"
@@ -2480,6 +2482,63 @@ static void virt_memory_plug(HotplugHandler *hotplug_dev,
   dev, _abort);
  }
  
+static void virt_virtio_md_pci_pre_plug(HotplugHandler *hotplug_dev,

+DeviceState *dev, Error **errp)
+{
+HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
+Error *local_err = NULL;
+
+if (!hotplug_dev2 && dev->hotplugged) {
+/*
+ * Without a bus hotplug handler, we cannot control the plug/unplug
+ * order. We should never reach this point when hotplugging on x86,
+ * however, better add a safety net.
+ */
+error_setg(errp, "hotplug of virtio based memory devices not supported"
+   " on this bus.");
+return;
+}
+/*
+ * First, see if we can plug this memory device at all. If that
+ * succeeds, branch of to the actual hotplug handler.
+ */
+memory_device_pre_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev), NULL,
+   _err);
+if (!local_err && hotplug_dev2) {
+hotplug_handler_pre_plug(hotplug_dev2, dev, _err);
+}
+error_propagate(errp, local_err);
+}
+
+static void virt_virtio_md_pci_plug(HotplugHandler *hotplug_dev,
+DeviceState *dev, Error **errp)
+{
+HotplugHandler *hotplug_dev2 = qdev_get_bus_hotplug_handler(dev);
+Error *local_err = NULL;
+
+/*
+ * Plug the memory device first and then branch off to the actual
+ * hotplug handler. If that one fails, we can easily undo the memory
+ * device bits.
+ */
+memory_device_plug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+if (hotplug_dev2) {
+hotplug_handler_plug(hotplug_dev2, dev, _err);
+if (local_err) {
+memory_device_unplug(MEMORY_DEVICE(dev), MACHINE(hotplug_dev));
+}
+}
+error_propagate(errp, local_err);
+}
+
+static void virt_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
+  DeviceState *dev, Error **errp)
+{
+/* We don't support hot unplug of virtio based memory devices */
+error_setg(errp, "virtio based memory devices cannot be unplugged.");
+}
+
+
  static void virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
  DeviceState *dev, Error **errp)
  {
@@ -2513,6 +2572,8 @@ static void 
virt_machine_device_pre_plug_cb(HotplugHandler *hotplug_dev,
  qdev_prop_set_uint32(dev, "len-reserved-regions", 1);
  qdev_prop_set_string(dev, "reserved-regions[0]", resv_prop_str);
  g_free(resv_prop_str);
+} else if 

Re: [PATCH v2 1/2] ppc/pnv.c: add a friendly warning when accel=kvm is used

2021-11-30 Thread David Gibson
On Tue, Nov 30, 2021 at 10:31:52AM -0300, Daniel Henrique Barboza wrote:
> If one tries to use -machine powernv9,accel=kvm in a Power9 host, a
> cryptic error will be shown:
> 
> qemu-system-ppc64: Register sync failed... If you're using kvm-hv.ko, only 
> "-cpu host" is possible
> qemu-system-ppc64: kvm_init_vcpu: kvm_arch_init_vcpu failed (0): Invalid 
> argument
> 
> Appending '-cpu host' will throw another error:
> 
> qemu-system-ppc64: invalid chip model 'host' for powernv9 machine
> 
> The root cause is that in IBM PowerPC we have different specs for the 
> bare-metal
> and the guests. The bare-metal follows OPAL, the guests follow PAPR. The 
> kernel
> KVM modules presented in the ppc kernels implements PAPR. This means that we
> can't use KVM accel when using the powernv machine, which is the emulation of
> the bare-metal host.
> 
> All that said, let's give a more informative error in this case.
> 
> Signed-off-by: Daniel Henrique Barboza 

Reviewed-by: David Gibson 

> ---
>  hw/ppc/pnv.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
> index 71e45515f1..e5b87e8730 100644
> --- a/hw/ppc/pnv.c
> +++ b/hw/ppc/pnv.c
> @@ -742,6 +742,11 @@ static void pnv_init(MachineState *machine)
>  DriveInfo *pnor = drive_get(IF_MTD, 0, 0);
>  DeviceState *dev;
>  
> +if (kvm_enabled()) {
> +error_report("The powernv machine does not work with KVM 
> acceleration");
> +exit(EXIT_FAILURE);
> +}
> +
>  /* allocate RAM */
>  if (machine->ram_size < mc->default_ram_size) {
>  char *sz = size_to_str(mc->default_ram_size);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v8 06/10] target/ppc: enable PMU instruction count

2021-11-30 Thread David Gibson
On Tue, Nov 30, 2021 at 07:24:04PM -0300, Daniel Henrique Barboza wrote:
> 
> 
> On 11/29/21 01:36, David Gibson wrote:
> > On Thu, Nov 25, 2021 at 12:08:13PM -0300, Daniel Henrique Barboza wrote:
> > > The PMU is already counting cycles by calculating time elapsed in
> > > nanoseconds. Counting instructions is a different matter and requires
> > > another approach.
> > > 
> > > This patch adds the capability of counting completed instructions
> > > (Perf event PM_INST_CMPL) by counting the amount of instructions
> > > translated in each translation block right before exiting it.
> > > 
> > > A new pmu_count_insns() helper in translation.c was added to do that.
> > > After verifying that the PMU is running (MMCR0_FC bit not set), call
> > > helper_insns_inc(). This new helper from power8-pmu.c will add the
> > > instructions to the relevant counters. It'll also be responsible for
> > > triggering counter negative overflows as it is already being done with
> > > cycles.
> > > 
> > > Signed-off-by: Daniel Henrique Barboza 
> > > ---
> > >   target/ppc/cpu.h |  1 +
> > >   target/ppc/helper.h  |  1 +
> > >   target/ppc/helper_regs.c |  4 +++
> > >   target/ppc/power8-pmu-regs.c.inc |  6 +
> > >   target/ppc/power8-pmu.c  | 38 ++
> > >   target/ppc/translate.c   | 46 
> > >   6 files changed, 96 insertions(+)
> > > 
> > > diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
> > > index 9b41b022e2..38cd2b5c43 100644
> > > --- a/target/ppc/cpu.h
> > > +++ b/target/ppc/cpu.h
> > > @@ -656,6 +656,7 @@ enum {
> > >   HFLAGS_PR = 14,  /* MSR_PR */
> > >   HFLAGS_PMCC0 = 15,  /* MMCR0 PMCC bit 0 */
> > >   HFLAGS_PMCC1 = 16,  /* MMCR0 PMCC bit 1 */
> > > +HFLAGS_MMCR0FC = 17, /* MMCR0 FC bit */
> > 
> > Now that the event stuff is a bit more refined, you could narrow this
> > down to specifically marking if any counters are actively counting
> > instructions (not frozen by MMCR0[FC] and not frozen by
> > MMCR0[FC14|FC56] *and* have the right event selected).
> > 
> > Since I suspect the instruction counting instrumentation could be
> > quite expensive (helper call on every tb), that might be worthwhile.
> 
> That was worthwhile. The performance increase is substantial with this
> change, in particular with tests that exercises only cycle events.

Good to know.

> > >   HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */
> > >   HFLAGS_VR = 25,  /* MSR_VR if cpu has VRE */
> > > diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> > > index 94b4690375..d8a23e054a 100644
> > > --- a/target/ppc/helper.h
> > > +++ b/target/ppc/helper.h
> > > @@ -24,6 +24,7 @@ DEF_HELPER_2(store_mmcr0, void, env, tl)
> > >   DEF_HELPER_2(store_mmcr1, void, env, tl)
> > >   DEF_HELPER_3(store_pmc, void, env, i32, i64)
> > >   DEF_HELPER_2(read_pmc, tl, env, i32)
> > > +DEF_HELPER_2(insns_inc, void, env, i32)
> > >   #endif
> > >   DEF_HELPER_1(check_tlb_flush_local, void, env)
> > >   DEF_HELPER_1(check_tlb_flush_global, void, env)
> > > diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c
> > > index 99562edd57..875c2fdfc6 100644
> > > --- a/target/ppc/helper_regs.c
> > > +++ b/target/ppc/helper_regs.c
> > > @@ -115,6 +115,10 @@ static uint32_t 
> > > hreg_compute_hflags_value(CPUPPCState *env)
> > >   if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) {
> > >   hflags |= 1 << HFLAGS_PMCC1;
> > >   }
> > > +if (env->spr[SPR_POWER_MMCR0] & MMCR0_FC) {
> > > +hflags |= 1 << HFLAGS_MMCR0FC;
> > > +}
> > > +
> > >   #ifndef CONFIG_USER_ONLY
> > >   if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) {
> > > diff --git a/target/ppc/power8-pmu-regs.c.inc 
> > > b/target/ppc/power8-pmu-regs.c.inc
> > > index 25b13ad564..580e4e41b2 100644
> > > --- a/target/ppc/power8-pmu-regs.c.inc
> > > +++ b/target/ppc/power8-pmu-regs.c.inc
> > > @@ -113,6 +113,12 @@ static void write_MMCR0_common(DisasContext *ctx, 
> > > TCGv val)
> > >*/
> > >   gen_icount_io_start(ctx);
> > >   gen_helper_store_mmcr0(cpu_env, val);
> > > +
> > > +/*
> > > + * End the translation block because MMCR0 writes can change
> > > + * ctx->pmu_frozen.
> > > + */
> > > +ctx->base.is_jmp = DISAS_EXIT_UPDATE;
> > >   }
> > >   void spr_write_MMCR0_ureg(DisasContext *ctx, int sprn, int gprn)
> > > diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
> > > index 01e0b9b8fc..59d0def79d 100644
> > > --- a/target/ppc/power8-pmu.c
> > > +++ b/target/ppc/power8-pmu.c
> > > @@ -112,6 +112,30 @@ static PMUEventType pmc_get_event(CPUPPCState *env, 
> > > int sprn)
> > >   return evt_type;
> > >   }
> > > +static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns)
> > > +{
> > > +bool overflow_triggered = false;
> > > +int sprn;
> > > +
> > > +/* PMC6 never counts instructions */
> > > +for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC5; sprn++) {
> > 

Re: [PATCH v4] target/ppc: fix Hash64 MMU update of PTE bit R

2021-11-30 Thread David Gibson
On Tue, Nov 30, 2021 at 09:44:58AM +0100, Cédric le Goater wrote:
> On 11/30/21 01:30, David Gibson wrote:
> > On Mon, Nov 29, 2021 at 03:57:51PM -0300, Leandro Lupori wrote:
> > > When updating the R bit of a PTE, the Hash64 MMU was using a wrong byte
> > > offset, causing the first byte of the adjacent PTE to be corrupted.
> > > This caused a panic when booting FreeBSD, using the Hash MMU.
> > > 
> > > Fixes: a2dd4e83e76b ("ppc/hash64: Rework R and C bit updates")
> > > Signed-off-by: Leandro Lupori 
> > 
> > Reviewed-by: David Gibson 
> 
> Sorry, I didn't wait for your Rb because the patch was a good candidate
> for -rc3. It is merged now.

No worries.

> 
> > Thanks for your patience with our nitpicking :).
> 
> Yes,
> 
> Here is another QEMU bug found by FreeBSD :
> https://lore.kernel.org/qemu-devel/427ef2ee-6871-0d27-f485-90ad142f6...@kaod.org/
> 
> It would be interesting to boot directly the PowerNV machine from a
> FreeBSB kernel and a minimum inirtd without using the skiroot images
> and an iso. Are images available ?
> 
> Thanks.
> 
> C.
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


[PULL 0/1] MAINTAINERS update

2021-11-30 Thread Eduardo Habkost
* MAINTAINERS: Change my email address (Eduardo Habkost)

Eduardo Habkost (1):
  MAINTAINERS: Change my email address

 MAINTAINERS | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

-- 
2.32.0




Re: [PATCH] fw_cfg: Fix memory leak in fw_cfg_register_file

2021-11-30 Thread Johan Hovold
On Tue, Nov 16, 2021 at 04:28:34PM +0100, Philippe Mathieu-Daudé wrote:
> On 11/16/21 12:42, Miaoqian Lin wrote:
> > When kobject_init_and_add() fails, entry should be freed just like
> > when sysfs_create_bin_file() fails.
> > 
> 
> Fixes: fe3c60684377 ("firmware: Fix a reference count leak.")
> Reviewed-by: Philippe Mathieu-Daudé 

No, no. This patch is completely bogus and would introduce a double
free.

> > Signed-off-by: Miaoqian Lin 
> > ---
> >  drivers/firmware/qemu_fw_cfg.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/firmware/qemu_fw_cfg.c b/drivers/firmware/qemu_fw_cfg.c
> > index 172c751a4f6c..0f404777f016 100644
> > --- a/drivers/firmware/qemu_fw_cfg.c
> > +++ b/drivers/firmware/qemu_fw_cfg.c
> > @@ -608,6 +608,7 @@ static int fw_cfg_register_file(const struct 
> > fw_cfg_file *f)
> >fw_cfg_sel_ko, "%d", entry->select);
> > if (err) {
> > kobject_put(>kobj);
> > +   kfree(entry);

entry would already have been freed by kobject_put() and
fw_cfg_sysfs_release_entry() here.

> > return err;
> > }
> >  
> > 

Doesn't look like this patch has been picked up yet, so:

NAK.

Johan



[PULL 1/1] MAINTAINERS: Change my email address

2021-11-30 Thread Eduardo Habkost
The ehabk...@redhat.com email address will stop working on
2021-12-01, change it to my personal email address.

Signed-off-by: Eduardo Habkost 
Message-Id: <20211129163053.2506734-1-ehabk...@redhat.com>
Signed-off-by: Eduardo Habkost 
---
 MAINTAINERS | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index d3879aa3c12..7e8a586b2ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -324,7 +324,7 @@ F: disas/sparc.c
 X86 TCG CPUs
 M: Paolo Bonzini 
 M: Richard Henderson 
-M: Eduardo Habkost 
+M: Eduardo Habkost 
 S: Maintained
 F: target/i386/tcg/
 F: tests/tcg/i386/
@@ -1628,7 +1628,7 @@ F: include/hw/i386/microvm.h
 F: pc-bios/bios-microvm.bin
 
 Machine core
-M: Eduardo Habkost 
+M: Eduardo Habkost 
 M: Marcel Apfelbaum 
 R: Philippe Mathieu-Daudé 
 S: Supported
@@ -2648,13 +2648,13 @@ F: backends/cryptodev*.c
 Python library
 M: John Snow 
 M: Cleber Rosa 
-R: Eduardo Habkost 
+R: Eduardo Habkost 
 S: Maintained
 F: python/
 T: git https://gitlab.com/jsnow/qemu.git python
 
 Python scripts
-M: Eduardo Habkost 
+M: Eduardo Habkost 
 M: Cleber Rosa 
 S: Odd Fixes
 F: scripts/*.py
@@ -2730,7 +2730,7 @@ T: git https://github.com/mdroth/qemu.git qga
 QOM
 M: Paolo Bonzini 
 R: Daniel P. Berrange 
-R: Eduardo Habkost 
+R: Eduardo Habkost 
 S: Supported
 F: docs/qdev-device-use.txt
 F: hw/core/qdev*
@@ -2750,7 +2750,7 @@ F: tests/unit/check-qom-proplist.c
 F: tests/unit/test-qdev-global-props.c
 
 QOM boilerplate conversion script
-M: Eduardo Habkost 
+M: Eduardo Habkost 
 S: Maintained
 F: scripts/codeconverter/
 
-- 
2.32.0




Re: why is iommu_platform set to off by default?

2021-11-30 Thread Michael S. Tsirkin
On Tue, Nov 30, 2021 at 04:38:11PM +, Stefan Hajnoczi wrote:
> On Tue, Nov 30, 2021 at 02:32:49PM +, Peter Maydell wrote:
> > I've just spent a day or so trying to track down why PCI passthrough
> > of a virtio-blk-pci device wasn't working. The problem turns out to be
> > that by default virtio pci devices don't use the IOMMU, even when the
> > machine model has created an IOMMU and arranged for the PCI bus to
> > be underneath it. So when the L2 guest tries to program the virtio device,
> > the virtio device treats the IPAs it writes as if they were PAs and
> > of course the data structures it's looking for aren't there.
> > 
> > Why do we default this to 'off'? It seems pretty unhelpful not to
> > honour the existence of the IOMMU, and the failure mode is pretty
> > opaque (L2 guest just hangs)...
> 
> Historically VIRTIO used guest physical addresses instead of bus
> addresses (IOVA). I think when IOMMU support was added, a QEMU -device
> virtio-* parameter was added and it's simply off by default:
> 
>   iommu_platform=  - on/off (default: false)
> 
> Maybe this behavior is for backwards compatibility? Existing guests with
> IOMMUs shouldn't change behavior, although this could be fixed with
> machine type compat properties.
> 
> Stefan

Unfortunately iommu is special. Bypassing it makes the config
insecure for nested virt, so it's important that we do not
allow guest driver to disable the iommu through the
feature bit.
This means normal compat machinery (make feature on by default
but disable for legacy drivers) does not work here.

-- 
MST




Re: [PATCH v1 0/3] virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE

2021-11-30 Thread Michael S. Tsirkin
On Tue, Nov 30, 2021 at 10:28:35AM +0100, David Hildenbrand wrote:
> Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE in QEMU, which indicates to
> a guest that we don't support reading unplugged memory. We indicate
> the feature based on a new "unplugged-inaccessible" property available
> for x86 targets only (the only ones with legacy guests). Guests that don't
> support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE will fail initialization if
> indicated/required by the hypervisor.
> 
> For example, Linux guests starting with v5.16 will support
> VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE.
> 
> For future targets that don't have legacy guests (especially arm64), we'll
> always indicate VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE.
> 
> More details can be found in the description of patch #2.
> 
> "
> For existing compat machines, the property will default to "off", to
> not change the behavior but eventually warn about a problematic setup.
> Short-term, we'll set the property default to "auto" for new QEMU machines.
> Mid-term, we'll set the property default to "on" for new QEMU machines.
> Long-term, we'll deprecate the parameter and disallow legacy guests
> completely.
> "
> 
> TODO: Once 6.2 was release, adjust patch #3. Replace patch #1 by a proper
>   Linux header sync.


oh so it's not for 6.2. got it.

> Cc: "Michael S. Tsirkin" 
> Cc: Marcel Apfelbaum 
> Cc: Paolo Bonzini 
> Cc: Richard Henderson 
> Cc: Eduardo Habkost 
> Cc: Gavin Shan 
> Cc: Hui Zhu 
> Cc: Sebastien Boeuf 
> Cc: Pankaj Gupta 
> 
> David Hildenbrand (3):
>   linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
>   virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE
>   virtio-mem: Set "unplugged-inaccessible=auto" for the 6.2 machine on
> x86
> 
>  hw/i386/pc.c|  1 +
>  hw/virtio/virtio-mem.c  | 63 +
>  include/hw/virtio/virtio-mem.h  |  8 +++
>  include/standard-headers/linux/virtio_mem.h |  9 ++-
>  4 files changed, 78 insertions(+), 3 deletions(-)
> 
> -- 
> 2.31.1




Re: [PATCH v1 1/3] linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE

2021-11-30 Thread Michael S. Tsirkin
On Tue, Nov 30, 2021 at 10:28:36AM +0100, David Hildenbrand wrote:
> TODO: replace by a proper header sync.

what's the plan for this patchset then?

> Signed-off-by: David Hildenbrand 
> ---
>  include/standard-headers/linux/virtio_mem.h | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/standard-headers/linux/virtio_mem.h 
> b/include/standard-headers/linux/virtio_mem.h
> index 05e5ade75d..18c74c527c 100644
> --- a/include/standard-headers/linux/virtio_mem.h
> +++ b/include/standard-headers/linux/virtio_mem.h
> @@ -68,9 +68,10 @@
>   * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
>   *
>   * There are no guarantees what will happen if unplugged memory is
> - * read/written. Such memory should, in general, not be touched. E.g.,
> - * even writing might succeed, but the values will simply be discarded at
> - * random points in time.
> + * read/written. In general, unplugged memory should not be touched, because
> + * the resulting action is undefined. There is one exception: without
> + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, unplugged memory inside the usable
> + * region can be read, to simplify creation of memory dumps.
>   *
>   * It can happen that the device cannot process a request, because it is
>   * busy. The device driver has to retry later.
> @@ -87,6 +88,8 @@
>  
>  /* node_id is an ACPI PXM and is valid */
>  #define VIRTIO_MEM_F_ACPI_PXM0
> +/* unplugged memory must not be accessed */
> +#define VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE  1
>  
>  
>  /* --- virtio-mem: guest -> host requests --- */
> -- 
> 2.31.1




Re: why is iommu_platform set to off by default?

2021-11-30 Thread Michael S. Tsirkin
On Tue, Nov 30, 2021 at 02:32:49PM +, Peter Maydell wrote:
> I've just spent a day or so trying to track down why PCI passthrough
> of a virtio-blk-pci device wasn't working. The problem turns out to be
> that by default virtio pci devices don't use the IOMMU, even when the
> machine model has created an IOMMU and arranged for the PCI bus to
> be underneath it. So when the L2 guest tries to program the virtio device,
> the virtio device treats the IPAs it writes as if they were PAs and
> of course the data structures it's looking for aren't there.

Because this is what legacy guests expect, and legacy configs are
much more common than nested.

> Why do we default this to 'off'? It seems pretty unhelpful not to
> honour the existence of the IOMMU, and the failure mode is pretty
> opaque (L2 guest just hangs)...
> 
> thanks
> -- PMM

This should be handled by VFIO in L1 really, it can check for a
device quirk and refuse binding if the feature bit
is disabled.


-- 
MST




Re: [PATCH] virtio: signal after wrapping packed used_idx

2021-11-30 Thread Michael S. Tsirkin
On Tue, Nov 30, 2021 at 01:45:10PM +, Stefan Hajnoczi wrote:
> Packed Virtqueues wrap used_idx instead of letting it run freely like
> Split Virtqueues do. If the used ring wraps more than once there is no
> way to compare vq->signalled_used and vq->used_idx in
> virtio_packed_should_notify() since they are modulo vq->vring.num.
> 
> This causes the device to stop sending used buffer notifications when
> when virtio_packed_should_notify() is called less than once each time
> around the used ring.
> 
> It is possible to trigger this with virtio-blk's dataplane
> notify_guest_bh() irq coalescing optimization. The call to
> virtio_notify_irqfd() (and virtio_packed_should_notify()) is deferred to
> a BH. If the guest driver is polling it can complete and submit more
> requests before the BH executes, causing the used ring to wrap more than
> once. The result is that the virtio-blk device ceases to raise
> interrupts and I/O hangs.
> 
> Cc: Tiwei Bie 
> Cc: Jason Wang 
> Cc: Michael S. Tsirkin 
> Signed-off-by: Stefan Hajnoczi 

Makes sense.  Fixes tag?

> ---
> Smarter solutions welcome, but I think notifying once per vq->vring.num
> is acceptable.
> ---
>  hw/virtio/virtio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index ea7c079fb0..f7851c2750 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -885,6 +885,7 @@ static void virtqueue_packed_flush(VirtQueue *vq, 
> unsigned int count)
>  if (vq->used_idx >= vq->vring.num) {
>  vq->used_idx -= vq->vring.num;
>  vq->used_wrap_counter ^= 1;
> +vq->signalled_used_valid = false;
>  }
>  }
>  
> -- 
> 2.33.1




[RFC PATCH v2 3/4] target/ppc: Remove the software TLB model of 7450 CPUs

2021-11-30 Thread Fabiano Rosas
(Applies to 7441, 7445, 7450, 7451, 7455, 7457, 7447, 7447a and 7448)

The QEMU-side software TLB implementation for the 7450 family of CPUs
is being removed due to lack of known users in the real world. The
last users in the code were removed by the two previous commits.

A brief history:

The feature was added in QEMU by commit 7dbe11acd8 ("Handle all MMU
models in switches...") with the mention that Linux was not able to
handle the TLB miss interrupts and the MMU model would be kept
disabled.

At some point later, commit 8ca3f6c382 ("Allow selection of all
defined PowerPC 74xx (aka G4) CPUs.") enabled the model for the 7450
family without further justification.

We have since the year 2011 [1] been unable to run OpenBIOS in the
7450s and have not heard of any other software that is used with those
CPUs in QEMU. Attempts were made to find a guest OS that implemented
the TLB miss handlers and none were found among Linux 5.15, FreeBSD 13,
MacOS9, MacOSX and MorphOS 3.15.

All CPUs that registered this feature were moved to an MMU model that
replaces the software TLB with a QEMU hardware TLB
implementation. They can now run the same software as the 7400 CPUs,
including the OSes mentioned above.

References:

- https://bugs.launchpad.net/qemu/+bug/812398
  https://gitlab.com/qemu-project/qemu/-/issues/86

- https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00289.html
  message id: 2029134431.406753-1-faro...@linux.ibm.com

Signed-off-by: Fabiano Rosas 
---
 target/ppc/cpu-qom.h |  6 +-
 target/ppc/cpu.h |  4 +---
 target/ppc/cpu_init.c| 26 --
 target/ppc/excp_helper.c | 29 -
 target/ppc/helper.h  |  2 --
 target/ppc/mmu_common.c  | 19 ---
 target/ppc/mmu_helper.c  | 31 ---
 target/ppc/translate.c   | 26 --
 8 files changed, 6 insertions(+), 137 deletions(-)

diff --git a/target/ppc/cpu-qom.h b/target/ppc/cpu-qom.h
index 5800fa324e..ef9e324474 100644
--- a/target/ppc/cpu-qom.h
+++ b/target/ppc/cpu-qom.h
@@ -45,7 +45,11 @@ enum powerpc_mmu_t {
 POWERPC_MMU_32B= 0x0001,
 /* PowerPC 6xx MMU with software TLB   */
 POWERPC_MMU_SOFT_6xx   = 0x0002,
-/* PowerPC 74xx MMU with software TLB  */
+/*
+ * PowerPC 74xx MMU with software TLB (this has been
+ * disabled, see git history for more information.
+ * keywords: tlbld tlbli TLBMISS PTEHI PTELO)
+ */
 POWERPC_MMU_SOFT_74xx  = 0x0003,
 /* PowerPC 4xx MMU with software TLB   */
 POWERPC_MMU_SOFT_4xx   = 0x0004,
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index e946da5f3a..432d609094 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -2138,8 +2138,6 @@ enum {
 PPC_SEGMENT= 0x0200ULL,
 /*   PowerPC 6xx TLB management instructions */
 PPC_6xx_TLB= 0x0400ULL,
-/* PowerPC 74xx TLB management instructions  */
-PPC_74xx_TLB   = 0x0800ULL,
 /*   PowerPC 40x TLB management instructions */
 PPC_40x_TLB= 0x1000ULL,
 /*   segment register access instructions for PowerPC 64 "bridge"*/
@@ -2196,7 +2194,7 @@ enum {
 | PPC_CACHE_DCBZ \
 | PPC_CACHE_DCBA | PPC_CACHE_LOCK \
 | PPC_EXTERN | PPC_SEGMENT | PPC_6xx_TLB \
-| PPC_74xx_TLB | PPC_40x_TLB | PPC_SEGMENT_64B \
+| PPC_40x_TLB | PPC_SEGMENT_64B \
 | PPC_SLBI | PPC_WRTEE | PPC_40x_EXCP \
 | PPC_405_MAC | PPC_440_SPEC | PPC_BOOKE \
 | PPC_MFAPIDI | PPC_TLBIVA | PPC_TLBIVAX \
diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c
index 962acf295f..ed0e2136d9 100644
--- a/target/ppc/cpu_init.c
+++ b/target/ppc/cpu_init.c
@@ -945,31 +945,6 @@ static void register_l3_ctrl(CPUPPCState *env)
  0x);
 }
 
-static void register_74xx_soft_tlb(CPUPPCState *env, int nb_tlbs, int nb_ways)
-{
-#if !defined(CONFIG_USER_ONLY)
-env->nb_tlb = nb_tlbs;
-env->nb_ways = nb_ways;
-env->id_tlbs = 1;
-env->tlb_type = TLB_6XX;
-/* XXX : not implemented */
-spr_register(env, SPR_PTEHI, "PTEHI",
- SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
- 0x);
-/* XXX : not implemented */
-spr_register(env, SPR_PTELO, "PTELO",
- SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
- 0x);
-/* XXX : not implemented */
-spr_register(env, SPR_TLBMISS, "TLBMISS",
- SPR_NOACCESS, SPR_NOACCESS,
- _read_generic, _write_generic,
- 0x);
-#endif
-}

[RFC PATCH v2 4/4] tests/avocado: ppc: Add smoke tests for MPC7400 and MPC7450 families

2021-11-30 Thread Fabiano Rosas
These tests ensure that our emulation for these cpus is not completely
broken and we can at least run OpenBIOS on them.

$ make check-avocado AVOCADO_TESTS=../tests/avocado/ppc_74xx.py

Signed-off-by: Fabiano Rosas 
Reviewed-by: Willian Rampazzo 
---
Note that the 7450s depend on an OpenBIOS change adding support for
their PVR:

https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00290.html
---
 tests/avocado/ppc_74xx.py | 123 ++
 1 file changed, 123 insertions(+)
 create mode 100644 tests/avocado/ppc_74xx.py

diff --git a/tests/avocado/ppc_74xx.py b/tests/avocado/ppc_74xx.py
new file mode 100644
index 00..556a9a7da9
--- /dev/null
+++ b/tests/avocado/ppc_74xx.py
@@ -0,0 +1,123 @@
+# Smoke tests for 74xx cpus (aka G4).
+#
+# Copyright (c) 2021, IBM Corp.
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+from avocado_qemu import QemuSystemTest
+from avocado_qemu import wait_for_console_pattern
+
+class ppc74xxCpu(QemuSystemTest):
+"""
+:avocado: tags=arch:ppc
+"""
+timeout = 5
+
+def test_ppc_7400(self):
+"""
+:avocado: tags=cpu:7400
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7410(self):
+"""
+:avocado: tags=cpu:7410
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,74xx')
+
+def test_ppc_7441(self):
+"""
+:avocado: tags=cpu:7441
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7445(self):
+"""
+:avocado: tags=cpu:7445
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7447(self):
+"""
+:avocado: tags=cpu:7447
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7447a(self):
+"""
+:avocado: tags=cpu:7447a
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7448(self):
+"""
+:avocado: tags=cpu:7448
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,MPC86xx')
+
+def test_ppc_7450(self):
+"""
+:avocado: tags=cpu:7450
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7451(self):
+"""
+:avocado: tags=cpu:7451
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7455(self):
+"""
+:avocado: tags=cpu:7455
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7457(self):
+"""
+:avocado: tags=cpu:7457
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
+
+def test_ppc_7457a(self):
+"""
+:avocado: tags=cpu:7457a
+"""
+self.vm.set_console()
+self.vm.launch()
+wait_for_console_pattern(self, '>> OpenBIOS')
+wait_for_console_pattern(self, '>> CPU type PowerPC,G4')
-- 
2.33.1




[RFC PATCH v2 1/4] target/ppc: Disable software TLB for the 7450 family

2021-11-30 Thread Fabiano Rosas
(Applies to 7441, 7445, 7450, 7451, 7455, 7457, 7447 and 7447a)*

We have since 2011 [1] been unable to run OpenBIOS in the 7450s and
have not heard of any other software that is used with those CPUs in
QEMU. A current discussion [2] shows that the 7450 software TLB is
unsupported in Linux 5.15, FreeBSD 13, MacOS9, MacOSX and MorphOS
3.15. With no known support in firmware or OS, this means that no code
for any of the 7450 CPUs is ever ran in QEMU.

Since the implementation in QEMU of the 7400 MMU is the same as the
7450, except for the software TLB vs. hardware TLB search, this patch
changes all 7450 cpus to the 7400 MMU model. This has the practical
effect of disabling the software TLB feature while keeping other
aspects of address translation working as expected.

This allow us to run software on the 7450 family again.

*- note that the 7448 is currently aliased in QEMU for a 7400, so it
   is unaffected by this change.

1- https://bugs.launchpad.net/qemu/+bug/812398
   https://gitlab.com/qemu-project/qemu/-/issues/86

2- https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00289.html
   message id: 2029134431.406753-1-faro...@linux.ibm.com

Signed-off-by: Fabiano Rosas 
---
 target/ppc/cpu_init.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c
index 6695985e9b..509df35d09 100644
--- a/target/ppc/cpu_init.c
+++ b/target/ppc/cpu_init.c
@@ -5932,7 +5932,6 @@ static void init_proc_7440(CPUPPCState *env)
  0x);
 /* Memory management */
 register_low_BATs(env);
-register_74xx_soft_tlb(env, 128, 2);
 init_excp_7450(env);
 env->dcache_line_size = 32;
 env->icache_line_size = 32;
@@ -5956,7 +5955,7 @@ POWERPC_FAMILY(7440)(ObjectClass *oc, void *data)
PPC_CACHE_DCBA | PPC_CACHE_DCBZ |
PPC_MEM_SYNC | PPC_MEM_EIEIO |
PPC_MEM_TLBIE | PPC_MEM_TLBSYNC |
-   PPC_MEM_TLBIA | PPC_74xx_TLB |
+   PPC_MEM_TLBIA |
PPC_SEGMENT | PPC_EXTERN |
PPC_ALTIVEC;
 pcc->msr_mask = (1ull << MSR_VR) |
@@ -5976,7 +5975,7 @@ POWERPC_FAMILY(7440)(ObjectClass *oc, void *data)
 (1ull << MSR_PMM) |
 (1ull << MSR_RI) |
 (1ull << MSR_LE);
-pcc->mmu_model = POWERPC_MMU_SOFT_74xx;
+pcc->mmu_model = POWERPC_MMU_32B;
 pcc->excp_model = POWERPC_EXCP_74xx;
 pcc->bus_model = PPC_FLAGS_INPUT_6xx;
 pcc->bfd_mach = bfd_mach_ppc_7400;
@@ -6067,7 +6066,6 @@ static void init_proc_7450(CPUPPCState *env)
  0x);
 /* Memory management */
 register_low_BATs(env);
-register_74xx_soft_tlb(env, 128, 2);
 init_excp_7450(env);
 env->dcache_line_size = 32;
 env->icache_line_size = 32;
@@ -6091,7 +6089,7 @@ POWERPC_FAMILY(7450)(ObjectClass *oc, void *data)
PPC_CACHE_DCBA | PPC_CACHE_DCBZ |
PPC_MEM_SYNC | PPC_MEM_EIEIO |
PPC_MEM_TLBIE | PPC_MEM_TLBSYNC |
-   PPC_MEM_TLBIA | PPC_74xx_TLB |
+   PPC_MEM_TLBIA |
PPC_SEGMENT | PPC_EXTERN |
PPC_ALTIVEC;
 pcc->msr_mask = (1ull << MSR_VR) |
@@ -6111,7 +6109,7 @@ POWERPC_FAMILY(7450)(ObjectClass *oc, void *data)
 (1ull << MSR_PMM) |
 (1ull << MSR_RI) |
 (1ull << MSR_LE);
-pcc->mmu_model = POWERPC_MMU_SOFT_74xx;
+pcc->mmu_model = POWERPC_MMU_32B;
 pcc->excp_model = POWERPC_EXCP_74xx;
 pcc->bus_model = PPC_FLAGS_INPUT_6xx;
 pcc->bfd_mach = bfd_mach_ppc_7400;
@@ -6205,7 +6203,6 @@ static void init_proc_7445(CPUPPCState *env)
 /* Memory management */
 register_low_BATs(env);
 register_high_BATs(env);
-register_74xx_soft_tlb(env, 128, 2);
 init_excp_7450(env);
 env->dcache_line_size = 32;
 env->icache_line_size = 32;
@@ -6229,7 +6226,7 @@ POWERPC_FAMILY(7445)(ObjectClass *oc, void *data)
PPC_CACHE_DCBA | PPC_CACHE_DCBZ |
PPC_MEM_SYNC | PPC_MEM_EIEIO |
PPC_MEM_TLBIE | PPC_MEM_TLBSYNC |
-   PPC_MEM_TLBIA | PPC_74xx_TLB |
+   PPC_MEM_TLBIA |
PPC_SEGMENT | PPC_EXTERN |
PPC_ALTIVEC;
 pcc->msr_mask = (1ull << MSR_VR) |
@@ -6249,7 +6246,7 @@ POWERPC_FAMILY(7445)(ObjectClass *oc, void *data)
 (1ull << MSR_PMM) |
 (1ull << MSR_RI) |
 (1ull << MSR_LE);
-pcc->mmu_model = POWERPC_MMU_SOFT_74xx;
+pcc->mmu_model = POWERPC_MMU_32B;
 pcc->excp_model = POWERPC_EXCP_74xx;
 pcc->bus_model = PPC_FLAGS_INPUT_6xx;
 pcc->bfd_mach = bfd_mach_ppc_7400;
@@ -6345,7 +6342,6 @@ static void init_proc_7455(CPUPPCState 

[RFC PATCH v2 2/4] target/ppc: Disable unused facilities in the e600 CPU

2021-11-30 Thread Fabiano Rosas
The e600 CPU is a successor of the 7448 and like all the 7450s CPUs,
it has an optional software TLB feature.

We have determined that there is no OS software support for the 7450
software TLB available these days. See the previous commit for more
information.

This patch disables the SPRs and instructions related to software TLB
from the e600 CPU.

No functional change intended. These facilities should be used by the
OS in interrupt handlers for interrupts that QEMU never generates.

Signed-off-by: Fabiano Rosas 
---
 target/ppc/cpu_init.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/target/ppc/cpu_init.c b/target/ppc/cpu_init.c
index 509df35d09..962acf295f 100644
--- a/target/ppc/cpu_init.c
+++ b/target/ppc/cpu_init.c
@@ -2537,9 +2537,6 @@ static void init_excp_7450(CPUPPCState *env)
 env->excp_vectors[POWERPC_EXCP_TRACE]= 0x0D00;
 env->excp_vectors[POWERPC_EXCP_PERFM]= 0x0F00;
 env->excp_vectors[POWERPC_EXCP_VPU]  = 0x0F20;
-env->excp_vectors[POWERPC_EXCP_IFTLB]= 0x1000;
-env->excp_vectors[POWERPC_EXCP_DLTLB]= 0x1100;
-env->excp_vectors[POWERPC_EXCP_DSTLB]= 0x1200;
 env->excp_vectors[POWERPC_EXCP_IABR] = 0x1300;
 env->excp_vectors[POWERPC_EXCP_SMI]  = 0x1400;
 env->excp_vectors[POWERPC_EXCP_VPUA] = 0x1600;
@@ -6643,7 +6640,6 @@ static void init_proc_e600(CPUPPCState *env)
 /* Memory management */
 register_low_BATs(env);
 register_high_BATs(env);
-register_74xx_soft_tlb(env, 128, 2);
 init_excp_7450(env);
 env->dcache_line_size = 32;
 env->icache_line_size = 32;
@@ -6667,7 +6663,7 @@ POWERPC_FAMILY(e600)(ObjectClass *oc, void *data)
PPC_CACHE_DCBA | PPC_CACHE_DCBZ |
PPC_MEM_SYNC | PPC_MEM_EIEIO |
PPC_MEM_TLBIE | PPC_MEM_TLBSYNC |
-   PPC_MEM_TLBIA | PPC_74xx_TLB |
+   PPC_MEM_TLBIA |
PPC_SEGMENT | PPC_EXTERN |
PPC_ALTIVEC;
 pcc->insns_flags2 = PPC_NONE;
-- 
2.33.1




[RFC PATCH v2 0/4] QEMU/openbios: PPC Software TLB support in the G4 family

2021-11-30 Thread Fabiano Rosas
Hi all,

Recap:

- QEMU enables 7450 SW TLB search by default;
- OpenBIOS does not know about SW TLB (vectors 0x1000, 0x1100, 0x1200);
- OpenBIOS does not know about 7450s PVRs.

Proposed solutions:

a) find another firmware/guest OS code that supports the feature;

b) implement the switching of the feature in QEMU and have the guest
code enable it only when supported. That would take some fiddling with
the MMU code to: merge POWERPC_MMU_SOFT_74xx into POWERPC_MMU_32B,
check the HID0[STEN] bit, figure out how to switch from HW TLB miss to
SW TLB miss on demand, block access to the TLBMISS register (and
others) when the feature is off, and so on;

c) leave the feature enabled in QEMU and implement the software TLB
miss handlers in openbios. The UM provides sample code, so this is
easy;

d) remove support for software TLB search for the 7450 family and
switch the cpus to the POWERPC_MMU_32B model. This is by far the
easiest solution, but could cause problems for any (which?) guest OS
code that actually uses the feature. All of the existing code for the
POWERPC_MMU_SOFT_74xx MMU model would probably be removed since it
would be dead code then;

v1:
https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00289.html

v2:

This series corresponds to option d) above.

- patch 1 moves all of the 7450 CPUs* into the POWERPC_MMU_32B MMU
  model, which does the same as the POWERPC_MMU_SOFT_74xx minus the
  software TLB handling. It also removes the instructions (tlbld, tlbli)
  and SPRs (TLBMISS, PTEHI, PTELO) from the 7450s as these facilities
  are only used along with the software TLB.

  *- except for 7448, which is seen by QEMU as a 7400.

- patch 2 removes the instructions and SPRs just like above, but from
  the e600 CPU. The e600 already uses the POWERPC_MMU_32B MMU model,
  so it already has software TLB disabled.

- patch 3 removes all of the now dead code for the 74xx software
  TLB. I left a note in the code with keywords to help with grep in
  case people search for the feature in the future.

- patch 4 adds smoke tests for all of the 74xx CPUs. These are broken
  pending the OpenBIOS patch.

For OpenBIOS:

We'd need to merge the patch 2/2 from the previous series:

https://lists.nongnu.org/archive/html/qemu-ppc/2021-11/msg00290.html
Message ID: 2029134431.406753-3-faro...@linux.ibm.com

this is just for the new PVRs. There is no need to add the handlers.

Thanks!

Fabiano Rosas (4):
  target/ppc: Disable software TLB for the 7450 family
  target/ppc: Disable unused facilities in the e600 CPU
  target/ppc: Remove the software TLB model of 7450 CPUs
  tests/avocado: ppc: Add smoke tests for MPC7400 and MPC7450 families

 target/ppc/cpu-qom.h  |   6 +-
 target/ppc/cpu.h  |   4 +-
 target/ppc/cpu_init.c |  57 --
 target/ppc/excp_helper.c  |  29 -
 target/ppc/helper.h   |   2 -
 target/ppc/mmu_common.c   |  19 --
 target/ppc/mmu_helper.c   |  31 --
 target/ppc/translate.c|  26 
 tests/avocado/ppc_74xx.py | 123 ++
 9 files changed, 140 insertions(+), 157 deletions(-)
 create mode 100644 tests/avocado/ppc_74xx.py

-- 
2.33.1




Re: [PATCH v8 06/10] target/ppc: enable PMU instruction count

2021-11-30 Thread Daniel Henrique Barboza




On 11/29/21 01:36, David Gibson wrote:

On Thu, Nov 25, 2021 at 12:08:13PM -0300, Daniel Henrique Barboza wrote:

The PMU is already counting cycles by calculating time elapsed in
nanoseconds. Counting instructions is a different matter and requires
another approach.

This patch adds the capability of counting completed instructions
(Perf event PM_INST_CMPL) by counting the amount of instructions
translated in each translation block right before exiting it.

A new pmu_count_insns() helper in translation.c was added to do that.
After verifying that the PMU is running (MMCR0_FC bit not set), call
helper_insns_inc(). This new helper from power8-pmu.c will add the
instructions to the relevant counters. It'll also be responsible for
triggering counter negative overflows as it is already being done with
cycles.

Signed-off-by: Daniel Henrique Barboza 
---
  target/ppc/cpu.h |  1 +
  target/ppc/helper.h  |  1 +
  target/ppc/helper_regs.c |  4 +++
  target/ppc/power8-pmu-regs.c.inc |  6 +
  target/ppc/power8-pmu.c  | 38 ++
  target/ppc/translate.c   | 46 
  6 files changed, 96 insertions(+)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 9b41b022e2..38cd2b5c43 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -656,6 +656,7 @@ enum {
  HFLAGS_PR = 14,  /* MSR_PR */
  HFLAGS_PMCC0 = 15,  /* MMCR0 PMCC bit 0 */
  HFLAGS_PMCC1 = 16,  /* MMCR0 PMCC bit 1 */
+HFLAGS_MMCR0FC = 17, /* MMCR0 FC bit */


Now that the event stuff is a bit more refined, you could narrow this
down to specifically marking if any counters are actively counting
instructions (not frozen by MMCR0[FC] and not frozen by
MMCR0[FC14|FC56] *and* have the right event selected).

Since I suspect the instruction counting instrumentation could be
quite expensive (helper call on every tb), that might be worthwhile.


That was worthwhile. The performance increase is substantial with this
change, in particular with tests that exercises only cycle events.






  HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */
  HFLAGS_VR = 25,  /* MSR_VR if cpu has VRE */
  
diff --git a/target/ppc/helper.h b/target/ppc/helper.h

index 94b4690375..d8a23e054a 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -24,6 +24,7 @@ DEF_HELPER_2(store_mmcr0, void, env, tl)
  DEF_HELPER_2(store_mmcr1, void, env, tl)
  DEF_HELPER_3(store_pmc, void, env, i32, i64)
  DEF_HELPER_2(read_pmc, tl, env, i32)
+DEF_HELPER_2(insns_inc, void, env, i32)
  #endif
  DEF_HELPER_1(check_tlb_flush_local, void, env)
  DEF_HELPER_1(check_tlb_flush_global, void, env)
diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c
index 99562edd57..875c2fdfc6 100644
--- a/target/ppc/helper_regs.c
+++ b/target/ppc/helper_regs.c
@@ -115,6 +115,10 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env)
  if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) {
  hflags |= 1 << HFLAGS_PMCC1;
  }
+if (env->spr[SPR_POWER_MMCR0] & MMCR0_FC) {
+hflags |= 1 << HFLAGS_MMCR0FC;
+}
+
  
  #ifndef CONFIG_USER_ONLY

  if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) {
diff --git a/target/ppc/power8-pmu-regs.c.inc b/target/ppc/power8-pmu-regs.c.inc
index 25b13ad564..580e4e41b2 100644
--- a/target/ppc/power8-pmu-regs.c.inc
+++ b/target/ppc/power8-pmu-regs.c.inc
@@ -113,6 +113,12 @@ static void write_MMCR0_common(DisasContext *ctx, TCGv val)
   */
  gen_icount_io_start(ctx);
  gen_helper_store_mmcr0(cpu_env, val);
+
+/*
+ * End the translation block because MMCR0 writes can change
+ * ctx->pmu_frozen.
+ */
+ctx->base.is_jmp = DISAS_EXIT_UPDATE;
  }
  
  void spr_write_MMCR0_ureg(DisasContext *ctx, int sprn, int gprn)

diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index 01e0b9b8fc..59d0def79d 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -112,6 +112,30 @@ static PMUEventType pmc_get_event(CPUPPCState *env, int 
sprn)
  return evt_type;
  }
  
+static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns)

+{
+bool overflow_triggered = false;
+int sprn;
+
+/* PMC6 never counts instructions */
+for (sprn = SPR_POWER_PMC1; sprn <= SPR_POWER_PMC5; sprn++) {
+if (pmc_get_event(env, sprn) != PMU_EVENT_INSTRUCTIONS) {
+continue;
+}
+
+env->spr[sprn] += num_insns;
+
+if (env->spr[sprn] >= PMC_COUNTER_NEGATIVE_VAL &&
+pmc_has_overflow_enabled(env, sprn)) {
+
+overflow_triggered = true;
+env->spr[sprn] = PMC_COUNTER_NEGATIVE_VAL;


Does the hardware PMU actually guarantee that the event will happen
exactly on the overflow?  Or could you count a few into the negative
zone before the event is delivered?


My understand reading the ISA and from testing with the a real PMU is that yes,
it'll guarantee that the overflow will happen when the counter reaches exactly

Re: [PATCH 2/2] hw/mips/boston: Fix load_elf error detection

2021-11-30 Thread Philippe Mathieu-Daudé
On 11/30/21 22:17, Jiaxun Yang wrote:
> load_elf gives negative return in case of error, not zero.
> 
> Fixes: 10e3f30 ("hw/mips/boston: Allow loading elf kernel and dtb")
> Signed-off-by: Jiaxun Yang 
> ---
> For 6.2.
> ---
>  hw/mips/boston.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé 



Re: [PATCH 1/2] hw/mips: bootloader: Fix write_ulong

2021-11-30 Thread Philippe Mathieu-Daudé
On 11/30/21 22:17, Jiaxun Yang wrote:
> bl_gen_write_ulong uses sd for both 32 and 64 bit CPU,
> while sd is illegal on 32 bit CPUs.
> 
> Replace sd with sw on 32bit CPUs.
> 
> Fixes: 3ebbf86 ("hw/mips: Add a bootloader helper")
> Signed-off-by: Jiaxun Yang 
> ---
> Should be backported to 6.0 onwards.
> ---
>  hw/mips/bootloader.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/mips/bootloader.c b/hw/mips/bootloader.c
> index 6ec8314490..1f8b2b 100644
> --- a/hw/mips/bootloader.c
> +++ b/hw/mips/bootloader.c
> @@ -182,7 +182,11 @@ void bl_gen_write_ulong(uint32_t **p, target_ulong addr, 
> target_ulong val)
>  {
>  bl_gen_load_ulong(p, BL_REG_K0, val);
>  bl_gen_load_ulong(p, BL_REG_K1, addr);
> -bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0);
> +if (bootcpu_supports_isa(ISA_MIPS3)) {
> +bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0);
> +} else {
> +bl_gen_sw(p, BL_REG_K0, BL_REG_K1, 0x0);
> +}

We have bl_gen_load_ulong(); having bl_gen_store_ulong()
would make the API even. Mind sending a patch? Otherwise:

Reviewed-by: Philippe Mathieu-Daudé 

>  }
>  
>  void bl_gen_write_u32(uint32_t **p, target_ulong addr, uint32_t val)
> 



[PATCH 2/2] hw/mips/boston: Fix load_elf error detection

2021-11-30 Thread Jiaxun Yang
load_elf gives negative return in case of error, not zero.

Fixes: 10e3f30 ("hw/mips/boston: Allow loading elf kernel and dtb")
Signed-off-by: Jiaxun Yang 
---
For 6.2.
---
 hw/mips/boston.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/mips/boston.c b/hw/mips/boston.c
index 0e3cca5511..296e9fa2ad 100644
--- a/hw/mips/boston.c
+++ b/hw/mips/boston.c
@@ -777,14 +777,15 @@ static void boston_mach_init(MachineState *machine)
 exit(1);
 }
 } else if (machine->kernel_filename) {
-uint64_t kernel_entry, kernel_high, kernel_size;
+uint64_t kernel_entry, kernel_high;
+ssize_t kernel_size;
 
 kernel_size = load_elf(machine->kernel_filename, NULL,
cpu_mips_kseg0_to_phys, NULL,
_entry, NULL, _high,
NULL, 0, EM_MIPS, 1, 0);
 
-if (kernel_size) {
+if (kernel_size > 0) {
 int dt_size;
 g_autofree const void *dtb_file_data, *dtb_load_data;
 hwaddr dtb_paddr = QEMU_ALIGN_UP(kernel_high, 64 * KiB);
-- 
2.11.0




[PATCH 1/2] hw/mips: bootloader: Fix write_ulong

2021-11-30 Thread Jiaxun Yang
bl_gen_write_ulong uses sd for both 32 and 64 bit CPU,
while sd is illegal on 32 bit CPUs.

Replace sd with sw on 32bit CPUs.

Fixes: 3ebbf86 ("hw/mips: Add a bootloader helper")
Signed-off-by: Jiaxun Yang 
---
Should be backported to 6.0 onwards.
---
 hw/mips/bootloader.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/mips/bootloader.c b/hw/mips/bootloader.c
index 6ec8314490..1f8b2b 100644
--- a/hw/mips/bootloader.c
+++ b/hw/mips/bootloader.c
@@ -182,7 +182,11 @@ void bl_gen_write_ulong(uint32_t **p, target_ulong addr, 
target_ulong val)
 {
 bl_gen_load_ulong(p, BL_REG_K0, val);
 bl_gen_load_ulong(p, BL_REG_K1, addr);
-bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0);
+if (bootcpu_supports_isa(ISA_MIPS3)) {
+bl_gen_sd(p, BL_REG_K0, BL_REG_K1, 0x0);
+} else {
+bl_gen_sw(p, BL_REG_K0, BL_REG_K1, 0x0);
+}
 }
 
 void bl_gen_write_u32(uint32_t **p, target_ulong addr, uint32_t val)
-- 
2.11.0




[PATCH 0/2] MIPS misc fixes

2021-11-30 Thread Jiaxun Yang
Two problems caught when I was trying to add CI job for various configurations.

Jiaxun Yang (2):
  hw/mips: bootloader: Fix write_ulong
  hw/mips/boston: Fix elf_load error detection

 hw/mips/bootloader.c | 6 +-
 hw/mips/boston.c | 5 +++--
 2 files changed, 8 insertions(+), 3 deletions(-)

-- 
2.11.0




Re: [PATCH] target/arm: Correct calculation of tlb range invalidate length

2021-11-30 Thread Alex Bennée


Peter Maydell  writes:

> The calculation of the length of TLB range invalidate operations
> in tlbi_aa64_range_get_length() is incorrect in two ways:
>  * the NUM field is 5 bits, but we read only 4 bits
>  * we miscalculate the page_shift value, because of an
>off-by-one error:
> TG 0b00 is invalid
> TG 0b01 is 4K granule size == 4096 == 2^12
> TG 0b10 is 16K granule size == 16384 == 2^14
> TG 0b11 is 64K granule size == 65536 == 2^16
>so page_shift should be (TG - 1) * 2 + 12
>
> Thanks to the bug report submitter Cha HyunSoo for identifying
> both these errors.
>
> Fixes: 84940ed82552d3c

Fixes: 84940ed825 (target/arm: Add support for FEAT_TLBIRANGE)

Reviewed-by: Alex Bennée 

-- 
Alex Bennée



Re: [PATCH v4] target/ppc: fix Hash64 MMU update of PTE bit R

2021-11-30 Thread Leandro Lupori

On 30/11/2021 05:44, Cédric Le Goater wrote:

It would be interesting to boot directly the PowerNV machine from a
FreeBSB kernel and a minimum inirtd without using the skiroot images
and an iso. Are images available ?


AFAIK there are no minimum initrd images available. The closest thing 
would be the "bootonly" ISOs that can be found in the links below:


https://download.freebsd.org/ftp/releases/powerpc/powerpc64/ISO-IMAGES/13.0/

https://download.freebsd.org/ftp/releases/powerpc/powerpc64le/ISO-IMAGES/13.0/

https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64/ISO-IMAGES/14.0/

https://download.freebsd.org/ftp/snapshots/powerpc/powerpc64le/ISO-IMAGES/14.0/


But to avoid using skiroot, you would need to manually extract the 
kernel from the ISO and also specify the rootfs, using something like: 
"-append cd9660:cd0".
If you don't want to emulate a disk or cd, the ISO can be passed to 
-initrd and "-append cd9660:md0" may be used to tell FreeBSD where to 
find the root partition (it loads it as a memory disk).


There are also qcow2 snapshots available:

https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64/

https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64le/

But you'll also need to extract the kernel from the image or from 
kernel.txz to boot them.
As these images target pseries and lack a FAT32 partition, Petitboot 
won't be able to boot them.


Alfredo (cc'ed) was trying to make them Petitboot compatible.



[Bug 1749393] Re: sbrk() not working under qemu-user with a PIE-compiled binary?

2021-11-30 Thread Brian Murray
Hello Raphaël, or anyone else affected,

Accepted qemu into focal-proposed. The package will build now and be
available at https://launchpad.net/ubuntu/+source/qemu/1:4.2-3ubuntu6.19
in a few hours, and then in the -proposed repository.

Please help us by testing this new package.  See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed.  Your feedback will aid us getting this
update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
focal to verification-done-focal. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-focal. In either case, without details of your testing we will
not be able to proceed.

Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification .  Thank you in
advance for helping!

N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.

** Changed in: qemu (Ubuntu Focal)
   Status: In Progress => Fix Committed

** Tags added: verification-needed verification-needed-focal

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1749393

Title:
  sbrk() not working under qemu-user with a PIE-compiled binary?

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Focal:
  Fix Committed

Bug description:
  [Impact]

   * The current space reserved can be too small and we can end up
     with no space at all for BRK. It can happen to any case, but is
     much more likely with the now common PIE binaries.

   * Backport the upstream fix which reserves a bit more space while loading
     and giving it back after interpreter and stack is loaded.

  [Test Plan]

   * On x86 run:
  sudo apt install -y qemu-user-static docker.io
  sudo docker run --rm arm64v8/debian:bullseye bash -c 'apt update && apt 
install -y wget'
  ...
  Running hooks in /etc/ca-certificates/update.d...
  done.
  Errors were encountered while processing:
   libc-bin
  E: Sub-process /usr/bin/dpkg returned an error code (1)

  
  Second test from bug 1928075

  $ sudo qemu-debootstrap --arch=arm64 bullseye bullseye-arm64
  http://ftp.debian.org/debian

  In the bad case this is failing like
  W: Failure trying to run: /sbin/ldconfig
  W: See //debootstrap/debootstrap.log for detail

  And in that log file you'll see the segfault
  $ tail -n 2 bullseye-arm64/debootstrap/debootstrap.log
  qemu: uncaught target signal 11 (Segmentation fault) - core dumped
  Segmentation fault (core dumped)

  [Where problems could occur]

   * Regressions would be around use-cases of linux-user that is
     emulation not of a system but of binaries.
     Commonly uses for cross-tests and cross-builds so that is the
     space to watch for regressions

  [Other Info]

   * n/a

  ---

  In Debian unstable, we recently switched bash to be a PIE-compiled
  binary (for hardening). Unfortunately this resulted in bash being
  broken when run under qemu-user (for all target architectures, host
  being amd64 for me).

  $ sudo chroot /srv/chroots/sid-i386/ qemu-i386-static /bin/bash
  bash: xmalloc: .././shell.c:1709: cannot allocate 10 bytes (0 bytes allocated)

  bash has its own malloc implementation based on sbrk():
  https://git.savannah.gnu.org/cgit/bash.git/tree/lib/malloc/malloc.c

  When we disable this internal implementation and rely on glibc's
  malloc, then everything is fine. But it might be that glibc has a
  fallback when sbrk() is not working properly and it might hide the
  underlying problem in qemu-user.

  This issue has also been reported to the bash upstream author and he 
suggested that the issue might be in qemu-user so I'm opening a ticket here. 
Here's the discussion with the bash upstream author:
  https://lists.gnu.org/archive/html/bug-bash/2018-02/threads.html#00080

  You can find the problematic bash binary in that .deb file:
  
http://snapshot.debian.org/archive/debian/20180206T154716Z/pool/main/b/bash/bash_4.4.18-1_i386.deb

  The version of qemu I have been using is 2.11 (Debian package qemu-
  user-static version 1:2.11+dfsg-1) but I have had reports that the
  problem is reproducible with older versions (back to 2.8 at least).

  Here are the related Debian bug reports:
  https://bugs.debian.org/889869
  https://bugs.debian.org/865599

  It's worth noting that bash used to have this problem (when compiled as a PIE 
binary) even when run directly but then something got fixed in the kernel and 
now the problem only appears when run under qemu-user:
  

Re: [PATCH v5 5/6] migration: Add migrate_use_tls() helper

2021-11-30 Thread Leonardo Bras Soares Passos
Hello Juan,

Thanks for reviewing!

Best regards,
Leo

On Fri, Nov 12, 2021 at 8:04 AM Juan Quintela  wrote:

> Leonardo Bras  wrote:
> > A lot of places check parameters.tls_creds in order to evaluate if TLS is
> > in use, and sometimes call migrate_get_current() just for that test.
> >
> > Add new helper function migrate_use_tls() in order to simplify testing
> > for TLS usage.
> >
> > Signed-off-by: Leonardo Bras 
>
> Reviewed-by: Juan Quintela 
>
>


Re: Odd square bracket encoding in QOM names

2021-11-30 Thread Mark Cave-Ayland

On 30/11/2021 16:41, Peter Maydell wrote:


On Tue, 30 Nov 2021 at 08:36, Mark Cave-Ayland
 wrote:

Has there been a recent change as to how square brackets are encoded within QOM
names? I noticed that the output has changed here in the "info qom-tree" output 
in
qemu-system-m68k for the q800 machine.

The q800 machine has a set of 256 memory region aliases that used to appear in 
the
"info qom-tree" output as:

  /mac_m68k.io[100] (memory-region)
  /mac_m68k.io[101] (memory-region)
  /mac_m68k.io[102] (memory-region)

but they now appear as:

  /mac_m68k.io\x5b100\x5d[0] (memory-region)
  /mac_m68k.io\x5b101\x5d[0] (memory-region)
  /mac_m68k.io\x5b102\x5d[0] (memory-region)


I looked at info qom-tree for an Arm machine, and the [..] seem to be
OK there. I tried to test with q800 but got stuck on finding a
command line to get it to run. Do you have repro instructions?


A couple of tests this evening and I think I must have misremembered this - I tried a 
couple of older versions of my various q800 branches (one within the 6.0 dev cycle 
and another within 6.1) and both escape the QOM object name in "info qom-tree" the 
same way as above.


Easiest way to see this is to grab Finn's kernel from issue #611 like this:

$ wget 
https://gitlab.com/qemu-project/qemu/uploads/dead759323116fb22cf0f03b8cdbe15a/vmlinux-5.14-multi.xz

$ unxz vmlinux-5.14-multi.xz

And then start QEMU with this command line:

$ ./qemu-system-m68k -M q800 -kernel vmlinux-5.14-multi -monitor stdio -S

Obviously the cause is the use of square brackets within the memory region name as 
per https://gitlab.com/qemu-project/qemu/-/blob/master/hw/m68k/q800.c#L411, so given 
the escaping has been like this for some time then I guess everything is working as 
intended, and sorry for the noise.



ATB,

Mark.



Re: Follow-up on the CXL discussion at OFTC

2021-11-30 Thread Ben Widawsky
On 21-11-30 13:09:56, Jonathan Cameron wrote:
> On Mon, 29 Nov 2021 18:28:43 +
> Alex Bennée  wrote:
> 
> > Ben Widawsky  writes:
> > 
> > > On 21-11-26 12:08:08, Alex Bennée wrote:  
> > >> 
> > >> Ben Widawsky  writes:
> > >>   
> > >> > On 21-11-19 02:29:51, Shreyas Shah wrote:  
> > >> >> Hi Ben
> > >> >> 
> > >> >> Are you planning to add the CXL2.0 switch inside QEMU or already 
> > >> >> added in one of the version? 
> > >> >>
> > >> >
> > >> > From me, there are no plans for QEMU anything until/unless upstream 
> > >> > thinks it
> > >> > will merge the existing patches, or provide feedback as to what it 
> > >> > would take to
> > >> > get them merged. If upstream doesn't see a point in these patches, 
> > >> > then I really
> > >> > don't see much value in continuing to further them. Once hardware 
> > >> > comes out, the
> > >> > value proposition is certainly less.  
> > >> 
> > >> I take it:
> > >> 
> > >>   Subject: [RFC PATCH v3 00/31] CXL 2.0 Support
> > >>   Date: Mon,  1 Feb 2021 16:59:17 -0800
> > >>   Message-Id: <20210202005948.241655-1-ben.widaw...@intel.com>
> > >> 
> > >> is the current state of the support? I saw there was a fair amount of
> > >> discussion on the thread so assumed there would be a v4 forthcoming at
> > >> some point.  
> > >
> > > Hi Alex,
> > >
> > > There is a v4, however, we never really had a solid plan for the primary 
> > > issue
> > > which was around handling CXL memory expander devices properly (both from 
> > > an
> > > interleaving standpoint as well as having a device which hosts multiple 
> > > memory
> > > capacities, persistent and volatile). I didn't feel it was worth sending 
> > > a v4
> > > unless someone could say
> > >
> > > 1. we will merge what's there and fix later, or
> > > 2. you must have a more perfect emulation in place, or
> > > 3. we want to see usages for a real guest  
> > 
> > I think 1. is acceptable if the community is happy there will be ongoing
> > development and it's not just a code dump. Given it will have a
> > MAINTAINERS entry I think that is demonstrated.
> 
> My thought is also 1.  There are a few hacks we need to clean out but
> nothing that should take too long.  I'm sure it'll take a rev or two more.
> Right now for example, I've added support to arm-virt and maybe need to
> move that over to a different machine model...
> 

The most annoying thing about rebasing it is passing the ACPI tests. They keep
changing upstream. Being able to at least merge up to there would be huge.

> > 
> > What's the current use case? Testing drivers before real HW comes out?
> > Will it still be useful after real HW comes out for people wanting to
> > debug things without HW?
> 
> CXL is continuing to expand in scope and capabilities and I don't see that
> reducing any time soon (My guess is 3 years+ to just catch up with what is
> under discussion today).  So I see two long term use cases:
> 
> 1) Automated verification that we haven't broken things.  I suspect no
> one person is going to have a test farm covering all the corner cases.
> So we'll need emulation + firmware + kernel based testing.
> 

Does this exist in other forms? AFAICT for x86, there isn't much example of
this.

> 2) New feature prove out.  We have already used it for some features that
> will appear in the next spec version. Obviously I can't say what or
> send that code out yet.  Its very useful and the spec draft has changed
> in various ways a result.  I can't commit others, but Huawei will be
> doing more of this going forwards.  For that we need a stable base to
> which we add the new stuff once spec publication allows it.
> 

I can't commit for Intel but I will say there's more latitude now to work on
projects like this compared to when I first wrote the patches. I have
interesting in continuing to develop this as well. I'm very interested in
supporting interleave and hotplug specifically.

> > 
> > >
> > > I had hoped we could merge what was there mostly as is and fix it up as 
> > > we go.
> > > It's useful in the state it is now, and as time goes on, we find more 
> > > usecases
> > > for it in a VMM, and not just driver development.
> > >  
> > >> 
> > >> Adding new subsystems to QEMU does seem to be a pain point for new
> > >> contributors. Patches tend to fall through the cracks of existing
> > >> maintainers who spend most of their time looking at stuff that directly
> > >> touches their files. There is also a reluctance to merge large chunks of
> > >> functionality without an identified maintainer (and maybe reviewers) who
> > >> can be the contact point for new patches. So in short you need:
> > >> 
> > >>  - Maintainer Reviewed-by/Acked-by on patches that touch other 
> > >> sub-systems  
> > >
> > > This is the challenging one. I have Cc'd the relevant maintainers (hw/pci 
> > > and
> > > hw/mem are the two) in the past, but I think there interest is lacking 
> > > (and
> > > reasonably so, it is an entirely different subsystem).  
> > 
> 

Re: [PATCH] target/arm: Correct calculation of tlb range invalidate length

2021-11-30 Thread Richard Henderson

On 11/30/21 6:32 PM, Peter Maydell wrote:

The calculation of the length of TLB range invalidate operations
in tlbi_aa64_range_get_length() is incorrect in two ways:
  * the NUM field is 5 bits, but we read only 4 bits
  * we miscalculate the page_shift value, because of an
off-by-one error:
 TG 0b00 is invalid
 TG 0b01 is 4K granule size == 4096 == 2^12
 TG 0b10 is 16K granule size == 16384 == 2^14
 TG 0b11 is 64K granule size == 65536 == 2^16
so page_shift should be (TG - 1) * 2 + 12

Thanks to the bug report submitter Cha HyunSoo for identifying
both these errors.

Fixes: 84940ed82552d3c
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/734
Signed-off-by: Peter Maydell 
---
Not marked for-6.2 because this isn't a regression: we
shipped the TLBI range invalidate support in 6.1.
I have no repro case for this issue, but this doesn't break
booting an aarch64 kernel, at least.
---


Reviewed-by: Richard Henderson 


r~



[PATCH] target/arm: Correct calculation of tlb range invalidate length

2021-11-30 Thread Peter Maydell
The calculation of the length of TLB range invalidate operations
in tlbi_aa64_range_get_length() is incorrect in two ways:
 * the NUM field is 5 bits, but we read only 4 bits
 * we miscalculate the page_shift value, because of an
   off-by-one error:
TG 0b00 is invalid
TG 0b01 is 4K granule size == 4096 == 2^12
TG 0b10 is 16K granule size == 16384 == 2^14
TG 0b11 is 64K granule size == 65536 == 2^16
   so page_shift should be (TG - 1) * 2 + 12

Thanks to the bug report submitter Cha HyunSoo for identifying
both these errors.

Fixes: 84940ed82552d3c
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/734
Signed-off-by: Peter Maydell 
---
Not marked for-6.2 because this isn't a regression: we
shipped the TLBI range invalidate support in 6.1.
I have no repro case for this issue, but this doesn't break
booting an aarch64 kernel, at least.
---
 target/arm/helper.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 9b317899a66..db837d53bd9 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -4519,18 +4519,18 @@ static uint64_t tlbi_aa64_range_get_length(CPUARMState 
*env,
 uint64_t exponent;
 uint64_t length;
 
-num = extract64(value, 39, 4);
+num = extract64(value, 39, 5);
 scale = extract64(value, 44, 2);
 page_size_granule = extract64(value, 46, 2);
 
-page_shift = page_size_granule * 2 + 12;
-
 if (page_size_granule == 0) {
 qemu_log_mask(LOG_GUEST_ERROR, "Invalid page size granule %d\n",
   page_size_granule);
 return 0;
 }
 
+page_shift = (page_size_granule - 1) * 2 + 12;
+
 exponent = (5 * scale) + 1;
 length = (num + 1) << (exponent + page_shift);
 
-- 
2.25.1




[PATCH] scripts/entitlement.sh: Use backward-compatible cp flags

2021-11-30 Thread Evan Miller
Older versions of Mac OS X do not support cp -a. The cp man page indicates that 
-a is equivalent to -pPR.

Signed-off-by: Evan Miller 

--- scripts/entitlement.sh.orig
+++ scripts/entitlement.sh
@@ -15,7 +15,7 @@

 if $in_place; then
   trap 'rm "$DST.tmp"' exit
-  cp -af "$SRC" "$DST.tmp"
+  cp -pPRf "$SRC" "$DST.tmp"
   SRC="$DST.tmp"
 else
   cd "$MESON_INSTALL_DESTDIR_PREFIX"


Re: Odd square bracket encoding in QOM names

2021-11-30 Thread Peter Maydell
On Tue, 30 Nov 2021 at 08:36, Mark Cave-Ayland
 wrote:
> Has there been a recent change as to how square brackets are encoded within 
> QOM
> names? I noticed that the output has changed here in the "info qom-tree" 
> output in
> qemu-system-m68k for the q800 machine.
>
> The q800 machine has a set of 256 memory region aliases that used to appear 
> in the
> "info qom-tree" output as:
>
>  /mac_m68k.io[100] (memory-region)
>  /mac_m68k.io[101] (memory-region)
>  /mac_m68k.io[102] (memory-region)
>
> but they now appear as:
>
>  /mac_m68k.io\x5b100\x5d[0] (memory-region)
>  /mac_m68k.io\x5b101\x5d[0] (memory-region)
>  /mac_m68k.io\x5b102\x5d[0] (memory-region)

I looked at info qom-tree for an Arm machine, and the [..] seem to be
OK there. I tried to test with q800 but got stuck on finding a
command line to get it to run. Do you have repro instructions?

thanks
-- PMM



Re: [PATCH v3 2/3] test/tcg/ppc64le: test mtfsf

2021-11-30 Thread Cédric Le Goater

Hello Lucas,

On 11/24/21 18:25, Lucas Mateus Castro (alqotel) wrote:

Added tests for the mtfsf to check if FI bit of FPSCR is being set
and if exception calls are being made correctly.

Signed-off-by: Lucas Mateus Castro (alqotel) 


Could you please rebase on mainline and resend ?

Thanks,

C.
 

---
  tests/tcg/ppc64/Makefile.target   |  1 +
  tests/tcg/ppc64le/Makefile.target |  1 +
  tests/tcg/ppc64le/mtfsf.c | 61 +++
  3 files changed, 63 insertions(+)
  create mode 100644 tests/tcg/ppc64le/mtfsf.c

diff --git a/tests/tcg/ppc64/Makefile.target b/tests/tcg/ppc64/Makefile.target
index 6ab7934fdf..8f4c7ac4ed 100644
--- a/tests/tcg/ppc64/Makefile.target
+++ b/tests/tcg/ppc64/Makefile.target
@@ -11,6 +11,7 @@ endif
  bcdsub: CFLAGS += -mpower8-vector
  
  PPC64_TESTS += byte_reverse

+PPC64_TESTS += mtfsf
  ifneq ($(DOCKER_IMAGE)$(CROSS_CC_HAS_POWER10),)
  run-byte_reverse: QEMU_OPTS+=-cpu POWER10
  run-plugin-byte_reverse-with-%: QEMU_OPTS+=-cpu POWER10
diff --git a/tests/tcg/ppc64le/Makefile.target 
b/tests/tcg/ppc64le/Makefile.target
index 5e65b1590d..b8cd9bf73a 100644
--- a/tests/tcg/ppc64le/Makefile.target
+++ b/tests/tcg/ppc64le/Makefile.target
@@ -10,6 +10,7 @@ endif
  bcdsub: CFLAGS += -mpower8-vector
  
  PPC64LE_TESTS += byte_reverse

+PPC64LE_TESTS += mtfsf
  ifneq ($(DOCKER_IMAGE)$(CROSS_CC_HAS_POWER10),)
  run-byte_reverse: QEMU_OPTS+=-cpu POWER10
  run-plugin-byte_reverse-with-%: QEMU_OPTS+=-cpu POWER10
diff --git a/tests/tcg/ppc64le/mtfsf.c b/tests/tcg/ppc64le/mtfsf.c
new file mode 100644
index 00..b3d31f3637
--- /dev/null
+++ b/tests/tcg/ppc64le/mtfsf.c
@@ -0,0 +1,61 @@
+#include 
+#include 
+#include 
+#include 
+
+#define FPSCR_VE 7  /* Floating-point invalid operation exception enable */
+#define FPSCR_VXSOFT 10 /* Floating-point invalid operation exception (soft) */
+#define FPSCR_FI 17 /* Floating-point fraction inexact   */
+
+#define FP_VE   (1ull << FPSCR_VE)
+#define FP_VXSOFT   (1ull << FPSCR_VXSOFT)
+#define FP_FI   (1ull << FPSCR_FI)
+
+void sigfpe_handler(int sig, siginfo_t *si, void *ucontext)
+{
+if (si->si_code == FPE_FLTINV) {
+exit(0);
+}
+exit(1);
+}
+
+int main(void)
+{
+union {
+double d;
+long long ll;
+} fpscr;
+
+struct sigaction sa = {
+.sa_sigaction = sigfpe_handler,
+.sa_flags = SA_SIGINFO
+};
+
+/*
+ * Enable the MSR bits F0 and F1 to enable exceptions.
+ * This shouldn't be needed in linux-user as these bits are enabled by
+ * default, but this allows to execute either in a VM or a real machine
+ * to compare the behaviors.
+ */
+prctl(PR_SET_FPEXC, PR_FP_EXC_PRECISE);
+
+/* First test if the FI bit is being set correctly */
+fpscr.ll = FP_FI;
+__builtin_mtfsf(0b, fpscr.d);
+fpscr.d = __builtin_mffs();
+assert((fpscr.ll & FP_FI) != 0);
+
+/* Then test if the deferred exception is being called correctly */
+sigaction(SIGFPE, , NULL);
+
+/*
+ * Although the VXSOFT exception has been chosen, based on test in a Power9
+ * any combination of exception bit + its enabling bit should work.
+ * But if a different exception is chosen si_code check should
+ * change accordingly.
+ */
+fpscr.ll = FP_VE | FP_VXSOFT;
+__builtin_mtfsf(0b, fpscr.d);
+
+return 1;
+}






Re: why is iommu_platform set to off by default?

2021-11-30 Thread Stefan Hajnoczi
On Tue, Nov 30, 2021 at 02:32:49PM +, Peter Maydell wrote:
> I've just spent a day or so trying to track down why PCI passthrough
> of a virtio-blk-pci device wasn't working. The problem turns out to be
> that by default virtio pci devices don't use the IOMMU, even when the
> machine model has created an IOMMU and arranged for the PCI bus to
> be underneath it. So when the L2 guest tries to program the virtio device,
> the virtio device treats the IPAs it writes as if they were PAs and
> of course the data structures it's looking for aren't there.
> 
> Why do we default this to 'off'? It seems pretty unhelpful not to
> honour the existence of the IOMMU, and the failure mode is pretty
> opaque (L2 guest just hangs)...

Historically VIRTIO used guest physical addresses instead of bus
addresses (IOVA). I think when IOMMU support was added, a QEMU -device
virtio-* parameter was added and it's simply off by default:

  iommu_platform=  - on/off (default: false)

Maybe this behavior is for backwards compatibility? Existing guests with
IOMMUs shouldn't change behavior, although this could be fixed with
machine type compat properties.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v1 0/8] virtio-mem: Support "prealloc=on"

2021-11-30 Thread Michal Prívozník
On 11/30/21 11:41, David Hildenbrand wrote:
> Based-on: <20211130092838.24224-1-da...@redhat.com>
> 
> Patch #1 - #7 are fully reviewed [1] but did not get picked up yet, so I'm
> sending them along here, as they are required to use os_mem_prealloc() in
> a safe way once the VM is running.
> 
> Support preallocation of memory to make virtio-mem safe to use with
> scarce memory resources such as hugetlb. Before acknowledging a plug
> request from the guest, we'll try preallcoating memory. If that fails,
> we'll fail the request gracefully and warn the usr once.
> 
> To fully support huge pages for shared memory, we'll have to adjust shared
> memory users, such as virtiofsd, to map guest memory via MAP_NORESERVE as
> well, because otherwise, they'll end up overwriting the "reserve=off"
> decision made by QEMU and try reserving huge pages for the sparse memory
> region.
> 
> In the future, we might want to process guest requests, including
> preallocating memory, asynchronously via a dedicated iothread.
> 
> [1] https://lkml.kernel.org/r/20211004120208.7409-1-da...@redhat.com
> 
> Cc: Paolo Bonzini 
> Cc: "Michael S. Tsirkin" 
> Cc: Igor Mammedov 
> Cc: Eduardo Habkost 
> Cc: Dr. David Alan Gilbert 
> Cc: Daniel P. Berrangé 
> Cc: Gavin Shan 
> Cc: Hui Zhu 
> Cc: Sebastien Boeuf 
> Cc: Pankaj Gupta 
> Cc: Michal Prívozník 
> 
> David Hildenbrand (8):
>   util/oslib-posix: Let touch_all_pages() return an error
>   util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
>   util/oslib-posix: Introduce and use MemsetContext for
> touch_all_pages()
>   util/oslib-posix: Don't create too many threads with small memory or
> little pages
>   util/oslib-posix: Avoid creating a single thread with
> MADV_POPULATE_WRITE
>   util/oslib-posix: Support concurrent os_mem_prealloc() invocation
>   util/oslib-posix: Forward SIGBUS to MCE handler under Linux
>   virtio-mem: Support "prealloc=on" option
> 
>  hw/virtio/virtio-mem.c |  39 +-
>  include/hw/virtio/virtio-mem.h |   4 +
>  include/qemu/osdep.h   |   7 +
>  softmmu/cpus.c |   4 +
>  util/oslib-posix.c | 231 +
>  5 files changed, 226 insertions(+), 59 deletions(-)
> 

Reviewed-by: Michal Privoznik 

Michal




Re: [PATCH V6 19/27] vfio-pci: cpr part 1 (fd and dma)

2021-11-30 Thread Steven Sistare
On 11/10/2021 2:48 AM, Zheng Chuan wrote:
> 
> Hi, steve
> 
> On 2021/8/11 1:06, Alex Williamson wrote:
>> On Fri,  6 Aug 2021 14:43:53 -0700
>> Steve Sistare  wrote:
>> [...]
>>> +static int
>>> +vfio_region_remap(MemoryRegionSection *section, void *handle, Error **errp)
>>> +{
>>> +MemoryRegion *mr = section->mr;
>>> +VFIOContainer *container = handle;
>>> +const char *name = memory_region_name(mr);
>>> +ram_addr_t size = int128_get64(section->size);
>>> +hwaddr offset, iova, roundup;
>>> +void *vaddr;
>>> +
>>> +if (vfio_listener_skipped_section(section) || 
>>> memory_region_is_iommu(mr)) {
>>> +return 0;
>>> +}
>>> +
>>> +offset = section->offset_within_address_space;
>>> +iova = REAL_HOST_PAGE_ALIGN(offset);
> We should not do remap if it shares on host page with other structures.
> I think a judgement like int128_ge((int128_make64(iova), llend)) in 
> vfio_listener_region_add() should be also added here to check it,
> otherwise it will remap no-exit dma which causes the live update failure.
> diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
> index 0981d31..d231841 100644
> --- a/hw/vfio/cpr.c
> +++ b/hw/vfio/cpr.c
> @@ -58,13 +58,21 @@ vfio_region_remap(MemoryRegionSection *section, void 
> *handle, Error **errp)
>  ram_addr_t size = int128_get64(section->size);
>  hwaddr offset, iova, roundup;
>  void *vaddr;
> -
> +Int128 llend;
> +
>  if (vfio_listener_skipped_section(section) || 
> memory_region_is_iommu(mr)) {
>  return 0;
>  }
> 
>  offset = section->offset_within_address_space;
>  iova = REAL_HOST_PAGE_ALIGN(offset);
> +llend = int128_make64(section->offset_within_address_space);
> +llend = int128_add(llend, section->size);
> +llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask));
> +if (int128_ge(int128_make64(iova), llend)) {
> +return 0;
> +}
> +
>  roundup = iova - offset;
>  size -= roundup;
>  size = REAL_HOST_PAGE_ALIGN(size);
> 
>>> +roundup = iova - offset;
>>> +size -= roundup;
>>> +size = REAL_HOST_PAGE_ALIGN(size);
>>> +vaddr = memory_region_get_ram_ptr(mr) +
>>> +section->offset_within_region + roundup;
>>> +
>>> +trace_vfio_region_remap(name, container->fd, iova, iova + size - 1, 
>>> vaddr);
>>> +return vfio_dma_map_vaddr(container, iova, size, vaddr, errp);
>>> +}

Thank you Zheng.  I intended to implement the logic you suggest, using 64-bit 
arithmetic,
but I botched it.  This should do the trick:

diff --git a/hw/vfio/cpr.c b/hw/vfio/cpr.c
index df334d9..bbdeaea 100644
--- a/hw/vfio/cpr.c
+++ b/hw/vfio/cpr.c
@@ -66,8 +66,8 @@ vfio_region_remap(MemoryRegionSection *section, void *handle,
 offset = section->offset_within_address_space;
 iova = REAL_HOST_PAGE_ALIGN(offset);
 roundup = iova - offset;
-size -= roundup;
-size = REAL_HOST_PAGE_ALIGN(size);
+size -= roundup;/* adjust for starting alignment */
+size &= qemu_real_host_page_mask;   /* adjust for ending alignment */
 end = iova + size;
 if (iova >= end) {
 return 0;

- Steve



Re: [PATCH v7 0/3] support dirty restraint on vCPU

2021-11-30 Thread Hyman Huang




On 11/30/21 22:57, Hyman Huang wrote:

1.
Start vm with kernel+initrd.img with qemu command line as following:

[root@Hyman_server1 fast_qemu]# cat vm.sh
#!/bin/bash
/usr/bin/qemu-system-x86_64 \
     -display none -vga none \
     -name guest=simple_vm,debug-threads=on \
     -monitor stdio \
     -machine pc-i440fx-2.12 \
     -accel kvm,dirty-ring-size=65536 -cpu host \
     -kernel /home/work/fast_qemu/vmlinuz-5.13.0-rc4+ \
     -initrd /home/work/fast_qemu/initrd-stress.img \
     -append "noapic edd=off printk.time=1 noreplace-smp 
cgroup_disable=memory pci=noearly console=ttyS0 debug ramsize=1500 
ratio=1 sleep=1" \

     -chardev file,id=charserial0,path=/var/log/vm_console.log \
     -serial chardev:charserial0 \
     -qmp unix:/tmp/qmp-sock,server,nowait \
     -D /var/log/vm.log \
     --trace events=/home/work/fast_qemu/events \
     -m 4096 -smp 2 -device sga

2.
Enable the dirtylimit trace event which will output to /var/log/vm.log
[root@Hyman_server1 fast_qemu]# cat /home/work/fast_qemu/events
dirtylimit_state_init
dirtylimit_vcpu
dirtylimit_impose
dirtyrate_do_calculate_vcpu


3.
Connect the qmp server with low level qmp client and set-dirty-limit

[root@Hyman_server1 my_qemu]# python3.6 ./scripts/qmp/qmp-shell -v -p 
/tmp/qmp-sock


Welcome to the QMP low-level shell!
Connected to QEMU 6.1.92

(QEMU) set-dirty-limit cpu-index=1 dirty-rate=400


{
     "arguments": {
     "cpu-index": 1,
     "dirty-rate": 400
     },
     "execute": "set-dirty-limit"
}

4.
observe the vcpu current dirty rate and quota dirty rate...

[root@Hyman_server1 ~]# tail -f /var/log/vm.log
dirtylimit_state_init dirtylimit state init: max cpus 2
dirtylimit_vcpu CPU[1] set quota dirtylimit 400
dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 0, 
percentage 0

dirtyrate_do_calculate_vcpu vcpu[0]: 1075 MB/s
dirtyrate_do_calculate_vcpu vcpu[1]: 1061 MB/s
dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 1061, 
percentage 62

dirtyrate_do_calculate_vcpu vcpu[0]: 1133 MB/s
dirtyrate_do_calculate_vcpu vcpu[1]: 380 MB/s
dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 380, 
percentage 57

dirtyrate_do_calculate_vcpu vcpu[0]: 1227 MB/s
dirtyrate_do_calculate_vcpu vcpu[1]: 464 MB/s

We can observe that vcpu-1's dirtyrate is about 400MB/s with dirty page 
limit set and the vcpu-0 is not affected.


5.
observe the vm stress info...
[root@Hyman_server1 fast_qemu]# tail -f /var/log/vm_console.log
[    0.838051] Run /init as init process
[    0.839216]   with arguments:
[    0.840153] /init
[    0.840882]   with environment:
[    0.841884] HOME=/
[    0.842649] TERM=linux
[    0.843478] edd=off
[    0.844233] ramsize=1500
[    0.845079] ratio=1
[    0.845829] sleep=1
/init (1): INFO: RAM 1500 MiB across 2 CPUs, ratio 1, sleep 1 us
[    1.158011] random: init: uninitialized urandom read (4096 bytes read)
[    1.448205] random: init: uninitialized urandom read (4096 bytes read)
/init (1): INFO: 1638282593684ms copied 1 GB in 00729ms
/init (00110): INFO: 1638282593964ms copied 1 GB in 00719ms
/init (1): INFO: 1638282594405ms copied 1 GB in 00719ms
/init (00110): INFO: 1638282594677ms copied 1 GB in 00713ms
/init (1): INFO: 1638282595093ms copied 1 GB in 00686ms
/init (00110): INFO: 1638282595339ms copied 1 GB in 00662ms
/init (1): INFO: 1638282595764ms copied 1 GB in 00670m

PS: the kernel and initrd images comes from:

kernel image: vmlinuz-5.13.0-rc4+, normal centos vmlinuz copied from 
/boot directory


initrd.img: initrd-stress.img, only contains a stress binary, which 
compiled from qemu source tests/migration/stress.c and run as init

in vm.

you can view README.md file of my project 
"met"(https://github.com/newfriday/met) to compile the 
initrd-stress.img. :)


On 11/30/21 20:57, Peter Xu wrote:

On Tue, Nov 30, 2021 at 06:28:10PM +0800, huang...@chinatelecom.cn wrote:

From: Hyman Huang(黄勇) 

The patch [2/3] has not been touched so far. Any corrections and
suggetions are welcome.


I played with it today, but the vcpu didn't got throttled as expected.

What I did was starting two workload with 500mb/s, each pinned on one 
vcpu

thread:

[root@fedora ~]# pgrep -fa mig_mon
595 ./mig_mon mm_dirty 1000 500 sequential
604 ./mig_mon mm_dirty 1000 500 sequential
[root@fedora ~]# taskset -pc 595
pid 595's current affinity list: 2
[root@fedora ~]# taskset -pc 604
pid 604's current affinity list: 3

Then start throttle with 100mb/s:

(QEMU) set-dirty-limit cpu-index=3 dirty-rate=100
{"return": {}}
(QEMU) set-dirty-limit cpu-index=2 dirty-rate=100
{"return": {}}

I can see the workload dropped a tiny little bit (perhaps 500mb -> 
499mb), then

it keeps going..
The test step above i listed assume that dirtyrate calculated by 
dirtylimit_calc_func via dirty-ring is accurate, which differ from

your test policy.

The macro DIRTYLIMIT_CALC_TIME_MS used as calculation period in 
migration/dirtyrate.c has a big affect on result. So "how we 

Re: [PATCH v1 3/3] virtio-mem: Set "unplugged-inaccessible=auto" for the 6.2 machine on x86

2021-11-30 Thread David Hildenbrand
On 30.11.21 16:34, Michal Prívozník wrote:
> On 11/30/21 10:28, David Hildenbrand wrote:
>> Set the new default to "auto", keeping it set to "on" for compat
> 
> I believe you wanted to say: keeping it set to "off", because that's
> what the patch does.

Right, I switched semantics from a "legacy-guest-support=on/off/auto" to
"unplugged-inaccessible=on/off/auto"

Thanks a lot for the fast review!!!

-- 
Thanks,

David / dhildenb




Re: [PATCH 2/2] hw/core/loader: workaround read() size limit.

2021-11-30 Thread Jamie Iles
Hi Philippe,

On Thu, Nov 11, 2021 at 05:04:56PM +, Jamie Iles wrote:
> On Thu, Nov 11, 2021 at 04:55:35PM +0100, Philippe Mathieu-Daudé wrote:
> > On 11/11/21 16:43, Philippe Mathieu-Daudé wrote:
> > > On 11/11/21 16:36, Jamie Iles wrote:
> > >> Hi Philippe,
> > >>
> > >> On Thu, Nov 11, 2021 at 03:55:48PM +0100, Philippe Mathieu-Daudé wrote:
> > >>> Hi Jamie,
> > >>>
> > >>> On 11/11/21 15:11, Jamie Iles wrote:
> >  On Linux, read() will only ever read a maximum of 0x7000 bytes
> >  regardless of what is asked.  If the file is larger than 0x7000
> >  bytes the read will need to be broken up into multiple chunks.
> > 
> >  Cc: Luc Michel 
> >  Signed-off-by: Jamie Iles 
> >  ---
> >   hw/core/loader.c | 40 ++--
> >   1 file changed, 34 insertions(+), 6 deletions(-)
> > 
> >  diff --git a/hw/core/loader.c b/hw/core/loader.c
> >  index 348bbf535bd9..16ca9b99cf0f 100644
> >  --- a/hw/core/loader.c
> >  +++ b/hw/core/loader.c
> >  @@ -80,6 +80,34 @@ int64_t get_image_size(const char *filename)
> >   return size;
> >   }
> >   
> >  +static ssize_t read_large(int fd, void *dst, size_t len)
> >  +{
> >  +/*
> >  + * man 2 read says:
> >  + *
> >  + * On Linux, read() (and similar system calls) will transfer at 
> >  most
> >  + * 0x7000 (2,147,479,552) bytes, returning the number of bytes
> > >>>
> > >>> Could you mention MAX_RW_COUNT from linux/fs.h?
> > >>>
> >  + * actually transferred.  (This is true on both 32-bit and 64-bit
> >  + * systems.)
> > >>>
> > >>> Maybe "This is true for both ILP32 and LP64 data models used by Linux"?
> > >>> (because that would not be the case for the ILP64 model).
> > >>>
> > >>> Otherwise s/systems/Linux variants/?
> > >>>
> >  + *
> >  + * So read in chunks no larger than 0x7000 bytes.
> >  + */
> >  +size_t max_chunk_size = 0x7000;
> > >>>
> > >>> We can declare it static const.
> > >>
> > >> Ack, can fix all of those up.
> > >>
> >  +size_t offset = 0;
> >  +
> >  +while (offset < len) {
> >  +size_t chunk_len = MIN(max_chunk_size, len - offset);
> >  +ssize_t br = read(fd, dst + offset, chunk_len);
> >  +
> >  +if (br < 0) {
> >  +return br;
> >  +}
> >  +offset += br;
> >  +}
> >  +
> >  +return (ssize_t)len;
> >  +}
> > >>>
> > >>> I see other read()/pread() calls:
> > >>>
> > >>> hw/9pfs/9p-local.c:472:tsize = read(fd, (void *)buf, bufsz);
> > >>> hw/vfio/common.c:269:if (pread(vbasedev->fd, , size,
> > >>> region->fd_offset + addr) != size) {
> > >>> ...
> > >>>
> > >>> Maybe the read_large() belongs to "sysemu/os-xxx.h"?
> > >>
> > >> I think util/osdep.c would be a good fit for this.  To make sure we're 
> > > 
> > > Yes.
> > > 
> > >> on the same page though are you proposing converting all pread/read 
> > >> calls to a qemu variant or auditing for ones that could potentially take 
> > >> a larger size?
> > > 
> > > Yes, I took some time wondering beside loading blob in guest memory,
> > > what would be the other issues you might encounter. I couldn't find
> > > many cases. Eventually hw/vfio/. I haven't audit much, only noticed
> > > hw/9pfs/9p-local.c and qga/commands-*.c (not sure if relevant), but
> > > since we want to fix this, I'd rather try to fix it globally.
> > 
> > Actually what you suggest is simpler, add qemu_read() / qemu_pread()
> > in util/osdep.c, convert all uses without caring about any audit.
> 
> Okay, this hasn't worked out too badly - I'll do the same for 
> write/pwrite too and then switch all of the callers over with a 
> coccinelle patch so it'll be a fairly large diff but simple.
> 
> We could elect to keep any calls with a compile-time constant length 
> with the unwrapped variants but I think that's probably more confusing 
> in the long-run.

Coming back to this I think this is probably a non-starter because of 
non-blocking file descriptors.  There is already a qemu_write_full so 
I'm inclined to add qemu_read_full following the same pattern and then 
convert all of the read calls in the loader to use that.

Thanks,

Jamie



Re: [PATCH v1 2/3] virtio-mem: Support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE

2021-11-30 Thread Michal Prívozník
On 11/30/21 10:28, David Hildenbrand wrote:
> With VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, we signal the VM that reading
> unplugged memory is not supported. We have to fail feature negotiation
> in case the guest does not support VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE.
> 
> First, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE is required to properly handle
> memory backends (or architectures) without support for the shared zeropage
> in the hypervisor cleanly. Without the shared zeropage, even reading an
> unpopulated virtual memory location can populate real memory and
> consequently consume memory in the hypervisor. We have a guaranteed shared
> zeropage only on MAP_PRIVATE anonymous memory.
> 
> Second, we want VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE to be the default
> long-term as even populating the shared zeropage can be problematic: for
> example, without THP support (possible) or without support for the shared
> huge zeropage with THP (unlikely), the PTE page tables to hold the shared
> zeropage entries can consume quite some memory that cannot be reclaimed
> easily.
> 
> Third, there are other optimizations+features (e.g., protection of
> unplugged memory, reducing the total memory slot size and bitmap sizes)
> that will require VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE.
> 
> We really only support x86 targets with virtio-mem for now (and
> Linux similarly only support x86), but that might change soon, so prepare
> for different targets already.
> 
> Add a new "unplugged-inaccessible" tristate property for x86 targets:
> - "off" will keep VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE unset and legacy
>   guests working.
> - "on" will set VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE and stop legacy guests
>   from using the device.
> - "auto" selects the default based on support for the shared zeropage.
> 
> Warn in case the property is set to "off" and we don't have support for the
> shared zeropage.
> 
> For existing compat machines, the property will default to "off", to
> not change the behavior but eventually warn about a problematic setup.
> Short-term, we'll set the property default to "auto" for new QEMU machines.
> Mid-term, we'll set the property default to "on" for new QEMU machines.
> Long-term, we'll deprecate the parameter and disallow legacy
> guests completely.
> 
> The property has to match on the migration source and destination. "auto"
> will result in the same VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE setting as long
> as the qemu command line (esp. memdev) match -- so "auto" is good enough
> for migration purposes and the parameter doesn't have to be migrated
> explicitly.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  hw/virtio/virtio-mem.c | 63 ++
>  include/hw/virtio/virtio-mem.h |  8 +
>  2 files changed, 71 insertions(+)
> 
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index d5a578142b..1e57156e81 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -32,6 +32,14 @@
>  #include CONFIG_DEVICES
>  #include "trace.h"
>  
> +/*
> + * We only had legacy x86 guests that did not support
> + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE. Other targets don't have legacy 
> guests.
> + */
> +#if defined(TARGET_X86_64) || defined(TARGET_I386)
> +#define VIRTIO_MEM_HAS_LEGACY_GUESTS
> +#endif
> +
>  /*
>   * Let's not allow blocks smaller than 1 MiB, for example, to keep the 
> tracking
>   * bitmap small.
> @@ -110,6 +118,19 @@ static uint64_t virtio_mem_default_block_size(RAMBlock 
> *rb)
>  return MAX(page_size, VIRTIO_MEM_MIN_BLOCK_SIZE);
>  }
>  
> +#if defined(VIRTIO_MEM_HAS_LEGACY_GUESTS)
> +static bool virtio_mem_has_shared_zeropage(RAMBlock *rb)
> +{
> +/*
> + * We only have a guaranteed shared zeropage on ordinary MAP_PRIVATE
> + * anonymous RAM. In any other case, reading unplugged *can* populate a
> + * fresh page, consuming actual memory.
> + */
> +return !qemu_ram_is_shared(rb) && rb->fd < 0 &&
> +   qemu_ram_pagesize(rb) == qemu_real_host_page_size;
> +}
> +#endif /* VIRTIO_MEM_HAS_LEGACY_GUESTS */
> +
>  /*
>   * Size the usable region bigger than the requested size if possible. Esp.
>   * Linux guests will only add (aligned) memory blocks in case they fully
> @@ -653,15 +674,29 @@ static uint64_t virtio_mem_get_features(VirtIODevice 
> *vdev, uint64_t features,
>  Error **errp)
>  {
>  MachineState *ms = MACHINE(qdev_get_machine());
> +VirtIOMEM *vmem = VIRTIO_MEM(vdev);
>  
>  if (ms->numa_state) {
>  #if defined(CONFIG_ACPI)
>  virtio_add_feature(, VIRTIO_MEM_F_ACPI_PXM);
>  #endif
>  }
> +assert(vmem->unplugged_inaccessible != ON_OFF_AUTO_AUTO);
> +if (vmem->unplugged_inaccessible == ON_OFF_AUTO_ON) {
> +virtio_add_feature(, VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE);
> +}
>  return features;
>  }
>  
> +static int virtio_mem_validate_features(VirtIODevice *vdev)
> +{
> +if (virtio_host_has_feature(vdev, 

Re: [PATCH v1 1/3] linux-headers: sync VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE

2021-11-30 Thread Michal Prívozník
On 11/30/21 10:28, David Hildenbrand wrote:
> TODO: replace by a proper header sync.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  include/standard-headers/linux/virtio_mem.h | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/standard-headers/linux/virtio_mem.h 
> b/include/standard-headers/linux/virtio_mem.h
> index 05e5ade75d..18c74c527c 100644
> --- a/include/standard-headers/linux/virtio_mem.h
> +++ b/include/standard-headers/linux/virtio_mem.h
> @@ -68,9 +68,10 @@
>   * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG).
>   *
>   * There are no guarantees what will happen if unplugged memory is
> - * read/written. Such memory should, in general, not be touched. E.g.,
> - * even writing might succeed, but the values will simply be discarded at
> - * random points in time.
> + * read/written. In general, unplugged memory should not be touched, because
> + * the resulting action is undefined. There is one exception: without
> + * VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE, unplugged memory inside the usable
> + * region can be read, to simplify creation of memory dumps.
>   *
>   * It can happen that the device cannot process a request, because it is
>   * busy. The device driver has to retry later.
> @@ -87,6 +88,8 @@
>  
>  /* node_id is an ACPI PXM and is valid */
>  #define VIRTIO_MEM_F_ACPI_PXM0
> +/* unplugged memory must not be accessed */
> +#define VIRTIO_MEM_F_UNPLUGGED_INACCESSIBLE  1
>  
>  
>  /* --- virtio-mem: guest -> host requests --- */
> 

This is verbatim copy of kernel commit of 61082ad6a6e1f.

Reviewed-by: Michal Privoznik 

Michal




Re: [PATCH v1 3/3] virtio-mem: Set "unplugged-inaccessible=auto" for the 6.2 machine on x86

2021-11-30 Thread Michal Prívozník
On 11/30/21 10:28, David Hildenbrand wrote:
> Set the new default to "auto", keeping it set to "on" for compat

I believe you wanted to say: keeping it set to "off", because that's
what the patch does.

> machines. This property is only available for x86 targets.
> 
> TODO: once 6.2 was released and we have compat machines, target the next
>   QEMU release.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  hw/i386/pc.c   | 1 +
>  hw/virtio/virtio-mem.c | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index a2ef40ecbc..045ba05431 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -99,6 +99,7 @@ GlobalProperty pc_compat_6_1[] = {
>  { TYPE_X86_CPU, "hv-version-id-major", "0x0006" },
>  { TYPE_X86_CPU, "hv-version-id-minor", "0x0001" },
>  { "ICH9-LPC", "x-keep-pci-slot-hpc", "false" },
> +{ "virtio-mem", "unplugged-inaccessible", "off" },
>  };
>  const size_t pc_compat_6_1_len = G_N_ELEMENTS(pc_compat_6_1);
>  
> diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
> index 1e57156e81..a5d26d414f 100644
> --- a/hw/virtio/virtio-mem.c
> +++ b/hw/virtio/virtio-mem.c
> @@ -1169,7 +1169,7 @@ static Property virtio_mem_properties[] = {
>   TYPE_MEMORY_BACKEND, HostMemoryBackend *),
>  #if defined(VIRTIO_MEM_HAS_LEGACY_GUESTS)
>  DEFINE_PROP_ON_OFF_AUTO(VIRTIO_MEM_UNPLUGGED_INACCESSIBLE_PROP, 
> VirtIOMEM,
> -unplugged_inaccessible, ON_OFF_AUTO_OFF),
> +unplugged_inaccessible, ON_OFF_AUTO_AUTO),
>  #endif
>  DEFINE_PROP_END_OF_LIST(),
>  };
> 

Reviewed-by: Michal Privoznik 

Michal




Re: [PATCH v7 3/3] cpus-common: implement dirty page limit on vCPU

2021-11-30 Thread Hyman Huang




On 11/30/21 21:21, Peter Xu wrote:

On Tue, Nov 30, 2021 at 06:28:13PM +0800, huang...@chinatelecom.cn wrote:

  ##
+# @set-dirty-limit:
+#
+# Set the upper limit of dirty page rate for a virtual CPU.
+#
+# Requires KVM with accelerator property "dirty-ring-size" set.
+# A virtual CPU's dirty page rate is a measure of its memory load.
+# To observe dirty page rates, use @calc-dirty-rate.
+#
+# @cpu-index: index of the virtual CPU.
+#
+# @dirty-rate: upper limit for the specified vCPU's dirty page rate (MB/s)
+#
+# Since: 7.0
+#
+# Example:
+#   {"execute": "set-dirty-limit"}
+#"arguments": { "cpu-index": 0,
+#   "dirty-rate": 200 } }
+#
+##
+{ 'command': 'set-dirty-limit',
+  'data': { 'cpu-index': 'int', 'dirty-rate': 'uint64' } }
+
+##
+# @cancel-dirty-limit:
+#
+# Cancel the dirty page limit for the vCPU which has been set with
+# set-dirty-limit command. Note that this command requires support from
+# dirty ring, same as the "set-dirty-limit" command.
+#
+# @cpu-index: index of the virtual CPU to cancel the dirty page limit
+#
+# Since: 7.0
+#
+# Example:
+#   {"execute": "cancel-dirty-limit"}
+#"arguments": { "cpu-index": 0 } }
+#
+##
+{ 'command': 'cancel-dirty-limit',
+  'data': { 'cpu-index': 'int' } }


This seems to be overloaded to be a standalone cmd..

How about:

   { "cmd": "vcpu-dirty-limit",
 "arguments": {
   "cpu": $cpu,
   "enable": true/false,
   "dirty-rate": 100,
 }
   }

If "enable"==false, then "dirty-rate" can be ignored and it'll shut down the
throttling on vcpu N.  Then this command will literally merge the two you
proposed.

It'll be nice if we provide yet another command:

   { "cmd": "query-vcpu-dirty-limit",
 "arguments": {
   "*cpu": $cpu,
 }
   }

When $cpu is specified, we return (cpu=$cpu, real_dirty_rate,
target_dirty_rate) for this vcpu.  When $cpu is not specified, we return an
array of that containing all the vcpus.

It'll be nicer to enhance the output of the query command to e.g. have a global
"enabled"=true/false as long as any vcpu has throttle enabled then the global
throttle is enabled.  I didn't think more than that, but how's that sound so
far?
Soud good, it makes the command easier for programmers to use and 
understand, i'll try this out next version.


Thanks,



--
Best Regards
Hyman Huang(黄勇)



Re: [PATCH v7 1/3] migration/dirtyrate: implement vCPU dirtyrate calculation periodically

2021-11-30 Thread Hyman Huang




On 11/30/21 21:04, Peter Xu wrote:

On Tue, Nov 30, 2021 at 06:28:11PM +0800, huang...@chinatelecom.cn wrote:

+static void dirtylimit_calc_func(void)
+{
+CPUState *cpu;
+DirtyPageRecord *dirty_pages;
+int64_t start_time, end_time, calc_time;
+DirtyRateVcpu rate;
+int i = 0;
+
+dirty_pages = g_malloc0(sizeof(*dirty_pages) *
+dirtylimit_calc_state->data.nvcpu);
+
+dirtylimit_global_dirty_log_start();
+
+CPU_FOREACH(cpu) {
+record_dirtypages(dirty_pages, cpu, true);
+}
+
+start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+g_usleep(DIRTYLIMIT_CALC_TIME_MS * 1000);
+end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+calc_time = end_time - start_time;
+
+dirtylimit_global_dirty_log_stop();


I haven't looked into the details, but..  I'm wondering whether we should just
keep the dirty ring enabled during the whole process of throttling.

start/stop can be expensive, especially when huge pages are used, start dirty
tracking will start to do huge page split. While right after the "stop" all the
huge pages will need to be rebuild again.

David from Google is even proposing a kernel change to eagerly splitting huge
pages when dirty tracking is enabled.

So I think we can keep the dirty tracking enabled until all the vcpu throttles
are stopped.

Yes, it's a good idea and i'll try this out next version.



+
+CPU_FOREACH(cpu) {
+record_dirtypages(dirty_pages, cpu, false);
+}




--
Best Regards
Hyman Huang(黄勇)



Re: [PATCH] memory: Fix incorrect calls of log_global_start/stop

2021-11-30 Thread David Hildenbrand
On 30.11.21 09:00, Peter Xu wrote:
> We should only call the log_global_start/stop when the global dirty track
> bitmask changes from zero<->non-zero.
> 
> No real issue reported for this yet probably because no immediate user to
> enable both dirty rate measurement and migration at the same time.  However
> it'll be good to be prepared for it.
> 
> Fixes: 63b41db4bc ("memory: make global_dirty_tracking a bitmask")
> Cc: Hyman Huang 
> Cc: Paolo Bonzini 
> Cc: Dr. David Alan Gilbert 
> Cc: Juan Quintela 
> Cc: David Hildenbrand 
> Signed-off-by: Peter Xu 

LGTM

Reviewed-by: David Hildenbrand 


-- 
Thanks,

David / dhildenb




Re: [PATCH v7 0/3] support dirty restraint on vCPU

2021-11-30 Thread Hyman Huang

1.
Start vm with kernel+initrd.img with qemu command line as following:

[root@Hyman_server1 fast_qemu]# cat vm.sh
#!/bin/bash
/usr/bin/qemu-system-x86_64 \
-display none -vga none \
-name guest=simple_vm,debug-threads=on \
-monitor stdio \
-machine pc-i440fx-2.12 \
-accel kvm,dirty-ring-size=65536 -cpu host \
-kernel /home/work/fast_qemu/vmlinuz-5.13.0-rc4+ \
-initrd /home/work/fast_qemu/initrd-stress.img \
-append "noapic edd=off printk.time=1 noreplace-smp 
cgroup_disable=memory pci=noearly console=ttyS0 debug ramsize=1500 
ratio=1 sleep=1" \

-chardev file,id=charserial0,path=/var/log/vm_console.log \
-serial chardev:charserial0 \
-qmp unix:/tmp/qmp-sock,server,nowait \
-D /var/log/vm.log \
--trace events=/home/work/fast_qemu/events \
-m 4096 -smp 2 -device sga

2.
Enable the dirtylimit trace event which will output to /var/log/vm.log
[root@Hyman_server1 fast_qemu]# cat /home/work/fast_qemu/events
dirtylimit_state_init
dirtylimit_vcpu
dirtylimit_impose
dirtyrate_do_calculate_vcpu


3.
Connect the qmp server with low level qmp client and set-dirty-limit

[root@Hyman_server1 my_qemu]# python3.6 ./scripts/qmp/qmp-shell -v -p 
/tmp/qmp-sock 



Welcome to the QMP low-level shell!
Connected to QEMU 6.1.92

(QEMU) set-dirty-limit cpu-index=1 dirty-rate=400 




{
"arguments": {
"cpu-index": 1,
"dirty-rate": 400
},
"execute": "set-dirty-limit"
}

4.
observe the vcpu current dirty rate and quota dirty rate...

[root@Hyman_server1 ~]# tail -f /var/log/vm.log
dirtylimit_state_init dirtylimit state init: max cpus 2
dirtylimit_vcpu CPU[1] set quota dirtylimit 400
dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 0, 
percentage 0

dirtyrate_do_calculate_vcpu vcpu[0]: 1075 MB/s
dirtyrate_do_calculate_vcpu vcpu[1]: 1061 MB/s
dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 1061, 
percentage 62

dirtyrate_do_calculate_vcpu vcpu[0]: 1133 MB/s
dirtyrate_do_calculate_vcpu vcpu[1]: 380 MB/s
dirtylimit_impose CPU[1] impose dirtylimit: quota 400, current 380, 
percentage 57

dirtyrate_do_calculate_vcpu vcpu[0]: 1227 MB/s
dirtyrate_do_calculate_vcpu vcpu[1]: 464 MB/s

We can observe that vcpu-1's dirtyrate is about 400MB/s with dirty page 
limit set and the vcpu-0 is not affected.


5.
observe the vm stress info...
[root@Hyman_server1 fast_qemu]# tail -f /var/log/vm_console.log
[0.838051] Run /init as init process
[0.839216]   with arguments:
[0.840153] /init
[0.840882]   with environment:
[0.841884] HOME=/
[0.842649] TERM=linux
[0.843478] edd=off
[0.844233] ramsize=1500
[0.845079] ratio=1
[0.845829] sleep=1
/init (1): INFO: RAM 1500 MiB across 2 CPUs, ratio 1, sleep 1 us
[1.158011] random: init: uninitialized urandom read (4096 bytes read)
[1.448205] random: init: uninitialized urandom read (4096 bytes read)
/init (1): INFO: 1638282593684ms copied 1 GB in 00729ms
/init (00110): INFO: 1638282593964ms copied 1 GB in 00719ms
/init (1): INFO: 1638282594405ms copied 1 GB in 00719ms
/init (00110): INFO: 1638282594677ms copied 1 GB in 00713ms
/init (1): INFO: 1638282595093ms copied 1 GB in 00686ms
/init (00110): INFO: 1638282595339ms copied 1 GB in 00662ms
/init (1): INFO: 1638282595764ms copied 1 GB in 00670m

PS: the kernel and initrd images comes from:

kernel image: vmlinuz-5.13.0-rc4+, normal centos vmlinuz copied from 
/boot directory


initrd.img: initrd-stress.img, only contains a stress binary, which 
compiled from qemu source tests/migration/stress.c and run as init

in vm.

you can view README.md file of my project 
"met"(https://github.com/newfriday/met) to compile the initrd-stress.img. :)


On 11/30/21 20:57, Peter Xu wrote:

On Tue, Nov 30, 2021 at 06:28:10PM +0800, huang...@chinatelecom.cn wrote:

From: Hyman Huang(黄勇) 

The patch [2/3] has not been touched so far. Any corrections and
suggetions are welcome.


I played with it today, but the vcpu didn't got throttled as expected.

What I did was starting two workload with 500mb/s, each pinned on one vcpu
thread:

[root@fedora ~]# pgrep -fa mig_mon
595 ./mig_mon mm_dirty 1000 500 sequential
604 ./mig_mon mm_dirty 1000 500 sequential
[root@fedora ~]# taskset -pc 595
pid 595's current affinity list: 2
[root@fedora ~]# taskset -pc 604
pid 604's current affinity list: 3

Then start throttle with 100mb/s:

(QEMU) set-dirty-limit cpu-index=3 dirty-rate=100
{"return": {}}
(QEMU) set-dirty-limit cpu-index=2 dirty-rate=100
{"return": {}}

I can see the workload dropped a tiny little bit (perhaps 500mb -> 499mb), then
it keeps going..

Further throttle won't work too:

(QEMU) set-dirty-limit cpu-index=2 dirty-rate=10
{"return": {}}

Funnily, the ssh client got slowed down instead... :(

Yong, how did you test it?



--
Best Regards
Hyman Huang(黄勇)



why is iommu_platform set to off by default?

2021-11-30 Thread Peter Maydell
I've just spent a day or so trying to track down why PCI passthrough
of a virtio-blk-pci device wasn't working. The problem turns out to be
that by default virtio pci devices don't use the IOMMU, even when the
machine model has created an IOMMU and arranged for the PCI bus to
be underneath it. So when the L2 guest tries to program the virtio device,
the virtio device treats the IPAs it writes as if they were PAs and
of course the data structures it's looking for aren't there.

Why do we default this to 'off'? It seems pretty unhelpful not to
honour the existence of the IOMMU, and the failure mode is pretty
opaque (L2 guest just hangs)...

thanks
-- PMM



Re: [PATCH 2/2] intel_iommu: Fix irqchip / X2APIC configuration checks

2021-11-30 Thread Claudio Fontana
Acked-by: Claudio Fontana 

I'll try to find time tonight to give you a tested by.

Ciao,

Claudio

On 11/30/21 2:42 PM, David Woodhouse wrote:
> We don't need to check kvm_enable_x2apic(). It's perfectly OK to support
> interrupt remapping even if we can't address CPUs above 254. Kind of
> pointless, but still functional.
> 
> The check on kvm_enable_x2apic() needs to happen *anyway* in order to
> allow CPUs above 254 even without an IOMMU, so allow that to happen
> elsewhere.
> 
> However, we do require the *split* irqchip in order to rewrite I/OAPIC
> destinations. So fix that check while we're here.
> 
> Signed-off-by: David Woodhouse 
> ---
>  hw/i386/intel_iommu.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 294499ee20..b0439d0fbf 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3746,15 +3746,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, 
> Error **errp)
>ON_OFF_AUTO_ON : 
> ON_OFF_AUTO_OFF;
>  }
>  if (s->intr_eim == ON_OFF_AUTO_ON && !s->buggy_eim) {
> -if (!kvm_irqchip_in_kernel()) {
> +if (!kvm_irqchip_is_split()) {
>  error_setg(errp, "eim=on requires 
> accel=kvm,kernel-irqchip=split");
>  return false;
>  }
> -if (!kvm_enable_x2apic()) {
> -error_setg(errp, "eim=on requires support on the KVM side"
> - "(X2APIC_API, first shipped in v4.7)");
> -return false;
> -}
>  }
>  
>  /* Currently only address widths supported are 39 and 48 bits */
> 




[PATCH] virtio: signal after wrapping packed used_idx

2021-11-30 Thread Stefan Hajnoczi
Packed Virtqueues wrap used_idx instead of letting it run freely like
Split Virtqueues do. If the used ring wraps more than once there is no
way to compare vq->signalled_used and vq->used_idx in
virtio_packed_should_notify() since they are modulo vq->vring.num.

This causes the device to stop sending used buffer notifications when
when virtio_packed_should_notify() is called less than once each time
around the used ring.

It is possible to trigger this with virtio-blk's dataplane
notify_guest_bh() irq coalescing optimization. The call to
virtio_notify_irqfd() (and virtio_packed_should_notify()) is deferred to
a BH. If the guest driver is polling it can complete and submit more
requests before the BH executes, causing the used ring to wrap more than
once. The result is that the virtio-blk device ceases to raise
interrupts and I/O hangs.

Cc: Tiwei Bie 
Cc: Jason Wang 
Cc: Michael S. Tsirkin 
Signed-off-by: Stefan Hajnoczi 
---
Smarter solutions welcome, but I think notifying once per vq->vring.num
is acceptable.
---
 hw/virtio/virtio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ea7c079fb0..f7851c2750 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -885,6 +885,7 @@ static void virtqueue_packed_flush(VirtQueue *vq, unsigned 
int count)
 if (vq->used_idx >= vq->vring.num) {
 vq->used_idx -= vq->vring.num;
 vq->used_wrap_counter ^= 1;
+vq->signalled_used_valid = false;
 }
 }
 
-- 
2.33.1




[PATCH 2/2] intel_iommu: Fix irqchip / X2APIC configuration checks

2021-11-30 Thread David Woodhouse
We don't need to check kvm_enable_x2apic(). It's perfectly OK to support
interrupt remapping even if we can't address CPUs above 254. Kind of
pointless, but still functional.

The check on kvm_enable_x2apic() needs to happen *anyway* in order to
allow CPUs above 254 even without an IOMMU, so allow that to happen
elsewhere.

However, we do require the *split* irqchip in order to rewrite I/OAPIC
destinations. So fix that check while we're here.

Signed-off-by: David Woodhouse 
---
 hw/i386/intel_iommu.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 294499ee20..b0439d0fbf 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3746,15 +3746,10 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error 
**errp)
   ON_OFF_AUTO_ON : ON_OFF_AUTO_OFF;
 }
 if (s->intr_eim == ON_OFF_AUTO_ON && !s->buggy_eim) {
-if (!kvm_irqchip_in_kernel()) {
+if (!kvm_irqchip_is_split()) {
 error_setg(errp, "eim=on requires accel=kvm,kernel-irqchip=split");
 return false;
 }
-if (!kvm_enable_x2apic()) {
-error_setg(errp, "eim=on requires support on the KVM side"
- "(X2APIC_API, first shipped in v4.7)");
-return false;
-}
 }
 
 /* Currently only address widths supported are 39 and 48 bits */
-- 
2.31.1



smime.p7s
Description: S/MIME cryptographic signature


[PATCH 1/2] target/i386: Fix sanity check on max APIC ID / X2APIC enablement

2021-11-30 Thread David Woodhouse
The check on x86ms->apic_id_limit in pc_machine_done() had two problems.

Firstly, we need KVM to support the X2APIC API in order to allow IRQ
delivery to APICs >= 255. So we need to call/check kvm_enable_x2apic(),
which was done elsewhere in *some* cases but not all.

Secondly, microvm needs the same check. So move it from pc_machine_done()
to x86_cpus_init() where it will work for both.

The check in kvm_cpu_instance_init() is now redundant and can be dropped.

Signed-off-by: David Woodhouse 
---
 hw/i386/pc.c  |  8 
 hw/i386/x86.c | 16 
 target/i386/kvm/kvm-cpu.c |  2 +-
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a2ef40ecbc..9959f93216 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -736,14 +736,6 @@ void pc_machine_done(Notifier *notifier, void *data)
 /* update FW_CFG_NB_CPUS to account for -device added CPUs */
 fw_cfg_modify_i16(x86ms->fw_cfg, FW_CFG_NB_CPUS, x86ms->boot_cpus);
 }
-
-
-if (x86ms->apic_id_limit > 255 && !xen_enabled() &&
-!kvm_irqchip_in_kernel()) {
-error_report("current -smp configuration requires kernel "
- "irqchip support.");
-exit(EXIT_FAILURE);
-}
 }
 
 void pc_guest_info_init(PCMachineState *pcms)
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index b84840a1bb..660e9413f5 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -39,6 +39,7 @@
 #include "sysemu/replay.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/cpu-timers.h"
+#include "sysemu/xen.h"
 #include "trace.h"
 
 #include "hw/i386/x86.h"
@@ -136,6 +137,21 @@ void x86_cpus_init(X86MachineState *x86ms, int 
default_cpu_version)
  */
 x86ms->apic_id_limit = x86_cpu_apic_id_from_index(x86ms,
   ms->smp.max_cpus - 1) + 
1;
+
+/*
+ * Can we support APIC ID 255 or higher?
+ *
+ * Under Xen: yes.
+ * With userspace emulated lapic: no
+ * With KVM's in-kernel lapic: only if X2APIC API is enabled.
+ */
+if (x86ms->apic_id_limit > 255 && !xen_enabled() &&
+(!kvm_irqchip_in_kernel() || !kvm_enable_x2apic())) {
+error_report("current -smp configuration requires kernel "
+ "irqchip and X2APIC API support.");
+exit(EXIT_FAILURE);
+}
+
 possible_cpus = mc->possible_cpu_arch_ids(ms);
 for (i = 0; i < ms->smp.cpus; i++) {
 x86_cpu_new(x86ms, possible_cpus->cpus[i].arch_id, _fatal);
diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index d95028018e..c60cb2dafb 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -165,7 +165,7 @@ static void kvm_cpu_instance_init(CPUState *cs)
 /* only applies to builtin_x86_defs cpus */
 if (!kvm_irqchip_in_kernel()) {
 x86_cpu_change_kvm_default("x2apic", "off");
-} else if (kvm_irqchip_is_split() && kvm_enable_x2apic()) {
+} else if (kvm_irqchip_is_split()) {
 x86_cpu_change_kvm_default("kvm-msi-ext-dest-id", "on");
 }
 
-- 
2.31.1



smime.p7s
Description: S/MIME cryptographic signature


[PATCH v2 1/2] ppc/pnv.c: add a friendly warning when accel=kvm is used

2021-11-30 Thread Daniel Henrique Barboza
If one tries to use -machine powernv9,accel=kvm in a Power9 host, a
cryptic error will be shown:

qemu-system-ppc64: Register sync failed... If you're using kvm-hv.ko, only 
"-cpu host" is possible
qemu-system-ppc64: kvm_init_vcpu: kvm_arch_init_vcpu failed (0): Invalid 
argument

Appending '-cpu host' will throw another error:

qemu-system-ppc64: invalid chip model 'host' for powernv9 machine

The root cause is that in IBM PowerPC we have different specs for the bare-metal
and the guests. The bare-metal follows OPAL, the guests follow PAPR. The kernel
KVM modules presented in the ppc kernels implements PAPR. This means that we
can't use KVM accel when using the powernv machine, which is the emulation of
the bare-metal host.

All that said, let's give a more informative error in this case.

Signed-off-by: Daniel Henrique Barboza 
---
 hw/ppc/pnv.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 71e45515f1..e5b87e8730 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -742,6 +742,11 @@ static void pnv_init(MachineState *machine)
 DriveInfo *pnor = drive_get(IF_MTD, 0, 0);
 DeviceState *dev;
 
+if (kvm_enabled()) {
+error_report("The powernv machine does not work with KVM 
acceleration");
+exit(EXIT_FAILURE);
+}
+
 /* allocate RAM */
 if (machine->ram_size < mc->default_ram_size) {
 char *sz = size_to_str(mc->default_ram_size);
-- 
2.31.1




[PATCH v2 2/2] docs/system/ppc/powernv.rst: document KVM support status

2021-11-30 Thread Daniel Henrique Barboza
Put in a more accessible place the reasoning behind our decision
to officially drop KVM support in the powernv machine.

Signed-off-by: Daniel Henrique Barboza 
---
 docs/system/ppc/powernv.rst | 13 +
 1 file changed, 13 insertions(+)

diff --git a/docs/system/ppc/powernv.rst b/docs/system/ppc/powernv.rst
index 86186b7d2c..8304e85c51 100644
--- a/docs/system/ppc/powernv.rst
+++ b/docs/system/ppc/powernv.rst
@@ -58,6 +58,19 @@ Prebuilt images of ``skiboot`` and ``skiroot`` are made 
available on the `OpenPO
 QEMU includes a prebuilt image of ``skiboot`` which is updated when a
 more recent version is required by the models.
 
+Current acceleration status
+---
+
+KVM acceleration in Linux Power hosts is provided by the kvm-hv and
+kvm-pr modules. kvm-hv is adherent to PAPR and it's not compliant with
+powernv. kvm-pr in theory could be used as a valid accel option but
+this isn't supported by kvm-pr at this moment.
+
+To spare users from dealing with not so informative errors when attempting
+to use accel=kvm, the powernv machine will throw an error informing that
+KVM is not supported. This can be revisited in the future if kvm-pr (or
+any other KVM alternative) is usable as KVM accel for this machine.
+
 Boot options
 
 
-- 
2.31.1




[PATCH v2 0/2] ppc/pnv.c: add a friendly warning when accel=kvm is used

2021-11-30 Thread Daniel Henrique Barboza
Hi,

In this version a documentation patch was added to explain
our motivations to officially disable KVM accel in the powernv
machine.

Changes from v1:
- added a doc patch
- v1 link: https://lists.gnu.org/archive/html/qemu-devel/2021-11/msg05207.html 

Daniel Henrique Barboza (2):
  ppc/pnv.c: add a friendly warning when accel=kvm is used
  docs/system/ppc/powernv.rst: document KVM support status

 docs/system/ppc/powernv.rst | 13 +
 hw/ppc/pnv.c|  5 +
 2 files changed, 18 insertions(+)

-- 
2.31.1




Re: [PATCH v7 3/3] cpus-common: implement dirty page limit on vCPU

2021-11-30 Thread Peter Xu
On Tue, Nov 30, 2021 at 06:28:13PM +0800, huang...@chinatelecom.cn wrote:
>  ##
> +# @set-dirty-limit:
> +#
> +# Set the upper limit of dirty page rate for a virtual CPU.
> +#
> +# Requires KVM with accelerator property "dirty-ring-size" set.
> +# A virtual CPU's dirty page rate is a measure of its memory load.
> +# To observe dirty page rates, use @calc-dirty-rate.
> +#
> +# @cpu-index: index of the virtual CPU.
> +#
> +# @dirty-rate: upper limit for the specified vCPU's dirty page rate (MB/s)
> +#
> +# Since: 7.0
> +#
> +# Example:
> +#   {"execute": "set-dirty-limit"}
> +#"arguments": { "cpu-index": 0,
> +#   "dirty-rate": 200 } }
> +#
> +##
> +{ 'command': 'set-dirty-limit',
> +  'data': { 'cpu-index': 'int', 'dirty-rate': 'uint64' } }
> +
> +##
> +# @cancel-dirty-limit:
> +#
> +# Cancel the dirty page limit for the vCPU which has been set with
> +# set-dirty-limit command. Note that this command requires support from
> +# dirty ring, same as the "set-dirty-limit" command.
> +#
> +# @cpu-index: index of the virtual CPU to cancel the dirty page limit
> +#
> +# Since: 7.0
> +#
> +# Example:
> +#   {"execute": "cancel-dirty-limit"}
> +#"arguments": { "cpu-index": 0 } }
> +#
> +##
> +{ 'command': 'cancel-dirty-limit',
> +  'data': { 'cpu-index': 'int' } }

This seems to be overloaded to be a standalone cmd..

How about:

  { "cmd": "vcpu-dirty-limit",
"arguments": {
  "cpu": $cpu,
  "enable": true/false,
  "dirty-rate": 100,
}
  }

If "enable"==false, then "dirty-rate" can be ignored and it'll shut down the
throttling on vcpu N.  Then this command will literally merge the two you
proposed.

It'll be nice if we provide yet another command:

  { "cmd": "query-vcpu-dirty-limit",
"arguments": {
  "*cpu": $cpu,
}
  }

When $cpu is specified, we return (cpu=$cpu, real_dirty_rate,
target_dirty_rate) for this vcpu.  When $cpu is not specified, we return an
array of that containing all the vcpus.

It'll be nicer to enhance the output of the query command to e.g. have a global
"enabled"=true/false as long as any vcpu has throttle enabled then the global
throttle is enabled.  I didn't think more than that, but how's that sound so
far?

Thanks,

-- 
Peter Xu




Re: [PATCH for-6.2 v2] block/nbd: forbid incompatible change of server options on reconnect

2021-11-30 Thread Vladimir Sementsov-Ogievskiy

30.11.2021 00:53, Vladimir Sementsov-Ogievskiy wrote:

Reconnect feature was never prepared to handle server options changed
on reconnect. Let's be stricter and check what exactly is changed. If
server capabilities just got richer don't worry. Otherwise fail and
drop the established connection.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---

v2: by Eric's comments:
  - drop extra check about old->min_block % new->min_block
  - make context_id check conditional itself
  - don't handle READ_ONLY flag here (see comment in code)
  - wording

  Code seems quite obvious, but honestly I still didn't test that it does
  what it should :( And I'm afraid, Qemu actually doesn't provide good
  possibility to do so.

  Eric, may be you know some simple way to test it with nbdkit?



Ok, a simple test I can do:

qemu-img create -f qcow2 a 10M
qemu-img create -f qcow2 b 20M
qemu-nbd b


then in parallel:

./qemu-io --image-opts driver=nbd,host=localhost,reconnect-delay=5

check that it works:
qemu-io> write 10M 1M
wrote 1048576/1048576 bytes at offset 10485760

Now, kill nbd server

try again

qemu-io> write 10M 1M

- it will wait up to 5 seconds for reconnection

Now start nbd server with shorter file

qemu-nbd a


Prepatch, qemu-io will successfully  connect, request will fail with "Invalid 
argument".

Afterpatch, qemu-io will refuse to connect to shorter export, write will fail with 
"Input/output error".




  include/block/nbd.h |  9 +
  nbd/client-connection.c | 88 +
  2 files changed, 97 insertions(+)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 78d101b774..9e1943d24c 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -157,6 +157,10 @@ enum {
  #define NBD_FLAG_SEND_RESIZE   (1 << NBD_FLAG_SEND_RESIZE_BIT)
  #define NBD_FLAG_SEND_CACHE(1 << NBD_FLAG_SEND_CACHE_BIT)
  #define NBD_FLAG_SEND_FAST_ZERO(1 << NBD_FLAG_SEND_FAST_ZERO_BIT)
+/*
+ * WARNING! If you add any new NBD_FLAG_ flag, check that logic in
+ * nbd_is_new_info_compatible() is still good about handling flags.
+ */
  
  /* New-style handshake (global) flags, sent from server to client, and

 control what will happen during handshake phase. */
@@ -305,6 +309,11 @@ struct NBDExportInfo {
  
  uint32_t context_id;
  
+/*

+ * WARNING! When adding any new field to the structure, don't forget
+ * to check and update the nbd_is_new_info_compatible() function.
+ */
+
  /* Set by server results during nbd_receive_export_list() */
  char *description;
  int n_contexts;
diff --git a/nbd/client-connection.c b/nbd/client-connection.c
index 695f855754..d50c187482 100644
--- a/nbd/client-connection.c
+++ b/nbd/client-connection.c
@@ -37,6 +37,10 @@ struct NBDClientConnection {
  bool do_negotiation;
  bool do_retry;
  
+/* Used only by connection thread, does not need mutex protection */

+bool has_prev_info;
+NBDExportInfo prev_info;
+
  QemuMutex mutex;
  
  /*

@@ -160,6 +164,69 @@ static int nbd_connect(QIOChannelSocket *sioc, 
SocketAddress *addr,
  return 0;
  }
  
+static bool nbd_is_new_info_compatible(NBDExportInfo *old, NBDExportInfo *new,

+   Error **errp)
+{
+uint32_t dropped_flags;
+
+if (old->structured_reply && !new->structured_reply) {
+error_setg(errp, "Server options degraded after reconnect: "
+   "structured_reply is not supported anymore");
+return false;
+}
+
+if (old->base_allocation) {
+if (!new->base_allocation) {
+error_setg(errp, "Server options degraded after reconnect: "
+   "base_allocation is not supported anymore");
+return false;
+}
+
+if (old->context_id != new->context_id) {
+error_setg(errp, "Meta context id changed after reconnect");
+return false;
+}
+}
+
+if (old->size != new->size) {
+error_setg(errp, "NBD export size changed after reconnect");
+return false;
+}
+
+/*
+ * No worry if rotational status changed.
+ *
+ * Also, we can't handle NBD_FLAG_READ_ONLY properly at this level: we 
don't
+ * actually know, does our client need write access or not. So, it's 
handled
+ * in block layer in nbd_handle_updated_info().
+ *
+ * All other flags are feature flags, they should not degrade.
+ */
+dropped_flags = (old->flags & ~new->flags) &
+~(NBD_FLAG_ROTATIONAL | NBD_FLAG_READ_ONLY);
+if (dropped_flags) {
+error_setg(errp, "Server options degraded after reconnect: flags 0x%"
+   PRIx32 " are not reported anymore", dropped_flags);
+return false;
+}
+
+if (new->min_block > old->min_block) {
+error_setg(errp, "Server requires more strict min_block after "
+   "reconnect: %" PRIu32 " instead of %" PRIu32,
+   new->min_block, 

Re: Follow-up on the CXL discussion at OFTC

2021-11-30 Thread Jonathan Cameron via
On Mon, 29 Nov 2021 18:28:43 +
Alex Bennée  wrote:

> Ben Widawsky  writes:
> 
> > On 21-11-26 12:08:08, Alex Bennée wrote:  
> >> 
> >> Ben Widawsky  writes:
> >>   
> >> > On 21-11-19 02:29:51, Shreyas Shah wrote:  
> >> >> Hi Ben
> >> >> 
> >> >> Are you planning to add the CXL2.0 switch inside QEMU or already added 
> >> >> in one of the version? 
> >> >>
> >> >
> >> > From me, there are no plans for QEMU anything until/unless upstream 
> >> > thinks it
> >> > will merge the existing patches, or provide feedback as to what it would 
> >> > take to
> >> > get them merged. If upstream doesn't see a point in these patches, then 
> >> > I really
> >> > don't see much value in continuing to further them. Once hardware comes 
> >> > out, the
> >> > value proposition is certainly less.  
> >> 
> >> I take it:
> >> 
> >>   Subject: [RFC PATCH v3 00/31] CXL 2.0 Support
> >>   Date: Mon,  1 Feb 2021 16:59:17 -0800
> >>   Message-Id: <20210202005948.241655-1-ben.widaw...@intel.com>
> >> 
> >> is the current state of the support? I saw there was a fair amount of
> >> discussion on the thread so assumed there would be a v4 forthcoming at
> >> some point.  
> >
> > Hi Alex,
> >
> > There is a v4, however, we never really had a solid plan for the primary 
> > issue
> > which was around handling CXL memory expander devices properly (both from an
> > interleaving standpoint as well as having a device which hosts multiple 
> > memory
> > capacities, persistent and volatile). I didn't feel it was worth sending a 
> > v4
> > unless someone could say
> >
> > 1. we will merge what's there and fix later, or
> > 2. you must have a more perfect emulation in place, or
> > 3. we want to see usages for a real guest  
> 
> I think 1. is acceptable if the community is happy there will be ongoing
> development and it's not just a code dump. Given it will have a
> MAINTAINERS entry I think that is demonstrated.

My thought is also 1.  There are a few hacks we need to clean out but
nothing that should take too long.  I'm sure it'll take a rev or two more.
Right now for example, I've added support to arm-virt and maybe need to
move that over to a different machine model...

> 
> What's the current use case? Testing drivers before real HW comes out?
> Will it still be useful after real HW comes out for people wanting to
> debug things without HW?

CXL is continuing to expand in scope and capabilities and I don't see that
reducing any time soon (My guess is 3 years+ to just catch up with what is
under discussion today).  So I see two long term use cases:

1) Automated verification that we haven't broken things.  I suspect no
one person is going to have a test farm covering all the corner cases.
So we'll need emulation + firmware + kernel based testing.

2) New feature prove out.  We have already used it for some features that
will appear in the next spec version. Obviously I can't say what or
send that code out yet.  Its very useful and the spec draft has changed
in various ways a result.  I can't commit others, but Huawei will be
doing more of this going forwards.  For that we need a stable base to
which we add the new stuff once spec publication allows it.

> 
> >
> > I had hoped we could merge what was there mostly as is and fix it up as we 
> > go.
> > It's useful in the state it is now, and as time goes on, we find more 
> > usecases
> > for it in a VMM, and not just driver development.
> >  
> >> 
> >> Adding new subsystems to QEMU does seem to be a pain point for new
> >> contributors. Patches tend to fall through the cracks of existing
> >> maintainers who spend most of their time looking at stuff that directly
> >> touches their files. There is also a reluctance to merge large chunks of
> >> functionality without an identified maintainer (and maybe reviewers) who
> >> can be the contact point for new patches. So in short you need:
> >> 
> >>  - Maintainer Reviewed-by/Acked-by on patches that touch other sub-systems 
> >>  
> >
> > This is the challenging one. I have Cc'd the relevant maintainers (hw/pci 
> > and
> > hw/mem are the two) in the past, but I think there interest is lacking (and
> > reasonably so, it is an entirely different subsystem).  
> 
> So the best approach to that is to leave a Cc: tag in the patch itself
> on your next posting so we can see the maintainer did see it but didn't
> contribute a review tag. This is also a good reason to keep Message-Id
> tags in patches so we can go back to the original threads.
> 
> So in my latest PR you'll see:
> 
>   Signed-off-by: Willian Rampazzo 
>   Reviewed-by: Beraldo Leal 
>   Message-Id: <20211122191124.31620-1-willi...@redhat.com>
>   Signed-off-by: Alex Bennée 
>   Reviewed-by: Philippe Mathieu-Daudé 
>   Message-Id: <20211129140932.4115115-7-alex.ben...@linaro.org>
> 
> which shows the Message-Id from Willian's original posting and the
> latest Message-Id from my posting of the maintainer tree (I trim off my
> old ones).
> 
> >>  - Reviewed-by tags on the 

Re: [PATCH v7 1/3] migration/dirtyrate: implement vCPU dirtyrate calculation periodically

2021-11-30 Thread Peter Xu
On Tue, Nov 30, 2021 at 06:28:11PM +0800, huang...@chinatelecom.cn wrote:
> +static void dirtylimit_calc_func(void)
> +{
> +CPUState *cpu;
> +DirtyPageRecord *dirty_pages;
> +int64_t start_time, end_time, calc_time;
> +DirtyRateVcpu rate;
> +int i = 0;
> +
> +dirty_pages = g_malloc0(sizeof(*dirty_pages) *
> +dirtylimit_calc_state->data.nvcpu);
> +
> +dirtylimit_global_dirty_log_start();
> +
> +CPU_FOREACH(cpu) {
> +record_dirtypages(dirty_pages, cpu, true);
> +}
> +
> +start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +g_usleep(DIRTYLIMIT_CALC_TIME_MS * 1000);
> +end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +calc_time = end_time - start_time;
> +
> +dirtylimit_global_dirty_log_stop();

I haven't looked into the details, but..  I'm wondering whether we should just
keep the dirty ring enabled during the whole process of throttling.

start/stop can be expensive, especially when huge pages are used, start dirty
tracking will start to do huge page split. While right after the "stop" all the
huge pages will need to be rebuild again.

David from Google is even proposing a kernel change to eagerly splitting huge
pages when dirty tracking is enabled.

So I think we can keep the dirty tracking enabled until all the vcpu throttles
are stopped.

> +
> +CPU_FOREACH(cpu) {
> +record_dirtypages(dirty_pages, cpu, false);
> +}

-- 
Peter Xu




Re: [PATCH v7 0/3] support dirty restraint on vCPU

2021-11-30 Thread Peter Xu
On Tue, Nov 30, 2021 at 06:28:10PM +0800, huang...@chinatelecom.cn wrote:
> From: Hyman Huang(黄勇) 
> 
> The patch [2/3] has not been touched so far. Any corrections and
> suggetions are welcome. 

I played with it today, but the vcpu didn't got throttled as expected.

What I did was starting two workload with 500mb/s, each pinned on one vcpu
thread:

[root@fedora ~]# pgrep -fa mig_mon
595 ./mig_mon mm_dirty 1000 500 sequential
604 ./mig_mon mm_dirty 1000 500 sequential
[root@fedora ~]# taskset -pc 595
pid 595's current affinity list: 2
[root@fedora ~]# taskset -pc 604
pid 604's current affinity list: 3

Then start throttle with 100mb/s:

(QEMU) set-dirty-limit cpu-index=3 dirty-rate=100
{"return": {}}
(QEMU) set-dirty-limit cpu-index=2 dirty-rate=100
{"return": {}}

I can see the workload dropped a tiny little bit (perhaps 500mb -> 499mb), then
it keeps going..

Further throttle won't work too:

(QEMU) set-dirty-limit cpu-index=2 dirty-rate=10
{"return": {}}

Funnily, the ssh client got slowed down instead... :(

Yong, how did you test it?

-- 
Peter Xu




Re: [PATCH 2/3] target/m68k: Implement TRAPcc

2021-11-30 Thread Richard Henderson

On 11/30/21 12:57 PM, Laurent Vivier wrote:

+DISAS_INSN(trapcc)
+{
+    /* Consume and discard the immediate operand. */
+    switch (extract32(insn, 0, 3)) {
+    case 2: /* trapcc.w */
+    (void)read_im16(env, s);
+    break;
+    case 3: /* trapcc.l */
+    (void)read_im32(env, s);
+    break;


Do we need to actually read the memory to trigger a fault if needed or can we only 
increase the PC?


Yes, and to pass the entire instruction to plugins.


+    case 4: /* trapcc (no operand) */
+    break;
+    default:
+    /* Illegal insn */
+    disas_undef(env, s, insn);
+    return;
+    }
+    do_trapcc(s, extract32(insn, 8, 4));
+}


Do we need to change something in m68k_interrupt_all()?


Yes, and cpu_loop.  Thanks,


r~



Re: unable to execute QEMU command 'qom-get': Property 'sgx-epc.unavailable-features' not found

2021-11-30 Thread Yang Zhong
On Thu, Nov 25, 2021 at 08:47:22PM +0800, Yang Zhong wrote:
> Hello Paolo,
> 
> Our customer used the Libvirt XML to start a SGX VM, but failed.
> 
> libvirt.libvirtError: internal error: unable to execute QEMU command 
> 'qom-get': Property 'sgx-epc.unavailable-features' not found
> 
> The XML file,
> 
> 
>  value="host,+sgx,+sgx-debug,+sgx-exinfo,+sgx-kss,+sgx-mode64,+sgx-provisionkey,+sgx-tokenkey,+sgx1,+sgx2,+sgxlc"/>
> 
> 
> 
> 
>   
> 
> The new compound property command should be located in /machine path,
> which are different with old command '-sgx-epc id=epc1,memdev=mem1'.
> 
> I also tried this from Qemu monitor tool, 
> (qemu) qom-list /machine
> type (string)
> kernel (string)
> ..
> sgx-epc (SgxEPC)
> ..
> sgx-epc[0] (child)
> ..
> 
> We can find sgx-epc from /machine list.
> 

  This issue is clear now, which is caused by Libvirt to get the CPU's 
unavailable-features by below command:
  
{"execute":"qom-get","arguments":{"path":"/machine/unattached/device[0]","property":"unavailable-features"}

  but in SGX vm, since the sgx is initialized before VCPU because sgx need set 
the virtual EPC info in the cpuid.  

  So the /machine/unattached/device[0] is occupied by sgx, which fail to get 
the unvailable-features from
  /machine/unattached/device[0].


  We need fix this issue, but this can be done in Qemu or Libvirt side.

  1) Libvirt side
 If the libvirt support SGX EPCs, libvirt can use 
/machine/unattached/device[n] to check "unavailable-features".
 n is the next number of sgx's unattached_count.

  2) Qemu side

 One temp patch to create one /sgx in the /machine in the 
device_set_realized() 
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 84f3019440..4154eef0d8 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -497,7 +497,7 @@ static void device_set_realized(Object *obj, bool value, 
Error **errp)
 NamedClockList *ncl;
 Error *local_err = NULL;
 bool unattached_parent = false;
-static int unattached_count;
+static int unattached_count, sgx_count;

 if (dev->hotplugged && !dc->hotpluggable) {
 error_setg(errp, QERR_DEVICE_NO_HOTPLUG, object_get_typename(obj));
@@ -509,7 +509,15 @@ static void device_set_realized(Object *obj, bool value, 
Error **errp)
 goto fail;
 }

-if (!obj->parent) {
+if (!obj->parent && !strcmp(object_get_typename(obj), "sgx-epc")) {
+gchar *name = g_strdup_printf("device[%d]", sgx_count++);
+
+object_property_add_child(container_get(qdev_get_machine(),
+"/sgx"),
+  name, obj);
+unattached_parent = true;
+g_free(name);
+} else if (!obj->parent) {
 gchar *name = g_strdup_printf("device[%d]", unattached_count++);

 object_property_add_child(container_get(qdev_get_machine()
   
This patch can make sure vcpu is still /machine/unattached/device[0].


Which solution is best?  thanks!

Yang




> I am not familiar with Libvirt side, would you please suggest how to implement
> this compound command in the XML file?  thanks a lot!
> 
> Regards,
> 
> Yang  
> 



Re: [PATCH for-6.1 v2] i386: do not call cpudef-only models functions for max, host, base

2021-11-30 Thread David Woodhouse
On Tue, 2021-11-30 at 10:00 +0100, Claudio Fontana wrote:
> I tend to agree that what we want if kvm_enable_x2apic fails is to abort QEMU 
> if we have been requesting smp > 255,
> and if we did not request smp > 255 cpus, we want to not advertise the 
> feature.
> 
> This is not accomplished, neither by my snippet above, not by the existing 
> code at any point in git history, and not by yours in itself.
> 
> Your change seems to accomplish the call to kvm_enable_x2apic, and abort if 
> requested smp > 255,
> but it does not stop advertising X2APIC and ext-dest-id on kvm_enable_x2apic 
> failure for the case < 255, so we'd need to add that.

We don't need kvm_enable_x2apic() at all if all APIC IDs are <255.

The kvm_enable_x2apic() call sets two flags. The first (USE_32BIT_IDS)
makes KVM take the high bits of the target APIC ID from the high bits
of the MSI address. So is only relevant if we ever want to deliver
interrupts to CPUs above 255.

The second (DISABLE_BROADCAST_QUIRK) prevents APIC ID 255 from being
interpreted as a broadcast. This is relevant if you ever want to
deliver interrupts to APIC #255.

So we need kvm_enable_x2apic() *if* we have APICs >= 255, but we don't
care about it if we have fewer CPUs.

Note the condition is '>= 255'. Not '> 255'.

With APIC IDs < 256 it also doesn't matter whether we advertise the
ext-dest-id feature to the guest or not, since that only tells them
where to put the upper bits... and there aren't any upper bits.

> Do I understand it right? Do we need to wrap all of this logic in a if 
> (kvm_enabled()) ?

For Xen because we aren't really emulating CPUs or APICs at all, it
doesn't matter and you can have as many CPUs as you like.

For TCG or (KVM && !kvm_irqchip_in_kernel()) we are limited to 254
because our userspace lapic doesn't emulate x2apic at all. But in the
TCG case, even if KVM isn't *built*, kvm_irqchip_in_kernel() will
return false. So all we need to check is kvm_irqchip_in_kernel().

I think the correct check is what I had in pc_machine_done() with the
off-by-one error fixed:

if (x86ms->apic_id_limit >= 255 && !xen_enabled() &&
(!kvm_irqchip_in_kernel() || !kvm_enable_x2apic())) {

But that doesn't help microvm unless we copy it there too. 

Let's take a look at the code we were already looking at in
kvm_cpu_instance_init():

/* only applies to builtin_x86_defs cpus */
if (!kvm_irqchip_in_kernel()) {
x86_cpu_change_kvm_default("x2apic", "off");

That part is purely cosmetic because kvm_arch_get_supported_cpuid() is
*already* going to mask out the X2APIC bit if(!kvm_irqchip_in_kernel())
anyway.

} else if (kvm_irqchip_is_split() && kvm_enable_x2apic()) {
x86_cpu_change_kvm_default("kvm-msi-ext-dest-id", "on");
}

That part makes sense but there's a check missing here because even
*without* ext-dest-id we can't support APIC ID 255 without using
kvm_enable_x2apic().

And by the time we're in kvm_cpu_instance_init() for the CPU with APIC
ID #255, it's too late to disable x2apic support for all the previous
CPUs.

So... we either do the check in pc_machine_done() and also a similar
check for microvm, or we actually abort in kvm_cpu_instance_init() if
this CPU's APIC ID is >=255 and kvm_enable_x2apic() has failed.

Either way, the call in intel_iommu can die.




> > 
> > In that case it needs to turn x2apic support *off*. But simply calling
> > (or not calling) x86_cpu_change_kvm_default() makes absolutely no
> > difference unless those defaults are *applied* by calling
> > x86_cpu_apply_props()
> 
> right, it makes absolutely no difference, and we cannot use 
> kvm_default_props, as they are for something else entirely.
> 
> > or making the same change by some other means.
> 
> right, we need to change it by other means.
> 
> It is still unclear to me for which cpu classes and versioned models we 
> should behave like this. Thoughts?
> 
> "max"? "base"? versioned models: depending on the model features?
> 
> > 
> > 
> > > > So what I care about (in case ∃ APIC IDs >= 255) is two things:
> > > > 
> > > >  1. Qemu needs to call kvm_enable_x2apic().
> > > >  2. If that *fails* qemu needs to *stop* advertising X2APIC and 
> > > > ext-dest-id.
> 
> Understand. We also need though:
> 
> 3. Not call kvm_enable_x2apic() when it should not be called (non-KVM 
> accelerator, which cpu classes and models)
> 4. Not stop advertising X2APIC and ext-dest-id when we shouldn't stop 
> advertising it.
> 
> > > > 
> > > > 
> > > > That last patch snippet in pc_machine_done() should suffice to achieve
> > > > that, I think. Because if kvm_enable_x2apic() fails and qemu has been
> > > > asked for that many CPUs, it aborts completely. Which seems right.
> 
> see comments above, and should we limit that code to when kvm is enabled?
> 
> > > > 
> > > 
> > > seems right to abort if requesting > 255 APIC IDs cannot be satisfied, I 
> > > agree.
> > > 
> > > So I think in the end, we want to:
> > > 
> > > 1) 

Re: [PATCH 1/3] target/m68k: Implement TRAPV

2021-11-30 Thread Laurent Vivier

Le 30/11/2021 à 11:37, Richard Henderson a écrit :

Signed-off-by: Richard Henderson 
---
  target/m68k/translate.c | 17 +
  1 file changed, 17 insertions(+)

diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index af43c8eab8..858ba761fc 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -4863,6 +4863,22 @@ DISAS_INSN(trap)
  gen_exception(s, s->base.pc_next, EXCP_TRAP0 + (insn & 0xf));
  }
  
+static void do_trapcc(DisasContext *s, int cond)

+{
+TCGLabel *over = gen_new_label();
+
+/* Jump over if !cond. */
+gen_jmpcc(s, cond ^ 1, over);
+
+gen_exception(s, s->base.pc_next, EXCP_TRAPCC);
+gen_set_label(over);
+}
+
+DISAS_INSN(trapv)
+{
+do_trapcc(s, 9); /* VS */
+}
+
  static void gen_load_fcr(DisasContext *s, TCGv res, int reg)
  {
  switch (reg) {
@@ -6026,6 +6042,7 @@ void register_m68k_insns (CPUM68KState *env)
  BASE(nop,   4e71, );
  INSN(rtd,   4e74, , RTD);
  BASE(rts,   4e75, );
+INSN(trapv, 4e76, , M68000);
  INSN(rtr,   4e77, , M68000);
  BASE(jump,  4e80, ffc0);
  BASE(jump,  4ec0, ffc0);



Same question as for PATCH 2 regarding m68k_interrupt_all()

Reviewed-by: Laurent Vivier 



Re: [PATCH v3 17/23] multifd: Use normal pages array on the send side

2021-11-30 Thread Juan Quintela
"Dr. David Alan Gilbert"  wrote:
> * Juan Quintela (quint...@redhat.com) wrote:
>> Signed-off-by: Juan Quintela 
>
> Can you explain a bit more what's going on here?

Sorry.

Until patch 20, we have what we had always have:

pages that are sent through multifd (non zero pages).  We are going to
call it normal pages.  So right now, we use the array of pages that we
are passed in directly on the multifd send methods.

But when we introduce zero pages handling around patch 20, we end having
two types of pages sent through multifd:
- normal pages (a.k.a. non-zero pages)
- zero pages

So the options are:
- we rename the fields before we introduce the zero page code, and then
  we introduce the zero page code.
- we rename at the same time that we introduce the zero page code.

I decided to go with the 1st option.

The other thing that we do here is that we introduce the normal array
pages, so right now we do:

for (i = 0; i < pages->num; i++) {
p->narmal[p->normal_num] = pages->offset[i];
p->normal_num++:
}


Why?

Because then patch 20 becomes:

for (i = 0; i < pages->num; i++) {
if (buffer_is_zero(page->offset[i])) {
p->zerol[p->zero_num] = pages->offset[i];
p->zeronum++:
} else {
p->narmal[p->normal_num] = pages->offset[i];
p->normal_num++:
}
}

i.e. don't have to touch the handling of normal pages at all, only this
for loop.

As an added benefit, after this patch, multifd methods don't need to
know about the pages array, only about the params array (that will allow
me to drop the locking earlier).

I hope this helps.

Later, Juan.




Re: [PATCH 2/3] target/m68k: Implement TRAPcc

2021-11-30 Thread Laurent Vivier

Le 30/11/2021 à 11:37, Richard Henderson a écrit :

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/754
Signed-off-by: Richard Henderson 
---
  target/m68k/cpu.h   |  2 ++
  target/m68k/cpu.c   |  1 +
  target/m68k/translate.c | 21 +
  3 files changed, 24 insertions(+)

diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h
index a3423729ef..03f600f7e7 100644
--- a/target/m68k/cpu.h
+++ b/target/m68k/cpu.h
@@ -527,6 +527,8 @@ enum m68k_features {
  M68K_FEATURE_MOVEC,
  /* Unaligned data accesses (680[2346]0) */
  M68K_FEATURE_UNALIGNED_DATA,
+/* TRAPCC insn. (680[2346]0, and CPU32) */
+M68K_FEATURE_TRAPCC,
  };
  
  static inline int m68k_feature(CPUM68KState *env, int feature)

diff --git a/target/m68k/cpu.c b/target/m68k/cpu.c
index c7aeb7da9c..5f778773d1 100644
--- a/target/m68k/cpu.c
+++ b/target/m68k/cpu.c
@@ -162,6 +162,7 @@ static void m68020_cpu_initfn(Object *obj)
  m68k_set_feature(env, M68K_FEATURE_CHK2);
  m68k_set_feature(env, M68K_FEATURE_MSP);
  m68k_set_feature(env, M68K_FEATURE_UNALIGNED_DATA);
+m68k_set_feature(env, M68K_FEATURE_TRAPCC);
  }
  
  /*

diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index 858ba761fc..cf29f35d91 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -4879,6 +4879,26 @@ DISAS_INSN(trapv)
  do_trapcc(s, 9); /* VS */
  }
  
+DISAS_INSN(trapcc)

+{
+/* Consume and discard the immediate operand. */
+switch (extract32(insn, 0, 3)) {
+case 2: /* trapcc.w */
+(void)read_im16(env, s);
+break;
+case 3: /* trapcc.l */
+(void)read_im32(env, s);
+break;


Do we need to actually read the memory to trigger a fault if needed or can we 
only increase the PC?

Normally these values are for the trap handler.


+case 4: /* trapcc (no operand) */
+break;
+default:
+/* Illegal insn */
+disas_undef(env, s, insn);
+return;
+}
+do_trapcc(s, extract32(insn, 8, 4));
+}


Do we need to change something in m68k_interrupt_all()?

if (!is_hw) {
switch (cs->exception_index) {
case EXCP_RTE:
/* Return from an exception.  */
m68k_rte(env);
return;
case EXCP_TRAP0 ...  EXCP_TRAP15:
/* Move the PC after the trap instruction.  */
retaddr += 2;
break;
}
}

Thanks,
Laurent


+
  static void gen_load_fcr(DisasContext *s, TCGv res, int reg)
  {
  switch (reg) {
@@ -6051,6 +6071,7 @@ void register_m68k_insns (CPUM68KState *env)
  INSN(scc,   50c0, f0f8, CF_ISA_A); /* Scc.B Dx   */
  INSN(scc,   50c0, f0c0, M68000);   /* Scc.B  */
  INSN(dbcc,  50c8, f0f8, M68000);
+INSN(trapcc,50f8, f0f8, TRAPCC);
  INSN(tpf,   51f8, fff8, CF_ISA_A);
  
  /* Branch instructions.  */








Re: [PATCH 3/3] target/m68k: Implement FTRAPcc

2021-11-30 Thread Richard Henderson

On 11/30/21 11:37 AM, Richard Henderson wrote:

+INSN(ftrapcc,   f278, ff80, FPU);


Whoops, mask should be fff8.


r~



[PATCH] aio-posix: split poll check from ready handler

2021-11-30 Thread Stefan Hajnoczi
Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 16, iocbpp: 0x7fcb9f937db0)= 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 
4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 32, iocbpp: 0x7fca7d0cebe0)= 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0cdd62be0)= 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0d0388b50)= 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h  |  4 +-
 util/aio-posix.h |  1 +
 block/curl.c | 11 ++---
 block/io_uring.c | 19 +
 block/iscsi.c|  4 +-
 block/linux-aio.c| 16 +---
 block/nfs.c  |  6 +--
 block/nvme.c | 34 +---
 block/ssh.c  |  4 +-
 block/win32-aio.c|  4 +-
 hw/virtio/virtio.c   | 16 +---
 hw/xen/xen-bus.c |  6 +--
 io/channel-command.c |  6 ++-
 io/channel-file.c|  3 +-
 io/channel-socket.c  |  3 +-
 migration/rdma.c |  8 ++--
 tests/unit/test-aio.c|  4 +-
 util/aio-posix.c | 88 ++--
 util/aio-win32.c |  4 +-
 util/async.c | 10 -
 util/main-loop.c |  4 +-
 util/qemu-coroutine-io.c |  5 ++-
 util/vhost-user-server.c | 11 ++---
 23 files changed, 184 insertions(+), 87 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 47fbe9d81f..5634173b12 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -469,6 +469,7 @@ void aio_set_fd_handler(AioContext *ctx,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
+IOHandler *io_poll_ready,
 void *opaque);
 
 /* Set polling begin/end callbacks for a file descriptor that has already been
@@ -490,7 +491,8 @@ void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
 bool is_external,
 EventNotifierHandler *io_read,
-AioPollFn *io_poll);
+AioPollFn *io_poll,
+EventNotifierHandler *io_poll_ready);
 
 /* Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
diff --git a/util/aio-posix.h b/util/aio-posix.h
index c80c04506a..7f2c37a684 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -24,6 +24,7 @@ struct AioHandler {
 IOHandler *io_read;
 IOHandler *io_write;
 AioPollFn *io_poll;
+IOHandler *io_poll_ready;
 IOHandler 

Re: [PATCH v3 17/23] multifd: Use normal pages array on the send side

2021-11-30 Thread Dr. David Alan Gilbert
* Juan Quintela (quint...@redhat.com) wrote:
> Signed-off-by: Juan Quintela 

Can you explain a bit more what's going on here?

Dave

> ---
>  migration/multifd.h  |  8 ++--
>  migration/multifd-zlib.c |  6 +++---
>  migration/multifd-zstd.c |  6 +++---
>  migration/multifd.c  | 30 +++---
>  migration/trace-events   |  4 ++--
>  5 files changed, 33 insertions(+), 21 deletions(-)
> 
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 7496f951a7..78e73df3ec 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -104,14 +104,18 @@ typedef struct {
>  /* thread local variables */
>  /* packets sent through this channel */
>  uint64_t num_packets;
> -/* pages sent through this channel */
> -uint64_t num_pages;
> +/* non zero pages sent through this channel */
> +uint64_t num_normal_pages;
>  /* syncs main thread and channels */
>  QemuSemaphore sem_sync;
>  /* buffers to send */
>  struct iovec *iov;
>  /* number of iovs used */
>  uint32_t iovs_num;
> +/* Pages that are not zero */
> +ram_addr_t *normal;
> +/* num of non zero pages */
> +uint32_t normal_num;
>  /* used for compression methods */
>  void *data;
>  }  MultiFDSendParams;
> diff --git a/migration/multifd-zlib.c b/migration/multifd-zlib.c
> index f65159392a..25ef68a548 100644
> --- a/migration/multifd-zlib.c
> +++ b/migration/multifd-zlib.c
> @@ -106,16 +106,16 @@ static int zlib_send_prepare(MultiFDSendParams *p, 
> Error **errp)
>  int ret;
>  uint32_t i;
>  
> -for (i = 0; i < p->pages->num; i++) {
> +for (i = 0; i < p->normal_num; i++) {
>  uint32_t available = z->zbuff_len - out_size;
>  int flush = Z_NO_FLUSH;
>  
> -if (i == p->pages->num - 1) {
> +if (i == p->normal_num - 1) {
>  flush = Z_SYNC_FLUSH;
>  }
>  
>  zs->avail_in = page_size;
> -zs->next_in = p->pages->block->host + p->pages->offset[i];
> +zs->next_in = p->pages->block->host + p->normal[i];
>  
>  zs->avail_out = available;
>  zs->next_out = z->zbuff + out_size;
> diff --git a/migration/multifd-zstd.c b/migration/multifd-zstd.c
> index 6933ba622a..61842d713e 100644
> --- a/migration/multifd-zstd.c
> +++ b/migration/multifd-zstd.c
> @@ -121,13 +121,13 @@ static int zstd_send_prepare(MultiFDSendParams *p, 
> Error **errp)
>  z->out.size = z->zbuff_len;
>  z->out.pos = 0;
>  
> -for (i = 0; i < p->pages->num; i++) {
> +for (i = 0; i < p->normal_num; i++) {
>  ZSTD_EndDirective flush = ZSTD_e_continue;
>  
> -if (i == p->pages->num - 1) {
> +if (i == p->normal_num - 1) {
>  flush = ZSTD_e_flush;
>  }
> -z->in.src = p->pages->block->host + p->pages->offset[i];
> +z->in.src = p->pages->block->host + p->normal[i];
>  z->in.size = page_size;
>  z->in.pos = 0;
>  
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 6983ba3e7c..dbe919b764 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -89,13 +89,13 @@ static int nocomp_send_prepare(MultiFDSendParams *p, 
> Error **errp)
>  MultiFDPages_t *pages = p->pages;
>  size_t page_size = qemu_target_page_size();
>  
> -for (int i = 0; i < p->pages->num; i++) {
> -p->iov[p->iovs_num].iov_base = pages->block->host + pages->offset[i];
> +for (int i = 0; i < p->normal_num; i++) {
> +p->iov[p->iovs_num].iov_base = pages->block->host + p->normal[i];
>  p->iov[p->iovs_num].iov_len = page_size;
>  p->iovs_num++;
>  }
>  
> -p->next_packet_size = p->pages->num * page_size;
> +p->next_packet_size = p->normal_num * page_size;
>  p->flags |= MULTIFD_FLAG_NOCOMP;
>  return 0;
>  }
> @@ -262,7 +262,7 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>  
>  packet->flags = cpu_to_be32(p->flags);
>  packet->pages_alloc = cpu_to_be32(p->pages->allocated);
> -packet->pages_used = cpu_to_be32(p->pages->num);
> +packet->pages_used = cpu_to_be32(p->normal_num);
>  packet->next_packet_size = cpu_to_be32(p->next_packet_size);
>  packet->packet_num = cpu_to_be64(p->packet_num);
>  
> @@ -270,9 +270,9 @@ static void multifd_send_fill_packet(MultiFDSendParams *p)
>  strncpy(packet->ramblock, p->pages->block->idstr, 256);
>  }
>  
> -for (i = 0; i < p->pages->num; i++) {
> +for (i = 0; i < p->normal_num; i++) {
>  /* there are architectures where ram_addr_t is 32 bit */
> -uint64_t temp = p->pages->offset[i];
> +uint64_t temp = p->normal[i];
>  
>  packet->offset[i] = cpu_to_be64(temp);
>  }
> @@ -556,6 +556,8 @@ void multifd_save_cleanup(void)
>  p->packet = NULL;
>  g_free(p->iov);
>  p->iov = NULL;
> +g_free(p->normal);
> +p->normal = NULL;
>  multifd_send_state->ops->send_cleanup(p, _err);
>  

[PATCH v1 8/8] virtio-mem: Support "prealloc=on" option

2021-11-30 Thread David Hildenbrand
For scarce memory resources, such as hugetlb, we want to be able to
prealloc such memory resources in order to not crash later on access. On
simple user errors we could otherwise easily run out of memory resources
an crash the VM -- pretty much undesired.

For ordinary memory devices, such as DIMMs, we preallocate memory via the
memory backend for such use cases; however, with virtio-mem we're dealing
with sparse memory backends; preallocating the whole memory backend
destroys the whole purpose of virtio-mem.

Instead, we want to preallocate memory when actually exposing memory to the
VM dynamically, and fail plugging memory gracefully + warn the user in case
preallocation fails.

A common use case for hugetlb will be using "reserve=off,prealloc=off" for
the memory backend and "prealloc=on" for the virtio-mem device. This
way, no huge pages will be reserved for the process, but we can recover
if there are no actual huge pages when plugging memory. Libvirt is
already prepared for this.

Note that preallocation cannot protect from the OOM killer -- which
holds true for any kind of preallocation in QEMU. It's primarily useful
only for scarce memory resources such as hugetlb, or shared file-backed
memory. It's of little use for ordinary anonymous memory that can be
swapped, KSM merged, ... but we won't forbid it.

Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 39 ++
 include/hw/virtio/virtio-mem.h |  4 
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index a5d26d414f..33c7884aa0 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -450,10 +450,40 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, 
uint64_t start_gpa,
 return -EBUSY;
 }
 virtio_mem_notify_unplug(vmem, offset, size);
-} else if (virtio_mem_notify_plug(vmem, offset, size)) {
-/* Could be a mapping attempt resulted in memory getting populated. */
-ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
-return -EBUSY;
+} else {
+int ret = 0;
+
+if (vmem->prealloc) {
+void *area = memory_region_get_ram_ptr(>memdev->mr) + offset;
+int fd = memory_region_get_fd(>memdev->mr);
+Error *local_err = NULL;
+
+os_mem_prealloc(fd, area, size, 1, _err);
+if (local_err) {
+static bool warned;
+
+/*
+ * Warn only once, we don't want to fill the log with these
+ * warnings.
+ */
+if (!warned) {
+warn_report_err(local_err);
+warned = true;
+} else {
+error_free(local_err);
+}
+ret = -EBUSY;
+}
+}
+if (!ret) {
+ret = virtio_mem_notify_plug(vmem, offset, size);
+}
+
+if (ret) {
+/* Could be preallocation or a notifier populated memory. */
+ram_block_discard_range(vmem->memdev->mr.ram_block, offset, size);
+return -EBUSY;
+}
 }
 virtio_mem_set_bitmap(vmem, start_gpa, size, plug);
 return 0;
@@ -1165,6 +1195,7 @@ static void virtio_mem_instance_init(Object *obj)
 static Property virtio_mem_properties[] = {
 DEFINE_PROP_UINT64(VIRTIO_MEM_ADDR_PROP, VirtIOMEM, addr, 0),
 DEFINE_PROP_UINT32(VIRTIO_MEM_NODE_PROP, VirtIOMEM, node, 0),
+DEFINE_PROP_BOOL(VIRTIO_MEM_PREALLOC_PROP, VirtIOMEM, prealloc, false),
 DEFINE_PROP_LINK(VIRTIO_MEM_MEMDEV_PROP, VirtIOMEM, memdev,
  TYPE_MEMORY_BACKEND, HostMemoryBackend *),
 #if defined(VIRTIO_MEM_HAS_LEGACY_GUESTS)
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 38c67a89f2..7745cfc1a3 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -31,6 +31,7 @@ OBJECT_DECLARE_TYPE(VirtIOMEM, VirtIOMEMClass,
 #define VIRTIO_MEM_BLOCK_SIZE_PROP "block-size"
 #define VIRTIO_MEM_ADDR_PROP "memaddr"
 #define VIRTIO_MEM_UNPLUGGED_INACCESSIBLE_PROP "unplugged-inaccessible"
+#define VIRTIO_MEM_PREALLOC_PROP "prealloc"
 
 struct VirtIOMEM {
 VirtIODevice parent_obj;
@@ -70,6 +71,9 @@ struct VirtIOMEM {
  */
 OnOffAuto unplugged_inaccessible;
 
+/* whether to prealloc memory when plugging new blocks */
+bool prealloc;
+
 /* notifiers to notify when "size" changes */
 NotifierList size_change_notifiers;
 
-- 
2.31.1




[PATCH v1 6/8] util/oslib-posix: Support concurrent os_mem_prealloc() invocation

2021-11-30 Thread David Hildenbrand
Add a mutex to protect the SIGBUS case, as we cannot mess concurrently
with the sigbus handler and we have to manage the global variable
sigbus_memset_context. The MADV_POPULATE_WRITE path can run
concurrently.

Note that page_mutex and page_cond are shared between concurrent
invocations, which shouldn't be a problem.

This is a preparation for future virtio-mem prealloc code, which will call
os_mem_prealloc() asynchronously from an iothread when handling guest
requests.

Reviewed-by: Pankaj Gupta 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 util/oslib-posix.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index efa4f96d56..9829149e4b 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -95,6 +95,7 @@ typedef struct MemsetThread MemsetThread;
 
 /* used by sigbus_handler() */
 static MemsetContext *sigbus_memset_context;
+static QemuMutex sigbus_mutex;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
@@ -625,6 +626,7 @@ static bool madv_populate_write_possible(char *area, size_t 
pagesize)
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
  Error **errp)
 {
+static gsize initialized;
 int ret;
 struct sigaction act, oldact;
 size_t hpagesize = qemu_fd_getpagesize(fd);
@@ -638,6 +640,12 @@ void os_mem_prealloc(int fd, char *area, size_t memory, 
int smp_cpus,
 use_madv_populate_write = madv_populate_write_possible(area, hpagesize);
 
 if (!use_madv_populate_write) {
+if (g_once_init_enter()) {
+qemu_mutex_init(_mutex);
+g_once_init_leave(, 1);
+}
+
+qemu_mutex_lock(_mutex);
 memset(, 0, sizeof(act));
 act.sa_handler = _handler;
 act.sa_flags = 0;
@@ -665,6 +673,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int 
smp_cpus,
 perror("os_mem_prealloc: failed to reinstall signal handler");
 exit(1);
 }
+qemu_mutex_unlock(_mutex);
 }
 }
 
-- 
2.31.1




[PATCH v1 7/8] util/oslib-posix: Forward SIGBUS to MCE handler under Linux

2021-11-30 Thread David Hildenbrand
Temporarily modifying the SIGBUS handler is really nasty, as we might be
unlucky and receive an MCE SIGBUS while having our handler registered.
Unfortunately, there is no way around messing with SIGBUS when
MADV_POPULATE_WRITE is not applicable or not around.

Let's forward SIGBUS that don't belong to us to the already registered
handler and document the situation.

Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 softmmu/cpus.c |  4 
 util/oslib-posix.c | 36 +---
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/softmmu/cpus.c b/softmmu/cpus.c
index 071085f840..23bca46b07 100644
--- a/softmmu/cpus.c
+++ b/softmmu/cpus.c
@@ -352,6 +352,10 @@ static void qemu_init_sigbus(void)
 {
 struct sigaction action;
 
+/*
+ * ALERT: when modifying this, take care that SIGBUS forwarding in
+ * os_mem_prealloc() will continue working as expected.
+ */
 memset(, 0, sizeof(action));
 action.sa_flags = SA_SIGINFO;
 action.sa_sigaction = sigbus_handler;
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 9829149e4b..5c47aa9cb7 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -95,6 +95,7 @@ typedef struct MemsetThread MemsetThread;
 
 /* used by sigbus_handler() */
 static MemsetContext *sigbus_memset_context;
+struct sigaction sigbus_oldact;
 static QemuMutex sigbus_mutex;
 
 static QemuMutex page_mutex;
@@ -446,7 +447,11 @@ const char *qemu_get_exec_dir(void)
 return exec_dir;
 }
 
+#ifdef CONFIG_LINUX
+static void sigbus_handler(int signal, siginfo_t *siginfo, void *ctx)
+#else /* CONFIG_LINUX */
 static void sigbus_handler(int signal)
+#endif /* CONFIG_LINUX */
 {
 int i;
 
@@ -459,6 +464,26 @@ static void sigbus_handler(int signal)
 }
 }
 }
+
+#ifdef CONFIG_LINUX
+/*
+ * We assume that the MCE SIGBUS handler could have been registered. We
+ * should never receive BUS_MCEERR_AO on any of our threads, but only on
+ * the main thread registered for PR_MCE_KILL_EARLY. Further, we should not
+ * receive BUS_MCEERR_AR triggered by action of other threads on one of
+ * our threads. So, no need to check for unrelated SIGBUS when seeing one
+ * for our threads.
+ *
+ * We will forward to the MCE handler, which will either handle the SIGBUS
+ * or reinstall the default SIGBUS handler and reraise the SIGBUS. The
+ * default SIGBUS handler will crash the process, so we don't care.
+ */
+if (sigbus_oldact.sa_flags & SA_SIGINFO) {
+sigbus_oldact.sa_sigaction(signal, siginfo, ctx);
+return;
+}
+#endif /* CONFIG_LINUX */
+warn_report("os_mem_prealloc: unrelated SIGBUS detected and ignored");
 }
 
 static void *do_touch_pages(void *arg)
@@ -628,10 +653,10 @@ void os_mem_prealloc(int fd, char *area, size_t memory, 
int smp_cpus,
 {
 static gsize initialized;
 int ret;
-struct sigaction act, oldact;
 size_t hpagesize = qemu_fd_getpagesize(fd);
 size_t numpages = DIV_ROUND_UP(memory, hpagesize);
 bool use_madv_populate_write;
+struct sigaction act;
 
 /*
  * Sense on every invocation, as MADV_POPULATE_WRITE cannot be used for
@@ -647,10 +672,15 @@ void os_mem_prealloc(int fd, char *area, size_t memory, 
int smp_cpus,
 
 qemu_mutex_lock(_mutex);
 memset(, 0, sizeof(act));
+#ifdef CONFIG_LINUX
+act.sa_sigaction = _handler;
+act.sa_flags = SA_SIGINFO;
+#else /* CONFIG_LINUX */
 act.sa_handler = _handler;
 act.sa_flags = 0;
+#endif /* CONFIG_LINUX */
 
-ret = sigaction(SIGBUS, , );
+ret = sigaction(SIGBUS, , _oldact);
 if (ret) {
 error_setg_errno(errp, errno,
 "os_mem_prealloc: failed to install signal handler");
@@ -667,7 +697,7 @@ void os_mem_prealloc(int fd, char *area, size_t memory, int 
smp_cpus,
 }
 
 if (!use_madv_populate_write) {
-ret = sigaction(SIGBUS, , NULL);
+ret = sigaction(SIGBUS, _oldact, NULL);
 if (ret) {
 /* Terminate QEMU since it can't recover from error */
 perror("os_mem_prealloc: failed to reinstall signal handler");
-- 
2.31.1




Re: [PATCH v3 16/23] multifd: Unfold "used" variable by its value

2021-11-30 Thread Dr. David Alan Gilbert
* Juan Quintela (quint...@redhat.com) wrote:
> Signed-off-by: Juan Quintela 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  migration/multifd.c | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 65676d56fd..6983ba3e7c 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -1059,7 +1059,6 @@ static void *multifd_recv_thread(void *opaque)
>  rcu_register_thread();
>  
>  while (true) {
> -uint32_t used;
>  uint32_t flags;
>  
>  if (p->quit) {
> @@ -1082,17 +1081,16 @@ static void *multifd_recv_thread(void *opaque)
>  break;
>  }
>  
> -used = p->pages->num;
>  flags = p->flags;
>  /* recv methods don't know how to handle the SYNC flag */
>  p->flags &= ~MULTIFD_FLAG_SYNC;
> -trace_multifd_recv(p->id, p->packet_num, used, flags,
> +trace_multifd_recv(p->id, p->packet_num, p->pages->num, flags,
> p->next_packet_size);
>  p->num_packets++;
> -p->num_pages += used;
> +p->num_pages += p->pages->num;
>  qemu_mutex_unlock(>mutex);
>  
> -if (used) {
> +if (p->pages->num) {
>  ret = multifd_recv_state->ops->recv_pages(p, _err);
>  if (ret != 0) {
>  break;
> -- 
> 2.33.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK




[PATCH v1 3/8] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages()

2021-11-30 Thread David Hildenbrand
Let's minimize the number of global variables to prepare for
os_mem_prealloc() getting called concurrently and make the code a bit
easier to read.

The only consumer that really needs a global variable is the sigbus
handler, which will require protection via a mutex in the future either way
as we cannot concurrently mess with the SIGBUS handler.

Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 util/oslib-posix.c | 73 +-
 1 file changed, 47 insertions(+), 26 deletions(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index cb89e07770..cf2ead54ad 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -73,21 +73,30 @@
 
 #define MAX_MEM_PREALLOC_THREAD_COUNT 16
 
+struct MemsetThread;
+
+typedef struct MemsetContext {
+bool all_threads_created;
+bool any_thread_failed;
+struct MemsetThread *threads;
+int num_threads;
+} MemsetContext;
+
 struct MemsetThread {
 char *addr;
 size_t numpages;
 size_t hpagesize;
 QemuThread pgthread;
 sigjmp_buf env;
+MemsetContext *context;
 };
 typedef struct MemsetThread MemsetThread;
 
-static MemsetThread *memset_thread;
-static int memset_num_threads;
+/* used by sigbus_handler() */
+static MemsetContext *sigbus_memset_context;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
-static bool threads_created_flag;
 
 int qemu_get_thread_id(void)
 {
@@ -438,10 +447,13 @@ const char *qemu_get_exec_dir(void)
 static void sigbus_handler(int signal)
 {
 int i;
-if (memset_thread) {
-for (i = 0; i < memset_num_threads; i++) {
-if (qemu_thread_is_self(_thread[i].pgthread)) {
-siglongjmp(memset_thread[i].env, 1);
+
+if (sigbus_memset_context) {
+for (i = 0; i < sigbus_memset_context->num_threads; i++) {
+MemsetThread *thread = _memset_context->threads[i];
+
+if (qemu_thread_is_self(>pgthread)) {
+siglongjmp(thread->env, 1);
 }
 }
 }
@@ -459,7 +471,7 @@ static void *do_touch_pages(void *arg)
  * clearing until all threads have been created.
  */
 qemu_mutex_lock(_mutex);
-while(!threads_created_flag){
+while (!memset_args->context->all_threads_created) {
 qemu_cond_wait(_cond, _mutex);
 }
 qemu_mutex_unlock(_mutex);
@@ -502,7 +514,7 @@ static void *do_madv_populate_write_pages(void *arg)
 
 /* See do_touch_pages(). */
 qemu_mutex_lock(_mutex);
-while (!threads_created_flag) {
+while (!memset_args->context->all_threads_created) {
 qemu_cond_wait(_cond, _mutex);
 }
 qemu_mutex_unlock(_mutex);
@@ -529,6 +541,9 @@ static int touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
int smp_cpus, bool use_madv_populate_write)
 {
 static gsize initialized = 0;
+MemsetContext context = {
+.num_threads = get_memset_num_threads(smp_cpus),
+};
 size_t numpages_per_thread, leftover;
 void *(*touch_fn)(void *);
 int ret = 0, i = 0;
@@ -546,35 +561,41 @@ static int touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 touch_fn = do_touch_pages;
 }
 
-threads_created_flag = false;
-memset_num_threads = get_memset_num_threads(smp_cpus);
-memset_thread = g_new0(MemsetThread, memset_num_threads);
-numpages_per_thread = numpages / memset_num_threads;
-leftover = numpages % memset_num_threads;
-for (i = 0; i < memset_num_threads; i++) {
-memset_thread[i].addr = addr;
-memset_thread[i].numpages = numpages_per_thread + (i < leftover);
-memset_thread[i].hpagesize = hpagesize;
-qemu_thread_create(_thread[i].pgthread, "touch_pages",
-   touch_fn, _thread[i],
+context.threads = g_new0(MemsetThread, context.num_threads);
+numpages_per_thread = numpages / context.num_threads;
+leftover = numpages % context.num_threads;
+for (i = 0; i < context.num_threads; i++) {
+context.threads[i].addr = addr;
+context.threads[i].numpages = numpages_per_thread + (i < leftover);
+context.threads[i].hpagesize = hpagesize;
+context.threads[i].context = 
+qemu_thread_create([i].pgthread, "touch_pages",
+   touch_fn, [i],
QEMU_THREAD_JOINABLE);
-addr += memset_thread[i].numpages * hpagesize;
+addr += context.threads[i].numpages * hpagesize;
+}
+
+if (!use_madv_populate_write) {
+sigbus_memset_context = 
 }
 
 qemu_mutex_lock(_mutex);
-threads_created_flag = true;
+context.all_threads_created = true;
 qemu_cond_broadcast(_cond);
 qemu_mutex_unlock(_mutex);
 
-for (i = 0; i < memset_num_threads; i++) {
-int tmp = (uintptr_t)qemu_thread_join(_thread[i].pgthread);
+for (i = 0; i < context.num_threads; i++) {
+int tmp = (uintptr_t)qemu_thread_join([i].pgthread);
 
 

[PATCH v1 4/8] util/oslib-posix: Don't create too many threads with small memory or little pages

2021-11-30 Thread David Hildenbrand
Let's limit the number of threads to something sane, especially that
- We don't have more threads than the number of pages we have
- We don't have threads that initialize small (< 64 MiB) memory

Reviewed-by: Pankaj Gupta 
Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 util/oslib-posix.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index cf2ead54ad..67c08a425e 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -40,6 +40,7 @@
 #include 
 #include "qemu/cutils.h"
 #include "qemu/compiler.h"
+#include "qemu/units.h"
 
 #ifdef CONFIG_LINUX
 #include 
@@ -525,7 +526,8 @@ static void *do_madv_populate_write_pages(void *arg)
 return (void *)(uintptr_t)ret;
 }
 
-static inline int get_memset_num_threads(int smp_cpus)
+static inline int get_memset_num_threads(size_t hpagesize, size_t numpages,
+ int smp_cpus)
 {
 long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
 int ret = 1;
@@ -533,6 +535,12 @@ static inline int get_memset_num_threads(int smp_cpus)
 if (host_procs > 0) {
 ret = MIN(MIN(host_procs, MAX_MEM_PREALLOC_THREAD_COUNT), smp_cpus);
 }
+
+/* Especially with gigantic pages, don't create more threads than pages. */
+ret = MIN(ret, numpages);
+/* Don't start threads to prealloc comparatively little memory. */
+ret = MIN(ret, MAX(1, hpagesize * numpages / (64 * MiB)));
+
 /* In case sysconf() fails, we fall back to single threaded */
 return ret;
 }
@@ -542,7 +550,7 @@ static int touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 {
 static gsize initialized = 0;
 MemsetContext context = {
-.num_threads = get_memset_num_threads(smp_cpus),
+.num_threads = get_memset_num_threads(hpagesize, numpages, smp_cpus),
 };
 size_t numpages_per_thread, leftover;
 void *(*touch_fn)(void *);
-- 
2.31.1




[PATCH v1 5/8] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE

2021-11-30 Thread David Hildenbrand
Let's simplify the case when we only want a single thread and don't have
to mess with signal handlers.

Reviewed-by: Pankaj Gupta 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 util/oslib-posix.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 67c08a425e..efa4f96d56 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -564,6 +564,14 @@ static int touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 }
 
 if (use_madv_populate_write) {
+/* Avoid creating a single thread for MADV_POPULATE_WRITE */
+if (context.num_threads == 1) {
+if (qemu_madvise(area, hpagesize * numpages,
+ QEMU_MADV_POPULATE_WRITE)) {
+return -errno;
+}
+return 0;
+}
 touch_fn = do_madv_populate_write_pages;
 } else {
 touch_fn = do_touch_pages;
-- 
2.31.1




[PATCH v1 2/8] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()

2021-11-30 Thread David Hildenbrand
Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

While MADV_POPULATE_WRITE is, in general, faster than manual
prefaulting, and especially faster with 4k pages, there is still value in
prefaulting using multiple threads to speed up preallocation.

More details on MADV_POPULATE_WRITE can be found in the Linux commits
4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
page tables") and eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for
MADV_POPULATE_(READ|WRITE)"), and in the man page proposal [1].

This resolves the TODO in do_touch_pages().

In the future, we might want to look into using fallocate(), eventually
combined with MADV_POPULATE_READ, when dealing with shared file/fd
mappings and not caring about memory bindings.

[1] https://lkml.kernel.org/r/20210816081922.5155-1-da...@redhat.com

Reviewed-by: Pankaj Gupta 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 include/qemu/osdep.h |  7 
 util/oslib-posix.c   | 83 +---
 2 files changed, 69 insertions(+), 21 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 60718fc342..d1660d67fa 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -471,6 +471,11 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #else
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
 #endif
+#ifdef MADV_POPULATE_WRITE
+#define QEMU_MADV_POPULATE_WRITE MADV_POPULATE_WRITE
+#else
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
+#endif
 
 #elif defined(CONFIG_POSIX_MADVISE)
 
@@ -484,6 +489,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_DONTNEED
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #else /* no-op */
 
@@ -497,6 +503,7 @@ static inline void qemu_cleanup_generic_vfree(void *p)
 #define QEMU_MADV_HUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_NOHUGEPAGE  QEMU_MADV_INVALID
 #define QEMU_MADV_REMOVE QEMU_MADV_INVALID
+#define QEMU_MADV_POPULATE_WRITE QEMU_MADV_INVALID
 
 #endif
 
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index b146beef78..cb89e07770 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -484,10 +484,6 @@ static void *do_touch_pages(void *arg)
  *
  * 'volatile' to stop compiler optimizing this away
  * to a no-op
- *
- * TODO: get a better solution from kernel so we
- * don't need to write at all so we don't cause
- * wear on the storage backing the region...
  */
 *(volatile char *)addr = *addr;
 addr += hpagesize;
@@ -497,6 +493,26 @@ static void *do_touch_pages(void *arg)
 return (void *)(uintptr_t)ret;
 }
 
+static void *do_madv_populate_write_pages(void *arg)
+{
+MemsetThread *memset_args = (MemsetThread *)arg;
+const size_t size = memset_args->numpages * memset_args->hpagesize;
+char * const addr = memset_args->addr;
+int ret = 0;
+
+/* See do_touch_pages(). */
+qemu_mutex_lock(_mutex);
+while (!threads_created_flag) {
+qemu_cond_wait(_cond, _mutex);
+}
+qemu_mutex_unlock(_mutex);
+
+if (size && qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE)) {
+ret = -errno;
+}
+return (void *)(uintptr_t)ret;
+}
+
 static inline int get_memset_num_threads(int smp_cpus)
 {
 long host_procs = sysconf(_SC_NPROCESSORS_ONLN);
@@ -510,10 +526,11 @@ static inline int get_memset_num_threads(int smp_cpus)
 }
 
 static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-   int smp_cpus)
+   int smp_cpus, bool use_madv_populate_write)
 {
 static gsize initialized = 0;
 size_t numpages_per_thread, leftover;
+void *(*touch_fn)(void *);
 int ret = 0, i = 0;
 char *addr = area;
 
@@ -523,6 +540,12 @@ static int touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 g_once_init_leave(, 1);
 }
 
+if (use_madv_populate_write) {
+touch_fn = do_madv_populate_write_pages;
+} else {
+touch_fn = do_touch_pages;
+}
+
 threads_created_flag = false;
 memset_num_threads = get_memset_num_threads(smp_cpus);
 memset_thread = g_new0(MemsetThread, memset_num_threads);
@@ -533,7 +556,7 @@ static int touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 memset_thread[i].numpages = numpages_per_thread + (i < leftover);
 memset_thread[i].hpagesize = hpagesize;
 qemu_thread_create(_thread[i].pgthread, "touch_pages",
-   do_touch_pages, _thread[i],
+   touch_fn, _thread[i],
   

[PATCH v7 0/3] support dirty restraint on vCPU

2021-11-30 Thread huangy81
From: Hyman Huang(黄勇) 

The patch [2/3] has not been touched so far. Any corrections and
suggetions are welcome. 

Please review, thanks!

v7:
- rebase on master
- polish the comments and error message according to the
  advices given by Markus
- introduce dirtylimit_enabled function to pre-check if dirty
  page limit is enabled before canceling.

v6:
- rebase on master
- fix dirtylimit setup crash found by Markus
- polish the comments according to the advice given by Markus
- adjust the qemu qmp command tag to 7.0

v5:
- rebase on master
- adjust the throttle algorithm by removing the tuning in 
  RESTRAINT_RATIO case so that dirty page rate could reachs the quota
  more quickly.
- fix percentage update in throttle iteration.

v4:
- rebase on master
- modify the following points according to the advice given by Markus
  1. move the defination into migration.json
  2. polish the comments of set-dirty-limit
  3. do the syntax check and change dirty rate to dirty page rate

Thanks for the carefule reviews made by Markus.

Please review, thanks!

v3:
- rebase on master
- modify the following points according to the advice given by Markus
  1. remove the DirtyRateQuotaVcpu and use its field as option directly
  2. add comments to show details of what dirtylimit setup do
  3. explain how to use dirtylimit in combination with existing qmp
 commands "calc-dirty-rate" and "query-dirty-rate" in documentation.

Thanks for the carefule reviews made by Markus.

Please review, thanks!

Hyman

v2:
- rebase on master
- modify the following points according to the advices given by Juan
  1. rename dirtyrestraint to dirtylimit
  2. implement the full lifecyle function of dirtylimit_calc, include
 dirtylimit_calc and dirtylimit_calc_quit
  3. introduce 'quit' field in dirtylimit_calc_state to implement the
 dirtylimit_calc_quit
  4. remove the ready_cond and ready_mtx since it may not be suitable
  5. put the 'record_dirtypage' function code at the beggining of the
 file
  6. remove the unnecesary return;
- other modifications has been made after code review
  1. introduce 'bmap' and 'nr' field in dirtylimit_state to record the
 number of running thread forked by dirtylimit
  2. stop the dirtyrate calculation thread if all the dirtylimit thread
 are stopped
  3. do some renaming works
 dirtyrate calulation thread -> dirtylimit-calc
 dirtylimit thread -> dirtylimit-{cpu_index}
 function name do_dirtyrestraint -> dirtylimit_check
 qmp command dirty-restraint -> set-drity-limit
 qmp command dirty-restraint-cancel -> cancel-dirty-limit
 header file dirtyrestraint.h -> dirtylimit.h

Please review, thanks !

thanks for the accurate and timely advices given by Juan. we really
appreciate it if corrections and suggetions about this patchset are
proposed.

Best Regards !

Hyman

v1:
this patchset introduce a mechanism to impose dirty restraint
on vCPU, aiming to keep the vCPU running in a certain dirtyrate
given by user. dirty restraint on vCPU maybe an alternative
method to implement convergence logic for live migration,
which could improve guest memory performance during migration
compared with traditional method in theory.

For the current live migration implementation, the convergence
logic throttles all vCPUs of the VM, which has some side effects.
-'read processes' on vCPU will be unnecessarily penalized
- throttle increase percentage step by step, which seems
  struggling to find the optimal throttle percentage when
  dirtyrate is high.
- hard to predict the remaining time of migration if the
  throttling percentage reachs 99%

to a certain extent, the dirty restraint machnism can fix these
effects by throttling at vCPU granularity during migration.

the implementation is rather straightforward, we calculate
vCPU dirtyrate via the Dirty Ring mechanism periodically
as the commit 0e21bf246 "implement dirty-ring dirtyrate calculation"
does, for vCPU that be specified to impose dirty restraint,
we throttle it periodically as the auto-converge does, once after
throttling, we compare the quota dirtyrate with current dirtyrate,
if current dirtyrate is not under the quota, increase the throttling
percentage until current dirtyrate is under the quota.

this patchset is the basis of implmenting a new auto-converge method
for live migration, we introduce two qmp commands for impose/cancel
the dirty restraint on specified vCPU, so it also can be an independent
api to supply the upper app such as libvirt, which can use it to
implement the convergence logic during live migration, supplemented
with the qmp 'calc-dirty-rate' command or whatever.

we post this patchset for RFC and any corrections and suggetions about
the implementation, api, throttleing algorithm or whatever are very
appreciated!

Please review, thanks !

Best Regards !

Hyman Huang (3):
  migration/dirtyrate: implement vCPU dirtyrate calculation periodically
  cpu-throttle: implement vCPU throttle
  cpus-common: implement dirty 

[PATCH v1 0/8] virtio-mem: Support "prealloc=on"

2021-11-30 Thread David Hildenbrand
Based-on: <20211130092838.24224-1-da...@redhat.com>

Patch #1 - #7 are fully reviewed [1] but did not get picked up yet, so I'm
sending them along here, as they are required to use os_mem_prealloc() in
a safe way once the VM is running.

Support preallocation of memory to make virtio-mem safe to use with
scarce memory resources such as hugetlb. Before acknowledging a plug
request from the guest, we'll try preallcoating memory. If that fails,
we'll fail the request gracefully and warn the usr once.

To fully support huge pages for shared memory, we'll have to adjust shared
memory users, such as virtiofsd, to map guest memory via MAP_NORESERVE as
well, because otherwise, they'll end up overwriting the "reserve=off"
decision made by QEMU and try reserving huge pages for the sparse memory
region.

In the future, we might want to process guest requests, including
preallocating memory, asynchronously via a dedicated iothread.

[1] https://lkml.kernel.org/r/20211004120208.7409-1-da...@redhat.com

Cc: Paolo Bonzini 
Cc: "Michael S. Tsirkin" 
Cc: Igor Mammedov 
Cc: Eduardo Habkost 
Cc: Dr. David Alan Gilbert 
Cc: Daniel P. Berrangé 
Cc: Gavin Shan 
Cc: Hui Zhu 
Cc: Sebastien Boeuf 
Cc: Pankaj Gupta 
Cc: Michal Prívozník 

David Hildenbrand (8):
  util/oslib-posix: Let touch_all_pages() return an error
  util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
  util/oslib-posix: Introduce and use MemsetContext for
touch_all_pages()
  util/oslib-posix: Don't create too many threads with small memory or
little pages
  util/oslib-posix: Avoid creating a single thread with
MADV_POPULATE_WRITE
  util/oslib-posix: Support concurrent os_mem_prealloc() invocation
  util/oslib-posix: Forward SIGBUS to MCE handler under Linux
  virtio-mem: Support "prealloc=on" option

 hw/virtio/virtio-mem.c |  39 +-
 include/hw/virtio/virtio-mem.h |   4 +
 include/qemu/osdep.h   |   7 +
 softmmu/cpus.c |   4 +
 util/oslib-posix.c | 231 +
 5 files changed, 226 insertions(+), 59 deletions(-)

-- 
2.31.1




[PATCH v1 1/8] util/oslib-posix: Let touch_all_pages() return an error

2021-11-30 Thread David Hildenbrand
Let's prepare touch_all_pages() for returning differing errors. Return
an error from the thread and report the last processed error.

Translate SIGBUS to -EFAULT, as a SIGBUS can mean all different kind of
things (memory error, read error, out of memory). When allocating memory
fails via the current SIGBUS-based mechanism, we'll get:
os_mem_prealloc: preallocating memory failed: Bad address

Reviewed-by: Daniel P. Berrangé 
Signed-off-by: David Hildenbrand 
---
 util/oslib-posix.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index e8bdb02e1d..b146beef78 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -84,7 +84,6 @@ typedef struct MemsetThread MemsetThread;
 
 static MemsetThread *memset_thread;
 static int memset_num_threads;
-static bool memset_thread_failed;
 
 static QemuMutex page_mutex;
 static QemuCond page_cond;
@@ -452,6 +451,7 @@ static void *do_touch_pages(void *arg)
 {
 MemsetThread *memset_args = (MemsetThread *)arg;
 sigset_t set, oldset;
+int ret = 0;
 
 /*
  * On Linux, the page faults from the loop below can cause mmap_sem
@@ -470,7 +470,7 @@ static void *do_touch_pages(void *arg)
 pthread_sigmask(SIG_UNBLOCK, , );
 
 if (sigsetjmp(memset_args->env, 1)) {
-memset_thread_failed = true;
+ret = -EFAULT;
 } else {
 char *addr = memset_args->addr;
 size_t numpages = memset_args->numpages;
@@ -494,7 +494,7 @@ static void *do_touch_pages(void *arg)
 }
 }
 pthread_sigmask(SIG_SETMASK, , NULL);
-return NULL;
+return (void *)(uintptr_t)ret;
 }
 
 static inline int get_memset_num_threads(int smp_cpus)
@@ -509,13 +509,13 @@ static inline int get_memset_num_threads(int smp_cpus)
 return ret;
 }
 
-static bool touch_all_pages(char *area, size_t hpagesize, size_t numpages,
-int smp_cpus)
+static int touch_all_pages(char *area, size_t hpagesize, size_t numpages,
+   int smp_cpus)
 {
 static gsize initialized = 0;
 size_t numpages_per_thread, leftover;
+int ret = 0, i = 0;
 char *addr = area;
-int i = 0;
 
 if (g_once_init_enter()) {
 qemu_mutex_init(_mutex);
@@ -523,7 +523,6 @@ static bool touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 g_once_init_leave(, 1);
 }
 
-memset_thread_failed = false;
 threads_created_flag = false;
 memset_num_threads = get_memset_num_threads(smp_cpus);
 memset_thread = g_new0(MemsetThread, memset_num_threads);
@@ -545,12 +544,16 @@ static bool touch_all_pages(char *area, size_t hpagesize, 
size_t numpages,
 qemu_mutex_unlock(_mutex);
 
 for (i = 0; i < memset_num_threads; i++) {
-qemu_thread_join(_thread[i].pgthread);
+int tmp = (uintptr_t)qemu_thread_join(_thread[i].pgthread);
+
+if (tmp) {
+ret = tmp;
+}
 }
 g_free(memset_thread);
 memset_thread = NULL;
 
-return memset_thread_failed;
+return ret;
 }
 
 void os_mem_prealloc(int fd, char *area, size_t memory, int smp_cpus,
@@ -573,9 +576,10 @@ void os_mem_prealloc(int fd, char *area, size_t memory, 
int smp_cpus,
 }
 
 /* touch pages simultaneously */
-if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
-error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
-"pages available to allocate guest RAM");
+ret = touch_all_pages(area, hpagesize, numpages, smp_cpus);
+if (ret) {
+error_setg_errno(errp, -ret,
+ "os_mem_prealloc: preallocating memory failed");
 }
 
 ret = sigaction(SIGBUS, , NULL);
-- 
2.31.1




[PATCH 2/3] target/m68k: Implement TRAPcc

2021-11-30 Thread Richard Henderson
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/754
Signed-off-by: Richard Henderson 
---
 target/m68k/cpu.h   |  2 ++
 target/m68k/cpu.c   |  1 +
 target/m68k/translate.c | 21 +
 3 files changed, 24 insertions(+)

diff --git a/target/m68k/cpu.h b/target/m68k/cpu.h
index a3423729ef..03f600f7e7 100644
--- a/target/m68k/cpu.h
+++ b/target/m68k/cpu.h
@@ -527,6 +527,8 @@ enum m68k_features {
 M68K_FEATURE_MOVEC,
 /* Unaligned data accesses (680[2346]0) */
 M68K_FEATURE_UNALIGNED_DATA,
+/* TRAPCC insn. (680[2346]0, and CPU32) */
+M68K_FEATURE_TRAPCC,
 };
 
 static inline int m68k_feature(CPUM68KState *env, int feature)
diff --git a/target/m68k/cpu.c b/target/m68k/cpu.c
index c7aeb7da9c..5f778773d1 100644
--- a/target/m68k/cpu.c
+++ b/target/m68k/cpu.c
@@ -162,6 +162,7 @@ static void m68020_cpu_initfn(Object *obj)
 m68k_set_feature(env, M68K_FEATURE_CHK2);
 m68k_set_feature(env, M68K_FEATURE_MSP);
 m68k_set_feature(env, M68K_FEATURE_UNALIGNED_DATA);
+m68k_set_feature(env, M68K_FEATURE_TRAPCC);
 }
 
 /*
diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index 858ba761fc..cf29f35d91 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -4879,6 +4879,26 @@ DISAS_INSN(trapv)
 do_trapcc(s, 9); /* VS */
 }
 
+DISAS_INSN(trapcc)
+{
+/* Consume and discard the immediate operand. */
+switch (extract32(insn, 0, 3)) {
+case 2: /* trapcc.w */
+(void)read_im16(env, s);
+break;
+case 3: /* trapcc.l */
+(void)read_im32(env, s);
+break;
+case 4: /* trapcc (no operand) */
+break;
+default:
+/* Illegal insn */
+disas_undef(env, s, insn);
+return;
+}
+do_trapcc(s, extract32(insn, 8, 4));
+}
+
 static void gen_load_fcr(DisasContext *s, TCGv res, int reg)
 {
 switch (reg) {
@@ -6051,6 +6071,7 @@ void register_m68k_insns (CPUM68KState *env)
 INSN(scc,   50c0, f0f8, CF_ISA_A); /* Scc.B Dx   */
 INSN(scc,   50c0, f0c0, M68000);   /* Scc.B  */
 INSN(dbcc,  50c8, f0f8, M68000);
+INSN(trapcc,50f8, f0f8, TRAPCC);
 INSN(tpf,   51f8, fff8, CF_ISA_A);
 
 /* Branch instructions.  */
-- 
2.25.1




[PATCH v7 2/3] cpu-throttle: implement vCPU throttle

2021-11-30 Thread huangy81
From: Hyman Huang(黄勇) 

Impose dirty restraint on vCPU by kicking it and sleep
as the auto-converge does during migration, but just
kick the specified vCPU instead, not all vCPUs of vm.

Start a thread to track the dirtylimit status and adjust
the throttle pencentage dynamically depend on current
and quota dirtyrate.

Introduce the util function in the header for dirtylimit
implementation.

Signed-off-by: Hyman Huang(黄勇) 
---
 include/sysemu/cpu-throttle.h |  30 
 softmmu/cpu-throttle.c| 316 ++
 softmmu/trace-events  |   5 +
 3 files changed, 351 insertions(+)

diff --git a/include/sysemu/cpu-throttle.h b/include/sysemu/cpu-throttle.h
index d65bdef..334e5e2 100644
--- a/include/sysemu/cpu-throttle.h
+++ b/include/sysemu/cpu-throttle.h
@@ -65,4 +65,34 @@ bool cpu_throttle_active(void);
  */
 int cpu_throttle_get_percentage(void);
 
+/**
+ * dirtylimit_enabled
+ *
+ * Returns: %true if dirty page limit for vCPU is enabled, %false otherwise.
+ */
+bool dirtylimit_enabled(int cpu_index);
+
+/**
+ * dirtylimit_state_init:
+ *
+ * initialize golobal state for dirtylimit
+ */
+void dirtylimit_state_init(int max_cpus);
+
+/**
+ * dirtylimit_vcpu:
+ *
+ * impose dirtylimit on vcpu util reaching the quota dirtyrate
+ */
+void dirtylimit_vcpu(int cpu_index,
+ uint64_t quota);
+/**
+ * dirtylimit_cancel_vcpu:
+ *
+ * cancel dirtylimit for the specified vcpu
+ *
+ * Returns: the number of running threads for dirtylimit
+ */
+int dirtylimit_cancel_vcpu(int cpu_index);
+
 #endif /* SYSEMU_CPU_THROTTLE_H */
diff --git a/softmmu/cpu-throttle.c b/softmmu/cpu-throttle.c
index 8c2144a..f199d68 100644
--- a/softmmu/cpu-throttle.c
+++ b/softmmu/cpu-throttle.c
@@ -29,6 +29,8 @@
 #include "qemu/main-loop.h"
 #include "sysemu/cpus.h"
 #include "sysemu/cpu-throttle.h"
+#include "sysemu/dirtylimit.h"
+#include "trace.h"
 
 /* vcpu throttling controls */
 static QEMUTimer *throttle_timer;
@@ -38,6 +40,320 @@ static unsigned int throttle_percentage;
 #define CPU_THROTTLE_PCT_MAX 99
 #define CPU_THROTTLE_TIMESLICE_NS 1000
 
+#define DIRTYLIMIT_TOLERANCE_RANGE  15  /* 15MB/s */
+
+#define DIRTYLIMIT_THROTTLE_HEAVY_WATERMARK 75
+#define DIRTYLIMIT_THROTTLE_SLIGHT_WATERMARK90
+
+#define DIRTYLIMIT_THROTTLE_HEAVY_STEP_SIZE 5
+#define DIRTYLIMIT_THROTTLE_SLIGHT_STEP_SIZE2
+
+typedef enum {
+RESTRAIN_KEEP,
+RESTRAIN_RATIO,
+RESTRAIN_HEAVY,
+RESTRAIN_SLIGHT,
+} RestrainPolicy;
+
+typedef struct DirtyLimitState {
+int cpu_index;
+bool enabled;
+uint64_t quota; /* quota dirtyrate MB/s */
+QemuThread thread;
+char *name; /* thread name */
+} DirtyLimitState;
+
+struct {
+DirtyLimitState *states;
+int max_cpus;
+unsigned long *bmap; /* running thread bitmap */
+unsigned long nr;
+} *dirtylimit_state;
+
+bool dirtylimit_enabled(int cpu_index)
+{
+return qatomic_read(_state->states[cpu_index].enabled);
+}
+
+static inline void dirtylimit_set_quota(int cpu_index, uint64_t quota)
+{
+qatomic_set(_state->states[cpu_index].quota, quota);
+}
+
+static inline uint64_t dirtylimit_quota(int cpu_index)
+{
+return qatomic_read(_state->states[cpu_index].quota);
+}
+
+static int64_t dirtylimit_current(int cpu_index)
+{
+return dirtylimit_calc_current(cpu_index);
+}
+
+static void dirtylimit_vcpu_thread(CPUState *cpu, run_on_cpu_data data)
+{
+double pct;
+double throttle_ratio;
+int64_t sleeptime_ns, endtime_ns;
+int *percentage = (int *)data.host_ptr;
+
+pct = (double)(*percentage) / 100;
+throttle_ratio = pct / (1 - pct);
+/* Add 1ns to fix double's rounding error (like 0.999...) */
+sleeptime_ns = (int64_t)(throttle_ratio * CPU_THROTTLE_TIMESLICE_NS + 1);
+endtime_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + sleeptime_ns;
+while (sleeptime_ns > 0 && !cpu->stop) {
+if (sleeptime_ns > SCALE_MS) {
+qemu_cond_timedwait_iothread(cpu->halt_cond,
+ sleeptime_ns / SCALE_MS);
+} else {
+qemu_mutex_unlock_iothread();
+g_usleep(sleeptime_ns / SCALE_US);
+qemu_mutex_lock_iothread();
+}
+sleeptime_ns = endtime_ns - qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+}
+qatomic_set(>throttle_thread_scheduled, 0);
+
+free(percentage);
+}
+
+static void dirtylimit_check(int cpu_index,
+ int percentage)
+{
+CPUState *cpu;
+int64_t sleeptime_ns, starttime_ms, currenttime_ms;
+int *pct_parameter;
+double pct;
+
+pct = (double) percentage / 100;
+
+starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+while (true) {
+CPU_FOREACH(cpu) {
+if ((cpu_index == cpu->cpu_index) &&
+(!qatomic_xchg(>throttle_thread_scheduled, 1))) {
+pct_parameter = malloc(sizeof(*pct_parameter));
+*pct_parameter = percentage;
+  

[PATCH v7 1/3] migration/dirtyrate: implement vCPU dirtyrate calculation periodically

2021-11-30 Thread huangy81
From: Hyman Huang(黄勇) 

Introduce the third method GLOBAL_DIRTY_LIMIT of dirty
tracking for calculate dirtyrate periodly for dirty restraint.

Implement thread for calculate dirtyrate periodly, which will
be used for dirty restraint.

Add dirtylimit.h to introduce the util function for dirty
limit implementation.

Signed-off-by: Hyman Huang(黄勇) 
---
 include/exec/memory.h   |   5 +-
 include/sysemu/dirtylimit.h |  44 ++
 migration/dirtyrate.c   | 139 
 migration/dirtyrate.h   |   2 +
 4 files changed, 179 insertions(+), 11 deletions(-)
 create mode 100644 include/sysemu/dirtylimit.h

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 20f1b27..606bec8 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -69,7 +69,10 @@ static inline void fuzz_dma_read_cb(size_t addr,
 /* Dirty tracking enabled because measuring dirty rate */
 #define GLOBAL_DIRTY_DIRTY_RATE (1U << 1)
 
-#define GLOBAL_DIRTY_MASK  (0x3)
+/* Dirty tracking enabled because dirty limit */
+#define GLOBAL_DIRTY_LIMIT  (1U << 2)
+
+#define GLOBAL_DIRTY_MASK  (0x7)
 
 extern unsigned int global_dirty_tracking;
 
diff --git a/include/sysemu/dirtylimit.h b/include/sysemu/dirtylimit.h
new file mode 100644
index 000..49298a2
--- /dev/null
+++ b/include/sysemu/dirtylimit.h
@@ -0,0 +1,44 @@
+/*
+ * dirty limit helper functions
+ *
+ * Copyright (c) 2021 CHINA TELECOM CO.,LTD.
+ *
+ * Authors:
+ *  Hyman Huang(黄勇) 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_DIRTYRLIMIT_H
+#define QEMU_DIRTYRLIMIT_H
+
+#define DIRTYLIMIT_CALC_PERIOD_TIME_S   15  /* 15s */
+
+/**
+ * dirtylimit_calc_current:
+ *
+ * get current dirty page rate for specified vCPU.
+ */
+int64_t dirtylimit_calc_current(int cpu_index);
+
+/**
+ * dirtylimit_calc:
+ *
+ * start dirty page rate calculation thread.
+ */
+void dirtylimit_calc(void);
+
+/**
+ * dirtylimit_calc_quit:
+ *
+ * quit dirty page rate calculation thread.
+ */
+void dirtylimit_calc_quit(void);
+
+/**
+ * dirtylimit_calc_state_init:
+ *
+ * initialize dirty page rate calculation state.
+ */
+void dirtylimit_calc_state_init(int max_cpus);
+#endif
diff --git a/migration/dirtyrate.c b/migration/dirtyrate.c
index d65e744..d370a21 100644
--- a/migration/dirtyrate.c
+++ b/migration/dirtyrate.c
@@ -27,6 +27,7 @@
 #include "qapi/qmp/qdict.h"
 #include "sysemu/kvm.h"
 #include "sysemu/runstate.h"
+#include "sysemu/dirtylimit.h"
 #include "exec/memory.h"
 
 /*
@@ -46,6 +47,134 @@ static struct DirtyRateStat DirtyStat;
 static DirtyRateMeasureMode dirtyrate_mode =
 DIRTY_RATE_MEASURE_MODE_PAGE_SAMPLING;
 
+#define DIRTYLIMIT_CALC_TIME_MS 1000/* 1000ms */
+
+struct {
+DirtyRatesData data;
+int64_t period;
+bool quit;
+} *dirtylimit_calc_state;
+
+static void dirtylimit_global_dirty_log_start(void)
+{
+qemu_mutex_lock_iothread();
+memory_global_dirty_log_start(GLOBAL_DIRTY_LIMIT);
+qemu_mutex_unlock_iothread();
+}
+
+static void dirtylimit_global_dirty_log_stop(void)
+{
+qemu_mutex_lock_iothread();
+memory_global_dirty_log_sync();
+memory_global_dirty_log_stop(GLOBAL_DIRTY_LIMIT);
+qemu_mutex_unlock_iothread();
+}
+
+static inline void record_dirtypages(DirtyPageRecord *dirty_pages,
+ CPUState *cpu, bool start)
+{
+if (start) {
+dirty_pages[cpu->cpu_index].start_pages = cpu->dirty_pages;
+} else {
+dirty_pages[cpu->cpu_index].end_pages = cpu->dirty_pages;
+}
+}
+
+static void dirtylimit_calc_func(void)
+{
+CPUState *cpu;
+DirtyPageRecord *dirty_pages;
+int64_t start_time, end_time, calc_time;
+DirtyRateVcpu rate;
+int i = 0;
+
+dirty_pages = g_malloc0(sizeof(*dirty_pages) *
+dirtylimit_calc_state->data.nvcpu);
+
+dirtylimit_global_dirty_log_start();
+
+CPU_FOREACH(cpu) {
+record_dirtypages(dirty_pages, cpu, true);
+}
+
+start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+g_usleep(DIRTYLIMIT_CALC_TIME_MS * 1000);
+end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+calc_time = end_time - start_time;
+
+dirtylimit_global_dirty_log_stop();
+
+CPU_FOREACH(cpu) {
+record_dirtypages(dirty_pages, cpu, false);
+}
+
+for (i = 0; i < dirtylimit_calc_state->data.nvcpu; i++) {
+uint64_t increased_dirty_pages =
+dirty_pages[i].end_pages - dirty_pages[i].start_pages;
+uint64_t memory_size_MB =
+(increased_dirty_pages * TARGET_PAGE_SIZE) >> 20;
+int64_t dirtyrate = (memory_size_MB * 1000) / calc_time;
+
+rate.id = i;
+rate.dirty_rate  = dirtyrate;
+dirtylimit_calc_state->data.rates[i] = rate;
+
+trace_dirtyrate_do_calculate_vcpu(i,
+dirtylimit_calc_state->data.rates[i].dirty_rate);
+}
+}
+
+static void 

[PATCH 3/3] target/m68k: Implement FTRAPcc

2021-11-30 Thread Richard Henderson
Signed-off-by: Richard Henderson 
---
 target/m68k/translate.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index cf29f35d91..3c04f9d1a9 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -5547,6 +5547,43 @@ DISAS_INSN(fscc)
 tcg_temp_free(tmp);
 }
 
+DISAS_INSN(ftrapcc)
+{
+DisasCompare c;
+TCGLabel *over;
+uint16_t ext;
+int cond;
+
+ext = read_im16(env, s);
+cond = ext & 0x3f;
+
+/* Consume and discard the immediate operand. */
+switch (extract32(insn, 0, 3)) {
+case 2: /* ftrapcc.w */
+(void)read_im16(env, s);
+break;
+case 3: /* ftrapcc.l */
+(void)read_im32(env, s);
+break;
+case 4: /* ftrapcc (no operand) */
+break;
+default:
+/* Illegal insn */
+disas_undef(env, s, insn);
+return;
+}
+
+/* Jump over if !cond. */
+gen_fcc_cond(, s, cond);
+update_cc_op(s);
+over = gen_new_label();
+tcg_gen_brcond_i32(tcg_invert_cond(c.tcond), c.v1, c.v2, over);
+free_cond();
+
+gen_exception(s, s->base.pc_next, EXCP_TRAPCC);
+gen_set_label(over);
+}
+
 #if defined(CONFIG_SOFTMMU)
 DISAS_INSN(frestore)
 {
@@ -6170,6 +6207,7 @@ void register_m68k_insns (CPUM68KState *env)
 INSN(fbcc,  f280, ffc0, CF_FPU);
 INSN(fpu,   f200, ffc0, FPU);
 INSN(fscc,  f240, ffc0, FPU);
+INSN(ftrapcc,   f278, ff80, FPU);
 INSN(fbcc,  f280, ff80, FPU);
 #if defined(CONFIG_SOFTMMU)
 INSN(frestore,  f340, ffc0, CF_FPU);
-- 
2.25.1




[PATCH 1/3] target/m68k: Implement TRAPV

2021-11-30 Thread Richard Henderson
Signed-off-by: Richard Henderson 
---
 target/m68k/translate.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index af43c8eab8..858ba761fc 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -4863,6 +4863,22 @@ DISAS_INSN(trap)
 gen_exception(s, s->base.pc_next, EXCP_TRAP0 + (insn & 0xf));
 }
 
+static void do_trapcc(DisasContext *s, int cond)
+{
+TCGLabel *over = gen_new_label();
+
+/* Jump over if !cond. */
+gen_jmpcc(s, cond ^ 1, over);
+
+gen_exception(s, s->base.pc_next, EXCP_TRAPCC);
+gen_set_label(over);
+}
+
+DISAS_INSN(trapv)
+{
+do_trapcc(s, 9); /* VS */
+}
+
 static void gen_load_fcr(DisasContext *s, TCGv res, int reg)
 {
 switch (reg) {
@@ -6026,6 +6042,7 @@ void register_m68k_insns (CPUM68KState *env)
 BASE(nop,   4e71, );
 INSN(rtd,   4e74, , RTD);
 BASE(rts,   4e75, );
+INSN(trapv, 4e76, , M68000);
 INSN(rtr,   4e77, , M68000);
 BASE(jump,  4e80, ffc0);
 BASE(jump,  4ec0, ffc0);
-- 
2.25.1




[PATCH for-7.0 0/3] target/m68k: Implement conditional traps

2021-11-30 Thread Richard Henderson
While looking at #754 for trapcc, I noticed that the other
conditional traps, trapv and ftrapcc, are also missing.


r~


Richard Henderson (3):
  target/m68k: Implement TRAPV
  target/m68k: Implement TRAPcc
  target/m68k: Implement FTRAPcc

 target/m68k/cpu.h   |  2 ++
 target/m68k/cpu.c   |  1 +
 target/m68k/translate.c | 76 +
 3 files changed, 79 insertions(+)

-- 
2.25.1




[PATCH v7 3/3] cpus-common: implement dirty page limit on vCPU

2021-11-30 Thread huangy81
From: Hyman Huang(黄勇) 

Implement dirtyrate calculation periodically basing on
dirty-ring and throttle vCPU until it reachs the quota
dirty page rate given by user.

Introduce qmp commands set-dirty-limit/cancel-dirty-limit to
set/cancel dirty page limit on vCPU.

Signed-off-by: Hyman Huang(黄勇) 
---
 cpus-common.c | 48 
 include/hw/core/cpu.h |  9 +
 qapi/migration.json   | 43 +++
 softmmu/vl.c  |  1 +
 4 files changed, 101 insertions(+)

diff --git a/cpus-common.c b/cpus-common.c
index 6e73d3e..86c7712 100644
--- a/cpus-common.c
+++ b/cpus-common.c
@@ -23,6 +23,11 @@
 #include "hw/core/cpu.h"
 #include "sysemu/cpus.h"
 #include "qemu/lockable.h"
+#include "sysemu/dirtylimit.h"
+#include "sysemu/cpu-throttle.h"
+#include "sysemu/kvm.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-migration.h"
 
 static QemuMutex qemu_cpu_list_lock;
 static QemuCond exclusive_cond;
@@ -352,3 +357,46 @@ void process_queued_cpu_work(CPUState *cpu)
 qemu_mutex_unlock(>work_mutex);
 qemu_cond_broadcast(_work_cond);
 }
+
+void qmp_set_dirty_limit(int64_t idx,
+ uint64_t dirtyrate,
+ Error **errp)
+{
+if (!kvm_enabled() || !kvm_dirty_ring_enabled()) {
+error_setg(errp, "setting a dirty page limit requires KVM with"
+   " accelerator property 'dirty-ring-size' set'");
+return;
+}
+
+dirtylimit_calc();
+dirtylimit_vcpu(idx, dirtyrate);
+}
+
+void qmp_cancel_dirty_limit(int64_t idx,
+Error **errp)
+{
+if (!kvm_enabled() || !kvm_dirty_ring_enabled()) {
+error_setg(errp, "KVM with accelerator property 'dirty-ring-size'"
+   " not set, abort canceling a dirty page limit");
+return;
+}
+
+if (!dirtylimit_enabled(idx)) {
+error_setg(errp, "dirty page limit for the CPU %ld not set", idx);
+return;
+}
+
+if (unlikely(!dirtylimit_cancel_vcpu(idx))) {
+dirtylimit_calc_quit();
+}
+}
+
+void dirtylimit_setup(int max_cpus)
+{
+if (!kvm_enabled() || !kvm_dirty_ring_enabled()) {
+return;
+}
+
+dirtylimit_calc_state_init(max_cpus);
+dirtylimit_state_init(max_cpus);
+}
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index e948e81..11df012 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -881,6 +881,15 @@ void end_exclusive(void);
  */
 void qemu_init_vcpu(CPUState *cpu);
 
+/**
+ * dirtylimit_setup:
+ *
+ * Initializes the global state of dirtylimit calculation and
+ * dirtylimit itself. This is prepared for vCPU dirtylimit which
+ * could be triggered during vm lifecycle.
+ */
+void dirtylimit_setup(int max_cpus);
+
 #define SSTEP_ENABLE  0x1  /* Enable simulated HW single stepping */
 #define SSTEP_NOIRQ   0x2  /* Do not use IRQ while single stepping */
 #define SSTEP_NOTIMER 0x4  /* Do not Timers while single stepping */
diff --git a/qapi/migration.json b/qapi/migration.json
index bbfd48c..57c9a63 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -1850,6 +1850,49 @@
 { 'command': 'query-dirty-rate', 'returns': 'DirtyRateInfo' }
 
 ##
+# @set-dirty-limit:
+#
+# Set the upper limit of dirty page rate for a virtual CPU.
+#
+# Requires KVM with accelerator property "dirty-ring-size" set.
+# A virtual CPU's dirty page rate is a measure of its memory load.
+# To observe dirty page rates, use @calc-dirty-rate.
+#
+# @cpu-index: index of the virtual CPU.
+#
+# @dirty-rate: upper limit for the specified vCPU's dirty page rate (MB/s)
+#
+# Since: 7.0
+#
+# Example:
+#   {"execute": "set-dirty-limit"}
+#"arguments": { "cpu-index": 0,
+#   "dirty-rate": 200 } }
+#
+##
+{ 'command': 'set-dirty-limit',
+  'data': { 'cpu-index': 'int', 'dirty-rate': 'uint64' } }
+
+##
+# @cancel-dirty-limit:
+#
+# Cancel the dirty page limit for the vCPU which has been set with
+# set-dirty-limit command. Note that this command requires support from
+# dirty ring, same as the "set-dirty-limit" command.
+#
+# @cpu-index: index of the virtual CPU to cancel the dirty page limit
+#
+# Since: 7.0
+#
+# Example:
+#   {"execute": "cancel-dirty-limit"}
+#"arguments": { "cpu-index": 0 } }
+#
+##
+{ 'command': 'cancel-dirty-limit',
+  'data': { 'cpu-index': 'int' } }
+
+##
 # @snapshot-save:
 #
 # Save a VM snapshot
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 620a1f1..0f83ce3 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -3777,5 +3777,6 @@ void qemu_init(int argc, char **argv, char **envp)
 qemu_init_displays();
 accel_setup_post(current_machine);
 os_setup_post();
+dirtylimit_setup(current_machine->smp.max_cpus);
 resume_mux_open();
 }
-- 
1.8.3.1




[Bug 1749393] Re: sbrk() not working under qemu-user with a PIE-compiled binary?

2021-11-30 Thread Christian Ehrhardt 
Uploaded to F-unapproved, waiting for the SRU team to accept it.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1749393

Title:
  sbrk() not working under qemu-user with a PIE-compiled binary?

Status in QEMU:
  Fix Released
Status in qemu package in Ubuntu:
  Fix Released
Status in qemu source package in Focal:
  In Progress

Bug description:
  [Impact]

   * The current space reserved can be too small and we can end up
     with no space at all for BRK. It can happen to any case, but is
     much more likely with the now common PIE binaries.

   * Backport the upstream fix which reserves a bit more space while loading
     and giving it back after interpreter and stack is loaded.

  [Test Plan]

   * On x86 run:
  sudo apt install -y qemu-user-static docker.io
  sudo docker run --rm arm64v8/debian:bullseye bash -c 'apt update && apt 
install -y wget'
  ...
  Running hooks in /etc/ca-certificates/update.d...
  done.
  Errors were encountered while processing:
   libc-bin
  E: Sub-process /usr/bin/dpkg returned an error code (1)

  
  Second test from bug 1928075

  $ sudo qemu-debootstrap --arch=arm64 bullseye bullseye-arm64
  http://ftp.debian.org/debian

  In the bad case this is failing like
  W: Failure trying to run: /sbin/ldconfig
  W: See //debootstrap/debootstrap.log for detail

  And in that log file you'll see the segfault
  $ tail -n 2 bullseye-arm64/debootstrap/debootstrap.log
  qemu: uncaught target signal 11 (Segmentation fault) - core dumped
  Segmentation fault (core dumped)

  [Where problems could occur]

   * Regressions would be around use-cases of linux-user that is
     emulation not of a system but of binaries.
     Commonly uses for cross-tests and cross-builds so that is the
     space to watch for regressions

  [Other Info]

   * n/a

  ---

  In Debian unstable, we recently switched bash to be a PIE-compiled
  binary (for hardening). Unfortunately this resulted in bash being
  broken when run under qemu-user (for all target architectures, host
  being amd64 for me).

  $ sudo chroot /srv/chroots/sid-i386/ qemu-i386-static /bin/bash
  bash: xmalloc: .././shell.c:1709: cannot allocate 10 bytes (0 bytes allocated)

  bash has its own malloc implementation based on sbrk():
  https://git.savannah.gnu.org/cgit/bash.git/tree/lib/malloc/malloc.c

  When we disable this internal implementation and rely on glibc's
  malloc, then everything is fine. But it might be that glibc has a
  fallback when sbrk() is not working properly and it might hide the
  underlying problem in qemu-user.

  This issue has also been reported to the bash upstream author and he 
suggested that the issue might be in qemu-user so I'm opening a ticket here. 
Here's the discussion with the bash upstream author:
  https://lists.gnu.org/archive/html/bug-bash/2018-02/threads.html#00080

  You can find the problematic bash binary in that .deb file:
  
http://snapshot.debian.org/archive/debian/20180206T154716Z/pool/main/b/bash/bash_4.4.18-1_i386.deb

  The version of qemu I have been using is 2.11 (Debian package qemu-
  user-static version 1:2.11+dfsg-1) but I have had reports that the
  problem is reproducible with older versions (back to 2.8 at least).

  Here are the related Debian bug reports:
  https://bugs.debian.org/889869
  https://bugs.debian.org/865599

  It's worth noting that bash used to have this problem (when compiled as a PIE 
binary) even when run directly but then something got fixed in the kernel and 
now the problem only appears when run under qemu-user:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518483

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1749393/+subscriptions




Re: [PATCH for-7.0 0/4] qemu-common.h include cleanup

2021-11-30 Thread Markus Armbruster
Peter Maydell  writes:

> On Mon, 29 Nov 2021 at 20:05, Peter Maydell  wrote:
>>
>> qemu-common.h has a comment at the top:
>>
>>  * This file is supposed to be included only by .c files. No header file 
>> should
>>  * depend on qemu-common.h, as this would easily lead to circular header
>>  * dependencies.
>
> As a side note, that comment was added back in 2012 when qemu-common.h
> was bigger, included other headers, and did some of the work we currently
> use osdep.h for. As it stands today qemu-common.h includes no other
> files so it isn't a source of possible circular dependencies -- it's
> just a grab-bag of miscellaneous prototypes that in an ideal world
> would be in more focused individual headers[*]. So there's an argument
> for deleting this comment...

First, thank you for this cleanup series.

The comment is from commit 04509ad939a:

qemu-common.h: Comment about usage rules

Every time we make a tiny change on a header file, we often find
circular header dependency problems. To avoid this nightmare, we need to
stop including qemu-common.h from other headers, and we should gradually
move the declarations from the catch-all qemu-common.h header to their
specific headers.

This simply adds a comment documenting the rules about qemu-common.h,
hoping that people will see it before including qemu-common.h from other
header files, and before adding more declarations to qemu-common.h.

Signed-off-by: Eduardo Habkost 
Signed-off-by: Andreas Färber 

You're right: we've since moved all #include out of qemu-common.h, so
the risk of circular header dependency is gone, and the comment's claim
"this would easily lead to circular header dependencies" is no longer
correct.

In my opinion, including qemu-common.h in a header is a bad idea
regardless.  Headers should include the headers they need to make them
self-contained, and no more, because more risks slower compiles, in
particular frequent recompiles of pretty much everything.  Previously
discussed at

https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg06291.html
Message-ID: <877eac82il@dusky.pond.sub.org>

Nothing in qemu-common.h should be required to make another header
self-contained.

This is likely the case for other headers as well, which don't carry a
comment forbidding inclusion into headers.  You could argue that
qemu-common.h should not carry one, either.  I can accept that.  I'm not
attached to the comment.  I am interested in keeping unwanted #include
in headers under control.

> [*] A cleanup that would be nice, and I'm about to send out a patchset
> that splits out the rtc related functions; but the grab-bag at the
> bottom of osdep.h is probably higher priority because that header
> gets pulled in by an order of magnitude more C files.

By all "normal" ones, in fact.




[PATCH v2 1/4] block_int: make bdrv_backing_overridden static

2021-11-30 Thread Emanuele Giuseppe Esposito
bdrv_backing_overridden is only used in block.c, so there is
no need to leave it in block_int.h

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block.c   | 4 +++-
 include/block/block_int.h | 3 ---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index 0ac5b163d2..10346b5011 100644
--- a/block.c
+++ b/block.c
@@ -103,6 +103,8 @@ static int bdrv_reopen_prepare(BDRVReopenState 
*reopen_state,
 static void bdrv_reopen_commit(BDRVReopenState *reopen_state);
 static void bdrv_reopen_abort(BDRVReopenState *reopen_state);
 
+static bool bdrv_backing_overridden(BlockDriverState *bs);
+
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
 
@@ -7475,7 +7477,7 @@ static bool append_strong_runtime_options(QDict *d, 
BlockDriverState *bs)
 /* Note: This function may return false positives; it may return true
  * even if opening the backing file specified by bs's image header
  * would result in exactly bs->backing. */
-bool bdrv_backing_overridden(BlockDriverState *bs)
+static bool bdrv_backing_overridden(BlockDriverState *bs)
 {
 if (bs->backing) {
 return strcmp(bs->auto_backing_file,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f4c75e8ba9..27008cfb22 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1122,9 +1122,6 @@ BlockDriver *bdrv_probe_all(const uint8_t *buf, int 
buf_size,
 void bdrv_parse_filename_strip_prefix(const char *filename, const char *prefix,
   QDict *options);
 
-bool bdrv_backing_overridden(BlockDriverState *bs);
-
-
 /**
  * bdrv_add_aio_context_notifier:
  *
-- 
2.31.1




[PATCH v2 4/4] include/sysemu/blockdev.h: remove drive_get_max_devs

2021-11-30 Thread Emanuele Giuseppe Esposito
Remove drive_get_max_devs, as it is not used by anyone.

Last use was removed in commit 8f2d75e81d5
("hw: Drop superfluous special checks for orphaned -drive").

Signed-off-by: Emanuele Giuseppe Esposito 
Reviewed-by: Philippe Mathieu-Daudé 
---
 blockdev.c| 17 -
 include/sysemu/blockdev.h |  1 -
 2 files changed, 18 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index e5cb3eb95f..0112f488a4 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -173,23 +173,6 @@ void blockdev_auto_del(BlockBackend *blk)
 }
 }
 
-/**
- * Returns the current mapping of how many units per bus
- * a particular interface can support.
- *
- *  A positive integer indicates n units per bus.
- *  0 implies the mapping has not been established.
- * -1 indicates an invalid BlockInterfaceType was given.
- */
-int drive_get_max_devs(BlockInterfaceType type)
-{
-if (type >= IF_IDE && type < IF_COUNT) {
-return if_max_devs[type];
-}
-
-return -1;
-}
-
 static int drive_index_to_bus_id(BlockInterfaceType type, int index)
 {
 int max_devs = if_max_devs[type];
diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
index 21ae0b994a..430abdbdb8 100644
--- a/include/sysemu/blockdev.h
+++ b/include/sysemu/blockdev.h
@@ -49,7 +49,6 @@ DriveInfo *drive_get(BlockInterfaceType type, int bus, int 
unit);
 void drive_check_orphaned(void);
 DriveInfo *drive_get_by_index(BlockInterfaceType type, int index);
 int drive_get_max_bus(BlockInterfaceType type);
-int drive_get_max_devs(BlockInterfaceType type);
 DriveInfo *drive_get_next(BlockInterfaceType type);
 
 DriveInfo *drive_new(QemuOpts *arg, BlockInterfaceType block_default_type,
-- 
2.31.1




[PATCH v2 2/4] include/sysemu/blockdev.c: introduce block_if_name

2021-11-30 Thread Emanuele Giuseppe Esposito
Add a getter function for the if_name array, so that also
outside functions can access it.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 blockdev.c| 5 +
 include/sysemu/blockdev.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index b35072644e..1581f0d2f5 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -83,6 +83,11 @@ static const char *const if_name[IF_COUNT] = {
 [IF_XEN] = "xen",
 };
 
+const char *block_if_name(BlockInterfaceType iface)
+{
+return if_name[iface];
+}
+
 static int if_max_devs[IF_COUNT] = {
 /*
  * Do not change these numbers!  They govern how drive option
diff --git a/include/sysemu/blockdev.h b/include/sysemu/blockdev.h
index 32c2d6023c..b321286dee 100644
--- a/include/sysemu/blockdev.h
+++ b/include/sysemu/blockdev.h
@@ -42,6 +42,7 @@ DriveInfo *blk_legacy_dinfo(BlockBackend *blk);
 DriveInfo *blk_set_legacy_dinfo(BlockBackend *blk, DriveInfo *dinfo);
 BlockBackend *blk_by_legacy_dinfo(DriveInfo *dinfo);
 
+const char *block_if_name(BlockInterfaceType iface);
 void override_max_devs(BlockInterfaceType type, int max_devs);
 
 DriveInfo *drive_get(BlockInterfaceType type, int bus, int unit);
-- 
2.31.1




  1   2   >