date:20220414

Re: [PATCH 3/5] virtio-net: add RSS support for Vhost backends

2022-04-14 Thread Jason Wang




在 2022/4/8 20:28, Maxime Coquelin 写道:

This patch introduces new Vhost backend callbacks to
support RSS, and makes them called in Virtio-net
device.

It will be used by Vhost-user backend implementation to
support RSS feature.

Signed-off-by: Maxime Coquelin 
---
  hw/net/vhost_net-stub.c   | 10 ++
  hw/net/vhost_net.c| 22 +
  hw/net/virtio-net.c   | 53 +--
  include/hw/virtio/vhost-backend.h |  7 
  include/net/vhost_net.h   |  4 +++
  5 files changed, 79 insertions(+), 17 deletions(-)

diff --git a/hw/net/vhost_net-stub.c b/hw/net/vhost_net-stub.c
index 89d71cfb8e..cc05e07c1f 100644
--- a/hw/net/vhost_net-stub.c
+++ b/hw/net/vhost_net-stub.c
@@ -101,3 +101,13 @@ int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu)
  {
  return 0;
  }
+
+int vhost_net_get_rss(struct vhost_net *net, VirtioNetRssCapa *rss_capa)
+{
+return 0;
+}
+
+int vhost_net_set_rss(struct vhost_net *net, VirtioNetRssData *rss_data)
+{
+return 0;
+}
diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index 30379d2ca4..aa2a1e8e5f 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -512,3 +512,25 @@ int vhost_net_set_mtu(struct vhost_net *net, uint16_t mtu)
  
  return vhost_ops->vhost_net_set_mtu(>dev, mtu);

  }
+
+int vhost_net_get_rss(struct vhost_net *net, VirtioNetRssCapa *rss_capa)
+{
+const VhostOps *vhost_ops = net->dev.vhost_ops;
+
+if (!vhost_ops->vhost_net_get_rss) {
+return 0;
+}
+
+return vhost_ops->vhost_net_get_rss(>dev, rss_capa);
+}
+
+int vhost_net_set_rss(struct vhost_net *net, VirtioNetRssData *rss_data)
+{
+const VhostOps *vhost_ops = net->dev.vhost_ops;
+
+if (!vhost_ops->vhost_net_set_rss) {
+return 0;
+}
+
+return vhost_ops->vhost_net_set_rss(>dev, rss_data);
+}
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38436e472b..237bbdb1b3 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -741,8 +741,10 @@ static uint64_t virtio_net_get_features(VirtIODevice 
*vdev, uint64_t features,
  return features;
  }
  
-if (!ebpf_rss_is_loaded(>ebpf_rss)) {

-virtio_clear_feature(, VIRTIO_NET_F_RSS);
+if (nc->peer->info->type == NET_CLIENT_DRIVER_TAP) {
+if (!ebpf_rss_is_loaded(>ebpf_rss)) {
+virtio_clear_feature(, VIRTIO_NET_F_RSS);
+}
  }
  features = vhost_net_get_features(get_vhost_net(nc->peer), features);
  vdev->backend_features = features;
@@ -1161,11 +1163,17 @@ static void virtio_net_detach_epbf_rss(VirtIONet *n);
  
  static void virtio_net_disable_rss(VirtIONet *n)

  {
+NetClientState *nc = qemu_get_queue(n->nic);
+
  if (n->rss_data.enabled) {
  trace_virtio_net_rss_disable();
  }
  n->rss_data.enabled = false;
  
+if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {

+vhost_net_set_rss(get_vhost_net(nc->peer), >rss_data);
+}
+
  virtio_net_detach_epbf_rss(n);
  }
  
@@ -1239,6 +1247,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,

bool do_rss)
  {
  VirtIODevice *vdev = VIRTIO_DEVICE(n);
+NetClientState *nc = qemu_get_queue(n->nic);
  struct virtio_net_rss_config cfg;
  size_t s, offset = 0, size_get;
  uint16_t queue_pairs, i;
@@ -1354,22 +1363,29 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
  }
  n->rss_data.enabled = true;
  
-if (!n->rss_data.populate_hash) {

-if (!virtio_net_attach_epbf_rss(n)) {
-/* EBPF must be loaded for vhost */
-if (get_vhost_net(qemu_get_queue(n->nic)->peer)) {
-warn_report("Can't load eBPF RSS for vhost");
-goto error;
+if (nc->peer && nc->peer->info->type == NET_CLIENT_DRIVER_VHOST_USER) {
+if (vhost_net_set_rss(get_vhost_net(nc->peer), >rss_data)) {
+warn_report("Failed to configure RSS for vhost-user");
+goto error;
+}
+} else {
+if (!n->rss_data.populate_hash) {
+if (!virtio_net_attach_epbf_rss(n)) {
+/* EBPF must be loaded for vhost */
+if (get_vhost_net(nc->peer)) {
+warn_report("Can't load eBPF RSS for vhost");
+goto error;
+}
+/* fallback to software RSS */
+warn_report("Can't load eBPF RSS - fallback to software RSS");
+n->rss_data.enabled_software_rss = true;
  }
-/* fallback to software RSS */
-warn_report("Can't load eBPF RSS - fallback to software RSS");
+} else {
+/* use software RSS for hash populating */
+/* and detach eBPF if was loaded before */
+virtio_net_detach_epbf_rss(n);
  n->rss_data.enabled_software_rss = true;
  }
-} else {
-/* use software RSS for hash populating */
-

Re: [PATCH 0/5] Vhost-user: add Virtio RSS support

2022-04-14 Thread Jason Wang




在 2022/4/8 20:28, Maxime Coquelin 写道:

The goal of this series is to add support for Virtio RSS
feature to the Vhost-user backend.

First patches are preliminary reworks to support variable
RSS key and indirection table length. eBPF change only adds
checks on whether the key length is 40B, it does not add
support for longer keys.

Vhost-user implementation supports up to 52B RSS key, in
order to match with the maximum supported by physical
NICs (Intel E810). Idea is that it could be used for
application like Virtio-forwarder, by programming the
Virtio device RSS key into the physical NIC and let the
physical NIC do the packets distribution.

DPDK Vhost-user backend PoC implementing the new requests
can be found here [0], it only implements the messages
handling, it does not perform any RSS for now.

[0]: https://gitlab.com/mcoquelin/dpdk-next-virtio/-/commits/vhost_user_rss_poc/



Not directly related to this series. I wonder if vhost-user consider to 
support control virtqueue then all the RSS stuffs could be done at 
vhost-user backend without introducing new commands.


Thanks




Maxime Coquelin (5):
   ebpf: pass and check RSS key length to the loader
   virtio-net: prepare for variable RSS key and indir table lengths
   virtio-net: add RSS support for Vhost backends
   docs: introduce RSS support in Vhost-user specification
   vhost-user: add RSS support

  docs/interop/vhost-user.rst   |  57 
  ebpf/ebpf_rss-stub.c  |   3 +-
  ebpf/ebpf_rss.c   |  17 ++--
  ebpf/ebpf_rss.h   |   3 +-
  hw/net/vhost_net-stub.c   |  10 ++
  hw/net/vhost_net.c|  22 +
  hw/net/virtio-net.c   |  87 +-
  hw/virtio/vhost-user.c| 146 +-
  include/hw/virtio/vhost-backend.h |   7 ++
  include/hw/virtio/virtio-net.h|  16 +++-
  include/migration/vmstate.h   |  10 ++
  include/net/vhost_net.h   |   4 +
  12 files changed, 344 insertions(+), 38 deletions(-)

Re: [PATCH 2/5] virtio-net: prepare for variable RSS key and indir table lengths

2022-04-14 Thread Jason Wang




在 2022/4/8 20:28, Maxime Coquelin 写道:

This patch is a preliminary rework to support RSS with
Vhost-user backends. It enables supporting different types
of hashes, key lengths and indirection table lengths.

This patch does not introduces behavioral changes.

Signed-off-by: Maxime Coquelin 
---
  ebpf/ebpf_rss.c|  8 
  hw/net/virtio-net.c| 35 +-
  include/hw/virtio/virtio-net.h | 16 +---
  include/migration/vmstate.h| 10 ++
  4 files changed, 53 insertions(+), 16 deletions(-)

diff --git a/ebpf/ebpf_rss.c b/ebpf/ebpf_rss.c
index 4a63854175..f03be5f919 100644
--- a/ebpf/ebpf_rss.c
+++ b/ebpf/ebpf_rss.c
@@ -96,7 +96,7 @@ static bool ebpf_rss_set_indirections_table(struct 
EBPFRSSContext *ctx,
  uint32_t i = 0;
  
  if (!ebpf_rss_is_loaded(ctx) || indirections_table == NULL ||

-   len > VIRTIO_NET_RSS_MAX_TABLE_LEN) {
+   len > VIRTIO_NET_RSS_DEFAULT_TABLE_LEN) {
  return false;
  }
  
@@ -116,13 +116,13 @@ static bool ebpf_rss_set_toepliz_key(struct EBPFRSSContext *ctx,

  uint32_t map_key = 0;
  
  /* prepare toeplitz key */

-uint8_t toe[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {};
+uint8_t toe[VIRTIO_NET_RSS_DEFAULT_KEY_SIZE] = {};
  
  if (!ebpf_rss_is_loaded(ctx) || toeplitz_key == NULL ||

-len != VIRTIO_NET_RSS_MAX_KEY_SIZE) {
+len != VIRTIO_NET_RSS_DEFAULT_KEY_SIZE) {
  return false;
  }
-memcpy(toe, toeplitz_key, VIRTIO_NET_RSS_MAX_KEY_SIZE);
+memcpy(toe, toeplitz_key, VIRTIO_NET_RSS_DEFAULT_KEY_SIZE);
  *(uint32_t *)toe = ntohl(*(uint32_t *)toe);
  
  if (bpf_map_update_elem(ctx->map_toeplitz_key, _key, toe,

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 73145d6390..38436e472b 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -137,12 +137,11 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
  memcpy(netcfg.mac, n->mac, ETH_ALEN);
  virtio_stl_p(vdev, , n->net_conf.speed);
  netcfg.duplex = n->net_conf.duplex;
-netcfg.rss_max_key_size = VIRTIO_NET_RSS_MAX_KEY_SIZE;
+netcfg.rss_max_key_size = n->rss_capa.max_key_size;
  virtio_stw_p(vdev, _max_indirection_table_length,
- virtio_host_has_feature(vdev, VIRTIO_NET_F_RSS) ?
- VIRTIO_NET_RSS_MAX_TABLE_LEN : 1);
+ n->rss_capa.max_indirection_len);
  virtio_stl_p(vdev, _hash_types,
- VIRTIO_NET_RSS_SUPPORTED_HASHES);
+ n->rss_capa.supported_hashes);
  memcpy(config, , n->config_size);
  
  /*

@@ -1202,7 +1201,7 @@ static bool virtio_net_attach_epbf_rss(VirtIONet *n)
  
  if (!ebpf_rss_set_all(>ebpf_rss, ,

n->rss_data.indirections_table, n->rss_data.key,
-  VIRTIO_NET_RSS_MAX_KEY_SIZE)) {
+  n->rss_data.key_len)) {
  return false;
  }
  
@@ -1277,7 +1276,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,

  err_value = n->rss_data.indirections_len;
  goto error;
  }
-if (n->rss_data.indirections_len > VIRTIO_NET_RSS_MAX_TABLE_LEN) {
+if (n->rss_data.indirections_len > n->rss_capa.max_indirection_len) {
  err_msg = "Too large indirection table";
  err_value = n->rss_data.indirections_len;
  goto error;
@@ -1323,7 +1322,7 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
  err_value = queue_pairs;
  goto error;
  }
-if (temp.b > VIRTIO_NET_RSS_MAX_KEY_SIZE) {
+if (temp.b > n->rss_capa.max_key_size) {
  err_msg = "Invalid key size";
  err_value = temp.b;
  goto error;
@@ -1339,6 +1338,14 @@ static uint16_t virtio_net_handle_rss(VirtIONet *n,
  }
  offset += size_get;
  size_get = temp.b;
+n->rss_data.key_len = temp.b;
+g_free(n->rss_data.key);
+n->rss_data.key = g_malloc(size_get);
+if (!n->rss_data.key) {
+err_msg = "Can't allocate key";
+err_value = n->rss_data.key_len;
+goto error;
+}
  s = iov_to_buf(iov, iov_cnt, offset, n->rss_data.key, size_get);
  if (s != size_get) {
  err_msg = "Can get key buffer";
@@ -3093,8 +3100,9 @@ static const VMStateDescription vmstate_virtio_net_rss = {
  VMSTATE_UINT32(rss_data.hash_types, VirtIONet),
  VMSTATE_UINT16(rss_data.indirections_len, VirtIONet),
  VMSTATE_UINT16(rss_data.default_queue, VirtIONet),
-VMSTATE_UINT8_ARRAY(rss_data.key, VirtIONet,
-VIRTIO_NET_RSS_MAX_KEY_SIZE),
+VMSTATE_VARRAY_UINT8_ALLOC(rss_data.key, VirtIONet,
+   rss_data.key_len, 0,
+   vmstate_info_uint8, uint8_t),



I wonder if we may break the migration compatibility here.

Thanks



  VMSTATE_VARRAY_UINT16_ALLOC(rss_data.indirections_table, VirtIONet,

Re: [PATCH] hw/riscv: virt: fix DT property mmu-type when CPU mmu option is disabled

2022-04-14 Thread Frank Chang

On Thu, Apr 14, 2022 at 11:57 PM Niklas Cassel via 
wrote:

> The device tree property "mmu-type" is currently exported as either
> "riscv,sv32" or "riscv,sv48".
>
> However, the riscv cpu device tree binding [1] has a specific value
> "riscv,none" for a HART without a MMU.
>
> Set the device tree property "mmu-type" to "riscv,none" when the CPU mmu
> option is disabled using rv32,mmu=off or rv64,mmu=off.
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/riscv/cpus.yaml?h=v5.17
>
> Signed-off-by: Niklas Cassel 
> ---
>  hw/riscv/virt.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
> index da50cbed43..3be6be9ad3 100644
> --- a/hw/riscv/virt.c
> +++ b/hw/riscv/virt.c
> @@ -230,8 +230,14 @@ static void create_fdt_socket_cpus(RISCVVirtState *s,
> int socket,
>  cpu_name = g_strdup_printf("/cpus/cpu@%d",
>  s->soc[socket].hartid_base + cpu);
>  qemu_fdt_add_subnode(mc->fdt, cpu_name);
> -qemu_fdt_setprop_string(mc->fdt, cpu_name, "mmu-type",
> -(is_32_bit) ? "riscv,sv32" : "riscv,sv48");
> +if (riscv_feature(>soc[socket].harts[cpu].env,
> +  RISCV_FEATURE_MMU)) {
> +qemu_fdt_setprop_string(mc->fdt, cpu_name, "mmu-type",
> +(is_32_bit) ? "riscv,sv32" :
> "riscv,sv48");
> +} else {
> +qemu_fdt_setprop_string(mc->fdt, cpu_name, "mmu-type",
> +"riscv,none");
> +}
>  name = riscv_isa_string(>soc[socket].harts[cpu]);
>  qemu_fdt_setprop_string(mc->fdt, cpu_name, "riscv,isa", name);
>  g_free(name);
> --
> 2.35.1
>
>
>
Reviewed-by: Frank Chang

Re: [PATCH] hw/riscv: virt: fix DT property mmu-type when CPU mmu option is disabled

2022-04-14 Thread Bin Meng

On Thu, Apr 14, 2022 at 11:55 PM Niklas Cassel via
 wrote:
>
> The device tree property "mmu-type" is currently exported as either
> "riscv,sv32" or "riscv,sv48".
>
> However, the riscv cpu device tree binding [1] has a specific value
> "riscv,none" for a HART without a MMU.
>
> Set the device tree property "mmu-type" to "riscv,none" when the CPU mmu
> option is disabled using rv32,mmu=off or rv64,mmu=off.
>
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/riscv/cpus.yaml?h=v5.17
>
> Signed-off-by: Niklas Cassel 
> ---
>  hw/riscv/virt.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
>

Reviewed-by: Bin Meng

Re:

2022-04-14 Thread Alex Bennée



Eric Auger  writes:

> Hi Alex,
>
> On 4/7/22 5:00 PM, Alex Bennée wrote:
>> When trying to work out what the virtio-net-tests where doing it was
>> hard because the g_test_trap_subprocess redirects all output to
>> /dev/null. Lift this restriction by using the appropriate flags so you
>> can see something similar to what the vhost-user-blk tests show when
>> running.
>> 
>> While we are at it remove the g_test_verbose() check so we always show
>> how the QEMU is run.
>> 
>> Signed-off-by: Alex Bennée 
>> ---
>>  tests/qtest/qos-test.c | 7 +++
>>  1 file changed, 3 insertions(+), 4 deletions(-)
>> 
>> diff --git a/tests/qtest/qos-test.c b/tests/qtest/qos-test.c
>> index f97d0a08fd..c6c196cc95 100644
>> --- a/tests/qtest/qos-test.c
>> +++ b/tests/qtest/qos-test.c
>> @@ -89,9 +89,7 @@ static void qos_set_machines_devices_available(void)
>>  
>>  static void restart_qemu_or_continue(char *path)
>>  {
>> -if (g_test_verbose()) {
>> -qos_printf("Run QEMU with: '%s'\n", path);
>> -}
>> +qos_printf("Run QEMU with: '%s'\n", path);
>>  /* compares the current command line with the
>>   * one previously executed: if they are the same,
>>   * don't restart QEMU, if they differ, stop previous
>> @@ -185,7 +183,8 @@ static void run_one_test(const void *arg)
>>  static void subprocess_run_one_test(const void *arg)
>>  {
>>  const gchar *path = arg;
>> -g_test_trap_subprocess(path, 0, 0);
>> +g_test_trap_subprocess(path, 0,
>> +   G_TEST_SUBPROCESS_INHERIT_STDOUT | 
>> G_TEST_SUBPROCESS_INHERIT_STDERR);
> While workling on libqos/pci tests on aarch64 I also did that but I
> noticed there were a bunch of errors such as:
>
> /aarch64/virt/generic-pcihost/pci-bus-generic/pci-bus/virtio-net-pci/virtio-net/virtio-net-tests/vhost-user/multiqueue:
> qemu-system-aarch64: Failed to set msg fds.
> qemu-system-aarch64: vhost VQ 0 ring restore failed: -22: Invalid
> argument (22)
> qemu-system-aarch64: Failed to set msg fds.
> qemu-system-aarch64: vhost VQ 1 ring restore failed: -22: Invalid
> argument (22)
> qemu-system-aarch64: Failed to set msg fds.
> qemu-system-aarch64: vhost VQ 2 ring restore failed: -22: Invalid
> argument (22)
> qemu-system-aarch64: Failed to set msg fds.
> qemu-system-aarch64: vhost VQ 3 ring restore failed: -22: Invalid
> argument (22)
>
> I see those also when running with x86_64-softmmu/qemu-system-x86_64
> (this is no aarch64 specific).

I think this is a symptom of an unclean tear down (which might be too
much to ask for our fake vhost backend) or something we should handle
better. I still have to get my gpio test working so I'll have a look
tomorrow.


>
> I don't know if it is an issue to get those additional errors?
>
> Thanks
>
> Eric
>
>>  g_test_trap_assert_passed();
>>  }
>>  
>> 


-- 
Alex Bennée

RE: [RFC PATCH 0/4] 9pfs: Add 9pfs support for Windows host

2022-04-14 Thread Shi, Guohuai



> -Original Message-
> From: Christian Schoenebeck 
> Sent: 2022年4月14日 19:24
> To: qemu-devel@nongnu.org; Shi, Guohuai 
> Cc: Bin Meng ; Greg Kurz 
> Subject: Re: [RFC PATCH 0/4] 9pfs: Add 9pfs support for Windows host
> 
> [Please note: This e-mail is from an EXTERNAL e-mail address]
> 
> On Mittwoch, 13. April 2022 05:30:57 CEST Shi, Guohuai wrote:
> > > We have 3 fs drivers: local, synth, proxy. I don't mind about proxy,
> > > it is in  bad shape and we will probably deprecate it in near future 
> > > anyway.
> > > But it would be good to have support for the synth driver, because
> > > we are using it for running test cases and fuzzing tests (QA).
> >
> > synth driver can not be built on Windows platform (or cross build on
> > Linux). So the test cases can not work on Windows.
> 
> Could you please be more specific what kind of challenge you see for making 
> the
> synth driver working on Windows? The synth driver is just a simple mockup 
> driver
> [1] that simulates in-RAM-only a filesystem with a bunch of hard coded dirs 
> and
> files, solely for the purpose to run test cases. So the synth driver does not 
> interact
> with any real filesystem on host at all. My expectation therefore would be 
> that
> it just needs to tweak some header includes and maybe declaring missing POSIX 
> data
> types, which you have done for the local driver already anyway.
> 

For 9p-synth:

I had enabled 9p-synth.c and built it successfully on Windows platform.
However, test cases code are not built on Windows host.
So I think it is useless that enable synth on Windows host (no way to run it).

> BTW support for macOS hosts has just been recently added for 9p, I know it is
> different as its a POSIX OS, but maybe you might still find the diff [2] 
> helpful.
> 
> [1] https://wiki.qemu.org/Documentation/9p#9p_Filesystem_Drivers
> [2] https://gitlab.com/qemu-project/qemu/-/commit/f45cc81911adc772
> 
> > > What are the limitations against security_model=mapped on Windows?
> > > Keep in  mind that with security_model=none you are very limited in
> > > what you can do with 9p.
> >
> > MSYS library does not support extend attribute (e.g. getxattr), And
> > does not support POSIX permission APIs (e.g. chmod, chown).
> > Security model is useless on Windows host.
> 
> That would be security_model=passthrough, yes, that's not possible with msys.
> The recommended way in practice though is using security_model=mapped [3] for 
> all
> systems, which should be possible to achieve with msys as well ...
> 
> [3] https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly
> 
> > It is possible that to "map" extend attribute to NTFS stream data.
> > However, if Windows host media is not NTFS (e.g. FAT) which does not
> > support stream data, then the "map" can not work.
> 
> ... yes exactly, it would make sense to use ADS [4] instead of xattr on 
> Windows.
> ADS are available with NTFS and ReFS and maybe also with exFAT nowadays (?), 
> not
> sure about the latter though. But I think it is fair enough to assume Windows 
> users
> to either use NTFS or ReFS. And if they don't, you can still call 
> error_report_once()
> to make user aware that
> seucrity_model=mapped(-xattr) requires a fileystem on Windows that supports 
> ADS.
> 
> [4] https://en.wikipedia.org/wiki/NTFS#Alternate_data_stream_(ADS)
> 

Windows does not support POSIX permission. 
So I think that only allow user to use security_model=none is reasonable on 
Windows host.

There is a difficulty to support "mapped" or "mapped-file" on Windows host:
There are many functions in 9p-code using APIs like "openat", "mkdirat", etc.
MSYS does not support that (openat is not valid on Windows host).
I remember that 9p replaced "open" by "openat" for a long time.
To fully support "security_model=mapped", 9p for Windows need to replace 
"openat" by "open".
This may impact too many functions.

I would have a try to enable "mapped" by using ADS, but it looks like a big 
refactor for 9p-local.c

Best Regards,
Guohuai

Re: [PATCH v2 for-7.1 9/9] nbd: document what is protected by the CoMutexes

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:56PM +0200, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 for-7.1 8/9] nbd: take receive_mutex when reading requests[].receiving

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:55PM +0200, Paolo Bonzini wrote:
> requests[].receiving is set by nbd_receive_replies() under the receive_mutex;
> Read it under the same mutex as well.  Waking up receivers on errors happens
> after each reply finishes processing, in nbd_co_receive_one_chunk().
> If there is no currently-active reply, there are two cases:
> 
> * either there is no active request at all, in which case no
> element of request[] can have .receiving = true
> 
> * or nbd_receive_replies() must be running and owns receive_mutex;
> in that case it will get back to nbd_co_receive_one_chunk() because
> the socket has been shutdown, and all waiting coroutines will wake up
> in turn.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 for-7.1 7/9] nbd: move s->state under requests_lock

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:54PM +0200, Paolo Bonzini wrote:
> Remove the confusing, and most likely wrong, atomics.  The only function
> that used to be somewhat in a hot path was nbd_client_connected(),
> but it is not anymore after the previous patches.
> 
> The same logic is used both to check if a request had to be reissued
> and also in nbd_reconnecting_attempt().  The former cases are outside
> requests_lock, while nbd_reconnecting_attempt() does have the lock,
> therefore the two have been separated in the previous commit.
> nbd_client_will_reconnect() can simply take s->requests_lock, while
> nbd_reconnecting_attempt() can inline the access now that no
> complicated atomics are involved.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 78 -
>  1 file changed, 41 insertions(+), 37 deletions(-)
>

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 for-7.1 6/9] nbd: code motion and function renaming

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:53PM +0200, Paolo Bonzini wrote:
> Prepare for the next patch, so that the diff is less confusing.
> 
> nbd_client_connecting is moved closer to the definition point.
> 
> nbd_client_connecting_wait() is kept only for the reconnection
> logic; when it is used to check if a request has to be reissued,
> use the renamed function nbd_client_will_reconnect().  In the
> next patch, the two cases will have different locking requirements.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 24 ++--
>  1 file changed, 14 insertions(+), 10 deletions(-)

Reviewed-by: Eric Blake 

Yes, this split makes the next patch easier to read ;)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 for-7.1 5/9] nbd: use a QemuMutex to synchronize yanking, reconnection and coroutines

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:52PM +0200, Paolo Bonzini wrote:
> The condition for waiting on the s->free_sema queue depends on
> both s->in_flight and s->state.  The latter is currently using
> atomics, but this is quite dubious and probably wrong.
> 
> Because s->state is written in the main thread too, for example by
> the yank callback, it cannot be protected by a CoMutex.  Introduce a
> separate lock that can be used by nbd_co_send_request(); later on this
> lock will also be used for s->state.  There will not be any contention
> on the lock unless there is a yank or reconnect, so this is not
> performance sensitive.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 46 +++---
>  1 file changed, 27 insertions(+), 19 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 for-7.1 4/9] nbd: keep send_mutex/free_sema handling outside nbd_co_do_establish_connection

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:51PM +0200, Paolo Bonzini wrote:
> Elevate s->in_flight early so that other incoming requests will wait
> on the CoQueue in nbd_co_send_request; restart them after getting back
> from nbd_reconnect_attempt.  This could be after the reconnect timer or
> nbd_cancel_in_flight have cancelled the attempt, so there is no
> need anymore to cancel the requests there.
> 
> nbd_co_send_request now handles both stopping and restarting pending
> requests after a successful connection, and there is no need to
> hold send_mutex in nbd_co_do_establish_connection.  The current setup
> is confusing because nbd_co_do_establish_connection is called both with
> send_mutex taken and without it.  Before the patch it uses free_sema which
> (at least in theory...) is protected by send_mutex, after the patch it
> does not anymore.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/coroutines.h |  4 +--
>  block/nbd.c| 66 ++
>  2 files changed, 33 insertions(+), 37 deletions(-)
> 
> diff --git a/block/coroutines.h b/block/coroutines.h
> index b293e943c8..8f6e438ef3 100644
> --- a/block/coroutines.h
> +++ b/block/coroutines.h
> @@ -59,7 +59,7 @@ int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState 
> *bs,
>  QEMUIOVector *qiov, int64_t pos);
>  
>  int coroutine_fn
> -nbd_co_do_establish_connection(BlockDriverState *bs, Error **errp);
> +nbd_co_do_establish_connection(BlockDriverState *bs, bool blocking, Error 
> **errp);

Long line; probably worth wrapping.  But that's cosmetic; I could do
it while applying to my tree.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 for-7.1 1/9] nbd: safeguard against waking up invalid coroutine

2022-04-14 Thread Eric Blake

On Thu, Apr 14, 2022 at 07:57:48PM +0200, Paolo Bonzini wrote:
> The .reply_possible field of s->requests is never set to false.  This is
> not a problem as it is only a safeguard to detect protocol errors,
> but it's sloppy.  In fact, the field is actually not necessary at all,
> because .coroutine is set to NULL in NBD_FOREACH_REPLY_CHUNK after
> receiving the last chunk.  Thus, replace .reply_possible with .coroutine
> and move the check before deciding the fate of this request.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/nbd.c | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)

Ah, indeed nicer than the v1 approach.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH v2 for-7.1 7/9] nbd: move s->state under requests_lock

2022-04-14 Thread Paolo Bonzini

Remove the confusing, and most likely wrong, atomics.  The only function
that used to be somewhat in a hot path was nbd_client_connected(),
but it is not anymore after the previous patches.

The same logic is used both to check if a request had to be reissued
and also in nbd_reconnecting_attempt().  The former cases are outside
requests_lock, while nbd_reconnecting_attempt() does have the lock,
therefore the two have been separated in the previous commit.
nbd_client_will_reconnect() can simply take s->requests_lock, while
nbd_reconnecting_attempt() can inline the access now that no
complicated atomics are involved.

Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 78 -
 1 file changed, 41 insertions(+), 37 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 37d466e435..b5aac2548c 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -35,7 +35,6 @@
 #include "qemu/option.h"
 #include "qemu/cutils.h"
 #include "qemu/main-loop.h"
-#include "qemu/atomic.h"
 
 #include "qapi/qapi-visit-sockets.h"
 #include "qapi/qmp/qstring.h"
@@ -72,10 +71,11 @@ typedef struct BDRVNBDState {
 NBDExportInfo info;
 
 /*
- * Protects free_sema, in_flight, requests[].coroutine,
+ * Protects state, free_sema, in_flight, requests[].coroutine,
  * reconnect_delay_timer.
  */
 QemuMutex requests_lock;
+NBDClientState state;
 CoQueue free_sema;
 int in_flight;
 NBDClientRequest requests[MAX_NBD_REQUESTS];
@@ -83,7 +83,6 @@ typedef struct BDRVNBDState {
 
 CoMutex send_mutex;
 CoMutex receive_mutex;
-NBDClientState state;
 
 QEMUTimer *open_timer;
 
@@ -132,11 +131,6 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
 s->x_dirty_bitmap = NULL;
 }
 
-static bool nbd_client_connected(BDRVNBDState *s)
-{
-return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTED;
-}
-
 static bool coroutine_fn nbd_recv_coroutine_wake_one(NBDClientRequest *req)
 {
 if (req->receiving) {
@@ -159,14 +153,15 @@ static void coroutine_fn 
nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
 }
 }
 
-static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
+/* Called with s->requests_lock held.  */
+static void coroutine_fn nbd_channel_error_locked(BDRVNBDState *s, int ret)
 {
-if (nbd_client_connected(s)) {
+if (s->state == NBD_CLIENT_CONNECTED) {
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
 }
 
 if (ret == -EIO) {
-if (nbd_client_connected(s)) {
+if (s->state == NBD_CLIENT_CONNECTED) {
 s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
 NBD_CLIENT_CONNECTING_NOWAIT;
 }
@@ -177,6 +172,12 @@ static void coroutine_fn nbd_channel_error(BDRVNBDState 
*s, int ret)
 nbd_recv_coroutines_wake(s, true);
 }
 
+static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
+{
+QEMU_LOCK_GUARD(>requests_lock);
+nbd_channel_error_locked(s, ret);
+}
+
 static void reconnect_delay_timer_del(BDRVNBDState *s)
 {
 if (s->reconnect_delay_timer) {
@@ -189,20 +190,18 @@ static void reconnect_delay_timer_cb(void *opaque)
 {
 BDRVNBDState *s = opaque;
 
-if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
-s->state = NBD_CLIENT_CONNECTING_NOWAIT;
-nbd_co_establish_connection_cancel(s->conn);
-}
-
 reconnect_delay_timer_del(s);
+WITH_QEMU_LOCK_GUARD(>requests_lock) {
+if (s->state != NBD_CLIENT_CONNECTING_WAIT) {
+return;
+}
+s->state = NBD_CLIENT_CONNECTING_NOWAIT;
+}
+nbd_co_establish_connection_cancel(s->conn);
 }
 
 static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 {
-if (qatomic_load_acquire(>state) != NBD_CLIENT_CONNECTING_WAIT) {
-return;
-}
-
 assert(!s->reconnect_delay_timer);
 s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
  QEMU_CLOCK_REALTIME,
@@ -225,7 +224,9 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 s->ioc = NULL;
 }
 
-s->state = NBD_CLIENT_QUIT;
+WITH_QEMU_LOCK_GUARD(>requests_lock) {
+s->state = NBD_CLIENT_QUIT;
+}
 }
 
 static void open_timer_del(BDRVNBDState *s)
@@ -254,15 +255,15 @@ static void open_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 timer_mod(s->open_timer, expire_time_ns);
 }
 
-static bool nbd_client_connecting_wait(BDRVNBDState *s)
-{
-return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
-}
-
 static bool nbd_client_will_reconnect(BDRVNBDState *s)
 {
-return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
+/*
+ * Called only after a socket error, so this is not performance sensitive.
+ */
+QEMU_LOCK_GUARD(>requests_lock);
+return s->state == NBD_CLIENT_CONNECTING_WAIT;
 }
+
 /*
  * Update @bs with information learned

[PATCH v2 for-7.1 8/9] nbd: take receive_mutex when reading requests[].receiving

2022-04-14 Thread Paolo Bonzini

requests[].receiving is set by nbd_receive_replies() under the receive_mutex;
Read it under the same mutex as well.  Waking up receivers on errors happens
after each reply finishes processing, in nbd_co_receive_one_chunk().
If there is no currently-active reply, there are two cases:

* either there is no active request at all, in which case no
element of request[] can have .receiving = true

* or nbd_receive_replies() must be running and owns receive_mutex;
in that case it will get back to nbd_co_receive_one_chunk() because
the socket has been shutdown, and all waiting coroutines will wake up
in turn.

Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index b5aac2548c..31c684772e 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -131,6 +131,7 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
 s->x_dirty_bitmap = NULL;
 }
 
+/* Called with s->receive_mutex taken.  */
 static bool coroutine_fn nbd_recv_coroutine_wake_one(NBDClientRequest *req)
 {
 if (req->receiving) {
@@ -142,12 +143,13 @@ static bool coroutine_fn 
nbd_recv_coroutine_wake_one(NBDClientRequest *req)
 return false;
 }
 
-static void coroutine_fn nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
+static void coroutine_fn nbd_recv_coroutines_wake(BDRVNBDState *s)
 {
 int i;
 
+QEMU_LOCK_GUARD(>receive_mutex);
 for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-if (nbd_recv_coroutine_wake_one(>requests[i]) && !all) {
+if (nbd_recv_coroutine_wake_one(>requests[i])) {
 return;
 }
 }
@@ -168,8 +170,6 @@ static void coroutine_fn 
nbd_channel_error_locked(BDRVNBDState *s, int ret)
 } else {
 s->state = NBD_CLIENT_QUIT;
 }
-
-nbd_recv_coroutines_wake(s, true);
 }
 
 static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
@@ -432,11 +432,10 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)
 
 qemu_coroutine_yield();
 /*
- * We may be woken for 3 reasons:
+ * We may be woken for 2 reasons:
  * 1. From this function, executing in parallel coroutine, when our
  *handle is received.
- * 2. From nbd_channel_error(), when connection is lost.
- * 3. From nbd_co_receive_one_chunk(), when previous request is
+ * 2. From nbd_co_receive_one_chunk(), when previous request is
  *finished and s->reply.handle set to 0.
  * Anyway, it's OK to lock the mutex and go to the next iteration.
  */
@@ -928,7 +927,7 @@ static coroutine_fn int nbd_co_receive_one_chunk(
 }
 s->reply.handle = 0;
 
-nbd_recv_coroutines_wake(s, false);
+nbd_recv_coroutines_wake(s);
 
 return ret;
 }
-- 
2.35.1

[PATCH v2 for-7.1 6/9] nbd: code motion and function renaming

2022-04-14 Thread Paolo Bonzini

Prepare for the next patch, so that the diff is less confusing.

nbd_client_connecting is moved closer to the definition point.

nbd_client_connecting_wait() is kept only for the reconnection
logic; when it is used to check if a request has to be reissued,
use the renamed function nbd_client_will_reconnect().  In the
next patch, the two cases will have different locking requirements.

Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index a2414566d1..37d466e435 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -254,18 +254,15 @@ static void open_timer_init(BDRVNBDState *s, uint64_t 
expire_time_ns)
 timer_mod(s->open_timer, expire_time_ns);
 }
 
-static bool nbd_client_connecting(BDRVNBDState *s)
-{
-NBDClientState state = qatomic_load_acquire(>state);
-return state == NBD_CLIENT_CONNECTING_WAIT ||
-state == NBD_CLIENT_CONNECTING_NOWAIT;
-}
-
 static bool nbd_client_connecting_wait(BDRVNBDState *s)
 {
 return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
 }
 
+static bool nbd_client_will_reconnect(BDRVNBDState *s)
+{
+return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT;
+}
 /*
  * Update @bs with information learned during a completed negotiation process.
  * Return failure if the server's advertised options are incompatible with the
@@ -355,6 +352,13 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 return 0;
 }
 
+static bool nbd_client_connecting(BDRVNBDState *s)
+{
+NBDClientState state = qatomic_load_acquire(>state);
+return state == NBD_CLIENT_CONNECTING_WAIT ||
+state == NBD_CLIENT_CONNECTING_NOWAIT;
+}
+
 /* Called with s->requests_lock taken.  */
 static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
 {
@@ -1190,7 +1194,7 @@ static int coroutine_fn nbd_co_request(BlockDriverState 
*bs, NBDRequest *request
 error_free(local_err);
 local_err = NULL;
 }
-} while (ret < 0 && nbd_client_connecting_wait(s));
+} while (ret < 0 && nbd_client_will_reconnect(s));
 
 return ret ? ret : request_ret;
 }
@@ -1249,7 +1253,7 @@ static int coroutine_fn 
nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
 error_free(local_err);
 local_err = NULL;
 }
-} while (ret < 0 && nbd_client_connecting_wait(s));
+} while (ret < 0 && nbd_client_will_reconnect(s));
 
 return ret ? ret : request_ret;
 }
@@ -1407,7 +1411,7 @@ static int coroutine_fn nbd_client_co_block_status(
 error_free(local_err);
 local_err = NULL;
 }
-} while (ret < 0 && nbd_client_connecting_wait(s));
+} while (ret < 0 && nbd_client_will_reconnect(s));
 
 if (ret < 0 || request_ret < 0) {
 return ret ? ret : request_ret;
-- 
2.35.1

[PATCH v2 for-7.1 5/9] nbd: use a QemuMutex to synchronize yanking, reconnection and coroutines

2022-04-14 Thread Paolo Bonzini

The condition for waiting on the s->free_sema queue depends on
both s->in_flight and s->state.  The latter is currently using
atomics, but this is quite dubious and probably wrong.

Because s->state is written in the main thread too, for example by
the yank callback, it cannot be protected by a CoMutex.  Introduce a
separate lock that can be used by nbd_co_send_request(); later on this
lock will also be used for s->state.  There will not be any contention
on the lock unless there is a yank or reconnect, so this is not
performance sensitive.

Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 46 +++---
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 62dd338ef3..a2414566d1 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -71,17 +71,22 @@ typedef struct BDRVNBDState {
 QIOChannel *ioc; /* The current I/O channel */
 NBDExportInfo info;
 
-CoMutex send_mutex;
+/*
+ * Protects free_sema, in_flight, requests[].coroutine,
+ * reconnect_delay_timer.
+ */
+QemuMutex requests_lock;
 CoQueue free_sema;
-
-CoMutex receive_mutex;
 int in_flight;
+NBDClientRequest requests[MAX_NBD_REQUESTS];
+QEMUTimer *reconnect_delay_timer;
+
+CoMutex send_mutex;
+CoMutex receive_mutex;
 NBDClientState state;
 
-QEMUTimer *reconnect_delay_timer;
 QEMUTimer *open_timer;
 
-NBDClientRequest requests[MAX_NBD_REQUESTS];
 NBDReply reply;
 BlockDriverState *bs;
 
@@ -350,7 +355,7 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 return 0;
 }
 
-/* called under s->send_mutex */
+/* Called with s->requests_lock taken.  */
 static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
 {
 bool blocking = nbd_client_connecting_wait(s);
@@ -382,9 +387,9 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState 
*s)
 s->ioc = NULL;
 }
 
-qemu_co_mutex_unlock(>send_mutex);
+qemu_mutex_unlock(>requests_lock);
 nbd_co_do_establish_connection(s->bs, blocking, NULL);
-qemu_co_mutex_lock(>send_mutex);
+qemu_mutex_lock(>requests_lock);
 
 /*
  * The reconnect attempt is done (maybe successfully, maybe not), so
@@ -466,11 +471,10 @@ static int coroutine_fn 
nbd_co_send_request(BlockDriverState *bs,
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 int rc, i = -1;
 
-qemu_co_mutex_lock(>send_mutex);
-
+qemu_mutex_lock(>requests_lock);
 while (s->in_flight == MAX_NBD_REQUESTS ||
(!nbd_client_connected(s) && s->in_flight > 0)) {
-qemu_co_queue_wait(>free_sema, >send_mutex);
+qemu_co_queue_wait(>free_sema, >requests_lock);
 }
 
 s->in_flight++;
@@ -491,13 +495,13 @@ static int coroutine_fn 
nbd_co_send_request(BlockDriverState *bs,
 }
 }
 
-g_assert(qemu_in_coroutine());
 assert(i < MAX_NBD_REQUESTS);
-
 s->requests[i].coroutine = qemu_coroutine_self();
 s->requests[i].offset = request->from;
 s->requests[i].receiving = false;
+qemu_mutex_unlock(>requests_lock);
 
+qemu_co_mutex_lock(>send_mutex);
 request->handle = INDEX_TO_HANDLE(s, i);
 
 assert(s->ioc);
@@ -517,17 +521,19 @@ static int coroutine_fn 
nbd_co_send_request(BlockDriverState *bs,
 } else {
 rc = nbd_send_request(s->ioc, request);
 }
+qemu_co_mutex_unlock(>send_mutex);
 
-err:
 if (rc < 0) {
+qemu_mutex_lock(>requests_lock);
+err:
 nbd_channel_error(s, rc);
 if (i != -1) {
 s->requests[i].coroutine = NULL;
 }
 s->in_flight--;
 qemu_co_queue_next(>free_sema);
+qemu_mutex_unlock(>requests_lock);
 }
-qemu_co_mutex_unlock(>send_mutex);
 return rc;
 }
 
@@ -1017,12 +1023,11 @@ static bool nbd_reply_chunk_iter_receive(BDRVNBDState 
*s,
 return true;
 
 break_loop:
+qemu_mutex_lock(>requests_lock);
 s->requests[HANDLE_TO_INDEX(s, handle)].coroutine = NULL;
-
-qemu_co_mutex_lock(>send_mutex);
 s->in_flight--;
 qemu_co_queue_next(>free_sema);
-qemu_co_mutex_unlock(>send_mutex);
+qemu_mutex_unlock(>requests_lock);
 
 return false;
 }
@@ -1855,8 +1860,9 @@ static int nbd_open(BlockDriverState *bs, QDict *options, 
int flags,
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
 s->bs = bs;
-qemu_co_mutex_init(>send_mutex);
+qemu_mutex_init(>requests_lock);
 qemu_co_queue_init(>free_sema);
+qemu_co_mutex_init(>send_mutex);
 qemu_co_mutex_init(>receive_mutex);
 
 if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
@@ -2044,9 +2050,11 @@ static void nbd_cancel_in_flight(BlockDriverState *bs)
 
 reconnect_delay_timer_del(s);
 
+qemu_mutex_lock(>requests_lock);
 if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
 }
+qemu_mutex_unlock(>requests_lock);
 
 nbd_co_establish_connection_cancel(s->conn);
 }
--

[PATCH v2 for-7.1 9/9] nbd: document what is protected by the CoMutexes

2022-04-14 Thread Paolo Bonzini

Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/nbd.c b/block/nbd.c
index 31c684772e..d0d94b40bd 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -81,12 +81,18 @@ typedef struct BDRVNBDState {
 NBDClientRequest requests[MAX_NBD_REQUESTS];
 QEMUTimer *reconnect_delay_timer;
 
+/* Protects sending data on the socket.  */
 CoMutex send_mutex;
+
+/*
+ * Protects receiving reply headers from the socket, as well as the
+ * fields reply and requests[].receiving
+ */
 CoMutex receive_mutex;
+NBDReply reply;
 
 QEMUTimer *open_timer;
 
-NBDReply reply;
 BlockDriverState *bs;
 
 /* Connection parameters */
-- 
2.35.1

[PATCH v2 for-7.1 1/9] nbd: safeguard against waking up invalid coroutine

2022-04-14 Thread Paolo Bonzini

The .reply_possible field of s->requests is never set to false.  This is
not a problem as it is only a safeguard to detect protocol errors,
but it's sloppy.  In fact, the field is actually not necessary at all,
because .coroutine is set to NULL in NBD_FOREACH_REPLY_CHUNK after
receiving the last chunk.  Thus, replace .reply_possible with .coroutine
and move the check before deciding the fate of this request.

Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 691d4b05dc..d29bee1122 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -58,7 +58,6 @@ typedef struct {
 Coroutine *coroutine;
 uint64_t offset;/* original offset of the request */
 bool receiving; /* sleeping in the yield in nbd_receive_replies */
-bool reply_possible;/* reply header not yet received */
 } NBDClientRequest;
 
 typedef enum NBDClientState {
@@ -454,15 +453,15 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)
 nbd_channel_error(s, -EINVAL);
 return -EINVAL;
 }
+ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
+if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
+nbd_channel_error(s, -EINVAL);
+return -EINVAL;
+}
 if (s->reply.handle == handle) {
 /* We are done */
 return 0;
 }
-ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
-if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].reply_possible) {
-nbd_channel_error(s, -EINVAL);
-return -EINVAL;
-}
 nbd_recv_coroutine_wake_one(>requests[ind2]);
 }
 }
@@ -505,7 +504,6 @@ static int nbd_co_send_request(BlockDriverState *bs,
 s->requests[i].coroutine = qemu_coroutine_self();
 s->requests[i].offset = request->from;
 s->requests[i].receiving = false;
-s->requests[i].reply_possible = true;
 
 request->handle = INDEX_TO_HANDLE(s, i);
 
-- 
2.35.1

[PATCH v2 for-7.1 3/9] nbd: remove peppering of nbd_client_connected

2022-04-14 Thread Paolo Bonzini

It is unnecessary to check nbd_client_connected() because every time
s->state is moved out of NBD_CLIENT_CONNECTED the socket is shut down
and all coroutines are resumed.

The only case where it was actually needed is when the NBD
server disconnects and there is no reconnect-delay.  In that
case, nbd_receive_replies() does not set s->reply.handle and
nbd_co_do_receive_one_chunk() cannot continue.  For that one case,
check the return value of nbd_receive_replies().

As to the others:

* nbd_receive_replies() can put the current coroutine to sleep if another
reply is ongoing; then it will be woken by nbd_channel_error(), called
by the ongoing reply.  Or it can try itself to read a reply header and
fail, thus calling nbd_channel_error() itself.

* nbd_co_send_request() will write the body of the request and fail

* nbd_reply_chunk_iter_receive() will call nbd_co_receive_one_chunk()
and then nbd_co_do_receive_one_chunk(), which will handle the failure as
above; or it will just detect a previous call to nbd_iter_channel_error()
via iter->ret < 0.

Reviewed-by: Eric Blake 
Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 3cc4ea4430..a30603ce87 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -409,10 +409,6 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)
 return 0;
 }
 
-if (!nbd_client_connected(s)) {
-return -EIO;
-}
-
 if (s->reply.handle != 0) {
 /*
  * Some other request is being handled now. It should already be
@@ -512,7 +508,7 @@ static int coroutine_fn 
nbd_co_send_request(BlockDriverState *bs,
 if (qiov) {
 qio_channel_set_cork(s->ioc, true);
 rc = nbd_send_request(s->ioc, request);
-if (nbd_client_connected(s) && rc >= 0) {
+if (rc >= 0) {
 if (qio_channel_writev_all(s->ioc, qiov->iov, qiov->niov,
NULL) < 0) {
 rc = -EIO;
@@ -829,8 +825,8 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
 }
 *request_ret = 0;
 
-nbd_receive_replies(s, handle);
-if (!nbd_client_connected(s)) {
+ret = nbd_receive_replies(s, handle);
+if (ret < 0) {
 error_setg(errp, "Connection closed");
 return -EIO;
 }
@@ -982,11 +978,6 @@ static bool nbd_reply_chunk_iter_receive(BDRVNBDState *s,
 NBDReply local_reply;
 NBDStructuredReplyChunk *chunk;
 Error *local_err = NULL;
-if (!nbd_client_connected(s)) {
-error_setg(_err, "Connection closed");
-nbd_iter_channel_error(iter, -EIO, _err);
-goto break_loop;
-}
 
 if (iter->done) {
 /* Previous iteration was last. */
@@ -1007,7 +998,7 @@ static bool nbd_reply_chunk_iter_receive(BDRVNBDState *s,
 }
 
 /* Do not execute the body of NBD_FOREACH_REPLY_CHUNK for simple reply. */
-if (nbd_reply_is_simple(reply) || !nbd_client_connected(s)) {
+if (nbd_reply_is_simple(reply) || iter->ret < 0) {
 goto break_loop;
 }
 
-- 
2.35.1

[PATCH v2 for-7.1 4/9] nbd: keep send_mutex/free_sema handling outside nbd_co_do_establish_connection

2022-04-14 Thread Paolo Bonzini

Elevate s->in_flight early so that other incoming requests will wait
on the CoQueue in nbd_co_send_request; restart them after getting back
from nbd_reconnect_attempt.  This could be after the reconnect timer or
nbd_cancel_in_flight have cancelled the attempt, so there is no
need anymore to cancel the requests there.

nbd_co_send_request now handles both stopping and restarting pending
requests after a successful connection, and there is no need to
hold send_mutex in nbd_co_do_establish_connection.  The current setup
is confusing because nbd_co_do_establish_connection is called both with
send_mutex taken and without it.  Before the patch it uses free_sema which
(at least in theory...) is protected by send_mutex, after the patch it
does not anymore.

Signed-off-by: Paolo Bonzini 
---
 block/coroutines.h |  4 +--
 block/nbd.c| 66 ++
 2 files changed, 33 insertions(+), 37 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index b293e943c8..8f6e438ef3 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -59,7 +59,7 @@ int coroutine_fn bdrv_co_writev_vmstate(BlockDriverState *bs,
 QEMUIOVector *qiov, int64_t pos);
 
 int coroutine_fn
-nbd_co_do_establish_connection(BlockDriverState *bs, Error **errp);
+nbd_co_do_establish_connection(BlockDriverState *bs, bool blocking, Error 
**errp);
 
 
 int coroutine_fn
@@ -109,7 +109,7 @@ bdrv_common_block_status_above(BlockDriverState *bs,
BlockDriverState **file,
int *depth);
 int generated_co_wrapper
-nbd_do_establish_connection(BlockDriverState *bs, Error **errp);
+nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
 
 int generated_co_wrapper
 blk_do_preadv(BlockBackend *blk, int64_t offset, int64_t bytes,
diff --git a/block/nbd.c b/block/nbd.c
index a30603ce87..62dd338ef3 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -187,9 +187,6 @@ static void reconnect_delay_timer_cb(void *opaque)
 if (qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTING_WAIT) {
 s->state = NBD_CLIENT_CONNECTING_NOWAIT;
 nbd_co_establish_connection_cancel(s->conn);
-while (qemu_co_enter_next(>free_sema, NULL)) {
-/* Resume all queued requests */
-}
 }
 
 reconnect_delay_timer_del(s);
@@ -310,11 +307,10 @@ static int nbd_handle_updated_info(BlockDriverState *bs, 
Error **errp)
 }
 
 int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
-Error **errp)
+bool blocking, Error **errp)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 int ret;
-bool blocking = nbd_client_connecting_wait(s);
 IO_CODE();
 
 assert(!s->ioc);
@@ -350,7 +346,6 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 
 /* successfully connected */
 s->state = NBD_CLIENT_CONNECTED;
-qemu_co_queue_restart_all(>free_sema);
 
 return 0;
 }
@@ -358,25 +353,25 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 /* called under s->send_mutex */
 static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
 {
-assert(nbd_client_connecting(s));
-assert(s->in_flight == 0);
-
-if (nbd_client_connecting_wait(s) && s->reconnect_delay &&
-!s->reconnect_delay_timer)
-{
-/*
- * It's first reconnect attempt after switching to
- * NBD_CLIENT_CONNECTING_WAIT
- */
-reconnect_delay_timer_init(s,
-qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
-s->reconnect_delay * NANOSECONDS_PER_SECOND);
-}
+bool blocking = nbd_client_connecting_wait(s);
 
 /*
  * Now we are sure that nobody is accessing the channel, and no one will
  * try until we set the state to CONNECTED.
  */
+assert(nbd_client_connecting(s));
+assert(s->in_flight == 1);
+
+if (blocking && !s->reconnect_delay_timer) {
+/*
+ * It's the first reconnect attempt after switching to
+ * NBD_CLIENT_CONNECTING_WAIT
+ */
+g_assert(s->reconnect_delay);
+reconnect_delay_timer_init(s,
+qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
+s->reconnect_delay * NANOSECONDS_PER_SECOND);
+}
 
 /* Finalize previous connection if any */
 if (s->ioc) {
@@ -387,7 +382,9 @@ static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState 
*s)
 s->ioc = NULL;
 }
 
-nbd_co_do_establish_connection(s->bs, NULL);
+qemu_co_mutex_unlock(>send_mutex);
+nbd_co_do_establish_connection(s->bs, blocking, NULL);
+qemu_co_mutex_lock(>send_mutex);
 
 /*
  * The reconnect attempt is done (maybe successfully, maybe not), so
@@ -472,21 +469,21 @@ static int coroutine_fn 
nbd_co_send_request(BlockDriverState *bs,
 qemu_co_mutex_lock(>send_mutex);
 
 while

[PATCH v2 for-7.1 0/9] nbd: actually make s->state thread-safe

2022-04-14 Thread Paolo Bonzini

The main point of this series is patch 7, which removes the dubious and
probably wrong use of atomics in block/nbd.c.  This in turn is enabled
mostly by the cleanups in patches 3-5.  Together, they introduce a
QemuMutex that synchronizes the NBD client coroutines, the reconnect_delay
timer and nbd_cancel_in_flight() as well.

The fixes happen to remove an incorrect use of qemu_co_queue_restart_all
and qemu_co_enter_next on the s->free_sema CoQueue, which was not guarded
by s->send_mutex.

The rest is bugfixes, simplifying the code a bit, and extra documentation.

v1->v2:
- cleaner patch 1
- fix grammar in patch 4
- split out patch 6

Paolo Bonzini (9):
  nbd: safeguard against waking up invalid coroutine
  nbd: mark more coroutine_fns
  nbd: remove peppering of nbd_client_connected
  nbd: keep send_mutex/free_sema handling outside
nbd_co_do_establish_connection
  nbd: use a QemuMutex to synchronize yanking, reconnection and
coroutines
  nbd: code motion and function renaming
  nbd: move s->state under requests_lock
  nbd: take receive_mutex when reading requests[].receiving
  nbd: document what is protected by the CoMutexes

 block/coroutines.h |   4 +-
 block/nbd.c| 298 +++--
 2 files changed, 154 insertions(+), 148 deletions(-)

-- 
2.35.1

[PATCH v2 for-7.1 2/9] nbd: mark more coroutine_fns

2022-04-14 Thread Paolo Bonzini

Several coroutine functions in block/nbd.c are not marked as such.  This
patch adds a few more markers; it is not exhaustive, but it focuses
especially on:

- places that wake other coroutines, because aio_co_wake() has very
different semantics inside a coroutine (queuing after yield vs. entering
immediately);

- functions with _co_ in their names, to avoid confusion

Reviewed-by: Eric Blake 
Signed-off-by: Paolo Bonzini 
---
 block/nbd.c | 64 ++---
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index d29bee1122..3cc4ea4430 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -132,7 +132,7 @@ static bool nbd_client_connected(BDRVNBDState *s)
 return qatomic_load_acquire(>state) == NBD_CLIENT_CONNECTED;
 }
 
-static bool nbd_recv_coroutine_wake_one(NBDClientRequest *req)
+static bool coroutine_fn nbd_recv_coroutine_wake_one(NBDClientRequest *req)
 {
 if (req->receiving) {
 req->receiving = false;
@@ -143,7 +143,7 @@ static bool nbd_recv_coroutine_wake_one(NBDClientRequest 
*req)
 return false;
 }
 
-static void nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
+static void coroutine_fn nbd_recv_coroutines_wake(BDRVNBDState *s, bool all)
 {
 int i;
 
@@ -154,7 +154,7 @@ static void nbd_recv_coroutines_wake(BDRVNBDState *s, bool 
all)
 }
 }
 
-static void nbd_channel_error(BDRVNBDState *s, int ret)
+static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
 {
 if (nbd_client_connected(s)) {
 qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
@@ -466,9 +466,9 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState 
*s, uint64_t handle)
 }
 }
 
-static int nbd_co_send_request(BlockDriverState *bs,
-   NBDRequest *request,
-   QEMUIOVector *qiov)
+static int coroutine_fn nbd_co_send_request(BlockDriverState *bs,
+NBDRequest *request,
+QEMUIOVector *qiov)
 {
 BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 int rc, i = -1;
@@ -721,9 +721,9 @@ static int nbd_parse_error_payload(NBDStructuredReplyChunk 
*chunk,
 return 0;
 }
 
-static int nbd_co_receive_offset_data_payload(BDRVNBDState *s,
-  uint64_t orig_offset,
-  QEMUIOVector *qiov, Error **errp)
+static int coroutine_fn
+nbd_co_receive_offset_data_payload(BDRVNBDState *s, uint64_t orig_offset,
+   QEMUIOVector *qiov, Error **errp)
 {
 QEMUIOVector sub_qiov;
 uint64_t offset;
@@ -1039,8 +1039,8 @@ break_loop:
 return false;
 }
 
-static int nbd_co_receive_return_code(BDRVNBDState *s, uint64_t handle,
-  int *request_ret, Error **errp)
+static int coroutine_fn nbd_co_receive_return_code(BDRVNBDState *s, uint64_t 
handle,
+   int *request_ret, Error 
**errp)
 {
 NBDReplyChunkIter iter;
 
@@ -1053,9 +1053,9 @@ static int nbd_co_receive_return_code(BDRVNBDState *s, 
uint64_t handle,
 return iter.ret;
 }
 
-static int nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t handle,
-uint64_t offset, QEMUIOVector *qiov,
-int *request_ret, Error **errp)
+static int coroutine_fn nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t 
handle,
+ uint64_t offset, 
QEMUIOVector *qiov,
+ int *request_ret, Error 
**errp)
 {
 NBDReplyChunkIter iter;
 NBDReply reply;
@@ -1105,10 +1105,10 @@ static int nbd_co_receive_cmdread_reply(BDRVNBDState 
*s, uint64_t handle,
 return iter.ret;
 }
 
-static int nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
-uint64_t handle, uint64_t length,
-NBDExtent *extent,
-int *request_ret, Error **errp)
+static int coroutine_fn nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
+ uint64_t handle, 
uint64_t length,
+ NBDExtent *extent,
+ int *request_ret, 
Error **errp)
 {
 NBDReplyChunkIter iter;
 NBDReply reply;
@@ -1165,8 +1165,8 @@ static int nbd_co_receive_blockstatus_reply(BDRVNBDState 
*s,
 return iter.ret;
 }
 
-static int nbd_co_request(BlockDriverState *bs, NBDRequest *request,
-  QEMUIOVector *write_qiov)
+static int coroutine_fn nbd_co_request(BlockDriverState *bs, NBDRequest 
*request,
+   QEMUIOVector *write_qiov)
 {
 int ret, request_ret;

Re: [PATCH v2 27/39] util/log: Introduce qemu_set_log_filename_flags

2022-04-14 Thread Richard Henderson


On 4/14/22 07:56, Alex Bennée wrote:

  #ifdef CONFIG_TRACE_LOG
-qemu_loglevel |= LOG_TRACE;
+log_flags |= LOG_TRACE;
  #endif
+qemu_loglevel = log_flags;
+


This looked weird - so should we consider a qatomic_set here to avoid an
inconsistent set of flags being read non-atomically elsewhere?


I suppose we could do, as a separate change, since this has not been considered before. 
But I don't believe in tears to aligned 'int' on any qemu host.



+logfile = g_new0(QemuLogFile, 1);
+logfile->fd = fd;
  qatomic_rcu_set(_logfile, logfile);


I was also pondering if flags should be part of the QemuLogFile
structure so it's consistent with each opened file. However I see it
gets repurposed just for clean-up later...


I actually had this at one point in development.  But yes, there's no point in it for just 
the release.



r~

Re: [RFC PATCH] tests/qtest: pass stdout/stderr down to subtests

2022-04-14 Thread Eric Auger

Hi Alex,

On 4/7/22 5:00 PM, Alex Bennée wrote:
> When trying to work out what the virtio-net-tests where doing it was
> hard because the g_test_trap_subprocess redirects all output to
> /dev/null. Lift this restriction by using the appropriate flags so you
> can see something similar to what the vhost-user-blk tests show when
> running.
> 
> While we are at it remove the g_test_verbose() check so we always show
> how the QEMU is run.
> 
> Signed-off-by: Alex Bennée 
> ---
>  tests/qtest/qos-test.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/tests/qtest/qos-test.c b/tests/qtest/qos-test.c
> index f97d0a08fd..c6c196cc95 100644
> --- a/tests/qtest/qos-test.c
> +++ b/tests/qtest/qos-test.c
> @@ -89,9 +89,7 @@ static void qos_set_machines_devices_available(void)
>  
>  static void restart_qemu_or_continue(char *path)
>  {
> -if (g_test_verbose()) {
> -qos_printf("Run QEMU with: '%s'\n", path);
> -}
> +qos_printf("Run QEMU with: '%s'\n", path);
>  /* compares the current command line with the
>   * one previously executed: if they are the same,
>   * don't restart QEMU, if they differ, stop previous
> @@ -185,7 +183,8 @@ static void run_one_test(const void *arg)
>  static void subprocess_run_one_test(const void *arg)
>  {
>  const gchar *path = arg;
> -g_test_trap_subprocess(path, 0, 0);
> +g_test_trap_subprocess(path, 0,
> +   G_TEST_SUBPROCESS_INHERIT_STDOUT | 
> G_TEST_SUBPROCESS_INHERIT_STDERR);
While workling on libqos/pci tests on aarch64 I also did that but I
noticed there were a bunch of errors such as:

/aarch64/virt/generic-pcihost/pci-bus-generic/pci-bus/virtio-net-pci/virtio-net/virtio-net-tests/vhost-user/multiqueue:
qemu-system-aarch64: Failed to set msg fds.
qemu-system-aarch64: vhost VQ 0 ring restore failed: -22: Invalid
argument (22)
qemu-system-aarch64: Failed to set msg fds.
qemu-system-aarch64: vhost VQ 1 ring restore failed: -22: Invalid
argument (22)
qemu-system-aarch64: Failed to set msg fds.
qemu-system-aarch64: vhost VQ 2 ring restore failed: -22: Invalid
argument (22)
qemu-system-aarch64: Failed to set msg fds.
qemu-system-aarch64: vhost VQ 3 ring restore failed: -22: Invalid
argument (22)

I see those also when running with x86_64-softmmu/qemu-system-x86_64
(this is no aarch64 specific).

I don't know if it is an issue to get those additional errors?

Thanks

Eric

>  g_test_trap_assert_passed();
>  }
>  
>

Re: [PATCH for-7.1] hw/block/fdc-sysbus: Always mark sysbus floppy controllers as not having DMA

2022-04-14 Thread Peter Maydell

On Tue, 12 Apr 2022 at 17:49, Peter Maydell  wrote:
>
> The sysbus floppy controllers (devices sysbus-fdc and sun-fdtwo)
> don't support DMA.  The core floppy controller code expects this to
> be indicated by setting FDCtrl::dma_chann to -1.  This used to be
> done in the device instance_init functions sysbus_fdc_initfn() and
> sun4m_fdc_initfn(), but in commit 1430759ec3e we refactored this code
> and accidentally lost the setting of dma_chann.
>
> For sysbus-fdc this has no ill effects because we were redundantly
> also setting dma_chann in fdctrl_init_sysbus(), but for sun-fdtwo
> this means that guests which try to enable DMA on the floppy
> controller will cause QEMU to crash because FDCtrl::dma is NULL.
>
> Set dma_chann to -1 in the common instance init, and remove the
> redundant code in fdctrl_init_sysbus() that is also setting it.
>
> There is a six-year-old FIXME comment in the jazz board code to the
> effect that in theory it should support doing DMA via a custom DMA
> controller.  If anybody ever chooses to fix that they can do it by
> adding support for setting both FDCtrl::dma_chann and FDCtrl::dma.
> (A QOM link property 'dma-controller' on the sysbus device which can
> be set to an instance of IsaDmaClass is probably the way to go.)
>
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/958
> Signed-off-by: Peter Maydell 

> -void fdctrl_init_sysbus(qemu_irq irq, int dma_chann,
> -hwaddr mmio_base, DriveInfo **fds)
> +void fdctrl_init_sysbus(qemu_irq irq, hwaddr mmio_base, DriveInfo **fds)
>  {
>  FDCtrl *fdctrl;
>  DeviceState *dev;
> @@ -105,7 +104,6 @@ void fdctrl_init_sysbus(qemu_irq irq, int dma_chann,
>  dev = qdev_new("sysbus-fdc");
>  sys = SYSBUS_FDC(dev);
>  fdctrl = >state;
> -fdctrl->dma_chann = dma_chann; /* FIXME */
>  sbd = SYS_BUS_DEVICE(dev);
>  sysbus_realize_and_unref(sbd, _fatal);
>  sysbus_connect_irq(sbd, 0, irq);

Just noticed that deleting this line removes the only use
of the 'fdctrl' local in this function, which then means
we can delete it. I'll send a v2 that does that.

-- PMM

[PATCH for-7.1 3/5] machine: add mem compound property

2022-04-14 Thread Paolo Bonzini

Make -m syntactic sugar for a compound property "-machine
mem.{size,max-size,slots}".  The new property does not have
the magic conversion to megabytes of unsuffixed arguments,
and also does not understand that "0" means the default size
(you have to leave it out to get the default).  This means
that we need to convert the QemuOpts by hand to a QDict.

Signed-off-by: Paolo Bonzini 
---
 hw/core/machine.c |  80 ++
 qapi/machine.json |  18 +++
 softmmu/vl.c  | 123 +++---
 3 files changed, 138 insertions(+), 83 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 710dfbd982..d5cfffe8a0 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -520,6 +520,78 @@ static void machine_set_hmat(Object *obj, bool value, 
Error **errp)
 ms->numa_state->hmat_enabled = value;
 }
 
+static void machine_get_mem(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+MemorySizeConfiguration mem = {
+.has_size = true,
+.size = ms->ram_size,
+.has_max_size = !!ms->ram_slots,
+.max_size = ms->maxram_size,
+.has_slots = !!ms->ram_slots,
+.slots = ms->ram_slots,
+};
+MemorySizeConfiguration *p_mem = 
+
+visit_type_MemorySizeConfiguration(v, name, _mem, _abort);
+}
+
+static void machine_set_mem(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+MachineClass *mc = MACHINE_GET_CLASS(obj);
+MemorySizeConfiguration *mem;
+
+ERRP_GUARD();
+
+if (!visit_type_MemorySizeConfiguration(v, name, , errp)) {
+return;
+}
+
+if (!mem->has_size) {
+mem->has_size = true;
+mem->size = mc->default_ram_size;
+}
+mem->size = QEMU_ALIGN_UP(mem->size, 8192);
+if (mc->fixup_ram_size) {
+mem->size = mc->fixup_ram_size(mem->size);
+}
+if ((ram_addr_t)mem->size != mem->size) {
+error_setg(errp, "ram size too large");
+goto out_free;
+}
+
+if (mem->has_max_size) {
+if (mem->max_size < mem->size) {
+error_setg(errp, "invalid value of maxmem: "
+   "maximum memory size (0x%" PRIx64 ") must be at least "
+   "the initial memory size (0x%" PRIx64 ")",
+   mem->max_size, mem->size);
+goto out_free;
+}
+if (mem->has_slots && mem->slots && mem->max_size == mem->size) {
+error_setg(errp, "invalid value of maxmem: "
+   "memory slots were specified but maximum memory size "
+   "(0x%" PRIx64 ") is equal to the initial memory size "
+   "(0x%" PRIx64 ")", mem->max_size, mem->size);
+goto out_free;
+}
+ms->maxram_size = mem->max_size;
+} else {
+if (mem->has_slots) {
+error_setg(errp, "slots specified but no max-size");
+goto out_free;
+}
+ms->maxram_size = mem->size;
+}
+ms->ram_size = mem->size;
+ms->ram_slots = mem->has_slots ? mem->slots : 0;
+out_free:
+qapi_free_MemorySizeConfiguration(mem);
+}
+
 static char *machine_get_nvdimm_persistence(Object *obj, Error **errp)
 {
 MachineState *ms = MACHINE(obj);
@@ -940,6 +1012,12 @@ static void machine_class_init(ObjectClass *oc, void 
*data)
 object_class_property_set_description(oc, "memory-backend",
   "Set RAM backend"
   "Valid value is ID of hostmem based 
backend");
+
+object_class_property_add(oc, "memory", "MemorySizeConfiguration",
+machine_get_mem, machine_set_mem,
+NULL, NULL);
+object_class_property_set_description(oc, "memory",
+"Memory size configuration");
 }
 
 static void machine_class_base_init(ObjectClass *oc, void *data)
@@ -970,6 +1048,8 @@ static void machine_initfn(Object *obj)
 ms->mem_merge = true;
 ms->enable_graphics = true;
 ms->kernel_cmdline = g_strdup("");
+ms->ram_size = mc->default_ram_size;
+ms->maxram_size = mc->default_ram_size;
 
 if (mc->nvdimm_supported) {
 Object *obj = OBJECT(ms);
diff --git a/qapi/machine.json b/qapi/machine.json
index fb06f02ed1..ca6b854670 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1612,3 +1612,21 @@
 ##
 { 'enum': 'SmbiosEntryPointType',
   'data': [ '32', '64' ] }
+
+##
+# @MemorySizeConfiguration:
+#
+# Schema for memory size configuration.
+#
+# @size: memory size in bytes
+#
+# @max-size: maximum hotpluggable memory size in bytes
+#
+# @slots: number of available memory slots for hotplug
+#
+# Since: 6.1
+##
+{ 'struct': 'MemorySizeConfiguration', 'data': {
+ '*size': 'size',
+ '*max-size': 'size',
+ '*slots': 'uint64' } }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index

[PATCH for-7.1 1/5] machine: use QAPI struct for boot configuration

2022-04-14 Thread Paolo Bonzini

As part of converting -boot to a property with a QAPI type, define
the struct and use it throughout QEMU to access boot configuration.
machine_boot_parse takes care of doing the QemuOpts->QAPI conversion by
hand, for now.

Signed-off-by: Paolo Bonzini 
---
 hw/arm/nseries.c|  2 +-
 hw/core/machine.c   | 68 +++--
 hw/hppa/machine.c   |  6 ++--
 hw/i386/pc.c|  2 +-
 hw/nvram/fw_cfg.c   | 27 +---
 hw/ppc/mac_newworld.c   |  2 +-
 hw/ppc/mac_oldworld.c   |  2 +-
 hw/ppc/prep.c   |  2 +-
 hw/ppc/spapr.c  |  4 +--
 hw/s390x/ipl.c  | 20 
 hw/sparc/sun4m.c|  4 +--
 hw/sparc64/sun4u.c  |  4 +--
 include/hw/boards.h |  4 +--
 include/sysemu/sysemu.h |  2 --
 qapi/machine.json   | 30 ++
 softmmu/bootdevice.c|  3 +-
 softmmu/globals.c   |  2 --
 softmmu/vl.c| 25 +--
 18 files changed, 127 insertions(+), 82 deletions(-)

diff --git a/hw/arm/nseries.c b/hw/arm/nseries.c
index 9c1cafae86..692c94ceb4 100644
--- a/hw/arm/nseries.c
+++ b/hw/arm/nseries.c
@@ -1365,7 +1365,7 @@ static void n8x0_init(MachineState *machine,
 }
 
 if (option_rom[0].name &&
-(machine->boot_order[0] == 'n' || !machine->kernel_filename)) {
+(machine->boot_config.order[0] == 'n' || !machine->kernel_filename)) {
 uint8_t *nolo_tags = g_new(uint8_t, 0x1);
 /* No, wait, better start at the ROM.  */
 s->mpu->cpu->env.regs[15] = OMAP2_Q2_BASE + 0x40;
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 1e23fdc14b..dc059cfab5 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -771,6 +771,68 @@ static void machine_set_smp(Object *obj, Visitor *v, const 
char *name,
 machine_parse_smp_config(ms, config, errp);
 }
 
+void machine_boot_parse(MachineState *ms, QemuOpts *opts, Error **errp)
+{
+MachineClass *machine_class = MACHINE_GET_CLASS(ms);
+const char *s;
+ERRP_GUARD();
+
+ms->boot_config = (BootConfiguration) {
+.has_order = true,
+.order = (char *)machine_class->default_boot_order,
+.has_strict = true,
+.strict = false,
+};
+if (!opts) {
+return;
+}
+
+s = qemu_opt_get(opts, "order");
+if (s) {
+validate_bootdevices(s, errp);
+if (*errp) {
+return;
+}
+ms->boot_config.order = (char *)s;
+}
+
+s = qemu_opt_get(opts, "once");
+if (s) {
+validate_bootdevices(s, errp);
+if (*errp) {
+return;
+}
+ms->boot_config.has_once = true;
+ms->boot_config.once = (char *)s;
+}
+
+s = qemu_opt_get(opts, "splash");
+if (s) {
+ms->boot_config.has_splash = true;
+ms->boot_config.splash = (char *)s;
+}
+
+s = qemu_opt_get(opts, "splash-time");
+if (s) {
+ms->boot_config.has_splash_time = true;
+ms->boot_config.splash_time = qemu_opt_get_number(opts, "splash-time", 
-1);
+}
+
+s = qemu_opt_get(opts, "reboot-timeout");
+if (s) {
+ms->boot_config.has_reboot_timeout = true;
+ms->boot_config.reboot_timeout = qemu_opt_get_number(opts, 
"reboot-timeout", -1);
+}
+
+s = qemu_opt_get(opts, "menu");
+if (s) {
+ms->boot_config.has_menu = true;
+ms->boot_config.menu = qemu_opt_get_bool(opts, "menu", false);
+}
+
+ms->boot_config.strict = qemu_opt_get_bool(opts, "strict", false);
+}
+
 static void machine_class_init(ObjectClass *oc, void *data)
 {
 MachineClass *mc = MACHINE_CLASS(oc);
@@ -1210,9 +1272,9 @@ void qdev_machine_creation_done(void)
 {
 cpu_synchronize_all_post_init();
 
-if (current_machine->boot_once) {
-qemu_boot_set(current_machine->boot_once, _fatal);
-qemu_register_reset(restore_boot_order, 
g_strdup(current_machine->boot_order));
+if (current_machine->boot_config.has_once) {
+qemu_boot_set(current_machine->boot_config.once, _fatal);
+qemu_register_reset(restore_boot_order, 
g_strdup(current_machine->boot_config.order));
 }
 
 /*
diff --git a/hw/hppa/machine.c b/hw/hppa/machine.c
index 98b30e0395..716a831464 100644
--- a/hw/hppa/machine.c
+++ b/hw/hppa/machine.c
@@ -116,7 +116,7 @@ static FWCfgState *create_fw_cfg(MachineState *ms)
 fw_cfg_add_file(fw_cfg, "/etc/power-button-addr",
 g_memdup(, sizeof(val)), sizeof(val));
 
-fw_cfg_add_i16(fw_cfg, FW_CFG_BOOT_DEVICE, ms->boot_order[0]);
+fw_cfg_add_i16(fw_cfg, FW_CFG_BOOT_DEVICE, ms->boot_config.order[0]);
 qemu_register_boot_set(fw_cfg_boot_set, fw_cfg);
 
 return fw_cfg;
@@ -309,8 +309,8 @@ static void machine_hppa_init(MachineState *machine)
  * mode (kernel_entry=1), and to boot from CD (gr[24]='d')
  * or hard disc * (gr[24]='c').
  */
-kernel_entry = boot_menu ? 1 : 0;
-cpu[0]->env.gr[24] = machine->boot_order[0];
+

[PATCH for-7.1 2/5] machine: add boot compound property

2022-04-14 Thread Paolo Bonzini

Make -boot syntactic sugar for a compound property "-machine 
boot.{order,menu,...}".
machine_boot_parse is replaced by the setter for the property.

Signed-off-by: Paolo Bonzini 
---
 hw/core/machine.c   | 99 -
 include/hw/boards.h |  1 -
 softmmu/vl.c| 16 +++-
 3 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index dc059cfab5..710dfbd982 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -771,66 +771,63 @@ static void machine_set_smp(Object *obj, Visitor *v, 
const char *name,
 machine_parse_smp_config(ms, config, errp);
 }
 
-void machine_boot_parse(MachineState *ms, QemuOpts *opts, Error **errp)
+static void machine_get_boot(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
+MachineState *ms = MACHINE(obj);
+BootConfiguration *config = >boot_config;
+visit_type_BootConfiguration(v, name, , _abort);
+}
+
+static void machine_free_boot_config(MachineState *ms)
+{
+g_free(ms->boot_config.order);
+g_free(ms->boot_config.once);
+g_free(ms->boot_config.splash);
+}
+
+static void machine_copy_boot_config(MachineState *ms, BootConfiguration 
*config)
 {
 MachineClass *machine_class = MACHINE_GET_CLASS(ms);
-const char *s;
+
+machine_free_boot_config(ms);
+ms->boot_config = *config;
+if (!config->has_order) {
+ms->boot_config.has_order = true;
+ms->boot_config.order = g_strdup(machine_class->default_boot_order);
+}
+}
+
+static void machine_set_boot(Object *obj, Visitor *v, const char *name,
+void *opaque, Error **errp)
+{
 ERRP_GUARD();
+MachineState *ms = MACHINE(obj);
+BootConfiguration *config = NULL;
 
-ms->boot_config = (BootConfiguration) {
-.has_order = true,
-.order = (char *)machine_class->default_boot_order,
-.has_strict = true,
-.strict = false,
-};
-if (!opts) {
+if (!visit_type_BootConfiguration(v, name, , errp)) {
 return;
 }
-
-s = qemu_opt_get(opts, "order");
-if (s) {
-validate_bootdevices(s, errp);
+if (config->has_order) {
+validate_bootdevices(config->order, errp);
 if (*errp) {
-return;
+goto out_free;
 }
-ms->boot_config.order = (char *)s;
 }
-
-s = qemu_opt_get(opts, "once");
-if (s) {
-validate_bootdevices(s, errp);
+if (config->has_once) {
+validate_bootdevices(config->once, errp);
 if (*errp) {
-return;
+goto out_free;
 }
-ms->boot_config.has_once = true;
-ms->boot_config.once = (char *)s;
 }
 
-s = qemu_opt_get(opts, "splash");
-if (s) {
-ms->boot_config.has_splash = true;
-ms->boot_config.splash = (char *)s;
-}
-
-s = qemu_opt_get(opts, "splash-time");
-if (s) {
-ms->boot_config.has_splash_time = true;
-ms->boot_config.splash_time = qemu_opt_get_number(opts, "splash-time", 
-1);
-}
+machine_copy_boot_config(ms, config);
+/* Strings live in ms->boot_config.  */
+free(config);
+return;
 
-s = qemu_opt_get(opts, "reboot-timeout");
-if (s) {
-ms->boot_config.has_reboot_timeout = true;
-ms->boot_config.reboot_timeout = qemu_opt_get_number(opts, 
"reboot-timeout", -1);
-}
-
-s = qemu_opt_get(opts, "menu");
-if (s) {
-ms->boot_config.has_menu = true;
-ms->boot_config.menu = qemu_opt_get_bool(opts, "menu", false);
-}
-
-ms->boot_config.strict = qemu_opt_get_bool(opts, "strict", false);
+out_free:
+qapi_free_BootConfiguration(config);
 }
 
 static void machine_class_init(ObjectClass *oc, void *data)
@@ -871,6 +868,12 @@ static void machine_class_init(ObjectClass *oc, void *data)
 object_class_property_set_description(oc, "dumpdtb",
 "Dump current dtb to a file and quit");
 
+object_class_property_add(oc, "boot", "BootConfiguration",
+machine_get_boot, machine_set_boot,
+NULL, NULL);
+object_class_property_set_description(oc, "boot",
+"Boot configuration");
+
 object_class_property_add(oc, "smp", "SMPConfiguration",
 machine_get_smp, machine_set_smp,
 NULL, NULL);
@@ -1004,12 +1007,16 @@ static void machine_initfn(Object *obj)
 ms->smp.clusters = 1;
 ms->smp.cores = 1;
 ms->smp.threads = 1;
+ms->smp.sockets = 1;
+
+machine_copy_boot_config(ms, &(BootConfiguration){ 0 });
 }
 
 static void machine_finalize(Object *obj)
 {
 MachineState *ms = MACHINE(obj);
 
+machine_free_boot_config(ms);
 g_free(ms->kernel_filename);
 g_free(ms->initrd_filename);
 g_free(ms->kernel_cmdline);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 4b4e8d6991..8994f6c93b 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -26,7 +26,6 @@

[PATCH for-7.1 5/5] machine: move more memory validation to Machine object

2022-04-14 Thread Paolo Bonzini

This allows setting memory properties without going through
vl.c, and have them validated just the same.

Signed-off-by: Paolo Bonzini 
---
 hw/core/machine.c | 21 +++--
 softmmu/vl.c  | 17 +++--
 2 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3dca927e3f..4f537c10c7 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -21,12 +21,14 @@
 #include "qapi/qapi-visit-common.h"
 #include "qapi/qapi-visit-machine.h"
 #include "qapi/visitor.h"
+#include "qom/object_interfaces.h"
 #include "hw/sysbus.h"
 #include "sysemu/cpus.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/reset.h"
 #include "sysemu/runstate.h"
 #include "sysemu/numa.h"
+#include "sysemu/xen.h"
 #include "qemu/error-report.h"
 #include "sysemu/qtest.h"
 #include "hw/pci/pci.h"
@@ -1283,8 +1285,23 @@ void machine_run_board_init(MachineState *machine, const 
char *mem_path, Error *
clock values from the log. */
 replay_checkpoint(CHECKPOINT_INIT);
 
-if (machine_class->default_ram_id && machine->ram_size &&
-numa_uses_legacy_mem() && !machine->memdev) {
+if (!xen_enabled()) {
+/* On 32-bit hosts, QEMU is limited by virtual address space */
+if (machine->ram_size > (2047 << 20) && HOST_LONG_BITS == 32) {
+error_setg(errp, "at most 2047 MB RAM can be simulated");
+return;
+}
+}
+
+if (machine->memdev) {
+ram_addr_t backend_size = 
object_property_get_uint(OBJECT(machine->memdev),
+   "size",  
_abort);
+if (backend_size != machine->ram_size) {
+error_setg(errp, "Machine memory size does not match the size of 
the memory backend");
+return;
+}
+} else if (machine_class->default_ram_id && machine->ram_size &&
+   numa_uses_legacy_mem()) {
 if (!create_default_memdev(current_machine, mem_path, errp)) {
 return;
 }
diff --git a/softmmu/vl.c b/softmmu/vl.c
index 03494760d5..a9913f0ae1 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -2021,24 +2021,13 @@ static void qemu_resolve_machine_memdev(void)
 error_report("Memory backend '%s' not found", ram_memdev_id);
 exit(EXIT_FAILURE);
 }
-backend_size = object_property_get_uint(backend, "size",  
_abort);
-if (have_custom_ram_size && backend_size != current_machine->ram_size) 
{
-error_report("Size specified by -m option must match size of "
- "explicitly specified 'memory-backend' property");
-exit(EXIT_FAILURE);
+if (!have_custom_ram_size) {
+backend_size = object_property_get_uint(backend, "size",  
_abort);
+current_machine->ram_size = backend_size;
 }
-current_machine->ram_size = backend_size;
 object_property_set_link(OBJECT(current_machine),
  "memory-backend", backend, _fatal);
 }
-
-if (!xen_enabled()) {
-/* On 32-bit hosts, QEMU is limited by virtual address space */
-if (current_machine->ram_size > (2047 << 20) && HOST_LONG_BITS == 32) {
-error_report("at most 2047 MB RAM can be simulated");
-exit(1);
-}
-}
 }
 
 static void parse_memory_options(const char *arg)
-- 
2.31.1

[PATCH for-7.1 4/5] machine: make memory-backend a link property

2022-04-14 Thread Paolo Bonzini

Handle HostMemoryBackend creation and setting of ms->ram entirely in
machine_run_board_init.

Signed-off-by: Paolo Bonzini 
---
 hw/core/machine.c   | 70 ++---
 hw/core/numa.c  |  2 +-
 hw/sparc/sun4m.c|  5 ++--
 include/hw/boards.h |  4 +--
 softmmu/vl.c| 62 ++-
 5 files changed, 74 insertions(+), 69 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index d5cfffe8a0..3dca927e3f 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -36,6 +36,7 @@
 #include "exec/confidential-guest-support.h"
 #include "hw/virtio/virtio.h"
 #include "hw/virtio/virtio-pci.h"
+#include "qom/object_interfaces.h"
 
 GlobalProperty hw_compat_6_2[] = {
 { "PIIX4_PM", "x-not-migrate-acpi-index", "on"},
@@ -650,21 +651,6 @@ bool device_type_is_dynamic_sysbus(MachineClass *mc, const 
char *type)
 return allowed;
 }
 
-static char *machine_get_memdev(Object *obj, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-return g_strdup(ms->ram_memdev_id);
-}
-
-static void machine_set_memdev(Object *obj, const char *value, Error **errp)
-{
-MachineState *ms = MACHINE(obj);
-
-g_free(ms->ram_memdev_id);
-ms->ram_memdev_id = g_strdup(value);
-}
-
 HotpluggableCPUList *machine_query_hotpluggable_cpus(MachineState *machine)
 {
 int i;
@@ -1007,8 +993,9 @@ static void machine_class_init(ObjectClass *oc, void *data)
 object_class_property_set_description(oc, "memory-encryption",
 "Set memory encryption object to use");
 
-object_class_property_add_str(oc, "memory-backend",
-  machine_get_memdev, machine_set_memdev);
+object_class_property_add_link(oc, "memory-backend", TYPE_MEMORY_BACKEND,
+   offsetof(MachineState, memdev), 
object_property_allow_set_link,
+   OBJ_PROP_LINK_STRONG);
 object_class_property_set_description(oc, "memory-backend",
   "Set RAM backend"
   "Valid value is ID of hostmem based 
backend");
@@ -1252,7 +1239,40 @@ MemoryRegion *machine_consume_memdev(MachineState 
*machine,
 return ret;
 }
 
-void machine_run_board_init(MachineState *machine)
+static bool create_default_memdev(MachineState *ms, const char *path, Error 
**errp)
+{
+Object *obj;
+MachineClass *mc = MACHINE_GET_CLASS(ms);
+bool r = false;
+
+obj = object_new(path ? TYPE_MEMORY_BACKEND_FILE : 
TYPE_MEMORY_BACKEND_RAM);
+if (path) {
+if (!object_property_set_str(obj, "mem-path", path, errp)) {
+goto out;
+}
+}
+if (!object_property_set_int(obj, "size", ms->ram_size, errp)) {
+goto out;
+}
+object_property_add_child(object_get_objects_root(), mc->default_ram_id,
+  obj);
+/* Ensure backend's memory region name is equal to mc->default_ram_id */
+if (!object_property_set_bool(obj, "x-use-canonical-path-for-ramblock-id",
+ false, errp)) {
+goto out;
+}
+if (!user_creatable_complete(USER_CREATABLE(obj), errp)) {
+goto out;
+}
+r = object_property_set_link(OBJECT(ms), "memory-backend", obj, errp);
+
+out:
+object_unref(obj);
+return r;
+}
+
+
+void machine_run_board_init(MachineState *machine, const char *mem_path, Error 
**errp)
 {
 MachineClass *machine_class = MACHINE_GET_CLASS(machine);
 ObjectClass *oc = object_class_by_name(machine->cpu_type);
@@ -1263,11 +1283,11 @@ void machine_run_board_init(MachineState *machine)
clock values from the log. */
 replay_checkpoint(CHECKPOINT_INIT);
 
-if (machine->ram_memdev_id) {
-Object *o;
-o = object_resolve_path_type(machine->ram_memdev_id,
- TYPE_MEMORY_BACKEND, NULL);
-machine->ram = machine_consume_memdev(machine, MEMORY_BACKEND(o));
+if (machine_class->default_ram_id && machine->ram_size &&
+numa_uses_legacy_mem() && !machine->memdev) {
+if (!create_default_memdev(current_machine, mem_path, errp)) {
+return;
+}
 }
 
 if (machine->numa_state) {
@@ -1277,6 +1297,10 @@ void machine_run_board_init(MachineState *machine)
 }
 }
 
+if (!machine->ram && machine->memdev) {
+machine->ram = machine_consume_memdev(machine, machine->memdev);
+}
+
 /* If the machine supports the valid_cpu_types check and the user
  * specified a CPU with -cpu check here that the user CPU is supported.
  */
diff --git a/hw/core/numa.c b/hw/core/numa.c
index 1aa05dcf42..26d8e5f616 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -695,7 +695,7 @@ void numa_complete_configuration(MachineState *ms)
 }
 
 if (!numa_uses_legacy_mem() && mc->default_ram_id) {
-if (ms->ram_memdev_id) {
+if (ms->memdev) {

[PATCH for-7.1 0/5] Move memory and boot to -machine

2022-04-14 Thread Paolo Bonzini

As the next step in turning command line options into shortcuts, this series
does -boot and -m.  It also makes -M memory-backend a link instead of special
casing it in vl.c, and makes the MachineState validate memory configuration
without needing help from vl.c.

Paolo Bonzini (5):
  machine: use QAPI struct for boot configuration
  machine: add boot compound property
  machine: add mem compound property
  machine: make memory-backend a link property
  machine: move more memory validation to Machine object

 hw/arm/nseries.c|   2 +-
 hw/core/machine.c   | 242 +++-
 hw/core/numa.c  |   2 +-
 hw/hppa/machine.c   |   6 +-
 hw/i386/pc.c|   2 +-
 hw/nvram/fw_cfg.c   |  27 ++---
 hw/ppc/mac_newworld.c   |   2 +-
 hw/ppc/mac_oldworld.c   |   2 +-
 hw/ppc/prep.c   |   2 +-
 hw/ppc/spapr.c  |   4 +-
 hw/s390x/ipl.c  |  20 +---
 hw/sparc/sun4m.c|   9 +-
 hw/sparc64/sun4u.c  |   4 +-
 include/hw/boards.h |   7 +-
 include/sysemu/sysemu.h |   2 -
 qapi/machine.json   |  48 
 softmmu/bootdevice.c|   3 +-
 softmmu/globals.c   |   2 -
 softmmu/vl.c| 229 +++--
 19 files changed, 363 insertions(+), 252 deletions(-)

-- 
2.31.1

Re: [PATCH for-7.1] hw/block/fdc-sysbus: Always mark sysbus floppy controllers as not having DMA

2022-04-14 Thread Philippe Mathieu-Daudé


On 12/4/22 18:49, Peter Maydell wrote:

The sysbus floppy controllers (devices sysbus-fdc and sun-fdtwo)
don't support DMA.  The core floppy controller code expects this to
be indicated by setting FDCtrl::dma_chann to -1.  This used to be
done in the device instance_init functions sysbus_fdc_initfn() and
sun4m_fdc_initfn(), but in commit 1430759ec3e we refactored this code
and accidentally lost the setting of dma_chann.


My bad :( Thanks for cleaning up.

Reviewed-by: Philippe Mathieu-Daudé 


For sysbus-fdc this has no ill effects because we were redundantly
also setting dma_chann in fdctrl_init_sysbus(), but for sun-fdtwo
this means that guests which try to enable DMA on the floppy
controller will cause QEMU to crash because FDCtrl::dma is NULL.

Set dma_chann to -1 in the common instance init, and remove the
redundant code in fdctrl_init_sysbus() that is also setting it.

There is a six-year-old FIXME comment in the jazz board code to the
effect that in theory it should support doing DMA via a custom DMA
controller.  If anybody ever chooses to fix that they can do it by
adding support for setting both FDCtrl::dma_chann and FDCtrl::dma.
(A QOM link property 'dma-controller' on the sysbus device which can
be set to an instance of IsaDmaClass is probably the way to go.)

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/958
Signed-off-by: Peter Maydell 
---
  include/hw/block/fdc.h |  3 +--
  hw/block/fdc-sysbus.c  | 14 +++---
  hw/mips/jazz.c |  2 +-
  3 files changed, 13 insertions(+), 6 deletions(-)

Re: [PATCH v2 39/39] util/log: Support per-thread log files

2022-04-14 Thread Richard Henderson


On 4/14/22 08:35, Alex Bennée wrote:

+/**
+ * valid_filename_template:
+ *
+ * Validate the filename template.  Require %d if per_thread, allow it
+ * otherwise; require no other % within the template.
+ * Return 0 if invalid, 1 if stderr, 2 if strdup, 3 if pid printf.


 From a neatness point of view having an enum would make it easier to
read in the switch down bellow...


Fair, and...


+switch (valid_filename_template(filename, per_thread, errp)) {
+case 0:
+return false; /* error */
+case 1:
+break;  /* stderr */
+case 2:
+newname = g_strdup(filename);
+break;
+case 3:
+newname = g_strdup_printf(filename, getpid());
+break;


default: g_assert_not_reached?


... using an enum with an extra enumerator for testing gives us

../src/util/log.c: In function ‘qemu_set_log_internal’:
../src/util/log.c:213:9: error: enumeration value ‘vft_new_case’ not handled in switch 
[-Werror=switch]

  213 | switch (valid_filename_template(filename, per_thread, errp)) {
  | ^~
cc1: all warnings being treated as errors

though that does require *not* having a default case.


r~

Re: [PATCH] hw/arm/smmuv3: Pass the real perm to returned IOMMUTLBEntry in smmuv3_translate()

2022-04-14 Thread Eric Auger

Hi Chenxiang,

On 4/7/22 9:57 AM, chenxiang via wrote:
> From: Xiang Chen 
>
> In function memory_region_iommu_replay(), it decides to notify() or not
> according to the perm of returned IOMMUTLBEntry. But for smmuv3, the
> returned perm is always IOMMU_NONE even if the translation success.
I think you should precise in the commit message that
memory_region_iommu_replay() always calls the IOMMU MR translate()
callback with flag=IOMMU_NONE and thus, currently, translate() returns
an IOMMUTLBEntry with perm set to IOMMU_NONE if the translation
succeeds, whereas it is expected to return the actual permission set in
the table entry.



> Pass the real perm to returned IOMMUTLBEntry to avoid the issue.
>
> Signed-off-by: Xiang Chen 
> ---
>  hw/arm/smmuv3.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 674623aabe..707eb430c2 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -760,7 +760,7 @@ epilogue:
>  qemu_mutex_unlock(>mutex);
>  switch (status) {
>  case SMMU_TRANS_SUCCESS:
> -entry.perm = flag;
> +entry.perm = cached_entry->entry.perm;
With that clarification
Reviewed-by: Eric Auger 

the translate() doc in ./include/exec/memory.h states
"
If IOMMU_NONE is passed then the IOMMU must do the
 * full page table walk and report the permissions in the returned
 * IOMMUTLBEntry. (Note that this implies that an IOMMU may not
 * return different mappings for reads and writes.)
"


Thanks

Eric
>  entry.translated_addr = cached_entry->entry.translated_addr +
>  (addr & cached_entry->entry.addr_mask);
>  entry.addr_mask = cached_entry->entry.addr_mask;

Re: [PATCH v2 32/39] util/log: Rename logfilename to global_filename

2022-04-14 Thread Richard Henderson


On 4/14/22 08:18, Alex Bennée wrote:


Richard Henderson  writes:


Rename to emphasize this is the file-scope global variable.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
---
  util/log.c | 13 ++---
  1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/util/log.c b/util/log.c
index 491a8f97f9..b3f79deb6c 100644
--- a/util/log.c
+++ b/util/log.c
@@ -34,7 +34,7 @@ typedef struct QemuLogFile {
  FILE *fd;
  } QemuLogFile;
  
-static char *logfilename;

+static char *global_filename;
  static QemuMutex qemu_logfile_mutex;
  static QemuLogFile *qemu_logfile;
  int qemu_loglevel;
@@ -151,8 +151,8 @@ static bool qemu_set_log_internal(const char *filename, 
bool changed_name,
  }
  }
  
-g_free(logfilename);

-logfilename = newname;
+g_free(global_filename);
+global_filename = newname;
  filename = newname;


I guess there is no conceivable failure mode in which we get a torn
pointer without qatomic_set?


No.  Word-sized quantities like pointers will never tear.
Although what this comment has to do with a renaming, I don't know.


r~



Anyway:

Reviewed-by: Alex Bennée

[PATCH] hw/riscv: virt: fix DT property mmu-type when CPU mmu option is disabled

2022-04-14 Thread Niklas Cassel via

The device tree property "mmu-type" is currently exported as either
"riscv,sv32" or "riscv,sv48".

However, the riscv cpu device tree binding [1] has a specific value
"riscv,none" for a HART without a MMU.

Set the device tree property "mmu-type" to "riscv,none" when the CPU mmu
option is disabled using rv32,mmu=off or rv64,mmu=off.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/devicetree/bindings/riscv/cpus.yaml?h=v5.17

Signed-off-by: Niklas Cassel 
---
 hw/riscv/virt.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index da50cbed43..3be6be9ad3 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -230,8 +230,14 @@ static void create_fdt_socket_cpus(RISCVVirtState *s, int 
socket,
 cpu_name = g_strdup_printf("/cpus/cpu@%d",
 s->soc[socket].hartid_base + cpu);
 qemu_fdt_add_subnode(mc->fdt, cpu_name);
-qemu_fdt_setprop_string(mc->fdt, cpu_name, "mmu-type",
-(is_32_bit) ? "riscv,sv32" : "riscv,sv48");
+if (riscv_feature(>soc[socket].harts[cpu].env,
+  RISCV_FEATURE_MMU)) {
+qemu_fdt_setprop_string(mc->fdt, cpu_name, "mmu-type",
+(is_32_bit) ? "riscv,sv32" : "riscv,sv48");
+} else {
+qemu_fdt_setprop_string(mc->fdt, cpu_name, "mmu-type",
+"riscv,none");
+}
 name = riscv_isa_string(>soc[socket].harts[cpu]);
 qemu_fdt_setprop_string(mc->fdt, cpu_name, "riscv,isa", name);
 g_free(name);
-- 
2.35.1

Re: [PATCH v2 39/39] util/log: Support per-thread log files

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Add a new log flag, tid, to turn this feature on.
> Require the log filename to be set, and to contain %d.
>
> Do not allow tid to be turned off once it is on, nor let
> the filename be change thereafter.  This avoids the need
> for signalling each thread to re-open on a name change.
>
> Signed-off-by: Richard Henderson 
> ---
> v2: Make use of CONFIG_GETTID, and fallback to SYS_gettid.
> ---
>  include/qemu/log.h |   1 +
>  util/log.c | 149 ++---
>  2 files changed, 115 insertions(+), 35 deletions(-)
>
> diff --git a/include/qemu/log.h b/include/qemu/log.h
> index a325bca661..c5643d8dd5 100644
> --- a/include/qemu/log.h
> +++ b/include/qemu/log.h
> @@ -34,6 +34,7 @@ bool qemu_log_separate(void);
>  #define CPU_LOG_PLUGIN (1 << 18)
>  /* LOG_STRACE is used for user-mode strace logging. */
>  #define LOG_STRACE (1 << 19)
> +#define LOG_PER_THREAD (1 << 20)
>  
>  /* Lock/unlock output. */
>  
> diff --git a/util/log.c b/util/log.c
> index df0710720f..0bb2233788 100644
> --- a/util/log.c
> +++ b/util/log.c
> @@ -27,6 +27,9 @@
>  #include "qemu/thread.h"
>  #include "qemu/lockable.h"
>  #include "qemu/rcu.h"
> +#ifdef CONFIG_LINUX
> +#include 
> +#endif
>  
>  
>  typedef struct RCUCloseFILE {
> @@ -38,22 +41,36 @@ typedef struct RCUCloseFILE {
>  static QemuMutex global_mutex;
>  static char *global_filename;
>  static FILE *global_file;
> +static __thread FILE *thread_file;
>  
>  int qemu_loglevel;
> -static int log_append = 0;
> +static bool log_append;
> +static bool log_per_thread;
>  static GArray *debug_regions;
>  
>  /* Returns true if qemu_log() will really write somewhere. */
>  bool qemu_log_enabled(void)
>  {
> -return qatomic_read(_file) != NULL;
> +return log_per_thread || qatomic_read(_file) != NULL;
>  }
>  
>  /* Returns true if qemu_log() will write somewhere other than stderr. */
>  bool qemu_log_separate(void)
>  {
>  FILE *logfile = qatomic_read(_file);
> -return logfile && logfile != stderr;
> +return log_per_thread || (logfile && logfile != stderr);
> +}
> +
> +static int log_thread_id(void)
> +{
> +#ifdef CONFIG_GETTID
> +return gettid();
> +#elif defined(SYS_gettid)
> +return syscall(SYS_gettid);
> +#else
> +static int counter;
> +return qatomic_fetch_inc();
> +#endif
>  }
>  
>  /* Lock/unlock output. */
> @@ -62,20 +79,34 @@ FILE *qemu_log_trylock(void)
>  {
>  FILE *logfile;
>  
> -rcu_read_lock();
> -/*
> - * FIXME: typeof_strip_qual, as used by qatomic_rcu_read,
> - * does not work with pointers to undefined structures,
> - * such as we have with struct _IO_FILE and musl libc.
> - * Since all we want is a read of a pointer, cast to void**,
> - * which does work with typeof_strip_qual.
> - */
> -logfile = qatomic_rcu_read((void **)_file);
> -if (logfile) {
> -qemu_flockfile(logfile);
> -} else {
> -rcu_read_unlock();
> +logfile = thread_file;
> +if (!logfile) {
> +if (log_per_thread) {
> +g_autofree char *filename
> += g_strdup_printf(global_filename, log_thread_id());
> +logfile = fopen(filename, "w");
> +if (!logfile) {
> +return NULL;
> +}
> +thread_file = logfile;
> +} else {
> +rcu_read_lock();
> +/*
> + * FIXME: typeof_strip_qual, as used by qatomic_rcu_read,
> + * does not work with pointers to undefined structures,
> + * such as we have with struct _IO_FILE and musl libc.
> + * Since all we want is a read of a pointer, cast to void**,
> + * which does work with typeof_strip_qual.
> + */
> +logfile = qatomic_rcu_read((void **)_file);
> +if (!logfile) {
> +rcu_read_unlock();
> +return NULL;
> +}
> +}
>  }
> +
> +qemu_flockfile(logfile);
>  return logfile;
>  }
>  
> @@ -84,7 +115,9 @@ void qemu_log_unlock(FILE *logfile)
>  if (logfile) {
>  fflush(logfile);
>  qemu_funlockfile(logfile);
> -rcu_read_unlock();
> +if (!log_per_thread) {
> +rcu_read_unlock();
> +}
>  }
>  }
>  
> @@ -112,40 +145,74 @@ static void rcu_close_file(RCUCloseFILE *r)
>  g_free(r);
>  }
>  
> +/**
> + * valid_filename_template:
> + *
> + * Validate the filename template.  Require %d if per_thread, allow it
> + * otherwise; require no other % within the template.
> + * Return 0 if invalid, 1 if stderr, 2 if strdup, 3 if pid printf.

>From a neatness point of view having an enum would make it easier to
read in the switch down bellow...

> + */
> +static int valid_filename_template(const char *filename,
> +   bool per_thread, Error **errp)
> +{
> +if (filename) {
> +char *pidstr = strstr(filename, "%");
> +
>

Re: [PATCH] tests/qtest: Move the fuzz tests to x86 only

2022-04-14 Thread Laurent Vivier


On 14/04/2022 15:01, Thomas Huth wrote:

The fuzz tests are currently scheduled for all targets, but their setup
code limits the run to "i386", so that these tests always show "SKIP"
on other targets. Move it to the right x86 list in meson.build, then
we can drop the architecture check during runtime, too.

Signed-off-by: Thomas Huth 
---
  tests/qtest/fuzz-lsi53c895a-test.c  |  8 ++--
  tests/qtest/fuzz-megasas-test.c | 12 
  tests/qtest/fuzz-sb16-test.c| 12 
  tests/qtest/fuzz-sdcard-test.c  | 12 
  tests/qtest/fuzz-virtio-scsi-test.c |  8 ++--
  tests/qtest/meson.build | 13 ++---
  6 files changed, 22 insertions(+), 43 deletions(-)



Reviewed-by: Laurent Vivier

Re: [PATCH v2 38/39] util/log: Limit RCUCloseFILE to file closing

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Use FILE* for global_file.  We can perform an rcu_read on that
> just as easily as RCUCloseFILE*.  This simplifies a couple of
> places, where previously we required taking the rcu_read_lock
> simply to avoid racing to dereference RCUCloseFile->fd.
>
> Only allocate the RCUCloseFile prior to call_rcu.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 36/39] util/log: Combine two logfile closes

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Merge the close from the changed_name block with the close
> from the !need_to_open_file block.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 37/39] util/log: Rename QemuLogFile to RCUCloseFILE

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> s/QemuLogFile/RCUCloseFILE/
> s/qemu_logfile_free/rcu_close_file/
>
> Emphasize that this is only a carrier for passing a pointer
> to call_rcu for closing, and not the real logfile.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 35/39] util/log: Hoist the eval of is_daemonized in qemu_set_log_internal

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Only call is_daemonized once.
> We require the result on all paths after this point.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 34/39] util/log: Rename qemu_logfile_mutex to global_mutex

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Rename to emphasize this covers the file-scope global variables.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 33/39] util/log: Rename qemu_logfile to global_file

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Rename to emphasize this is the file-scope global variable.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 32/39] util/log: Rename logfilename to global_filename

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Rename to emphasize this is the file-scope global variable.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 
> ---
>  util/log.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/util/log.c b/util/log.c
> index 491a8f97f9..b3f79deb6c 100644
> --- a/util/log.c
> +++ b/util/log.c
> @@ -34,7 +34,7 @@ typedef struct QemuLogFile {
>  FILE *fd;
>  } QemuLogFile;
>  
> -static char *logfilename;
> +static char *global_filename;
>  static QemuMutex qemu_logfile_mutex;
>  static QemuLogFile *qemu_logfile;
>  int qemu_loglevel;
> @@ -151,8 +151,8 @@ static bool qemu_set_log_internal(const char *filename, 
> bool changed_name,
>  }
>  }
>  
> -g_free(logfilename);
> -logfilename = newname;
> +g_free(global_filename);
> +global_filename = newname;
>  filename = newname;

I guess there is no conceivable failure mode in which we get a torn
pointer without qatomic_set?

Anyway:

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 30/39] softmmu: Use qemu_set_log_filename_flags

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Perform all logfile setup at startup in one step.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 31/39] util/log: Remove qemu_log_close

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> The only real use is in cpu_abort, where we have just
> flushed the file via qemu_log_unlock, and are just about
> to force-crash the application via abort.  We do not
> really need to close the FILE before the abort.
>
> The two uses in test-logging.c can be handled with
> qemu_set_log_filename_flags.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 29/39] linux-user: Use qemu_set_log_filename_flags

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Perform all logfile setup in one step.
>
> Signed-off-by: Richard Henderson 
> ---
>  linux-user/main.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/linux-user/main.c b/linux-user/main.c
> index d263b2a669..0297ae8321 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -85,6 +85,7 @@ static bool enable_strace;
>   * Used to support command line arguments overriding environment variables.
>   */
>  static int last_log_mask;
> +static const char *last_log_filename;
>  
>  /*
>   * When running 32-on-64 we should make sure we can fit all of the possible
> @@ -257,7 +258,7 @@ static void handle_arg_dfilter(const char *arg)
>  
>  static void handle_arg_log_filename(const char *arg)
>  {
> -qemu_set_log_filename(arg, _fatal);
> +last_log_filename = arg;
>  }
>  
>  static void handle_arg_set_env(const char *arg)
> @@ -643,7 +644,6 @@ int main(int argc, char **argv, char **envp)
>  int i;
>  int ret;
>  int execfd;
> -int log_mask;
>  unsigned long max_reserved_va;
>  bool preserve_argv0;
>  
> @@ -677,10 +677,9 @@ int main(int argc, char **argv, char **envp)
>  
>  optind = parse_args(argc, argv);
>  
> -log_mask = last_log_mask | (enable_strace ? LOG_STRACE : 0);
> -if (log_mask) {
> -qemu_set_log(log_mask, _fatal);
> -}
> +qemu_set_log_filename_flags(last_log_filename,
> +last_log_mask | (enable_strace * LOG_STRACE),
> +_fatal);

I guess enable_strace ? LOG_STRACE : 0 should generate the same code
either way.

Anyway:

Reviewed-by: Alex Bennée 


>  
>  if (!trace_init_backends()) {
>  exit(1);


-- 
Alex Bennée

Re: [PATCH v2 28/39] bsd-user: Use qemu_set_log_filename_flags

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Perform all logfile setup in one step.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 27/39] util/log: Introduce qemu_set_log_filename_flags

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Provide a function to set both filename and flags at
> the same time.  This is the common case at startup.
>
> Signed-off-by: Richard Henderson 
> ---
> v2: Return bool, per recommendations in qapi/error.h (phil).
> ---
>  include/qemu/log.h |   1 +
>  util/log.c | 122 -
>  2 files changed, 77 insertions(+), 46 deletions(-)
>
> diff --git a/include/qemu/log.h b/include/qemu/log.h
> index 42d545f77a..b6c73376b5 100644
> --- a/include/qemu/log.h
> +++ b/include/qemu/log.h
> @@ -82,6 +82,7 @@ extern const QEMULogItem qemu_log_items[];
>  
>  bool qemu_set_log(int log_flags, Error **errp);
>  bool qemu_set_log_filename(const char *filename, Error **errp);
> +bool qemu_set_log_filename_flags(const char *name, int flags, Error **errp);
>  void qemu_set_dfilter_ranges(const char *ranges, Error **errp);
>  bool qemu_log_in_addr_range(uint64_t addr);
>  int qemu_str_to_log_mask(const char *str);
> diff --git a/util/log.c b/util/log.c
> index 8b8b6a5d83..2152d5591e 100644
> --- a/util/log.c
> +++ b/util/log.c
> @@ -117,15 +117,58 @@ static void qemu_logfile_free(QemuLogFile *logfile)
>  }
>  
>  /* enable or disable low levels log */
> -bool qemu_set_log(int log_flags, Error **errp)
> +static bool qemu_set_log_internal(const char *filename, bool changed_name,
> +  int log_flags, Error **errp)
>  {
> -bool need_to_open_file = false;
> +bool need_to_open_file;
>  QemuLogFile *logfile;
>  
> -qemu_loglevel = log_flags;
> +QEMU_LOCK_GUARD(_logfile_mutex);
> +logfile = qemu_logfile;
> +
> +if (changed_name) {
> +char *newname = NULL;
> +
> +/*
> + * Allow the user to include %d in their logfile which will be
> + * substituted with the current PID. This is useful for debugging 
> many
> + * nested linux-user tasks but will result in lots of logs.
> + *
> + * filename may be NULL. In that case, log output is sent to stderr
> + */
> +if (filename) {
> +char *pidstr = strstr(filename, "%");
> +
> +if (pidstr) {
> +/* We only accept one %d, no other format strings */
> +if (pidstr[1] != 'd' || strchr(pidstr + 2, '%')) {
> +error_setg(errp, "Bad logfile format: %s", filename);
> +return false;
> +}
> +newname = g_strdup_printf(filename, getpid());
> +} else {
> +newname = g_strdup(filename);
> +}
> +}
> +
> +g_free(logfilename);
> +logfilename = newname;
> +filename = newname;
> +
> +if (logfile) {
> +qatomic_rcu_set(_logfile, NULL);
> +call_rcu(logfile, qemu_logfile_free, rcu);
> +logfile = NULL;
> +}
> +} else {
> +filename = logfilename;
> +}
> +
>  #ifdef CONFIG_TRACE_LOG
> -qemu_loglevel |= LOG_TRACE;
> +log_flags |= LOG_TRACE;
>  #endif
> +qemu_loglevel = log_flags;
> +

This looked weird - so should we consider a qatomic_set here to avoid an
inconsistent set of flags being read non-atomically elsewhere?

>  /*
>   * In all cases we only log if qemu_loglevel is set.
>   * Also:
> @@ -134,71 +177,58 @@ bool qemu_set_log(int log_flags, Error **errp)
>   *   If we are daemonized,
>   * we will only log if there is a logfilename.
>   */
> -if (qemu_loglevel && (!is_daemonized() || logfilename)) {
> -need_to_open_file = true;
> -}
> -QEMU_LOCK_GUARD(_logfile_mutex);
> -if (qemu_logfile && !need_to_open_file) {
> -logfile = qemu_logfile;
> +need_to_open_file = log_flags && (!is_daemonized() || filename);
> +
> +if (logfile && !need_to_open_file) {
>  qatomic_rcu_set(_logfile, NULL);
>  call_rcu(logfile, qemu_logfile_free, rcu);
> -} else if (!qemu_logfile && need_to_open_file) {
> -logfile = g_new0(QemuLogFile, 1);
> -if (logfilename) {
> -logfile->fd = fopen(logfilename, log_append ? "a" : "w");
> -if (!logfile->fd) {
> +return true;
> +}
> +if (!logfile && need_to_open_file) {
> +FILE *fd;
> +
> +if (filename) {
> +fd = fopen(filename, log_append ? "a" : "w");
> +if (!fd) {
>  error_setg_errno(errp, errno, "Error opening logfile %s",
> - logfilename);
> + filename);
>  return false;
>  }
>  /* In case we are a daemon redirect stderr to logfile */
>  if (is_daemonized()) {
> -dup2(fileno(logfile->fd), STDERR_FILENO);
> -fclose(logfile->fd);
> +dup2(fileno(fd), STDERR_FILENO);
> +fclose(fd);
>  /* This will skip closing logfile in qemu_log_close() */

[PATCH] remove -writeconfig

2022-04-14 Thread Paolo Bonzini

Like -set and -readconfig, it would not really be too hard to
extend -writeconfig to parsing mechanisms other than QemuOpts.
However, the uses of -writeconfig are substantially more
limited, as it is generally easier to write the configuration
by hand in the first place.  In addition, -writeconfig does
not even try to detect cases where it prints incorrect
syntax (for example if values have a quote in them, since
qemu_config_parse does not support any kind of escaping.
Just remove it.

Signed-off-by: Paolo Bonzini 
---
 docs/about/deprecated.rst   |  7 --
 docs/about/removed-features.rst |  6 +
 include/qemu/config-file.h  |  1 -
 qemu-options.hx |  8 ++-
 softmmu/climain.c   | 20 
 util/qemu-config.c  | 42 -
 6 files changed, 8 insertions(+), 76 deletions(-)

diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 3fb330116d..bd1163eed1 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -79,13 +79,6 @@ library enabled as a cryptography provider.
 Neither the ``nettle`` library, or the built-in cryptography provider are
 supported on FIPS enabled hosts.
 
-``-writeconfig`` (since 6.0)
-'
-
-The ``-writeconfig`` option is not able to serialize the entire contents
-of the QEMU command line.  It is thus considered a failed experiment
-and deprecated, with no current replacement.
-
 Userspace local APIC with KVM (x86, since 6.0)
 ''
 
diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index 4b831ea291..c313a22fd0 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -336,6 +336,12 @@ for the RISC-V ``virt`` machine and ``sifive_u`` machine.
 The ``-no-quit`` was a synonym for ``-display ...,window-close=off`` which
 should be used instead.
 
+``-writeconfig`` (removed in 7.1)
+'
+
+The ``-writeconfig`` option was not able to serialize the entire contents
+of the QEMU command line.  It is thus considered a failed experiment
+and removed without a replacement.
 
 QEMU Machine Protocol (QMP) commands
 
diff --git a/include/qemu/config-file.h b/include/qemu/config-file.h
index f605423321..321e7c7c03 100644
--- a/include/qemu/config-file.h
+++ b/include/qemu/config-file.h
@@ -12,7 +12,6 @@ void qemu_add_opts(QemuOptsList *list);
 void qemu_add_drive_opts(QemuOptsList *list);
 int qemu_global_option(const char *str);
 
-void qemu_config_write(FILE *fp);
 int qemu_config_parse(FILE *fp, QemuOptsList **lists, const char *fname,
   Error **errp);
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 34e9b32a5c..4c8e47a73c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -4622,18 +4622,14 @@ SRST
 ERST
 
 DEF("readconfig", HAS_ARG, QEMU_OPTION_readconfig,
-"-readconfig \n", QEMU_ARCH_ALL)
+"-readconfig \n"
+"read config file\n", QEMU_ARCH_ALL)
 SRST
 ``-readconfig file``
 Read device configuration from file. This approach is useful when
 you want to spawn QEMU process with many command line options but
 you don't want to exceed the command line character limit.
 ERST
-DEF("writeconfig", HAS_ARG, QEMU_OPTION_writeconfig,
-"-writeconfig \n"
-"read/write config file (deprecated)\n", QEMU_ARCH_ALL)
-SRST
-ERST
 
 DEF("no-user-config", 0, QEMU_OPTION_nouserconfig,
 "-no-user-config\n"
diff --git a/softmmu/climain.c b/softmmu/climain.c
index 92307b2bbd..160e4c201c 100644
--- a/softmmu/climain.c
+++ b/softmmu/climain.c
@@ -3534,26 +3534,6 @@ void qemu_init(int argc, char **argv, char **envp)
 display_remote++;
 break;
 #endif
-case QEMU_OPTION_writeconfig:
-{
-FILE *fp;
-warn_report("-writeconfig is deprecated and will go away 
without a replacement");
-if (strcmp(optarg, "-") == 0) {
-fp = stdout;
-} else {
-fp = fopen(optarg, "w");
-if (fp == NULL) {
-error_report("open %s: %s", optarg,
- strerror(errno));
-exit(1);
-}
-}
-qemu_config_write(fp);
-if (fp != stdout) {
-fclose(fp);
-}
-break;
-}
 case QEMU_OPTION_qtest:
 qtest_chrdev = optarg;
 break;
diff --git a/util/qemu-config.c b/util/qemu-config.c
index 436ab63b16..433488aa56 100644
--- a/util/qemu-config.c
+++ b/util/qemu-config.c
@@ -314,48 +314,6 @@ void qemu_add_opts(QemuOptsList *list)
 abort();
 }
 
-struct

Re: [PATCH v2 26/39] sysemu/os-win32: Test for and use _lock_file/_unlock_file

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> The bug referenced in os-win32.h was fixed in mingw-w64 v6.
>
> According to repology, version 5 used by ubuntu 18, which is
> not yet out of support, so provide a meson link test for it.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 25/39] include/qemu/log: Move entire implementation out-of-line

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Move QemuLogFile, qemu_logfile, and all inline functions into qemu/log.c.
> No need to expose these implementation details in the api.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

> ---
>  include/qemu/log.h| 38 --
>  tests/unit/test-logging.c |  1 +
>  util/log.c| 30 +-
>  3 files changed, 34 insertions(+), 35 deletions(-)
>
> diff --git a/include/qemu/log.h b/include/qemu/log.h
> index 75973111bb..42d545f77a 100644
> --- a/include/qemu/log.h
> +++ b/include/qemu/log.h
> @@ -3,46 +3,16 @@
>  
>  /* A small part of this API is split into its own header */
>  #include "qemu/log-for-trace.h"
> -#include "qemu/rcu.h"
> -
> -typedef struct QemuLogFile {
> -struct rcu_head rcu;
> -FILE *fd;
> -} QemuLogFile;
> -
> -/* Private global variable, don't use */
> -extern QemuLogFile *qemu_logfile;
> -
>  
>  /* 
>   * The new API:
> - *
>   */
>  
> -/* Log settings checking macros: */
> +/* Returns true if qemu_log() will really write somewhere. */
> +bool qemu_log_enabled(void);
>  
> -/* Returns true if qemu_log() will really write somewhere
> - */
> -static inline bool qemu_log_enabled(void)
> -{
> -return qemu_logfile != NULL;
> -}
> -
> -/* Returns true if qemu_log() will write somewhere else than stderr
> - */
> -static inline bool qemu_log_separate(void)
> -{
> -QemuLogFile *logfile;
> -bool res = false;
> -
> -rcu_read_lock();
> -logfile = qatomic_rcu_read(_logfile);
> -if (logfile && logfile->fd != stderr) {
> -res = true;
> -}
> -rcu_read_unlock();
> -return res;
> -}
> +/* Returns true if qemu_log() will write somewhere other than stderr. */
> +bool qemu_log_separate(void);
>  
>  #define CPU_LOG_TB_OUT_ASM (1 << 0)
>  #define CPU_LOG_TB_IN_ASM  (1 << 1)
> diff --git a/tests/unit/test-logging.c b/tests/unit/test-logging.c
> index dcb8ac70df..9b87af75af 100644
> --- a/tests/unit/test-logging.c
> +++ b/tests/unit/test-logging.c
> @@ -30,6 +30,7 @@
>  #include "qemu-common.h"
>  #include "qapi/error.h"
>  #include "qemu/log.h"
> +#include "qemu/rcu.h"
>  
>  static void test_parse_range(void)
>  {
> diff --git a/util/log.c b/util/log.c
> index caa38e707b..8b8b6a5d83 100644
> --- a/util/log.c
> +++ b/util/log.c
> @@ -26,14 +26,42 @@
>  #include "trace/control.h"
>  #include "qemu/thread.h"
>  #include "qemu/lockable.h"
> +#include "qemu/rcu.h"
> +
> +
> +typedef struct QemuLogFile {
> +struct rcu_head rcu;
> +FILE *fd;
> +} QemuLogFile;
>  
>  static char *logfilename;
>  static QemuMutex qemu_logfile_mutex;
> -QemuLogFile *qemu_logfile;
> +static QemuLogFile *qemu_logfile;
>  int qemu_loglevel;
>  static int log_append = 0;
>  static GArray *debug_regions;
>  
> +/* Returns true if qemu_log() will really write somewhere. */
> +bool qemu_log_enabled(void)
> +{
> +return qemu_logfile != NULL;
> +}
> +
> +/* Returns true if qemu_log() will write somewhere other than stderr. */
> +bool qemu_log_separate(void)
> +{
> +QemuLogFile *logfile;
> +bool res = false;
> +
> +rcu_read_lock();
> +logfile = qatomic_rcu_read(_logfile);
> +if (logfile && logfile->fd != stderr) {
> +res = true;
> +}
> +rcu_read_unlock();
> +return res;
> +}
> +
>  /* Lock/unlock output. */
>  
>  FILE *qemu_log_trylock(void)


-- 
Alex Bennée

Re: [PATCH v2 23/39] tests/unit: Do not reference QemuLogFile directly

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Use qemu_log_lock/unlock instead of the raw rcu_read.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH 1/1] qemu-img: properly list formats which have consistency check implemented

2022-04-14 Thread Kevin Wolf

Am 07.04.2022 um 10:39 hat Denis V. Lunev geschrieben:
> Simple grep for the .bdrv_co_check callback presence gives the following
> list of block drivers
> * QED
> * VDI
> * VHDX
> * VMDK
> * Parallels
> which have this callback. The presense of the callback means that
> consistency check is supported.
> 
> The patch updates documentation accordingly.
> 
> Signed-off-by: Denis V. Lunev 
> CC: Kevin Wolf 
> CC: Hanna Reitz 

Thanks, applied to the block branch.

Kevin

Re: [PATCH v2 19/39] util/log: Remove qemu_log_flush

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> All uses flush output immediately before or after qemu_log_unlock.
> Instead of a separate call, move the flush into qemu_log_unlock.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 22/39] linux-user: Expand log_page_dump inline

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> We have extra stuff to log at the same time.
> Hoist the qemu_log_lock/unlock to the caller and use fprintf.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 19/39] util/log: Remove qemu_log_flush

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> All uses flush output immediately before or after qemu_log_unlock.
> Instead of a separate call, move the flush into qemu_log_unlock.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 21/39] bsd-user: Expand log_page_dump inline

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> We have extra stuff to log at the same time.
> Hoist the qemu_log_trylock/unlock to the caller and use fprintf.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 20/39] util/log: Drop call to setvbuf

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Now that the log buffer is flushed after every qemu_log_unlock,
> which includes every call to qemu_log, we do not need to force
> line buffering (or unbuffering for windows).  Block buffer the
> entire loggable unit.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 18/39] util/log: Mark qemu_log_trylock as G_GNUC_WARN_UNUSED_RESULT

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Now that all uses have been updated, consider a missing
> test of the result of qemu_log_trylock a bug and Werror.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 17/39] util/log: Drop return value from qemu_log

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> The only user of this feature, tcg_dump_ops, has been
> converted to use fprintf directly.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

[PATCH v1] virtio/virtio.c: include virtio prefix in error message

2022-04-14 Thread Moteen Shah

From: Moteen Shah 

The error message in virtio_init_region_cache()
is given a prefix virtio.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/230
Buglink: https://bugs.launchpad.net/qemu/+bug/1919021``

Signed-off-by: Moteen Shah 
---
 hw/virtio/virtio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 9d637e043e..f31427bd41 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -174,7 +174,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 len = address_space_cache_init(>desc, vdev->dma_as,
addr, size, packed);
 if (len < size) {
-virtio_error(vdev, "Cannot map desc");
+virtio_error(vdev, "Virtio cannot map desc");
 goto err_desc;
 }
 
@@ -182,7 +182,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 len = address_space_cache_init(>used, vdev->dma_as,
vq->vring.used, size, true);
 if (len < size) {
-virtio_error(vdev, "Cannot map used");
+virtio_error(vdev, "Virtio cannot map used");
 goto err_used;
 }
 
@@ -190,7 +190,7 @@ static void virtio_init_region_cache(VirtIODevice *vdev, 
int n)
 len = address_space_cache_init(>avail, vdev->dma_as,
vq->vring.avail, size, false);
 if (len < size) {
-virtio_error(vdev, "Cannot map avail");
+virtio_error(vdev, "Virtio cannot map avail");
 goto err_avail;
 }
 
-- 
2.35.1

Re: [PATCH v2 15/39] target/nios2: Remove log_cpu_state from reset

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> This is redundant with the logging done in cpu_common_reset.
>
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

Re: [PATCH v2 14/39] accel/tcg: Use cpu_dump_state between qemu_log_trylock/unlock

2022-04-14 Thread Alex Bennée



Richard Henderson  writes:

> Inside log_cpu_state, we perform qemu_log_trylock/unlock, which need
> not be done if we have already performed the lock beforehand.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alex Bennée 

-- 
Alex Bennée

[PATCH] tests/qtest: Move the fuzz tests to x86 only

2022-04-14 Thread Thomas Huth

The fuzz tests are currently scheduled for all targets, but their setup
code limits the run to "i386", so that these tests always show "SKIP"
on other targets. Move it to the right x86 list in meson.build, then
we can drop the architecture check during runtime, too.

Signed-off-by: Thomas Huth 
---
 tests/qtest/fuzz-lsi53c895a-test.c  |  8 ++--
 tests/qtest/fuzz-megasas-test.c | 12 
 tests/qtest/fuzz-sb16-test.c| 12 
 tests/qtest/fuzz-sdcard-test.c  | 12 
 tests/qtest/fuzz-virtio-scsi-test.c |  8 ++--
 tests/qtest/meson.build | 13 ++---
 6 files changed, 22 insertions(+), 43 deletions(-)

diff --git a/tests/qtest/fuzz-lsi53c895a-test.c 
b/tests/qtest/fuzz-lsi53c895a-test.c
index ba5d468970..031d9de241 100644
--- a/tests/qtest/fuzz-lsi53c895a-test.c
+++ b/tests/qtest/fuzz-lsi53c895a-test.c
@@ -39,14 +39,10 @@ static void test_lsi_do_dma_empty_queue(void)
 
 int main(int argc, char **argv)
 {
-const char *arch = qtest_get_arch();
-
 g_test_init(, , NULL);
 
-if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
-qtest_add_func("fuzz/lsi53c895a/lsi_do_dma_empty_queue",
-   test_lsi_do_dma_empty_queue);
-}
+qtest_add_func("fuzz/lsi53c895a/lsi_do_dma_empty_queue",
+   test_lsi_do_dma_empty_queue);
 
 return g_test_run();
 }
diff --git a/tests/qtest/fuzz-megasas-test.c b/tests/qtest/fuzz-megasas-test.c
index e1141c58a4..129b182f83 100644
--- a/tests/qtest/fuzz-megasas-test.c
+++ b/tests/qtest/fuzz-megasas-test.c
@@ -64,16 +64,12 @@ static void test_gitlab_issue521_megasas_sgl_ovf(void)
 
 int main(int argc, char **argv)
 {
-const char *arch = qtest_get_arch();
-
 g_test_init(, , NULL);
 
-if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
-qtest_add_func("fuzz/test_lp1878263_megasas_zero_iov_cnt",
-   test_lp1878263_megasas_zero_iov_cnt);
-qtest_add_func("fuzz/gitlab_issue521_megasas_sgl_ovf",
-   test_gitlab_issue521_megasas_sgl_ovf);
-}
+qtest_add_func("fuzz/test_lp1878263_megasas_zero_iov_cnt",
+   test_lp1878263_megasas_zero_iov_cnt);
+qtest_add_func("fuzz/gitlab_issue521_megasas_sgl_ovf",
+   test_gitlab_issue521_megasas_sgl_ovf);
 
 return g_test_run();
 }
diff --git a/tests/qtest/fuzz-sb16-test.c b/tests/qtest/fuzz-sb16-test.c
index f47a8bcdbd..91fdcd1e8a 100644
--- a/tests/qtest/fuzz-sb16-test.c
+++ b/tests/qtest/fuzz-sb16-test.c
@@ -55,15 +55,11 @@ static void test_fuzz_sb16_0xd4(void)
 
 int main(int argc, char **argv)
 {
-const char *arch = qtest_get_arch();
-
 g_test_init(, , NULL);
 
-   if (strcmp(arch, "i386") == 0) {
-qtest_add_func("fuzz/test_fuzz_sb16/1c", test_fuzz_sb16_0x1c);
-qtest_add_func("fuzz/test_fuzz_sb16/91", test_fuzz_sb16_0x91);
-qtest_add_func("fuzz/test_fuzz_sb16/d4", test_fuzz_sb16_0xd4);
-   }
+qtest_add_func("fuzz/test_fuzz_sb16/1c", test_fuzz_sb16_0x1c);
+qtest_add_func("fuzz/test_fuzz_sb16/91", test_fuzz_sb16_0x91);
+qtest_add_func("fuzz/test_fuzz_sb16/d4", test_fuzz_sb16_0xd4);
 
-   return g_test_run();
+return g_test_run();
 }
diff --git a/tests/qtest/fuzz-sdcard-test.c b/tests/qtest/fuzz-sdcard-test.c
index 0f94965a66..d0f4e0e93c 100644
--- a/tests/qtest/fuzz-sdcard-test.c
+++ b/tests/qtest/fuzz-sdcard-test.c
@@ -164,15 +164,11 @@ static void oss_fuzz_36391(void)
 
 int main(int argc, char **argv)
 {
-const char *arch = qtest_get_arch();
-
 g_test_init(, , NULL);
 
-   if (strcmp(arch, "i386") == 0) {
-qtest_add_func("fuzz/sdcard/oss_fuzz_29225", oss_fuzz_29225);
-qtest_add_func("fuzz/sdcard/oss_fuzz_36217", oss_fuzz_36217);
-qtest_add_func("fuzz/sdcard/oss_fuzz_36391", oss_fuzz_36391);
-   }
+qtest_add_func("fuzz/sdcard/oss_fuzz_29225", oss_fuzz_29225);
+qtest_add_func("fuzz/sdcard/oss_fuzz_36217", oss_fuzz_36217);
+qtest_add_func("fuzz/sdcard/oss_fuzz_36391", oss_fuzz_36391);
 
-   return g_test_run();
+return g_test_run();
 }
diff --git a/tests/qtest/fuzz-virtio-scsi-test.c 
b/tests/qtest/fuzz-virtio-scsi-test.c
index aaf6d10e18..c9b6fe2123 100644
--- a/tests/qtest/fuzz-virtio-scsi-test.c
+++ b/tests/qtest/fuzz-virtio-scsi-test.c
@@ -62,14 +62,10 @@ static void test_mmio_oob_from_memory_region_cache(void)
 
 int main(int argc, char **argv)
 {
-const char *arch = qtest_get_arch();
-
 g_test_init(, , NULL);
 
-if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
-qtest_add_func("fuzz/test_mmio_oob_from_memory_region_cache",
-   test_mmio_oob_from_memory_region_cache);
-}
+qtest_add_func("fuzz/test_mmio_oob_from_memory_region_cache",
+   test_mmio_oob_from_memory_region_cache);
 
 return g_test_run();
 }
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 1709fc6ccb..22e1361210 100644
---

Re: [PATCH] tests/qtest: Run the fuzz-sdcard-test only on i386 and x86_64

2022-04-14 Thread Thomas Huth


On 14/04/2022 14.28, Thomas Huth wrote:

The fuzz-sdcard-test is currently scheduled for all targets,
but the code limits itself to "i386". Move it to the right
list in meson.build and allow it to be run on "x86_64", too.

While we're at it, also clean up the wrong indentation in
fuzz-sdcard-test.c (it was using 3 spaces instead of 4 in some
lines).

Signed-off-by: Thomas Huth 
---
  tests/qtest/fuzz-sdcard-test.c | 6 +++---


Oh, well, I just noticed that the other fuzz tests have the same issue, to 
(they only work on x86) ... so never mind this patch, I'll create a new one 
that moves the other fuzz tests, too.


 Thomas

Re: XIVE VFIO kernel resample failure in INTx mode under heavy load

2022-04-14 Thread Cédric Le Goater





After re-reading what I just wrote, I am leaning towards disabling use of 
KVM_CAP_IRQFD_RESAMPLE as it seems last worked on POWER8 and never since :)

Did I miss something in the picture (hey Cedric)?


How about disabling it like this?

=
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 5bfd4aa9e5aa..c999f7b1ab1b 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -732,7 +732,7 @@ static PCIINTxRoute spapr_route_intx_pin_to_irq(void 
*opaque, int pin)
  SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(opaque);
  PCIINTxRoute route;

-    route.mode = PCI_INTX_ENABLED;
+    route.mode = PCI_INTX_DISABLED;

=


I like it.


You now know how to test all the combinations :) Prepare your matrix,
variables are :

 * Host OS  POWER8, POWER9+
 * KVM device   XICS (P8), XICS-on-XIVE (P9), XIVE-on-XIVE (P9)
 * kernel_irqchip   off, on
 * ic-mode  xics, xive
 * Guest OS msi or nomsi

Ideally you should check TCG, but that's like kernel_irqchip=off.

Cheers,

C.
 


(btw what the heck is PCI_INTX_INVERTED for?)


--
Alexey

Re: XIVE VFIO kernel resample failure in INTx mode under heavy load

2022-04-14 Thread Cédric Le Goater


Hello Alexey,

Thanks for taking over.


On 4/13/22 06:56, Alexey Kardashevskiy wrote:



On 3/17/22 06:16, Cédric Le Goater wrote:

Timothy,

On 3/16/22 17:29, Cédric Le Goater wrote:

Hello,



I've been struggling for some time with what is looking like a
potential bug in QEMU/KVM on the POWER9 platform.  It appears that
in XIVE mode, when the in-kernel IRQ chip is enabled, an external
device that rapidly asserts IRQs via the legacy INTx level mechanism
will only receive one interrupt in the KVM guest.


Indeed. I could reproduce with a pass-through PCI adapter using
'pci=nomsi'. The virtio devices operate correctly but the network
adapter only receives one event (*):


$ cat /proc/interrupts
    CPU0   CPU1   CPU2   CPU3   CPU4 CPU5   CPU6
   CPU7
  16:   2198   1378   1519   1216  0 0  0   
   0  XIVE-IPI   0 Edge  IPI-0
  17:  0  0  0  0   2003 1936   1335
   1507  XIVE-IPI   1 Edge  IPI-1
  18:  0   6401  0  0  0 0  0   
   0  XIVE-IRQ 4609 Level virtio3, virtio0, virtio2
  19:  0  0  0  0  0 204  0 
 0  XIVE-IRQ 4610 Level virtio1
  20:  0  0  0  0  0 0  0   
   0  XIVE-IRQ 4608 Level xhci-hcd:usb1
  21:  0  1  0  0  0 0  0   
   0  XIVE-IRQ 4612 Level eth1 (*)
  23:  0  0  0  0  0 0  0   
   0  XIVE-IRQ 4096 Edge  RAS_EPOW
  24:  0  0  0  0  0 0  0   
   0  XIVE-IRQ 4592 Edge  hvc_console
  26:  0  0  0  0  0 0  0   
   0  XIVE-IRQ 4097 Edge  RAS_HOTPLUG


Changing any one of those items appears to avoid the glitch, e.g. XICS


XICS is very different from XIVE. The driver implements the previous
interrupt controller architecture (P5-P8) and the hypervisor mediates
the delivery to the guest. With XIVE, vCPUs are directly signaled by
the IC. When under KVM, we use different KVM devices for each mode :

* KVM XIVE is a XICS-on-XIVE implementation (P9/P10 hosts) for guests
   not using the XIVE native interface. RHEL7 for instance.
* KVM XIVE native is a XIVE implementation (P9/P10 hosts) for guests
   using the XIVE native interface. Linux > 4.14.
* KVM XICS is for P8 hosts (no XIVE HW)

VFIO adds some complexity with the source events. I think the problem
comes from the assertion state. I will talk about it later.


mode with the in-kernel IRQ chip works (all interrupts are passed
through),


All interrupts are passed through using XIVE also. Run 'info pic' in
the monitor. On the host, check the IRQ mapping in the debugfs file :

   /sys/kernel/debug/powerpc/kvm-xive-*

and XIVE mode with the in-kernel IRQ chip disabled also works. 


In that case, no KVM device backs the QEMU device and all state
is in one place.


We
are also not seeing any problems in XIVE mode with the in-kernel
chip from MSI/MSI-X devices.


Yes. pass-through devices are expected to operate correctly :)


The device in question is a real time card that needs to raise an
interrupt every 1ms.  It works perfectly on the host, but fails in
the guest -- with the in-kernel IRQ chip and XIVE enabled, it
receives exactly one interrupt, at which point the host continues to
see INTx+ but the guest sees INTX-, and the IRQ handler in the guest
kernel is never reentered.


ok. Same symptom as the scenario above.


We have also seen some very rare glitches where, over a long period
of time, we can enter a similar deadlock in XICS mode.


with the in-kernel XICS IRQ chip ?


Disabling
the in-kernel IRQ chip in XIVE mode will also lead to the lockup
with this device, since the userspace IRQ emulation cannot keep up
with the rapid interrupt firing (measurements show around 100ms
required for processing each interrupt in the user mode).


MSI emulation in QEMU is slower indeed (35%). LSI is very slow because
it is handled as a special case in the device/driver. To maintain the
assertion state, all LSI handling is done with a special HCALL :
H_INT_ESB which is implemented in QEMU. This generates a lot of back
and forth in the KVM stack.


My understanding is the resample mechanism does some clever tricks
with level IRQs, but that QEMU needs to check if the IRQ is still
asserted by the device on guest EOI.


Yes. the problem is in that area.


Since a failure here would
explain these symptoms I'm wondering if there is a bug in either
QEMU or KVM for POWER / pSeries (SPAPr) where the IRQ is not
resampled and therefore not re-fired in the guest?


KVM I would say. The assertion state is maintained in KVM for the KVM
XICS-on-XIVE implementation and in QEMU for the KVM XIVE native
device. These are good candidates. I will take a look.


All works

[PATCH] tests/qtest: Run the fuzz-sdcard-test only on i386 and x86_64

2022-04-14 Thread Thomas Huth

The fuzz-sdcard-test is currently scheduled for all targets,
but the code limits itself to "i386". Move it to the right
list in meson.build and allow it to be run on "x86_64", too.

While we're at it, also clean up the wrong indentation in
fuzz-sdcard-test.c (it was using 3 spaces instead of 4 in some
lines).

Signed-off-by: Thomas Huth 
---
 tests/qtest/fuzz-sdcard-test.c | 6 +++---
 tests/qtest/meson.build| 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tests/qtest/fuzz-sdcard-test.c b/tests/qtest/fuzz-sdcard-test.c
index 0f94965a66..7707038a71 100644
--- a/tests/qtest/fuzz-sdcard-test.c
+++ b/tests/qtest/fuzz-sdcard-test.c
@@ -168,11 +168,11 @@ int main(int argc, char **argv)
 
 g_test_init(, , NULL);
 
-   if (strcmp(arch, "i386") == 0) {
+if (g_str_equal(arch, "i386") || g_str_equal(arch, "x86_64")) {
 qtest_add_func("fuzz/sdcard/oss_fuzz_29225", oss_fuzz_29225);
 qtest_add_func("fuzz/sdcard/oss_fuzz_36217", oss_fuzz_36217);
 qtest_add_func("fuzz/sdcard/oss_fuzz_36391", oss_fuzz_36391);
-   }
+}
 
-   return g_test_run();
+return g_test_run();
 }
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 1709fc6ccb..f3bee0b4c5 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -22,7 +22,6 @@ qtests_generic = \
   (config_all_devices.has_key('CONFIG_LSI_SCSI_PCI') ? 
['fuzz-lsi53c895a-test'] : []) + \
   (config_all_devices.has_key('CONFIG_VIRTIO_SCSI') ? 
['fuzz-virtio-scsi-test'] : []) + \
   (config_all_devices.has_key('CONFIG_SB16') ? ['fuzz-sb16-test'] : []) + \
-  (config_all_devices.has_key('CONFIG_SDHCI_PCI') ? ['fuzz-sdcard-test'] : []) 
+ \
   [
   'cdrom-test',
   'device-introspect-test',
@@ -67,6 +66,7 @@ qtests_i386 = \
   (config_all_devices.has_key('CONFIG_TPM_TIS_ISA') ? ['tpm-tis-swtpm-test'] : 
[]) +\
   (config_all_devices.has_key('CONFIG_RTL8139_PCI') ? ['rtl8139-test'] : []) + 
 \
   (config_all_devices.has_key('CONFIG_E1000E_PCI_EXPRESS') ? 
['fuzz-e1000e-test'] : []) +   \
+  (config_all_devices.has_key('CONFIG_SDHCI_PCI') ? ['fuzz-sdcard-test'] : []) 
+\
   (config_all_devices.has_key('CONFIG_ESP_PCI') ? ['am53c974-test'] : []) +
 \
   (config_all_devices.has_key('CONFIG_ACPI_ERST') ? ['erst-test'] : []) +  
  \
   (config_all_devices.has_key('CONFIG_VIRTIO_NET') and 
 \
-- 
2.27.0

Re: Support for x86_64 on aarch64 emulation

2022-04-14 Thread Redha Gouicem

I will start working on a cleaner patch to allow the correct memory
ordering enforcement to be implemented, as my current PoC is very hacky.
I'll post it to the mailing list as soon as I'm done.

Redha

On 08/04/2022 17:27, Richard Henderson wrote:
> On 4/8/22 05:21, Redha Gouicem wrote:
>> We are working on support for x86_64 emulation on aarch64, mainly
>> related to memory ordering issues. We first wanted to know what the
>> community thinks about our proposal, and its chance of getting merged
>> one day.
>>
>> Note that we worked with qemu-user, so there may be issues in system
>> mode that we missed.
>>
>> # Problem
>>
>> When generating the TCG instructions for memory accesses, fences are
>> always inserted *before* the access, following this translation rule:
>>
>>  x86   --> TCG -->    aarch64
>>  -
>>  RMOV  -->  Fm_ld; ld  -->  DMBLD; LDR
>>  WMOV  -->  Fm_st; st  -->  DMBFF; STR
>>
>> Here, Fm_ld is a fence that orders any preceding memory access with
>> the subsequent load. F_m_st is a fence that orders any preceding
>> memory access with the subsequent store. This means that, in TCG, all
>> memory accesses are ordered by fences. Thus, no memory accesses can be
>> re-ordered in TCG. This is a problem, because it is *stricter than
>> x86*. Consider when a program contains:
>>
>>  WMOV; RMOV
>>
>>
>> x86 allows re-ordering independent store-load pairs, so the above pair
>> can safely re-order on an x86 host. However, with QEMU's current
>> translation, it becomes:
>>
>>  DMBFF; STR; DMBLD; LDR
>>
>> In this target aarch64 code, no re-ordering is possible. Hence, QEMU
>> enforces a stronger model than x86. While that is correct, it harms
>> performance.
>>
>> # Solution
>>
>> We propose an alternative scheme, which we formally proved correct
>> (paper under review):
>>
>>  x86   -->  TCG    -->    aarch64
>>  -
>>  RMOV  -->  ld; Fld_m  -->  LDR; DMBLD
>>  WMOV  -->  Fst_st; st -->  DMBST; STR
>>
>> This new scheme precisely captures the observable behaviors of the
>> input program (in x86's memory model). This behavior is preserved in
>> the resulting TCG and aarch64 programs. Which the inserted fences
>> enforce (formally verified). Note that this scheme enforces fewer
>> ordering than the previous (unnecessarily strong) mapping scheme. This
>> new scheme benefits performance. We evaluated this on benchmarks
>> (PARSEC) and got up to 19.7% improvement, 6.7% on average.
>>
>> # Implementation Considerations
>>   Different (source and host) architectures may demand different such
>> mapping schemes. Some schemes may place fences before an instruction,
>> while others place them after. The implementation of fence placement
>> should thus be sufficiently flexible that either is possible. Though,
>> note that write-read pairs are unordered in almost all architectures.
>>   We see two ways of doing this:
>> - extracting the placement of the fence from the
>>    tcg_gen_qemu_ld/st_i32/i64 functions, and have each architecture
>>    explicitly generate the fence at the correct place
>> - adding two parameters to these functions specifying the strength of
>>    the "before" and "after" fences. The function would then generate
>>    both fences in the IR (one of them may be a NOP fence), which in
>>    turn will be translated back to the host
> 
> This has been on my to-do list for quite some time.  My previous work was
> 
> https://patchew.org/QEMU/20210316220735.2048137-1-richard.hender...@linaro.org/
> 
> I have some further work (possibly not posted?  I can't find a reference) 
> which attempted to strength reduce the barriers, and to use 
> load-acquire/store-release insns when alignment of the operation allows.  
> Unfortunately, for the interesting cases in question (x86 and s390x guests, 
> with the strongest guest memory models), it was rare that we could prove the 
> alignment was sufficient, so it was a fair amount of work being done for no 
> gain.
> 
> 
> r~

-- 
Redha Gouicem
Post doctoral researcher
Chair of Decentralized Systems Engineering
Department of Informatics, Technical University of Munich (TUM)

Re: [PATCH] docs: Correct the default thread-pool-size

2022-04-14 Thread Vivek Goyal

On Wed, Apr 13, 2022 at 12:20:54PM +0800, Liu Yiding wrote:
> Refer to 26ec190964 virtiofsd: Do not use a thread pool by default
> 
> Signed-off-by: Liu Yiding 

Looks good. Our default used to be --thread-pool-size=64. But we changed
it to using no thread pool because on lower end of workloads it performed
better. When multiple threads are doing parallel I/O then, thread pool
helps. So people who want to do lots of parallel I/O should manually
enable thread pool.

Acked-by: Vivek Goyal 

Vivek
> ---
>  docs/tools/virtiofsd.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> index 0c0560203c..33fed08c6f 100644
> --- a/docs/tools/virtiofsd.rst
> +++ b/docs/tools/virtiofsd.rst
> @@ -127,7 +127,7 @@ Options
>  .. option:: --thread-pool-size=NUM
>  
>Restrict the number of worker threads per request queue to NUM.  The 
> default
> -  is 64.
> +  is 0.
>  
>  .. option:: --cache=none|auto|always
>  
> -- 
> 2.31.1
> 
> 
> 
>

[PATCH] tests/qtest: Enable more tests for the "mipsel" target

2022-04-14 Thread Thomas Huth

Allow the same set of tests for all MIPS targets, so that "mipsel"
now gets some additional test coverage, too. While we're at it,
simplify the definitions for qtests_mips64 and qtests_mips64el.

Signed-off-by: Thomas Huth 
---
 tests/qtest/endianness-test.c |  1 +
 tests/qtest/meson.build   | 14 +++---
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/tests/qtest/endianness-test.c b/tests/qtest/endianness-test.c
index 9c03b72dc9..2f5a88bf4c 100644
--- a/tests/qtest/endianness-test.c
+++ b/tests/qtest/endianness-test.c
@@ -28,6 +28,7 @@ struct TestCase {
 static const TestCase test_cases[] = {
 { "i386", "pc", -1 },
 { "mips", "malta", 0x1000, .bswap = true },
+{ "mipsel", "malta", 0x1000 },
 { "mips64", "magnum", 0x9000, .bswap = true },
 { "mips64", "pica61", 0x9000, .bswap = true },
 { "mips64", "malta", 0x1000, .bswap = true },
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index d25f82bb5a..1709fc6ccb 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -143,17 +143,9 @@ qtests_mips = \
   (config_all_devices.has_key('CONFIG_ISA_TESTDEV') ? ['endianness-test'] : 
[]) +\
   (config_all_devices.has_key('CONFIG_VGA') ? ['display-vga-test'] : [])
 
-qtests_mips64 = \
-  ['test-filter-mirror', 'test-filter-redirector'] + \
-  (slirp.found() ? ['test-netfilter'] : []) + \
-  (config_all_devices.has_key('CONFIG_ISA_TESTDEV') ? ['endianness-test'] : 
[]) +\
-  (config_all_devices.has_key('CONFIG_VGA') ? ['display-vga-test'] : [])
-
-qtests_mips64el = \
-  ['test-filter-mirror', 'test-filter-redirector'] + \
-  (slirp.found() ? ['test-netfilter'] : []) + \
-  (config_all_devices.has_key('CONFIG_ISA_TESTDEV') ? ['endianness-test'] : 
[]) +\
-  (config_all_devices.has_key('CONFIG_VGA') ? ['display-vga-test'] : [])
+qtests_mipsel = qtests_mips
+qtests_mips64 = qtests_mips
+qtests_mips64el = qtests_mips
 
 qtests_ppc = \
   ['test-filter-mirror', 'test-filter-redirector'] + \
-- 
2.27.0

Re: [RFC PATCH 0/4] 9pfs: Add 9pfs support for Windows host

2022-04-14 Thread Christian Schoenebeck

On Mittwoch, 13. April 2022 05:30:57 CEST Shi, Guohuai wrote:
> > We have 3 fs drivers: local, synth, proxy. I don't mind about proxy, it is
> > in  bad shape and we will probably deprecate it in near future anyway.
> > But it would be good to have support for the synth driver, because we are
> > using it for running test cases and fuzzing tests (QA).
> 
> synth driver can not be built on Windows platform (or cross build on
> Linux). So the test cases can not work on Windows.

Could you please be more specific what kind of challenge you see for making 
the synth driver working on Windows? The synth driver is just a simple mockup 
driver [1] that simulates in-RAM-only a filesystem with a bunch of hard coded 
dirs and files, solely for the purpose to run test cases. So the synth driver 
does not interact with any real filesystem on host at all. My expectation 
therefore would be that it just needs to tweak some header includes and maybe 
declaring missing POSIX data types, which you have done for the local driver 
already anyway.

BTW support for macOS hosts has just been recently added for 9p, I know it is 
different as its a POSIX OS, but maybe you might still find the diff [2] 
helpful.

[1] https://wiki.qemu.org/Documentation/9p#9p_Filesystem_Drivers
[2] https://gitlab.com/qemu-project/qemu/-/commit/f45cc81911adc772

> > What are the limitations against security_model=mapped on Windows? Keep in
> >  mind that with security_model=none you are very limited in what you can
> > do with 9p.
> 
> MSYS library does not support extend attribute (e.g. getxattr),
> And does not support POSIX permission APIs (e.g. chmod, chown).
> Security model is useless on Windows host.

That would be security_model=passthrough, yes, that's not possible with msys. 
The recommended way in practice though is using security_model=mapped [3] for 
all systems, which should be possible to achieve with msys as well ...

[3] https://wiki.qemu.org/Documentation/9psetup#Starting_the_Guest_directly

> It is possible that to "map" extend attribute to NTFS stream data.
> However, if Windows host media is not NTFS (e.g. FAT) which does not support
> stream data, then the "map" can not work.

... yes exactly, it would make sense to use ADS [4] instead of xattr on 
Windows. ADS are available with NTFS and ReFS and maybe also with exFAT 
nowadays (?), not sure about the latter though. But I think it is fair enough 
to assume Windows users to either use NTFS or ReFS. And if they don't, you can 
still call error_report_once() to make user aware that
seucrity_model=mapped(-xattr) requires a fileystem on Windows that supports 
ADS.

[4] https://en.wikipedia.org/wiki/NTFS#Alternate_data_stream_(ADS)

Best regards,
Christian Schoenebeck

[RFC 15/18] vfio/iommufd: Implement iommufd backend

2022-04-14 Thread Yi Liu

Add the iommufd backend. The IOMMUFD container class is implemented
based on the new /dev/iommu user API. This backend obviously depends
on CONFIG_IOMMUFD.

So far, the iommufd backend doesn't support live migration and
cache coherency yet due to missing support in the host kernel meaning
that only a subset of the container class callbacks is implemented.

Co-authored-by: Eric Auger 
Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/as.c |   2 +-
 hw/vfio/iommufd.c| 545 +++
 hw/vfio/meson.build  |   3 +
 hw/vfio/pci.c|  10 +
 hw/vfio/trace-events |  11 +
 include/hw/vfio/vfio-common.h|  18 +
 include/hw/vfio/vfio-container-obj.h |   1 +
 7 files changed, 589 insertions(+), 1 deletion(-)
 create mode 100644 hw/vfio/iommufd.c

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 4abaa4068f..94618efd1f 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -41,7 +41,7 @@
 #include "qapi/error.h"
 #include "migration/migration.h"
 
-static QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces =
+VFIOAddressSpaceList vfio_address_spaces =
 QLIST_HEAD_INITIALIZER(vfio_address_spaces);
 
 void vfio_host_win_add(VFIOContainer *container,
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
new file mode 100644
index 00..f8375f1672
--- /dev/null
+++ b/hw/vfio/iommufd.c
@@ -0,0 +1,545 @@
+/*
+ * iommufd container backend
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu 
+ *  Eric Auger 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include 
+#include 
+
+#include "hw/vfio/vfio-common.h"
+#include "qemu/error-report.h"
+#include "trace.h"
+#include "qapi/error.h"
+#include "hw/iommufd/iommufd.h"
+#include "hw/qdev-core.h"
+#include "sysemu/reset.h"
+#include "qemu/cutils.h"
+
+static bool iommufd_check_extension(VFIOContainer *bcontainer,
+VFIOContainerFeature feat)
+{
+switch (feat) {
+default:
+return false;
+};
+}
+
+static int iommufd_map(VFIOContainer *bcontainer, hwaddr iova,
+   ram_addr_t size, void *vaddr, bool readonly)
+{
+VFIOIOMMUFDContainer *container = container_of(bcontainer,
+   VFIOIOMMUFDContainer, obj);
+
+return iommufd_map_dma(container->iommufd, container->ioas_id,
+   iova, size, vaddr, readonly);
+}
+
+static int iommufd_unmap(VFIOContainer *bcontainer,
+ hwaddr iova, ram_addr_t size,
+ IOMMUTLBEntry *iotlb)
+{
+VFIOIOMMUFDContainer *container = container_of(bcontainer,
+   VFIOIOMMUFDContainer, obj);
+
+/* TODO: Handle dma_unmap_bitmap with iotlb args (migration) */
+return iommufd_unmap_dma(container->iommufd,
+ container->ioas_id, iova, size);
+}
+
+static int vfio_get_devicefd(const char *sysfs_path, Error **errp)
+{
+long int vfio_id = -1, ret = -ENOTTY;
+char *path, *tmp = NULL;
+DIR *dir;
+struct dirent *dent;
+struct stat st;
+gchar *contents;
+gsize length;
+int major, minor;
+dev_t vfio_devt;
+
+path = g_strdup_printf("%s/vfio-device", sysfs_path);
+if (stat(path, ) < 0) {
+error_setg_errno(errp, errno, "no such host device");
+goto out;
+}
+
+dir = opendir(path);
+if (!dir) {
+error_setg_errno(errp, errno, "couldn't open dirrectory %s", path);
+goto out;
+}
+
+while ((dent = readdir(dir))) {
+const char *end_name;
+
+if (!strncmp(dent->d_name, "vfio", 4)) {
+ret = qemu_strtol(dent->d_name + 4, _name, 10, _id);
+if (ret) {
+error_setg(errp, "suspicious vfio* file in %s", path);
+goto out;
+}
+break;
+}
+}
+
+/* check if the major:minor matches */
+tmp = g_strdup_printf("%s/%s/dev", path, dent->d_name);
+if (!g_file_get_contents(tmp, , , NULL)) {
+error_setg(errp, "failed to load \"%s\"", tmp);
+goto out;
+}
+
+if (sscanf(contents, "%d:%d", , ) != 2) {
+error_setg(errp, "failed to load \"%s\"", tmp);
+goto out;
+

[RFC 11/18] vfio/ccw: Use vfio_[attach/detach]_device

2022-04-14 Thread Yi Liu

From: Eric Auger 

Let the vfio-ccw device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Also now all the devices have been migrated to use the new
vfio_attach_device/vfio_detach_device API, let's turn the
legacy functions into static functions, local to container.c.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/ccw.c | 118 --
 hw/vfio/container.c   |   8 +--
 include/hw/vfio/vfio-common.h |   4 --
 3 files changed, 32 insertions(+), 98 deletions(-)

diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 0354737666..6fde7849cc 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -579,27 +579,32 @@ static void vfio_ccw_put_region(VFIOCCWDevice *vcdev)
 g_free(vcdev->io_region);
 }
 
-static void vfio_ccw_put_device(VFIOCCWDevice *vcdev)
-{
-g_free(vcdev->vdev.name);
-vfio_put_base_device(>vdev);
-}
-
-static void vfio_ccw_get_device(VFIOGroup *group, VFIOCCWDevice *vcdev,
-Error **errp)
+static void vfio_ccw_realize(DeviceState *dev, Error **errp)
 {
+CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
+S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
+VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
+S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
+VFIODevice *vbasedev = >vdev;
+Error *err = NULL;
 char *name = g_strdup_printf("%x.%x.%04x", vcdev->cdev.hostid.cssid,
  vcdev->cdev.hostid.ssid,
  vcdev->cdev.hostid.devid);
-VFIODevice *vbasedev;
+int ret;
 
-QLIST_FOREACH(vbasedev, >device_list, next) {
-if (strcmp(vbasedev->name, name) == 0) {
-error_setg(errp, "vfio: subchannel %s has already been attached",
-   name);
-goto out_err;
+/* Call the class init function for subchannel. */
+if (cdc->realize) {
+cdc->realize(cdev, vcdev->vdev.sysfsdev, );
+if (err) {
+goto out_err_propagate;
 }
 }
+vbasedev->sysfsdev = g_strdup_printf("/sys/bus/css/devices/%s/%s",
+ name, cdev->mdevid);
+vbasedev->ops = _ccw_ops;
+vbasedev->type = VFIO_DEVICE_TYPE_CCW;
+vbasedev->name = name;
+vbasedev->dev = >cdev.parent_obj.parent_obj;
 
 /*
  * All vfio-ccw devices are believed to operate in a way compatible with
@@ -609,80 +614,18 @@ static void vfio_ccw_get_device(VFIOGroup *group, 
VFIOCCWDevice *vcdev,
  * needs to be set before vfio_get_device() for vfio common to handle
  * ram_block_discard_disable().
  */
-vcdev->vdev.ram_block_discard_allowed = true;
-
-if (vfio_get_device(group, vcdev->cdev.mdevid, >vdev, errp)) {
-goto out_err;
-}
-
-vcdev->vdev.ops = _ccw_ops;
-vcdev->vdev.type = VFIO_DEVICE_TYPE_CCW;
-vcdev->vdev.name = name;
-vcdev->vdev.dev = >cdev.parent_obj.parent_obj;
-
-return;
-
-out_err:
-g_free(name);
-}
-
-static VFIOGroup *vfio_ccw_get_group(S390CCWDevice *cdev, Error **errp)
-{
-char *tmp, group_path[PATH_MAX];
-ssize_t len;
-int groupid;
 
-tmp = g_strdup_printf("/sys/bus/css/devices/%x.%x.%04x/%s/iommu_group",
-  cdev->hostid.cssid, cdev->hostid.ssid,
-  cdev->hostid.devid, cdev->mdevid);
-len = readlink(tmp, group_path, sizeof(group_path));
-g_free(tmp);
+vbasedev->ram_block_discard_allowed = true;
 
-if (len <= 0 || len >= sizeof(group_path)) {
-error_setg(errp, "vfio: no iommu_group found");
-return NULL;
-}
-
-group_path[len] = 0;
-
-if (sscanf(basename(group_path), "%d", ) != 1) {
-error_setg(errp, "vfio: failed to read %s", group_path);
-return NULL;
-}
-
-return vfio_get_group(groupid, _space_memory, errp);
-}
-
-static void vfio_ccw_realize(DeviceState *dev, Error **errp)
-{
-VFIOGroup *group;
-CcwDevice *ccw_dev = DO_UPCAST(CcwDevice, parent_obj, dev);
-S390CCWDevice *cdev = DO_UPCAST(S390CCWDevice, parent_obj, ccw_dev);
-VFIOCCWDevice *vcdev = DO_UPCAST(VFIOCCWDevice, cdev, cdev);
-S390CCWDeviceClass *cdc = S390_CCW_DEVICE_GET_CLASS(cdev);
-Error *err = NULL;
-
-/* Call the class init function for subchannel. */
-if (cdc->realize) {
-cdc->realize(cdev, vcdev->vdev.sysfsdev, );
-if (err) {
-goto out_err_propagate;
-}
-}
-
-group = vfio_ccw_get_group(cdev, );
-if (!group) {
-goto out_group_err;
-}
-
-vfio_ccw_get_device(group, vcdev, );
-if (err) {
-goto out_device_err;
+ret = vfio_attach_device(vbasedev, _space_memory, errp);
+if (ret) {
+g_free(vbasedev->name);
+g_free(vbasedev->sysfsdev);
 }
 
 vfio_ccw_get_region(vcdev, );
 if (err) {
-goto out_region_err;
+goto

[RFC 10/18] vfio/ap: Use vfio_[attach/detach]_device

2022-04-14 Thread Yi Liu

From: Eric Auger 

Let the vfio-ap device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/ap.c | 62 
 1 file changed, 9 insertions(+), 53 deletions(-)

diff --git a/hw/vfio/ap.c b/hw/vfio/ap.c
index e0dd561e85..286ac638e5 100644
--- a/hw/vfio/ap.c
+++ b/hw/vfio/ap.c
@@ -50,58 +50,17 @@ struct VFIODeviceOps vfio_ap_ops = {
 .vfio_compute_needs_reset = vfio_ap_compute_needs_reset,
 };
 
-static void vfio_ap_put_device(VFIOAPDevice *vapdev)
-{
-g_free(vapdev->vdev.name);
-vfio_put_base_device(>vdev);
-}
-
-static VFIOGroup *vfio_ap_get_group(VFIOAPDevice *vapdev, Error **errp)
-{
-GError *gerror = NULL;
-char *symlink, *group_path;
-int groupid;
-
-symlink = g_strdup_printf("%s/iommu_group", vapdev->vdev.sysfsdev);
-group_path = g_file_read_link(symlink, );
-g_free(symlink);
-
-if (!group_path) {
-error_setg(errp, "%s: no iommu_group found for %s: %s",
-   TYPE_VFIO_AP_DEVICE, vapdev->vdev.sysfsdev, 
gerror->message);
-g_error_free(gerror);
-return NULL;
-}
-
-if (sscanf(basename(group_path), "%d", ) != 1) {
-error_setg(errp, "vfio: failed to read %s", group_path);
-g_free(group_path);
-return NULL;
-}
-
-g_free(group_path);
-
-return vfio_get_group(groupid, _space_memory, errp);
-}
-
 static void vfio_ap_realize(DeviceState *dev, Error **errp)
 {
-int ret;
-char *mdevid;
-VFIOGroup *vfio_group;
 APDevice *apdev = AP_DEVICE(dev);
 VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
+VFIODevice *vbasedev = >vdev;
+int ret;
 
-vfio_group = vfio_ap_get_group(vapdev, errp);
-if (!vfio_group) {
-return;
-}
-
-vapdev->vdev.ops = _ap_ops;
-vapdev->vdev.type = VFIO_DEVICE_TYPE_AP;
-mdevid = basename(vapdev->vdev.sysfsdev);
-vapdev->vdev.name = g_strdup_printf("%s", mdevid);
-vapdev->vdev.dev = dev;
+vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+vbasedev->ops = _ap_ops;
+vbasedev->type = VFIO_DEVICE_TYPE_AP;
+vbasedev->dev = dev;
 
 /*
  * vfio-ap devices operate in a way compatible with discarding of
@@ -111,7 +70,7 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
  */
 vapdev->vdev.ram_block_discard_allowed = true;
 
-ret = vfio_get_device(vfio_group, mdevid, >vdev, errp);
+ret = vfio_attach_device(vbasedev, _space_memory, errp);
 if (ret) {
 goto out_get_dev_err;
 }
@@ -119,18 +78,15 @@ static void vfio_ap_realize(DeviceState *dev, Error **errp)
 return;
 
 out_get_dev_err:
-vfio_ap_put_device(vapdev);
-vfio_put_group(vfio_group);
+vfio_detach_device(vbasedev);
 }
 
 static void vfio_ap_unrealize(DeviceState *dev)
 {
 APDevice *apdev = AP_DEVICE(dev);
 VFIOAPDevice *vapdev = VFIO_AP_DEVICE(apdev);
-VFIOGroup *group = vapdev->vdev.group;
 
-vfio_ap_put_device(vapdev);
-vfio_put_group(group);
+vfio_detach_device(>vdev);
 }
 
 static Property vfio_ap_properties[] = {
-- 
2.27.0

[RFC 18/18] vfio/pci: Add an iommufd option

2022-04-14 Thread Yi Liu

From: Eric Auger 

This auto/on/off option allows the user to force a the select
the iommu BE (iommufd or legacy).

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/pci.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cf5703f94b..70a4c2b0a8 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,8 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "qapi/visitor.h"
+#include "qapi/qapi-visit-common.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -3246,6 +3248,26 @@ static Property vfio_pci_dev_properties[] = {
 DEFINE_PROP_END_OF_LIST(),
 };
 
+static void get_iommu_be(Object *obj, Visitor *v, const char *name,
+ void *opaque, Error **errp)
+{
+VFIOPCIDevice *vdev = VFIO_PCI(obj);
+VFIODevice *vbasedev = >vbasedev;
+OnOffAuto iommufd_be = vbasedev->iommufd_be;
+
+visit_type_OnOffAuto(v, name, _be, errp);
+}
+
+static void set_iommu_be(Object *obj, Visitor *v, const char *name,
+ void *opaque, Error **errp)
+{
+VFIOPCIDevice *vdev = VFIO_PCI(obj);
+VFIODevice *vbasedev = >vbasedev;
+
+visit_type_OnOffAuto(v, name, >iommufd_be, errp);
+}
+
+
 static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
@@ -3253,6 +3275,10 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, 
void *data)
 
 dc->reset = vfio_pci_reset;
 device_class_set_props(dc, vfio_pci_dev_properties);
+object_class_property_add(klass, "iommufd", "OnOffAuto",
+  get_iommu_be, set_iommu_be, NULL, NULL);
+object_class_property_set_description(klass, "iommufd",
+  "Enable iommufd backend");
 dc->desc = "VFIO-based PCI device assignment";
 set_bit(DEVICE_CATEGORY_MISC, dc->categories);
 pdc->realize = vfio_realize;
-- 
2.27.0

[RFC 17/18] vfio/as: Allow the selection of a given iommu backend

2022-04-14 Thread Yi Liu

From: Eric Auger 

Now we support two types of iommu backends, let's add the capability
to select one of them. This is based on a VFIODevice auto/on/off
iommu_be field. This field is likely to be forced to a given value or
set by a device option.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/as.c  | 31 ++-
 include/hw/vfio/vfio-common.h |  1 +
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 13a6653a0d..fce7a088e9 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -985,16 +985,45 @@ vfio_get_container_class(VFIOIOMMUBackendType be)
 case VFIO_IOMMU_BACKEND_TYPE_LEGACY:
 klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER);
 return VFIO_CONTAINER_OBJ_CLASS(klass);
+case VFIO_IOMMU_BACKEND_TYPE_IOMMUFD:
+klass = object_class_by_name(TYPE_VFIO_IOMMUFD_CONTAINER);
+return VFIO_CONTAINER_OBJ_CLASS(klass);
 default:
 return NULL;
 }
 }
 
+static VFIOContainerClass *
+select_iommu_backend(OnOffAuto value, Error **errp)
+{
+VFIOContainerClass *vccs = NULL;
+
+if (value == ON_OFF_AUTO_OFF) {
+return vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+} else {
+int iommufd = qemu_open_old("/dev/iommu", O_RDWR);
+
+vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_IOMMUFD);
+if (iommufd < 0 || !vccs) {
+if (value == ON_OFF_AUTO_AUTO) {
+vccs = 
vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+} else { /* ON */
+error_setg(errp, "iommufd backend is not supported by %s",
+   iommufd < 0 ? "the host" : "QEMU");
+error_append_hint(errp, "set iommufd=off\n");
+vccs = NULL;
+}
+}
+close(iommufd);
+}
+return vccs;
+}
+
 int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
 {
 VFIOContainerClass *vccs;
 
-vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+vccs = select_iommu_backend(vbasedev->iommufd_be, errp);
 if (!vccs) {
 return -ENOENT;
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index bef48ddfaf..2d941aae70 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -126,6 +126,7 @@ typedef struct VFIODevice {
 VFIOMigration *migration;
 Error *migration_blocker;
 OnOffAuto pre_copy_dirty_page_tracking;
+OnOffAuto iommufd_be;
 } VFIODevice;
 
 struct VFIODeviceOps {
-- 
2.27.0

[RFC 07/18] vfio: Add base object for VFIOContainer

2022-04-14 Thread Yi Liu

Qomify the VFIOContainer object which acts as a base class for a
container. This base class is derived into the legacy VFIO container
and later on, into the new iommufd based container.

The base class implements generic code such as code related to
memory_listener and address space management whereas the derived
class implements callbacks that depend on the kernel user space
being used.

'as.c' only manipulates the base class object with wrapper functions
that call the right class functions. Existing 'container.c' code is
converted to implement the legacy container class functions.

Existing migration code only works with the legacy container.
Also 'spapr.c' isn't BE agnostic.

Below is the object. It's named as VFIOContainer, old VFIOContainer
is replaced with VFIOLegacyContainer.

struct VFIOContainer {
/* private */
Object parent_obj;

VFIOAddressSpace *space;
MemoryListener listener;
Error *error;
bool initialized;
bool dirty_pages_supported;
uint64_t dirty_pgsizes;
uint64_t max_dirty_bitmap_size;
unsigned long pgsizes;
unsigned int dma_max_mappings;
QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
QLIST_HEAD(, VFIORamDiscardListener) vrdl_list;
QLIST_ENTRY(VFIOContainer) next;
};

struct VFIOLegacyContainer {
VFIOContainer obj;
int fd; /* /dev/vfio/vfio, empowered by the attached groups */
MemoryListener prereg_listener;
unsigned iommu_type;
QLIST_HEAD(, VFIOGroup) group_list;
};

Co-authored-by: Eric Auger 
Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/as.c |  48 +++---
 hw/vfio/container-obj.c  | 195 +++
 hw/vfio/container.c  | 224 ---
 hw/vfio/meson.build  |   1 +
 hw/vfio/migration.c  |   4 +-
 hw/vfio/pci.c|   4 +-
 hw/vfio/spapr.c  |  22 +--
 include/hw/vfio/vfio-common.h|  78 ++
 include/hw/vfio/vfio-container-obj.h | 154 ++
 9 files changed, 540 insertions(+), 190 deletions(-)
 create mode 100644 hw/vfio/container-obj.c
 create mode 100644 include/hw/vfio/vfio-container-obj.h

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 4181182808..37423d2c89 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -215,9 +215,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  * of vaddr will always be there, even if the memory object is
  * destroyed and its backing memory munmap-ed.
  */
-ret = vfio_dma_map(container, iova,
-   iotlb->addr_mask + 1, vaddr,
-   read_only);
+ret = vfio_container_dma_map(container, iova,
+ iotlb->addr_mask + 1, vaddr,
+ read_only);
 if (ret) {
 error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
  "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -225,7 +225,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)
  iotlb->addr_mask + 1, vaddr, ret);
 }
 } else {
-ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
+ret = vfio_container_dma_unmap(container, iova,
+   iotlb->addr_mask + 1, iotlb);
 if (ret) {
 error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
  "0x%"HWADDR_PRIx") = %d (%m)",
@@ -242,12 +243,13 @@ static void 
vfio_ram_discard_notify_discard(RamDiscardListener *rdl,
 {
 VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
 listener);
+VFIOContainer *container = vrdl->container;
 const hwaddr size = int128_get64(section->size);
 const hwaddr iova = section->offset_within_address_space;
 int ret;
 
 /* Unmap with a single call. */
-ret = vfio_dma_unmap(vrdl->container, iova, size , NULL);
+ret = vfio_container_dma_unmap(container, iova, size , NULL);
 if (ret) {
 error_report("%s: vfio_dma_unmap() failed: %s", __func__,
  strerror(-ret));
@@ -259,6 +261,7 @@ static int 
vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
 {
 VFIORamDiscardListener *vrdl = container_of(rdl, VFIORamDiscardListener,
 listener);
+VFIOContainer *container = vrdl->container;
 const hwaddr end = section->offset_within_region +
int128_get64(section->size);
 hwaddr start, next, iova;
@@ -277,8 +280,8 @@ static int 
vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
section->offset_within_address_space;
 vaddr = memory_region_get_ram_ptr(section->mr) + start;
 
-ret = vfio_dma_map(vrdl->container, iova, next -

[RFC 16/18] vfio/iommufd: Add IOAS_COPY_DMA support

2022-04-14 Thread Yi Liu

Compared with legacy vfio container BE, one of the benefits provided by
iommufd is to reduce the redundant page pinning on kernel side through
the usage of IOAS_COPY_DMA. For iommufd containers within the same address
space, IOVA mappings can be copied from a source container to destination
container.

To achieve this, move the vfio_memory_listener to be per address space.
In the memory listener callbacks, all the containers within the address
space will be looped. For the iommufd containers, QEMU uses IOAS_MAP_DMA
on the first one, and then uses IOAS_COPY_DMA to copy the IOVA mappings
from the first iommufd container to other iommufd containers within the
address space. For legacy containers, IOVA mapping is done by
VFIO_IOMMU_MAP_DMA.

Signed-off-by: Yi Liu 
---
 hw/vfio/as.c | 117 +++
 hw/vfio/container-obj.c  |  17 +++-
 hw/vfio/container.c  |  19 ++---
 hw/vfio/iommufd.c|  43 +++---
 include/hw/vfio/vfio-common.h|   6 +-
 include/hw/vfio/vfio-container-obj.h |   8 +-
 6 files changed, 167 insertions(+), 43 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 94618efd1f..13a6653a0d 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -388,16 +388,16 @@ static void 
vfio_unregister_ram_discard_listener(VFIOContainer *container,
 g_free(vrdl);
 }
 
-static void vfio_listener_region_add(MemoryListener *listener,
- MemoryRegionSection *section)
+static void vfio_container_region_add(VFIOContainer *container,
+  VFIOContainer **src_container,
+  MemoryRegionSection *section)
 {
-VFIOContainer *container = container_of(listener, VFIOContainer, listener);
 hwaddr iova, end;
 Int128 llend, llsize;
 void *vaddr;
 int ret;
 VFIOHostDMAWindow *hostwin;
-bool hostwin_found;
+bool hostwin_found, copy_dma_supported = false;
 Error *err = NULL;
 
 if (vfio_listener_skipped_section(section)) {
@@ -533,12 +533,25 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 }
 }
 
+copy_dma_supported = vfio_container_check_extension(container,
+VFIO_FEAT_DMA_COPY);
+
+if (copy_dma_supported && *src_container) {
+if (!vfio_container_dma_copy(*src_container, container,
+ iova, int128_get64(llsize),
+ section->readonly)) {
+return;
+} else {
+info_report("IOAS copy failed try map for container: %p", 
container);
+}
+}
+
 ret = vfio_container_dma_map(container, iova, int128_get64(llsize),
  vaddr, section->readonly);
 if (ret) {
-error_setg(, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-   "0x%"HWADDR_PRIx", %p) = %d (%m)",
-   container, iova, int128_get64(llsize), vaddr, ret);
+error_setg(, "vfio_container_dma_map(%p, 0x%"HWADDR_PRIx", "
+   "0x%"HWADDR_PRIx", %p) = %d (%m)", container, iova,
+   int128_get64(llsize), vaddr, ret);
 if (memory_region_is_ram_device(section->mr)) {
 /* Allow unexpected mappings not to be fatal for RAM devices */
 error_report_err(err);
@@ -547,6 +560,9 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 goto fail;
 }
 
+if (copy_dma_supported) {
+*src_container = container;
+}
 return;
 
 fail:
@@ -573,10 +589,22 @@ fail:
 }
 }
 
-static void vfio_listener_region_del(MemoryListener *listener,
+static void vfio_listener_region_add(MemoryListener *listener,
  MemoryRegionSection *section)
 {
-VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+VFIOAddressSpace *space = container_of(listener,
+   VFIOAddressSpace, listener);
+VFIOContainer *container, *src_container;
+
+src_container = NULL;
+QLIST_FOREACH(container, >containers, next) {
+vfio_container_region_add(container, _container, section);
+}
+}
+
+static void vfio_container_region_del(VFIOContainer *container,
+  MemoryRegionSection *section)
+{
 hwaddr iova, end;
 Int128 llend, llsize;
 int ret;
@@ -682,18 +710,38 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 vfio_container_del_section_window(container, section);
 }
 
+static void vfio_listener_region_del(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+VFIOAddressSpace *space = container_of(listener,
+   VFIOAddressSpace, listener);
+VFIOContainer *container;
+
+QLIST_FOREACH(container, >containers, next) {
+

[RFC 14/18] hw/iommufd: Creation

2022-04-14 Thread Yi Liu

Introduce iommufd utility library which can be compiled out with
CONFIG_IOMMUFD configuration. This code is bound to be called by
several subsystems: vdpa, and vfio.

Co-authored-by: Eric Auger 
Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 MAINTAINERS  |   7 ++
 hw/Kconfig   |   1 +
 hw/iommufd/Kconfig   |   4 +
 hw/iommufd/iommufd.c | 209 +++
 hw/iommufd/meson.build   |   1 +
 hw/iommufd/trace-events  |  11 ++
 hw/iommufd/trace.h   |   1 +
 hw/meson.build   |   1 +
 include/hw/iommufd/iommufd.h |  37 +++
 meson.build  |   1 +
 10 files changed, 273 insertions(+)
 create mode 100644 hw/iommufd/Kconfig
 create mode 100644 hw/iommufd/iommufd.c
 create mode 100644 hw/iommufd/meson.build
 create mode 100644 hw/iommufd/trace-events
 create mode 100644 hw/iommufd/trace.h
 create mode 100644 include/hw/iommufd/iommufd.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 4ad2451e03..f6bcb25f7f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1954,6 +1954,13 @@ F: hw/vfio/ap.c
 F: docs/system/s390x/vfio-ap.rst
 L: qemu-s3...@nongnu.org
 
+iommufd
+M: Yi Liu 
+M: Eric Auger 
+S: Supported
+F: hw/iommufd/*
+F: include/hw/iommufd/*
+
 vhost
 M: Michael S. Tsirkin 
 S: Supported
diff --git a/hw/Kconfig b/hw/Kconfig
index ad20cce0a9..d270d44760 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -63,6 +63,7 @@ source sparc/Kconfig
 source sparc64/Kconfig
 source tricore/Kconfig
 source xtensa/Kconfig
+source iommufd/Kconfig
 
 # Symbols used by multiple targets
 config TEST_DEVICES
diff --git a/hw/iommufd/Kconfig b/hw/iommufd/Kconfig
new file mode 100644
index 00..4b1b00e36b
--- /dev/null
+++ b/hw/iommufd/Kconfig
@@ -0,0 +1,4 @@
+config IOMMUFD
+bool
+default y
+depends on LINUX
diff --git a/hw/iommufd/iommufd.c b/hw/iommufd/iommufd.c
new file mode 100644
index 00..4e8179d612
--- /dev/null
+++ b/hw/iommufd/iommufd.c
@@ -0,0 +1,209 @@
+/*
+ * QEMU IOMMUFD
+ *
+ * Copyright (C) 2022 Intel Corporation.
+ * Copyright Red Hat, Inc. 2022
+ *
+ * Authors: Yi Liu 
+ *  Eric Auger 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see .
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/thread.h"
+#include "qemu/module.h"
+#include 
+#include 
+#include "hw/iommufd/iommufd.h"
+#include "trace.h"
+
+static QemuMutex iommufd_lock;
+static uint32_t iommufd_users;
+static int iommufd = -1;
+
+static int iommufd_get(void)
+{
+qemu_mutex_lock(_lock);
+if (iommufd == -1) {
+iommufd = qemu_open_old("/dev/iommu", O_RDWR);
+if (iommufd < 0) {
+error_report("Failed to open /dev/iommu!");
+} else {
+iommufd_users = 1;
+}
+trace_iommufd_get(iommufd);
+} else if (++iommufd_users == UINT32_MAX) {
+error_report("Failed to get iommufd: %d, count overflow", iommufd);
+iommufd_users--;
+qemu_mutex_unlock(_lock);
+return -E2BIG;
+}
+qemu_mutex_unlock(_lock);
+return iommufd;
+}
+
+static void iommufd_put(int fd)
+{
+qemu_mutex_lock(_lock);
+if (--iommufd_users) {
+qemu_mutex_unlock(_lock);
+return;
+}
+iommufd = -1;
+trace_iommufd_put(fd);
+close(fd);
+qemu_mutex_unlock(_lock);
+}
+
+static int iommufd_alloc_ioas(int iommufd, uint32_t *ioas)
+{
+int ret;
+struct iommu_ioas_alloc alloc_data  = {
+.size = sizeof(alloc_data),
+.flags = 0,
+};
+
+ret = ioctl(iommufd, IOMMU_IOAS_ALLOC, _data);
+if (ret) {
+error_report("Failed to allocate ioas %m");
+}
+
+*ioas = alloc_data.out_ioas_id;
+trace_iommufd_alloc_ioas(iommufd, *ioas, ret);
+
+return ret;
+}
+
+static void iommufd_free_ioas(int iommufd, uint32_t ioas)
+{
+int ret;
+struct iommu_destroy des = {
+.size = sizeof(des),
+.id = ioas,
+};
+
+ret = ioctl(iommufd, IOMMU_DESTROY, );
+trace_iommufd_free_ioas(iommufd, ioas, ret);
+if (ret) {
+error_report("Failed to free ioas: %u %m", ioas);
+}
+}
+
+int iommufd_get_ioas(int *fd, uint32_t *ioas_id)
+{
+int ret;
+
+*fd = iommufd_get();
+if (*fd < 0) {
+return *fd;
+}
+
+ret = iommufd_alloc_ioas(*fd, ioas_id);
+trace_iommufd_get_ioas(*fd, *ioas_id, ret);
+

[RFC 09/18] vfio/platform: Use vfio_[attach/detach]_device

2022-04-14 Thread Yi Liu

From: Eric Auger 

Let the vfio-platform device use vfio_attach_device() and
vfio_detach_device(), hence hiding the details of the used
IOMMU backend.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/platform.c | 42 ++
 1 file changed, 2 insertions(+), 40 deletions(-)

diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c
index 5af73f9287..3bcdc20667 100644
--- a/hw/vfio/platform.c
+++ b/hw/vfio/platform.c
@@ -529,12 +529,7 @@ static VFIODeviceOps vfio_platform_ops = {
  */
 static int vfio_base_device_init(VFIODevice *vbasedev, Error **errp)
 {
-VFIOGroup *group;
-VFIODevice *vbasedev_iter;
-char *tmp, group_path[PATH_MAX], *group_name;
-ssize_t len;
 struct stat st;
-int groupid;
 int ret;
 
 /* @sysfsdev takes precedence over @host */
@@ -557,47 +552,14 @@ static int vfio_base_device_init(VFIODevice *vbasedev, 
Error **errp)
 return -errno;
 }
 
-tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-len = readlink(tmp, group_path, sizeof(group_path));
-g_free(tmp);
-
-if (len < 0 || len >= sizeof(group_path)) {
-ret = len < 0 ? -errno : -ENAMETOOLONG;
-error_setg_errno(errp, -ret, "no iommu_group found");
-return ret;
-}
-
-group_path[len] = 0;
-
-group_name = basename(group_path);
-if (sscanf(group_name, "%d", ) != 1) {
-error_setg_errno(errp, errno, "failed to read %s", group_path);
-return -errno;
-}
-
-trace_vfio_platform_base_device_init(vbasedev->name, groupid);
-
-group = vfio_get_group(groupid, _space_memory, errp);
-if (!group) {
-return -ENOENT;
-}
-
-QLIST_FOREACH(vbasedev_iter, >device_list, next) {
-if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-error_setg(errp, "device is already attached");
-vfio_put_group(group);
-return -EBUSY;
-}
-}
-ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+ret = vfio_attach_device(vbasedev, _space_memory, errp);
 if (ret) {
-vfio_put_group(group);
 return ret;
 }
 
 ret = vfio_populate_device(vbasedev, errp);
 if (ret) {
-vfio_put_group(group);
+vfio_detach_device(vbasedev);
 }
 
 return ret;
-- 
2.27.0

[RFC 08/18] vfio/container: Introduce vfio_[attach/detach]_device

2022-04-14 Thread Yi Liu

From: Eric Auger 

We want the VFIO devices to be able to use two different
IOMMU callbacks, the legacy VFIO one and the new iommufd one.

Introduce vfio_[attach/detach]_device which aim at hiding the
underlying IOMMU backend (IOCTLs, datatypes, ...).

Once vfio_attach_device completes, the device is attached
to a security context and its fd can be used. Conversely
When vfio_detach_device completes, the device has been
detached to the security context.

In this patch, only the vfio-pci device gets converted to use
the new API. Subsequent patches will handle other devices.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/container.c   | 65 +++
 hw/vfio/pci.c | 50 +++
 include/hw/vfio/vfio-common.h |  2 ++
 3 files changed, 72 insertions(+), 45 deletions(-)

diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 79972064d3..c74a3cd4ae 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -1214,6 +1214,71 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
 return vfio_eeh_container_op(container, op);
 }
 
+static int vfio_device_groupid(VFIODevice *vbasedev, Error **errp)
+{
+char *tmp, group_path[PATH_MAX], *group_name;
+int ret, groupid;
+ssize_t len;
+
+tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
+len = readlink(tmp, group_path, sizeof(group_path));
+g_free(tmp);
+
+if (len <= 0 || len >= sizeof(group_path)) {
+ret = len < 0 ? -errno : -ENAMETOOLONG;
+error_setg_errno(errp, -ret, "no iommu_group found");
+return ret;
+}
+
+group_path[len] = 0;
+
+group_name = basename(group_path);
+if (sscanf(group_name, "%d", ) != 1) {
+error_setg_errno(errp, errno, "failed to read %s", group_path);
+return -errno;
+}
+return groupid;
+}
+
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+{
+int groupid = vfio_device_groupid(vbasedev, errp);
+VFIODevice *vbasedev_iter;
+VFIOGroup *group;
+int ret;
+
+if (groupid < 0) {
+return groupid;
+}
+
+trace_vfio_realize(vbasedev->name, groupid);
+group = vfio_get_group(groupid, as, errp);
+if (!group) {
+return -1;
+}
+
+QLIST_FOREACH(vbasedev_iter, >device_list, next) {
+if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
+error_setg(errp, "device is already attached");
+vfio_put_group(group);
+return -1;
+}
+}
+ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
+if (ret) {
+vfio_put_group(group);
+return -1;
+}
+
+return 0;
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+vfio_put_base_device(vbasedev);
+vfio_put_group(vbasedev->group);
+}
+
 static void vfio_legacy_container_class_init(ObjectClass *klass,
  void *data)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a00a485e46..0363f81017 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2654,10 +2654,9 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, 
Error **errp)
 
 static void vfio_put_device(VFIOPCIDevice *vdev)
 {
-g_free(vdev->vbasedev.name);
 g_free(vdev->msix);
 
-vfio_put_base_device(>vbasedev);
+vfio_detach_device(>vbasedev);
 }
 
 static void vfio_err_notifier_handler(void *opaque)
@@ -2804,13 +2803,9 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
 VFIOPCIDevice *vdev = VFIO_PCI(pdev);
 VFIODevice *vbasedev = >vbasedev;
-VFIODevice *vbasedev_iter;
-VFIOGroup *group;
-char *tmp, *subsys, group_path[PATH_MAX], *group_name;
+char *tmp, *subsys;
 Error *err = NULL;
-ssize_t len;
 struct stat st;
-int groupid;
 int i, ret;
 bool is_mdev;
 
@@ -2839,39 +2834,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 vbasedev->type = VFIO_DEVICE_TYPE_PCI;
 vbasedev->dev = DEVICE(vdev);
 
-tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
-len = readlink(tmp, group_path, sizeof(group_path));
-g_free(tmp);
-
-if (len <= 0 || len >= sizeof(group_path)) {
-error_setg_errno(errp, len < 0 ? errno : ENAMETOOLONG,
- "no iommu_group found");
-goto error;
-}
-
-group_path[len] = 0;
-
-group_name = basename(group_path);
-if (sscanf(group_name, "%d", ) != 1) {
-error_setg_errno(errp, errno, "failed to read %s", group_path);
-goto error;
-}
-
-trace_vfio_realize(vbasedev->name, groupid);
-
-group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), 
errp);
-if (!group) {
-goto error;
-}
-
-QLIST_FOREACH(vbasedev_iter, >device_list, next) {
-if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
-error_setg(errp, "device is already attached");
-vfio_put_group(group);
-goto error;
-

[RFC 12/18] vfio/container-obj: Introduce [attach/detach]_device container callbacks

2022-04-14 Thread Yi Liu

From: Eric Auger 

Let's turn attach/detach_device as container callbacks. That way,
their implementation can be easily customized for a given backend.

For the time being, only the legacy container is supported.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/as.c | 36 
 hw/vfio/container.c  | 11 +
 hw/vfio/pci.c|  2 +-
 include/hw/vfio/vfio-common.h|  7 ++
 include/hw/vfio/vfio-container-obj.h |  6 +
 5 files changed, 57 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 37423d2c89..30e86f6833 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -874,3 +874,39 @@ void vfio_put_address_space(VFIOAddressSpace *space)
 g_free(space);
 }
 }
+
+static VFIOContainerClass *
+vfio_get_container_class(VFIOIOMMUBackendType be)
+{
+ObjectClass *klass;
+
+switch (be) {
+case VFIO_IOMMU_BACKEND_TYPE_LEGACY:
+klass = object_class_by_name(TYPE_VFIO_LEGACY_CONTAINER);
+return VFIO_CONTAINER_OBJ_CLASS(klass);
+default:
+return NULL;
+}
+}
+
+int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+{
+VFIOContainerClass *vccs;
+
+vccs = vfio_get_container_class(VFIO_IOMMU_BACKEND_TYPE_LEGACY);
+if (!vccs) {
+return -ENOENT;
+}
+return vccs->attach_device(vbasedev, as, errp);
+}
+
+void vfio_detach_device(VFIODevice *vbasedev)
+{
+VFIOContainerClass *vccs;
+
+if (!vbasedev->container) {
+return;
+}
+vccs = VFIO_CONTAINER_OBJ_GET_CLASS(vbasedev->container);
+vccs->detach_device(vbasedev);
+}
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 5d73f8285e..74febc1567 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -50,8 +50,6 @@
 static int vfio_kvm_device_fd = -1;
 #endif
 
-#define TYPE_VFIO_LEGACY_CONTAINER "qemu:vfio-legacy-container"
-
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
 
@@ -1240,7 +1238,8 @@ static int vfio_device_groupid(VFIODevice *vbasedev, 
Error **errp)
 return groupid;
 }
 
-int vfio_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
+static int
+legacy_attach_device(VFIODevice *vbasedev, AddressSpace *as, Error **errp)
 {
 int groupid = vfio_device_groupid(vbasedev, errp);
 VFIODevice *vbasedev_iter;
@@ -1269,14 +1268,16 @@ int vfio_attach_device(VFIODevice *vbasedev, 
AddressSpace *as, Error **errp)
 vfio_put_group(group);
 return -1;
 }
+vbasedev->container = >container->obj;
 
 return 0;
 }
 
-void vfio_detach_device(VFIODevice *vbasedev)
+static void legacy_detach_device(VFIODevice *vbasedev)
 {
 vfio_put_base_device(vbasedev);
 vfio_put_group(vbasedev->group);
+vbasedev->container = NULL;
 }
 
 static void vfio_legacy_container_class_init(ObjectClass *klass,
@@ -1292,6 +1293,8 @@ static void vfio_legacy_container_class_init(ObjectClass 
*klass,
 vccs->add_window = vfio_legacy_container_add_section_window;
 vccs->del_window = vfio_legacy_container_del_section_window;
 vccs->check_extension = vfio_legacy_container_check_extension;
+vccs->attach_device = legacy_attach_device;
+vccs->detach_device = legacy_detach_device;
 }
 
 static const TypeInfo vfio_legacy_container_info = {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0363f81017..e1ab6d339d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3063,7 +3063,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 
 if (!pdev->failover_pair_id &&
-vfio_container_check_extension(>group->container->obj,
+vfio_container_check_extension(vbasedev->container,
VFIO_FEAT_LIVE_MIGRATION)) {
 ret = vfio_migration_probe(vbasedev, errp);
 if (ret) {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7d7898717e..2040c27cda 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -83,9 +83,15 @@ typedef struct VFIOLegacyContainer {
 
 typedef struct VFIODeviceOps VFIODeviceOps;
 
+typedef enum VFIOIOMMUBackendType {
+VFIO_IOMMU_BACKEND_TYPE_LEGACY = 0,
+VFIO_IOMMU_BACKEND_TYPE_IOMMUFD = 1,
+} VFIOIOMMUBackendType;
+
 typedef struct VFIODevice {
 QLIST_ENTRY(VFIODevice) next;
 struct VFIOGroup *group;
+VFIOContainer *container;
 char *sysfsdev;
 char *name;
 DeviceState *dev;
@@ -97,6 +103,7 @@ typedef struct VFIODevice {
 bool ram_block_discard_allowed;
 bool enable_migration;
 VFIODeviceOps *ops;
+VFIOIOMMUBackendType be;
 unsigned int num_irqs;
 unsigned int num_regions;
 unsigned int flags;
diff --git a/include/hw/vfio/vfio-container-obj.h 
b/include/hw/vfio/vfio-container-obj.h
index 7ffbbb299f..ebc1340530 100644
--- a/include/hw/vfio/vfio-container-obj.h
+++ b/include/hw/vfio/vfio-container-obj.h
@@ -42,6 +42,8 @@

[RFC 02/18] linux-headers: Import latest vfio.h and iommufd.h

2022-04-14 Thread Yi Liu

From: Eric Auger 

Imported from https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 linux-headers/linux/iommufd.h | 223 ++
 linux-headers/linux/vfio.h|  84 +
 2 files changed, 307 insertions(+)
 create mode 100644 linux-headers/linux/iommufd.h

diff --git a/linux-headers/linux/iommufd.h b/linux-headers/linux/iommufd.h
new file mode 100644
index 00..6c3cd9e259
--- /dev/null
+++ b/linux-headers/linux/iommufd.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _IOMMUFD_H
+#define _IOMMUFD_H
+
+#include 
+#include 
+
+#define IOMMUFD_TYPE (';')
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl mechanims follows a general format to allow for extensibility. 
Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *understood, however a known field has a value the kernel does not
+ *understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *correct.
+ *  - ENOENT: An ID or IOVA provided does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics oveflowed.
+ *
+ * As well as additional errnos. within specific ioctls.
+ */
+enum {
+   IOMMUFD_CMD_BASE = 0x80,
+   IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
+   IOMMUFD_CMD_IOAS_ALLOC,
+   IOMMUFD_CMD_IOAS_IOVA_RANGES,
+   IOMMUFD_CMD_IOAS_MAP,
+   IOMMUFD_CMD_IOAS_COPY,
+   IOMMUFD_CMD_IOAS_UNMAP,
+   IOMMUFD_CMD_VFIO_IOAS,
+};
+
+/**
+ * struct iommu_destroy - ioctl(IOMMU_DESTROY)
+ * @size: sizeof(struct iommu_destroy)
+ * @id: iommufd object ID to destroy. Can by any destroyable object type.
+ *
+ * Destroy any object held within iommufd.
+ */
+struct iommu_destroy {
+   __u32 size;
+   __u32 id;
+};
+#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
+
+/**
+ * struct iommu_ioas_alloc - ioctl(IOMMU_IOAS_ALLOC)
+ * @size: sizeof(struct iommu_ioas_alloc)
+ * @flags: Must be 0
+ * @out_ioas_id: Output IOAS ID for the allocated object
+ *
+ * Allocate an IO Address Space (IOAS) which holds an IO Virtual Address (IOVA)
+ * to memory mapping.
+ */
+struct iommu_ioas_alloc {
+   __u32 size;
+   __u32 flags;
+   __u32 out_ioas_id;
+};
+#define IOMMU_IOAS_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_ALLOC)
+
+/**
+ * struct iommu_ioas_iova_ranges - ioctl(IOMMU_IOAS_IOVA_RANGES)
+ * @size: sizeof(struct iommu_ioas_iova_ranges)
+ * @ioas_id: IOAS ID to read ranges from
+ * @out_num_iovas: Output total number of ranges in the IOAS
+ * @__reserved: Must be 0
+ * @out_valid_iovas: Array of valid IOVA ranges. The array length is the 
smaller
+ *   of out_num_iovas or the length implied by size.
+ * @out_valid_iovas.start: First IOVA in the allowed range
+ * @out_valid_iovas.last: Inclusive last IOVA in the allowed range
+ *
+ * Query an IOAS for ranges of allowed IOVAs. Operation outside these ranges is
+ * not allowed. out_num_iovas will be set to the total number of iovas
+ * and the out_valid_iovas[] will be filled in as space permits.
+ * size should include the allocated flex array.
+ */
+struct iommu_ioas_iova_ranges {
+   __u32 size;
+   __u32 ioas_id;
+   __u32 out_num_iovas;
+   __u32 __reserved;
+   struct iommu_valid_iovas {
+   __aligned_u64 start;
+   __aligned_u64 last;
+   } out_valid_iovas[];
+};
+#define IOMMU_IOAS_IOVA_RANGES _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_IOVA_RANGES)
+
+/**
+ * enum iommufd_ioas_map_flags - Flags for map and copy
+ * @IOMMU_IOAS_MAP_FIXED_IOVA: If clear the kernel will compute an appropriate
+ * IOVA to place the mapping at
+ * @IOMMU_IOAS_MAP_WRITEABLE: DMA is allowed to write to this mapping
+ * @IOMMU_IOAS_MAP_READABLE: DMA is allowed to read from this mapping
+ */
+enum iommufd_ioas_map_flags {
+   IOMMU_IOAS_MAP_FIXED_IOVA = 1 << 0,
+   IOMMU_IOAS_MAP_WRITEABLE = 1 << 1,
+   IOMMU_IOAS_MAP_READABLE = 1 << 2,
+};
+
+/**
+ * struct iommu_ioas_map - ioctl(IOMMU_IOAS_MAP)
+ * @size: sizeof(struct iommu_ioas_map)
+ * @flags: Combination of enum iommufd_ioas_map_flags
+ * @ioas_id: IOAS ID to change the mapping of
+ * @__reserved: Must be 0
+ * @user_va: Userspace pointer to start mapping from

[RFC 04/18] vfio/pci: Use vbasedev local variable in vfio_realize()

2022-04-14 Thread Yi Liu

From: Eric Auger 

Using a VFIODevice handle local variable to improve the code readability.

no functional change intended

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/pci.c | 49 +
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index e26e65bb1f..e707329394 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2803,6 +2803,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice 
*vdev)
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
 VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+VFIODevice *vbasedev = >vbasedev;
 VFIODevice *vbasedev_iter;
 VFIOGroup *group;
 char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -2813,7 +2814,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 int i, ret;
 bool is_mdev;
 
-if (!vdev->vbasedev.sysfsdev) {
+if (!vbasedev->sysfsdev) {
 if (!(~vdev->host.domain || ~vdev->host.bus ||
   ~vdev->host.slot || ~vdev->host.function)) {
 error_setg(errp, "No provided host device");
@@ -2821,24 +2822,24 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
   "or -device vfio-pci,sysfsdev=PATH_TO_DEVICE\n");
 return;
 }
-vdev->vbasedev.sysfsdev =
+vbasedev->sysfsdev =
 g_strdup_printf("/sys/bus/pci/devices/%04x:%02x:%02x.%01x",
 vdev->host.domain, vdev->host.bus,
 vdev->host.slot, vdev->host.function);
 }
 
-if (stat(vdev->vbasedev.sysfsdev, ) < 0) {
+if (stat(vbasedev->sysfsdev, ) < 0) {
 error_setg_errno(errp, errno, "no such host device");
-error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.sysfsdev);
+error_prepend(errp, VFIO_MSG_PREFIX, vbasedev->sysfsdev);
 return;
 }
 
-vdev->vbasedev.name = g_path_get_basename(vdev->vbasedev.sysfsdev);
-vdev->vbasedev.ops = _pci_ops;
-vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
-vdev->vbasedev.dev = DEVICE(vdev);
+vbasedev->name = g_path_get_basename(vbasedev->sysfsdev);
+vbasedev->ops = _pci_ops;
+vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+vbasedev->dev = DEVICE(vdev);
 
-tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
+tmp = g_strdup_printf("%s/iommu_group", vbasedev->sysfsdev);
 len = readlink(tmp, group_path, sizeof(group_path));
 g_free(tmp);
 
@@ -2856,7 +2857,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 goto error;
 }
 
-trace_vfio_realize(vdev->vbasedev.name, groupid);
+trace_vfio_realize(vbasedev->name, groupid);
 
 group = vfio_get_group(groupid, pci_device_iommu_address_space(pdev), 
errp);
 if (!group) {
@@ -2864,7 +2865,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 
 QLIST_FOREACH(vbasedev_iter, >device_list, next) {
-if (strcmp(vbasedev_iter->name, vdev->vbasedev.name) == 0) {
+if (strcmp(vbasedev_iter->name, vbasedev->name) == 0) {
 error_setg(errp, "device is already attached");
 vfio_put_group(group);
 goto error;
@@ -2877,22 +2878,22 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
  * stays in sync with the active working set of the guest driver.  Prevent
  * the x-balloon-allowed option unless this is minimally an mdev device.
  */
-tmp = g_strdup_printf("%s/subsystem", vdev->vbasedev.sysfsdev);
+tmp = g_strdup_printf("%s/subsystem", vbasedev->sysfsdev);
 subsys = realpath(tmp, NULL);
 g_free(tmp);
 is_mdev = subsys && (strcmp(subsys, "/sys/bus/mdev") == 0);
 free(subsys);
 
-trace_vfio_mdev(vdev->vbasedev.name, is_mdev);
+trace_vfio_mdev(vbasedev->name, is_mdev);
 
-if (vdev->vbasedev.ram_block_discard_allowed && !is_mdev) {
+if (vbasedev->ram_block_discard_allowed && !is_mdev) {
 error_setg(errp, "x-balloon-allowed only potentially compatible "
"with mdev devices");
 vfio_put_group(group);
 goto error;
 }
 
-ret = vfio_get_device(group, vdev->vbasedev.name, >vbasedev, errp);
+ret = vfio_get_device(group, vbasedev->name, vbasedev, errp);
 if (ret) {
 vfio_put_group(group);
 goto error;
@@ -2905,7 +2906,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 }
 
 /* Get a copy of config space */
-ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
+ret = pread(vbasedev->fd, vdev->pdev.config,
 MIN(pci_config_size(>pdev), vdev->config_size),
 vdev->config_offset);
 if (ret < (int)MIN(pci_config_size(>pdev), vdev->config_size)) {
@@ -2933,7 +2934,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 goto error;
 }
 vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);
-

[RFC 05/18] vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr

2022-04-14 Thread Yi Liu

Rename VFIOGuestIOMMU iommu field into iommu_mr. Then it becomes clearer
it is an IOMMU memory region.

no functional change intended

Signed-off-by: Yi Liu 
---
 hw/vfio/common.c  | 16 
 include/hw/vfio/vfio-common.h |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 080046e3f5..b05f68b5c7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -992,7 +992,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
  * device emulation the VFIO iommu handles to use).
  */
 giommu = g_malloc0(sizeof(*giommu));
-giommu->iommu = iommu_mr;
+giommu->iommu_mr = iommu_mr;
 giommu->iommu_offset = section->offset_within_address_space -
section->offset_within_region;
 giommu->container = container;
@@ -1007,7 +1007,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 int128_get64(llend),
 iommu_idx);
 
-ret = memory_region_iommu_set_page_size_mask(giommu->iommu,
+ret = memory_region_iommu_set_page_size_mask(giommu->iommu_mr,
  container->pgsizes,
  );
 if (ret) {
@@ -1022,7 +1022,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 goto fail;
 }
 QLIST_INSERT_HEAD(>giommu_list, giommu, giommu_next);
-memory_region_iommu_replay(giommu->iommu, >n);
+memory_region_iommu_replay(giommu->iommu_mr, >n);
 
 return;
 }
@@ -1128,7 +1128,7 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 VFIOGuestIOMMU *giommu;
 
 QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
-if (MEMORY_REGION(giommu->iommu) == section->mr &&
+if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
 giommu->n.start == section->offset_within_region) {
 memory_region_unregister_iommu_notifier(section->mr,
 >n);
@@ -1393,11 +1393,11 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 VFIOGuestIOMMU *giommu;
 
 QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
-if (MEMORY_REGION(giommu->iommu) == section->mr &&
+if (MEMORY_REGION(giommu->iommu_mr) == section->mr &&
 giommu->n.start == section->offset_within_region) {
 Int128 llend;
 vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
-int idx = memory_region_iommu_attrs_to_index(giommu->iommu,
+int idx = memory_region_iommu_attrs_to_index(giommu->iommu_mr,
MEMTXATTRS_UNSPECIFIED);
 
 llend = 
int128_add(int128_make64(section->offset_within_region),
@@ -1410,7 +1410,7 @@ static int vfio_sync_dirty_bitmap(VFIOContainer 
*container,
 section->offset_within_region,
 int128_get64(llend),
 idx);
-memory_region_iommu_replay(giommu->iommu, );
+memory_region_iommu_replay(giommu->iommu_mr, );
 break;
 }
 }
@@ -2246,7 +2246,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
 
 QLIST_FOREACH_SAFE(giommu, >giommu_list, giommu_next, tmp) {
 memory_region_unregister_iommu_notifier(
-MEMORY_REGION(giommu->iommu), >n);
+MEMORY_REGION(giommu->iommu_mr), >n);
 QLIST_REMOVE(giommu, giommu_next);
 g_free(giommu);
 }
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0a76..e573f5a9f1 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -98,7 +98,7 @@ typedef struct VFIOContainer {
 
 typedef struct VFIOGuestIOMMU {
 VFIOContainer *container;
-IOMMUMemoryRegion *iommu;
+IOMMUMemoryRegion *iommu_mr;
 hwaddr iommu_offset;
 IOMMUNotifier n;
 QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
-- 
2.27.0

[RFC 13/18] vfio/container-obj: Introduce VFIOContainer reset callback

2022-04-14 Thread Yi Liu

From: Eric Auger 

Reset implementation depends on the container backend. Let's
introduce a VFIOContainer class function and register a generic
reset handler that will be able to call the right reset function
depending on the container type. Also, let's move the
registration/unregistration to a place that is not backend-specific
(first vfio address space created instead of the first group).

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/as.c | 18 ++
 hw/vfio/container-obj.c  | 13 +
 hw/vfio/container.c  | 26 ++
 include/hw/vfio/vfio-container-obj.h |  2 ++
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/hw/vfio/as.c b/hw/vfio/as.c
index 30e86f6833..4abaa4068f 100644
--- a/hw/vfio/as.c
+++ b/hw/vfio/as.c
@@ -847,6 +847,18 @@ const MemoryListener vfio_memory_listener = {
 .log_sync = vfio_listener_log_sync,
 };
 
+void vfio_reset_handler(void *opaque)
+{
+VFIOAddressSpace *space;
+VFIOContainer *bcontainer;
+
+QLIST_FOREACH(space, _address_spaces, list) {
+ QLIST_FOREACH(bcontainer, >containers, next) {
+ vfio_container_reset(bcontainer);
+ }
+}
+}
+
 VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 {
 VFIOAddressSpace *space;
@@ -862,6 +874,9 @@ VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 space->as = as;
 QLIST_INIT(>containers);
 
+if (QLIST_EMPTY(_address_spaces)) {
+qemu_register_reset(vfio_reset_handler, NULL);
+}
 QLIST_INSERT_HEAD(_address_spaces, space, list);
 
 return space;
@@ -873,6 +888,9 @@ void vfio_put_address_space(VFIOAddressSpace *space)
 QLIST_REMOVE(space, list);
 g_free(space);
 }
+if (QLIST_EMPTY(_address_spaces)) {
+qemu_unregister_reset(vfio_reset_handler, NULL);
+}
 }
 
 static VFIOContainerClass *
diff --git a/hw/vfio/container-obj.c b/hw/vfio/container-obj.c
index 40c1e2a2b5..c4220336af 100644
--- a/hw/vfio/container-obj.c
+++ b/hw/vfio/container-obj.c
@@ -68,6 +68,19 @@ int vfio_container_dma_unmap(VFIOContainer *container,
 return vccs->dma_unmap(container, iova, size, iotlb);
 }
 
+int vfio_container_reset(VFIOContainer *container)
+{
+VFIOContainerClass *vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+vccs = VFIO_CONTAINER_OBJ_GET_CLASS(container);
+
+if (!vccs->reset) {
+return -ENOENT;
+}
+
+return vccs->reset(container);
+}
+
 void vfio_container_set_dirty_page_tracking(VFIOContainer *container,
 bool start)
 {
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 74febc1567..2f59422048 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -458,12 +458,15 @@ vfio_legacy_container_del_section_window(VFIOContainer 
*bcontainer,
 }
 }
 
-void vfio_reset_handler(void *opaque)
+static int vfio_legacy_container_reset(VFIOContainer *bcontainer)
 {
+VFIOLegacyContainer *container = container_of(bcontainer,
+  VFIOLegacyContainer, obj);
 VFIOGroup *group;
 VFIODevice *vbasedev;
+int ret, final_ret = 0;
 
-QLIST_FOREACH(group, _group_list, next) {
+QLIST_FOREACH(group, >group_list, container_next) {
 QLIST_FOREACH(vbasedev, >device_list, next) {
 if (vbasedev->dev->realized) {
 vbasedev->ops->vfio_compute_needs_reset(vbasedev);
@@ -471,13 +474,19 @@ void vfio_reset_handler(void *opaque)
 }
 }
 
-QLIST_FOREACH(group, _group_list, next) {
+QLIST_FOREACH(group, >group_list, next) {
 QLIST_FOREACH(vbasedev, >device_list, next) {
 if (vbasedev->dev->realized && vbasedev->needs_reset) {
-vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+ret = vbasedev->ops->vfio_hot_reset_multi(vbasedev);
+if (ret) {
+error_report("failed to reset %s (%d)",
+ vbasedev->name, ret);
+final_ret = ret;
+}
 }
 }
 }
+return final_ret;
 }
 
 static void vfio_kvm_device_add_group(VFIOGroup *group)
@@ -1004,10 +1013,6 @@ static VFIOGroup *vfio_get_group(int groupid, 
AddressSpace *as, Error **errp)
 goto close_fd_exit;
 }
 
-if (QLIST_EMPTY(_group_list)) {
-qemu_register_reset(vfio_reset_handler, NULL);
-}
-
 QLIST_INSERT_HEAD(_group_list, group, next);
 
 return group;
@@ -1036,10 +1041,6 @@ static void vfio_put_group(VFIOGroup *group)
 trace_vfio_put_group(group->fd);
 close(group->fd);
 g_free(group);
-
-if (QLIST_EMPTY(_group_list)) {
-qemu_unregister_reset(vfio_reset_handler, NULL);
-}
 }
 
 static int vfio_get_device(VFIOGroup *group, const char *name,
@@ -1293,6 +1294,7 @@ static void vfio_legacy_container_class_init(ObjectClass 
*klass,
 vccs->add_window =

[RFC 00/18] vfio: Adopt iommufd

2022-04-14 Thread Yi Liu

With the introduction of iommufd[1], the linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.
Other devices like vpda, vfio mdev and etc. are not considered yet.

For vfio devices, the new interface is tied with device fd and iommufd
as the iommufd solution is device-centric. This is different from legacy
vfio which is group-centric. To support both interfaces in QEMU, this
series introduces the iommu backend concept in the form of different
container classes. The existing vfio container is named legacy container
(equivalent with legacy iommu backend in this series), while the new
iommufd based container is named as iommufd container (may also be mentioned
as iommufd backend in this series). The two backend types have their own
way to setup secure context and dma management interface. Below diagram
shows how it looks like with both BEs.

VFIO   AddressSpace/Memory
+---+  +--+  +-+  +-+
|  pci  |  | platform |  |  ap |  | ccw |
+---+---+  ++-+  +--+--+  +--+--+ +--+
|   |   |||   AddressSpace   |
|   |   ||++-+
+---V---V---VV+   /
|   VFIOAddressSpace  | <+
|  |  |  MemoryListener
|  VFIOContainer list |
+---+++
||
||
+---V--++V--+
|   iommufd||vfio legacy|
|  container   || container |
+---+--+++--+
||
| /dev/iommu | /dev/vfio/vfio
| /dev/vfio/devices/vfioX| /dev/vfio/$group_id
 Userspace  ||
 ===++
 Kernel |  device fd |
+---+| group/container fd
| (BIND_IOMMUFD || (SET_CONTAINER/SET_IOMMU)
|  ATTACH_IOAS) || device fd
|   ||
|   +---VV-+
iommufd |   |vfio  |
(map/unmap  |   +-++---+
 ioas_copy) | || map/unmap
| ||
 +--V--++-V--+  +--V+
 | iommfd core ||  device|  |  vfio iommu   |
 +-+++  +---+

[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
  (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
- VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
- VFIO populates DMA map/unmap via the container BEs
  *) iommufd BE: uses iommufd
  *) vfio legacy BE: uses container fd

This series qomifies the VFIOContainer object which acts as a base class
for a container. This base class is derived into the legacy VFIO container
and the new iommufd based container. The base class implements generic code
such as code related to memory_listener and address space management whereas
the derived class implements callbacks that depend on the kernel user space
being used.

The selection of the backend is made on a device basis using the new
iommufd option (on/off/auto). By default the iommufd backend is selected
if supported by the host and by QEMU (iommufd KConfig). This option is
currently available only for the vfio-pci device. For other types of
devices, it does not yet exist and the legacy BE is chosen by default.

Test done:
- PCI and Platform device were tested
- ccw and ap were only compile-tested
- limited device hotplug test
- vIOMMU test run for both legacy and iommufd backends (limited tests)

This series was co-developed by Eric Auger and me based on the exploration
iommufd kernel[2], complete code of this series is available in[3]. As
iommufd kernel is in the early step (only iommufd generic interface is in
mailing list), so this series hasn't made the iommufd backend fully on par
with legacy backend w.r.t. features like p2p mappings, coherency tracking,
live migration, etc. This

[RFC 03/18] hw/vfio/pci: fix vfio_pci_hot_reset_result trace point

2022-04-14 Thread Yi Liu

From: Eric Auger 

Properly output the errno string.

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 hw/vfio/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 67a183f17b..e26e65bb1f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2337,7 +2337,7 @@ static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool 
single)
 g_free(reset);
 
 trace_vfio_pci_hot_reset_result(vdev->vbasedev.name,
-ret ? "%m" : "Success");
+ret ? strerror(errno) : "Success");
 
 out:
 /* Re-enable INTx on affected devices */
-- 
2.27.0

[RFC 01/18] scripts/update-linux-headers: Add iommufd.h

2022-04-14 Thread Yi Liu

From: Eric Auger 

Update the script to import iommufd.h

Signed-off-by: Eric Auger 
Signed-off-by: Yi Liu 
---
 scripts/update-linux-headers.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 839a5ec614..a89b83e6d6 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -160,7 +160,7 @@ done
 
 rm -rf "$output/linux-headers/linux"
 mkdir -p "$output/linux-headers/linux"
-for header in kvm.h vfio.h vfio_ccw.h vfio_zdev.h vhost.h \
+for header in kvm.h vfio.h iommufd.h vfio_ccw.h vfio_zdev.h vhost.h \
   psci.h psp-sev.h userfaultfd.h mman.h; do
 cp "$tmpdir/include/linux/$header" "$output/linux-headers/linux"
 done
-- 
2.27.0

Re: [PATCH] docs: Correct the default thread-pool-size

2022-04-14 Thread liuyd.f...@fujitsu.com

[+cc dgilb...@redhat.com stefa...@redhat.com]

On 4/14/22 1:05 PM, liuyd.f...@fujitsu.com wrote:
> [+cc vgo...@redhat.com]
>
> On 4/13/22 12:20 PM, Liu Yiding wrote:
>> Refer to 26ec190964 virtiofsd: Do not use a thread pool by default
>>
>> Signed-off-by: Liu Yiding 
>> ---
>>docs/tools/virtiofsd.rst | 2 +-
>>1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
>> index 0c0560203c..33fed08c6f 100644
>> --- a/docs/tools/virtiofsd.rst
>> +++ b/docs/tools/virtiofsd.rst
>> @@ -127,7 +127,7 @@ Options
>>.. option:: --thread-pool-size=NUM
>>
>>  Restrict the number of worker threads per request queue to NUM.  The 
>> default
>> -  is 64.
>> +  is 0.
>>
>>.. option:: --cache=none|auto|always
>>

-- 
Best Regards.
Yiding Liu

[PATCH v5 3/3] tests/qtest/libqos: Add generic pci host bridge in arm-virt machine

2022-04-14 Thread Eric Auger

Up to now the virt-machine node contains a virtio-mmio node.
However no driver produces any PCI interface node. Hence, PCI
tests cannot be run with aarch64 binary.

Add a GPEX driver node that produces a pci interface node. This latter
then can be consumed by all the pci tests. One of the first motivation
was to be able to run the virtio-iommu-pci tests.

We still face an issue with pci hotplug tests as hotplug cannot happen
on the pcie root bus and require a generic root port. This will be
addressed later on.

We force cpu=max along with aarch64/virt machine as some PCI tests
require high MMIO regions to be available.

Signed-off-by: Eric Auger 

---
v3 -> v4:
- properly handle le/cpu conversions on config access

v2 -> v3:
- force cpu=max with aarch64/virt

v1 -> v2:
- copyright updated to 2022
- QPCIBusARM renamed into QGenericPCIBus
- QGenericPCIHost declarations and definitions moved in the same
  place as the generic pci implementation
- rename pci-arm.c/h in generic-pcihost.c/h and remove any ref to
  ARM there
- remove qos_node_produces_opts, qpci_new_arm, qpci_free_arm
- ecam_alloc_ptr now is a field of QGenericPCIBus and not QPCIBus
- new libqos_init to create generic-pcihost driver that contains
  pci-bus-generic
- QGenericPCIHost moved in the same place as the generic pci
  bindings

Signed-off-by: Eric Auger 
---
 tests/qtest/libqos/arm-virt-machine.c |  19 ++-
 tests/qtest/libqos/generic-pcihost.c  | 231 ++
 tests/qtest/libqos/generic-pcihost.h  |  54 ++
 tests/qtest/libqos/meson.build|   1 +
 4 files changed, 301 insertions(+), 4 deletions(-)
 create mode 100644 tests/qtest/libqos/generic-pcihost.c
 create mode 100644 tests/qtest/libqos/generic-pcihost.h

diff --git a/tests/qtest/libqos/arm-virt-machine.c 
b/tests/qtest/libqos/arm-virt-machine.c
index e0f5932284..96da0dde54 100644
--- a/tests/qtest/libqos/arm-virt-machine.c
+++ b/tests/qtest/libqos/arm-virt-machine.c
@@ -22,6 +22,8 @@
 #include "malloc.h"
 #include "qgraph.h"
 #include "virtio-mmio.h"
+#include "generic-pcihost.h"
+#include "hw/pci/pci_regs.h"
 
 #define ARM_PAGE_SIZE   4096
 #define VIRTIO_MMIO_BASE_ADDR   0x0A003E00
@@ -35,6 +37,7 @@ struct QVirtMachine {
 QOSGraphObject obj;
 QGuestAllocator alloc;
 QVirtioMMIODevice virtio_mmio;
+QGenericPCIHost bridge;
 };
 
 static void virt_destructor(QOSGraphObject *obj)
@@ -57,11 +60,13 @@ static void *virt_get_driver(void *object, const char 
*interface)
 static QOSGraphObject *virt_get_device(void *obj, const char *device)
 {
 QVirtMachine *machine = obj;
-if (!g_strcmp0(device, "virtio-mmio")) {
+if (!g_strcmp0(device, "generic-pcihost")) {
+return >bridge.obj;
+} else if (!g_strcmp0(device, "virtio-mmio")) {
 return >virtio_mmio.obj;
 }
 
-fprintf(stderr, "%s not present in arm/virtio\n", device);
+fprintf(stderr, "%s not present in arm/virt\n", device);
 g_assert_not_reached();
 }
 
@@ -76,16 +81,22 @@ static void *qos_create_machine_arm_virt(QTestState *qts)
 qvirtio_mmio_init_device(>virtio_mmio, qts, VIRTIO_MMIO_BASE_ADDR,
  VIRTIO_MMIO_SIZE);
 
+qos_create_generic_pcihost(>bridge, qts, >alloc);
+
 machine->obj.get_device = virt_get_device;
 machine->obj.get_driver = virt_get_driver;
 machine->obj.destructor = virt_destructor;
 return machine;
 }
 
-static void virtio_mmio_register_nodes(void)
+static void virt_machine_register_nodes(void)
 {
 qos_node_create_machine("arm/virt", qos_create_machine_arm_virt);
 qos_node_contains("arm/virt", "virtio-mmio", NULL);
+
+qos_node_create_machine_args("aarch64/virt", qos_create_machine_arm_virt,
+ " -cpu max");
+qos_node_contains("aarch64/virt", "generic-pcihost", NULL);
 }
 
-libqos_init(virtio_mmio_register_nodes);
+libqos_init(virt_machine_register_nodes);
diff --git a/tests/qtest/libqos/generic-pcihost.c 
b/tests/qtest/libqos/generic-pcihost.c
new file mode 100644
index 00..c237f8ca7a
--- /dev/null
+++ b/tests/qtest/libqos/generic-pcihost.c
@@ -0,0 +1,231 @@
+/*
+ * libqos PCI bindings for generic PCI
+ *
+ * Copyright Red Hat Inc., 2022
+ *
+ * Authors:
+ *  Eric Auger   
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "libqtest.h"
+#include "generic-pcihost.h"
+#include "qapi/qmp/qdict.h"
+#include "hw/pci/pci_regs.h"
+#include "qemu/host-utils.h"
+
+#include "qemu/module.h"
+
+/* QGenericPCIHost */
+
+QOSGraphObject *generic_pcihost_get_device(void *obj, const char *device)
+{
+QGenericPCIHost *host = obj;
+if (!g_strcmp0(device, "pci-bus-generic")) {
+return >pci.obj;
+}
+fprintf(stderr, "%s not present in generic-pcihost\n", device);
+g_assert_not_reached();
+}
+
+void qos_create_generic_pcihost(QGenericPCIHost *host,
+

[PATCH v5 2/3] tests/qtest/libqos: Skip hotplug tests if pci root bus is not hotpluggable

2022-04-14 Thread Eric Auger

ARM does not not support hotplug on pcie.0. Add a flag on the bus
which tells if devices can be hotplugged and skip hotplug tests
if the bus cannot be hotplugged. This is a temporary solution to
enable the other pci tests on aarch64.

Signed-off-by: Eric Auger 
Acked-by: Thomas Huth 
Reviewed-by: Alex Bennée 

---

v1 ->v2:
- reword g_test_skip msg into "pci bus does not support hotplug"
---
 tests/qtest/e1000e-test.c |  6 ++
 tests/qtest/libqos/pci.h  |  1 +
 tests/qtest/vhost-user-blk-test.c | 10 ++
 tests/qtest/virtio-blk-test.c |  5 +
 tests/qtest/virtio-net-test.c |  5 +
 tests/qtest/virtio-rng-test.c |  5 +
 6 files changed, 32 insertions(+)

diff --git a/tests/qtest/e1000e-test.c b/tests/qtest/e1000e-test.c
index e648fdd409..ad6d7f9303 100644
--- a/tests/qtest/e1000e-test.c
+++ b/tests/qtest/e1000e-test.c
@@ -235,6 +235,12 @@ static void test_e1000e_multiple_transfers(void *obj, void 
*data,
 static void test_e1000e_hotplug(void *obj, void *data, QGuestAllocator * alloc)
 {
 QTestState *qts = global_qtest;  /* TODO: get rid of global_qtest here */
+QE1000E_PCI *dev = obj;
+
+if (dev->pci_dev.bus->not_hotpluggable) {
+g_test_skip("pci bus does not support hotplug");
+return;
+}
 
 qtest_qmp_device_add(qts, "e1000e", "e1000e_net", "{'addr': '0x06'}");
 qpci_unplug_acpi_device_test(qts, "e1000e_net", 0x06);
diff --git a/tests/qtest/libqos/pci.h b/tests/qtest/libqos/pci.h
index 44f6806fe4..6a28b40522 100644
--- a/tests/qtest/libqos/pci.h
+++ b/tests/qtest/libqos/pci.h
@@ -52,6 +52,7 @@ struct QPCIBus {
 uint64_t pio_alloc_ptr, pio_limit;
 uint64_t mmio_alloc_ptr, mmio_limit;
 bool has_buggy_msi; /* TRUE for spapr, FALSE for pci */
+bool not_hotpluggable; /* TRUE if devices cannot be hotplugged */
 
 };
 
diff --git a/tests/qtest/vhost-user-blk-test.c 
b/tests/qtest/vhost-user-blk-test.c
index 62e670f39b..1316aae0fa 100644
--- a/tests/qtest/vhost-user-blk-test.c
+++ b/tests/qtest/vhost-user-blk-test.c
@@ -676,6 +676,11 @@ static void pci_hotplug(void *obj, void *data, 
QGuestAllocator *t_alloc)
 QVirtioPCIDevice *dev;
 QTestState *qts = dev1->pdev->bus->qts;
 
+if (dev1->pdev->bus->not_hotpluggable) {
+g_test_skip("pci bus does not support hotplug");
+return;
+}
+
 /* plug secondary disk */
 qtest_qmp_device_add(qts, "vhost-user-blk-pci", "drv1",
  "{'addr': %s, 'chardev': 'char2'}",
@@ -703,6 +708,11 @@ static void multiqueue(void *obj, void *data, 
QGuestAllocator *t_alloc)
 uint64_t features;
 uint16_t num_queues;
 
+if (pdev1->pdev->bus->not_hotpluggable) {
+g_test_skip("bus pci.0 does not support hotplug");
+return;
+}
+
 /*
  * The primary device has 1 queue and VIRTIO_BLK_F_MQ is not enabled. The
  * VIRTIO specification allows VIRTIO_BLK_F_MQ to be enabled when there is
diff --git a/tests/qtest/virtio-blk-test.c b/tests/qtest/virtio-blk-test.c
index 2a23698211..acb44c9fb8 100644
--- a/tests/qtest/virtio-blk-test.c
+++ b/tests/qtest/virtio-blk-test.c
@@ -701,6 +701,11 @@ static void pci_hotplug(void *obj, void *data, 
QGuestAllocator *t_alloc)
 QVirtioPCIDevice *dev;
 QTestState *qts = dev1->pdev->bus->qts;
 
+if (dev1->pdev->bus->not_hotpluggable) {
+g_test_skip("pci bus does not support hotplug");
+return;
+}
+
 /* plug secondary disk */
 qtest_qmp_device_add(qts, "virtio-blk-pci", "drv1",
  "{'addr': %s, 'drive': 'drive1'}",
diff --git a/tests/qtest/virtio-net-test.c b/tests/qtest/virtio-net-test.c
index a71395849f..e54817f154 100644
--- a/tests/qtest/virtio-net-test.c
+++ b/tests/qtest/virtio-net-test.c
@@ -174,6 +174,11 @@ static void hotplug(void *obj, void *data, QGuestAllocator 
*t_alloc)
 QTestState *qts = dev->pdev->bus->qts;
 const char *arch = qtest_get_arch();
 
+if (dev->pdev->bus->not_hotpluggable) {
+g_test_skip("pci bus does not support hotplug");
+return;
+}
+
 qtest_qmp_device_add(qts, "virtio-net-pci", "net1",
  "{'addr': %s}", stringify(PCI_SLOT_HP));
 
diff --git a/tests/qtest/virtio-rng-test.c b/tests/qtest/virtio-rng-test.c
index e6b8cd8e0c..5ce444ad72 100644
--- a/tests/qtest/virtio-rng-test.c
+++ b/tests/qtest/virtio-rng-test.c
@@ -20,6 +20,11 @@ static void rng_hotplug(void *obj, void *data, 
QGuestAllocator *alloc)
 QVirtioPCIDevice *dev = obj;
 QTestState *qts = dev->pdev->bus->qts;
 
+   if (dev->pdev->bus->not_hotpluggable) {
+g_test_skip("pci bus does not support hotplug");
+return;
+}
+
 const char *arch = qtest_get_arch();
 
 qtest_qmp_device_add(qts, "virtio-rng-pci", "rng1",
-- 
2.26.3

[PATCH v5 0/3] qtests/libqos: Allow PCI tests to be run with virt-machine

2022-04-14 Thread Eric Auger

Up to now the virt-machine node only contains a virtio-mmio
driver node but no driver that eventually produces any pci-bus
interface.

Hence, PCI libqos tests cannot be run with aarch64 binary.

This series brings the pieces needed to be able to run PCI tests
with the aarch64 binary: a generic-pcihost driver node gets
instantiated by the machine. This later contains a pci-bus-generic
driver which produces a pci-bus interface. Then all tests
consuming the pci-bus interface can be run with the libqos arm
virt machine.

One of the first goal was to be able to run the virtio-iommu-pci
tests as the virtio-iommu was initially targetting ARM and it
was awkard to run the tests with the pc machine. This is now
possible.

Only the tests doing hotplug cannot be run yet as hotplug is
not possible on the root bus. This will be dealt with separately
by adding a root port to the object tree.

Best Regards

Eric

Depends on Alex' fix:
tests/qtest: properly initialise the vring used idx

This series can be found at:
https://github.com/eauger/qemu/tree/libqos-pci-arm-v5

History

v4 -> v5:
- Added Alex' R-b
- Removed [PATCH v3 4/5] tests/qtest/vhost-user-blk-test:
  Temporary hack to get tests passing on aarch64
  following Alex' fix

v3 -> v4:
- handle endianess when accessing the cfg space (fix PPC64
  BE failure). Tested on such machine.

v2 -> v3:
- force -cpu=max along with aarch64/virt
- reduced the vhost-user-block-pci issue workaround to a
  single guest_alloc() instead of enabling MSIs. Call for
  help on this specific issue. The 2 tests which fail are:
  test_basic and indirect.

v1 -> v2:
- copyright updated to 2022
- QPCIBusARM renamed into QGenericPCIBus
- QGenericPCIHost declarations and definitions moved in the same
  place as the generic pci implementation
- rename pci-arm.c/h in generic-pcihost.c/h and remove any ref to
  ARM there
- remove qos_node_produces_opts, qpci_new_arm, qpci_free_arm
- ecam_alloc_ptr now is a field of QGenericPCIBus and not QPCIBus
- new libqos_init to create generic-pcihost driver that contains
  pci-bus-generic
- QGenericPCIHost moved in the same place as the generic pci
  bindings
- collected Thomas A-b/R-b

Eric Auger (3):
  tests/qtest/libqos/pci: Introduce pio_limit
  tests/qtest/libqos: Skip hotplug tests if pci root bus is not
hotpluggable
  tests/qtest/libqos: Add generic pci host bridge in arm-virt machine

 tests/qtest/e1000e-test.c |   6 +
 tests/qtest/libqos/arm-virt-machine.c |  19 ++-
 tests/qtest/libqos/generic-pcihost.c  | 231 ++
 tests/qtest/libqos/generic-pcihost.h  |  54 ++
 tests/qtest/libqos/meson.build|   1 +
 tests/qtest/libqos/pci-pc.c   |   1 +
 tests/qtest/libqos/pci-spapr.c|   1 +
 tests/qtest/libqos/pci.c  |  78 +
 tests/qtest/libqos/pci.h  |   6 +-
 tests/qtest/vhost-user-blk-test.c |  10 ++
 tests/qtest/virtio-blk-test.c |   5 +
 tests/qtest/virtio-net-test.c |   5 +
 tests/qtest/virtio-rng-test.c |   5 +
 13 files changed, 387 insertions(+), 35 deletions(-)
 create mode 100644 tests/qtest/libqos/generic-pcihost.c
 create mode 100644 tests/qtest/libqos/generic-pcihost.h

-- 
2.26.3

[PATCH v5 1/3] tests/qtest/libqos/pci: Introduce pio_limit

2022-04-14 Thread Eric Auger

At the moment the IO space limit is hardcoded to
QPCI_PIO_LIMIT = 0x1. When accesses are performed to a bar,
the base address of this latter is compared against the limit
to decide whether we perform an IO or a memory access.

On ARM, we cannot keep this PIO limit as the arm-virt machine
uses [0x3eff, 0x3f00 ] for the IO space map and we
are mandated to allocate at 0x0.

Add a new flag in QPCIBar indicating whether it is an IO bar
or a memory bar. This flag is set on QPCIBar allocation and
provisionned based on the BAR configuration. Then the new flag
is used in access functions and in iomap() function.

Signed-off-by: Eric Auger 
Reviewed-by: Thomas Huth 
Reviewed-by: Alex Bennée 
---
 tests/qtest/libqos/pci-pc.c|  1 +
 tests/qtest/libqos/pci-spapr.c |  1 +
 tests/qtest/libqos/pci.c   | 78 ++
 tests/qtest/libqos/pci.h   |  5 +--
 4 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/tests/qtest/libqos/pci-pc.c b/tests/qtest/libqos/pci-pc.c
index f97844289f..8051a0881a 100644
--- a/tests/qtest/libqos/pci-pc.c
+++ b/tests/qtest/libqos/pci-pc.c
@@ -150,6 +150,7 @@ void qpci_init_pc(QPCIBusPC *qpci, QTestState *qts, 
QGuestAllocator *alloc)
 
 qpci->bus.qts = qts;
 qpci->bus.pio_alloc_ptr = 0xc000;
+qpci->bus.pio_limit = 0x1;
 qpci->bus.mmio_alloc_ptr = 0xE000;
 qpci->bus.mmio_limit = 0x1ULL;
 
diff --git a/tests/qtest/libqos/pci-spapr.c b/tests/qtest/libqos/pci-spapr.c
index 262226985f..870ffdd8b5 100644
--- a/tests/qtest/libqos/pci-spapr.c
+++ b/tests/qtest/libqos/pci-spapr.c
@@ -197,6 +197,7 @@ void qpci_init_spapr(QPCIBusSPAPR *qpci, QTestState *qts,
 
 qpci->bus.qts = qts;
 qpci->bus.pio_alloc_ptr = 0xc000;
+qpci->bus.pio_limit = 0x1;
 qpci->bus.mmio_alloc_ptr = qpci->mmio32.pci_base;
 qpci->bus.mmio_limit = qpci->mmio32.pci_base + qpci->mmio32.size;
 
diff --git a/tests/qtest/libqos/pci.c b/tests/qtest/libqos/pci.c
index 3a9076ae58..b23d72346b 100644
--- a/tests/qtest/libqos/pci.c
+++ b/tests/qtest/libqos/pci.c
@@ -398,44 +398,56 @@ void qpci_config_writel(QPCIDevice *dev, uint8_t offset, 
uint32_t value)
 
 uint8_t qpci_io_readb(QPCIDevice *dev, QPCIBar token, uint64_t off)
 {
-if (token.addr < QPCI_PIO_LIMIT) {
-return dev->bus->pio_readb(dev->bus, token.addr + off);
+QPCIBus *bus = dev->bus;
+
+if (token.is_io) {
+return bus->pio_readb(bus, token.addr + off);
 } else {
 uint8_t val;
-dev->bus->memread(dev->bus, token.addr + off, , sizeof(val));
+
+bus->memread(dev->bus, token.addr + off, , sizeof(val));
 return val;
 }
 }
 
 uint16_t qpci_io_readw(QPCIDevice *dev, QPCIBar token, uint64_t off)
 {
-if (token.addr < QPCI_PIO_LIMIT) {
-return dev->bus->pio_readw(dev->bus, token.addr + off);
+QPCIBus *bus = dev->bus;
+
+if (token.is_io) {
+return bus->pio_readw(bus, token.addr + off);
 } else {
 uint16_t val;
-dev->bus->memread(dev->bus, token.addr + off, , sizeof(val));
+
+bus->memread(bus, token.addr + off, , sizeof(val));
 return le16_to_cpu(val);
 }
 }
 
 uint32_t qpci_io_readl(QPCIDevice *dev, QPCIBar token, uint64_t off)
 {
-if (token.addr < QPCI_PIO_LIMIT) {
-return dev->bus->pio_readl(dev->bus, token.addr + off);
+QPCIBus *bus = dev->bus;
+
+if (token.is_io) {
+return bus->pio_readl(bus, token.addr + off);
 } else {
 uint32_t val;
-dev->bus->memread(dev->bus, token.addr + off, , sizeof(val));
+
+bus->memread(dev->bus, token.addr + off, , sizeof(val));
 return le32_to_cpu(val);
 }
 }
 
 uint64_t qpci_io_readq(QPCIDevice *dev, QPCIBar token, uint64_t off)
 {
-if (token.addr < QPCI_PIO_LIMIT) {
-return dev->bus->pio_readq(dev->bus, token.addr + off);
+QPCIBus *bus = dev->bus;
+
+if (token.is_io) {
+return bus->pio_readq(bus, token.addr + off);
 } else {
 uint64_t val;
-dev->bus->memread(dev->bus, token.addr + off, , sizeof(val));
+
+bus->memread(bus, token.addr + off, , sizeof(val));
 return le64_to_cpu(val);
 }
 }
@@ -443,57 +455,65 @@ uint64_t qpci_io_readq(QPCIDevice *dev, QPCIBar token, 
uint64_t off)
 void qpci_io_writeb(QPCIDevice *dev, QPCIBar token, uint64_t off,
 uint8_t value)
 {
-if (token.addr < QPCI_PIO_LIMIT) {
-dev->bus->pio_writeb(dev->bus, token.addr + off, value);
+QPCIBus *bus = dev->bus;
+
+if (token.is_io) {
+bus->pio_writeb(bus, token.addr + off, value);
 } else {
-dev->bus->memwrite(dev->bus, token.addr + off, , sizeof(value));
+bus->memwrite(bus, token.addr + off, , sizeof(value));
 }
 }
 
 void qpci_io_writew(QPCIDevice *dev, QPCIBar token, uint64_t off,
 uint16_t value)
 {
-if (token.addr < QPCI_PIO_LIMIT) {
-dev->bus->pio_writew(dev->bus, token.addr + off, value);
+QPCIBus

Re: [PATCH v5 1/4] qapi/machine.json: Add cluster-id

2022-04-14 Thread wangyanan (Y)


On 2022/4/14 15:56, Gavin Shan wrote:

Hi Yanan,

On 4/14/22 10:27 AM, wangyanan (Y) wrote:

On 2022/4/14 8:06, Gavin Shan wrote:

On 4/13/22 7:49 PM, wangyanan (Y) wrote:

On 2022/4/3 22:59, Gavin Shan wrote:

This adds cluster-id in CPU instance properties, which will be used
by arm/virt machine. Besides, the cluster-id is also verified or
dumped in various spots:

   * hw/core/machine.c::machine_set_cpu_numa_node() to associate
 CPU with its NUMA node.

   * hw/core/machine.c::machine_numa_finish_cpu_init() to associate
 CPU with NUMA node when no default association isn't provided.

   * hw/core/machine-hmp-cmds.c::hmp_hotpluggable_cpus() to dump
 cluster-id.

Signed-off-by: Gavin Shan 
---
  hw/core/machine-hmp-cmds.c |  4 
  hw/core/machine.c  | 16 
  qapi/machine.json  |  6 --
  3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/hw/core/machine-hmp-cmds.c b/hw/core/machine-hmp-cmds.c
index 4e2f319aeb..5cb5eecbfc 100644
--- a/hw/core/machine-hmp-cmds.c
+++ b/hw/core/machine-hmp-cmds.c
@@ -77,6 +77,10 @@ void hmp_hotpluggable_cpus(Monitor *mon, const 
QDict *qdict)

  if (c->has_die_id) {
  monitor_printf(mon, "    die-id: \"%" PRIu64 "\"\n", 
c->die_id);

  }
+    if (c->has_cluster_id) {
+    monitor_printf(mon, "    cluster-id: \"%" PRIu64 "\"\n",
+   c->cluster_id);
+    }
  if (c->has_core_id) {
  monitor_printf(mon, "    core-id: \"%" PRIu64 
"\"\n", c->core_id);

  }
diff --git a/hw/core/machine.c b/hw/core/machine.c
index d856485cb4..8748b64657 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -677,6 +677,11 @@ void machine_set_cpu_numa_node(MachineState 
*machine,

  return;
  }
+    if (props->has_cluster_id && !slot->props.has_cluster_id) {
+    error_setg(errp, "cluster-id is not supported");
+    return;
+    }
+
  if (props->has_socket_id && !slot->props.has_socket_id) {
  error_setg(errp, "socket-id is not supported");
  return;
@@ -696,6 +701,11 @@ void machine_set_cpu_numa_node(MachineState 
*machine,

  continue;
  }
+    if (props->has_cluster_id &&
+    props->cluster_id != slot->props.cluster_id) {
+    continue;
+    }
+
  if (props->has_die_id && props->die_id != 
slot->props.die_id) {

  continue;
  }
@@ -990,6 +1000,12 @@ static char *cpu_slot_to_string(const 
CPUArchId *cpu)

  }
  g_string_append_printf(s, "die-id: %"PRId64, 
cpu->props.die_id);

  }
+    if (cpu->props.has_cluster_id) {
+    if (s->len) {
+    g_string_append_printf(s, ", ");
+    }
+    g_string_append_printf(s, "cluster-id: %"PRId64, 
cpu->props.cluster_id);

+    }
  if (cpu->props.has_core_id) {
  if (s->len) {
  g_string_append_printf(s, ", ");
diff --git a/qapi/machine.json b/qapi/machine.json
index 9c460ec450..ea22b574b0 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -868,10 +868,11 @@
  # @node-id: NUMA node ID the CPU belongs to
  # @socket-id: socket number within node/board the CPU belongs to
  # @die-id: die number within socket the CPU belongs to (since 4.1)
-# @core-id: core number within die the CPU belongs to
+# @cluster-id: cluster number within die the CPU belongs to

We also need a "(since 7.1)" tag for cluster-id.

I remember this should be "cluster number within socket..."
according to Igor's comments in v3 ?


Igor had suggestion to correct the description for 'core-id' like
below, but he didn't suggest anything for 'cluster-id'. The question
is clusters are sub-components of die, instead of socket, if die
is supported? You may want to me change it like below and please
confirm.

  @cluster-id: cluster number within die/socket the CPU belongs to

suggestion from Ignor in v3:

   > +# @core-id: core number within cluster the CPU belongs to

   s:cluster:cluster/die:


We want "within cluster/die" description for core-id because we
support both "cores in cluster" for ARM and "cores in die" for X86.
Base on this routine, we only need "within socket" for cluster-id
because we currently only support "clusters in socket". Does this
make sense?



Thanks for the explanation. So ARM64 doesn't have die and x86 doesn't
have cluster?

At least for now, yes. :)

Thanks,
Yanan

If so, I need to adjust the description for 'cluster-id'
as you suggested in v6:

  @cluster-id: cluster number within socket the CPU belongs to (since 
7.1)

Alternatively, the plainest documentation for the IDs is to simply
range **-id only to its next level topo like below. This may avoid
increasing complexity when more topo-ids are inserted middle.
But whether this way is acceptable is up to the Maintainers. :)

# @socket-id: socket number within node/board the CPU belongs to
# @die-id: die number within socket the CPU

1 2 >

1 - 100 of 118 matches

Mail list logo