date:20211207


On 08/12/2021 08.48, Thomas Huth wrote:

On 08/12/2021 08.44, Michael S. Tsirkin wrote:

On Mon, Dec 06, 2021 at 11:20:39PM +0100, Laurent Vivier wrote:

Add some tests to check the state of the machine if the migration
is cancelled while we are using virtio-net failover.

Signed-off-by: Laurent Vivier 


So this one I think is needed for the release. Thomas, are you
merging it there or should I?


rc4 has already been tagged yesterday. I don't think that Richard will still 
allow another PR at this point in time unless it fixes a really really 
critical problem. Laurent's series only adds a new qtest, so this certainly 
does not qualify, AFAIK.


Never mind, I had patch series v7 before my eyes when I hit the reply button 
here. The patch that touches the code outside of the test folder (3/6) has 
already been merged, so we should be fine for the release.


 Thomas

Re: [PATCH v5 3/4] failover: fix unplug pending detection


On 08/12/2021 08.36, Michael S. Tsirkin wrote:

On Fri, Nov 19, 2021 at 10:07:17AM +0100, Laurent Vivier wrote:

Failover needs to detect the end of the PCI unplug to start migration
after the VFIO card has been unplugged.

To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in
pcie_unplug_device().

But since
 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35")
we have switched to ACPI unplug and these functions are not called anymore
and the flag not set. So failover migration is not able to detect if card
is really unplugged and acts as it's done as soon as it's started. So it
doesn't wait the end of the unplug to start the migration. We don't see any
problem when we test that because ACPI unplug is faster than PCIe native
hotplug and when the migration really starts the unplug operation is
already done.

See c000a9bd06ea ("pci: mark device having guest unplug request pending")
 a99c4da9fc2a ("pci: mark devices partially unplugged")

Signed-off-by: Laurent Vivier 
Reviewed-by: Ani Sinha 


Hmm.  I think this one may be needed for this release actually.
Isolate from testing changes and repost?


You merged it already here:

 https://gitlab.com/qemu-project/qemu/-/commit/9323f892b39d133eb6

so we should be fine :-)

 Thomas

Re: [PATCH v6 5/6] test/libqtest: add some virtio-net failover migration cancelling tests


On 08/12/2021 08.44, Michael S. Tsirkin wrote:

On Mon, Dec 06, 2021 at 11:20:39PM +0100, Laurent Vivier wrote:

Add some tests to check the state of the machine if the migration
is cancelled while we are using virtio-net failover.

Signed-off-by: Laurent Vivier 


So this one I think is needed for the release. Thomas, are you
merging it there or should I?


rc4 has already been tagged yesterday. I don't think that Richard will still 
allow another PR at this point in time unless it fixes a really really 
critical problem. Laurent's series only adds a new qtest, so this certainly 
does not qualify, AFAIK.


 Thomas

Re: [PATCH v6 0/6] tests/qtest: add some tests for virtio-net failover

On Mon, Dec 06, 2021 at 11:20:34PM +0100, Laurent Vivier wrote:
> This series adds a qtest entry to test virtio-net failover feature.

I think it's a good idea to CC me and Jason on the next version.
Thanks!


> We check following error cases:
> 
> - check missing id on device with failover_pair_id triggers an error
> - check a primary device plugged on a bus that doesn't support hotplug
>   triggers an error
> 
> We check the status of the machine before and after hotplugging cards and
> feature negotiation:
> 
> - check we don't see the primary device at boot if failover is on
> - check we see the primary device at boot if failover is off
> - check we don't see the primary device if failover is on
>   but failover_pair_id is not the one with on (I think this should be changed)
> - check the primary device is plugged after the feature negotiation
> - check the result if the primary device is plugged before standby device and
>   vice-versa
> - check the if the primary device is coldplugged and the standy device
>   hotplugged and vice-versa
> - check the migration triggers the unplug and the hotplug
> 
> There is one preliminary patch in the series:
> 
> - PATCH 1 introduces a function to enable PCI bridge.
>   Failover needs to be plugged on a pcie-root-port and while
>   the root port is not configured the cards behind it are not
>   available
> 
> v6:
> - manage more than 2 root ports
> - add a function to check if a card is available or not
> - check migration state
> - add cancelled migration test cases
> - rename tests
> 
> v5:
> - re-add the wait-unplug test that has been removed from v4 by mistake.
> 
> v4:
> - rely on query-migrate status to know the migration state rather than
>   to wait the STOP event.
> - remove the patch to add time out to qtest_qmp_eventwait()
> 
> v3:
> - fix a bug with ACPI unplug and add the related test
> 
> v2:
> - remove PATCH 1 that introduced a function that can be replaced by
>   qobject_to_json_pretty() (Markus)
> - Add migration to a file and from the file to check the card is
>   correctly unplugged on the source, and hotplugged on the dest
> - Add an ACPI call to eject the card as the kernel would do
> 
> Laurent Vivier (6):
>   qtest/libqos: add a function to initialize secondary PCI buses
>   tests/qtest: add some tests for virtio-net failover
>   failover: fix unplug pending detection
>   tests/libqtest: update virtio-net failover test
>   test/libqtest: add some virtio-net failover migration cancelling tests
>   tests/libqtest: add a migration test with two couples of failover
> devices
> 
>  hw/acpi/pcihp.c   |   30 +-
>  include/hw/pci/pci_bridge.h   |8 +
>  tests/qtest/libqos/pci.c  |  118 +++
>  tests/qtest/libqos/pci.h  |1 +
>  tests/qtest/meson.build   |3 +
>  tests/qtest/virtio-net-failover.c | 1294 +
>  6 files changed, 1451 insertions(+), 3 deletions(-)
>  create mode 100644 tests/qtest/virtio-net-failover.c
> 
> -- 
> 2.33.1
> 
> 
> 
>

Re: [PATCH v6 5/6] test/libqtest: add some virtio-net failover migration cancelling tests

On Mon, Dec 06, 2021 at 11:20:39PM +0100, Laurent Vivier wrote:
> Add some tests to check the state of the machine if the migration
> is cancelled while we are using virtio-net failover.
> 
> Signed-off-by: Laurent Vivier 

So this one I think is needed for the release. Thomas, are you
merging it there or should I?

> ---
>  tests/qtest/virtio-net-failover.c | 291 ++
>  1 file changed, 291 insertions(+)
> 
> diff --git a/tests/qtest/virtio-net-failover.c 
> b/tests/qtest/virtio-net-failover.c
> index c88f8ddec39a..57abb99e7f6e 100644
> --- a/tests/qtest/virtio-net-failover.c
> +++ b/tests/qtest/virtio-net-failover.c
> @@ -682,6 +682,289 @@ static void test_migrate_in(gconstpointer opaque)
>  machine_stop(qts);
>  }
>  
> +static void test_migrate_abort_wait_unplug(gconstpointer opaque)
> +{
> +QTestState *qts;
> +QDict *resp, *args, *data, *ret;
> +g_autofree gchar *uri = g_strdup_printf("exec: cat > %s", (gchar 
> *)opaque);
> +const gchar *status;
> +
> +qts = machine_start(BASE_MACHINE
> + "-netdev user,id=hs0 "
> + "-netdev user,id=hs1 ",
> + 2);
> +
> +check_one_card(qts, false, "standby0", MAC_STANDBY0);
> +check_one_card(qts, false, "primary0", MAC_PRIMARY0);
> +
> +qtest_qmp_device_add(qts, "virtio-net", "standby0",
> + "{'bus': 'root0',"
> + "'failover': 'on',"
> + "'netdev': 'hs0',"
> + "'mac': '"MAC_STANDBY0"'}");
> +
> +check_one_card(qts, true, "standby0", MAC_STANDBY0);
> +check_one_card(qts, false, "primary0", MAC_PRIMARY0);
> +
> +start_virtio_net(qts, 1, 0, "standby0");
> +
> +check_one_card(qts, true, "standby0", MAC_STANDBY0);
> +check_one_card(qts, false, "primary0", MAC_PRIMARY0);
> +
> +qtest_qmp_device_add(qts, "virtio-net", "primary0",
> + "{'bus': 'root1',"
> + "'failover_pair_id': 'standby0',"
> + "'netdev': 'hs1',"
> + "'rombar': 0,"
> + "'romfile': '',"
> + "'mac': '"MAC_PRIMARY0"'}");
> +
> +check_one_card(qts, true, "standby0", MAC_STANDBY0);
> +check_one_card(qts, true, "primary0", MAC_PRIMARY0);
> +
> +args = qdict_from_jsonf_nofail("{}");
> +g_assert_nonnull(args);
> +qdict_put_str(args, "uri", uri);
> +
> +resp = qtest_qmp(qts, "{ 'execute': 'migrate', 'arguments': %p}", args);
> +g_assert(qdict_haskey(resp, "return"));
> +qobject_unref(resp);
> +
> +/* the event is sent whan QEMU asks the OS to unplug the card */
> +resp = qtest_qmp_eventwait_ref(qts, "UNPLUG_PRIMARY");
> +g_assert(qdict_haskey(resp, "data"));
> +
> +data = qdict_get_qdict(resp, "data");
> +g_assert(qdict_haskey(data, "device-id"));
> +g_assert_cmpstr(qdict_get_str(data, "device-id"), ==, "primary0");
> +
> +qobject_unref(resp);
> +
> +resp = qtest_qmp(qts, "{ 'execute': 'migrate_cancel' }");
> +g_assert(qdict_haskey(resp, "return"));
> +qobject_unref(resp);
> +
> +/* migration has been cancelled while the unplug was in progress */
> +
> +/* while the card is not ejected, we must be in "cancelling" state */
> +ret = migrate_status(qts);
> +
> +status = qdict_get_str(ret, "status");
> +g_assert_cmpstr(status, ==, "cancelling");
> +qobject_unref(ret);
> +
> +/* OS unplugs the cards, QEMU can move from wait-unplug state */
> +qtest_outl(qts, ACPI_PCIHP_ADDR_ICH9 + PCI_EJ_BASE, 1);
> +
> +while (true) {
> +ret = migrate_status(qts);
> +
> +status = qdict_get_str(ret, "status");
> +if (strcmp(status, "cancelled") == 0) {
> +break;
> +}
> +g_assert_cmpstr(status, !=, "failed");
> +g_assert_cmpstr(status, !=, "active");
> +qobject_unref(ret);
> +}
> +qobject_unref(ret);
> +
> +check_one_card(qts, true, "standby0", MAC_STANDBY0);
> +check_one_card(qts, true, "primary0", MAC_PRIMARY0);
> +
> +machine_stop(qts);
> +}
> +
> +static void test_migrate_abort_active(gconstpointer opaque)
> +{
> +QTestState *qts;
> +QDict *resp, *args, *data, *ret;
> +g_autofree gchar *uri = g_strdup_printf("exec: cat > %s", (gchar 
> *)opaque);
> +const gchar *status;
> +
> +qts = machine_start(BASE_MACHINE
> + "-netdev user,id=hs0 "
> + "-netdev user,id=hs1 ",
> + 2);
> +
> +check_one_card(qts, false, "standby0", MAC_STANDBY0);
> +check_one_card(qts, false, "primary0", MAC_PRIMARY0);
> +
> +qtest_qmp_device_add(qts, "virtio-net", "standby0",
> + "{'bus': 'root0',"
> + "'failover': 'on',"
> + "'netdev': 'hs0',"
> + "'mac': '"MAC_STANDBY0"'}");
> +
> +check_one_card(qts,

Re: [PATCH v5 3/4] failover: fix unplug pending detection

On Fri, Nov 19, 2021 at 10:07:17AM +0100, Laurent Vivier wrote:
> Failover needs to detect the end of the PCI unplug to start migration
> after the VFIO card has been unplugged.
> 
> To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and reset in
> pcie_unplug_device().
> 
> But since
> 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on Q35")
> we have switched to ACPI unplug and these functions are not called anymore
> and the flag not set. So failover migration is not able to detect if card
> is really unplugged and acts as it's done as soon as it's started. So it
> doesn't wait the end of the unplug to start the migration. We don't see any
> problem when we test that because ACPI unplug is faster than PCIe native
> hotplug and when the migration really starts the unplug operation is
> already done.
> 
> See c000a9bd06ea ("pci: mark device having guest unplug request pending")
> a99c4da9fc2a ("pci: mark devices partially unplugged")
> 
> Signed-off-by: Laurent Vivier 
> Reviewed-by: Ani Sinha 

Hmm.  I think this one may be needed for this release actually.
Isolate from testing changes and repost?

> ---
>  hw/acpi/pcihp.c | 30 +++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c
> index f610a25d2ef9..30405b5113d7 100644
> --- a/hw/acpi/pcihp.c
> +++ b/hw/acpi/pcihp.c
> @@ -222,9 +222,27 @@ static void acpi_pcihp_eject_slot(AcpiPciHpState *s, 
> unsigned bsel, unsigned slo
>  PCIDevice *dev = PCI_DEVICE(qdev);
>  if (PCI_SLOT(dev->devfn) == slot) {
>  if (!acpi_pcihp_pc_no_hotplug(s, dev)) {
> -hotplug_ctrl = qdev_get_hotplug_handler(qdev);
> -hotplug_handler_unplug(hotplug_ctrl, qdev, _abort);
> -object_unparent(OBJECT(qdev));
> +/*
> + * partially_hotplugged is used by virtio-net failover:
> + * failover has asked the guest OS to unplug the device
> + * but we need to keep some references to the device
> + * to be able to plug it back in case of failure so
> + * we don't execute hotplug_handler_unplug().
> + */
> +if (dev->partially_hotplugged) {
> +/*
> + * pending_deleted_event is set to true when
> + * virtio-net failover asks to unplug the device,
> + * and set to false here when the operation is done
> + * This is used by the migration loop to detect the
> + * end of the operation and really start the migration.
> + */
> +qdev->pending_deleted_event = false;
> +} else {
> +hotplug_ctrl = qdev_get_hotplug_handler(qdev);
> +hotplug_handler_unplug(hotplug_ctrl, qdev, _abort);
> +object_unparent(OBJECT(qdev));
> +}
>  }
>  }
>  }
> @@ -396,6 +414,12 @@ void acpi_pcihp_device_unplug_request_cb(HotplugHandler 
> *hotplug_dev,
>  return;
>  }
>  
> +/*
> + * pending_deleted_event is used by virtio-net failover to detect the
> + * end of the unplug operation, the flag is set to false in
> + * acpi_pcihp_eject_slot() when the operation is completed.
> + */
> +pdev->qdev.pending_deleted_event = true;
>  s->acpi_pcihp_pci_status[bsel].down |= (1U << slot);
>  acpi_send_event(DEVICE(hotplug_dev), ACPI_PCI_HOTPLUG_STATUS);
>  }
> -- 
> 2.33.1

Re: [PATCH v7 2/4] tests/qtest: add some tests for virtio-net failover

On Wed, Dec 08, 2021 at 08:33:31AM +0100, Thomas Huth wrote:
> On 07/12/2021 18.23, Laurent Vivier wrote:
> > Add test cases to test several error cases that must be
> > generated by invalid failover configuration.
> > 
> > Add a combination of coldplug and hotplug test cases to be
> > sure the primary is correctly managed according the
> > presence or not of the STANDBY feature.
> > 
> > Signed-off-by: Laurent Vivier 
> > ---
> >   tests/qtest/meson.build   |   4 +
> >   tests/qtest/virtio-net-failover.c | 771 ++
> >   2 files changed, 775 insertions(+)
> >   create mode 100644 tests/qtest/virtio-net-failover.c
> 
> Acked-by: Thomas Huth 
> 
> I'll take this series through my "testing" branch (unless someone speaks up
> that it should go through some virtio/network branch instead).

Pls do.

-- 
MST

Re: [PATCH v5 0/4] tests/qtest: add some tests for virtio-net failover

On Fri, Nov 19, 2021 at 10:07:14AM +0100, Laurent Vivier wrote:
> This series adds a qtest entry to test virtio-net failover feature.

Reviewed-by: Michael S. Tsirkin 


> We check following error cases:
> 
> - check missing id on device with failover_pair_id triggers an error
> - check a primary device plugged on a bus that doesn't support hotplug
>   triggers an error
> 
> We check the status of the machine before and after hotplugging cards and
> feature negotiation:
> 
> - check we don't see the primary device at boot if failover is on
> - check we see the primary device at boot if failover is off
> - check we don't see the primary device if failover is on
>   but failover_pair_id is not the one with on (I think this should be changed)
> - check the primary device is plugged after the feature negotiation
> - check the result if the primary device is plugged before standby device and
>   vice-versa
> - check the if the primary device is coldplugged and the standy device
>   hotplugged and vice-versa
> - check the migration triggers the unplug and the hotplug
> 
> There is one preliminary patch in the series:
> 
> - PATCH 1 introduces a function to enable PCI bridge.
>   Failover needs to be plugged on a pcie-root-port and while
>   the root port is not configured the cards behind it are not
>   available
> 
> v5:
> - re-add the wait-unplug test that has been removed from v4 by mistake.
> 
> v4:
> - rely on query-migrate status to know the migration state rather than
>   to wait the STOP event.
> - remove the patch to add time out to qtest_qmp_eventwait()
> 
> v3:
> - fix a bug with ACPI unplug and add the related test
> 
> v2:
> - remove PATCH 1 that introduced a function that can be replaced by
>   qobject_to_json_pretty() (Markus)
> - Add migration to a file and from the file to check the card is
>   correctly unplugged on the source, and hotplugged on the dest
> - Add an ACPI call to eject the card as the kernel would do
> 
> Laurent Vivier (4):
>   qtest/libqos: add a function to initialize secondary PCI buses
>   tests/qtest: add some tests for virtio-net failover
>   failover: fix unplug pending detection
>   tests/libqtest: update virtio-net failover test
> 
>  hw/acpi/pcihp.c   |  30 +-
>  include/hw/pci/pci_bridge.h   |   8 +
>  tests/qtest/libqos/pci.c  | 118 ++
>  tests/qtest/libqos/pci.h  |   1 +
>  tests/qtest/meson.build   |   3 +
>  tests/qtest/virtio-net-failover.c | 681 ++
>  6 files changed, 838 insertions(+), 3 deletions(-)
>  create mode 100644 tests/qtest/virtio-net-failover.c
> 
> -- 
> 2.33.1
>

Re: [PATCH v7 2/4] tests/qtest: add some tests for virtio-net failover


On 07/12/2021 18.23, Laurent Vivier wrote:

Add test cases to test several error cases that must be
generated by invalid failover configuration.

Add a combination of coldplug and hotplug test cases to be
sure the primary is correctly managed according the
presence or not of the STANDBY feature.

Signed-off-by: Laurent Vivier 
---
  tests/qtest/meson.build   |   4 +
  tests/qtest/virtio-net-failover.c | 771 ++
  2 files changed, 775 insertions(+)
  create mode 100644 tests/qtest/virtio-net-failover.c


Acked-by: Thomas Huth 

I'll take this series through my "testing" branch (unless someone speaks up 
that it should go through some virtio/network branch instead).

Re: [PATCH RFC 00/11] vl: Explore redesign of startup

2021-12-07 Thread Markus Armbruster

Damien Hedde  writes:

> Hi Markus,
>
> It looks promising. I did not think we could so "easily" have a new
> working startup.

Look at this big axe I got!  ;)

>  But I'm not so sure that I understand how we should 
> progress from here.

I neglected to explain this my cover letter.  My apologies...

> I see 3 main parts in this:
> A. introducing new binary (meson, ...)
> B. startup api: phase related stuff (maybe more)
> C. cli to qmp parser

Makes sense to me at a high level.

> I think if we want to add a new binary (instead of replace it), there
> will be some common api and every startup will have to
> support/implement it. Probably some part of vl.c will have to go in
> some common code.
> In practice, we probably should introduce/extract this before
> introducing the new binary.

I think there are two practical ways to structure such patches:

* Refactor existing code to make parts available for new code, then
  introduce new code that uses them.

* Copy, cut unwanted parts, refactor to deduplicate.

I think either way can work as patches.  The second way is how I'd start
the work myself.

> One central part of this api is the phase mechanism (even if legacy
> startup can only support it partially or not-at-all).
>
> I think we have 2 choices:
> + we have to use until_phase explicitly
> + we make qmp commands implicitly advances phases when needed.

Yes.

> I think it's better to go the implicit way as much as possible: it
> means we focus on commands and not on some artificial phases we set up
> because of legacy.

An explicit phase control command looked like the fast & easy path to
phase control to me, so that's what I picked for the RFC.

Instead of a single "advance to arbitrary phase" command, we can have
multiple "do X, which requires phase Y and advances to phase Y+1"
commands.  E.g. "create machine" goes from @no-machine to
@machine-created.

We may want additional, automatic phase advances for convenience, but I
feel it's best to get the essential stuff roughly right before talking
about convenience features.

> Either way, we probably should put the phase info in qapi so that we
> don't have to hardcode that in every command in order to have common 
> error handling. One thing we could do is replace "allow-preconfig" in
> qapi by some phase requirement entry(entries?) and make qmp call 
> qemu_until_phase() or some qemu_phase_check() function.

I'd also like some phase support from QAPI.  Manual phase checking code
in commands would be tedious and error prone.  Better to declare
required phase(s) in the schema.

One small step further: declare phase transitions in the schema, too.
Then the phase state machine definition is all *data*.  Data is easier
to reason about than code.  Extracting the complete state machine from
the schema is straightforward.  Extracting it from C code is anything
but.

> We also maybe need to sort out if we want to merge the phases into the
> runstate.

Yes.

> Thanks for making the effort to do this rfc,

Thanks for your feedback!

[PATCH 7/7] hw/riscv: Use error_fatal for SoC realisation

From: Alistair Francis 

When realising the SoC use error_fatal instead of error_abort as the
process can fail and report useful information to the user.

Currently a user can see this:

   $ ../qemu/bld/qemu-system-riscv64 -M sifive_u -S -monitor stdio -display 
none -drive if=pflash
QEMU 6.1.93 monitor - type 'help' for more information
(qemu) Unexpected error in sifive_u_otp_realize() at 
../hw/misc/sifive_u_otp.c:229:
qemu-system-riscv64: OTP drive size < 16K
Aborted (core dumped)

Which this patch addresses

Signed-off-by: Alistair Francis 
Reported-by: Markus Armbruster 
---
 hw/riscv/microchip_pfsoc.c | 2 +-
 hw/riscv/opentitan.c   | 2 +-
 hw/riscv/sifive_e.c| 2 +-
 hw/riscv/sifive_u.c| 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/riscv/microchip_pfsoc.c b/hw/riscv/microchip_pfsoc.c
index 57d779fb55..f16e4d10eb 100644
--- a/hw/riscv/microchip_pfsoc.c
+++ b/hw/riscv/microchip_pfsoc.c
@@ -471,7 +471,7 @@ static void microchip_icicle_kit_machine_init(MachineState 
*machine)
 /* Initialize SoC */
 object_initialize_child(OBJECT(machine), "soc", >soc,
 TYPE_MICROCHIP_PFSOC);
-qdev_realize(DEVICE(>soc), NULL, _abort);
+qdev_realize(DEVICE(>soc), NULL, _fatal);
 
 /* Split RAM into low and high regions using aliases to machine->ram */
 mem_low_size = memmap[MICROCHIP_PFSOC_DRAM_LO].size;
diff --git a/hw/riscv/opentitan.c b/hw/riscv/opentitan.c
index c531450b9f..0856c347e8 100644
--- a/hw/riscv/opentitan.c
+++ b/hw/riscv/opentitan.c
@@ -80,7 +80,7 @@ static void opentitan_board_init(MachineState *machine)
 /* Initialize SoC */
 object_initialize_child(OBJECT(machine), "soc", >soc,
 TYPE_RISCV_IBEX_SOC);
-qdev_realize(DEVICE(>soc), NULL, _abort);
+qdev_realize(DEVICE(>soc), NULL, _fatal);
 
 memory_region_add_subregion(sys_mem,
 memmap[IBEX_DEV_RAM].base, machine->ram);
diff --git a/hw/riscv/sifive_e.c b/hw/riscv/sifive_e.c
index 9b206407a6..dcb87b6cfd 100644
--- a/hw/riscv/sifive_e.c
+++ b/hw/riscv/sifive_e.c
@@ -88,7 +88,7 @@ static void sifive_e_machine_init(MachineState *machine)
 
 /* Initialize SoC */
 object_initialize_child(OBJECT(machine), "soc", >soc, TYPE_RISCV_E_SOC);
-qdev_realize(DEVICE(>soc), NULL, _abort);
+qdev_realize(DEVICE(>soc), NULL, _fatal);
 
 /* Data Tightly Integrated Memory */
 memory_region_add_subregion(sys_mem,
diff --git a/hw/riscv/sifive_u.c b/hw/riscv/sifive_u.c
index 589ae72a59..d576484851 100644
--- a/hw/riscv/sifive_u.c
+++ b/hw/riscv/sifive_u.c
@@ -545,7 +545,7 @@ static void sifive_u_machine_init(MachineState *machine)
  _abort);
 object_property_set_str(OBJECT(>soc), "cpu-type", machine->cpu_type,
  _abort);
-qdev_realize(DEVICE(>soc), NULL, _abort);
+qdev_realize(DEVICE(>soc), NULL, _fatal);
 
 /* register RAM */
 memory_region_add_subregion(system_memory, memmap[SIFIVE_U_DEV_DRAM].base,
-- 
2.31.1

[PATCH 4/7] hw/intc: sifive_plic: Cleanup remaining functions

From: Alistair Francis 

We can remove the original sifive_plic_irqs_pending() function and
instead just use the sifive_plic_claim() function (renamed to
sifive_plic_claimed()) to determine if any interrupts are pending.

This requires move the side effects outside of sifive_plic_claimed(),
but as they are only invoked once that isn't a problem.

We have also removed all of the old #ifdef debugging logs, so let's
cleanup the last remaining debug function while we are here.

Signed-off-by: Alistair Francis 
---
 hw/intc/sifive_plic.c | 109 +-
 1 file changed, 22 insertions(+), 87 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 7f9715a584..d9bf01b647 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -31,8 +31,6 @@
 #include "migration/vmstate.h"
 #include "hw/irq.h"
 
-#define RISCV_DEBUG_PLIC 0
-
 static bool addr_between(uint32_t addr, uint32_t base, uint32_t num)
 {
 uint32_t end = base + num;
@@ -57,47 +55,6 @@ static PLICMode char_to_mode(char c)
 }
 }
 
-static char mode_to_char(PLICMode m)
-{
-switch (m) {
-case PLICMode_U: return 'U';
-case PLICMode_S: return 'S';
-case PLICMode_H: return 'H';
-case PLICMode_M: return 'M';
-default: return '?';
-}
-}
-
-static void sifive_plic_print_state(SiFivePLICState *plic)
-{
-int i;
-int addrid;
-
-/* pending */
-qemu_log("pending   : ");
-for (i = plic->bitfield_words - 1; i >= 0; i--) {
-qemu_log("%08x", plic->pending[i]);
-}
-qemu_log("\n");
-
-/* pending */
-qemu_log("claimed   : ");
-for (i = plic->bitfield_words - 1; i >= 0; i--) {
-qemu_log("%08x", plic->claimed[i]);
-}
-qemu_log("\n");
-
-for (addrid = 0; addrid < plic->num_addrs; addrid++) {
-qemu_log("hart%d-%c enable: ",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode));
-for (i = plic->bitfield_words - 1; i >= 0; i--) {
-qemu_log("%08x", plic->enable[addrid * plic->bitfield_words + i]);
-}
-qemu_log("\n");
-}
-}
-
 static uint32_t atomic_set_masked(uint32_t *a, uint32_t mask, uint32_t value)
 {
 uint32_t old, new, cmp = qatomic_read(a);
@@ -121,26 +78,34 @@ static void sifive_plic_set_claimed(SiFivePLICState *plic, 
int irq, bool level)
 atomic_set_masked(>claimed[irq >> 5], 1 << (irq & 31), -!!level);
 }
 
-static int sifive_plic_irqs_pending(SiFivePLICState *plic, uint32_t addrid)
+static uint32_t sifive_plic_claimed(SiFivePLICState *plic, uint32_t addrid)
 {
+uint32_t max_irq = 0;
+uint32_t max_prio = plic->target_priority[addrid];
 int i, j;
+
 for (i = 0; i < plic->bitfield_words; i++) {
 uint32_t pending_enabled_not_claimed =
-(plic->pending[i] & ~plic->claimed[i]) &
-plic->enable[addrid * plic->bitfield_words + i];
+(plic->pending[i] & ~plic->claimed[i]) &
+plic->enable[addrid * plic->bitfield_words + i];
+
 if (!pending_enabled_not_claimed) {
 continue;
 }
+
 for (j = 0; j < 32; j++) {
 int irq = (i << 5) + j;
 uint32_t prio = plic->source_priority[irq];
 int enabled = pending_enabled_not_claimed & (1 << j);
-if (enabled && prio > plic->target_priority[addrid]) {
-return 1;
+
+if (enabled && prio > max_prio) {
+max_irq = irq;
+max_prio = prio;
 }
 }
 }
-return 0;
+
+return max_irq;
 }
 
 static void sifive_plic_update(SiFivePLICState *plic)
@@ -151,7 +116,7 @@ static void sifive_plic_update(SiFivePLICState *plic)
 for (addrid = 0; addrid < plic->num_addrs; addrid++) {
 uint32_t hartid = plic->addr_config[addrid].hartid;
 PLICMode mode = plic->addr_config[addrid].mode;
-int level = sifive_plic_irqs_pending(plic, addrid);
+bool level = !!sifive_plic_claimed(plic, addrid);
 
 switch (mode) {
 case PLICMode_M:
@@ -164,41 +129,6 @@ static void sifive_plic_update(SiFivePLICState *plic)
 break;
 }
 }
-
-if (RISCV_DEBUG_PLIC) {
-sifive_plic_print_state(plic);
-}
-}
-
-static uint32_t sifive_plic_claim(SiFivePLICState *plic, uint32_t addrid)
-{
-int i, j;
-uint32_t max_irq = 0;
-uint32_t max_prio = plic->target_priority[addrid];
-
-for (i = 0; i < plic->bitfield_words; i++) {
-uint32_t pending_enabled_not_claimed =
-(plic->pending[i] & ~plic->claimed[i]) &
-plic->enable[addrid * plic->bitfield_words + i];
-if (!pending_enabled_not_claimed) {
-continue;
-}
-for (j = 0; j < 32; j++) {
-int irq = (i << 5) + j;
-uint32_t prio = plic->source_priority[irq];
-int enabled = pending_enabled_not_claimed & (1 << j);
-

[PATCH 6/7] target/riscv: Enable the Hypervisor extension by default

From: Alistair Francis 

Let's enable the Hypervisor extension by default. This doesn't affect
named CPUs (such as lowrisc-ibex or sifive-u54) but does enable the
Hypervisor extensions by default for the virt machine.

Signed-off-by: Alistair Francis 
---
 target/riscv/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index 1edb2771b4..013a8760b5 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -626,7 +626,7 @@ static Property riscv_cpu_properties[] = {
 DEFINE_PROP_BOOL("c", RISCVCPU, cfg.ext_c, true),
 DEFINE_PROP_BOOL("s", RISCVCPU, cfg.ext_s, true),
 DEFINE_PROP_BOOL("u", RISCVCPU, cfg.ext_u, true),
-DEFINE_PROP_BOOL("h", RISCVCPU, cfg.ext_h, false),
+DEFINE_PROP_BOOL("h", RISCVCPU, cfg.ext_h, true),
 DEFINE_PROP_BOOL("Counters", RISCVCPU, cfg.ext_counters, true),
 DEFINE_PROP_BOOL("Zifencei", RISCVCPU, cfg.ext_ifencei, true),
 DEFINE_PROP_BOOL("Zicsr", RISCVCPU, cfg.ext_icsr, true),
-- 
2.31.1

[PATCH 2/7] hw/intc: sifive_plic: Cleanup the write function

From: Alistair Francis 

Signed-off-by: Alistair Francis 
Reviewed-by: Bin Meng 
---
 hw/intc/sifive_plic.c | 82 +--
 1 file changed, 33 insertions(+), 49 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 35f097799a..c1fa689868 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -33,6 +33,17 @@
 
 #define RISCV_DEBUG_PLIC 0
 
+static bool addr_between(uint32_t addr, uint32_t base, uint32_t num)
+{
+uint32_t end = base + num;
+
+if (addr >= base && addr < end) {
+return true;
+}
+
+return false;
+}
+
 static PLICMode char_to_mode(char c)
 {
 switch (c) {
@@ -269,80 +280,53 @@ static void sifive_plic_write(void *opaque, hwaddr addr, 
uint64_t value,
 {
 SiFivePLICState *plic = opaque;
 
-/* writes must be 4 byte words */
-if ((addr & 0x3) != 0) {
-goto err;
-}
-
-if (addr >= plic->priority_base && /* 4 bytes per source */
-addr < plic->priority_base + (plic->num_sources << 2))
-{
+if (addr_between(addr, plic->priority_base, plic->num_sources << 2)) {
 uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
+
 plic->source_priority[irq] = value & 7;
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: write priority: irq=%d priority=%d\n",
-irq, plic->source_priority[irq]);
-}
 sifive_plic_update(plic);
-return;
-} else if (addr >= plic->pending_base && /* 1 bit per source */
-   addr < plic->pending_base + (plic->num_sources >> 3))
-{
+} else if (addr_between(addr, plic->pending_base,
+plic->num_sources >> 3)) {
 qemu_log_mask(LOG_GUEST_ERROR,
   "%s: invalid pending write: 0x%" HWADDR_PRIx "",
   __func__, addr);
-return;
-} else if (addr >= plic->enable_base && /* 1 bit per source */
-addr < plic->enable_base + plic->num_addrs * plic->enable_stride)
-{
+} else if (addr_between(addr, plic->enable_base,
+plic->num_addrs * plic->enable_stride)) {
 uint32_t addrid = (addr - plic->enable_base) / plic->enable_stride;
 uint32_t wordid = (addr & (plic->enable_stride - 1)) >> 2;
+
 if (wordid < plic->bitfield_words) {
 plic->enable[addrid * plic->bitfield_words + wordid] = value;
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: write enable: hart%d-%c word=%d value=%x\n",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode), wordid,
-plic->enable[addrid * plic->bitfield_words + wordid]);
-}
-return;
+} else {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid enable write 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 }
-} else if (addr >= plic->context_base && /* 4 bytes per reg */
-addr < plic->context_base + plic->num_addrs * plic->context_stride)
-{
+} else if (addr_between(addr, plic->context_base,
+plic->num_addrs * plic->context_stride)) {
 uint32_t addrid = (addr - plic->context_base) / plic->context_stride;
 uint32_t contextid = (addr & (plic->context_stride - 1));
+
 if (contextid == 0) {
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: write priority: hart%d-%c priority=%x\n",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode),
-plic->target_priority[addrid]);
-}
 if (value <= plic->num_priorities) {
 plic->target_priority[addrid] = value;
 sifive_plic_update(plic);
 }
-return;
 } else if (contextid == 4) {
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: write claim: hart%d-%c irq=%x\n",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode),
-(uint32_t)value);
-}
 if (value < plic->num_sources) {
 sifive_plic_set_claimed(plic, value, false);
 sifive_plic_update(plic);
 }
-return;
+} else {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid context write 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 }
+} else {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "%s: Invalid register write 0x%" HWADDR_PRIx "\n",
+  __func__, addr);
 }
-
-err:
-qemu_log_mask(LOG_GUEST_ERROR,
-  "%s: Invalid register write 0x%" HWADDR_PRIx "\n",
-  __func__, addr);
 }
 
 static const MemoryRegionOps sifive_plic_ops = {
--

[PATCH 3/7] hw/intc: sifive_plic: Cleanup the read function

From: Alistair Francis 

Signed-off-by: Alistair Francis 
Reviewed-by: Bin Meng 
---
 hw/intc/sifive_plic.c | 55 +--
 1 file changed, 11 insertions(+), 44 deletions(-)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c1fa689868..7f9715a584 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -205,70 +205,37 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
addr, unsigned size)
 {
 SiFivePLICState *plic = opaque;
 
-/* writes must be 4 byte words */
-if ((addr & 0x3) != 0) {
-goto err;
-}
-
-if (addr >= plic->priority_base && /* 4 bytes per source */
-addr < plic->priority_base + (plic->num_sources << 2))
-{
+if (addr_between(addr, plic->priority_base, plic->num_sources << 2)) {
 uint32_t irq = ((addr - plic->priority_base) >> 2) + 1;
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: read priority: irq=%d priority=%d\n",
-irq, plic->source_priority[irq]);
-}
+
 return plic->source_priority[irq];
-} else if (addr >= plic->pending_base && /* 1 bit per source */
-   addr < plic->pending_base + (plic->num_sources >> 3))
-{
+} else if (addr_between(addr, plic->pending_base, plic->num_sources >> 3)) 
{
 uint32_t word = (addr - plic->pending_base) >> 2;
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: read pending: word=%d value=%d\n",
-word, plic->pending[word]);
-}
+
 return plic->pending[word];
-} else if (addr >= plic->enable_base && /* 1 bit per source */
- addr < plic->enable_base + plic->num_addrs * plic->enable_stride)
-{
+} else if (addr_between(addr, plic->enable_base,
+plic->num_addrs * plic->enable_stride)) {
 uint32_t addrid = (addr - plic->enable_base) / plic->enable_stride;
 uint32_t wordid = (addr & (plic->enable_stride - 1)) >> 2;
+
 if (wordid < plic->bitfield_words) {
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: read enable: hart%d-%c word=%d value=%x\n",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode), wordid,
-plic->enable[addrid * plic->bitfield_words + wordid]);
-}
 return plic->enable[addrid * plic->bitfield_words + wordid];
 }
-} else if (addr >= plic->context_base && /* 1 bit per source */
- addr < plic->context_base + plic->num_addrs * 
plic->context_stride)
-{
+} else if (addr_between(addr, plic->context_base,
+plic->num_addrs * plic->context_stride)) {
 uint32_t addrid = (addr - plic->context_base) / plic->context_stride;
 uint32_t contextid = (addr & (plic->context_stride - 1));
+
 if (contextid == 0) {
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: read priority: hart%d-%c priority=%x\n",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode),
-plic->target_priority[addrid]);
-}
 return plic->target_priority[addrid];
 } else if (contextid == 4) {
 uint32_t value = sifive_plic_claim(plic, addrid);
-if (RISCV_DEBUG_PLIC) {
-qemu_log("plic: read claim: hart%d-%c irq=%x\n",
-plic->addr_config[addrid].hartid,
-mode_to_char(plic->addr_config[addrid].mode),
-value);
-}
+
 sifive_plic_update(plic);
 return value;
 }
 }
 
-err:
 qemu_log_mask(LOG_GUEST_ERROR,
   "%s: Invalid register read 0x%" HWADDR_PRIx "\n",
   __func__, addr);
-- 
2.31.1

[PATCH 5/7] target/riscv: Mark the Hypervisor extension as non experimental

From: Alistair Francis 

The Hypervisor spec is now frozen, so remove the experimental tag.

Signed-off-by: Alistair Francis 
---
 target/riscv/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
index f812998123..1edb2771b4 100644
--- a/target/riscv/cpu.c
+++ b/target/riscv/cpu.c
@@ -626,6 +626,7 @@ static Property riscv_cpu_properties[] = {
 DEFINE_PROP_BOOL("c", RISCVCPU, cfg.ext_c, true),
 DEFINE_PROP_BOOL("s", RISCVCPU, cfg.ext_s, true),
 DEFINE_PROP_BOOL("u", RISCVCPU, cfg.ext_u, true),
+DEFINE_PROP_BOOL("h", RISCVCPU, cfg.ext_h, false),
 DEFINE_PROP_BOOL("Counters", RISCVCPU, cfg.ext_counters, true),
 DEFINE_PROP_BOOL("Zifencei", RISCVCPU, cfg.ext_ifencei, true),
 DEFINE_PROP_BOOL("Zicsr", RISCVCPU, cfg.ext_icsr, true),
@@ -639,7 +640,6 @@ static Property riscv_cpu_properties[] = {
 DEFINE_PROP_BOOL("x-zbb", RISCVCPU, cfg.ext_zbb, false),
 DEFINE_PROP_BOOL("x-zbc", RISCVCPU, cfg.ext_zbc, false),
 DEFINE_PROP_BOOL("x-zbs", RISCVCPU, cfg.ext_zbs, false),
-DEFINE_PROP_BOOL("x-h", RISCVCPU, cfg.ext_h, false),
 DEFINE_PROP_BOOL("x-j", RISCVCPU, cfg.ext_j, false),
 DEFINE_PROP_BOOL("x-v", RISCVCPU, cfg.ext_v, false),
 DEFINE_PROP_STRING("vext_spec", RISCVCPU, cfg.vext_spec),
-- 
2.31.1

Re: [PATCH v10 00/77] support vector extension v1.0

On Mon, Nov 29, 2021 at 1:04 PM  wrote:
>
> From: Frank Chang 
>
> This patchset implements the vector extension v1.0 for RISC-V on QEMU.
>
> RVV v1.0 spec is now fronzen for public review:
> https://github.com/riscv/riscv-v-spec/releases/tag/v1.0
>
> The port is available here:
> https://github.com/sifive/qemu/tree/rvv-1.0-upstream-v10
>
> RVV v1.0 can be enabled with -cpu option: v=true and specify vext_spec
> option to v1.0 (i.e. vext_spec=v1.0)
>
> Note: This patchset depends on other patchsets listed in Based-on
>   section below so it is not able to be built unless those patchsets
>   are applied.

I think this is all reviewed now. Once your other patch sets are
merged just re-send this and I can apply it.

Alistair

>
> Changelog:
>
> v10
>   * Add ELEN checks for widening and narrowing instructions.
>
> v9
>   * Remove explicitly set mstatus.SD patches as mstatus.SD is now
> set in add_status_sd().
>   * Rebase on riscv-to-apply.next branch.
>
> v8
>   * Use {get,dest}_gpr APIs.
>   * remove vector AMO instructions.
>   * rename vpopc.m to vcpop.m.
>   * rename vle1.v and vse1.v to vlm.v and vsm.v.
>   * rename vmandnot.mm and vmornot.mm to vmandn.mm and vmorn.mm.
>
> v7
>   * remove hardcoded GDB vector registers list.
>   * add vsetivli instruction.
>   * add vle1.v and vse1.v instructions.
>
> v6
>   * add vector floating-point reciprocal estimate instruction.
>   * add vector floating-point reciprocal square-root estimate instruction.
>   * update check rules for segment register groups, each segment register
> group has to follow overlap rules.
>   * update viota.m instruction check rules.
>
> v5
>   * refactor RVV v1.0 check functions.
> (Thanks to Richard Henderson's bitwise tricks.)
>   * relax RV_VLEN_MAX to 1024-bits.
>   * implement vstart CSR's behaviors.
>   * trigger illegal instruction exception if frm is not valid for
> vector floating-point instructions.
>   * rebase on riscv-to-apply.next.
>
> v4
>   * remove explicit float flmul variable in DisasContext.
>   * replace floating-point calculations with shift operations to
> improve performance.
>   * relax RV_VLEN_MAX to 512-bits.
>
> v3
>   * apply nan-box helpers from Richard Henderson.
>   * remove fp16 api changes as they are sent independently in another
> pathcset by Chih-Min Chao.
>   * remove all tail elements clear functions as tail elements can
> retain unchanged for either VTA set to undisturbed or agnostic.
>   * add fp16 nan-box check generator function.
>   * add floating-point rounding mode enum.
>   * replace flmul arithmetic with shifts to avoid floating-point
> conversions.
>   * add Zvqmac extension.
>   * replace gdbstub vector register xml files with dynamic generator.
>   * bumped to RVV v1.0.
>   * RVV v1.0 related changes:
> * add vlre.v and vsr.v vector whole register
>   load/store instructions
> * add vrgatherei16 instruction.
> * rearranged bits in vtype to make vlmul bits into a contiguous
>   field.
>
> v2
>   * drop v0.7.1 support.
>   * replace invisible return check macros with functions.
>   * move mark_vs_dirty() to translators.
>   * add SSTATUS_VS flag for s-mode.
>   * nan-box scalar fp register for floating-point operations.
>   * add gdbstub files for vector registers to allow system-mode
> debugging with GDB.
>
> Based-on: <20211021160847.2748577-1-frank.ch...@sifive.com>
> Based-on: <20211021162956.2772656-1-frank.ch...@sifive.com>
>
> Frank Chang (72):
>   target/riscv: drop vector 0.7.1 and add 1.0 support
>   target/riscv: Use FIELD_EX32() to extract wd field
>   target/riscv: rvv-1.0: set mstatus.SD bit if mstatus.VS is dirty
>   target/riscv: rvv-1.0: introduce writable misa.v field
>   target/riscv: rvv-1.0: add translation-time vector context status
>   target/riscv: rvv-1.0: remove rvv related codes from fcsr registers
>   target/riscv: rvv-1.0: check MSTATUS_VS when accessing vector csr
> registers
>   target/riscv: rvv-1.0: remove MLEN calculations
>   target/riscv: rvv-1.0: add fractional LMUL
>   target/riscv: rvv-1.0: add VMA and VTA
>   target/riscv: rvv-1.0: update check functions
>   target/riscv: introduce more imm value modes in translator functions
>   target/riscv: rvv:1.0: add translation-time nan-box helper function
>   target/riscv: rvv-1.0: remove amo operations instructions
>   target/riscv: rvv-1.0: configure instructions
>   target/riscv: rvv-1.0: stride load and store instructions
>   target/riscv: rvv-1.0: index load and store instructions
>   target/riscv: rvv-1.0: fix address index overflow bug of indexed
> load/store insns
>   target/riscv: rvv-1.0: fault-only-first unit stride load
>   target/riscv: rvv-1.0: load/store whole register instructions
>   target/riscv: rvv-1.0: update vext_max_elems() for load/store insns
>   target/riscv: rvv-1.0: take fractional LMUL into vector max elements
> calculation
>   target/riscv: rvv-1.0: floating-point square-root instruction
>

[PATCH 0/7] A collection of RISC-V cleanups and improvements

From: Alistair Francis 

This is a few patches to cleanup some RISC-V hardware and mark the
Hyperisor extension as non experimental.

Alistair Francis (7):
  hw/intc: sifive_plic: Add a reset function
  hw/intc: sifive_plic: Cleanup the write function
  hw/intc: sifive_plic: Cleanup the read function
  hw/intc: sifive_plic: Cleanup remaining functions
  target/riscv: Mark the Hypervisor extension as non experimental
  target/riscv: Enable the Hypervisor extension by default
  hw/riscv: Use error_fatal for SoC realisation

 hw/intc/sifive_plic.c  | 254 +++--
 hw/riscv/microchip_pfsoc.c |   2 +-
 hw/riscv/opentitan.c   |   2 +-
 hw/riscv/sifive_e.c|   2 +-
 hw/riscv/sifive_u.c|   2 +-
 target/riscv/cpu.c |   2 +-
 6 files changed, 81 insertions(+), 183 deletions(-)

-- 
2.31.1

[PATCH 1/7] hw/intc: sifive_plic: Add a reset function

From: Alistair Francis 

Signed-off-by: Alistair Francis 
---
 hw/intc/sifive_plic.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index 877e76877c..35f097799a 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -355,6 +355,17 @@ static const MemoryRegionOps sifive_plic_ops = {
 }
 };
 
+static void sifive_plic_reset(DeviceState *dev)
+{
+SiFivePLICState *s = SIFIVE_PLIC(dev);
+
+memset(s->source_priority, 0, sizeof(uint32_t) * s->num_sources);
+memset(s->target_priority, 0, sizeof(uint32_t) * s->num_addrs);
+memset(s->pending, 0, sizeof(uint32_t) * s->bitfield_words);
+memset(s->claimed, 0, sizeof(uint32_t) * s->bitfield_words);
+memset(s->enable, 0, sizeof(uint32_t) * s->num_enables);
+}
+
 /*
  * parse PLIC hart/mode address offset config
  *
@@ -501,6 +512,7 @@ static void sifive_plic_class_init(ObjectClass *klass, void 
*data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 
+dc->reset = sifive_plic_reset;
 device_class_set_props(dc, sifive_plic_properties);
 dc->realize = sifive_plic_realize;
 dc->vmsd = _sifive_plic;
-- 
2.31.1

Re: [RFC] vhost-vdpa-net: add vhost-vdpa-net host device support

2021-12-07 Thread Jason Wang

On Wed, Dec 8, 2021 at 1:20 PM Longpeng(Mike)  wrote:
>
> From: Longpeng 
>
> Hi guys,
>
> This patch introduces vhost-vdpa-net device, which is inspired
> by vhost-user-blk and the proposal of vhost-vdpa-blk device [1].
>
> I've tested this patch on Huawei's offload card:
> ./x86_64-softmmu/qemu-system-x86_64 \
> -device vhost-vdpa-net-pci,vdpa-dev=/dev/vhost-vdpa-0
>
> For virtio hardware offloading, the most important requirement for us
> is to support live migration between offloading cards from different
> vendors, the combination of netdev and virtio-net seems too heavy, we
> prefer a lightweight way.

Could you elaborate more on this? It's mainly the control path when
using with netdev, and it provides a lot of other benefits:

- decouple the transport specific stuff out of the vhost abstraction,
mmio device is supported with 0 line of code
- migration compatibility, reuse the migration stream that is already
supported by Qemu virtio-net, this will allow migration among
different vhost backends.
- software mediation facility, not all the virtqueues are assigned to
guests directly. One example is the virtio-net cvq, qemu may want to
intercept and record the device state for migration. Reusing the
current virtio-net codes simplifies a lot of codes.
- transparent failover (in the future), the nic model can choose to
switch between vhost backends etc.

>
> Maybe we could support both in the future ?

For the net, we need to figure out the advantages of this approach
first. Note that we didn't have vhost-user-net-pci or vhost-pci in the
past.

For the block, I will leave Stefan and Stefano to comment.

> Such as:
>
> * Lightweight
>  Net: vhost-vdpa-net
>  Storage: vhost-vdpa-blk
>
> * Heavy but more powerful
>  Net: netdev + virtio-net + vhost-vdpa
>  Storage: bdrv + virtio-blk + vhost-vdpa
>
> [1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg797569.html
>
> Signed-off-by: Longpeng(Mike) 
> ---
>  hw/net/meson.build |   1 +
>  hw/net/vhost-vdpa-net.c| 338 
> +
>  hw/virtio/Kconfig  |   5 +
>  hw/virtio/meson.build  |   1 +
>  hw/virtio/vhost-vdpa-net-pci.c | 118 +

I'd expect there's no device type specific code in this approach and
any kind of vDPA devices could be used with a general pci device.

Any reason for having net specific types here?

>  include/hw/virtio/vhost-vdpa-net.h |  31 
>  include/net/vhost-vdpa.h   |   2 +
>  net/vhost-vdpa.c   |   2 +-
>  8 files changed, 497 insertions(+), 1 deletion(-)
>  create mode 100644 hw/net/vhost-vdpa-net.c
>  create mode 100644 hw/virtio/vhost-vdpa-net-pci.c
>  create mode 100644 include/hw/virtio/vhost-vdpa-net.h
>
> diff --git a/hw/net/meson.build b/hw/net/meson.build
> index bdf71f1..139ebc4 100644
> --- a/hw/net/meson.build
> +++ b/hw/net/meson.build
> @@ -44,6 +44,7 @@ specific_ss.add(when: 'CONFIG_XILINX_ETHLITE', if_true: 
> files('xilinx_ethlite.c'
>
>  softmmu_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('net_rx_pkt.c'))
>  specific_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('virtio-net.c'))
> +specific_ss.add(when: 'CONFIG_VHOST_VDPA_NET', if_true: 
> files('vhost-vdpa-net.c'))
>
>  softmmu_ss.add(when: ['CONFIG_VIRTIO_NET', 'CONFIG_VHOST_NET'], if_true: 
> files('vhost_net.c'), if_false: files('vhost_net-stub.c'))
>  softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost_net-stub.c'))
> diff --git a/hw/net/vhost-vdpa-net.c b/hw/net/vhost-vdpa-net.c
> new file mode 100644
> index 000..48b99f9
> --- /dev/null
> +++ b/hw/net/vhost-vdpa-net.c
> @@ -0,0 +1,338 @@
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qemu/cutils.h"
> +#include "hw/qdev-core.h"
> +#include "hw/qdev-properties.h"
> +#include "hw/qdev-properties-system.h"
> +#include "hw/virtio/vhost.h"
> +#include "hw/virtio/vhost-vdpa-net.h"
> +#include "hw/virtio/virtio.h"
> +#include "hw/virtio/virtio-bus.h"
> +#include "hw/virtio/virtio-access.h"
> +#include "sysemu/sysemu.h"
> +#include "sysemu/runstate.h"
> +#include "net/vhost-vdpa.h"
> +
> +static void vhost_vdpa_net_get_config(VirtIODevice *vdev, uint8_t *config)
> +{
> +VHostVdpaNet *s = VHOST_VDPA_NET(vdev);
> +
> +memcpy(config, >netcfg, sizeof(struct virtio_net_config));
> +}
> +
> +static void vhost_vdpa_net_set_config(VirtIODevice *vdev, const uint8_t 
> *config)
> +{
> +VHostVdpaNet *s = VHOST_VDPA_NET(vdev);
> +struct virtio_net_config *netcfg = (struct virtio_net_config *)config;
> +int ret;
> +
> +ret = vhost_dev_set_config(>dev, (uint8_t *)netcfg, 0, 
> sizeof(*netcfg),
> +   VHOST_SET_CONFIG_TYPE_MASTER);
> +if (ret) {
> +error_report("set device config space failed");
> +return;
> +}
> +}
> +
> +static uint64_t vhost_vdpa_net_get_features(VirtIODevice *vdev,
> +uint64_t features,
> +

[RFC] vhost-vdpa-net: add vhost-vdpa-net host device support

2021-12-07 Thread Longpeng(Mike)

From: Longpeng 

Hi guys,

This patch introduces vhost-vdpa-net device, which is inspired
by vhost-user-blk and the proposal of vhost-vdpa-blk device [1].

I've tested this patch on Huawei's offload card:
./x86_64-softmmu/qemu-system-x86_64 \
-device vhost-vdpa-net-pci,vdpa-dev=/dev/vhost-vdpa-0

For virtio hardware offloading, the most important requirement for us
is to support live migration between offloading cards from different
vendors, the combination of netdev and virtio-net seems too heavy, we
prefer a lightweight way.

Maybe we could support both in the future ? Such as:

* Lightweight
 Net: vhost-vdpa-net
 Storage: vhost-vdpa-blk

* Heavy but more powerful
 Net: netdev + virtio-net + vhost-vdpa
 Storage: bdrv + virtio-blk + vhost-vdpa

[1] https://www.mail-archive.com/qemu-devel@nongnu.org/msg797569.html

Signed-off-by: Longpeng(Mike) 
---
 hw/net/meson.build |   1 +
 hw/net/vhost-vdpa-net.c| 338 +
 hw/virtio/Kconfig  |   5 +
 hw/virtio/meson.build  |   1 +
 hw/virtio/vhost-vdpa-net-pci.c | 118 +
 include/hw/virtio/vhost-vdpa-net.h |  31 
 include/net/vhost-vdpa.h   |   2 +
 net/vhost-vdpa.c   |   2 +-
 8 files changed, 497 insertions(+), 1 deletion(-)
 create mode 100644 hw/net/vhost-vdpa-net.c
 create mode 100644 hw/virtio/vhost-vdpa-net-pci.c
 create mode 100644 include/hw/virtio/vhost-vdpa-net.h

diff --git a/hw/net/meson.build b/hw/net/meson.build
index bdf71f1..139ebc4 100644
--- a/hw/net/meson.build
+++ b/hw/net/meson.build
@@ -44,6 +44,7 @@ specific_ss.add(when: 'CONFIG_XILINX_ETHLITE', if_true: 
files('xilinx_ethlite.c'
 
 softmmu_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('net_rx_pkt.c'))
 specific_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('virtio-net.c'))
+specific_ss.add(when: 'CONFIG_VHOST_VDPA_NET', if_true: 
files('vhost-vdpa-net.c'))
 
 softmmu_ss.add(when: ['CONFIG_VIRTIO_NET', 'CONFIG_VHOST_NET'], if_true: 
files('vhost_net.c'), if_false: files('vhost_net-stub.c'))
 softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost_net-stub.c'))
diff --git a/hw/net/vhost-vdpa-net.c b/hw/net/vhost-vdpa-net.c
new file mode 100644
index 000..48b99f9
--- /dev/null
+++ b/hw/net/vhost-vdpa-net.c
@@ -0,0 +1,338 @@
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qemu/cutils.h"
+#include "hw/qdev-core.h"
+#include "hw/qdev-properties.h"
+#include "hw/qdev-properties-system.h"
+#include "hw/virtio/vhost.h"
+#include "hw/virtio/vhost-vdpa-net.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-bus.h"
+#include "hw/virtio/virtio-access.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/runstate.h"
+#include "net/vhost-vdpa.h"
+
+static void vhost_vdpa_net_get_config(VirtIODevice *vdev, uint8_t *config)
+{
+VHostVdpaNet *s = VHOST_VDPA_NET(vdev);
+
+memcpy(config, >netcfg, sizeof(struct virtio_net_config));
+}
+
+static void vhost_vdpa_net_set_config(VirtIODevice *vdev, const uint8_t 
*config)
+{
+VHostVdpaNet *s = VHOST_VDPA_NET(vdev);
+struct virtio_net_config *netcfg = (struct virtio_net_config *)config;
+int ret;
+
+ret = vhost_dev_set_config(>dev, (uint8_t *)netcfg, 0, sizeof(*netcfg),
+   VHOST_SET_CONFIG_TYPE_MASTER);
+if (ret) {
+error_report("set device config space failed");
+return;
+}
+}
+
+static uint64_t vhost_vdpa_net_get_features(VirtIODevice *vdev,
+uint64_t features,
+Error **errp)
+{
+VHostVdpaNet *s = VHOST_VDPA_NET(vdev);
+
+virtio_add_feature(, VIRTIO_NET_F_CSUM);
+virtio_add_feature(, VIRTIO_NET_F_GUEST_CSUM);
+virtio_add_feature(, VIRTIO_NET_F_MAC);
+virtio_add_feature(, VIRTIO_NET_F_GSO);
+virtio_add_feature(, VIRTIO_NET_F_GUEST_TSO4);
+virtio_add_feature(, VIRTIO_NET_F_GUEST_TSO6);
+virtio_add_feature(, VIRTIO_NET_F_GUEST_ECN);
+virtio_add_feature(, VIRTIO_NET_F_GUEST_UFO);
+virtio_add_feature(, VIRTIO_NET_F_GUEST_ANNOUNCE);
+virtio_add_feature(, VIRTIO_NET_F_HOST_TSO4);
+virtio_add_feature(, VIRTIO_NET_F_HOST_TSO6);
+virtio_add_feature(, VIRTIO_NET_F_HOST_ECN);
+virtio_add_feature(, VIRTIO_NET_F_HOST_UFO);
+virtio_add_feature(, VIRTIO_NET_F_MRG_RXBUF);
+virtio_add_feature(, VIRTIO_NET_F_STATUS);
+virtio_add_feature(, VIRTIO_NET_F_CTRL_VQ);
+virtio_add_feature(, VIRTIO_NET_F_CTRL_RX);
+virtio_add_feature(, VIRTIO_NET_F_CTRL_VLAN);
+virtio_add_feature(, VIRTIO_NET_F_CTRL_RX_EXTRA);
+virtio_add_feature(, VIRTIO_NET_F_CTRL_MAC_ADDR);
+virtio_add_feature(, VIRTIO_NET_F_MQ);
+
+return vhost_get_features(>dev, vdpa_feature_bits, features);
+}
+
+static int vhost_vdpa_net_start(VirtIODevice *vdev, Error **errp)
+{
+VHostVdpaNet *s = VHOST_VDPA_NET(vdev);
+BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
+

[PATCH] Adding Cédric's repos in MAINTAINERS file.

2021-12-07 Thread lagarcia

From: Leonardo Garcia 

Signed-off-by: Leonardo Garcia 
---
 MAINTAINERS | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 7543eb4d59..52c6b99763 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -273,6 +273,7 @@ F: hw/ppc/ppc.c
 F: hw/ppc/ppc_booke.c
 F: include/hw/ppc/ppc.h
 F: disas/ppc.c
+T: git https://gitlab.com/legoater/qemu.git
 
 RISC-V TCG CPUs
 M: Palmer Dabbelt 
@@ -390,6 +391,7 @@ R: David Gibson 
 R: Greg Kurz 
 S: Maintained
 F: target/ppc/kvm.c
+T: git https://gitlab.com/legoater/qemu.git
 
 S390 KVM CPUs
 M: Halil Pasic 
@@ -1343,6 +1345,7 @@ F: tests/qtest/libqos/*spapr*
 F: tests/qtest/rtas*
 F: tests/qtest/libqos/rtas*
 F: tests/avocado/ppc_pseries.py
+T: git https://gitlab.com/legoater/qemu.git
 
 PowerNV (Non-Virtualized)
 M: Cédric Le Goater 
@@ -1356,6 +1359,7 @@ F: include/hw/ppc/pnv*
 F: include/hw/pci-host/pnv*
 F: pc-bios/skiboot.lid
 F: tests/qtest/pnv*
+T: git https://gitlab.com/legoater/qemu.git powernv-next
 
 virtex_ml507
 M: Edgar E. Iglesias 
@@ -1399,6 +1403,7 @@ F: hw/ppc/vof*
 F: include/hw/ppc/vof*
 F: pc-bios/vof/*
 F: pc-bios/vof*
+T: git https://gitlab.com/legoater/qemu.git
 
 RISC-V Machines
 ---
@@ -2244,6 +2249,7 @@ S: Supported
 F: hw/*/*xive*
 F: include/hw/*/*xive*
 F: docs/*/*xive*
+T: git https://gitlab.com/legoater/qemu.git
 
 Renesas peripherals
 R: Yoshinori Sato 
-- 
2.33.1

Re: [PATCH v10 77/77] target/riscv: rvv-1.0: Add ELEN checks for widening and narrowing instructions

On Mon, Nov 29, 2021 at 1:54 PM  wrote:
>
> From: Frank Chang 
>
> SEW has the limitation which cannot exceed ELEN.
>
> Widening instructions have a destination group with EEW = 2*SEW
> and narrowing instructions have a source operand with EEW = 2*SEW.
> Both of the instructions have the limitation of: 2*SEW <= ELEN.
>
> Signed-off-by: Frank Chang 

Acked-by: Alistair Francis 

Alistair

> ---
>  target/riscv/insn_trans/trans_rvv.c.inc | 17 +++--
>  target/riscv/translate.c|  2 ++
>  2 files changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/target/riscv/insn_trans/trans_rvv.c.inc 
> b/target/riscv/insn_trans/trans_rvv.c.inc
> index 47eb3119cbe..5e3f7fdb77c 100644
> --- a/target/riscv/insn_trans/trans_rvv.c.inc
> +++ b/target/riscv/insn_trans/trans_rvv.c.inc
> @@ -386,9 +386,10 @@ static bool vext_check_mss(DisasContext *s, int vd, int 
> vs1, int vs2)
>   *  can not be greater than 8 vector registers (Section 5.2):
>   *  => LMUL < 8.
>   *  => SEW < 64.
> - *   2. Destination vector register number is multiples of 2 * LMUL.
> + *   2. Double-width SEW cannot greater than ELEN.
> + *   3. Destination vector register number is multiples of 2 * LMUL.
>   *  (Section 3.4.2)
> - *   3. Destination vector register group for a masked vector
> + *   4. Destination vector register group for a masked vector
>   *  instruction cannot overlap the source mask register (v0).
>   *  (Section 5.3)
>   */
> @@ -396,6 +397,7 @@ static bool vext_wide_check_common(DisasContext *s, int 
> vd, int vm)
>  {
>  return (s->lmul <= 2) &&
> (s->sew < MO_64) &&
> +   ((s->sew + 1) <= (s->elen >> 4)) &&
> require_align(vd, s->lmul + 1) &&
> require_vm(vm, vd);
>  }
> @@ -409,11 +411,12 @@ static bool vext_wide_check_common(DisasContext *s, int 
> vd, int vm)
>   *  can not be greater than 8 vector registers (Section 5.2):
>   *  => LMUL < 8.
>   *  => SEW < 64.
> - *   2. Source vector register number is multiples of 2 * LMUL.
> + *   2. Double-width SEW cannot greater than ELEN.
> + *   3. Source vector register number is multiples of 2 * LMUL.
>   *  (Section 3.4.2)
> - *   3. Destination vector register number is multiples of LMUL.
> + *   4. Destination vector register number is multiples of LMUL.
>   *  (Section 3.4.2)
> - *   4. Destination vector register group for a masked vector
> + *   5. Destination vector register group for a masked vector
>   *  instruction cannot overlap the source mask register (v0).
>   *  (Section 5.3)
>   */
> @@ -422,6 +425,7 @@ static bool vext_narrow_check_common(DisasContext *s, int 
> vd, int vs2,
>  {
>  return (s->lmul <= 2) &&
> (s->sew < MO_64) &&
> +   ((s->sew + 1) <= (s->elen >> 4)) &&
> require_align(vs2, s->lmul + 1) &&
> require_align(vd, s->lmul) &&
> require_vm(vm, vd);
> @@ -2806,7 +2810,8 @@ GEN_OPIVV_TRANS(vredxor_vs, reduction_check)
>  /* Vector Widening Integer Reduction Instructions */
>  static bool reduction_widen_check(DisasContext *s, arg_rmrr *a)
>  {
> -return reduction_check(s, a) && (s->sew < MO_64);
> +return reduction_check(s, a) && (s->sew < MO_64) &&
> +   ((s->sew + 1) <= (s->elen >> 4));
>  }
>
>  GEN_OPIVV_WIDEN_TRANS(vwredsum_vs, reduction_widen_check)
> diff --git a/target/riscv/translate.c b/target/riscv/translate.c
> index 68edaaf6ac7..5df6c0d800b 100644
> --- a/target/riscv/translate.c
> +++ b/target/riscv/translate.c
> @@ -96,6 +96,7 @@ typedef struct DisasContext {
>  int8_t lmul;
>  uint8_t sew;
>  uint16_t vlen;
> +uint16_t elen;
>  target_ulong vstart;
>  bool vl_eq_vlmax;
>  uint8_t ntemp;
> @@ -705,6 +706,7 @@ static void riscv_tr_init_disas_context(DisasContextBase 
> *dcbase, CPUState *cs)
>  ctx->ext_zfh = cpu->cfg.ext_zfh;
>  ctx->ext_zfhmin = cpu->cfg.ext_zfhmin;
>  ctx->vlen = cpu->cfg.vlen;
> +ctx->elen = cpu->cfg.elen;
>  ctx->mstatus_hs_fs = FIELD_EX32(tb_flags, TB_FLAGS, MSTATUS_HS_FS);
>  ctx->mstatus_hs_vs = FIELD_EX32(tb_flags, TB_FLAGS, MSTATUS_HS_VS);
>  ctx->hlsx = FIELD_EX32(tb_flags, TB_FLAGS, HLSX);
> --
> 2.25.1
>
>

[PATCH v2] docs: Introducing pseries documentation.

2021-12-07 Thread lagarcia

From: Leonardo Garcia 

The purpose of this document is to substitute the content currently
available in the QEMU wiki at [0]. This initial version does contain
some additional content as well. Whenever this documentation gets
upstream and is reflected in [1], the QEMU wiki will be edited to point
to this documentation, so that we only need to keep it updated in one
place.

0. https://wiki.qemu.org/Documentation/Platforms/POWER
1. https://qemu.readthedocs.io/en/latest/system/ppc/pseries.html

Signed-off-by: Leonardo Garcia 
Reviewed-by: David Gibson 
---
 docs/system/ppc/pseries.rst | 226 
 1 file changed, 226 insertions(+)

diff --git a/docs/system/ppc/pseries.rst b/docs/system/ppc/pseries.rst
index 932d4dd17d..13edccbfd8 100644
--- a/docs/system/ppc/pseries.rst
+++ b/docs/system/ppc/pseries.rst
@@ -1,12 +1,238 @@
 pSeries family boards (``pseries``)
 ===
 
+The Power machine para-virtualized environment described by the `Linux on Power
+Architecture Reference document (LoPAR)
+`_
+is called pSeries. This environment is also known as sPAPR, System p guests, or
+simply Power Linux guests (although it is capable of running other operating
+systems, such as AIX).
+
+Even though pSeries is designed to behave as a guest environment, it is also
+capable of acting as a hypervisor OS, providing, on that role, nested
+virtualization capabilities.
+
 Supported devices
 -
 
+ * Multi processor support for many Power processors generations: POWER7,
+   POWER7+, POWER8, POWER8NVL, POWER9, and Power10. Support for POWER5+ exists,
+   but its state is unknown.
+ * Interrupt Controller, XICS (POWER8) and XIVE (POWER9 and Power10)
+ * vPHB PCIe Host bridge.
+ * vscsi and vnet devices, compatible with the same devices available on a
+   PowerVM hypervisor with VIOS managing LPARs.
+ * Virtio based devices.
+ * PCIe device pass through.
+
 Missing devices
 ---
 
+ * SPICE support.
 
 Firmware
 
+
+`SLOF `_ (Slimline Open Firmware) is an
+implementation of the `IEEE 1275-1994, Standard for Boot (Initialization
+Configuration) Firmware: Core Requirements and Practices
+`_.
+
+QEMU includes a prebuilt image of SLOF which is updated when a more recent
+version is required.
+
+Build directions
+
+
+.. code-block:: bash
+
+  ./configure --target-list=ppc64-softmmu && make
+
+Running instructions
+
+
+Someone can select the pSeries machine type by running QEMU with the following
+options:
+
+.. code-block:: bash
+
+  qemu-system-ppc64 -M pseries 
+
+sPAPR devices
+-
+
+The sPAPR specification defines a set of para-virtualized devices, which are
+also supported by the pSeries machine in QEMU and can be instantiated with the
+``-device`` option:
+
+* ``spapr-vlan`` : a virtual network interface.
+* ``spapr-vscsi`` : a virtual SCSI disk interface.
+* ``spapr-rng`` : a pseudo-device for passing random number generator data to 
the
+  guest (see the `H_RANDOM hypercall feature
+  `_ for details).
+* ``spapr-vty``: a virtual teletype.
+* ``spapr-pci-host-bridge``: a PCI host bridge.
+* ``tpm-spapr``: a Trusted Platform Module (TPM).
+* ``spapr-tpm-proxy``: a TPM proxy.
+
+These are compatible with the devices historically available for use when
+running the IBM PowerVM hypervisor with LPARs.
+
+However, since these devices have originally been specified with another
+hypervisor and non-Linux guests in mind, you should use the virtio counterparts
+(virtio-net, virtio-blk/scsi and virtio-rng) if possible instead, since they
+will most probably give you better performance with Linux guests in a QEMU
+environment.
+
+The pSeries machine in QEMU is always instantiated with the following devices:
+
+* A NVRAM device (``spapr-nvram``).
+* A virtual teletype (``spapr-vty``).
+* A PCI host bridge (``spapr-pci-host-bridge``).
+
+Hence, it is not needed to add them manually, unless you use the 
``-nodefaults``
+command line option in QEMU.
+
+In the case of the default ``spapr-nvram`` device, if someone wants to make the
+contents of the NVRAM device persistent, they will need to specify a PFLASH
+device when starting QEMU, i.e. either use
+``-drive if=pflash,file=,format=raw`` to set the default PFLASH
+device, or specify one with an ID
+(``-drive if=none,file=,format=raw,id=pfid``) and pass that ID to the
+NVRAM device with ``-global spapr-nvram.drive=pfid``.
+
+sPAPR specification
+^^^
+
+The main source of documentation on the sPAPR standard is the `Linux on Power
+Architecture Reference document (LoPAR)
+`_.
+However, documentation specific to QEMU's implementation of the specification
+can

Re: [PATCH v10 70/77] target/riscv: rvv-1.0: floating-point reciprocal estimate instruction

On Mon, Nov 29, 2021 at 1:58 PM  wrote:
>
> From: Frank Chang 
>
> Implement the floating-point reciprocal estimate to 7 bits instruction.
>
> Signed-off-by: Frank Chang 

Acked-by: Alistair Francis 

Alistair

> ---
>  target/riscv/helper.h   |   4 +
>  target/riscv/insn32.decode  |   1 +
>  target/riscv/insn_trans/trans_rvv.c.inc |   1 +
>  target/riscv/vector_helper.c| 191 
>  4 files changed, 197 insertions(+)
>
> diff --git a/target/riscv/helper.h b/target/riscv/helper.h
> index bdf06dfb24d..ab283d12b79 100644
> --- a/target/riscv/helper.h
> +++ b/target/riscv/helper.h
> @@ -845,6 +845,10 @@ DEF_HELPER_5(vfrsqrt7_v_h, void, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_5(vfrsqrt7_v_w, void, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_5(vfrsqrt7_v_d, void, ptr, ptr, ptr, env, i32)
>
> +DEF_HELPER_5(vfrec7_v_h, void, ptr, ptr, ptr, env, i32)
> +DEF_HELPER_5(vfrec7_v_w, void, ptr, ptr, ptr, env, i32)
> +DEF_HELPER_5(vfrec7_v_d, void, ptr, ptr, ptr, env, i32)
> +
>  DEF_HELPER_6(vfmin_vv_h, void, ptr, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_6(vfmin_vv_w, void, ptr, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_6(vfmin_vv_d, void, ptr, ptr, ptr, ptr, env, i32)
> diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode
> index 6e5f288943a..952768f8ded 100644
> --- a/target/riscv/insn32.decode
> +++ b/target/riscv/insn32.decode
> @@ -561,6 +561,7 @@ vfwnmsac_vv 11 . . . 001 . 1010111 
> @r_vm
>  vfwnmsac_vf 11 . . . 101 . 1010111 @r_vm
>  vfsqrt_v010011 . . 0 001 . 1010111 @r2_vm
>  vfrsqrt7_v  010011 . . 00100 001 . 1010111 @r2_vm
> +vfrec7_v010011 . . 00101 001 . 1010111 @r2_vm
>  vfmin_vv000100 . . . 001 . 1010111 @r_vm
>  vfmin_vf000100 . . . 101 . 1010111 @r_vm
>  vfmax_vv000110 . . . 001 . 1010111 @r_vm
> diff --git a/target/riscv/insn_trans/trans_rvv.c.inc 
> b/target/riscv/insn_trans/trans_rvv.c.inc
> index 8fe718610a9..ff8f6df8f7b 100644
> --- a/target/riscv/insn_trans/trans_rvv.c.inc
> +++ b/target/riscv/insn_trans/trans_rvv.c.inc
> @@ -2408,6 +2408,7 @@ static bool trans_##NAME(DisasContext *s, arg_rmr *a)  \
>
>  GEN_OPFV_TRANS(vfsqrt_v, opfv_check, RISCV_FRM_DYN)
>  GEN_OPFV_TRANS(vfrsqrt7_v, opfv_check, RISCV_FRM_DYN)
> +GEN_OPFV_TRANS(vfrec7_v, opfv_check, RISCV_FRM_DYN)
>
>  /* Vector Floating-Point MIN/MAX Instructions */
>  GEN_OPFVV_TRANS(vfmin_vv, opfvv_check)
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index d5f3229bcb4..946dca53ffd 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -3587,6 +3587,197 @@ GEN_VEXT_V_ENV(vfrsqrt7_v_h, 2, 2)
>  GEN_VEXT_V_ENV(vfrsqrt7_v_w, 4, 4)
>  GEN_VEXT_V_ENV(vfrsqrt7_v_d, 8, 8)
>
> +/*
> + * Vector Floating-Point Reciprocal Estimate Instruction
> + *
> + * Adapted from riscv-v-spec recip.c:
> + * https://github.com/riscv/riscv-v-spec/blob/master/recip.c
> + */
> +static uint64_t frec7(uint64_t f, int exp_size, int frac_size,
> +  float_status *s)
> +{
> +uint64_t sign = extract64(f, frac_size + exp_size, 1);
> +uint64_t exp = extract64(f, frac_size, exp_size);
> +uint64_t frac = extract64(f, 0, frac_size);
> +
> +const uint8_t lookup_table[] = {
> +127, 125, 123, 121, 119, 117, 116, 114,
> +112, 110, 109, 107, 105, 104, 102, 100,
> +99, 97, 96, 94, 93, 91, 90, 88,
> +87, 85, 84, 83, 81, 80, 79, 77,
> +76, 75, 74, 72, 71, 70, 69, 68,
> +66, 65, 64, 63, 62, 61, 60, 59,
> +58, 57, 56, 55, 54, 53, 52, 51,
> +50, 49, 48, 47, 46, 45, 44, 43,
> +42, 41, 40, 40, 39, 38, 37, 36,
> +35, 35, 34, 33, 32, 31, 31, 30,
> +29, 28, 28, 27, 26, 25, 25, 24,
> +23, 23, 22, 21, 21, 20, 19, 19,
> +18, 17, 17, 16, 15, 15, 14, 14,
> +13, 12, 12, 11, 11, 10, 9, 9,
> +8, 8, 7, 7, 6, 5, 5, 4,
> +4, 3, 3, 2, 2, 1, 1, 0
> +};
> +const int precision = 7;
> +
> +if (exp == 0 && frac != 0) { /* subnormal */
> +/* Normalize the subnormal. */
> +while (extract64(frac, frac_size - 1, 1) == 0) {
> +exp--;
> +frac <<= 1;
> +}
> +
> +frac = (frac << 1) & MAKE_64BIT_MASK(0, frac_size);
> +
> +if (exp != 0 && exp != UINT64_MAX) {
> +/*
> + * Overflow to inf or max value of same sign,
> + * depending on sign and rounding mode.
> + */
> +s->float_exception_flags |= (float_flag_inexact |
> + float_flag_overflow);
> +
> +if ((s->float_rounding_mode == float_round_to_zero) ||
> +((s->float_rounding_mode == float_round_down) && !sign) ||
> +((s->float_rounding_mode == float_round_up) && sign)) {
> +/* Return greatest/negative finite

Re: [PATCH v10 69/77] target/riscv: rvv-1.0: floating-point reciprocal square-root estimate instruction

On Mon, Nov 29, 2021 at 1:57 PM  wrote:
>
> From: Frank Chang 
>
> Implement the floating-point reciprocal square-root estimate to 7 bits
> instruction.
>
> Signed-off-by: Frank Chang 

Acked-by: Alistair Francis 

Alistair

> ---
>  target/riscv/helper.h   |   4 +
>  target/riscv/insn32.decode  |   1 +
>  target/riscv/insn_trans/trans_rvv.c.inc |   1 +
>  target/riscv/vector_helper.c| 183 
>  4 files changed, 189 insertions(+)
>
> diff --git a/target/riscv/helper.h b/target/riscv/helper.h
> index a717a87a0e0..bdf06dfb24d 100644
> --- a/target/riscv/helper.h
> +++ b/target/riscv/helper.h
> @@ -841,6 +841,10 @@ DEF_HELPER_5(vfsqrt_v_h, void, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_5(vfsqrt_v_w, void, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_5(vfsqrt_v_d, void, ptr, ptr, ptr, env, i32)
>
> +DEF_HELPER_5(vfrsqrt7_v_h, void, ptr, ptr, ptr, env, i32)
> +DEF_HELPER_5(vfrsqrt7_v_w, void, ptr, ptr, ptr, env, i32)
> +DEF_HELPER_5(vfrsqrt7_v_d, void, ptr, ptr, ptr, env, i32)
> +
>  DEF_HELPER_6(vfmin_vv_h, void, ptr, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_6(vfmin_vv_w, void, ptr, ptr, ptr, ptr, env, i32)
>  DEF_HELPER_6(vfmin_vv_d, void, ptr, ptr, ptr, ptr, env, i32)
> diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode
> index c4fdc76a269..6e5f288943a 100644
> --- a/target/riscv/insn32.decode
> +++ b/target/riscv/insn32.decode
> @@ -560,6 +560,7 @@ vfwmsac_vf  10 . . . 101 . 1010111 
> @r_vm
>  vfwnmsac_vv 11 . . . 001 . 1010111 @r_vm
>  vfwnmsac_vf 11 . . . 101 . 1010111 @r_vm
>  vfsqrt_v010011 . . 0 001 . 1010111 @r2_vm
> +vfrsqrt7_v  010011 . . 00100 001 . 1010111 @r2_vm
>  vfmin_vv000100 . . . 001 . 1010111 @r_vm
>  vfmin_vf000100 . . . 101 . 1010111 @r_vm
>  vfmax_vv000110 . . . 001 . 1010111 @r_vm
> diff --git a/target/riscv/insn_trans/trans_rvv.c.inc 
> b/target/riscv/insn_trans/trans_rvv.c.inc
> index 53c8573f117..8fe718610a9 100644
> --- a/target/riscv/insn_trans/trans_rvv.c.inc
> +++ b/target/riscv/insn_trans/trans_rvv.c.inc
> @@ -2407,6 +2407,7 @@ static bool trans_##NAME(DisasContext *s, arg_rmr *a)  \
>  }
>
>  GEN_OPFV_TRANS(vfsqrt_v, opfv_check, RISCV_FRM_DYN)
> +GEN_OPFV_TRANS(vfrsqrt7_v, opfv_check, RISCV_FRM_DYN)
>
>  /* Vector Floating-Point MIN/MAX Instructions */
>  GEN_OPFVV_TRANS(vfmin_vv, opfvv_check)
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 22848d6b683..d5f3229bcb4 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -18,6 +18,7 @@
>
>  #include "qemu/osdep.h"
>  #include "qemu/host-utils.h"
> +#include "qemu/bitops.h"
>  #include "cpu.h"
>  #include "exec/memop.h"
>  #include "exec/exec-all.h"
> @@ -3404,6 +3405,188 @@ GEN_VEXT_V_ENV(vfsqrt_v_h, 2, 2)
>  GEN_VEXT_V_ENV(vfsqrt_v_w, 4, 4)
>  GEN_VEXT_V_ENV(vfsqrt_v_d, 8, 8)
>
> +/*
> + * Vector Floating-Point Reciprocal Square-Root Estimate Instruction
> + *
> + * Adapted from riscv-v-spec recip.c:
> + * https://github.com/riscv/riscv-v-spec/blob/master/recip.c
> + */
> +static uint64_t frsqrt7(uint64_t f, int exp_size, int frac_size)
> +{
> +uint64_t sign = extract64(f, frac_size + exp_size, 1);
> +uint64_t exp = extract64(f, frac_size, exp_size);
> +uint64_t frac = extract64(f, 0, frac_size);
> +
> +const uint8_t lookup_table[] = {
> +52, 51, 50, 48, 47, 46, 44, 43,
> +42, 41, 40, 39, 38, 36, 35, 34,
> +33, 32, 31, 30, 30, 29, 28, 27,
> +26, 25, 24, 23, 23, 22, 21, 20,
> +19, 19, 18, 17, 16, 16, 15, 14,
> +14, 13, 12, 12, 11, 10, 10, 9,
> +9, 8, 7, 7, 6, 6, 5, 4,
> +4, 3, 3, 2, 2, 1, 1, 0,
> +127, 125, 123, 121, 119, 118, 116, 114,
> +113, 111, 109, 108, 106, 105, 103, 102,
> +100, 99, 97, 96, 95, 93, 92, 91,
> +90, 88, 87, 86, 85, 84, 83, 82,
> +80, 79, 78, 77, 76, 75, 74, 73,
> +72, 71, 70, 70, 69, 68, 67, 66,
> +65, 64, 63, 63, 62, 61, 60, 59,
> +59, 58, 57, 56, 56, 55, 54, 53
> +};
> +const int precision = 7;
> +
> +if (exp == 0 && frac != 0) { /* subnormal */
> +/* Normalize the subnormal. */
> +while (extract64(frac, frac_size - 1, 1) == 0) {
> +exp--;
> +frac <<= 1;
> +}
> +
> +frac = (frac << 1) & MAKE_64BIT_MASK(0, frac_size);
> +}
> +
> +int idx = ((exp & 1) << (precision - 1)) |
> +(frac >> (frac_size - precision + 1));
> +uint64_t out_frac = (uint64_t)(lookup_table[idx]) <<
> +(frac_size - precision);
> +uint64_t out_exp = (3 * MAKE_64BIT_MASK(0, exp_size - 1) + ~exp) / 2;
> +
> +uint64_t val = 0;
> +val = deposit64(val, 0, frac_size, out_frac);
> +val = deposit64(val, frac_size, exp_size, out_exp);
> +val = deposit64(val, frac_size + exp_size,

Re: [PATCH v3 1/1] target/riscv: Fix PMP propagation for tlb

On Tue, Nov 23, 2021 at 7:09 PM LIU Zhiwei  wrote:
>
> Only the pmp index that be checked by pmp_hart_has_privs can be used
> by pmp_get_tlb_size to avoid an error pmp index.
>
> Before modification, we may use an error pmp index. For example,
> we check address 0x4fc, and the size 0x4 in pmp_hart_has_privs. If there
> is an pmp rule, valid range is [0x4fc, 0x500), then pmp_hart_has_privs
> will return true;
>
> However, this checked pmp index is discarded as pmp_hart_has_privs
> return bool value. In pmp_is_range_in_tlb, it will traverse all pmp
> rules. The tlb_sa will be 0x0, and tlb_ea will be 0x4fff. If there is
> a pmp rule [0x10, 0x4]. It will be misused as it is legal in
> pmp_get_tlb_size.
>
> Signed-off-by: LIU Zhiwei 

Thanks!

Applied to riscv-to-apply.next

Alistair

> ---
>  target/riscv/cpu_helper.c | 16 ++-
>  target/riscv/pmp.c| 56 +--
>  target/riscv/pmp.h|  6 ++---
>  3 files changed, 31 insertions(+), 47 deletions(-)
>
> diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
> index 9eeed38c7e..4239bd2ca5 100644
> --- a/target/riscv/cpu_helper.c
> +++ b/target/riscv/cpu_helper.c
> @@ -362,24 +362,26 @@ static int get_physical_address_pmp(CPURISCVState *env, 
> int *prot,
>  int mode)
>  {
>  pmp_priv_t pmp_priv;
> -target_ulong tlb_size_pmp = 0;
> +int pmp_index = -1;
>
>  if (!riscv_feature(env, RISCV_FEATURE_PMP)) {
>  *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
>  return TRANSLATE_SUCCESS;
>  }
>
> -if (!pmp_hart_has_privs(env, addr, size, 1 << access_type, _priv,
> -mode)) {
> +pmp_index = pmp_hart_has_privs(env, addr, size, 1 << access_type,
> +   _priv, mode);
> +if (pmp_index < 0) {
>  *prot = 0;
>  return TRANSLATE_PMP_FAIL;
>  }
>
>  *prot = pmp_priv_to_page_prot(pmp_priv);
> -if (tlb_size != NULL) {
> -if (pmp_is_range_in_tlb(env, addr & ~(*tlb_size - 1), 
> _size_pmp)) {
> -*tlb_size = tlb_size_pmp;
> -}
> +if ((tlb_size != NULL) && pmp_index != MAX_RISCV_PMPS) {
> +target_ulong tlb_sa = addr & ~(*tlb_size - 1);
> +target_ulong tlb_ea = tlb_sa + *tlb_size - 1;
> +
> +*tlb_size = pmp_get_tlb_size(env, pmp_index, tlb_sa, tlb_ea);
>  }
>
>  return TRANSLATE_SUCCESS;
> diff --git a/target/riscv/pmp.c b/target/riscv/pmp.c
> index 54abf42583..1172142e34 100644
> --- a/target/riscv/pmp.c
> +++ b/target/riscv/pmp.c
> @@ -297,8 +297,11 @@ static bool pmp_hart_has_privs_default(CPURISCVState 
> *env, target_ulong addr,
>
>  /*
>   * Check if the address has required RWX privs to complete desired operation
> + * Return PMP rule index if a pmp rule match
> + * Return MAX_RISCV_PMPS if default match
> + * Return negtive value if no match
>   */
> -bool pmp_hart_has_privs(CPURISCVState *env, target_ulong addr,
> +int pmp_hart_has_privs(CPURISCVState *env, target_ulong addr,
>  target_ulong size, pmp_priv_t privs, pmp_priv_t *allowed_privs,
>  target_ulong mode)
>  {
> @@ -310,8 +313,10 @@ bool pmp_hart_has_privs(CPURISCVState *env, target_ulong 
> addr,
>
>  /* Short cut if no rules */
>  if (0 == pmp_get_num_rules(env)) {
> -return pmp_hart_has_privs_default(env, addr, size, privs,
> -  allowed_privs, mode);
> +if (pmp_hart_has_privs_default(env, addr, size, privs,
> +   allowed_privs, mode)) {
> +ret = MAX_RISCV_PMPS;
> +}
>  }
>
>  if (size == 0) {
> @@ -338,7 +343,7 @@ bool pmp_hart_has_privs(CPURISCVState *env, target_ulong 
> addr,
>  if ((s + e) == 1) {
>  qemu_log_mask(LOG_GUEST_ERROR,
>"pmp violation - access is partially inside\n");
> -ret = 0;
> +ret = -1;
>  break;
>  }
>
> @@ -441,18 +446,22 @@ bool pmp_hart_has_privs(CPURISCVState *env, 
> target_ulong addr,
>  }
>  }
>
> -ret = ((privs & *allowed_privs) == privs);
> +if ((privs & *allowed_privs) == privs) {
> +ret = i;
> +}
>  break;
>  }
>  }
>
>  /* No rule matched */
>  if (ret == -1) {
> -return pmp_hart_has_privs_default(env, addr, size, privs,
> -  allowed_privs, mode);
> +if (pmp_hart_has_privs_default(env, addr, size, privs,
> +   allowed_privs, mode)) {
> +ret = MAX_RISCV_PMPS;
> +}
>  }
>
> -return ret == 1 ? true : false;
> +return ret;
>  }
>
>  /*
> @@ -595,8 +604,8 @@ target_ulong mseccfg_csr_read(CPURISCVState *env)
>   * Calculate the TLB size if the start address or the end address of
>   * PMP entry is presented in the TLB page.
>   */
> -static

Re: [PATCH v3 1/1] target/riscv: Fix PMP propagation for tlb

On Tue, Nov 23, 2021 at 7:09 PM LIU Zhiwei  wrote:
>
> Only the pmp index that be checked by pmp_hart_has_privs can be used
> by pmp_get_tlb_size to avoid an error pmp index.
>
> Before modification, we may use an error pmp index. For example,
> we check address 0x4fc, and the size 0x4 in pmp_hart_has_privs. If there
> is an pmp rule, valid range is [0x4fc, 0x500), then pmp_hart_has_privs
> will return true;
>
> However, this checked pmp index is discarded as pmp_hart_has_privs
> return bool value. In pmp_is_range_in_tlb, it will traverse all pmp
> rules. The tlb_sa will be 0x0, and tlb_ea will be 0x4fff. If there is
> a pmp rule [0x10, 0x4]. It will be misused as it is legal in
> pmp_get_tlb_size.
>
> Signed-off-by: LIU Zhiwei 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu_helper.c | 16 ++-
>  target/riscv/pmp.c| 56 +--
>  target/riscv/pmp.h|  6 ++---
>  3 files changed, 31 insertions(+), 47 deletions(-)
>
> diff --git a/target/riscv/cpu_helper.c b/target/riscv/cpu_helper.c
> index 9eeed38c7e..4239bd2ca5 100644
> --- a/target/riscv/cpu_helper.c
> +++ b/target/riscv/cpu_helper.c
> @@ -362,24 +362,26 @@ static int get_physical_address_pmp(CPURISCVState *env, 
> int *prot,
>  int mode)
>  {
>  pmp_priv_t pmp_priv;
> -target_ulong tlb_size_pmp = 0;
> +int pmp_index = -1;
>
>  if (!riscv_feature(env, RISCV_FEATURE_PMP)) {
>  *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
>  return TRANSLATE_SUCCESS;
>  }
>
> -if (!pmp_hart_has_privs(env, addr, size, 1 << access_type, _priv,
> -mode)) {
> +pmp_index = pmp_hart_has_privs(env, addr, size, 1 << access_type,
> +   _priv, mode);
> +if (pmp_index < 0) {
>  *prot = 0;
>  return TRANSLATE_PMP_FAIL;
>  }
>
>  *prot = pmp_priv_to_page_prot(pmp_priv);
> -if (tlb_size != NULL) {
> -if (pmp_is_range_in_tlb(env, addr & ~(*tlb_size - 1), 
> _size_pmp)) {
> -*tlb_size = tlb_size_pmp;
> -}
> +if ((tlb_size != NULL) && pmp_index != MAX_RISCV_PMPS) {
> +target_ulong tlb_sa = addr & ~(*tlb_size - 1);
> +target_ulong tlb_ea = tlb_sa + *tlb_size - 1;
> +
> +*tlb_size = pmp_get_tlb_size(env, pmp_index, tlb_sa, tlb_ea);
>  }
>
>  return TRANSLATE_SUCCESS;
> diff --git a/target/riscv/pmp.c b/target/riscv/pmp.c
> index 54abf42583..1172142e34 100644
> --- a/target/riscv/pmp.c
> +++ b/target/riscv/pmp.c
> @@ -297,8 +297,11 @@ static bool pmp_hart_has_privs_default(CPURISCVState 
> *env, target_ulong addr,
>
>  /*
>   * Check if the address has required RWX privs to complete desired operation
> + * Return PMP rule index if a pmp rule match
> + * Return MAX_RISCV_PMPS if default match
> + * Return negtive value if no match
>   */
> -bool pmp_hart_has_privs(CPURISCVState *env, target_ulong addr,
> +int pmp_hart_has_privs(CPURISCVState *env, target_ulong addr,
>  target_ulong size, pmp_priv_t privs, pmp_priv_t *allowed_privs,
>  target_ulong mode)
>  {
> @@ -310,8 +313,10 @@ bool pmp_hart_has_privs(CPURISCVState *env, target_ulong 
> addr,
>
>  /* Short cut if no rules */
>  if (0 == pmp_get_num_rules(env)) {
> -return pmp_hart_has_privs_default(env, addr, size, privs,
> -  allowed_privs, mode);
> +if (pmp_hart_has_privs_default(env, addr, size, privs,
> +   allowed_privs, mode)) {
> +ret = MAX_RISCV_PMPS;
> +}
>  }
>
>  if (size == 0) {
> @@ -338,7 +343,7 @@ bool pmp_hart_has_privs(CPURISCVState *env, target_ulong 
> addr,
>  if ((s + e) == 1) {
>  qemu_log_mask(LOG_GUEST_ERROR,
>"pmp violation - access is partially inside\n");
> -ret = 0;
> +ret = -1;
>  break;
>  }
>
> @@ -441,18 +446,22 @@ bool pmp_hart_has_privs(CPURISCVState *env, 
> target_ulong addr,
>  }
>  }
>
> -ret = ((privs & *allowed_privs) == privs);
> +if ((privs & *allowed_privs) == privs) {
> +ret = i;
> +}
>  break;
>  }
>  }
>
>  /* No rule matched */
>  if (ret == -1) {
> -return pmp_hart_has_privs_default(env, addr, size, privs,
> -  allowed_privs, mode);
> +if (pmp_hart_has_privs_default(env, addr, size, privs,
> +   allowed_privs, mode)) {
> +ret = MAX_RISCV_PMPS;
> +}
>  }
>
> -return ret == 1 ? true : false;
> +return ret;
>  }
>
>  /*
> @@ -595,8 +604,8 @@ target_ulong mseccfg_csr_read(CPURISCVState *env)
>   * Calculate the TLB size if the start address or the end address of
>   * PMP entry is presented in the TLB page.
>   */
> -static target_ulong

Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

2021-12-07 Thread Sean Christopherson

On Tue, Dec 07, 2021, Chris Murphy wrote:
> cc: qemu-devel
> 
> Hi,
> 
> I'm trying to help progress a very troublesome and so far elusive bug
> we're seeing in Fedora infrastructure. When running dozens of qemu-kvm
> VMs simultaneously, eventually they become unresponsive, as well as
> new processes as we try to extract information from the host about
> what's gone wrong.

Have you tried bisecting?  IIUC, the issues showed up between v5.11 and 
v5.12.12,
bisecting should be relatively straightforward.

> Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
> state where forking does not work correctly, breaking most things
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585
> 
> In subsequent testing, we used newer kernels with lockdep and other
> debug stuff enabled, and managed to capture a hung task with a bunch
> of locks listed, including kvm and qemu processes. But I can't parse
> it.
> 
> 5.15-rc7
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
> 5.15+
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939
> 
> If anyone can take a glance at those kernel messages, and/or give
> hints how we can extract more information for debugging, it'd be
> appreciated. Maybe all of that is normal and the actual problem isn't
> in any of these traces.

All the instances of

  (>mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm]

are uninteresting and expected, that's just each vCPU task taking its associated
vcpu->mutex, likely for KVM_RUN.

At a glance, the XFS stuff looks far more interesting/suspect.

[PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups

Let's use the reserved pool of simulated PCI groups to allow intercept
devices to have separate groups from interpreted devices as some group
values may be different. If we run out of simulated PCI groups, subsequent
intercept devices just get the default group.
Furthermore, if we encounter any PCI groups from hostdevs that are marked
as simulated, let's just assign them to the default group to avoid
conflicts between host simulated groups and our own simulated groups.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c | 19 ++--
 hw/s390x/s390-pci-vfio.c| 40 ++---
 include/hw/s390x/s390-pci-bus.h |  6 -
 3 files changed, 59 insertions(+), 6 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index ab442f17fb..8b0f3ef120 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -747,13 +747,14 @@ static void s390_pci_iommu_free(S390pciState *s, PCIBus 
*bus, int32_t devfn)
 object_unref(OBJECT(iommu));
 }
 
-S390PCIGroup *s390_group_create(int id)
+S390PCIGroup *s390_group_create(int id, int host_id)
 {
 S390PCIGroup *group;
 S390pciState *s = s390_get_phb();
 
 group = g_new0(S390PCIGroup, 1);
 group->id = id;
+group->host_id = host_id;
 QTAILQ_INSERT_TAIL(>zpci_groups, group, link);
 return group;
 }
@@ -771,12 +772,25 @@ S390PCIGroup *s390_group_find(int id)
 return NULL;
 }
 
+S390PCIGroup *s390_group_find_host_sim(int host_id)
+{
+S390PCIGroup *group;
+S390pciState *s = s390_get_phb();
+
+QTAILQ_FOREACH(group, >zpci_groups, link) {
+if (group->id >= ZPCI_SIM_GRP_START && group->host_id == host_id) {
+return group;
+}
+}
+return NULL;
+}
+
 static void s390_pci_init_default_group(void)
 {
 S390PCIGroup *group;
 ClpRspQueryPciGrp *resgrp;
 
-group = s390_group_create(ZPCI_DEFAULT_FN_GRP);
+group = s390_group_create(ZPCI_DEFAULT_FN_GRP, ZPCI_DEFAULT_FN_GRP);
 resgrp = >zpci_group;
 resgrp->fr = 1;
 resgrp->dasm = 0;
@@ -824,6 +838,7 @@ static void s390_pcihost_realize(DeviceState *dev, Error 
**errp)
NULL, g_free);
 s->zpci_table = g_hash_table_new_full(g_int_hash, g_int_equal, NULL, NULL);
 s->bus_no = 0;
+s->next_sim_grp = ZPCI_SIM_GRP_START;
 QTAILQ_INIT(>pending_sei);
 QTAILQ_INIT(>zpci_devs);
 QTAILQ_INIT(>zpci_dma_limit);
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index c9269683f5..bdc5892287 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -305,13 +305,17 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 {
 struct vfio_info_cap_header *hdr;
 struct vfio_device_info_cap_zpci_group *cap;
+S390pciState *s = s390_get_phb();
 ClpRspQueryPciGrp *resgrp;
 VFIOPCIDevice *vpci =  container_of(pbdev->pdev, VFIOPCIDevice, pdev);
 
 hdr = vfio_get_device_info_cap(info, VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
 
-/* If capability not provided, just use the default group */
-if (hdr == NULL) {
+/*
+ * If capability not provided or the underlying hostdev is simulated, just
+ * use the default group.
+ */
+if (hdr == NULL || pbdev->zpci_fn.pfgid >= ZPCI_SIM_GRP_START) {
 trace_s390_pci_clp_cap(vpci->vbasedev.name,
VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
 pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
@@ -320,11 +324,41 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 }
 cap = (void *) hdr;
 
+/*
+ * For an intercept device, let's use an existing simulated group if one
+ * one was already created for other intercept devices in this group.
+ * If not, create a new simulated group if any are still available.
+ * If all else fails, just fall back on the default group.
+ */
+if (!pbdev->interp) {
+pbdev->pci_group = s390_group_find_host_sim(pbdev->zpci_fn.pfgid);
+if (pbdev->pci_group) {
+/* Use existing simulated group */
+pbdev->zpci_fn.pfgid = pbdev->pci_group->id;
+return;
+} else {
+if (s->next_sim_grp == ZPCI_DEFAULT_FN_GRP) {
+/* All out of simulated groups, use default */
+trace_s390_pci_clp_cap(vpci->vbasedev.name,
+   VFIO_DEVICE_INFO_CAP_ZPCI_GROUP);
+pbdev->zpci_fn.pfgid = ZPCI_DEFAULT_FN_GRP;
+pbdev->pci_group = s390_group_find(ZPCI_DEFAULT_FN_GRP);
+return;
+} else {
+/* We can assign a new simulated group */
+pbdev->zpci_fn.pfgid = s->next_sim_grp;
+s->next_sim_grp++;
+/* Fall through to create the new sim group using CLP info */
+}
+}
+}
+
 /* See if the PCI group is already defined, create if not */
 pbdev->pci_group =

[PATCH 07/12] s390x/pci: enable for load/store intepretation

Use the associated vfio feature ioctl to enable interpretation for devices
when requested.  As part of this process, we must use the host function
handle rather than a QEMU-generated one -- this is provided as part of the
ioctl payload.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c  | 69 +++-
 hw/s390x/s390-pci-inst.c | 63 -
 hw/s390x/s390-pci-vfio.c | 55 +
 include/hw/s390x/s390-pci-bus.h  |  1 +
 include/hw/s390x/s390-pci-vfio.h | 15 +++
 5 files changed, 201 insertions(+), 2 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 01b58ebc70..451bd32d92 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -971,12 +971,57 @@ static void s390_pci_update_subordinate(PCIDevice *dev, 
uint32_t nr)
 }
 }
 
+static int s390_pci_interp_plug(S390pciState *s, S390PCIBusDevice *pbdev)
+{
+uint32_t idx;
+int rc;
+
+rc = s390_pci_probe_interp(pbdev);
+if (rc) {
+return rc;
+}
+
+rc = s390_pci_update_passthrough_fh(pbdev);
+if (rc) {
+return rc;
+}
+
+/*
+ * The host device is in an enabled state, but the device must
+ * begin as disabled for the guest so mask off the enable bit
+ * from the passthrough handle.
+ */
+pbdev->fh &= ~FH_MASK_ENABLE;
+
+/* Next, see if the idx is already in-use */
+idx = pbdev->fh & FH_MASK_INDEX;
+if (pbdev->idx != idx) {
+if (s390_pci_find_dev_by_idx(s, idx)) {
+return -EINVAL;
+}
+/*
+ * Update the idx entry with the passed through idx
+ * If the relinquised idx is lower than next_idx, use it
+ * to replace next_idx
+ */
+g_hash_table_remove(s->zpci_table, >idx);
+if (idx < s->next_idx) {
+s->next_idx = idx;
+}
+pbdev->idx = idx;
+g_hash_table_insert(s->zpci_table, >idx, pbdev);
+}
+
+return 0;
+}
+
 static void s390_pcihost_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
   Error **errp)
 {
 S390pciState *s = S390_PCI_HOST_BRIDGE(hotplug_dev);
 PCIDevice *pdev = NULL;
 S390PCIBusDevice *pbdev = NULL;
+int rc;
 
 if (object_dynamic_cast(OBJECT(dev), TYPE_PCI_BRIDGE)) {
 PCIBridge *pb = PCI_BRIDGE(dev);
@@ -1022,12 +1067,33 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 set_pbdev_info(pbdev);
 
 if (object_dynamic_cast(OBJECT(dev), "vfio-pci")) {
-pbdev->fh |= FH_SHM_VFIO;
+/*
+ * By default, interpretation is always requested; if the available
+ * facilities indicate it is not available, fallback to the
+ * intercept model.
+ */
+if (pbdev->interp && !s390_has_feat(S390_FEAT_ZPCI_INTERP)) {
+DPRINTF("zPCI interpretation facilities missing.\n");
+pbdev->interp = false;
+}
+if (pbdev->interp) {
+rc = s390_pci_interp_plug(s, pbdev);
+if (rc) {
+error_setg(errp, "zpci interp plug failed: %d", rc);
+return;
+}
+}
 pbdev->iommu->dma_limit = s390_pci_start_dma_count(s, pbdev);
 /* Fill in CLP information passed via the vfio region */
 s390_pci_get_clp_info(pbdev);
+if (!pbdev->interp) {
+/* Do vfio passthrough but intercept for I/O */
+pbdev->fh |= FH_SHM_VFIO;
+}
 } else {
 pbdev->fh |= FH_SHM_EMUL;
+/* Always intercept emulated devices */
+pbdev->interp = false;
 }
 
 if (s390_pci_msix_init(pbdev)) {
@@ -1360,6 +1426,7 @@ static Property s390_pci_device_properties[] = {
 DEFINE_PROP_UINT16("uid", S390PCIBusDevice, uid, UID_UNDEFINED),
 DEFINE_PROP_S390_PCI_FID("fid", S390PCIBusDevice, fid),
 DEFINE_PROP_STRING("target", S390PCIBusDevice, target),
+DEFINE_PROP_BOOL("interp", S390PCIBusDevice, interp, true),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 0cef7fbace..ba4017474e 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -18,6 +18,7 @@
 #include "sysemu/hw_accel.h"
 #include "hw/s390x/s390-pci-inst.h"
 #include "hw/s390x/s390-pci-bus.h"
+#include "hw/s390x/s390-pci-vfio.h"
 #include "hw/s390x/tod.h"
 
 #ifndef DEBUG_S390PCI_INST
@@ -156,6 +157,47 @@ out:
 return rc;
 }
 
+static int clp_enable_interp(S390PCIBusDevice *pbdev)
+{
+int rc;
+
+rc = s390_pci_set_interp(pbdev, true);
+if (rc) {
+DPRINTF("Failed to enable interpretation\n");
+return rc;
+}
+rc = s390_pci_update_passthrough_fh(pbdev);
+if (rc) {
+DPRINTF("Failed to update passthrough fh\n");
+

[PATCH 06/12] target/s390x: add zpci-interp to cpu models

The zpci-interp feature is used to specify whether zPCI interpretation is
to be used for this guest.

Signed-off-by: Matthew Rosato 
---
 target/s390x/cpu_features_def.h.inc | 1 +
 target/s390x/gen-features.c | 2 ++
 target/s390x/kvm/kvm.c  | 1 +
 3 files changed, 4 insertions(+)

diff --git a/target/s390x/cpu_features_def.h.inc 
b/target/s390x/cpu_features_def.h.inc
index e86662bb3b..4ade3182aa 100644
--- a/target/s390x/cpu_features_def.h.inc
+++ b/target/s390x/cpu_features_def.h.inc
@@ -146,6 +146,7 @@ DEF_FEAT(SIE_CEI, "cei", SCLP_CPU, 43, "SIE: 
Conditional-external-interception f
 DEF_FEAT(DAT_ENH_2, "dateh2", MISC, 0, "DAT-enhancement facility 2")
 DEF_FEAT(CMM, "cmm", MISC, 0, "Collaborative-memory-management facility")
 DEF_FEAT(AP, "ap", MISC, 0, "AP instructions installed")
+DEF_FEAT(ZPCI_INTERP, "zpci-interp", MISC, 0, "zPCI interpretation")
 
 /* Features exposed via the PLO instruction. */
 DEF_FEAT(PLO_CL, "plo-cl", PLO, 0, "PLO Compare and load (32 bit in general 
registers)")
diff --git a/target/s390x/gen-features.c b/target/s390x/gen-features.c
index 7cb1a6ec10..7005d22415 100644
--- a/target/s390x/gen-features.c
+++ b/target/s390x/gen-features.c
@@ -554,6 +554,7 @@ static uint16_t full_GEN14_GA1[] = {
 S390_FEAT_HPMA2,
 S390_FEAT_SIE_KSS,
 S390_FEAT_GROUP_MULTIPLE_EPOCH_PTFF,
+S390_FEAT_ZPCI_INTERP,
 };
 
 #define full_GEN14_GA2 EmptyFeat
@@ -650,6 +651,7 @@ static uint16_t default_GEN14_GA1[] = {
 S390_FEAT_GROUP_MSA_EXT_8,
 S390_FEAT_MULTIPLE_EPOCH,
 S390_FEAT_GROUP_MULTIPLE_EPOCH_PTFF,
+S390_FEAT_ZPCI_INTERP,
 };
 
 #define default_GEN14_GA2 EmptyFeat
diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c
index 5b1fdb55c4..b13d78f988 100644
--- a/target/s390x/kvm/kvm.c
+++ b/target/s390x/kvm/kvm.c
@@ -2290,6 +2290,7 @@ static int kvm_to_feat[][2] = {
 { KVM_S390_VM_CPU_FEAT_PFMFI, S390_FEAT_SIE_PFMFI},
 { KVM_S390_VM_CPU_FEAT_SIGPIF, S390_FEAT_SIE_SIGPIF},
 { KVM_S390_VM_CPU_FEAT_KSS, S390_FEAT_SIE_KSS},
+{ KVM_S390_VM_CPU_FEAT_ZPCI_INTERP, S390_FEAT_ZPCI_INTERP },
 };
 
 static int query_cpu_feat(S390FeatBitmap features)
-- 
2.27.0

[PATCH 04/12] Update linux headers

This is a placeholder that pulls in 5.16-rc4 + unmerged kernel changes
required by this item.  A proper header sync can be done once the
associated kernel code merges.

Signed-off-by: Matthew Rosato 
---
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h | 121 +-
 include/standard-headers/linux/ethtool.h  |  31 +
 include/standard-headers/linux/fuse.h |  15 ++-
 include/standard-headers/linux/pci_regs.h |   6 +
 include/standard-headers/linux/virtio_gpu.h   |  18 ++-
 include/standard-headers/linux/virtio_ids.h   |  24 
 include/standard-headers/linux/virtio_mem.h   |   9 +-
 include/standard-headers/linux/virtio_vsock.h |   3 +-
 linux-headers/asm-arm64/unistd.h  |   1 +
 linux-headers/asm-generic/unistd.h|  22 +++-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-s390/kvm.h  |   1 +
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |   5 +
 linux-headers/asm-x86/unistd_32.h |   3 +
 linux-headers/asm-x86/unistd_64.h |   3 +
 linux-headers/asm-x86/unistd_x32.h|   3 +
 linux-headers/linux/kvm.h |  41 +-
 linux-headers/linux/vfio.h|  22 
 linux-headers/linux/vfio_zdev.h   |  51 
 26 files changed, 370 insertions(+), 24 deletions(-)

diff --git a/include/standard-headers/asm-x86/kvm_para.h 
b/include/standard-headers/asm-x86/kvm_para.h
index 204cfb8640..f0235e58a1 100644
--- a/include/standard-headers/asm-x86/kvm_para.h
+++ b/include/standard-headers/asm-x86/kvm_para.h
@@ -8,6 +8,7 @@
  * should be used to determine that a VM is running under KVM.
  */
 #define KVM_CPUID_SIGNATURE0x4000
+#define KVM_SIGNATURE "KVMKVMKVM\0\0\0"
 
 /* This CPUID returns two feature bitmaps in eax, edx. Before enabling
  * a particular paravirtualization, the appropriate feature bit should
diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 352b51fd0a..2c025cb4fe 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -103,6 +103,12 @@ extern "C" {
 /* 8 bpp Red */
 #define DRM_FORMAT_R8  fourcc_code('R', '8', ' ', ' ') /* [7:0] R */
 
+/* 10 bpp Red */
+#define DRM_FORMAT_R10 fourcc_code('R', '1', '0', ' ') /* [15:0] x:R 
6:10 little endian */
+
+/* 12 bpp Red */
+#define DRM_FORMAT_R12 fourcc_code('R', '1', '2', ' ') /* [15:0] x:R 
4:12 little endian */
+
 /* 16 bpp Red */
 #define DRM_FORMAT_R16 fourcc_code('R', '1', '6', ' ') /* [15:0] R 
little endian */
 
@@ -372,6 +378,12 @@ extern "C" {
 
 #define DRM_FORMAT_RESERVED  ((1ULL << 56) - 1)
 
+#define fourcc_mod_get_vendor(modifier) \
+   (((modifier) >> 56) & 0xff)
+
+#define fourcc_mod_is_vendor(modifier, vendor) \
+   (fourcc_mod_get_vendor(modifier) == DRM_FORMAT_MOD_VENDOR_## vendor)
+
 #define fourcc_mod_code(vendor, val) \
uint64_t)DRM_FORMAT_MOD_VENDOR_## vendor) << 56) | ((val) & 
0x00ffULL))
 
@@ -899,9 +911,9 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t modifier)
 
 /*
  * The top 4 bits (out of the 56 bits alloted for specifying vendor specific
- * modifiers) denote the category for modifiers. Currently we have only two
- * categories of modifiers ie AFBC and MISC. We can have a maximum of sixteen
- * different categories.
+ * modifiers) denote the category for modifiers. Currently we have three
+ * categories of modifiers ie AFBC, MISC and AFRC. We can have a maximum of
+ * sixteen different categories.
  */
 #define DRM_FORMAT_MOD_ARM_CODE(__type, __val) \
fourcc_mod_code(ARM, ((uint64_t)(__type) << 52) | ((__val) & 
0x000fULL))
@@ -1016,6 +1028,109 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t 
modifier)
  */
 #define AFBC_FORMAT_MOD_USM(1ULL << 12)
 
+/*
+ * Arm Fixed-Rate Compression (AFRC) modifiers
+ *
+ * AFRC is a proprietary fixed rate image compression protocol and format,
+ * designed to provide guaranteed bandwidth and memory footprint
+ * reductions in graphics and media use-cases.
+ *
+ * AFRC buffers consist of one or more planes, with the same components
+ * and meaning as an uncompressed buffer using the same pixel format.
+ *
+ * Within each plane, the pixel/luma/chroma values are grouped into
+ * "coding unit" blocks which are individually compressed to a
+ * fixed size (in bytes). All coding units within a given plane of a buffer
+ * store the same number of values, and have the same compressed size.
+ *
+ * The coding unit size is

[PATCH 11/12] s390x/pci: use dtsm provided from vfio capabilities for interpreted devices

When using the IOAT assist via interpretation, we should advertise what
the host driver supports, not QEMU.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-vfio.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index 6fc03a858a..c9269683f5 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -336,7 +336,11 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 resgrp->i = cap->noi;
 resgrp->maxstbl = cap->maxstbl;
 resgrp->version = cap->version;
-resgrp->dtsm = ZPCI_DTSM;
+if (hdr->version >= 2 && pbdev->interp) {
+resgrp->dtsm = cap->dtsm;
+} else {
+resgrp->dtsm = ZPCI_DTSM;
+}
 }
 }
 
-- 
2.27.0

[PATCH 05/12] virtio-gpu: do not byteswap padding

From: Paolo Bonzini 

In Linux 5.16, the padding of struct virtio_gpu_ctrl_hdr has become a
single-byte field followed by a uint8_t[3] array of padding bytes,
and virtio_gpu_ctrl_hdr_bswap does not compile anymore.

Signed-off-by: Paolo Bonzini 
Reviewed-by: Alex Bennée 
Reviewed-by: Michael S. Tsirkin 
Reviewed-by: Philippe Mathieu-Daudé 
Acked-by: Cornelia Huck 
---
 include/hw/virtio/virtio-gpu-bswap.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/hw/virtio/virtio-gpu-bswap.h 
b/include/hw/virtio/virtio-gpu-bswap.h
index e2bee8f595..5faac0d8d5 100644
--- a/include/hw/virtio/virtio-gpu-bswap.h
+++ b/include/hw/virtio/virtio-gpu-bswap.h
@@ -24,7 +24,6 @@ virtio_gpu_ctrl_hdr_bswap(struct virtio_gpu_ctrl_hdr *hdr)
 le32_to_cpus(>flags);
 le64_to_cpus(>fence_id);
 le32_to_cpus(>ctx_id);
-le32_to_cpus(>padding);
 }
 
 static inline void
-- 
2.27.0

[PATCH 09/12] s390x/pci: enable adapter event notification for interpreted devices

Use the associated vfio feature ioctl to enable adapter event notification
and forwarding for devices when requested.  This feature will be set up
with or without firmware assist based upon the 'intassist' setting.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c  | 24 +++--
 hw/s390x/s390-pci-inst.c | 54 +++-
 hw/s390x/s390-pci-vfio.c | 88 
 include/hw/s390x/s390-pci-bus.h  |  1 +
 include/hw/s390x/s390-pci-vfio.h | 20 
 5 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 503326210a..1ae8792a26 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -189,7 +189,10 @@ void s390_pci_sclp_deconfigure(SCCB *sccb)
 rc = SCLP_RC_NO_ACTION_REQUIRED;
 break;
 default:
-if (pbdev->summary_ind) {
+if (pbdev->interp) {
+/* Interpreted devices were using interrupt forwarding */
+s390_pci_set_aif(pbdev, NULL, false, pbdev->intassist);
+} else if (pbdev->summary_ind) {
 pci_dereg_irqs(pbdev);
 }
 if (pbdev->iommu->enabled) {
@@ -981,6 +984,11 @@ static int s390_pci_interp_plug(S390pciState *s, 
S390PCIBusDevice *pbdev)
 return rc;
 }
 
+rc = s390_pci_probe_aif(pbdev);
+if (rc) {
+return rc;
+}
+
 rc = s390_pci_update_passthrough_fh(pbdev);
 if (rc) {
 return rc;
@@ -1075,6 +1083,7 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 if (pbdev->interp && !s390_has_feat(S390_FEAT_ZPCI_INTERP)) {
 DPRINTF("zPCI interpretation facilities missing.\n");
 pbdev->interp = false;
+pbdev->intassist = false;
 }
 if (pbdev->interp) {
 rc = s390_pci_interp_plug(s, pbdev);
@@ -1089,11 +1098,13 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 if (!pbdev->interp) {
 /* Do vfio passthrough but intercept for I/O */
 pbdev->fh |= FH_SHM_VFIO;
+pbdev->intassist = false;
 }
 } else {
 pbdev->fh |= FH_SHM_EMUL;
 /* Always intercept emulated devices */
 pbdev->interp = false;
+pbdev->intassist = false;
 }
 
 if (s390_pci_msix_init(pbdev) && !pbdev->interp) {
@@ -1243,7 +1254,10 @@ static void s390_pcihost_reset(DeviceState *dev)
 /* Process all pending unplug requests */
 QTAILQ_FOREACH_SAFE(pbdev, >zpci_devs, link, next) {
 if (pbdev->unplug_requested) {
-if (pbdev->summary_ind) {
+if (pbdev->interp) {
+/* Interpreted devices were using interrupt forwarding */
+s390_pci_set_aif(pbdev, NULL, false, pbdev->intassist);
+} else if (pbdev->summary_ind) {
 pci_dereg_irqs(pbdev);
 }
 if (pbdev->iommu->enabled) {
@@ -1381,7 +1395,10 @@ static void s390_pci_device_reset(DeviceState *dev)
 break;
 }
 
-if (pbdev->summary_ind) {
+if (pbdev->interp) {
+/* Interpreted devices were using interrupt forwarding */
+s390_pci_set_aif(pbdev, NULL, false, pbdev->intassist);
+} else if (pbdev->summary_ind) {
 pci_dereg_irqs(pbdev);
 }
 if (pbdev->iommu->enabled) {
@@ -1427,6 +1444,7 @@ static Property s390_pci_device_properties[] = {
 DEFINE_PROP_S390_PCI_FID("fid", S390PCIBusDevice, fid),
 DEFINE_PROP_STRING("target", S390PCIBusDevice, target),
 DEFINE_PROP_BOOL("interp", S390PCIBusDevice, interp, true),
+DEFINE_PROP_BOOL("intassist", S390PCIBusDevice, intassist, true),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index ba4017474e..02093e82f9 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -,6 +,46 @@ static void fmb_update(void *opaque)
 timer_mod(pbdev->fmb_timer, t + DEFAULT_MUI);
 }
 
+static int mpcifc_reg_int_interp(S390PCIBusDevice *pbdev, ZpciFib *fib)
+{
+int rc;
+
+/* Interpreted devices must also use interrupt forwarding */
+rc = s390_pci_get_aif(pbdev, false, pbdev->intassist);
+if (rc) {
+DPRINTF("Bad interrupt forwarding state\n");
+return rc;
+}
+
+rc = s390_pci_set_aif(pbdev, fib, true, pbdev->intassist);
+if (rc) {
+DPRINTF("Failed to enable interrupt forwarding\n");
+return rc;
+}
+
+return 0;
+}
+
+static int mpcifc_dereg_int_interp(S390PCIBusDevice *pbdev, ZpciFib *fib)
+{
+int rc;
+
+/* Interpreted devices were using interrupt forwarding */
+rc = s390_pci_get_aif(pbdev, true, pbdev->intassist);
+if (rc) {
+DPRINTF("Bad interrupt forwarding state\n");
+return rc;
+}
+
+rc = s390_pci_set_aif(pbdev, fib, false,

[PATCH 10/12] s390x/pci: use I/O Address Translation assist when interpreting

Allow the underlying kvm host to handle the Refresh PCI Translation
instruction intercepts.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c  |  6 ++--
 hw/s390x/s390-pci-inst.c | 51 ++--
 hw/s390x/s390-pci-vfio.c | 33 +
 include/hw/s390x/s390-pci-inst.h |  2 +-
 include/hw/s390x/s390-pci-vfio.h | 10 +++
 5 files changed, 95 insertions(+), 7 deletions(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 1ae8792a26..ab442f17fb 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -196,7 +196,7 @@ void s390_pci_sclp_deconfigure(SCCB *sccb)
 pci_dereg_irqs(pbdev);
 }
 if (pbdev->iommu->enabled) {
-pci_dereg_ioat(pbdev->iommu);
+pci_dereg_ioat(pbdev);
 }
 pbdev->state = ZPCI_FS_STANDBY;
 rc = SCLP_RC_NORMAL_COMPLETION;
@@ -1261,7 +1261,7 @@ static void s390_pcihost_reset(DeviceState *dev)
 pci_dereg_irqs(pbdev);
 }
 if (pbdev->iommu->enabled) {
-pci_dereg_ioat(pbdev->iommu);
+pci_dereg_ioat(pbdev);
 }
 pbdev->state = ZPCI_FS_STANDBY;
 s390_pci_perform_unplug(pbdev);
@@ -1402,7 +1402,7 @@ static void s390_pci_device_reset(DeviceState *dev)
 pci_dereg_irqs(pbdev);
 }
 if (pbdev->iommu->enabled) {
-pci_dereg_ioat(pbdev->iommu);
+pci_dereg_ioat(pbdev);
 }
 
 fmb_timer_free(pbdev);
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 02093e82f9..598e5a5d52 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -978,6 +978,24 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev)
 return 0;
 }
 
+static int reg_ioat_interp(S390PCIBusDevice *pbdev, uint64_t iota)
+{
+int rc;
+
+rc = s390_pci_probe_ioat(pbdev);
+if (rc) {
+return rc;
+}
+
+rc = s390_pci_set_ioat(pbdev, iota);
+if (rc) {
+return rc;
+}
+
+pbdev->iommu->enabled = true;
+return 0;
+}
+
 static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
 uintptr_t ra)
 {
@@ -995,6 +1013,16 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice 
*pbdev, ZpciFib fib,
 return -EINVAL;
 }
 
+/* If this is an interpreted device, we must use the IOAT assist */
+if (pbdev->interp) {
+if (reg_ioat_interp(pbdev, g_iota)) {
+error_report("failure starting ioat assist");
+s390_program_interrupt(env, PGM_OPERAND, ra);
+return -EINVAL;
+}
+return 0;
+}
+
 /* currently we only support designation type 1 with translation */
 if (!(dt == ZPCI_IOTA_RTTO && t)) {
 error_report("unsupported ioat dt %d t %d", dt, t);
@@ -1011,8 +1039,25 @@ static int reg_ioat(CPUS390XState *env, S390PCIBusDevice 
*pbdev, ZpciFib fib,
 return 0;
 }
 
-void pci_dereg_ioat(S390PCIIOMMU *iommu)
+static void dereg_ioat_interp(S390PCIBusDevice *pbdev)
 {
+if (s390_pci_probe_ioat(pbdev) != 0) {
+return;
+}
+
+s390_pci_set_ioat(pbdev, 0);
+pbdev->iommu->enabled = false;
+}
+
+void pci_dereg_ioat(S390PCIBusDevice *pbdev)
+{
+S390PCIIOMMU *iommu = pbdev->iommu;
+
+if (pbdev->interp) {
+dereg_ioat_interp(pbdev);
+return;
+}
+
 s390_pci_iommu_disable(iommu);
 iommu->pba = 0;
 iommu->pal = 0;
@@ -1251,7 +1296,7 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_SEQUENCE);
 } else {
-pci_dereg_ioat(pbdev->iommu);
+pci_dereg_ioat(pbdev);
 }
 break;
 case ZPCI_MOD_FC_REREG_IOAT:
@@ -1262,7 +1307,7 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_SEQUENCE);
 } else {
-pci_dereg_ioat(pbdev->iommu);
+pci_dereg_ioat(pbdev);
 if (reg_ioat(env, pbdev, fib, ra)) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_INSUF_RES);
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index 6f9271df87..6fc03a858a 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -240,6 +240,39 @@ int s390_pci_get_aif(S390PCIBusDevice *pbdev, bool enable, 
bool assist)
 return rc;
 }
 
+int s390_pci_probe_ioat(S390PCIBusDevice *pbdev)
+{
+VFIOPCIDevice *vdev = container_of(pbdev->pdev, VFIOPCIDevice, pdev);
+struct vfio_device_feature feat = {
+.argsz = sizeof(struct vfio_device_feature),
+.flags = VFIO_DEVICE_FEATURE_PROBE + VFIO_DEVICE_FEATURE_ZPCI_IOAT
+};
+
+assert(vdev);
+
+return ioctl(vdev->vbasedev.fd, VFIO_DEVICE_FEATURE, );
+}
+
+int

[PATCH 02/12] s390x/pci: don't use hard-coded dma range in reg_ioat

Instead use the values from clp info, they will either be the hard-coded
values or what came from the host driver via vfio.

Fixes: 9670ee752727 ("s390x/pci: use a PCI Function structure")
Reviewed-by: Eric Farman 
Reviewed-by: Pierre Morel 
Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-inst.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 1c8ad91175..11b7f6bfa1 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -916,9 +916,10 @@ int pci_dereg_irqs(S390PCIBusDevice *pbdev)
 return 0;
 }
 
-static int reg_ioat(CPUS390XState *env, S390PCIIOMMU *iommu, ZpciFib fib,
+static int reg_ioat(CPUS390XState *env, S390PCIBusDevice *pbdev, ZpciFib fib,
 uintptr_t ra)
 {
+S390PCIIOMMU *iommu = pbdev->iommu;
 uint64_t pba = ldq_p();
 uint64_t pal = ldq_p();
 uint64_t g_iota = ldq_p();
@@ -927,7 +928,7 @@ static int reg_ioat(CPUS390XState *env, S390PCIIOMMU 
*iommu, ZpciFib fib,
 
 pba &= ~0xfff;
 pal |= 0xfff;
-if (pba > pal || pba < ZPCI_SDMA_ADDR || pal > ZPCI_EDMA_ADDR) {
+if (pba > pal || pba < pbdev->zpci_fn.sdma || pal > pbdev->zpci_fn.edma) {
 s390_program_interrupt(env, PGM_OPERAND, ra);
 return -EINVAL;
 }
@@ -1125,7 +1126,7 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 } else if (pbdev->iommu->enabled) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_SEQUENCE);
-} else if (reg_ioat(env, pbdev->iommu, fib, ra)) {
+} else if (reg_ioat(env, pbdev, fib, ra)) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_INSUF_RES);
 }
@@ -1150,7 +1151,7 @@ int mpcifc_service_call(S390CPU *cpu, uint8_t r1, 
uint64_t fiba, uint8_t ar,
 s390_set_status_code(env, r1, ZPCI_MOD_ST_SEQUENCE);
 } else {
 pci_dereg_ioat(pbdev->iommu);
-if (reg_ioat(env, pbdev->iommu, fib, ra)) {
+if (reg_ioat(env, pbdev, fib, ra)) {
 cc = ZPCI_PCI_LS_ERR;
 s390_set_status_code(env, r1, ZPCI_MOD_ST_INSUF_RES);
 }
-- 
2.27.0

[PATCH 03/12] s390x/pci: add supported DT information to clp response

The DTSM is a mask that specifies which I/O Address Translation designation
types are supported.  Today QEMU only supports DT=1.

Reviewed-by: Pierre Morel 
Reviewed-by: Eric Farman 
Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c | 1 +
 hw/s390x/s390-pci-inst.c| 1 +
 hw/s390x/s390-pci-vfio.c| 1 +
 include/hw/s390x/s390-pci-bus.h | 1 +
 include/hw/s390x/s390-pci-clp.h | 3 ++-
 5 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 1b51a72838..01b58ebc70 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -782,6 +782,7 @@ static void s390_pci_init_default_group(void)
 resgrp->i = 128;
 resgrp->maxstbl = 128;
 resgrp->version = 0;
+resgrp->dtsm = ZPCI_DTSM;
 }
 
 static void set_pbdev_info(S390PCIBusDevice *pbdev)
diff --git a/hw/s390x/s390-pci-inst.c b/hw/s390x/s390-pci-inst.c
index 11b7f6bfa1..0cef7fbace 100644
--- a/hw/s390x/s390-pci-inst.c
+++ b/hw/s390x/s390-pci-inst.c
@@ -329,6 +329,7 @@ int clp_service_call(S390CPU *cpu, uint8_t r2, uintptr_t ra)
 stw_p(>i, group->zpci_group.i);
 stw_p(>maxstbl, group->zpci_group.maxstbl);
 resgrp->version = group->zpci_group.version;
+resgrp->dtsm = group->zpci_group.dtsm;
 stw_p(>hdr.rsp, CLP_RC_OK);
 break;
 }
diff --git a/hw/s390x/s390-pci-vfio.c b/hw/s390x/s390-pci-vfio.c
index 2a153fa8c9..6f80a47e29 100644
--- a/hw/s390x/s390-pci-vfio.c
+++ b/hw/s390x/s390-pci-vfio.c
@@ -160,6 +160,7 @@ static void s390_pci_read_group(S390PCIBusDevice *pbdev,
 resgrp->i = cap->noi;
 resgrp->maxstbl = cap->maxstbl;
 resgrp->version = cap->version;
+resgrp->dtsm = ZPCI_DTSM;
 }
 }
 
diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h
index 2727e7bdef..da3cde2bb4 100644
--- a/include/hw/s390x/s390-pci-bus.h
+++ b/include/hw/s390x/s390-pci-bus.h
@@ -37,6 +37,7 @@
 #define ZPCI_MAX_UID 0x
 #define UID_UNDEFINED 0
 #define UID_CHECKING_ENABLED 0x01
+#define ZPCI_DTSM 0x40
 
 OBJECT_DECLARE_SIMPLE_TYPE(S390pciState, S390_PCI_HOST_BRIDGE)
 OBJECT_DECLARE_SIMPLE_TYPE(S390PCIBus, S390_PCI_BUS)
diff --git a/include/hw/s390x/s390-pci-clp.h b/include/hw/s390x/s390-pci-clp.h
index 96b8e3f133..cc8c8662b8 100644
--- a/include/hw/s390x/s390-pci-clp.h
+++ b/include/hw/s390x/s390-pci-clp.h
@@ -163,7 +163,8 @@ typedef struct ClpRspQueryPciGrp {
 uint8_t fr;
 uint16_t maxstbl;
 uint16_t mui;
-uint64_t reserved3;
+uint8_t dtsm;
+uint8_t reserved3[7];
 uint64_t dasm; /* dma address space mask */
 uint64_t msia; /* MSI address */
 uint64_t reserved4;
-- 
2.27.0

[PATCH 08/12] s390x/pci: don't fence interpreted devices without MSI-X

Lack of MSI-X support is not an issue for interpreted passthrough
devices, so let's let these in.  This will allow, for example, ISM
devices to be passed through -- but only when interpretation is
available and being used.

Signed-off-by: Matthew Rosato 
---
 hw/s390x/s390-pci-bus.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/s390x/s390-pci-bus.c b/hw/s390x/s390-pci-bus.c
index 451bd32d92..503326210a 100644
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -1096,7 +1096,7 @@ static void s390_pcihost_plug(HotplugHandler 
*hotplug_dev, DeviceState *dev,
 pbdev->interp = false;
 }
 
-if (s390_pci_msix_init(pbdev)) {
+if (s390_pci_msix_init(pbdev) && !pbdev->interp) {
 error_setg(errp, "MSI-X support is mandatory "
"in the S390 architecture");
 return;
-- 
2.27.0

[PATCH 01/12] s390x/pci: use a reserved ID for the default PCI group

The current default PCI group being used can technically collide with a
real group ID passed from a hostdev.  Let's instead use a group ID that
comes from a special pool (0xF0-0xFF) that is architected to be reserved
for simulated devices.

Fixes: 28dc86a072 ("s390x/pci: use a PCI Group structure")
Reviewed-by: Eric Farman 
Reviewed-by: Pierre Morel 
Signed-off-by: Matthew Rosato 
---
 include/hw/s390x/s390-pci-bus.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/hw/s390x/s390-pci-bus.h b/include/hw/s390x/s390-pci-bus.h
index aa891c178d..2727e7bdef 100644
--- a/include/hw/s390x/s390-pci-bus.h
+++ b/include/hw/s390x/s390-pci-bus.h
@@ -313,7 +313,7 @@ typedef struct ZpciFmb {
 } ZpciFmb;
 QEMU_BUILD_BUG_MSG(offsetof(ZpciFmb, fmt0) != 48, "padding in ZpciFmb");
 
-#define ZPCI_DEFAULT_FN_GRP 0x20
+#define ZPCI_DEFAULT_FN_GRP 0xFF
 typedef struct S390PCIGroup {
 ClpRspQueryPciGrp zpci_group;
 int id;
-- 
2.27.0

[PATCH 00/12] s390x/pci: zPCI interpretation support

Note:  The first 3 patches of this series are included as pre-reqs, but
should be pulled via a separate series.  Also, patch 5 is needed to
support 5.16+ linux header-sync and was already done by Paolo but not
merged yet so is thus included here as well. 

For QEMU, the majority of the work in enabling instruction interpretation
is handled via new VFIO ioctls to SET the appropriate interpretation and
interrupt forwarding modes, and to GET the function handle to use for
interpretive execution.  

This series implements these new ioctls, as well as adding a new, optional
'intercept' parameter to zpci to request interpretation support not be used
as well as an 'intassist' parameter to determine whether or not the
firmware assist will be used for interrupt delivery or whether the host
will be responsible for delivering all interrupts.

The ZPCI_INTERP CPU feature is added beginning with the z14 model to
enable this support.

As a consequence of implementing zPCI interpretation, ISM devices now
become eligible for passthrough (but only when zPCI interpretation is
available).

>From the perspective of guest configuration, you passthrough zPCI devices
in the same manner as before, with intepretation support being used by
default if available in kernel+qemu.

Associated kernel series:
https://lkml.org/lkml/2021/12/7/1179

Matthew Rosato (11):
  s390x/pci: use a reserved ID for the default PCI group
  s390x/pci: don't use hard-coded dma range in reg_ioat
  s390x/pci: add supported DT information to clp response
  Update linux headers
  target/s390x: add zpci-interp to cpu models
  s390x/pci: enable for load/store intepretation
  s390x/pci: don't fence interpreted devices without MSI-X
  s390x/pci: enable adapter event notification for interpreted devices
  s390x/pci: use I/O Address Translation assist when interpreting
  s390x/pci: use dtsm provided from vfio capabilities for interpreted
devices
  s390x/pci: let intercept devices have separate PCI groups

Paolo Bonzini (1):
  virtio-gpu: do not byteswap padding

 hw/s390x/s390-pci-bus.c   | 121 +-
 hw/s390x/s390-pci-inst.c  | 178 +-
 hw/s390x/s390-pci-vfio.c  | 221 +-
 include/hw/s390x/s390-pci-bus.h   |  11 +-
 include/hw/s390x/s390-pci-clp.h   |   3 +-
 include/hw/s390x/s390-pci-inst.h  |   2 +-
 include/hw/s390x/s390-pci-vfio.h  |  45 
 include/hw/virtio/virtio-gpu-bswap.h  |   1 -
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h | 121 +-
 include/standard-headers/linux/ethtool.h  |  31 +++
 include/standard-headers/linux/fuse.h |  15 +-
 include/standard-headers/linux/pci_regs.h |   6 +
 include/standard-headers/linux/virtio_gpu.h   |  18 +-
 include/standard-headers/linux/virtio_ids.h   |  24 ++
 include/standard-headers/linux/virtio_mem.h   |   9 +-
 include/standard-headers/linux/virtio_vsock.h |   3 +-
 linux-headers/asm-arm64/unistd.h  |   1 +
 linux-headers/asm-generic/unistd.h|  22 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-s390/kvm.h  |   1 +
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |   5 +
 linux-headers/asm-x86/unistd_32.h |   3 +
 linux-headers/asm-x86/unistd_64.h |   3 +
 linux-headers/asm-x86/unistd_x32.h|   3 +
 linux-headers/linux/kvm.h |  41 +++-
 linux-headers/linux/vfio.h|  22 ++
 linux-headers/linux/vfio_zdev.h   |  51 
 target/s390x/cpu_features_def.h.inc   |   1 +
 target/s390x/gen-features.c   |   2 +
 target/s390x/kvm/kvm.c|   1 +
 37 files changed, 928 insertions(+), 52 deletions(-)

-- 
2.27.0

[PATCH v2 1/2] spice: Update QXLInterface for spice >= 0.15.0

spice updated the spelling (and arguments) of "attache_worker" in
0.15.0. Update QEMU to match, preventing -Wdeprecated-declarations
compilations from reporting build errors.

See also:
https://gitlab.freedesktop.org/spice/spice/-/commit/974692bda1e77af92b71ed43b022439448492cb9

Signed-off-by: John Snow 
Acked-by: Gerd Hoffmann 
---
 include/ui/qemu-spice.h |  6 ++
 hw/display/qxl.c| 14 +-
 ui/spice-display.c  | 11 +++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/ui/qemu-spice.h b/include/ui/qemu-spice.h
index 71ecd6cfd1..21fe195e18 100644
--- a/include/ui/qemu-spice.h
+++ b/include/ui/qemu-spice.h
@@ -40,6 +40,12 @@ int qemu_spice_migrate_info(const char *hostname, int port, 
int tls_port,
 #define SPICE_NEEDS_SET_MM_TIME 0
 #endif
 
+#if defined(SPICE_SERVER_VERSION) && (SPICE_SERVER_VERSION >= 0x000f00)
+#define SPICE_HAS_ATTACHED_WORKER 1
+#else
+#define SPICE_HAS_ATTACHED_WORKER 0
+#endif
+
 #else  /* CONFIG_SPICE */
 
 #include "qemu/error-report.h"
diff --git a/hw/display/qxl.c b/hw/display/qxl.c
index 29c80b4289..1da6703e44 100644
--- a/hw/display/qxl.c
+++ b/hw/display/qxl.c
@@ -517,13 +517,20 @@ static int qxl_track_command(PCIQXLDevice *qxl, struct 
QXLCommandExt *ext)
 
 /* spice display interface callbacks */
 
-static void interface_attach_worker(QXLInstance *sin, QXLWorker *qxl_worker)
+static void interface_attached_worker(QXLInstance *sin)
 {
 PCIQXLDevice *qxl = container_of(sin, PCIQXLDevice, ssd.qxl);
 
 trace_qxl_interface_attach_worker(qxl->id);
 }
 
+#if !(SPICE_HAS_ATTACHED_WORKER)
+static void interface_attach_worker(QXLInstance *sin, QXLWorker *qxl_worker)
+{
+interface_attached_worker(sin);
+}
+#endif
+
 static void interface_set_compression_level(QXLInstance *sin, int level)
 {
 PCIQXLDevice *qxl = container_of(sin, PCIQXLDevice, ssd.qxl);
@@ -1131,7 +1138,12 @@ static const QXLInterface qxl_interface = {
 .base.major_version  = SPICE_INTERFACE_QXL_MAJOR,
 .base.minor_version  = SPICE_INTERFACE_QXL_MINOR,
 
+#if SPICE_HAS_ATTACHED_WORKER
+.attached_worker = interface_attached_worker,
+#else
 .attache_worker  = interface_attach_worker,
+#endif
+
 .set_compression_level   = interface_set_compression_level,
 #if SPICE_NEEDS_SET_MM_TIME
 .set_mm_time = interface_set_mm_time,
diff --git a/ui/spice-display.c b/ui/spice-display.c
index f59c69882d..1a60cebb7d 100644
--- a/ui/spice-display.c
+++ b/ui/spice-display.c
@@ -500,10 +500,17 @@ void qemu_spice_display_refresh(SimpleSpiceDisplay *ssd)
 
 /* spice display interface callbacks */
 
+#if SPICE_HAS_ATTACHED_WORKER
+static void interface_attached_worker(QXLInstance *sin)
+{
+/* nothing to do */
+}
+#else
 static void interface_attach_worker(QXLInstance *sin, QXLWorker *qxl_worker)
 {
 /* nothing to do */
 }
+#endif
 
 static void interface_set_compression_level(QXLInstance *sin, int level)
 {
@@ -702,7 +709,11 @@ static const QXLInterface dpy_interface = {
 .base.major_version  = SPICE_INTERFACE_QXL_MAJOR,
 .base.minor_version  = SPICE_INTERFACE_QXL_MINOR,
 
+#if SPICE_HAS_ATTACHED_WORKER
+.attached_worker = interface_attached_worker,
+#else
 .attache_worker  = interface_attach_worker,
+#endif
 .set_compression_level   = interface_set_compression_level,
 #if SPICE_NEEDS_SET_MM_TIME
 .set_mm_time = interface_set_mm_time,
-- 
2.31.1

[PATCH v2 2/2] ui/clipboard: Don't use g_autoptr just to free a variable

Clang doesn't recognize that the variable is being "used" and will emit
a warning:

  ../ui/clipboard.c:47:34: error: variable 'old' set but not used 
[-Werror,-Wunused-but-set-variable]
  g_autoptr(QemuClipboardInfo) old = NULL;
 ^
  1 error generated.

OK, fine. Just do things the old way.

Signed-off-by: John Snow 
---
 ui/clipboard.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/ui/clipboard.c b/ui/clipboard.c
index d7b008d62a..9ab65efefb 100644
--- a/ui/clipboard.c
+++ b/ui/clipboard.c
@@ -44,12 +44,11 @@ void qemu_clipboard_peer_release(QemuClipboardPeer *peer,
 
 void qemu_clipboard_update(QemuClipboardInfo *info)
 {
-g_autoptr(QemuClipboardInfo) old = NULL;
 assert(info->selection < QEMU_CLIPBOARD_SELECTION__COUNT);
 
 notifier_list_notify(_notifiers, info);
 
-old = cbinfo[info->selection];
+g_free(cbinfo[info->selection]);
 cbinfo[info->selection] = qemu_clipboard_info_ref(info);
 }
 
-- 
2.31.1

Re: [PATCH] docs: Introducing pseries documentation.

2021-12-07 Thread Leonardo Augusto Guimarães Garcia

Hi Cédric,

On 11/18/21 05:57, Cédric Le Goater wrote:
> Hello Leonardo,
>
> On 11/17/21 21:14, lagar...@linux.ibm.com wrote:
>> From: Leonardo Garcia 
>>
>> The purpose of this document is to substitute the content currently
>> available in the QEMU wiki at [0]. This initial version does contain
>> some additional content as well. Whenever this documentation gets
>> upstream and is reflected in [1], the QEMU wiki will be edited to point
>> to this documentation, so that we only need to keep it updated in one
>> place.
>>
>> 0. https://wiki.qemu.org/Documentation/Platforms/POWER
>> 1. https://qemu.readthedocs.io/en/latest/system/ppc/pseries.html
>>
>> Signed-off-by: Leonardo Garcia 
>
>
> Thanks for this update,
>
> Some general comments,
>
> There are other ppc documents :
>
>   docs/specs/ppc-spapr-hcalls.txt
>   docs/specs/ppc-spapr-hotplug.txt
>   docs/specs/ppc-spapr-numa.rst
>   docs/specs/ppc-spapr-uv-hcalls.txt
>   docs/specs/ppc-spapr-xive.rst
>   docs/specs/ppc-xive.rst
>
> that should be moved maybe and/or referenced by this file ? Feel free
> to do a follow up.


Definitely! Thanks for pointing this out. I have included some reference
to these files in v2, but proper reference will need some rework on the
above files as well (such as transforming txt files into rst files).
I'll work on that as a follow up.


>
> On the terminology, I think we should use in the text :
>
>    pSeries, PowerNV, POWER[0-9]


Sure. I updated the terminology according to the above in v2.


>
> When it comes to QEMU machines names, it's 'pseries', 'powernv'
>
> Some small comments,
>
>
>> ---
>>   docs/system/ppc/pseries.rst | 185 
>>   1 file changed, 185 insertions(+)
>>
>> diff --git a/docs/system/ppc/pseries.rst b/docs/system/ppc/pseries.rst
>> index 932d4dd17d..2de3fb4d51 100644
>> --- a/docs/system/ppc/pseries.rst
>> +++ b/docs/system/ppc/pseries.rst
>> @@ -1,12 +1,197 @@
>>   pSeries family boards (``pseries``)
>>   ===
>>   +The Power machine virtualized environment described by the `Linux
>> on Power
>
> para-virtualized ?


Absolutely. Fixed.


>
>> +Architecture Reference document (LoPAR)
>> +`_
>>
>> +is called pseries. This environment is also known as sPAPR, System p
>> guests, or
>> +simply Power Linux guests (although it is capable of running other
>> operating
>> +systems, such as AIX).
>> +
>> +Even though pseries is designed to behave as a guest environment, it
>> is also
>> +capable of acting as a hypervisor OS, providing, on that role, nested
>> +virtualization capabilities.
>
> on POWER9 and above processors. Maybe we should start introducing the
> KVM-pseries term.


We can do nested virtualization with the combination of KVM-HV and
KVM-PR on POWER8 machines, for instance. At this point of the text, I
wouldn't like to enter in this kind of detail, as it will be treated
later in the documentation.


>
>> +
>>   Supported devices
>>   -
>>   + * Multi processor support for many Power processors generations:
>> POWER5+,
>> +   POWER7, POWER7+, POWER8, POWER8NVL, Power9, and Power10 (there is
>> no support
>> +   for POWER6 processors).
>
> POWER8NVL is for baremetal only.


You can actually create pseries guests on a POWER8NVL machine using
QEMU/KVM.


>
>> + * Interrupt Controller, XICS (POWER8) and XIVE (Power9 and Power10)
>> + * vPHB PCIe Host bridge.
>> + * vscsi and vnet devices, compatible with the same devices
>> available on a
>> +   PowerVM hypervisor with VIOS managing LPARs.
>> + * Virtio based devices.
>> + * PCIe device pass through.
>> +
>>   Missing devices
>>   ---
>>   + * SPICE support.
>>     Firmware
>>   
>> +
>> +`SLOF `_ (Slimline Open Firmware) is an
>> +implementation of the `IEEE 1275-1994, Standard for Boot
>> (Initialization
>> +Configuration) Firmware: Core Requirements and Practices
>> +`_.
>> +
>> +QEMU includes a prebuilt image of SLOF which is updated when a more
>> recent
>> +version is required.
>> +
>> +Build directions
>> +
>> +
>> +.. code-block:: bash
>> +
>> +  ./configure --target-list=ppc64-softmmu && make
>> +
>> +Running instructions
>> +
>> +
>> +Someone can select the pseries machine type by running QEMU with the
>> following
>> +options:
>> +
>> +.. code-block:: bash
>> +
>> +  qemu-system-ppc64 -M pseries 
>> +
>> +sPAPR devices
>> +-
>> +
>> +The sPAPR specification defines a set of para-virtualized devices,
>> which are
>> +also supported by the pseries machine in QEMU and can be
>> instantiated with the
>> +`-device` option:
>> +
>> +* spapr-vlan : A virtual network interface.
>> +* spapr-vscsi : A virtual SCSI disk interface.
>> +* spapr-rng : A pseudo-device for passing random number generator
>> data to the
>> +  guest (see the `H_RANDOM

[PATCH v2 0/2] Misc: build fixes for Fedora 35, Ubuntu et al

I didn't push this through in time for 6.2, so this series just worries
about the little fixes necessary for building QEMU under Fedora 35 and
the latest Ubuntu distributions.

The actual container changes have been cut off of this series in favor
of Dan's larger series that switches us over to using lcitool.

John Snow (2):
  spice: Update QXLInterface for spice >= 0.15.0
  ui/clipboard: Don't use g_autoptr just to free a variable

 include/ui/qemu-spice.h |  6 ++
 hw/display/qxl.c| 14 +-
 ui/clipboard.c  |  3 +--
 ui/spice-display.c  | 11 +++
 4 files changed, 31 insertions(+), 3 deletions(-)

-- 
2.31.1

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

On Tue, 7 Dec 2021 at 15:49, Damien Hedde  wrote:
>
>
>
> On 12/7/21 16:45, Peter Maydell wrote:
> > On Tue, 7 Dec 2021 at 15:24, Peter Maydell  wrote:
> >> The bug is a bug in any case and we'll fix it, it's just a
> >> question of whether it meets the bar to go into 6.2, which is
> >> hopefully going to have its final RC tagged today. If this
> >> patch had arrived a week ago then the bar would have been
> >> lower and it would definitely have gone in. As it is I have
> >> to weigh up the chances of this change causing a regression
> >> for eg KVM running on emulated QEMU.
> >
> > Looking at the KVM source it doesn't ever set the LRENPIE
> > bit (it doesn't even have a #define for it), which both
> > explains why we didn't notice this bug before and also
> > means we can be pretty certain we're not going to cause a
> > regression for KVM at least if we fix it...

> We are perfectly fine with this not going into 6.2.

I thought about it a bit more, and realized that we could
end up giving KVM spurious maintenance interrupts even though
it doesn't set the LRENPIE bit, because the incorrect OR
meant we'd send a maint irq whenever the EOIcount was nonzero.
So we've put this fix in for 6.2.

Thanks for the patch and the discussion.

-- PMM

[PATCH v3 1/1] multifd: Shut down the QIO channels to avoid blocking the send threads when they are terminated

When doing live migration with multifd channels 8, 16 or larger number,
the guest hangs in the presence of the network errors such as missing TCP ACKs.

At sender's side:
The main thread is blocked on qemu_thread_join, migration_fd_cleanup
is called because one thread fails on qio_channel_write_all when
the network problem happens and other send threads are blocked on sendmsg.
They could not be terminated. So the main thread is blocked on qemu_thread_join
to wait for the threads terminated.

(gdb) bt
0  0x7f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0
1  0x55cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at 
../util/qemu-thread-posix.c:627
2  0x55cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542
3  0x55cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at 
../migration/migration.c:1808
4  0x55cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at 
../migration/migration.c:1850
5  0x55cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at ../util/async.c:141
6  0x55cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at ../util/async.c:169
7  0x55cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at 
../util/aio-posix.c:381
8  0x55cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, callback=0x0, 
user_data=0x0) at ../util/async.c:311
9  0x7f30c9c8cdf4 in g_main_context_dispatch () at 
/usr/lib64/libglib-2.0.so.0
10 0x55cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232
11 0x55cbb718521c in os_host_main_loop_wait (timeout=42251070366) at 
../util/main-loop.c:255
12 0x55cbb7185321 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:531
13 0x55cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726
14 0x55cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c57, 
envp=0x7ffc0c578ab0) at ../softmmu/main.c:50

To make sure that the send threads could be terminated, IO channels should be
shut down to avoid waiting IO.

Reviewed-by: Dr. David Alan Gilbert 
Reviewed-by: Daniel P. Berrangé 
Signed-off-by: Li Zhang 
---
v3 - >v2: 
Move the channel shutdown before the semaphore post.

 migration/multifd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 7c9deb1921..f9423be12d 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -522,6 +522,9 @@ static void multifd_send_terminate_threads(Error *err)
 
 qemu_mutex_lock(>mutex);
 p->quit = true;
+if (p->c) {
+qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+}
 qemu_sem_post(>sem);
 qemu_mutex_unlock(>mutex);
 }
-- 
2.31.1

dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

2021-12-07 Thread Chris Murphy

cc: qemu-devel

Hi,

I'm trying to help progress a very troublesome and so far elusive bug
we're seeing in Fedora infrastructure. When running dozens of qemu-kvm
VMs simultaneously, eventually they become unresponsive, as well as
new processes as we try to extract information from the host about
what's gone wrong.

Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
state where forking does not work correctly, breaking most things
https://bugzilla.redhat.com/show_bug.cgi?id=2009585

In subsequent testing, we used newer kernels with lockdep and other
debug stuff enabled, and managed to capture a hung task with a bunch
of locks listed, including kvm and qemu processes. But I can't parse
it.

5.15-rc7
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
5.15+
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939

If anyone can take a glance at those kernel messages, and/or give
hints how we can extract more information for debugging, it'd be
appreciated. Maybe all of that is normal and the actual problem isn't
in any of these traces.

Thanks,

--
Chris Murphy

Re: [PULL for-6.2 0/1] target-arm queue


On 12/7/21 9:25 AM, Peter Maydell wrote:

Last minute pullreq with one patch, fixing the GICv3 ICH_MISR_EL2.LRENP
calculation. I went back-and-forth on whether to put this in, but:
  * it's an effective regression from 6.1 (the bug itself has been
present since before then, but it was previously masked by the
other bug which we fixed in 9cee1efe92)
  * I just realized it could cause a screaming maintenance interrupt
even for hypervisors like KVM that don't set LRENPIE

On the other hand this is very late and we haven't seen it be a
problem with any guest except Qualcomm's hypervisor. So if you want
to decide it's better not going in that's OK too.

Tested on the gitlab CI and with a local test of nested KVM.

-- PMM

The following changes since commit 7635eff97104242d618400e4b6746d0a5c97af82:

   Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into 
staging (2021-12-06 11:18:06 -0800)

are available in the Git repository at:

   https://git.linaro.org/people/pmaydell/qemu-arm.git 
tags/pull-target-arm-20211207

for you to fetch changes up to 2958e5150dfa297dd5a51fe57a29156b8744f07f:

   gicv3: fix ICH_MISR's LRENP computation (2021-12-07 15:30:08 +)


target-arm queue:
  * Fix calculation of ICH_MISR_EL2.LRENP to avoid incorrect generation
of maintenance interrupts


Damien Hedde (1):
   gicv3: fix ICH_MISR's LRENP computation

  hw/intc/arm_gicv3_cpuif.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)


Applied, thanks.

r~

Re: [PATCH v2 1/3] qmp: Support for querying stats

2021-12-07 Thread Daniel P . Berrangé

Copying in Markus as QAPI maintainer, since I feel this proposed
design is a little oddball from typical QAPI design approach

On Fri, Nov 19, 2021 at 01:51:51PM -0600, Mark Kanda wrote:
> Introduce qmp support for querying stats. Provide a framework for
> adding new stats and support for the following commands:
> 
> - query-stats
> Returns a list of all stats, with options for specifying a stat
> name and schema type. A schema type is the set of stats associated
> with a given component (e.g. vm or vcpu).
> 
> - query-stats-schemas
> Returns a list of stats included in each schema type, with an
> option for specifying the schema name.
> 
> - query-stats-instances
> Returns a list of stat instances and their associated schema type.
> 
> The framework provides a method to register callbacks for these qmp
> commands.
> 
> The first usecase will be for fd-based KVM stats (in an upcoming
> patch).
> 
> Examples (with fd-based KVM stats):
> 
> { "execute": "query-stats" }
> { "return": [
> { "name": "vcpu_1",
>   "type": "kvm-vcpu",
>   "stats": [
> { "name": "guest_mode",
>   "unit": "none",
>   "base": 10,
>   "val": [ 0 ],
>   "exponent": 0,
>   "type": "instant" },
> { "name": "directed_yield_successful",
>   "unit": "none",
>   "base": 10,
>   "val": [ 0 ],
>   "exponent": 0,
>   "type": "cumulative" },
> ...
> },
> { "name": "vcpu_0",
>   "type": "kvm-vcpu",
>   "stats": ...
> ...
>  },
> { "name": "vm",
>   "type": "kvm-vm",
>   "stats": [
> { "name": "max_mmu_page_hash_collisions",
>   "unit": "none",
>   "base": 10,
>   "val": [ 0 ],
>   "exponent": 0,
>   "type": "peak" },
>   ...

So this is essentially exposing the low level kernel data structure
'struct kvm_stats_desc' mapped 1-to-1 into QAPI.

There are pros/cons to doing that should be explored to see whether
this actually makes sense for the QMP design.

I understand this design is intended to be fully self-describing
such that we can add arbitrarily more fields without ever
changing QEMU code, and with a simple mapping from the kernel
kvm_stats_desc.

Taking the first level of data returned, we see the natural
structure of the data wrt vCPUs is flattened:

 { "return": [
 { "name": "vcpu_0",
   "type": "kvm-vcpu",
   "stats": [...],  // stats for vcpu 0
 },
 { "name": "vcpu_1",
   "type": "kvm-vcpu",
   "stats": [...],  // stats for vcpu 0
 },
 ...other vCPUs...
 { "name": "vm",
   "type": "kvm-vm",
   "stats": [...],  // stats for the VM
 },
 }

This name+type stuff is all unnecessarily indirect. If we ever
have to add more data unrelated to the kvm stats, we're going
to need QEMU changes no matter what, so this indirect structure
isn't future proofing it.

I'd rather expect us to have named struct fields for each
different provider of data, respecting the natural hierachy.
ie use an array for vCPU data.

I understand this is future proofed to be able to support
non-KVM stats. If we have KVM per-vCPU stat and non-KVM
per-VCPU stats at the same time, I'd expect them all to
appear in the same place.  IOW, overall I'd expect to
see grouping more like

 { "return": {
 "vcpus": [

[ // stats for vcpu 0
  { "provider": 'kvm',
"stats": [...] },
  { "provider": 'qemu',
"stats"; [...] }
],

[ // stats for vcpu 1
  { "provider": 'kvm',
"stats": [...] },
  { "provider": 'qemu',
"stats"; [...] }
],

...other vCPUs...
 ]
 "vm": [
 { "provider": 'kvm',
   "stats": [...] },
 { "provider": 'qemu',
   "stats"; [...] } ],
 ],
 }

Now onto the values being reported. AFAICT from the kernel
docs, for all the types of data it currently reports
(cumulative, instant, peak, none), there is only ever going
to be a single value. I assume the ability to report multiple
values is future proofing for a later requirement. This is
fine from a kerenl POV since they're trying to fit this into
a flat struct. QAPI is way more flexible. It can switch
between reporting an scale or array or scalars for the
same field. So if we know the stat will only ever have
1 value, we should be reporting a scalar, not an array
which will only ever have one value.

Second, for a given named statistic, AFAICT, the data type,
unit, base and exponent are all fixed. I don't see a reason
for us to be reporting that information every time we call
'query-stats'. Just report the name + value(s).  Apps that
want a specific named stat won't care about the dimensions,
because they'll already know what the value means.

Apps that want to be metadata driven to handle arbitrary
stats, can just call 'query-stats-schemas' to learn about
the dimensions one time.

This will give

Re: [PATCH 0/3] iotests: multiprocessing!!

On Mon, Dec 6, 2021 at 3:26 PM Vladimir Sementsov-Ogievskiy <
vsement...@virtuozzo.com> wrote:

> 06.12.2021 21:37, John Snow wrote:
> >
> >
> > On Fri, Dec 3, 2021 at 7:22 AM Vladimir Sementsov-Ogievskiy <
> vsement...@virtuozzo.com > wrote:
> >
> > Hi all!
> >
> > Finally, I can not stand it any longer. So, I'm happy to present
> > multiprocessing support for iotests test runner.
> >
> > testing on tmpfs:
> >
> > Before:
> >
> > time check -qcow2
> > ...
> > real12m28.095s
> > user9m53.398s
> > sys 2m55.548s
> >
> > After:
> >
> > time check -qcow2 -j 12
> > ...
> > real2m12.136s
> > user12m40.948s
> > sys 4m7.449s
> >
> >
> > VERY excellent. And this will probably flush a lot more bugs loose, too.
> (Which I consider a good thing!)
>
> Thanks!)
>
> > We could look into utilizing it for 'make check', but we'll have to be
> prepared for a greater risk of race conditions on the CI if we do. But...
> it's seriously hard to argue with this kind of optimization, very well done!
>
> I thought about this too. I think, we can at least passthrought -j flag of
> "make -j9 check" to ./check
>
> I think, CIs mostly call make check without -j flag. But I always call
> make -j9 check. And it always upset me that all tests run in parallel
> except for iotests. So if it possible to detect that we are called through
> "make -j9 check" and use "-j 9" for iotests/check in this case, it would be
> good.
>
> >
> >
> > Hmm, seems -j 6 should be enough. I have 6 cores, 2 threads per core.
> > Anyway, that's so fast!
> >
> > Vladimir Sementsov-Ogievskiy (3):
> >iotests/testrunner.py: add doc string for run_test()
> >iotests/testrunner.py: move updating last_elapsed to run_tests
> >iotests: check: multiprocessing support
> >
> >   tests/qemu-iotests/check |  4 +-
> >   tests/qemu-iotests/testrunner.py | 86
> 
> >   2 files changed, 80 insertions(+), 10 deletions(-)
> >
> > --
> > 2.31.1
> >
>
>
>
I'll also now add:

Tested-by: John Snow 

I tried to find a different workaround in just a few minutes, but that just
made it clear that your solution was right. While I had it checked out, I
ran it a few times and it looks good to me!
(And no new problems from the Python CI stuff, so it looks good to me.)

Re: [PATCH] fuzz: pass failures from child process into libfuzzer engine

2021-12-07 Thread Alexander Bulekov

On 211206 2348, Konstantin Khlebnikov wrote:
> 
> 
>06.12.2021, 19:35, "Alexander Bulekov" <[1]alx...@bu.edu>:
> 
>  On 211205 1917, Konstantin Khlebnikov wrote:
> 
> Fuzzer is supposed to stop when first bug is found and report
>failure.
> Present fuzzers fork new child at each iteration to isolate
>side-effects.
> But child's exit code is ignored, i.e. libfuzzer does not see any
>crashes.
> 
> Right now virtio-net fuzzer instantly falls on assert in iov_copy and
> dumps crash-*, but fuzzing continues and ends successfully if global
> timeout is set.
> 
> Let's put required logic into helper function "fork_fuzzer_and_wait".
> 
> 
>  Hi Konstantin,
>  Can you provide more details about them problem this is meant to solve?
>  Currently, the fuzzer would just output a "crash-" file and continue
>  fuzzing. So the crash isn't lost - it can still be reproduced later.
>  This means the fuzzer can progress faster (no need to restart the whole
>  process each time there is a crash).
> 
>  However, this is of course, not the default libfuzzer behavior. That's
>  why I wonder whether you encountered some issue that depended on
>  libfuzzer exiting immediately. We have had some problems on OSS-Fuzz,
>  with incomplete coverage reports, and I wonder if this could be related.
> 
>  For the example you gave, OSS-Fuzz picked up on the crash, so even
>  though we don't comform to the default libfuzzer behavior, the crashes
>  are still detected.
>  
> [2]https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=23985=iov_copy=2
> 
>Oh well. So, this is known "feature". That was unexpected. =)
>Recent libfuzzer has options for that behaviour: "-fork=1
>-ignore_crashes=1".
> 
>I'm trying to fuzz various virtio devices and was really surprised that
>present
>fuzzing targes still find crashes in seconds.
>I thought they might be missed due to unhandled exit status.

For some reason, I never created a report for this issue. Just created
one: https://gitlab.com/qemu-project/qemu/-/issues/762

> 
>It seems fuzzing targets like "virtio-net" in present state wastes
>resources.
>oss-fuzz could instead focus on not yet broken targets.

I don't think so. For example, the generic-fuzz-virtio-net-pci-slirp
fuzzer also found the same issue, but it continued making progress, and
eventually found CVE-2021-3748
https://access.redhat.com/security/cve/cve-2021-3748
(the reproducer was almost 200 lines long - much more complex than issue #762)
So with the fork approach, the fuzzer might be slowed down (due to
outputting stacktraces and creating crash- files), but it can still
continue to make progress.

> 
>Or "abort/assert' in device emulation code should be treated as "success"?
>In some sense that's true, we cannot prevent suicide behaviour in vm.
>Real hardware dies easily after shooting randomly into ports/io ranges.

Certainly not. We usually create QEMU Issues for assertion failures
found by the fuzzer, but the one you brough up slipped through the cracks.

> 
>  Small question below.
> 
> Signed-off-by: Konstantin Khlebnikov <[3]khlebni...@yandex-team.ru>
> ---
>  tests/qtest/fuzz/fork_fuzz.c | 26 ++
>  tests/qtest/fuzz/fork_fuzz.h | 1 +
>  tests/qtest/fuzz/generic_fuzz.c | 3 +--
>  tests/qtest/fuzz/i440fx_fuzz.c | 3 +--
>  tests/qtest/fuzz/virtio_blk_fuzz.c | 6 ++
>  tests/qtest/fuzz/virtio_net_fuzz.c | 6 ++
>  tests/qtest/fuzz/virtio_scsi_fuzz.c | 6 ++
>  7 files changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/tests/qtest/fuzz/fork_fuzz.c
>b/tests/qtest/fuzz/fork_fuzz.c
> index 6ffb2a7937..6e3a3867bf 100644
> --- a/tests/qtest/fuzz/fork_fuzz.c
> +++ b/tests/qtest/fuzz/fork_fuzz.c
> @@ -38,4 +38,30 @@ void counter_shm_init(void)
>  free(copy);
>  }
>  
> +/* Returns true in child process */
> +bool fork_fuzzer_and_wait(void)
> +{
> + pid_t pid;
> + int wstatus;
> +
> + pid = fork();
> + if (pid < 0) {
> + perror("fork");
> + abort();
> + }
> +
> + if (pid == 0) {
> + return true;
> + }
>  
> + if (waitpid(pid, , 0) < 0) {
> + perror("waitpid");
> + abort();
> + }
> +
> + if (!WIFEXITED(wstatus) || WEXITSTATUS(wstatus) != 0) {
> + abort();
> + }
> 
>  Maybe instead of these aborts, we return "true" so the fork-server tries
>  to run the input, itself and (hopefully) crashes. That way we would have
>  an accurate stack trace, instead of abort, which is probably important
>  for the

Re: [PATCH v7 1/7] net/vmnet: add vmnet dependency and customizable option

2021-12-07 Thread Vladislav Yaroshchuk

If you meant patch series cover letter, it exists, see
https://patchew.org/QEMU/20211207101828.22033-1-yaroshchuk2...@gmail.com/

вт, 7 дек. 2021 г. в 17:12, Markus Armbruster :

> No cover letter?
>
>

-- 
Best Regards,

Vladislav Yaroshchuk

[PULL 1/1] gicv3: fix ICH_MISR's LRENP computation

From: Damien Hedde 

According to the "Arm Generic Interrupt Controller Architecture
Specification GIC architecture version 3 and 4" (version G: page 345
for aarch64 or 509 for aarch32):
LRENP bit of ICH_MISR is set when ICH_HCR.LRENPIE==1 and
ICH_HCR.EOIcount is non-zero.

When only LRENPIE was set (and EOI count was zero), the LRENP bit was
wrongly set and MISR value was wrong.

As an additional consequence, if an hypervisor set ICH_HCR.LRENPIE,
the maintenance interrupt was constantly fired. It happens since patch
9cee1efe92 ("hw/intc: Set GIC maintenance interrupt level to only 0 or 1")
which fixed another bug about maintenance interrupt (most significant
bits of misr, including this one, were ignored in the interrupt trigger).

Fixes: 83f036fe3d ("hw/intc/arm_gicv3: Add accessors for ICH_ system registers")
Signed-off-by: Damien Hedde 
Reviewed-by: Peter Maydell 
Message-id: 20211207094427.3473-1-damien.he...@greensocs.com
Signed-off-by: Peter Maydell 
---
 hw/intc/arm_gicv3_cpuif.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/intc/arm_gicv3_cpuif.c b/hw/intc/arm_gicv3_cpuif.c
index 7fba9314508..85fc369e550 100644
--- a/hw/intc/arm_gicv3_cpuif.c
+++ b/hw/intc/arm_gicv3_cpuif.c
@@ -351,7 +351,8 @@ static uint32_t maintenance_interrupt_state(GICv3CPUState 
*cs)
 /* Scan list registers and fill in the U, NP and EOI bits */
 eoi_maintenance_interrupt_state(cs, );
 
-if (cs->ich_hcr_el2 & (ICH_HCR_EL2_LRENPIE | ICH_HCR_EL2_EOICOUNT_MASK)) {
+if ((cs->ich_hcr_el2 & ICH_HCR_EL2_LRENPIE) &&
+(cs->ich_hcr_el2 & ICH_HCR_EL2_EOICOUNT_MASK)) {
 value |= ICH_MISR_EL2_LRENP;
 }
 
-- 
2.25.1

[PATCH v7 2/4] tests/qtest: add some tests for virtio-net failover

Add test cases to test several error cases that must be
generated by invalid failover configuration.

Add a combination of coldplug and hotplug test cases to be
sure the primary is correctly managed according the
presence or not of the STANDBY feature.

Signed-off-by: Laurent Vivier 
---
 tests/qtest/meson.build   |   4 +
 tests/qtest/virtio-net-failover.c | 771 ++
 2 files changed, 775 insertions(+)
 create mode 100644 tests/qtest/virtio-net-failover.c

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index c9d8458062ff..975a0f2f5f25 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -68,6 +68,10 @@ qtests_i386 = \
   (config_all_devices.has_key('CONFIG_RTL8139_PCI') ? ['rtl8139-test'] : []) + 
 \
   (config_all_devices.has_key('CONFIG_E1000E_PCI_EXPRESS') ? 
['fuzz-e1000e-test'] : []) +   \
   (config_all_devices.has_key('CONFIG_ESP_PCI') ? ['am53c974-test'] : []) +
 \
+  (config_all_devices.has_key('CONFIG_VIRTIO_NET') and 
 \
+   config_all_devices.has_key('CONFIG_Q35') and
 \
+   config_all_devices.has_key('CONFIG_VIRTIO_PCI') and 
 \
+   slirp.found() ? ['virtio-net-failover'] : []) + 
 \
   (unpack_edk2_blobs ? ['bios-tables-test'] : []) +
 \
   qtests_pci + 
 \
   ['fdc-test',
diff --git a/tests/qtest/virtio-net-failover.c 
b/tests/qtest/virtio-net-failover.c
new file mode 100644
index ..7444d30d2900
--- /dev/null
+++ b/tests/qtest/virtio-net-failover.c
@@ -0,0 +1,771 @@
+/*
+ * QTest testcase for virtio-net failover
+ *
+ * See docs/system/virtio-net-failover.rst
+ *
+ * Copyright (c) 2021 Red Hat, Inc.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+#include "qemu/osdep.h"
+#include "libqos/libqtest.h"
+#include "libqos/pci.h"
+#include "libqos/pci-pc.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qlist.h"
+#include "qapi/qmp/qjson.h"
+#include "libqos/malloc-pc.h"
+#include "libqos/virtio-pci.h"
+#include "hw/pci/pci.h"
+
+#define ACPI_PCIHP_ADDR_ICH90x0cc0
+#define PCI_EJ_BASE 0x0008
+
+#define BASE_MACHINE "-M q35 -nodefaults " \
+"-device pcie-root-port,id=root0,addr=0x1,bus=pcie.0,chassis=1 " \
+"-device pcie-root-port,id=root1,addr=0x2,bus=pcie.0,chassis=2 "
+
+#define MAC_PRIMARY0 "52:54:00:11:11:11"
+#define MAC_STANDBY0 "52:54:00:22:22:22"
+
+static QGuestAllocator guest_malloc;
+static QPCIBus *pcibus;
+
+static QTestState *machine_start(const char *args, int numbus)
+{
+QTestState *qts;
+QPCIDevice *dev;
+int bus;
+
+qts = qtest_init(args);
+
+pc_alloc_init(_malloc, qts, 0);
+pcibus = qpci_new_pc(qts, _malloc);
+g_assert(qpci_secondary_buses_init(pcibus) == numbus);
+
+for (bus = 1; bus <= numbus; bus++) {
+dev = qpci_device_find(pcibus, QPCI_DEVFN(bus, 0));
+g_assert_nonnull(dev);
+
+qpci_device_enable(dev);
+qpci_iomap(dev, 4, NULL);
+
+g_free(dev);
+}
+
+return qts;
+}
+
+static void machine_stop(QTestState *qts)
+{
+qpci_free_pc(pcibus);
+alloc_destroy(_malloc);
+qtest_quit(qts);
+}
+
+static void test_error_id(void)
+{
+QTestState *qts;
+QDict *resp;
+QDict *err;
+
+qts = machine_start(BASE_MACHINE
+"-device virtio-net,bus=root0,id=standby0,failover=on",
+2);
+
+resp = qtest_qmp(qts, "{'execute': 'device_add',"
+  "'arguments': {"
+  "'driver': 'virtio-net',"
+  "'bus': 'root1',"
+  "'failover_pair_id': 'standby0'"
+  "} }");
+g_assert(qdict_haskey(resp, "error"));
+
+err = qdict_get_qdict(resp, "error");
+g_assert(qdict_haskey(err, "desc"));
+
+g_assert_cmpstr(qdict_get_str(err, "desc"), ==,
+"Device with failover_pair_id needs to have id");
+
+qobject_unref(resp);
+
+machine_stop(qts);
+}
+
+static void test_error_pcie(void)
+{
+QTestState *qts;
+QDict *resp;
+QDict *err;
+
+qts = machine_start(BASE_MACHINE
+"-device virtio-net,bus=root0,id=standby0,failover=on",
+2);
+
+resp = qtest_qmp(qts, "{'execute': 'device_add',"
+  "'arguments': {"
+  "'driver': 'virtio-net',"
+  "'id': 'primary0',"
+  "'bus': 'pcie.0',"
+  "'failover_pair_id': 'standby0'"
+  "} }");
+g_assert(qdict_haskey(resp, "error"));
+
+err = qdict_get_qdict(resp, "error");
+g_assert(qdict_haskey(err, "desc"));
+
+g_assert_cmpstr(qdict_get_str(err, "desc"), ==,
+

[PULL for-6.2 0/1] target-arm queue

Last minute pullreq with one patch, fixing the GICv3 ICH_MISR_EL2.LRENP
calculation. I went back-and-forth on whether to put this in, but:
 * it's an effective regression from 6.1 (the bug itself has been
   present since before then, but it was previously masked by the
   other bug which we fixed in 9cee1efe92)
 * I just realized it could cause a screaming maintenance interrupt
   even for hypervisors like KVM that don't set LRENPIE

On the other hand this is very late and we haven't seen it be a
problem with any guest except Qualcomm's hypervisor. So if you want
to decide it's better not going in that's OK too.

Tested on the gitlab CI and with a local test of nested KVM.

-- PMM

The following changes since commit 7635eff97104242d618400e4b6746d0a5c97af82:

  Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into 
staging (2021-12-06 11:18:06 -0800)

are available in the Git repository at:

  https://git.linaro.org/people/pmaydell/qemu-arm.git 
tags/pull-target-arm-20211207

for you to fetch changes up to 2958e5150dfa297dd5a51fe57a29156b8744f07f:

  gicv3: fix ICH_MISR's LRENP computation (2021-12-07 15:30:08 +)


target-arm queue:
 * Fix calculation of ICH_MISR_EL2.LRENP to avoid incorrect generation
   of maintenance interrupts


Damien Hedde (1):
  gicv3: fix ICH_MISR's LRENP computation

 hw/intc/arm_gicv3_cpuif.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

[PATCH v7 1/4] qtest/libqos: add a function to initialize secondary PCI buses

Scan the PCI devices to find bridge and set PCI_SECONDARY_BUS and
PCI_SUBORDINATE_BUS (algorithm from seabios)

Signed-off-by: Laurent Vivier 
Acked-by: Thomas Huth 
---
 include/hw/pci/pci_bridge.h |   8 +++
 tests/qtest/libqos/pci.c| 118 
 tests/qtest/libqos/pci.h|   1 +
 3 files changed, 127 insertions(+)

diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index a94d350034bf..30691a6e5728 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -138,6 +138,7 @@ typedef struct PCIBridgeQemuCap {
 uint64_t mem_pref_64; /* Prefetchable memory to reserve (64-bit MMIO) */
 } PCIBridgeQemuCap;
 
+#define REDHAT_PCI_CAP_TYPE_OFFSET  3
 #define REDHAT_PCI_CAP_RESOURCE_RESERVE 1
 
 /*
@@ -152,6 +153,13 @@ typedef struct PCIResReserve {
 uint64_t mem_pref_64;
 } PCIResReserve;
 
+#define REDHAT_PCI_CAP_RES_RESERVE_BUS_RES 4
+#define REDHAT_PCI_CAP_RES_RESERVE_IO  8
+#define REDHAT_PCI_CAP_RES_RESERVE_MEM 16
+#define REDHAT_PCI_CAP_RES_RESERVE_PREF_MEM_32 20
+#define REDHAT_PCI_CAP_RES_RESERVE_PREF_MEM_64 24
+#define REDHAT_PCI_CAP_RES_RESERVE_CAP_SIZE32
+
 int pci_bridge_qemu_reserve_cap_init(PCIDevice *dev, int cap_offset,
PCIResReserve res_reserve, Error **errp);
 
diff --git a/tests/qtest/libqos/pci.c b/tests/qtest/libqos/pci.c
index e1e96189c821..a0b606e25c71 100644
--- a/tests/qtest/libqos/pci.c
+++ b/tests/qtest/libqos/pci.c
@@ -13,6 +13,8 @@
 #include "qemu/osdep.h"
 #include "pci.h"
 
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
 #include "hw/pci/pci_regs.h"
 #include "qemu/host-utils.h"
 #include "qgraph.h"
@@ -99,6 +101,122 @@ void qpci_device_init(QPCIDevice *dev, QPCIBus *bus, 
QPCIAddress *addr)
 g_assert(!addr->device_id || device_id == addr->device_id);
 }
 
+static uint8_t qpci_find_resource_reserve_capability(QPCIDevice *dev)
+{
+uint16_t device_id;
+uint8_t cap = 0;
+
+if (qpci_config_readw(dev, PCI_VENDOR_ID) != PCI_VENDOR_ID_REDHAT) {
+return 0;
+}
+
+device_id = qpci_config_readw(dev, PCI_DEVICE_ID);
+
+if (device_id != PCI_DEVICE_ID_REDHAT_PCIE_RP &&
+device_id != PCI_DEVICE_ID_REDHAT_BRIDGE) {
+return 0;
+}
+
+do {
+cap = qpci_find_capability(dev, PCI_CAP_ID_VNDR, cap);
+} while (cap &&
+ qpci_config_readb(dev, cap + REDHAT_PCI_CAP_TYPE_OFFSET) !=
+ REDHAT_PCI_CAP_RESOURCE_RESERVE);
+if (cap) {
+uint8_t cap_len = qpci_config_readb(dev, cap + PCI_CAP_FLAGS);
+if (cap_len < REDHAT_PCI_CAP_RES_RESERVE_CAP_SIZE) {
+return 0;
+}
+}
+return cap;
+}
+
+static void qpci_secondary_buses_rec(QPCIBus *qbus, int bus, int *pci_bus)
+{
+QPCIDevice *dev;
+uint16_t class;
+uint8_t pribus, secbus, subbus;
+int index;
+
+for (index = 0; index < 32; index++) {
+dev = qpci_device_find(qbus, QPCI_DEVFN(bus + index, 0));
+if (dev == NULL) {
+continue;
+}
+class = qpci_config_readw(dev, PCI_CLASS_DEVICE);
+if (class == PCI_CLASS_BRIDGE_PCI) {
+qpci_config_writeb(dev, PCI_SECONDARY_BUS, 255);
+qpci_config_writeb(dev, PCI_SUBORDINATE_BUS, 0);
+}
+g_free(dev);
+}
+
+for (index = 0; index < 32; index++) {
+dev = qpci_device_find(qbus, QPCI_DEVFN(bus + index, 0));
+if (dev == NULL) {
+continue;
+}
+class = qpci_config_readw(dev, PCI_CLASS_DEVICE);
+if (class != PCI_CLASS_BRIDGE_PCI) {
+continue;
+}
+
+pribus = qpci_config_readb(dev, PCI_PRIMARY_BUS);
+if (pribus != bus) {
+qpci_config_writeb(dev, PCI_PRIMARY_BUS, bus);
+}
+
+secbus = qpci_config_readb(dev, PCI_SECONDARY_BUS);
+(*pci_bus)++;
+if (*pci_bus != secbus) {
+secbus = *pci_bus;
+qpci_config_writeb(dev, PCI_SECONDARY_BUS, secbus);
+}
+
+subbus = qpci_config_readb(dev, PCI_SUBORDINATE_BUS);
+qpci_config_writeb(dev, PCI_SUBORDINATE_BUS, 255);
+
+qpci_secondary_buses_rec(qbus, secbus << 5, pci_bus);
+
+if (subbus != *pci_bus) {
+uint8_t res_bus = *pci_bus;
+uint8_t cap = qpci_find_resource_reserve_capability(dev);
+
+if (cap) {
+uint32_t tmp_res_bus;
+
+tmp_res_bus = qpci_config_readl(dev, cap +
+
REDHAT_PCI_CAP_RES_RESERVE_BUS_RES);
+if (tmp_res_bus != (uint32_t)-1) {
+res_bus = tmp_res_bus & 0xFF;
+if ((uint8_t)(res_bus + secbus) < secbus ||
+(uint8_t)(res_bus + secbus) < res_bus) {
+res_bus = 0;
+}
+if (secbus + res_bus > *pci_bus) {
+res_bus = secbus + res_bus;
+

[PATCH v7 0/4] tests/qtest: add some tests for virtio-net failover

This series adds a qtest entry to test virtio-net failover feature.

We check following error cases:

- check missing id on device with failover_pair_id triggers an error
- check a primary device plugged on a bus that doesn't support hotplug
  triggers an error

We check the status of the machine before and after hotplugging cards and
feature negotiation:

- check we don't see the primary device at boot if failover is on
- check we see the primary device at boot if failover is off
- check we don't see the primary device if failover is on
  but failover_pair_id is not the one with on (I think this should be changed)
- check the primary device is plugged after the feature negotiation
- check the result if the primary device is plugged before standby device and
  vice-versa
- check the if the primary device is coldplugged and the standy device
  hotplugged and vice-versa
- check the migration triggers the unplug and the hotplug

There is one preliminary patch in the series:

- PATCH 1 introduces a function to enable PCI bridge.
  Failover needs to be plugged on a pcie-root-port and while
  the root port is not configured the cards behind it are not
  available

v7:
- merge patch 3 and 4 as the fix for ACPI unplug has been merged
- address Thomas' comments
- add a dependency on slirp in meson.build
- check FAILOVER_NEGOCIATED device-id and MIGRATION status
  on destination, update UNPLUG_PRIMARY event checking
- fix an object_unref() in test_migrate_abort_active()
- fix typo s/whan/when/

v6:
- manage more than 2 root ports
- add a function to check if a card is available or not
- check migration state
- add cancelled migration test cases
- rename tests

v5:
- re-add the wait-unplug test that has been removed from v4 by mistake.

v4:
- rely on query-migrate status to know the migration state rather than
  to wait the STOP event.
- remove the patch to add time out to qtest_qmp_eventwait()

v3:
- fix a bug with ACPI unplug and add the related test

v2:
- remove PATCH 1 that introduced a function that can be replaced by
  qobject_to_json_pretty() (Markus)
- Add migration to a file and from the file to check the card is
  correctly unplugged on the source, and hotplugged on the dest
- Add an ACPI call to eject the card as the kernel would do

Laurent Vivier (4):
  qtest/libqos: add a function to initialize secondary PCI buses
  tests/qtest: add some tests for virtio-net failover
  test/libqtest: add some virtio-net failover migration cancelling tests
  tests/libqtest: add a migration test with two couples of failover
devices

 include/hw/pci/pci_bridge.h   |8 +
 tests/qtest/libqos/pci.c  |  118 +++
 tests/qtest/libqos/pci.h  |1 +
 tests/qtest/meson.build   |4 +
 tests/qtest/virtio-net-failover.c | 1324 +
 5 files changed, 1455 insertions(+)
 create mode 100644 tests/qtest/virtio-net-failover.c

-- 
2.33.1

Re: [PULL 0/1] tcg patch queue for 6.2


On 12/7/21 6:39 AM, Richard Henderson wrote:

The following changes since commit 7635eff97104242d618400e4b6746d0a5c97af82:

   Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into 
staging (2021-12-06 11:18:06 -0800)

are available in the Git repository at:

   https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20211207

for you to fetch changes up to b9537d5904f6e3df896264a6144883ab07db9608:

   tcg/arm: Reduce vector alignment requirement for NEON (2021-12-07 06:32:09 
-0800)


Fix stack spills for arm neon.


Richard Henderson (1):
   tcg/arm: Reduce vector alignment requirement for NEON

  tcg/tcg.c|  8 +++-
  tcg/arm/tcg-target.c.inc | 13 +
  2 files changed, 16 insertions(+), 5 deletions(-)


Applied, thanks.

r~

[PATCH v7 3/4] test/libqtest: add some virtio-net failover migration cancelling tests

Add some tests to check the state of the machine if the migration
is cancelled while we are using virtio-net failover.

Signed-off-by: Laurent Vivier 
Acked-by: Thomas Huth 
---
 tests/qtest/virtio-net-failover.c | 276 ++
 1 file changed, 276 insertions(+)

diff --git a/tests/qtest/virtio-net-failover.c 
b/tests/qtest/virtio-net-failover.c
index 7444d30d2900..f3d4ba69f51b 100644
--- a/tests/qtest/virtio-net-failover.c
+++ b/tests/qtest/virtio-net-failover.c
@@ -735,6 +735,274 @@ static void test_migrate_in(gconstpointer opaque)
 machine_stop(qts);
 }
 
+static void test_migrate_abort_wait_unplug(gconstpointer opaque)
+{
+QTestState *qts;
+QDict *resp, *args, *ret;
+g_autofree gchar *uri = g_strdup_printf("exec: cat > %s", (gchar *)opaque);
+const gchar *status;
+
+qts = machine_start(BASE_MACHINE
+ "-netdev user,id=hs0 "
+ "-netdev user,id=hs1 ",
+ 2);
+
+check_one_card(qts, false, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+
+qtest_qmp_device_add(qts, "virtio-net", "standby0",
+ "{'bus': 'root0',"
+ "'failover': 'on',"
+ "'netdev': 'hs0',"
+ "'mac': '"MAC_STANDBY0"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+
+start_virtio_net(qts, 1, 0, "standby0");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+
+qtest_qmp_device_add(qts, "virtio-net", "primary0",
+ "{'bus': 'root1',"
+ "'failover_pair_id': 'standby0',"
+ "'netdev': 'hs1',"
+ "'rombar': 0,"
+ "'romfile': '',"
+ "'mac': '"MAC_PRIMARY0"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, true, "primary0", MAC_PRIMARY0);
+
+args = qdict_from_jsonf_nofail("{}");
+g_assert_nonnull(args);
+qdict_put_str(args, "uri", uri);
+
+resp = qtest_qmp(qts, "{ 'execute': 'migrate', 'arguments': %p}", args);
+g_assert(qdict_haskey(resp, "return"));
+qobject_unref(resp);
+
+/* the event is sent when QEMU asks the OS to unplug the card */
+resp = get_unplug_primary_event(qts);
+g_assert_cmpstr(qdict_get_str(resp, "device-id"), ==, "primary0");
+qobject_unref(resp);
+
+resp = qtest_qmp(qts, "{ 'execute': 'migrate_cancel' }");
+g_assert(qdict_haskey(resp, "return"));
+qobject_unref(resp);
+
+/* migration has been cancelled while the unplug was in progress */
+
+/* while the card is not ejected, we must be in "cancelling" state */
+ret = migrate_status(qts);
+
+status = qdict_get_str(ret, "status");
+g_assert_cmpstr(status, ==, "cancelling");
+qobject_unref(ret);
+
+/* OS unplugs the cards, QEMU can move from wait-unplug state */
+qtest_outl(qts, ACPI_PCIHP_ADDR_ICH9 + PCI_EJ_BASE, 1);
+
+while (true) {
+ret = migrate_status(qts);
+
+status = qdict_get_str(ret, "status");
+if (strcmp(status, "cancelled") == 0) {
+break;
+}
+g_assert_cmpstr(status, !=, "failed");
+g_assert_cmpstr(status, !=, "active");
+qobject_unref(ret);
+}
+qobject_unref(ret);
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, true, "primary0", MAC_PRIMARY0);
+
+machine_stop(qts);
+}
+
+static void test_migrate_abort_active(gconstpointer opaque)
+{
+QTestState *qts;
+QDict *resp, *args, *ret;
+g_autofree gchar *uri = g_strdup_printf("exec: cat > %s", (gchar *)opaque);
+const gchar *status;
+
+qts = machine_start(BASE_MACHINE
+ "-netdev user,id=hs0 "
+ "-netdev user,id=hs1 ",
+ 2);
+
+check_one_card(qts, false, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+
+qtest_qmp_device_add(qts, "virtio-net", "standby0",
+ "{'bus': 'root0',"
+ "'failover': 'on',"
+ "'netdev': 'hs0',"
+ "'mac': '"MAC_STANDBY0"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+
+start_virtio_net(qts, 1, 0, "standby0");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+
+qtest_qmp_device_add(qts, "virtio-net", "primary0",
+ "{'bus': 'root1',"
+ "'failover_pair_id': 'standby0',"
+ "'netdev': 'hs1',"
+ "'rombar': 0,"
+ "'romfile': '',"
+

[PATCH v7 4/4] tests/libqtest: add a migration test with two couples of failover devices

Signed-off-by: Laurent Vivier 
Acked-by: Thomas Huth 
---
 tests/qtest/virtio-net-failover.c | 277 ++
 1 file changed, 277 insertions(+)

diff --git a/tests/qtest/virtio-net-failover.c 
b/tests/qtest/virtio-net-failover.c
index f3d4ba69f51b..43d32ce17ba4 100644
--- a/tests/qtest/virtio-net-failover.c
+++ b/tests/qtest/virtio-net-failover.c
@@ -20,6 +20,7 @@
 
 #define ACPI_PCIHP_ADDR_ICH90x0cc0
 #define PCI_EJ_BASE 0x0008
+#define PCI_SEL_BASE0x0010
 
 #define BASE_MACHINE "-M q35 -nodefaults " \
 "-device pcie-root-port,id=root0,addr=0x1,bus=pcie.0,chassis=1 " \
@@ -27,6 +28,8 @@
 
 #define MAC_PRIMARY0 "52:54:00:11:11:11"
 #define MAC_STANDBY0 "52:54:00:22:22:22"
+#define MAC_PRIMARY1 "52:54:00:33:33:33"
+#define MAC_STANDBY1 "52:54:00:44:44:44"
 
 static QGuestAllocator guest_malloc;
 static QPCIBus *pcibus;
@@ -1003,6 +1006,276 @@ static void test_migrate_abort_timeout(gconstpointer 
opaque)
 machine_stop(qts);
 }
 
+static void test_multi_out(gconstpointer opaque)
+{
+QTestState *qts;
+QDict *resp, *args, *ret;
+g_autofree gchar *uri = g_strdup_printf("exec: cat > %s", (gchar *)opaque);
+const gchar *status, *expected;
+
+qts = machine_start(BASE_MACHINE
+"-device pcie-root-port,id=root2,addr=0x3,bus=pcie.0,chassis=3 
"
+"-device pcie-root-port,id=root3,addr=0x4,bus=pcie.0,chassis=4 
"
+"-netdev user,id=hs0 "
+"-netdev user,id=hs1 "
+"-netdev user,id=hs2 "
+"-netdev user,id=hs3 ",
+4);
+
+check_one_card(qts, false, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+check_one_card(qts, false, "standby1", MAC_STANDBY1);
+check_one_card(qts, false, "primary1", MAC_PRIMARY1);
+
+qtest_qmp_device_add(qts, "virtio-net", "standby0",
+ "{'bus': 'root0',"
+ "'failover': 'on',"
+ "'netdev': 'hs0',"
+ "'mac': '"MAC_STANDBY0"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+check_one_card(qts, false, "standby1", MAC_STANDBY1);
+check_one_card(qts, false, "primary1", MAC_PRIMARY1);
+
+qtest_qmp_device_add(qts, "virtio-net", "primary0",
+ "{'bus': 'root1',"
+ "'failover_pair_id': 'standby0',"
+ "'netdev': 'hs1',"
+ "'rombar': 0,"
+ "'romfile': '',"
+ "'mac': '"MAC_PRIMARY0"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, false, "primary0", MAC_PRIMARY0);
+check_one_card(qts, false, "standby1", MAC_STANDBY1);
+check_one_card(qts, false, "primary1", MAC_PRIMARY1);
+
+start_virtio_net(qts, 1, 0, "standby0");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, true, "primary0", MAC_PRIMARY0);
+check_one_card(qts, false, "standby1", MAC_STANDBY1);
+check_one_card(qts, false, "primary1", MAC_PRIMARY1);
+
+qtest_qmp_device_add(qts, "virtio-net", "standby1",
+ "{'bus': 'root2',"
+ "'failover': 'on',"
+ "'netdev': 'hs2',"
+ "'mac': '"MAC_STANDBY1"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, true, "primary0", MAC_PRIMARY0);
+check_one_card(qts, true, "standby1", MAC_STANDBY1);
+check_one_card(qts, false, "primary1", MAC_PRIMARY1);
+
+qtest_qmp_device_add(qts, "virtio-net", "primary1",
+ "{'bus': 'root3',"
+ "'failover_pair_id': 'standby1',"
+ "'netdev': 'hs3',"
+ "'rombar': 0,"
+ "'romfile': '',"
+ "'mac': '"MAC_PRIMARY1"'}");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, true, "primary0", MAC_PRIMARY0);
+check_one_card(qts, true, "standby1", MAC_STANDBY1);
+check_one_card(qts, false, "primary1", MAC_PRIMARY1);
+
+start_virtio_net(qts, 3, 0, "standby1");
+
+check_one_card(qts, true, "standby0", MAC_STANDBY0);
+check_one_card(qts, true, "primary0", MAC_PRIMARY0);
+check_one_card(qts, true, "standby1", MAC_STANDBY1);
+check_one_card(qts, true, "primary1", MAC_PRIMARY1);
+
+args = qdict_from_jsonf_nofail("{}");
+g_assert_nonnull(args);
+qdict_put_str(args, "uri", uri);
+
+resp = qtest_qmp(qts, "{ 'execute': 'migrate', 'arguments': %p}", args);
+g_assert(qdict_haskey(resp, "return"));
+qobject_unref(resp);
+
+/* the event is sent when QEMU asks the OS to unplug the card */
+resp = get_unplug_primary_event(qts);
+if (strcmp(qdict_get_str(resp, "device-id"), "primary0") ==

RE: [PATCH for-7.0 2/4] target/hexagon/cpu.h: don't include qemu-common.h

2021-12-07 Thread Taylor Simpson




> -Original Message-
> From: Peter Maydell 
> Sent: Monday, November 29, 2021 2:05 PM
> To: qemu-...@nongnu.org; qemu-devel@nongnu.org
> Cc: Paolo Bonzini ; Sergio Lopez ;
> Taylor Simpson ; Yoshinori Sato
> ; Marcel Apfelbaum
> 
> Subject: [PATCH for-7.0 2/4] target/hexagon/cpu.h: don't include qemu-
> common.h
> 
> The qemu-common.h header is not supposed to be included from any other
> header files, only from .c files (as documented in a comment at the start of
> it).
> 
> Move the include to linux-user/hexagon/cpu_loop.c, which needs it for the
> declaration of cpu_exec_step_atomic().
> 
> Signed-off-by: Peter Maydell 
> ---
>  target/hexagon/cpu.h  | 1 -
>  linux-user/hexagon/cpu_loop.c | 1 +
>  2 files changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Taylor Simpson

Re: [PATCH RFC 00/11] vl: Explore redesign of startup


Hi Markus,

It looks promising. I did not think we could so "easily" have a new 
working startup. But I'm not so sure that I understand how we should 
progress from here.


I see 3 main parts in this:
A. introducing new binary (meson, ...)
B. startup api: phase related stuff (maybe more)
C. cli to qmp parser

I think if we want to add a new binary (instead of replace it), there 
will be some common api and every startup will have to support/implement 
it. Probably some part of vl.c will have to go in some common code.
In practice, we probably should introduce/extract this before 
introducing the new binary.


One central part of this api is the phase mechanism (even if legacy 
startup can only support it partially or not-at-all).


I think we have 2 choices:
+ we have to use until_phase explicitly
+ we make qmp commands implicitly advances phases when needed.

I think it's better to go the implicit way as much as possible: it means 
we focus on commands and not on some artificial phases we set up because 
of legacy.


Either way, we probably should put the phase info in qapi so that we 
don't have to hardcode that in every command in order to have common 
error handling. One thing we could do is replace "allow-preconfig" in 
qapi by some phase requirement entry(entries?) and make qmp call 
qemu_until_phase() or some qemu_phase_check() function.


We also maybe need to sort out if we want to merge the phases into the 
runstate.


Thanks for making the effort to do this rfc,
--
Damien

On 12/2/21 08:04, Markus Armbruster wrote:

These patches are meant to back the memo "Redesign of QEMU startup &
initial configuration" I just posted.  Read that first, please.

My running example for initial configuration via QMP is cold plug.  It
works at the end of the series.

I'm taking a number of shortcuts:

* I hack up qemu-system-FOO instead of creating an alternate program.
   Just so I don't have to mess with Meson.

* Instead of creating QAPI/CLI infrastructure, I use QMP as CLI: each
   argument is interpreted as QMP command.  This is certainly bad CLI,
   but should suffice to demonstrate things.

* Instead of feeding the CLI's QMP commands to the main loop via a
   quasi-monitor, I send them directly to the QMP dispatcher.  Simpler,
   but I'm not sure that's going to work for all QMP commands we want.

* Phase advance is by explicit command @until-phase only.  Carelessly
   named.  We may want some other commands to advance the phase
   automatically.

* There are no safeguards.  You *can* run QMP commands in phases where
   they crash.  Data corruption is left as an exercise for the reader.

* Possibly more I can't remember right now :)

Markus Armbruster (11):
   vl: Cut off the CLI with an axe
   vl: Drop x-exit-preconfig
   vl: Hardcode a QMP monitor on stdio for now
   vl: Hardcode a VGA device for now
   vl: Demonstrate (bad) CLI wrapped around QMP
   vl: Factor qemu_until_phase() out of qemu_init()
   vl: Implement qemu_until_phase() running from arbitrary phase
   vl: Implement qemu_until_phase() running to arbitrary phase
   vl: New QMP command until-phase
   vl: Disregard lack of 'allow-preconfig': true
   vl: Enter main loop in phase @machine-initialized

  qapi/misc.json |   27 -
  qapi/phase.json|   31 +
  qapi/qapi-schema.json  |1 +
  include/hw/qdev-core.h |   31 -
  hw/core/machine-qmp-cmds.c |1 +
  hw/core/machine.c  |1 +
  hw/core/qdev.c |7 +
  hw/pci/pci.c   |1 +
  hw/usb/core.c  |1 +
  hw/virtio/virtio-iommu.c   |1 +
  monitor/hmp-cmds.c |8 -
  monitor/hmp.c  |1 +
  softmmu/qdev-monitor.c |3 +
  softmmu/vl.c   | 2833 ++--
  ui/console.c   |1 +
  MAINTAINERS|1 +
  hmp-commands.hx|   18 -
  qapi/meson.build   |1 +
  18 files changed, 180 insertions(+), 2788 deletions(-)
  create mode 100644 qapi/phase.json

Re: [RFC v3 0/4] tls: add macros for coroutine-safe TLS variables

On Tue, Dec 07, 2021 at 01:55:34PM +, Peter Maydell wrote:
> On Tue, 7 Dec 2021 at 13:53, Stefan Hajnoczi  wrote:
> >
> > On Mon, Dec 06, 2021 at 02:34:45PM +, Peter Maydell wrote:
> > > On Mon, 6 Dec 2021 at 14:33, Stefan Hajnoczi  wrote:
> > > >
> > > > v3:
> > > > - Added __attribute__((weak)) to get_ptr_*() [Florian]
> > >
> > > Do we really need it *only* on get_ptr_*() ? If we need to
> > > noinline the other two we probably also should use the same
> > > attribute weak to force no optimizations at all.
> >
> > The weak attribute can't be used on static functions, so I think we need
> > a different approach:
> >
> > In file included from ../util/async.c:35:
> > /builds/stefanha/qemu/include/qemu/coroutine-tls.h:201:11: error: weak 
> > declaration of 'get_ptr_my_aiocontext' must be public
> >  type *get_ptr_##var(void)  
> >   \
> >^~~~
> > ../util/async.c:673:1: note: in expansion of macro 
> > 'QEMU_DEFINE_STATIC_CO_TLS'
> >  QEMU_DEFINE_STATIC_CO_TLS(AioContext *, my_aiocontext)
> >  ^
> >
> > Adding asm volatile("") seems to work though:
> > https://godbolt.org/z/3hn8Gh41d
> 
> You can see in the clang disassembly there that this isn't
> sufficient. The compiler puts in both calls, but it ignores
> the return results and always returns "true" from the function.

You're right! I missed that the return value of the call isn't used >_<.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-07 Thread Wouter Verhelst

On Mon, Dec 06, 2021 at 05:00:47PM -0600, Eric Blake wrote:
> On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > >    Simple reply message
> > > 
> > >   The simple reply message MUST be sent by the server in response to all
> > >   requests if structured replies have not been negotiated using
> > > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, 
> > > a simple
> > > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > > -but only if the reply has no data payload.  The message looks as
> > > -follows:
> > > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > > +negotiated, a simple reply MAY be used as a reply to any request other
> > > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > +the message looks as follows:
> > > 
> > >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> > >  `NBD_REPLY_MAGIC`)
> > > @@ -369,6 +398,16 @@ S: 64 bits, handle
> > >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > >   *error* is zero)
> > > 
> > > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > +the message looks like:
> > > +
> > > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > > +S: 32 bits, error (MAY be zero)
> > > +S: 64 bits, handle
> > > +S: 128 bits, padding (MUST be zero)
> > > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > > +*error* is zero)
> > > +
> > 
> > If we go this way, let's put payload length into padding: it will help to 
> > make the protocol context-independent and less error-prone.

Agreed.

> Easy enough to do (the payload length will be 0 except for
> NBD_CMD_READ).

Indeed.

> > Or, the otherway, may be just forbid the payload for simple-64bit ? What's 
> > the reason to allow 64bit requests without structured reply negotiation?
> 
> The two happened to be orthogonal enough in my implementation.  It was
> easy to demonstrate either one without the other, and it IS easier to
> write a client using non-structured replies (structured reads ARE
> tougher than simple reads, even if it is less efficient when it comes
> to reading zeros).  But you are also right that we could require
> structured reads prior to allowing 64-bit operations, and then have
> only one supported reply type on the wire when negotiated.  Wouter,
> which way do you prefer?

Given that I *still* haven't gotten around to implementing structured
replies for nbd-server, I'd prefer not to require it, but that's not
really a decent argument IMO :-)

[... I haven't read this in much detail yet, intend to do that later...]

-- 
 w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}

Re: [PATCH] tests/docker: add libfuse3 development headers

2021-12-07 Thread Richard W.M. Jones

On Tue, Dec 07, 2021 at 04:00:25PM +, Stefan Hajnoczi wrote:
...
> diff --git a/tests/docker/dockerfiles/centos8.docker 
> b/tests/docker/dockerfiles/centos8.docker
> index 7f135f8e8c..a2dae4be29 100644
> --- a/tests/docker/dockerfiles/centos8.docker
> +++ b/tests/docker/dockerfiles/centos8.docker
> @@ -19,6 +19,7 @@ ENV PACKAGES \
>  device-mapper-multipath-devel \
>  diffutils \
>  findutils \
> +fuse3-devel \
>  gcc \
>  gcc-c++ \
>  genisoimage \

Just for my own notes, it took me a while to work out that CentOS 8
does have fuse3.  It didn't appear in EPEL 8 etc:

https://src.fedoraproject.org/rpms/fuse3
https://ci.centos.org/search/?q=fuse3

However it turns out it is built from a source package called "fuse"
(version 2.9.7!)  Also I am able to install fuse3 on RHEL 8.  So I
guess that's OK in the end.

The rest of the changes look good too, so:

Acked-by: Richard W.M. Jones 

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

[PATCH] tests/docker: add libfuse3 development headers

The FUSE exports feature is not built because most container images do
not have libfuse3 development headers installed. Add the necessary
packages to the Dockerfiles.

Cc: Hanna Reitz 
Cc: Richard W.M. Jones 
Signed-off-by: Stefan Hajnoczi 
---
 tests/docker/dockerfiles/alpine.docker| 1 +
 tests/docker/dockerfiles/centos8.docker   | 1 +
 tests/docker/dockerfiles/fedora.docker| 1 +
 tests/docker/dockerfiles/opensuse-leap.docker | 1 +
 tests/docker/dockerfiles/ubuntu.docker| 1 +
 tests/docker/dockerfiles/ubuntu2004.docker| 1 +
 6 files changed, 6 insertions(+)

diff --git a/tests/docker/dockerfiles/alpine.docker 
b/tests/docker/dockerfiles/alpine.docker
index 7e6997e301..9ddb3c2ebc 100644
--- a/tests/docker/dockerfiles/alpine.docker
+++ b/tests/docker/dockerfiles/alpine.docker
@@ -12,6 +12,7 @@ ENV PACKAGES \
ccache \
coreutils \
curl-dev \
+   fuse3-dev \
g++ \
gcc \
git \
diff --git a/tests/docker/dockerfiles/centos8.docker 
b/tests/docker/dockerfiles/centos8.docker
index 7f135f8e8c..a2dae4be29 100644
--- a/tests/docker/dockerfiles/centos8.docker
+++ b/tests/docker/dockerfiles/centos8.docker
@@ -19,6 +19,7 @@ ENV PACKAGES \
 device-mapper-multipath-devel \
 diffutils \
 findutils \
+fuse3-devel \
 gcc \
 gcc-c++ \
 genisoimage \
diff --git a/tests/docker/dockerfiles/fedora.docker 
b/tests/docker/dockerfiles/fedora.docker
index c6fd7e1113..a3a712c87b 100644
--- a/tests/docker/dockerfiles/fedora.docker
+++ b/tests/docker/dockerfiles/fedora.docker
@@ -20,6 +20,7 @@ ENV PACKAGES \
 device-mapper-multipath-devel \
 diffutils \
 findutils \
+fuse3-devel \
 gcc \
 gcc-c++ \
 gcovr \
diff --git a/tests/docker/dockerfiles/opensuse-leap.docker 
b/tests/docker/dockerfiles/opensuse-leap.docker
index 3bbdb67f4f..2beb61bd7e 100644
--- a/tests/docker/dockerfiles/opensuse-leap.docker
+++ b/tests/docker/dockerfiles/opensuse-leap.docker
@@ -15,6 +15,7 @@ ENV PACKAGES \
 dbus-1 \
 diffutils \
 findutils \
+fuse3-devel \
 gcc \
 gcc-c++ \
 gcovr \
diff --git a/tests/docker/dockerfiles/ubuntu.docker 
b/tests/docker/dockerfiles/ubuntu.docker
index f0e0180d21..0c694a2bf0 100644
--- a/tests/docker/dockerfiles/ubuntu.docker
+++ b/tests/docker/dockerfiles/ubuntu.docker
@@ -29,6 +29,7 @@ ENV PACKAGES \
 libepoxy-dev \
 libfdt-dev \
 libffi-dev \
+libfuse3-dev \
 libgbm-dev \
 libgnutls28-dev \
 libgtk-3-dev \
diff --git a/tests/docker/dockerfiles/ubuntu2004.docker 
b/tests/docker/dockerfiles/ubuntu2004.docker
index 15a026be09..a46feaecdd 100644
--- a/tests/docker/dockerfiles/ubuntu2004.docker
+++ b/tests/docker/dockerfiles/ubuntu2004.docker
@@ -34,6 +34,7 @@ ENV PACKAGES \
 libepoxy-dev \
 libfdt-dev \
 libffi-dev \
+libfuse3-dev \
 libgbm-dev \
 libgcrypt20-dev \
 libglib2.0-dev \
-- 
2.33.1

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation





On 12/7/21 16:45, Peter Maydell wrote:

On Tue, 7 Dec 2021 at 15:24, Peter Maydell  wrote:

The bug is a bug in any case and we'll fix it, it's just a
question of whether it meets the bar to go into 6.2, which is
hopefully going to have its final RC tagged today. If this
patch had arrived a week ago then the bar would have been
lower and it would definitely have gone in. As it is I have
to weigh up the chances of this change causing a regression
for eg KVM running on emulated QEMU.


Looking at the KVM source it doesn't ever set the LRENPIE
bit (it doesn't even have a #define for it), which both
explains why we didn't notice this bug before and also
means we can be pretty certain we're not going to cause a
regression for KVM at least if we fix it...

-- PMM



We are perfectly fine with this not going into 6.2.

--
Damien

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

On Tue, 7 Dec 2021 at 15:24, Peter Maydell  wrote:
> The bug is a bug in any case and we'll fix it, it's just a
> question of whether it meets the bar to go into 6.2, which is
> hopefully going to have its final RC tagged today. If this
> patch had arrived a week ago then the bar would have been
> lower and it would definitely have gone in. As it is I have
> to weigh up the chances of this change causing a regression
> for eg KVM running on emulated QEMU.

Looking at the KVM source it doesn't ever set the LRENPIE
bit (it doesn't even have a #define for it), which both
explains why we didn't notice this bug before and also
means we can be pretty certain we're not going to cause a
regression for KVM at least if we fix it...

-- PMM

Re: [PATCH v2 0/1] migration: multifd live migration improvement




On 12/7/21 3:16 PM, Daniel P. Berrangé wrote:

On Tue, Dec 07, 2021 at 02:45:10PM +0100, Li Zhang wrote:

On 12/6/21 8:54 PM, Dr. David Alan Gilbert wrote:

* Li Zhang (lizh...@suse.de) wrote:

When testing live migration with multifd channels (8, 16, or a bigger number)
and using qemu -incoming (without "defer"), if a network error occurs
(for example, triggering the kernel SYN flooding detection),
the migration fails and the guest hangs forever.

The test environment and the command line is as the following:

QEMU verions: QEMU emulator version 6.2.91 (v6.2.0-rc1-47-gc5fbdd60cf)
Host OS: SLE 15  with kernel: 5.14.5-1-default
Network Card: mlx5 100Gbps
Network card: Intel Corporation I350 Gigabit (1Gbps)

Source:
qemu-system-x86_64 -M q35 -smp 32 -nographic \
  -serial telnet:10.156.208.153:4321,server,nowait \
  -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
  -monitor stdio
Dest:
qemu-system-x86_64 -M q35 -smp 32 -nographic \
  -serial telnet:10.156.208.154:4321,server,nowait \
  -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
  -monitor stdio \
  -incoming tcp:1.0.8.154:4000

(qemu) migrate_set_parameter max-bandwidth 100G
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_parameter multifd-channels 16

The guest hangs when executing the command: migrate -d tcp:1.0.8.154:4000.

If a network problem happens, TCP ACK is not received by destination
and the destination resets the connection with RST.

No. TimeSource  Destination ProtocolLength  Info
119 1.0211691.0.8.153   1.0.8.154   TCP 141060166 → 
4000 [PSH, ACK] Seq=65 Ack=1 Win=62720 Len=1344 TSval=1338662881 
TSecr=1399531897
No. TimeSource  Destination ProtocolLength  Info
125 1.0211811.0.8.154   1.0.8.153   TCP 54  4000 → 
60166 [RST] Seq=1 Win=0 Len=0

kernel log:
[334520.229445] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[334562.994919] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[334695.519927] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[334734.689511] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[335687.740415] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[335730.013598] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.

Should we document somewhere how to avoid that?  Is there something we
should be doing in the connection code to avoid it?

We should use the command line -incoming defer in QEMU command line instead
of -incoming ip:port.

And the backlog of the socket will be set as the same as  multifd channels,
this problem doesn't happen as far as I test.

If we use --incoming ip:port in the QEMU command line, the backlog of the
socket is always 1, it will cause the SYN flooding.

Do we send migration parameters from the src to the dst QEMU ?


No, I don't think we send migration parameters from the src to the dest 
QEMU.


I set migration parameters on both sides from qemu monitor seperately.


There are a bunch of things that we need to set to the same
value on the src and dst. If we sent any relevant MigrationParameters
fields to the dst, when the first/main migration chanel is opened, it
could validate that it is configured in a way that is compatible with
the src. If it isn't, it can drop the main channel immediately. This
would trigger the src to fail the migration and we couldn't get stuck
setting up the secondary data channels for multifd.


OK,  currently, we have same parameters on both sides if we set them the 
same parameters.


If we use -incoming tcp:ip:port because the multifd is disabled by 
default and backlog is 1 when the socket is created.


Here is the function which set the backlog:

static void
socket_start_incoming_migration_internal(SocketAddress *saddr,
 Error **errp)
{
    QIONetListener *listener = qio_net_listener_new();
    MigrationIncomingState *mis = migration_incoming_get_current();
    size_t i;
    int num = 1;

    qio_net_listener_set_name(listener, "migration-socket-listener");

    if (migrate_use_multifd()) {
    num = migrate_multifd_channels();
    }
  ...

}

The process  with -incoming tcp:ip:port is as the following:

1.   Create qemu process with command line -incoming tcp:ip:port

2.   socket_start_incoming_migration_internal  is called and backlog is: 
num=1, multifd is disabled, num = migrate_multifd_channels() is not called


3.   Enable multifd and set multifd parameters, but the backlog is still 
1, because the it couldn't be changed anymore.


4.   Run migration

The process with -incoming defer is as the following:

1.

RE: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

2021-12-07 Thread Brian Cain



> -Original Message-
> From: Peter Maydell 
...
> On Tue, 7 Dec 2021 at 15:18, Brian Cain  wrote:
> > Peter Maydell wrote:
> > > I won't try to put this into 6.2 unless you have a common guest
> > > that runs into this bug.
> 
> > I know that Qualcomm encounters this issue with its hypervisor
> (https://github.com/quic/gunyah-hypervisor).  Apologies for not being familiar
> -- "common guest" means multiple guest systems/OSs that encounter the
> issue?  Does that mean that it would not suffice to demonstrate the issue for
> the one known case?
> 
> It means "if you see this with a Linux, BSD etc guest that's
> more important than if you see this with some oddball thing
> nobody else is using and whose users aren't as likely to be
> using release versions of QEMU rather than mainline".

Ok, gotcha, thanks for the clarification :)

> The bug is a bug in any case and we'll fix it, it's just a
> question of whether it meets the bar to go into 6.2, which is
> hopefully going to have its final RC tagged today. If this
> patch had arrived a week ago then the bar would have been
> lower and it would definitely have gone in. As it is I have
> to weigh up the chances of this change causing a regression
> for eg KVM running on emulated QEMU.

I understand, and it sounds like the right call.

Thanks!
-Brian

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

On Tue, 7 Dec 2021 at 15:18, Brian Cain  wrote:
> Peter Maydell wrote:
> > I won't try to put this into 6.2 unless you have a common guest
> > that runs into this bug.

> I know that Qualcomm encounters this issue with its hypervisor 
> (https://github.com/quic/gunyah-hypervisor).  Apologies for not being 
> familiar -- "common guest" means multiple guest systems/OSs that encounter 
> the issue?  Does that mean that it would not suffice to demonstrate the issue 
> for the one known case?

It means "if you see this with a Linux, BSD etc guest that's
more important than if you see this with some oddball thing
nobody else is using and whose users aren't as likely to be
using release versions of QEMU rather than mainline".

The bug is a bug in any case and we'll fix it, it's just a
question of whether it meets the bar to go into 6.2, which is
hopefully going to have its final RC tagged today. If this
patch had arrived a week ago then the bar would have been
lower and it would definitely have gone in. As it is I have
to weigh up the chances of this change causing a regression
for eg KVM running on emulated QEMU.

thanks
-- PMM

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

On 12/7/21 15:21, Peter Maydell wrote:

On Tue, 7 Dec 2021 at 09:44, Damien Hedde  wrote:

According to the "Arm Generic Interrupt Controller Architecture
Specification GIC architecture version 3 and 4" (version G: page 345
for aarch64 or 509 for aarch32):
LRENP bit of ICH_MISR is set when ICH_HCR.LRENPIE==1 and
ICH_HCR.EOIcount is non-zero.

When only LRENPIE was set (and EOI count was zero), the LRENP bit was
wrongly set and MISR value was wrong.

As an additional consequence, if an hypervisor set ICH_HCR.LRENPIE,
the maintenance interrupt was constantly fired. It happens since patch
9cee1efe92 ("hw/intc: Set GIC maintenance interrupt level to only 0 or 1")
which fixed another bug about maintenance interrupt (most significant
bits of misr, including this one, were ignored in the interrupt trigger).

Fixes: 83f036fe3d ("hw/intc/arm_gicv3: Add accessors for ICH_ system registers")
Signed-off-by: Damien Hedde 
---
The gic doc is available here:
https://developer.arm.com/documentation/ihi0069/g

v2: identical resend because subject screw-up (sorry)

Reviewed-by: Peter Maydell 

I won't try to put this into 6.2 unless you have a common guest
that runs into this bug.

thanks
-- PMM

I don't know if this fit into "common guest" but my use case is:

> ./build/qemu-system-aarch64 \
> -machine virt,virtualization=on,gic-version=3,highmem=off  \
> -cpu max -m size=4G -smp cpus=8 -nographic  \
> -kernel hypvm.elf   \
> -device loader,file=Image,addr=0x4108  \
> -device loader,file=virt_512M.dtb,addr=0x4420

where Image is a buildroot compiled kernel and hypvm.elf is an 
hypervisor from qualcomm (https://github.com/quic/gunyah-hypervisor).

It boots fine on v6.0 or v6.1 but hangs on master.

It's the same hypervisor Brian is talking about.

Thanks,
Damien

RE: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

2021-12-07 Thread Brian Cain

> -Original Message-
> From: Qemu-devel 
> On Behalf Of Peter Maydell
...
> On Tue, 7 Dec 2021 at 09:44, Damien Hedde 
> wrote:
> >
> > According to the "Arm Generic Interrupt Controller Architecture
> > Specification GIC architecture version 3 and 4" (version G: page 345
> > for aarch64 or 509 for aarch32):
> > LRENP bit of ICH_MISR is set when ICH_HCR.LRENPIE==1 and
> > ICH_HCR.EOIcount is non-zero.
> >
> > When only LRENPIE was set (and EOI count was zero), the LRENP bit was
> > wrongly set and MISR value was wrong.
> >
> > As an additional consequence, if an hypervisor set ICH_HCR.LRENPIE,
> > the maintenance interrupt was constantly fired. It happens since patch
> > 9cee1efe92 ("hw/intc: Set GIC maintenance interrupt level to only 0 or 1")
> > which fixed another bug about maintenance interrupt (most significant
> > bits of misr, including this one, were ignored in the interrupt trigger).
> >
> > Fixes: 83f036fe3d ("hw/intc/arm_gicv3: Add accessors for ICH_ system
> registers")
> > Signed-off-by: Damien Hedde 
> > ---
> > The gic doc is available here:
> > https://developer.arm.com/documentation/ihi0069/g
> >
> > v2: identical resend because subject screw-up (sorry)
> 
> Reviewed-by: Peter Maydell 
> 
> I won't try to put this into 6.2 unless you have a common guest
> that runs into this bug.

Peter,

I know that Qualcomm encounters this issue with its hypervisor 
(https://github.com/quic/gunyah-hypervisor).  Apologies for not being familiar 
-- "common guest" means multiple guest systems/OSs that encounter the issue?  
Does that mean that it would not suffice to demonstrate the issue for the one 
known case?

-Brian

Re: [PATCH v1 2/2] osdep: support mempolicy for preallocation in os_mem_prealloc

2021-12-07 Thread David Hildenbrand

On 07.12.21 14:58, Daniil Tatianin wrote:
> I believe you're right. Looking at the implementation of
> shmem_alloc_page, it uses the inode policy, which is set via
> vma->set_policy (from the mbind() call in this case). set_mempolicy is
> both useless and redundant here, as thread's
> policy is only ever used in case vma->get_policy returns NULL (which it
> doesn't in our case).
> Sorry for the confusion.

Hi Danlil,

not an issue, the man page is really confusing ... so I was similarly
confused a few months ago until I actually started to dig :)

-- 
Thanks,

David / dhildenb

Re: [PATCH v1 1/1] osdep: asynchronous teardown for shutdown on Linux

2021-12-07 Thread Halil Pasic

On Mon, 6 Dec 2021 11:47:55 +
Daniel P. Berrangé  wrote:

> On Mon, Dec 06, 2021 at 12:43:12PM +0100, Claudio Imbrenda wrote:
> > On Mon, 6 Dec 2021 11:21:10 +
> > Daniel P. Berrangé  wrote:
> >   
> > > On Mon, Dec 06, 2021 at 12:06:11PM +0100, Claudio Imbrenda wrote:  
> > > > This patch adds support for asynchronously tearing down a VM on Linux.
> > > > 
> > > > When qemu terminates, either naturally or because of a fatal signal,
> > > > the VM is torn down. If the VM is huge, it can take a considerable
> > > > amount of time for it to be cleaned up. In case of a protected VM, it
> > > > might take even longer than a non-protected VM (this is the case on
> > > > s390x, for example).
> > > > 
> > > > Some users might want to shut down a VM and restart it immediately,
> > > > without having to wait. This is especially true if management
> > > > infrastructure like libvirt is used.
> > > > 
> > > > This patch implements a simple trick on Linux to allow qemu to return
> > > > immediately, with the teardown of the VM being performed
> > > > asynchronously.
> > > > 
> > > > If the new commandline option -async-teardown is used, a new process is
> > > > spawned from qemu using the clone syscall, so that it will share its
> > > > address space with qemu.
> > > > 
> > > > The new process will then wait until qemu terminates, and then it will
> > > > exit itself.
> > > > 
> > > > This allows qemu to terminate quickly, without having to wait for the
> > > > whole address space to be torn down. The teardown process will exit
> > > > after qemu, so it will be the last user of the address space, and
> > > > therefore it will take care of the actual teardown.
> > > > 
> > > > The teardown process will share the same cgroups as qemu, so both
> > > > memory usage and cpu time will be accounted properly.
> > > 
> > > If this suggested workaround has any benefit to the shutdown of a VM
> > > with libvirt, then it is a bug in libvirt IMHO.
> > > 
> > > When libvirt tears down a QEMU VM, it should be waiting for *every*
> > > process in the VM's cgroup to be terminated before it reports that
> > > the VM is shutoff. IOW, the fact that this workaround lets the main
> > > QEMU process exit quickly should not matter. libvirt should still
> > > be blocked in exactly the same place in its code, waiting for the
> > > "async" cleanup process to exit. IOW, this should not be async at
> > > all from libvirt's POV.  
> > 
> > interesting, I did not know that about libvirt.
> > 
> > maybe libvirt could be fixed/improved to allow this patch to work?  
> 
> That would not be desirable. When libvirt reports a VM as shutoff,
> it is expected that all resources associated with the VM huave been
> fully released, such that they are available for launching a new
> VM.  We can't allow resources to be asynchronously released as that
> violates app's expectation that the resources are released once the
> VM is shutoff.

I do see your point. But I believe, a part of the problem is that
currently 'can start VM again' and 'all resources associated with
the previous run of the VM were cleaned up' are tied together. And
intuitively it makes a ton of sense. It is just that due to certain
reasons complete shutdown with complete cleanup takes way too long. So
we are looking for a solution to decouple the two. I believe complete
cleanup is inherently hard, so we should not hope solving that. Do you
agree?

Under the assumption that we won't be able to make the cleanup (making
all the resources available again) fast enough, I believe the only way
forward is coming up with a solution if the user explicitly says so
the assumption you just laid out should not be justified any  more.
Maybe something like enlightening the management software about this
'non-interchangeable resources are released, interchangeable resources
not yet fully released' state and add something like a 'force-start' 
operation, where the user explicitly opts-into potentially consuming more
resources (because of the overlap) for less downtime.

What do you think?

> 
> > surely without this patch an asynchronous teardown will not be possible
> > at all  
> 
> I appreciate that the current slow teardown is a pain, but async
> teardown does not sound like an appealing alternative given that
> the app can't use the resources again until the teardown is
> complete.

I don't fully agree with this. I think this statement disregards that
some resources are non-interchangeable in a sense that we need the exact
same resource free, while other resources are interchangeable in a sense
that we don't care which instance we get as long as we get enough
instances from a certain class. When we stop a VM and then start the same
VM again, we don't expect to get the same chunks of memory we had before,
but we just allocate new memory.

Yes we may run into trouble, but we may not. As long as we don't just
change this under the hood, but make it an option that somebody has
to consciously choose

[PULL 1/1] tcg/arm: Reduce vector alignment requirement for NEON

With arm32, the ABI gives us 8-byte alignment for the stack.
While it's possible to realign the stack to provide 16-byte alignment,
it's far easier to simply not encode 16-byte alignment in the
VLD1 and VST1 instructions that we emit.

Remove the assertion in temp_allocate_frame, limit natural alignment
to the provided stack alignment, and add a comment.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1999878
Reported-by: Richard W.M. Jones 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Richard Henderson 
Message-Id: <20210912174925.200132-1-richard.hender...@linaro.org>
Message-Id: <20211206191335.230683-2-richard.hender...@linaro.org>
---
 tcg/tcg.c|  8 +++-
 tcg/arm/tcg-target.c.inc | 13 +
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index 57f17a4649..934aa8510b 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -3061,7 +3061,13 @@ static void temp_allocate_frame(TCGContext *s, TCGTemp 
*ts)
 g_assert_not_reached();
 }
 
-assert(align <= TCG_TARGET_STACK_ALIGN);
+/*
+ * Assume the stack is sufficiently aligned.
+ * This affects e.g. ARM NEON, where we have 8 byte stack alignment
+ * and do not require 16 byte vector alignment.  This seems slightly
+ * easier than fully parameterizing the above switch statement.
+ */
+align = MIN(TCG_TARGET_STACK_ALIGN, align);
 off = ROUND_UP(s->current_frame_offset, align);
 
 /* If we've exhausted the stack frame, restart with a smaller TB. */
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 633b8a37ba..9d322cdba6 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -2523,8 +2523,13 @@ static void tcg_out_ld(TCGContext *s, TCGType type, 
TCGReg arg,
 tcg_out_vldst(s, INSN_VLD1 | 0x7d0, arg, arg1, arg2);
 return;
 case TCG_TYPE_V128:
-/* regs 2; size 8; align 16 */
-tcg_out_vldst(s, INSN_VLD1 | 0xae0, arg, arg1, arg2);
+/*
+ * We have only 8-byte alignment for the stack per the ABI.
+ * Rather than dynamically re-align the stack, it's easier
+ * to simply not request alignment beyond that.  So:
+ * regs 2; size 8; align 8
+ */
+tcg_out_vldst(s, INSN_VLD1 | 0xad0, arg, arg1, arg2);
 return;
 default:
 g_assert_not_reached();
@@ -2543,8 +2548,8 @@ static void tcg_out_st(TCGContext *s, TCGType type, 
TCGReg arg,
 tcg_out_vldst(s, INSN_VST1 | 0x7d0, arg, arg1, arg2);
 return;
 case TCG_TYPE_V128:
-/* regs 2; size 8; align 16 */
-tcg_out_vldst(s, INSN_VST1 | 0xae0, arg, arg1, arg2);
+/* See tcg_out_ld re alignment: regs 2; size 8; align 8 */
+tcg_out_vldst(s, INSN_VST1 | 0xad0, arg, arg1, arg2);
 return;
 default:
 g_assert_not_reached();
-- 
2.25.1

[PULL 0/1] tcg patch queue for 6.2

The following changes since commit 7635eff97104242d618400e4b6746d0a5c97af82:

  Merge tag 'block-pull-request' of https://gitlab.com/stefanha/qemu into 
staging (2021-12-06 11:18:06 -0800)

are available in the Git repository at:

  https://gitlab.com/rth7680/qemu.git tags/pull-tcg-20211207

for you to fetch changes up to b9537d5904f6e3df896264a6144883ab07db9608:

  tcg/arm: Reduce vector alignment requirement for NEON (2021-12-07 06:32:09 
-0800)


Fix stack spills for arm neon.


Richard Henderson (1):
  tcg/arm: Reduce vector alignment requirement for NEON

 tcg/tcg.c|  8 +++-
 tcg/arm/tcg-target.c.inc | 13 +
 2 files changed, 16 insertions(+), 5 deletions(-)

Re: [PATCH 07/14] ppc/pnv: Introduce a num_pecs class attribute for PHB4 PEC devices

2021-12-07 Thread Cédric Le Goater


On 12/7/21 15:03, Frederic Barrat wrote:



On 07/12/2021 11:45, Cédric Le Goater wrote:

On 12/7/21 11:00, Frederic Barrat wrote:



On 02/12/2021 15:42, Cédric Le Goater wrote:

POWER9 processor comes with 3 PHB4 PECs (PCI Express Controller) and
each PEC can have several PHBs :

   * PEC0 provides 1 PHB  (PHB0)
   * PEC1 provides 2 PHBs (PHB1 and PHB2)
   * PEC2 provides 3 PHBs (PHB3, PHB4 and PHB5)

A num_pecs class attribute represents better the logic units of the
POWER9 chip. Use that instead of num_phbs which fits POWER8 chips.
This will ease adding support for user created devices.

Signed-off-by: Cédric Le Goater 
---


With this patch, chip->num_phbs is only defined and used on P8. We may want to 
add a comment to make it clear.


Yes.

With the latest changes, I think we can now move num_phbs under PnvChip8
and num_pecs under PnvChip9 since they are only used in these routines :

P8:
 static void pnv_chip_power8_instance_init(Object *obj)
 chip->num_phbs = pcc->num_phbs;
 for (i = 0; i < chip->num_phbs; i++) {

 static void pnv_chip_power8_realize(DeviceState *dev, Error **errp)
 for (i = 0; i < chip->num_phbs; i++) {
P9:
 static void pnv_chip_power9_instance_init(Object *obj)
 chip->num_pecs = pcc->num_pecs;
 for (i = 0; i < chip->num_pecs; i++) {

 static void pnv_chip_power9_phb_realize(PnvChip *chip, Error **errp)
 for (i = 0; i < chip->num_pecs; i++) {


As I review this series, something is bugging me though: the difference of 
handling between P8 and P9.
On P9, we seem to have a more logical hiearachy:
phb <- PCI controller (PEC) <- chip


Yes. It's cleaner than P8 in terms of logic. P8 initial support was
done hastily for skiboot bringup in 2014.


With P8, we don't have an explicit PEC, but we have a PBCQ object, which is 
somewhat similar. The hierarchy seems also more convoluted.


But we don't have stacks on P8. Do we ?



Stacks were introduced on P9 because all the lanes handled by a PEC could be 
grouped differently, each group being called a stack. And each stack is 
associated to a PHB.
On P8, there's no such split, so the doc didn't mention stacks. But each PEC 
handles exactly one PHB. So we could still keep the same abstractions.
On all chips, a PEC would really be equal to a pbcq interface to the power bus. 
The pbcq is servicing one (on P8) or more (on P9/P10) PHBs.




I don't see why it's treated differently. It seems both chips could be treated 
the same, which would make the code easier to follow.


I agree. Daniel certainly would also :)

That's outside of the scope of this series though. 


Well, this patchset enables libvirt support for the PowerNV machines.
Once this is pushed, we need to keep the API, the object model names
being part of it.

7.0 is a good time for a change, really. After that, we won't be able
to change the QOM hierarchy that much.


So maybe for a future patch? Who knows, I might volunteer...


You would introduce a phb3-pec on top of the phb3s ?


Or rename pnv_phb3_pbcq.c to pnv_phb3_pec.c and starts from there. 
Conceptually, the TYPE_PNV_PBCQ and TYPE_PNV_PHB4_PEC_STACK objects seem close. 
But that's easy to say in an email...


It's a start.

Here is the PHB3 QOM tree :

   /pnv-phb3[0] (pnv-phb3)
  /lsi (ics)
  /msi (phb3-msi)
  /msi32[0] (memory-region)
  /msi64[0] (memory-region)
  /pbcq (pnv-pbcq)
/pbcq-mmio0[0] (memory-region)
/pbcq-mmio1[0] (memory-region)
/pbcq-phb[0] (memory-region)
/xscom-pbcq-nest-0.0[0] (memory-region)
/xscom-pbcq-pci-0.0[0] (memory-region)
/xscom-pbcq-spci-0.0[0] (memory-region)
  /pci-io[0] (memory-region)
  /pci-mmio[0] (memory-region)
  /pcie-mmcfg-mmio[0] (memory-region)
  /phb3-m32[0] (memory-region)
  /phb3-regs[0] (memory-region)
  /phb3_iommu[0] (pnv-phb3-iommu-memory-region)
  /root (pnv-phb3-root-port)
/bus master container[0] (memory-region)
/bus master[0] (memory-region)
/pci_bridge_io[0] (memory-region)
/pci_bridge_io[1] (memory-region)
/pci_bridge_mem[0] (memory-region)
/pci_bridge_pci[0] (memory-region)
/pci_bridge_pref_mem[0] (memory-region)
/pci_bridge_vga_io_hi[0] (memory-region)
/pci_bridge_vga_io_lo[0] (memory-region)
/pci_bridge_vga_mem[0] (memory-region)
/pcie.0 (PCIE)
  /root-bus (pnv-phb3-root-bus)

We would swap 'pnv-phb3' and 'pnv-pbcq' and rename it to 'pnv-phb3-pec'.
Looks good to me. This should clarify the relationship between objects.

I never like the back pointer to the phb under pbcq:

(qemu) qom-list /machine/chip[0]/pnv-phb3[0]/pbcq
type (string)
parent_bus (link)
realized (bool)
hotplugged (bool)
hotpluggable (bool)
pbcq-mmio0[0] (child)
xscom-pbcq-spci-0.0[0] (child)
xscom-pbcq-nest-0.0[0] (child)
pbcq-mmio1[0] (child)
phb (link)
pbcq-phb[0] (child)

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

On Tue, 7 Dec 2021 at 09:44, Damien Hedde  wrote:
>
> According to the "Arm Generic Interrupt Controller Architecture
> Specification GIC architecture version 3 and 4" (version G: page 345
> for aarch64 or 509 for aarch32):
> LRENP bit of ICH_MISR is set when ICH_HCR.LRENPIE==1 and
> ICH_HCR.EOIcount is non-zero.
>
> When only LRENPIE was set (and EOI count was zero), the LRENP bit was
> wrongly set and MISR value was wrong.
>
> As an additional consequence, if an hypervisor set ICH_HCR.LRENPIE,
> the maintenance interrupt was constantly fired. It happens since patch
> 9cee1efe92 ("hw/intc: Set GIC maintenance interrupt level to only 0 or 1")
> which fixed another bug about maintenance interrupt (most significant
> bits of misr, including this one, were ignored in the interrupt trigger).
>
> Fixes: 83f036fe3d ("hw/intc/arm_gicv3: Add accessors for ICH_ system 
> registers")
> Signed-off-by: Damien Hedde 
> ---
> The gic doc is available here:
> https://developer.arm.com/documentation/ihi0069/g
>
> v2: identical resend because subject screw-up (sorry)

Reviewed-by: Peter Maydell 

I won't try to put this into 6.2 unless you have a common guest
that runs into this bug.

thanks
-- PMM

Re: [PATCH v2 0/1] migration: multifd live migration improvement

2021-12-07 Thread Daniel P . Berrangé

On Tue, Dec 07, 2021 at 02:45:10PM +0100, Li Zhang wrote:
> 
> On 12/6/21 8:54 PM, Dr. David Alan Gilbert wrote:
> > * Li Zhang (lizh...@suse.de) wrote:
> > > When testing live migration with multifd channels (8, 16, or a bigger 
> > > number)
> > > and using qemu -incoming (without "defer"), if a network error occurs
> > > (for example, triggering the kernel SYN flooding detection),
> > > the migration fails and the guest hangs forever.
> > > 
> > > The test environment and the command line is as the following:
> > > 
> > > QEMU verions: QEMU emulator version 6.2.91 (v6.2.0-rc1-47-gc5fbdd60cf)
> > > Host OS: SLE 15  with kernel: 5.14.5-1-default
> > > Network Card: mlx5 100Gbps
> > > Network card: Intel Corporation I350 Gigabit (1Gbps)
> > > 
> > > Source:
> > > qemu-system-x86_64 -M q35 -smp 32 -nographic \
> > >  -serial telnet:10.156.208.153:4321,server,nowait \
> > >  -m 4096 -enable-kvm -hda 
> > > /var/lib/libvirt/images/openSUSE-15.3.img \
> > >  -monitor stdio
> > > Dest:
> > > qemu-system-x86_64 -M q35 -smp 32 -nographic \
> > >  -serial telnet:10.156.208.154:4321,server,nowait \
> > >  -m 4096 -enable-kvm -hda 
> > > /var/lib/libvirt/images/openSUSE-15.3.img \
> > >  -monitor stdio \
> > >  -incoming tcp:1.0.8.154:4000
> > > 
> > > (qemu) migrate_set_parameter max-bandwidth 100G
> > > (qemu) migrate_set_capability multifd on
> > > (qemu) migrate_set_parameter multifd-channels 16
> > > 
> > > The guest hangs when executing the command: migrate -d tcp:1.0.8.154:4000.
> > > 
> > > If a network problem happens, TCP ACK is not received by destination
> > > and the destination resets the connection with RST.
> > > 
> > > No. TimeSource  Destination ProtocolLength  Info
> > > 119 1.0211691.0.8.153   1.0.8.154   TCP 1410
> > > 60166 → 4000 [PSH, ACK] Seq=65 Ack=1 Win=62720 Len=1344 TSval=1338662881 
> > > TSecr=1399531897
> > > No. TimeSource  Destination ProtocolLength  Info
> > > 125 1.0211811.0.8.154   1.0.8.153   TCP 54  
> > > 4000 → 60166 [RST] Seq=1 Win=0 Len=0
> > > 
> > > kernel log:
> > > [334520.229445] TCP: request_sock_TCP: Possible SYN flooding on port 
> > > 4000. Sending cookies.  Check SNMP counters.
> > > [334562.994919] TCP: request_sock_TCP: Possible SYN flooding on port 
> > > 4000. Sending cookies.  Check SNMP counters.
> > > [334695.519927] TCP: request_sock_TCP: Possible SYN flooding on port 
> > > 4000. Sending cookies.  Check SNMP counters.
> > > [334734.689511] TCP: request_sock_TCP: Possible SYN flooding on port 
> > > 4000. Sending cookies.  Check SNMP counters.
> > > [335687.740415] TCP: request_sock_TCP: Possible SYN flooding on port 
> > > 4000. Sending cookies.  Check SNMP counters.
> > > [335730.013598] TCP: request_sock_TCP: Possible SYN flooding on port 
> > > 4000. Sending cookies.  Check SNMP counters.
> > Should we document somewhere how to avoid that?  Is there something we
> > should be doing in the connection code to avoid it?
> 
> We should use the command line -incoming defer in QEMU command line instead
> of -incoming ip:port.
> 
> And the backlog of the socket will be set as the same as  multifd channels, 
> this problem doesn't happen as far as I test.
> 
> If we use --incoming ip:port in the QEMU command line, the backlog of the
> socket is always 1, it will cause the SYN flooding.

Do we send migration parameters from the src to the dst QEMU ?

There are a bunch of things that we need to set to the same
value on the src and dst. If we sent any relevant MigrationParameters
fields to the dst, when the first/main migration chanel is opened, it
could validate that it is configured in a way that is compatible with
the src. If it isn't, it can drop the main channel immediately. This
would trigger the src to fail the migration and we couldn't get stuck
setting up the secondary data channels for multifd.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [PATCH v7 1/7] net/vmnet: add vmnet dependency and customizable option

2021-12-07 Thread Markus Armbruster

No cover letter?

Re: [PATCH 07/14] ppc/pnv: Introduce a num_pecs class attribute for PHB4 PEC devices

2021-12-07 Thread Frederic Barrat





On 07/12/2021 11:45, Cédric Le Goater wrote:

On 12/7/21 11:00, Frederic Barrat wrote:



On 02/12/2021 15:42, Cédric Le Goater wrote:

POWER9 processor comes with 3 PHB4 PECs (PCI Express Controller) and
each PEC can have several PHBs :

   * PEC0 provides 1 PHB  (PHB0)
   * PEC1 provides 2 PHBs (PHB1 and PHB2)
   * PEC2 provides 3 PHBs (PHB3, PHB4 and PHB5)

A num_pecs class attribute represents better the logic units of the
POWER9 chip. Use that instead of num_phbs which fits POWER8 chips.
This will ease adding support for user created devices.

Signed-off-by: Cédric Le Goater 
---


With this patch, chip->num_phbs is only defined and used on P8. We may 
want to add a comment to make it clear.


Yes.

With the latest changes, I think we can now move num_phbs under PnvChip8
and num_pecs under PnvChip9 since they are only used in these routines :

P8:
     static void pnv_chip_power8_instance_init(Object *obj)
     chip->num_phbs = pcc->num_phbs;
     for (i = 0; i < chip->num_phbs; i++) {

     static void pnv_chip_power8_realize(DeviceState *dev, Error **errp)
     for (i = 0; i < chip->num_phbs; i++) {
P9:
     static void pnv_chip_power9_instance_init(Object *obj)
     chip->num_pecs = pcc->num_pecs;
     for (i = 0; i < chip->num_pecs; i++) {

     static void pnv_chip_power9_phb_realize(PnvChip *chip, Error **errp)
     for (i = 0; i < chip->num_pecs; i++) {

As I review this series, something is bugging me though: the 
difference of handling between P8 and P9.

On P9, we seem to have a more logical hiearachy:
phb <- PCI controller (PEC) <- chip


Yes. It's cleaner than P8 in terms of logic. P8 initial support was
done hastily for skiboot bringup in 2014.

With P8, we don't have an explicit PEC, but we have a PBCQ object, 
which is somewhat similar. The hierarchy seems also more convoluted.


But we don't have stacks on P8. Do we ?



Stacks were introduced on P9 because all the lanes handled by a PEC 
could be grouped differently, each group being called a stack. And each 
stack is associated to a PHB.
On P8, there's no such split, so the doc didn't mention stacks. But each 
PEC handles exactly one PHB. So we could still keep the same abstractions.
On all chips, a PEC would really be equal to a pbcq interface to the 
power bus. The pbcq is servicing one (on P8) or more (on P9/P10) PHBs.




I don't see why it's treated differently. It seems both chips could be 
treated the same, which would make the code easier to follow.


I agree. Daniel certainly would also :)

That's outside of the scope of this series though. 


Well, this patchset enables libvirt support for the PowerNV machines.
Once this is pushed, we need to keep the API, the object model names
being part of it.

7.0 is a good time for a change, really. After that, we won't be able
to change the QOM hierarchy that much.


So maybe for a future patch? Who knows, I might volunteer...


You would introduce a phb3-pec on top of the phb3s ?



Or rename pnv_phb3_pbcq.c to pnv_phb3_pec.c and starts from there. 
Conceptually, the TYPE_PNV_PBCQ and TYPE_PNV_PHB4_PEC_STACK objects seem 
close. But that's easy to say in an email...


  Fred



Let me send a v2 first and may be we could rework the object hierarchy
in the 7.0 time frame. We don't have to merge this ASAP.

Thanks,

C.

Re: [PATCH v1 2/2] osdep: support mempolicy for preallocation in os_mem_prealloc

2021-12-07 Thread Daniil Tatianin

I believe you're right. Looking at the implementation of shmem_alloc_page, it uses the inode policy, which is set viavma->set_policy (from the mbind() call in this case). set_mempolicy is both useless and redundant here, as thread'spolicy is only ever used in case vma->get_policy returns NULL (which it doesn't in our case).Sorry for the confusion.Thanks,Daniil 07.12.2021, 11:13, "David Hildenbrand" :On 07.12.21 08:06, Daniil Tatianin wrote: This is needed for cases where we want to make sure that a shared memory region gets allocated from a specific NUMA node. This is impossible to do with mbind(2) because it ignores the policy for memory mapped with MAP_SHARED. We work around this by calling set_mempolicy from prealloc threads instead.That's not quite true I think, how were you able to observe this? Do youhave a reproducer?While the man page says:"The specified policy will be ignored for any MAP_SHARED mappings inthe specified memory range. Rather the pages will be allocatedaccording to the memory policy of the thread that caused the page to beallocated. Again, this may not be the thread that called mbind()."What it really means is that as long as we access that memory via the*VMA* for which we called mbind(), which is the case when *not* usingfallocate() to preallocate memory, we end up using the correct policy.I did experiments a while ago with hugetlbfs shared memory and itproperly allocated from the right node. So I'd be curious how youtrigger this. --Thanks,David / dhildenb

Re: [RFC v3 0/4] tls: add macros for coroutine-safe TLS variables

On Tue, 7 Dec 2021 at 13:53, Stefan Hajnoczi  wrote:
>
> On Mon, Dec 06, 2021 at 02:34:45PM +, Peter Maydell wrote:
> > On Mon, 6 Dec 2021 at 14:33, Stefan Hajnoczi  wrote:
> > >
> > > v3:
> > > - Added __attribute__((weak)) to get_ptr_*() [Florian]
> >
> > Do we really need it *only* on get_ptr_*() ? If we need to
> > noinline the other two we probably also should use the same
> > attribute weak to force no optimizations at all.
>
> The weak attribute can't be used on static functions, so I think we need
> a different approach:
>
> In file included from ../util/async.c:35:
> /builds/stefanha/qemu/include/qemu/coroutine-tls.h:201:11: error: weak 
> declaration of 'get_ptr_my_aiocontext' must be public
>  type *get_ptr_##var(void)
> \
>^~~~
> ../util/async.c:673:1: note: in expansion of macro 'QEMU_DEFINE_STATIC_CO_TLS'
>  QEMU_DEFINE_STATIC_CO_TLS(AioContext *, my_aiocontext)
>  ^
>
> Adding asm volatile("") seems to work though:
> https://godbolt.org/z/3hn8Gh41d

You can see in the clang disassembly there that this isn't
sufficient. The compiler puts in both calls, but it ignores
the return results and always returns "true" from the function.

-- PMM

Re: [RFC v3 0/4] tls: add macros for coroutine-safe TLS variables

On Mon, Dec 06, 2021 at 02:34:45PM +, Peter Maydell wrote:
> On Mon, 6 Dec 2021 at 14:33, Stefan Hajnoczi  wrote:
> >
> > v3:
> > - Added __attribute__((weak)) to get_ptr_*() [Florian]
> 
> Do we really need it *only* on get_ptr_*() ? If we need to
> noinline the other two we probably also should use the same
> attribute weak to force no optimizations at all.

The weak attribute can't be used on static functions, so I think we need
a different approach:

In file included from ../util/async.c:35:
/builds/stefanha/qemu/include/qemu/coroutine-tls.h:201:11: error: weak 
declaration of 'get_ptr_my_aiocontext' must be public
 type *get_ptr_##var(void)\
   ^~~~
../util/async.c:673:1: note: in expansion of macro 'QEMU_DEFINE_STATIC_CO_TLS'
 QEMU_DEFINE_STATIC_CO_TLS(AioContext *, my_aiocontext)
 ^

Adding asm volatile("") seems to work though:
https://godbolt.org/z/3hn8Gh41d

The GCC documentation mentions combining noinline with asm(""):
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-noinline-function-attribute

Stefan


signature.asc
Description: PGP signature

Re: [PATCH v2 1/1] multifd: Shut down the QIO channels to avoid blocking the send threads when they are terminated.




On 12/6/21 8:50 PM, Dr. David Alan Gilbert wrote:

* Li Zhang (lizh...@suse.de) wrote:

Thanks for Daniel's review.

Hi David and Juan,

Any comments for this patch?


Yeh I think that's OK, so

Reviewed-by: Dr. David Alan Gilbert 

I'd have a slight preference for it being before the post I think.


Thanks.



Dave


Thanks

Li

On 12/3/21 12:55 PM, Li Zhang wrote:

When doing live migration with multifd channels 8, 16 or larger number,
the guest hangs in the presence of the network errors such as missing TCP ACKs.

At sender's side:
The main thread is blocked on qemu_thread_join, migration_fd_cleanup
is called because one thread fails on qio_channel_write_all when
the network problem happens and other send threads are blocked on sendmsg.
They could not be terminated. So the main thread is blocked on qemu_thread_join
to wait for the threads terminated.

(gdb) bt
0  0x7f30c8dcffc0 in __pthread_clockjoin_ex () at /lib64/libpthread.so.0
1  0x55cbb716084b in qemu_thread_join (thread=0x55cbb881f418) at 
../util/qemu-thread-posix.c:627
2  0x55cbb6b54e40 in multifd_save_cleanup () at ../migration/multifd.c:542
3  0x55cbb6b4de06 in migrate_fd_cleanup (s=0x55cbb8024000) at 
../migration/migration.c:1808
4  0x55cbb6b4dfb4 in migrate_fd_cleanup_bh (opaque=0x55cbb8024000) at 
../migration/migration.c:1850
5  0x55cbb7173ac1 in aio_bh_call (bh=0x55cbb7eb98e0) at ../util/async.c:141
6  0x55cbb7173bcb in aio_bh_poll (ctx=0x55cbb7ebba80) at ../util/async.c:169
7  0x55cbb715ba4b in aio_dispatch (ctx=0x55cbb7ebba80) at 
../util/aio-posix.c:381
8  0x55cbb7173ffe in aio_ctx_dispatch (source=0x55cbb7ebba80, callback=0x0, 
user_data=0x0) at ../util/async.c:311
9  0x7f30c9c8cdf4 in g_main_context_dispatch () at 
/usr/lib64/libglib-2.0.so.0
10 0x55cbb71851a2 in glib_pollfds_poll () at ../util/main-loop.c:232
11 0x55cbb718521c in os_host_main_loop_wait (timeout=42251070366) at 
../util/main-loop.c:255
12 0x55cbb7185321 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:531
13 0x55cbb6e6ba27 in qemu_main_loop () at ../softmmu/runstate.c:726
14 0x55cbb6ad6fd7 in main (argc=68, argv=0x7ffc0c57, 
envp=0x7ffc0c578ab0) at ../softmmu/main.c:50

To make sure that the send threads could be terminated, IO channels should be
shut down to avoid waiting IO.

Signed-off-by: Li Zhang 
---
   migration/multifd.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/migration/multifd.c b/migration/multifd.c
index 7c9deb1921..33f8287969 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -523,6 +523,9 @@ static void multifd_send_terminate_threads(Error *err)
   qemu_mutex_lock(>mutex);
   p->quit = true;
   qemu_sem_post(>sem);
+if (p->c) {
+qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+}
   qemu_mutex_unlock(>mutex);
   }
   }

Re: [PATCH v2 0/1] migration: multifd live migration improvement




On 12/6/21 8:54 PM, Dr. David Alan Gilbert wrote:

* Li Zhang (lizh...@suse.de) wrote:

When testing live migration with multifd channels (8, 16, or a bigger number)
and using qemu -incoming (without "defer"), if a network error occurs
(for example, triggering the kernel SYN flooding detection),
the migration fails and the guest hangs forever.

The test environment and the command line is as the following:

QEMU verions: QEMU emulator version 6.2.91 (v6.2.0-rc1-47-gc5fbdd60cf)
Host OS: SLE 15  with kernel: 5.14.5-1-default
Network Card: mlx5 100Gbps
Network card: Intel Corporation I350 Gigabit (1Gbps)

Source:
qemu-system-x86_64 -M q35 -smp 32 -nographic \
 -serial telnet:10.156.208.153:4321,server,nowait \
 -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
 -monitor stdio
Dest:
qemu-system-x86_64 -M q35 -smp 32 -nographic \
 -serial telnet:10.156.208.154:4321,server,nowait \
 -m 4096 -enable-kvm -hda /var/lib/libvirt/images/openSUSE-15.3.img \
 -monitor stdio \
 -incoming tcp:1.0.8.154:4000

(qemu) migrate_set_parameter max-bandwidth 100G
(qemu) migrate_set_capability multifd on
(qemu) migrate_set_parameter multifd-channels 16

The guest hangs when executing the command: migrate -d tcp:1.0.8.154:4000.

If a network problem happens, TCP ACK is not received by destination
and the destination resets the connection with RST.

No. TimeSource  Destination ProtocolLength  Info
119 1.0211691.0.8.153   1.0.8.154   TCP 141060166 → 
4000 [PSH, ACK] Seq=65 Ack=1 Win=62720 Len=1344 TSval=1338662881 
TSecr=1399531897
No. TimeSource  Destination ProtocolLength  Info
125 1.0211811.0.8.154   1.0.8.153   TCP 54  4000 → 
60166 [RST] Seq=1 Win=0 Len=0

kernel log:
[334520.229445] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[334562.994919] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[334695.519927] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[334734.689511] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[335687.740415] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.
[335730.013598] TCP: request_sock_TCP: Possible SYN flooding on port 4000. 
Sending cookies.  Check SNMP counters.

Should we document somewhere how to avoid that?  Is there something we
should be doing in the connection code to avoid it?


We should use the command line -incoming defer in QEMU command line 
instead of -incoming ip:port.


And the backlog of the socket will be set as the same as  multifd 
channels,  this problem doesn't happen as far as I test.


If we use --incoming ip:port in the QEMU command line, the backlog of 
the socket is always 1, it will cause the SYN flooding.





Dave


There are two problems here:
1. On the send side, the main thread is blocked on qemu_thread_join and
send threads are blocked on sendmsg
2. On receive side, the receive threads are blocked on qemu_sem_wait to
wait for a semaphore.

The patch is to fix the first problem, and the guest doesn't hang any more.
But there is no better solution to fix the second problem yet.

Li Zhang (1):
   multifd: Shut down the QIO channels to avoid blocking the send threads
 when they are terminated.

  migration/multifd.c | 3 +++
  1 file changed, 3 insertions(+)

--
2.31.1

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation

On Tue, 7 Dec 2021 at 13:05, Damien Hedde  wrote:
> On 12/7/21 13:45, Philippe Mathieu-Daudé wrote:
> > On 12/7/21 10:44, Damien Hedde wrote:
> >> According to the "Arm Generic Interrupt Controller Architecture
> >> Specification GIC architecture version 3 and 4" (version G: page 345
> >> for aarch64 or 509 for aarch32):
> >> LRENP bit of ICH_MISR is set when ICH_HCR.LRENPIE==1 and
> >> ICH_HCR.EOIcount is non-zero.
> >>
> >> When only LRENPIE was set (and EOI count was zero), the LRENP bit was
> >> wrongly set and MISR value was wrong.
> >>
> >> As an additional consequence, if an hypervisor set ICH_HCR.LRENPIE,
> >> the maintenance interrupt was constantly fired. It happens since patch
> >> 9cee1efe92 ("hw/intc: Set GIC maintenance interrupt level to only 0 or 1")
> >> which fixed another bug about maintenance interrupt (most significant
> >> bits of misr, including this one, were ignored in the interrupt trigger).
> >>
> >> Fixes: 83f036fe3d ("hw/intc/arm_gicv3: Add accessors for ICH_ system 
> >> registers")
> >
> > This commit predates 6.1 release, so technically this is not
> > a regression for 6.2.
>
> Do you mean "Fixes:" is meant only for regression or simply that this
> patch should not go for 6.2 ?

Fixes: is fine in all situations where the commit is fixing
a bug that was introduced in the commit hash it mentions.

Separately, given where we are in the release cycle, a patch has
to hit a very high bar to go into 6.2: at least "this breaks
a real world use case that worked fine in 6.1", and probably also
"a use case that we expect a fair number of users to be using".

-- PMM

Re: [RFC v3 0/4] tls: add macros for coroutine-safe TLS variables

On Mon, Dec 06, 2021 at 02:34:45PM +, Peter Maydell wrote:
> On Mon, 6 Dec 2021 at 14:33, Stefan Hajnoczi  wrote:
> >
> > v3:
> > - Added __attribute__((weak)) to get_ptr_*() [Florian]
> 
> Do we really need it *only* on get_ptr_*() ? If we need to
> noinline the other two we probably also should use the same
> attribute weak to force no optimizations at all.

I don't know but it does seem safer to use weak in all cases.

Florian and others?

Stefan


signature.asc
Description: PGP signature

[PATCH v3 6/6] virtio: unify dataplane and non-dataplane ->handle_output()

Now that virtio-blk and virtio-scsi are ready, get rid of
the handle_aio_output() callback. It's no longer needed.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  4 +--
 hw/block/dataplane/virtio-blk.c | 16 ++
 hw/scsi/virtio-scsi-dataplane.c | 54 -
 hw/virtio/virtio.c  | 32 +--
 4 files changed, 26 insertions(+), 80 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b90095628f..f095637058 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -316,8 +316,8 @@ bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
-void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleOutput handle_output);
+void virtio_queue_aio_attach_host_notifier(VirtQueue *vq, AioContext *ctx);
+void virtio_queue_aio_detach_host_notifier(VirtQueue *vq, AioContext *ctx);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index a2fa407b98..49276e46f2 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,17 +154,6 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOBlock *s = (VirtIOBlock *)vdev;
-
-assert(s->dataplane);
-assert(s->dataplane_started);
-
-virtio_blk_handle_vq(s, vq);
-}
-
 /* Context: QEMU global mutex held */
 int virtio_blk_data_plane_start(VirtIODevice *vdev)
 {
@@ -258,8 +247,7 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 for (i = 0; i < nvqs; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx,
-virtio_blk_data_plane_handle_output);
+virtio_queue_aio_attach_host_notifier(vq, s->ctx);
 }
 aio_context_release(s->ctx);
 return 0;
@@ -302,7 +290,7 @@ static void virtio_blk_data_plane_stop_bh(void *opaque)
 for (i = 0; i < s->conf->num_queues; i++) {
 VirtQueue *vq = virtio_get_queue(s->vdev, i);
 
-virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vq, s->ctx);
 }
 }
 
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 76137de67f..29575cbaf6 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,45 +49,6 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error **errp)
 }
 }
 
-static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
-  VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_cmd_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
-   VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_ctrl_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
-static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
-VirtQueue *vq)
-{
-VirtIOSCSI *s = VIRTIO_SCSI(vdev);
-
-virtio_scsi_acquire(s);
-if (!s->dataplane_fenced) {
-assert(s->ctx && s->dataplane_started);
-virtio_scsi_handle_event_vq(s, vq);
-}
-virtio_scsi_release(s);
-}
-
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
 {
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s)));
@@ -112,10 +73,10 @@ static void virtio_scsi_dataplane_stop_bh(void *opaque)
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
 int i;
 
-virtio_queue_aio_set_host_notifier_handler(vs->ctrl_vq, s->ctx, NULL);
-virtio_queue_aio_set_host_notifier_handler(vs->event_vq, s->ctx, NULL);
+virtio_queue_aio_detach_host_notifier(vs->ctrl_vq, s->ctx);
+virtio_queue_aio_detach_host_notifier(vs->event_vq, s->ctx);
 for (i = 0; i < vs->conf.num_queues; i++) {
-virtio_queue_aio_set_host_notifier_handler(vs->cmd_vqs[i], s->ctx, 
NULL);
+virtio_queue_aio_detach_host_notifier(vs->cmd_vqs[i], s->ctx);
 }
 }
 
@@ -176,14 +137,11 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 memory_region_transaction_commit();

[PATCH v3 1/6] aio-posix: split poll check from ready handler

Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 16, iocbpp: 0x7fcb9f937db0)= 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, 
count: 8) = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 
4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, 
count: 512)= 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, 
nr: 32, iocbpp: 0x7fca7d0cebe0)= 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0cdd62be0)= 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, 
count: 8) = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, 
nr: 32, iocbpp: 0x7fd0d0388b50)= 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi 
---
 include/block/aio.h  |  4 +-
 util/aio-posix.h |  1 +
 block/curl.c | 11 ++---
 block/export/fuse.c  |  4 +-
 block/io_uring.c | 19 +
 block/iscsi.c|  4 +-
 block/linux-aio.c| 16 +---
 block/nfs.c  |  6 +--
 block/nvme.c | 51 +++
 block/ssh.c  |  4 +-
 block/win32-aio.c|  4 +-
 hw/virtio/virtio.c   | 16 +---
 hw/xen/xen-bus.c |  6 +--
 io/channel-command.c |  6 ++-
 io/channel-file.c|  3 +-
 io/channel-socket.c  |  3 +-
 migration/rdma.c |  8 ++--
 tests/unit/test-aio.c|  4 +-
 util/aio-posix.c | 89 ++--
 util/aio-win32.c |  4 +-
 util/async.c | 10 -
 util/main-loop.c |  4 +-
 util/qemu-coroutine-io.c |  5 ++-
 util/vhost-user-server.c | 11 ++---
 24 files changed, 191 insertions(+), 102 deletions(-)

diff --git a/include/block/aio.h b/include/block/aio.h
index 47fbe9d81f..5634173b12 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -469,6 +469,7 @@ void aio_set_fd_handler(AioContext *ctx,
 IOHandler *io_read,
 IOHandler *io_write,
 AioPollFn *io_poll,
+IOHandler *io_poll_ready,
 void *opaque);
 
 /* Set polling begin/end callbacks for a file descriptor that has already been
@@ -490,7 +491,8 @@ void aio_set_event_notifier(AioContext *ctx,
 EventNotifier *notifier,
 bool is_external,
 EventNotifierHandler *io_read,
-AioPollFn *io_poll);
+AioPollFn *io_poll,
+EventNotifierHandler *io_poll_ready);
 
 /* Set polling begin/end callbacks for an event notifier that has already been
  * registered with aio_set_event_notifier.  Do nothing if the event notifier is
diff --git a/util/aio-posix.h b/util/aio-posix.h
index c80c04506a..7f2c37a684 100644
--- a/util/aio-posix.h
+++ b/util/aio-posix.h
@@ -24,6 +24,7 @@ struct AioHandler {
 IOHandler *io_read;
 IOHandler *io_write;
 AioPollFn *io_poll;
+

[PATCH v3 5/6] virtio: use ->handle_output() instead of ->handle_aio_output()

The difference between ->handle_output() and ->handle_aio_output() was
that ->handle_aio_output() returned a bool return value indicating
progress. This was needed by the old polling API but now that the bool
return value is gone, the two functions can be unified.

Signed-off-by: Stefan Hajnoczi 
---
 hw/virtio/virtio.c | 33 +++--
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index c042be3935..a97a406d3c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -125,7 +125,6 @@ struct VirtQueue
 
 uint16_t vector;
 VirtIOHandleOutput handle_output;
-VirtIOHandleOutput handle_aio_output;
 VirtIODevice *vdev;
 EventNotifier guest_notifier;
 EventNotifier host_notifier;
@@ -2300,20 +2299,6 @@ void virtio_queue_set_align(VirtIODevice *vdev, int n, 
int align)
 }
 }
 
-static void virtio_queue_notify_aio_vq(VirtQueue *vq)
-{
-if (vq->vring.desc && vq->handle_aio_output) {
-VirtIODevice *vdev = vq->vdev;
-
-trace_virtio_queue_notify(vdev, vq - vdev->vq, vq);
-vq->handle_aio_output(vdev, vq);
-
-if (unlikely(vdev->start_on_kick)) {
-virtio_set_started(vdev, true);
-}
-}
-}
-
 static void virtio_queue_notify_vq(VirtQueue *vq)
 {
 if (vq->vring.desc && vq->handle_output) {
@@ -2392,7 +2377,6 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 vdev->vq[i].vring.num_default = queue_size;
 vdev->vq[i].vring.align = VIRTIO_PCI_VRING_ALIGN;
 vdev->vq[i].handle_output = handle_output;
-vdev->vq[i].handle_aio_output = NULL;
 vdev->vq[i].used_elems = g_malloc0(sizeof(VirtQueueElement) *
queue_size);
 
@@ -2404,7 +2388,6 @@ void virtio_delete_queue(VirtQueue *vq)
 vq->vring.num = 0;
 vq->vring.num_default = 0;
 vq->handle_output = NULL;
-vq->handle_aio_output = NULL;
 g_free(vq->used_elems);
 vq->used_elems = NULL;
 virtio_virtqueue_reset_region_cache(vq);
@@ -3509,14 +3492,6 @@ EventNotifier *virtio_queue_get_guest_notifier(VirtQueue 
*vq)
 return >guest_notifier;
 }
 
-static void virtio_queue_host_notifier_aio_read(EventNotifier *n)
-{
-VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
-if (event_notifier_test_and_clear(n)) {
-virtio_queue_notify_aio_vq(vq);
-}
-}
-
 static void virtio_queue_host_notifier_aio_poll_begin(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
@@ -3536,7 +3511,7 @@ static void 
virtio_queue_host_notifier_aio_poll_ready(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
 
-virtio_queue_notify_aio_vq(vq);
+virtio_queue_notify_vq(vq);
 }
 
 static void virtio_queue_host_notifier_aio_poll_end(EventNotifier *n)
@@ -3551,9 +3526,8 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 VirtIOHandleOutput handle_output)
 {
 if (handle_output) {
-vq->handle_aio_output = handle_output;
 aio_set_event_notifier(ctx, >host_notifier, true,
-   virtio_queue_host_notifier_aio_read,
+   virtio_queue_host_notifier_read,
virtio_queue_host_notifier_aio_poll,
virtio_queue_host_notifier_aio_poll_ready);
 aio_set_event_notifier_poll(ctx, >host_notifier,
@@ -3563,8 +3537,7 @@ void virtio_queue_aio_set_host_notifier_handler(VirtQueue 
*vq, AioContext *ctx,
 aio_set_event_notifier(ctx, >host_notifier, true, NULL, NULL, 
NULL);
 /* Test and clear notifier before after disabling event,
  * in case poll callback didn't have time to run. */
-virtio_queue_host_notifier_aio_read(>host_notifier);
-vq->handle_aio_output = NULL;
+virtio_queue_host_notifier_read(>host_notifier);
 }
 }
 
-- 
2.33.1

[PATCH v3 2/6] virtio: get rid of VirtIOHandleAIOOutput

The virtqueue host notifier API
virtio_queue_aio_set_host_notifier_handler() polls the virtqueue for new
buffers. AioContext previously required a bool progress return value
indicating whether an event was handled or not. This is no longer
necessary because the AioContext polling API has been split into a poll
check function and an event handler function. The event handler is only
run when we know there is work to do, so it doesn't return bool.

The VirtIOHandleAIOOutput function signature is now the same as
VirtIOHandleOutput. Get rid of the bool return value.

Further simplifications will be made for virtio-blk and virtio-scsi in
the next patch.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio.h  |  3 +--
 hw/block/dataplane/virtio-blk.c |  4 ++--
 hw/scsi/virtio-scsi-dataplane.c | 18 ++
 hw/virtio/virtio.c  | 12 
 4 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..b90095628f 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -175,7 +175,6 @@ void virtio_error(VirtIODevice *vdev, const char *fmt, ...) 
GCC_FMT_ATTR(2, 3);
 void virtio_device_set_child_bus_name(VirtIODevice *vdev, char *bus_name);
 
 typedef void (*VirtIOHandleOutput)(VirtIODevice *, VirtQueue *);
-typedef bool (*VirtIOHandleAIOOutput)(VirtIODevice *, VirtQueue *);
 
 VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 VirtIOHandleOutput handle_output);
@@ -318,7 +317,7 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue 
*vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
-VirtIOHandleAIOOutput 
handle_output);
+VirtIOHandleOutput handle_output);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
 VirtQueue *virtio_vector_next_queue(VirtQueue *vq);
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index ee5a5352dc..a2fa407b98 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -154,7 +154,7 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s)
 g_free(s);
 }
 
-static bool virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
+static void virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
@@ -162,7 +162,7 @@ static bool 
virtio_blk_data_plane_handle_output(VirtIODevice *vdev,
 assert(s->dataplane);
 assert(s->dataplane_started);
 
-return virtio_blk_handle_vq(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 /* Context: QEMU global mutex held */
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 18eb824c97..76137de67f 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -49,49 +49,43 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error 
**errp)
 }
 }
 
-static bool virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_cmd(VirtIODevice *vdev,
   VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_cmd_vq(s, vq);
+virtio_scsi_handle_cmd_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_ctrl(VirtIODevice *vdev,
VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_ctrl_vq(s, vq);
+virtio_scsi_handle_ctrl_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
-static bool virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
+static void virtio_scsi_data_plane_handle_event(VirtIODevice *vdev,
 VirtQueue *vq)
 {
-bool progress = false;
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
 if (!s->dataplane_fenced) {
 assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_event_vq(s, vq);
+virtio_scsi_handle_event_vq(s, vq);
 }
 virtio_scsi_release(s);
-return progress;
 }
 
 static int virtio_scsi_set_host_notifier(VirtIOSCSI *s, VirtQueue *vq, int n)
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index

[PATCH v3 3/6] virtio-blk: drop unused virtio_blk_handle_vq() return value

The return value of virtio_blk_handle_vq() is no longer used. Get rid of
it. This is a step towards unifying the dataplane and non-dataplane
virtqueue handler functions.

Prepare virtio_blk_handle_output() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
---
 include/hw/virtio/virtio-blk.h |  2 +-
 hw/block/virtio-blk.c  | 14 +++---
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 29655a406d..d311c57cca 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -90,7 +90,7 @@ typedef struct MultiReqBuffer {
 bool is_write;
 } MultiReqBuffer;
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq);
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh);
 
 #endif
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index f139cd7cc9..82676cdd01 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -767,12 +767,11 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, 
MultiReqBuffer *mrb)
 return 0;
 }
 
-bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
+void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 {
 VirtIOBlockReq *req;
 MultiReqBuffer mrb = {};
 bool suppress_notifications = virtio_queue_get_notification(vq);
-bool progress = false;
 
 aio_context_acquire(blk_get_aio_context(s->blk));
 blk_io_plug(s->blk);
@@ -783,7 +782,6 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 }
 
 while ((req = virtio_blk_get_request(s, vq))) {
-progress = true;
 if (virtio_blk_handle_request(req, )) {
 virtqueue_detach_element(req->vq, >elem, 0);
 virtio_blk_free_request(req);
@@ -802,19 +800,13 @@ bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
 
 blk_io_unplug(s->blk);
 aio_context_release(blk_get_aio_context(s->blk));
-return progress;
-}
-
-static void virtio_blk_handle_output_do(VirtIOBlock *s, VirtQueue *vq)
-{
-virtio_blk_handle_vq(s, vq);
 }
 
 static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
 VirtIOBlock *s = (VirtIOBlock *)vdev;
 
-if (s->dataplane) {
+if (s->dataplane && !s->dataplane_started) {
 /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
  * dataplane here instead of waiting for .set_status().
  */
@@ -823,7 +815,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, 
VirtQueue *vq)
 return;
 }
 }
-virtio_blk_handle_output_do(s, vq);
+virtio_blk_handle_vq(s, vq);
 }
 
 void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
-- 
2.33.1

[PATCH v3 4/6] virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane

Prepare virtio_scsi_handle_cmd() to be used by both dataplane and
non-dataplane by making the condition for starting ioeventfd more
specific. This way it won't trigger when dataplane has already been
started.

Signed-off-by: Stefan Hajnoczi 
---
 hw/scsi/virtio-scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 51fd09522a..34a968ecfb 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -720,7 +720,7 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, 
VirtQueue *vq)
 /* use non-QOM casts in the data path */
 VirtIOSCSI *s = (VirtIOSCSI *)vdev;
 
-if (s->ctx) {
+if (s->ctx && !s->dataplane_started) {
 virtio_device_start_ioeventfd(vdev);
 if (!s->dataplane_fenced) {
 return;
-- 
2.33.1

[PATCH v3 0/6] aio-posix: split poll check from ready handler

v3:
- Fixed FUSE export aio_set_fd_handler() call that I missed and double-checked
  for any other missing call sites using Coccinelle [Rich]
v2:
- Cleaned up unused return values in nvme and virtio-blk [Stefano]
- Documented try_poll_mode() ready_list argument [Stefano]
- Unified virtio-blk/scsi dataplane and non-dataplane virtqueue handlers 
[Stefano]

The first patch improves AioContext's adaptive polling execution time
measurement. This can result in better performance because the algorithm makes
better decisions about when to poll versus when to fall back to file descriptor
monitoring.

The remaining patches unify the virtio-blk and virtio-scsi dataplane and
non-dataplane virtqueue handlers. This became possible because the dataplane
handler function now has the same function signature as the non-dataplane
handler function. Stefano Garzarella prompted me to make this refactoring.

Stefan Hajnoczi (6):
  aio-posix: split poll check from ready handler
  virtio: get rid of VirtIOHandleAIOOutput
  virtio-blk: drop unused virtio_blk_handle_vq() return value
  virtio-scsi: prepare virtio_scsi_handle_cmd for dataplane
  virtio: use ->handle_output() instead of ->handle_aio_output()
  virtio: unify dataplane and non-dataplane ->handle_output()

 include/block/aio.h |  4 +-
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio.h  |  5 +-
 util/aio-posix.h|  1 +
 block/curl.c| 11 ++--
 block/export/fuse.c |  4 +-
 block/io_uring.c| 19 ---
 block/iscsi.c   |  4 +-
 block/linux-aio.c   | 16 +++---
 block/nfs.c |  6 +--
 block/nvme.c| 51 ---
 block/ssh.c |  4 +-
 block/win32-aio.c   |  4 +-
 hw/block/dataplane/virtio-blk.c | 16 +-
 hw/block/virtio-blk.c   | 14 ++
 hw/scsi/virtio-scsi-dataplane.c | 60 +++---
 hw/scsi/virtio-scsi.c   |  2 +-
 hw/virtio/virtio.c  | 73 +--
 hw/xen/xen-bus.c|  6 +--
 io/channel-command.c|  6 ++-
 io/channel-file.c   |  3 +-
 io/channel-socket.c |  3 +-
 migration/rdma.c|  8 +--
 tests/unit/test-aio.c   |  4 +-
 util/aio-posix.c| 89 +
 util/aio-win32.c|  4 +-
 util/async.c| 10 +++-
 util/main-loop.c|  4 +-
 util/qemu-coroutine-io.c|  5 +-
 util/vhost-user-server.c| 11 ++--
 30 files changed, 219 insertions(+), 230 deletions(-)

-- 
2.33.1

Re: [PATCH v2 for 6.2?] gicv3: fix ICH_MISR's LRENP computation