date:20211011

Re: [PATCH 5/6] pcie: fast unplug when slot power is off

2021-10-11 Thread Michael S. Tsirkin

On Mon, Oct 11, 2021 at 02:05:03PM +0200, Gerd Hoffmann wrote:
> In case the slot is powered off (and the power indicator turned off too)
> we can unplug right away, without round-trip to the guest.
> 
> Also clear pending attention button press, there is nothing to care
> about any more.
> 
> Signed-off-by: Gerd Hoffmann 
> ---
>  hw/pci/pcie.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index 70fc11ba4c7d..f3ac04399969 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -561,6 +561,16 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler 
> *hotplug_dev,
>  return;
>  }
>  
> +if (((sltctl & PCI_EXP_SLTCTL_PIC) == PCI_EXP_SLTCTL_PWR_IND_OFF) &&
> +((sltctl & PCI_EXP_SLTCTL_PCC) == PCI_EXP_SLTCTL_PWR_OFF)) {
> +/* slot is powered off -> unplug without round-trip to the guest */
> +pcie_cap_slot_do_unplug(hotplug_pdev);
> +hotplug_event_notify(hotplug_pdev);
> +pci_word_test_and_clear_mask(exp_cap + PCI_EXP_SLTSTA,
> + PCI_EXP_SLTSTA_ABP);

Does this handle all the things including link status etc btw?
I don't remember off-hand.

> +return;
> +}
> +
>  pcie_cap_slot_push_attention_button(hotplug_pdev);
>  }
>  
> -- 
> 2.31.1

Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-10-11 Thread Markus Armbruster

Leonardo Bras Soares Passos  writes:

> Hello Eric,
>
> On Mon, Oct 11, 2021 at 4:32 PM Eric Blake  wrote:
>>
>> On Sat, Oct 09, 2021 at 04:56:13AM -0300, Leonardo Bras wrote:
>> > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
>> > zerocopy interface.
>> >
>> > Change multifd_send_sync_main() so it can distinguish the last sync from
>> > the setup and per-iteration ones, so a flush_zerocopy() can be called
>> > at the last sync in order to make sure all RAM is sent before finishing
>> > the migration.
>> >
>> > Also make it return -1 if flush_zerocopy() fails, in order to cancel
>> > the migration process, and avoid resuming the guest in the target host
>> > without receiving all current RAM.
>> >
>> > This will work fine on RAM migration because the RAM pages are not usually 
>> > freed,
>> > and there is no problem on changing the pages content between async_send() 
>> > and
>> > the actual sending of the buffer, because this change will dirty the page 
>> > and
>> > cause it to be re-sent on a next iteration anyway.
>> >
>> > Given a lot of locked memory may be needed in order to use multid migration
>> > with zerocopy enabled, make it optional by creating a new parameter
>> > multifd-zerocopy on qapi, so low-privileged users can still perform multifd
>> > migrations.
>> >
>> > Signed-off-by: Leonardo Bras 
>> > ---
>> >  qapi/migration.json   | 18 ++
>> >  migration/migration.h |  1 +
>> >  migration/multifd.h   |  2 +-
>> >  migration/migration.c | 20 
>> >  migration/multifd.c   | 33 -
>> >  migration/ram.c   | 20 +---
>> >  monitor/hmp-cmds.c|  4 
>> >  7 files changed, 85 insertions(+), 13 deletions(-)
>> >
>> > diff --git a/qapi/migration.json b/qapi/migration.json
>> > index 88f07baedd..c4890cbb54 100644
>> > --- a/qapi/migration.json
>> > +++ b/qapi/migration.json
>> > @@ -724,6 +724,11 @@
>> >  #  will consume more CPU.
>> >  #  Defaults to 1. (Since 5.0)
>> >  #
>> > +# @multifd-zerocopy: Controls behavior on sending memory pages on multifd 
>> > migration.
>> > +#When true, enables a zerocopy mechanism for sending 
>> > memory
>> > +#pages, if host does support it.
>>
>> s/does support/supports/ (several instances this patch)
>
> I will make sure to fix that in v5.
>
>>
>> > +#Defaults to false. (Since 6.2)
>> > +#
>> >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>> >  #aliases for the purpose of dirty bitmap 
>> > migration.  Such
>> >  #aliases may for example be the corresponding 
>> > names on the
>> > @@ -758,6 +763,7 @@
>> > 'xbzrle-cache-size', 'max-postcopy-bandwidth',
>> > 'max-cpu-throttle', 'multifd-compression',
>> > 'multifd-zlib-level' ,'multifd-zstd-level',
>> > +'multifd-zerocopy',
>> > 'block-bitmap-mapping' ] }
>>
>> Should this feature be guarded with 'if':'CONFIG_LINUX', since that's
>> the only platform where you even compile the code (even then, it can
>> still fail if the kernel doesn't support it).
>
> I think it makes sense for the feature to always be available, even
> though it's not supported
> outside linux > v4.14.
>
> IMHO it makes more sense for the user to get an error when it starts
> migration, due to host
> not supporting zerocopy, than the error happening in the config step,
> which may cause the user
> to question if it was the right parameter.
>
> The config message error here could also be ignored, and users can
> think zerocopy is working, while it's not.
>
> For automated migrations, this feature should never be enabled  for
> non-linux / older linux hosts anyway.
>
> Is there a good argument I am missing for this feature being disabled
> on non-linux?

The general argument for having QAPI schema 'if' mirror the C
implementation's #if is introspection.  Let me explain why that matters.

Consider a management application that supports a range of QEMU
versions, say 5.0 to 6.2.  Say it wants to use an QMP command that is
new in QEMU 6.2.  The sane way to do that is to probe for the command
with query-qmp-schema.  Same for command arguments, and anything else
QMP.

If you doubt "sane", check out Part II of "QEMU interface introspection:
>From hacks to solutions"[*].

The same technique works when a QMP command / argument / whatever is
compile-time conditional ('if' in the schema).  The code the management
application needs anyway to deal with older QEMU now also deals with
"compiled out".  Nice.

Of course, a command or argument present in QEMU can still fail, and the
management application still needs to handle failure.  Distinguishing
different failure modes can be bothersome and/or fragile.

By making the QAPI schema conditional mirror the C conditional, you
squash the failure mode "this version of QEMU

Re: [PATCH 6/6] pcie: expire pending delete

2021-10-11 Thread Michael S. Tsirkin

On Tue, Oct 12, 2021 at 07:30:34AM +0200, Gerd Hoffmann wrote:
> > > index f3ac04399969..477c8776aa27 100644
> > > --- a/hw/pci/pcie.c
> > > +++ b/hw/pci/pcie.c
> > > @@ -549,6 +549,8 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler 
> > > *hotplug_dev,
> > >  }
> > >  
> > >  dev->pending_deleted_event = true;
> > > +dev->pending_deleted_expires_ms =
> > > +qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 5000; /* 5 secs */
> > >  
> > >  /* In case user cancel the operation of multi-function hot-add,
> > >   * remove the function that is unexposed to guest individually,
> > 
> > 
> > Well this will be barely enough, right?
> > 
> > Once the Power
> > Indicator begins blinking, a 5-second abort interval exists during 
> > which a second depression of the
> > Attention Button cancels the operation.
> 
> Well, canceling the hot-plug is not supported in qemu right now (there
> is no qmp command for that).  I'm also not sure it makes sense in the
> first place for virtual machines.

Yes. However if you resend an attention button press within the
5 second window, guest will think you cancelled hot-plug
and act accordingly.
It's a fundamentally racy algorithm :(


> > So I guess it needs to be more. Problem is of course if guest is
> > busy because of interrupts and whatnot, it might not get to
> > handling that in time ...
> 
> See patch #3, that one should take care of a busy guest ...
> 
> take care,
>   Gerd

Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion

2021-10-11 Thread elena

On Wed, Nov 25, 2020 at 12:44:07PM -0800, Elena Afanasova wrote:
> Hello,
>

Hi

Sorry for top-posting, just wanted to provide a quik update.
We are currently working on the support for ioregionfd in Qemu and will
be posting the patches soon. Plus the KVM patches will be posted based
of the RFC v3 with some fixes if there are no objections from Elena's side
who originally posted KVM RFC patchset.


Thanks!
Elena Ufimtseva

> I'm an Outreachy intern with QEMU and I’m working on implementing the 
> ioregionfd 
> API in KVM. So I’d like to resume the ioregionfd design discussion. The 
> latest 
> version of the ioregionfd API document is provided below.
> 
> Overview
> 
> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses over a
> file descriptor without returning from ioctl(KVM_RUN). This allows device
> emulation to run in another task separate from the vCPU task.
> 
> This is achieved through KVM ioctls for registering MMIO/PIO regions and a 
> wire
> protocol that KVM uses to communicate with a task handling an MMIO/PIO access.
> 
> The traditional ioctl(KVM_RUN) dispatch mechanism with device emulation in a
> separate task looks like this:
> 
>kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> device task
> 
> ioregionfd improves performance by eliminating the need for the vCPU task to
> forward MMIO/PIO exits to device emulation tasks:
> 
>kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
>  ^
>  `---ioregionfd---> device task
> 
> Both multi-threaded and multi-process VMMs can take advantage of ioregionfd to
> run device emulation in dedicated threads and processes, respectively.
> 
> This mechanism is similar to ioeventfd except it supports all read and write
> accesses, whereas ioeventfd only supports posted doorbell writes.
> 
> Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work alongside
> the new mechanism, but only one mechanism handles a MMIO/PIO access.
> 
> KVM_CREATE_IOREGIONFD
> -
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: system ioctl
> :Parameters: none
> :Returns: an ioregionfd file descriptor, -1 on error
> 
> This ioctl creates a new ioregionfd and returns the file descriptor. The fd 
> can
> be used to handle MMIO/PIO accesses instead of returning from ioctl(KVM_RUN)
> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions must be
> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses on the
> fd. An ioregionfd can be used with multiple VMs and its lifecycle is not tied
> to a specific VM.
> 
> When the last file descriptor for an ioregionfd is closed, all regions
> registered with KVM_SET_IOREGION are dropped and guest accesses to those
> regions cause ioctl(KVM_RUN) to return again.
> 
> KVM_SET_IOREGION
> 
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: vm ioctl
> :Parameters: struct kvm_ioregion (in)
> :Returns: 0 on success, -1 on error
> 
> This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO region. Guest
> read and write accesses are dispatched through the given ioregionfd instead of
> returning from ioctl(KVM_RUN).
> 
> ::
> 
>   struct kvm_ioregion {
>   __u64 guest_paddr; /* guest physical address */
>   __u64 memory_size; /* bytes */
>   __u64 user_data;
>   __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>   __u32 flags;
>   __u8  pad[32];
>   };
> 
>   /* for kvm_ioregion::flags */
>   #define KVM_IOREGION_PIO   (1u << 0)
>   #define KVM_IOREGION_POSTED_WRITES (1u << 1)
> 
> If a new region would split an existing region -1 is returned and errno is
> EINVAL.
> 
> Regions can be deleted by setting fd to -1. If no existing region matches
> guest_paddr and memory_size then -1 is returned and errno is ENOENT.
> 
> Existing regions can be modified as long as guest_paddr and memory_size
> match an existing region.
> 
> MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.
> 
> The user_data value is included in messages KVM writes to the ioregionfd upon
> guest access. KVM does not interpret user_data.
> 
> Both read and write guest accesses wait for a response before entering the
> guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a response
> and immediately enters the guest again. This is suitable for accesses that do
> not require synchronous emulation, such as posted doorbell register writes.
> Note that guest writes may block the vCPU despite KVM_IOREGION_POSTED_WRITES 
> if
> the device is too slow in reading from the ioregionfd.
> 
> Wire protocol
> -
> The protocol spoken over the file descriptor is as follows. The device reads
> commands from the file descriptor with the following layout::
> 
>   struct ioregionfd_cmd {
>   __u32 info;
>   __u32 padding;
>   __u64 user_data;
>   __u64 offset;
>   __u64 data;
>   };
> 
> The info field layout is as follows::
> 
>   bits:  | 31 ... 8

Re: [PATCH 6/6] pcie: expire pending delete

2021-10-11 Thread Gerd Hoffmann

> > index f3ac04399969..477c8776aa27 100644
> > --- a/hw/pci/pcie.c
> > +++ b/hw/pci/pcie.c
> > @@ -549,6 +549,8 @@ void pcie_cap_slot_unplug_request_cb(HotplugHandler 
> > *hotplug_dev,
> >  }
> >  
> >  dev->pending_deleted_event = true;
> > +dev->pending_deleted_expires_ms =
> > +qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + 5000; /* 5 secs */
> >  
> >  /* In case user cancel the operation of multi-function hot-add,
> >   * remove the function that is unexposed to guest individually,
> 
> 
> Well this will be barely enough, right?
> 
>   Once the Power
>   Indicator begins blinking, a 5-second abort interval exists during 
> which a second depression of the
>   Attention Button cancels the operation.

Well, canceling the hot-plug is not supported in qemu right now (there
is no qmp command for that).  I'm also not sure it makes sense in the
first place for virtual machines.

> So I guess it needs to be more. Problem is of course if guest is
> busy because of interrupts and whatnot, it might not get to
> handling that in time ...

See patch #3, that one should take care of a busy guest ...

take care,
  Gerd

Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding

2021-10-11 Thread Markus Armbruster

Eugenio Pérez  writes:

> Initial version of shadow virtqueue that actually forward buffers. There
> are no iommu support at the moment, and that will be addressed in future
> patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> this means that SVQ is not usable at this point of the series on any
> device.
>
> For simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors. Support for them will not be added in this series.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review. Later commits add simpler
> ones.
>
> SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> retrieve its status (next available idx the device was going to
> consume) race-free. It can later reset the device to replace vring
> addresses etc. When SVQ starts qemu can resume consuming the guest's
> driver ring from that state, without notice from the latter.
>
> This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> commits.
>
> Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> can be done in the future if an use case arises. At this moment we can
> just rely on reseting the full device.
>
> Signed-off-by: Eugenio Pérez 
> ---
>  qapi/net.json  |   2 +-
>  hw/virtio/vhost-shadow-virtqueue.c | 237 -
>  hw/virtio/vhost-vdpa.c | 109 -
>  3 files changed, 337 insertions(+), 11 deletions(-)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index fe546b0e7c..1f4a55f2c5 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -86,7 +86,7 @@
>  #
>  # @name: the device name of the VirtIO device
>  #
> -# @enable: true to use the alternate shadow VQ notifications
> +# @enable: true to use the alternate shadow VQ buffers fowarding path

Uh, why does the flag change meaning half-way through this series?

>  #
>  # Returns: Error if failure, or 'no error' for success.
>  #

[...]

Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue

2021-10-11 Thread Markus Armbruster

Eugenio Pérez  writes:

> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> stops, so code flow follows usual cleanup.
>
> Also, host notifiers must be disabled at SVQ start, and they will not
> start if SVQ has been enabled when device is stopped. This is trivial
> to address, but it is left out for simplicity at this moment.
>
> Signed-off-by: Eugenio Pérez 
> ---
>  qapi/net.json  |   2 +-
>  hw/virtio/vhost-shadow-virtqueue.h |   8 ++
>  include/hw/virtio/vhost-vdpa.h |   4 +
>  hw/virtio/vhost-shadow-virtqueue.c | 138 -
>  hw/virtio/vhost-vdpa.c | 116 +++-
>  5 files changed, 264 insertions(+), 4 deletions(-)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index a2c30fd455..fe546b0e7c 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -88,7 +88,7 @@
>  #
>  # @enable: true to use the alternate shadow VQ notifications
>  #
> -# Returns: Always error, since SVQ is not implemented at the moment.
> +# Returns: Error if failure, or 'no error' for success.

Delete the whole line, please.

>  #
>  # Since: 6.2
>  #
[...]

Re: [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp

2021-10-11 Thread Markus Armbruster

Eugenio Pérez  writes:

> Command to enable shadow virtqueue.
>
> Signed-off-by: Eugenio Pérez 
> ---
>  qapi/net.json  | 23 +++
>  hw/virtio/vhost-vdpa.c |  8 
>  2 files changed, 31 insertions(+)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index 7fab2e7cd8..a2c30fd455 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -79,6 +79,29 @@
>  { 'command': 'netdev_del', 'data': {'id': 'str'},
>'allow-preconfig': true }
>  
> +##
> +# @x-vhost-enable-shadow-vq:
> +#
> +# Use vhost shadow virtqueue.
> +#
> +# @name: the device name of the VirtIO device

Is this a qdev ID?  A network client name?

> +#
> +# @enable: true to use the alternate shadow VQ notifications
> +#
> +# Returns: Always error, since SVQ is not implemented at the moment.
> +#
> +# Since: 6.2
> +#
> +# Example:
> +#
> +# -> { "execute": "x-vhost-enable-shadow-vq",
> +# "arguments": { "name": "virtio-net", "enable": false } }
> +#
> +##
> +{ 'command': 'x-vhost-enable-shadow-vq',
> +  'data': {'name': 'str', 'enable': 'bool'},
> +  'if': 'defined(CONFIG_VHOST_KERNEL)' }
> +

Adding an command just for controlling a flag in some object is fine for
quick experiments.  As a permanent interface, it's problematic: one
command per flag would result in way too many commands.  Better: one
command to control a set of related properties.

I hesitate to suggest qom-set, because qom-set is not introspectable.
Recurring complaint about QOM: poor integration with QAPI/QMP.

Naming nitpick: since the command can both enable and disable, I'd call
it -set-vq instead of -enable-vq.

>  ##
>  # @NetLegacyNicOptions:
>  #
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 4fa414feea..c63e311d7c 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -23,6 +23,8 @@
>  #include "cpu.h"
>  #include "trace.h"
>  #include "qemu-common.h"
> +#include "qapi/qapi-commands-net.h"
> +#include "qapi/error.h"
>  
>  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
>  {
> @@ -656,6 +658,12 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev 
> *dev)
>  return true;
>  }
>  
> +
> +void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error 
> **errp)
> +{
> +error_setg(errp, "Shadow virtqueue still not implemented");
> +}
> +
>  const VhostOps vdpa_ops = {
>  .backend_type = VHOST_BACKEND_TYPE_VDPA,
>  .vhost_backend_init = vhost_vdpa_init,

Re: [PATCH] Trim some trailing space from human-readable output

2021-10-11 Thread Markus Armbruster

Philippe Mathieu-Daudé  writes:

> On 10/9/21 17:24, Markus Armbruster wrote:
>> I noticed -cpu help printing enough trailing spaces to make the output
>> at least 84 characters wide.  Looks ugly unless the terminal is wider.
>> Ugly or not, trailing spaces are stupid.
>> 
>> The culprit is this line in x86_cpu_list_entry():
>> 
>> qemu_printf("x86 %-20s  %-58s\n", name, desc);
>> 
>> This prints a string with minimum field left-justified right before a
>> newline.  Change it to
>> 
>> qemu_printf("x86 %-20s  %s\n", name, desc);
>> 
>> which avoids the trailing spaces and is simpler to boot.
>> 
>> A search for the pattern with "git-grep -E '%-[0-9]+s\\n'" found a few
>> more instances.  Change them similarly.
>> 
>> Signed-off-by: Markus Armbruster 
>> ---
>>  monitor/hmp-cmds.c | 2 +-
>>  target/i386/cpu-dump.c | 4 ++--
>>  target/i386/cpu.c  | 2 +-
>>  target/ppc/cpu_init.c  | 2 +-
>>  target/s390x/cpu_models.c  | 4 ++--
>>  target/xtensa/mmu_helper.c | 2 +-
>>  6 files changed, 8 insertions(+), 8 deletions(-)
>
> Nitpicking, do you mind prefixing the patch subject with 'monitor:'?

You're right, the code I patch is called from HMP commands, and probably
not from anywhere else.

> Otherwise:
> Reviewed-by: Philippe Mathieu-Daudé 

Thanks!

[PATCH] tests/tcg/xtensa: allow testing big-endian cores

2021-10-11 Thread Max Filippov

Don't disable all big-endian tests, instead check whether $(CORE) is
supported by the configured $(QEMU) and enable tests if it is.

Signed-off-by: Max Filippov 
---
 MAINTAINERS| 1 +
 tests/tcg/xtensa/Makefile.softmmu-target   | 4 ++--
 tests/tcg/xtensaeb/Makefile.softmmu-target | 5 +
 3 files changed, 8 insertions(+), 2 deletions(-)
 create mode 100644 tests/tcg/xtensaeb/Makefile.softmmu-target

diff --git a/MAINTAINERS b/MAINTAINERS
index 50435b8d2f50..8b5ed46a5f1c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -343,6 +343,7 @@ S: Maintained
 F: target/xtensa/
 F: hw/xtensa/
 F: tests/tcg/xtensa/
+F: tests/tcg/xtensaeb/
 F: disas/xtensa.c
 F: include/hw/xtensa/xtensa-isa.h
 F: configs/devices/xtensa*/default.mak
diff --git a/tests/tcg/xtensa/Makefile.softmmu-target 
b/tests/tcg/xtensa/Makefile.softmmu-target
index 9530cac2ad95..f1cf2a6496d2 100644
--- a/tests/tcg/xtensa/Makefile.softmmu-target
+++ b/tests/tcg/xtensa/Makefile.softmmu-target
@@ -2,7 +2,8 @@
 # Xtensa softmmu tests
 #
 
-ifneq ($(TARGET_WORDS_BIGENDIAN),y)
+CORE=dc232b
+ifneq ($(shell $(QEMU) -cpu help | grep -w $(CORE)),)
 
 XTENSA_SRC = $(SRC_PATH)/tests/tcg/xtensa
 XTENSA_ALL = $(filter-out $(XTENSA_SRC)/linker.ld.S,$(wildcard 
$(XTENSA_SRC)/*.S))
@@ -15,7 +16,6 @@ XTENSA_USABLE_TESTS = $(filter-out $(XTENSA_BROKEN_TESTS), 
$(XTENSA_TESTS))
 TESTS += $(XTENSA_USABLE_TESTS)
 VPATH += $(XTENSA_SRC)
 
-CORE=dc232b
 QEMU_OPTS+=-M sim -cpu $(CORE) -nographic -semihosting -icount 6 $(EXTFLAGS) 
-kernel
 
 INCLUDE_DIRS = $(SRC_PATH)/target/xtensa/core-$(CORE)
diff --git a/tests/tcg/xtensaeb/Makefile.softmmu-target 
b/tests/tcg/xtensaeb/Makefile.softmmu-target
new file mode 100644
index ..4204a96d53c0
--- /dev/null
+++ b/tests/tcg/xtensaeb/Makefile.softmmu-target
@@ -0,0 +1,5 @@
+#
+# Xtensa softmmu tests
+#
+
+include $(SRC_PATH)/tests/tcg/xtensa/Makefile.softmmu-target
-- 
2.20.1

Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue

2021-10-11 Thread Jason Wang

On Tue, Oct 12, 2021 at 11:59 AM Jason Wang  wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> > is intended as a new method of tracking the memory the devices touch
> > during a migration process: Instead of relay on vhost device's dirty
> > logging capability, SVQ intercepts the VQ dataplane forwarding the
> > descriptors between VM and device. This way qemu is the effective
> > writer of guests memory, like in qemu's virtio device operation.
> >
> > When SVQ is enabled qemu offers a new vring to the device to read
> > and write into, and also intercepts kicks and calls between the device
> > and the guest. Used buffers relay would cause dirty memory being
> > tracked, but at this RFC SVQ is not enabled on migration automatically.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
> > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> > not map the shadow vq in guest's VA, but in qemu's.
> >
> > For qemu to use shadow virtqueues the guest virtio driver must not use
> > features like event_idx or indirect descriptors. These limitations will
> > be addressed in later series, but they are left out for simplicity at
> > the moment.
> >
> > SVQ needs to be enabled with QMP command:
> >
> > { "execute": "x-vhost-enable-shadow-vq",
> >"arguments": { "name": "dev0", "enable": true } }
> >
> > This series includes some patches to delete in the final version that
> > helps with its testing. The first two of the series freely implements
> > the feature to stop the device and be able to retrieve its status. It's
> > intended to be used with vp_vpda driver in a nested environment. This
> > driver also need modifications to forward the new status bit.
> >
> > Patches 2-8 prepares the SVQ and QMP command to support guest to host
> > notifications forwarding. If the SVQ is enabled with these ones
> > applied and the device supports it, that part can be tested in
> > isolation (for example, with networking), hopping through SVQ.
> >
> > Same thing is true with patches 9-13, but with device to guest
> > notifications.
> >
> > The rest of the patches implements the actual buffer forwarding.
> >
> > Comments are welcome.
>
>
> Hi Eugenio:
>
>
> It would be helpful to have a public git repo for us to ease the review.
>
> Thanks
>

Btw, we also need to measure the performance impact of the shadow virtqueue.

Thanks

Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue

2021-10-11 Thread Jason Wang




在 2021/10/1 下午3:05, Eugenio Pérez 写道:

This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
is intended as a new method of tracking the memory the devices touch
during a migration process: Instead of relay on vhost device's dirty
logging capability, SVQ intercepts the VQ dataplane forwarding the
descriptors between VM and device. This way qemu is the effective
writer of guests memory, like in qemu's virtio device operation.

When SVQ is enabled qemu offers a new vring to the device to read
and write into, and also intercepts kicks and calls between the device
and the guest. Used buffers relay would cause dirty memory being
tracked, but at this RFC SVQ is not enabled on migration automatically.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.

For qemu to use shadow virtqueues the guest virtio driver must not use
features like event_idx or indirect descriptors. These limitations will
be addressed in later series, but they are left out for simplicity at
the moment.

SVQ needs to be enabled with QMP command:

{ "execute": "x-vhost-enable-shadow-vq",
   "arguments": { "name": "dev0", "enable": true } }

This series includes some patches to delete in the final version that
helps with its testing. The first two of the series freely implements
the feature to stop the device and be able to retrieve its status. It's
intended to be used with vp_vpda driver in a nested environment. This
driver also need modifications to forward the new status bit.

Patches 2-8 prepares the SVQ and QMP command to support guest to host
notifications forwarding. If the SVQ is enabled with these ones
applied and the device supports it, that part can be tested in
isolation (for example, with networking), hopping through SVQ.

Same thing is true with patches 9-13, but with device to guest
notifications.

The rest of the patches implements the actual buffer forwarding.

Comments are welcome.



Hi Eugenio:


It would be helpful to have a public git repo for us to ease the review.

Thanks




TODO:
* Event, indirect, packed, and others features of virtio - Waiting for
   confirmation of the big picture.
* Use already available iova tree to track mappings.
* To sepparate buffers forwarding in its own AIO context, so we can
   throw more threads to that task and we don't need to stop the main
   event loop.
* unmap iommu memory. Now the tree can only grow from SVQ enable, but
   it should be fine as long as not a lot of memory is added to the
   guest.
* Rebase on top of latest qemu (and, hopefully, on top of multiqueue
   vdpa).
* Some assertions need to be appropiate error handling paths.
* Proper documentation.

Changes from v3 RFC:
   * Move everything to vhost-vdpa backend. A big change, this allowed
 some cleanup but more code has been added in other places.
   * More use of glib utilities, especially to manage memory.
v3 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html

Changes from v2 RFC:
   * Adding vhost-vdpa devices support
   * Fixed some memory leaks pointed by different comments
v2 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html

Changes from v1 RFC:
   * Use QMP instead of migration to start SVQ mode.
   * Only accepting IOMMU devices, closer behavior with target devices
 (vDPA)
   * Fix invalid masking/unmasking of vhost call fd.
   * Use of proper methods for synchronization.
   * No need to modify VirtIO device code, all of the changes are
 contained in vhost code.
   * Delete superfluous code.
   * An intermediate RFC was sent with only the notifications forwarding
 changes. It can be seen in
 https://patchew.org/QEMU/20210129205415.876290-1-epere...@redhat.com/
v1 link:
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html

Eugenio Pérez (20):
   virtio: Add VIRTIO_F_QUEUE_STATE
   virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
   virtio: Add virtio_queue_is_host_notifier_enabled
   vhost: Make vhost_virtqueue_{start,stop} public
   vhost: Add x-vhost-enable-shadow-vq qmp
   vhost: Add VhostShadowVirtqueue
   vdpa: Register vdpa devices in a list
   vhost: Route guest->host notification through shadow virtqueue
   Add vhost_svq_get_svq_call_notifier
   Add vhost_svq_set_guest_call_notifier
   vdpa: Save call_fd in vhost-vdpa
   vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
   vhost: Route host->guest notification through shadow virtqueue
   virtio: Add vhost_shadow_vq_get_vring_addr
   vdpa: Save host and guest features
   vhost: Add vhost_svq_valid_device_features to shadow vq
   vhost: Shadow virtqueue buffers forwarding
   vhost: Add VhostIOVATree
   vhost: Use a tree to store memory mappings
   vdpa: Add custom IOTLB translations to SVQ

Eugenio Pérez (20):
   virtio:

Re: [PATCH v2 3/3] vdpa: Check for iova range at mappings changes

2021-10-11 Thread Jason Wang

On Tue, Oct 5, 2021 at 9:49 PM Eugenio Pérez  wrote:
>
> Check vdpa device range before updating memory regions so we don't add
> any outside of it, and report the invalid change if any.
>
> Signed-off-by: Eugenio Pérez 
> ---
>  include/hw/virtio/vhost-vdpa.h |  2 +
>  hw/virtio/vhost-vdpa.c | 68 ++
>  hw/virtio/trace-events |  1 +
>  3 files changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index a8963da2d9..c288cf7ecb 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -13,6 +13,7 @@
>  #define HW_VIRTIO_VHOST_VDPA_H
>
>  #include "hw/virtio/virtio.h"
> +#include "standard-headers/linux/vhost_types.h"
>
>  typedef struct VhostVDPAHostNotifier {
>  MemoryRegion mr;
> @@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
>  uint32_t msg_type;
>  bool iotlb_batch_begin_sent;
>  MemoryListener listener;
> +struct vhost_vdpa_iova_range iova_range;
>  struct vhost_dev *dev;
>  VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>  } VhostVDPA;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index be7c63b4ba..6654287050 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -37,20 +37,34 @@ static Int128 vhost_vdpa_section_end(const 
> MemoryRegionSection *section)
>  return llend;
>  }
>
> -static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
> -{
> -return (!memory_region_is_ram(section->mr) &&
> -!memory_region_is_iommu(section->mr)) ||
> -memory_region_is_protected(section->mr) ||
> -   /* vhost-vDPA doesn't allow MMIO to be mapped  */
> -memory_region_is_ram_device(section->mr) ||
> -   /*
> -* Sizing an enabled 64-bit BAR can cause spurious mappings to
> -* addresses in the upper part of the 64-bit address space.  These
> -* are never accessed by the CPU and beyond the address width of
> -* some IOMMU hardware.  TODO: VDPA should tell us the IOMMU 
> width.
> -*/
> -   section->offset_within_address_space & (1ULL << 63);
> +static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section,
> +uint64_t iova_min,
> +uint64_t iova_max)
> +{
> +Int128 llend;
> +
> +if ((!memory_region_is_ram(section->mr) &&
> + !memory_region_is_iommu(section->mr)) ||
> +memory_region_is_protected(section->mr) ||
> +/* vhost-vDPA doesn't allow MMIO to be mapped  */
> +memory_region_is_ram_device(section->mr)) {
> +return true;
> +}
> +
> +if (section->offset_within_address_space < iova_min) {
> +error_report("RAM section out of device range (min=%lu, addr=%lu)",
> + iova_min, section->offset_within_address_space);
> +return true;
> +}
> +
> +llend = vhost_vdpa_section_end(section);
> +if (int128_gt(llend, int128_make64(iova_max))) {
> +error_report("RAM section out of device range (max=%lu, end 
> addr=%lu)",
> + iova_max, int128_get64(llend));
> +return true;
> +}
> +
> +return false;
>  }
>
>  static int vhost_vdpa_dma_map(struct vhost_vdpa *v, hwaddr iova, hwaddr size,
> @@ -162,7 +176,8 @@ static void vhost_vdpa_listener_region_add(MemoryListener 
> *listener,
>  void *vaddr;
>  int ret;
>
> -if (vhost_vdpa_listener_skipped_section(section)) {
> +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> +v->iova_range.last)) {
>  return;
>  }
>
> @@ -220,7 +235,8 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
> *listener,
>  Int128 llend, llsize;
>  int ret;
>
> -if (vhost_vdpa_listener_skipped_section(section)) {
> +if (vhost_vdpa_listener_skipped_section(section, v->iova_range.first,
> +v->iova_range.last)) {
>  return;
>  }
>
> @@ -288,9 +304,24 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, 
> uint8_t status)
>  vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, );
>  }
>
> +static int vhost_vdpa_get_iova_range(struct vhost_vdpa *v)
> +{
> +int ret;
> +
> +ret = vhost_vdpa_call(v->dev, VHOST_VDPA_GET_IOVA_RANGE, >iova_range);
> +if (ret != 0) {
> +return ret;
> +}

I think we need a fallback for the kernel that does not support
VHOST_VDPA_GET_IOVA_RANGE?

Thanks

> +
> +trace_vhost_vdpa_get_iova_range(v->dev, v->iova_range.first,
> +v->iova_range.last);
> +return ret;
> +}
> +
>  static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>  {
>  struct vhost_vdpa *v;
> +int r;
>  assert(dev->vhost_ops->backend_type ==

Re: [PATCH v2 2/3] vdpa: Add vhost_vdpa_section_end

2021-10-11 Thread Jason Wang

On Tue, Oct 5, 2021 at 9:49 PM Eugenio Pérez  wrote:
>
> Abstract this operation, that will be reused when validating the region
> against the iova range that the device supports.
>
> Signed-off-by: Eugenio Pérez 

Acked-by: Jason Wang 

> ---
>  hw/virtio/vhost-vdpa.c | 22 +++---
>  1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index ea1aa71ad8..be7c63b4ba 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -24,6 +24,19 @@
>  #include "trace.h"
>  #include "qemu-common.h"
>
> +/*
> + * Return one past the end of the end of section. Be careful with uint64_t
> + * conversions!
> + */
> +static Int128 vhost_vdpa_section_end(const MemoryRegionSection *section)
> +{
> +Int128 llend = int128_make64(section->offset_within_address_space);
> +llend = int128_add(llend, section->size);
> +llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +
> +return llend;
> +}
> +
>  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
>  {
>  return (!memory_region_is_ram(section->mr) &&
> @@ -160,10 +173,7 @@ static void 
> vhost_vdpa_listener_region_add(MemoryListener *listener,
>  }
>
>  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> -llend = int128_make64(section->offset_within_address_space);
> -llend = int128_add(llend, section->size);
> -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> -
> +llend = vhost_vdpa_section_end(section);
>  if (int128_ge(int128_make64(iova), llend)) {
>  return;
>  }
> @@ -221,9 +231,7 @@ static void vhost_vdpa_listener_region_del(MemoryListener 
> *listener,
>  }
>
>  iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> -llend = int128_make64(section->offset_within_address_space);
> -llend = int128_add(llend, section->size);
> -llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +llend = vhost_vdpa_section_end(section);
>
>  trace_vhost_vdpa_listener_region_del(v, iova, int128_get64(llend));
>
> --
> 2.27.0
>

Re: [PATCH v2 1/3] vdpa: Skip protected ram IOMMU mappings

2021-10-11 Thread Jason Wang

On Tue, Oct 5, 2021 at 9:49 PM Eugenio Pérez  wrote:
>
> Following the logic of commit 56918a126ae ("memory: Add RAM_PROTECTED
> flag to skip IOMMU mappings") with VFIO, skip memory sections
> inaccessible via normal mechanisms, including DMA.
>
> Signed-off-by: Eugenio Pérez 

Acked-by: Jason Wang 

> ---
>  hw/virtio/vhost-vdpa.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 47d7a5a23d..ea1aa71ad8 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -28,6 +28,7 @@ static bool 
> vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
>  {
>  return (!memory_region_is_ram(section->mr) &&
>  !memory_region_is_iommu(section->mr)) ||
> +memory_region_is_protected(section->mr) ||
> /* vhost-vDPA doesn't allow MMIO to be mapped  */
>  memory_region_is_ram_device(section->mr) ||
> /*
> --
> 2.27.0
>

[PATCH v2] hw/riscv: virt: bugfix the memory-backend-file command is invalid

2021-10-11 Thread MingWang Li

From: Mingwang Li 

When I start the VM with the following command:
$ ./qemu-system-riscv64 -M virt,accel=kvm -m 4096M -cpu host -nographic \
-name guest=riscv-guset \
-smp 2 \
-bios none \
-kernel ./Image \
-drive file=./guest.img,format=raw,id=hd0 \
-device virtio-blk-device,drive=hd0 \
-append "root=/dev/vda rw console=ttyS0 earlycon=sbi" \
-object 
memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on \
-numa node,memdev=mem -mem-prealloc \
-chardev socket,id=char0,path=/mnt/vhost-net0 \
-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \
-device 
virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,mrg_rxbuf=on,csum=on,guest_csum=on,guest_ecn=on
 \

Then, QEMU displays the following error information:
qemu-system-riscv64: Failed initializing vhost-user memory map, consider using 
-object memory-backend-file share=on
qemu-system-riscv64: vhost_set_mem_table failed: Interrupted system call (4)
qemu-system-riscv64: unable to start vhost net: 4: falling back on userspace 
virtio

Note that, before starting the kvm-acceled QEMU VM, following
temporarily unaccepted QEMU patches should be used:
https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg02516.html

This error was made bacause default main_mem is used to be registered
as the system memory, other memory cannot be initialized. Therefore,
the system memory should be initialized to the machine->ram, which
consists of the default main_mem and other possible memory required
by applications, such as shared hugepage memory in DPDK.
Also, the mc->defaul_ram_id should be set to the default main_mem,
such as "riscv_virt_board.ram" for the virt machine.

Signed-off-by: Mingwang Li 
Signed-off-by: Yifei Jiang 
Reviewed-by: Alistair Francis 
---
 hw/riscv/virt.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/hw/riscv/virt.c b/hw/riscv/virt.c
index ec0cb69b8c..b3b431c847 100644
--- a/hw/riscv/virt.c
+++ b/hw/riscv/virt.c
@@ -771,7 +771,6 @@ static void virt_machine_init(MachineState *machine)
 const MemMapEntry *memmap = virt_memmap;
 RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
 MemoryRegion *system_memory = get_system_memory();
-MemoryRegion *main_mem = g_new(MemoryRegion, 1);
 MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
 char *plic_hart_config, *soc_name;
 target_ulong start_addr = memmap[VIRT_DRAM].base;
@@ -890,10 +889,8 @@ static void virt_machine_init(MachineState *machine)
 }
 
 /* register system main memory (actual RAM) */
-memory_region_init_ram(main_mem, NULL, "riscv_virt_board.ram",
-   machine->ram_size, _fatal);
 memory_region_add_subregion(system_memory, memmap[VIRT_DRAM].base,
-main_mem);
+machine->ram);
 
 /* create device tree */
 create_fdt(s, memmap, machine->ram_size, machine->kernel_cmdline,
@@ -1032,6 +1029,7 @@ static void virt_machine_class_init(ObjectClass *oc, void 
*data)
 mc->cpu_index_to_instance_props = riscv_numa_cpu_index_to_props;
 mc->get_default_cpu_node_id = riscv_numa_get_default_cpu_node_id;
 mc->numa_mem_supported = true;
+mc->default_ram_id = "riscv_virt_board.ram";
 
 machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE);
 
-- 
2.19.1

Re: [PATCH 02/13] target/riscv: Create RISCVMXL enumeration

2021-10-11 Thread Alistair Francis

On Fri, Oct 8, 2021 at 3:47 AM Richard Henderson
 wrote:
>
> Move the MXL_RV* defines to enumerators.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu_bits.h | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/target/riscv/cpu_bits.h b/target/riscv/cpu_bits.h
> index 999187a9ee..e248c6bf6d 100644
> --- a/target/riscv/cpu_bits.h
> +++ b/target/riscv/cpu_bits.h
> @@ -364,9 +364,11 @@
>  #define MISA32_MXL  0xC000
>  #define MISA64_MXL  0xC000ULL
>
> -#define MXL_RV321
> -#define MXL_RV642
> -#define MXL_RV128   3
> +typedef enum {
> +MXL_RV32  = 1,
> +MXL_RV64  = 2,
> +MXL_RV128 = 3,
> +} RISCVMXL;
>
>  /* sstatus CSR bits */
>  #define SSTATUS_UIE 0x0001
> --
> 2.25.1
>
>

Re: [PATCH] hw/riscv: virt: bugfix the memory-backend-file command is invalid

2021-10-11 Thread Alistair Francis

On Thu, Sep 30, 2021 at 12:03 AM MingWang Li  wrote:
>
> From: Mingwang Li 
>
> If default main_mem is used to be registered as the system memory,
> other memory cannot be initialized. Therefore, the system memory
> should be initialized to the machine->ram, which consists of the
> default main_mem and other possible memory required by applications,
> such as shared hugepage memory in DPDK.
> Also, the mc->defaul_ram_id should be set to the default main_mem,
> which is named as "riscv_virt_board.ram".
>
> Signed-off-by: Mingwang Li 
> Signed-off-by: Yifei Jiang 

Reviewed-by: Alistair Francis 

Alistair

Re: [PATCH] target/riscv: line up all of the registers in the info register dump

2021-10-11 Thread Alistair Francis

On Sat, Oct 9, 2021 at 3:51 PM Travis Geiselbrecht  wrote:
>
> Ensure the columns for all of the register names and values line up.
> No functional change, just a minor tweak to the output.
>
> Signed-off-by: Travis Geiselbrecht 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  target/riscv/cpu.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/target/riscv/cpu.c b/target/riscv/cpu.c
> index 1d69d1887e..660f9ce131 100644
> --- a/target/riscv/cpu.c
> +++ b/target/riscv/cpu.c
> @@ -258,7 +258,7 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, 
> int flags)
>  }
>  if (riscv_has_ext(env, RVH)) {
>  qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "hstatus ", env->hstatus);
> -qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "vsstatus ",
> +qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "vsstatus",
>   (target_ulong)env->vsstatus);
>  }
>  qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mip ", env->mip);
> @@ -289,8 +289,8 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, 
> int flags)
>  qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mtval   ", env->mtval);
>  qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "stval   ", env->stval);
>  if (riscv_has_ext(env, RVH)) {
> -qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "htval ", env->htval);
> -qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mtval2 ", env->mtval2);
> +qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "htval   ", env->htval);
> +qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mtval2  ", env->mtval2);
>  }
>  qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "mscratch", env->mscratch);
>  qemu_fprintf(f, " %s " TARGET_FMT_lx "\n", "sscratch", env->sscratch);
> @@ -298,7 +298,7 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, 
> int flags)
>  #endif
>
>  for (i = 0; i < 32; i++) {
> -qemu_fprintf(f, " %s " TARGET_FMT_lx,
> +qemu_fprintf(f, " %-8s " TARGET_FMT_lx,
>   riscv_int_regnames[i], env->gpr[i]);
>  if ((i & 3) == 3) {
>  qemu_fprintf(f, "\n");
> @@ -306,7 +306,7 @@ static void riscv_cpu_dump_state(CPUState *cs, FILE *f, 
> int flags)
>  }
>  if (flags & CPU_DUMP_FPU) {
>  for (i = 0; i < 32; i++) {
> -qemu_fprintf(f, " %s %016" PRIx64,
> +qemu_fprintf(f, " %-8s %016" PRIx64,
>   riscv_fpr_regnames[i], env->fpr[i]);
>  if ((i & 3) == 3) {
>  qemu_fprintf(f, "\n");
> --
> 2.25.1
>
>

Re: [PATCH 2/8] accel/tcg: Split out g2h_tlbe

2021-10-11 Thread Alistair Francis

On Mon, Oct 11, 2021 at 3:44 AM Richard Henderson
 wrote:
>
> Create a new function to combine a CPUTLBEntry addend
> with the guest address to form a host address.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  accel/tcg/cputlb.c | 24 ++--
>  1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 46140ccff3..761f726722 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -90,6 +90,11 @@ static inline size_t sizeof_tlb(CPUTLBDescFast *fast)
>  return fast->mask + (1 << CPU_TLB_ENTRY_BITS);
>  }
>
> +static inline uintptr_t g2h_tlbe(const CPUTLBEntry *tlb, target_ulong gaddr)
> +{
> +return tlb->addend + (uintptr_t)gaddr;
> +}
> +
>  static void tlb_window_reset(CPUTLBDesc *desc, int64_t ns,
>   size_t max_entries)
>  {
> @@ -976,8 +981,7 @@ static void tlb_reset_dirty_range_locked(CPUTLBEntry 
> *tlb_entry,
>
>  if ((addr & (TLB_INVALID_MASK | TLB_MMIO |
>   TLB_DISCARD_WRITE | TLB_NOTDIRTY)) == 0) {
> -addr &= TARGET_PAGE_MASK;
> -addr += tlb_entry->addend;
> +addr = g2h_tlbe(tlb_entry, addr & TARGET_PAGE_MASK);
>  if ((addr - start) < length) {
>  #if TCG_OVERSIZED_GUEST
>  tlb_entry->addr_write |= TLB_NOTDIRTY;
> @@ -1527,7 +1531,7 @@ tb_page_addr_t get_page_addr_code_hostp(CPUArchState 
> *env, target_ulong addr,
>  return -1;
>  }
>
> -p = (void *)((uintptr_t)addr + entry->addend);
> +p = (void *)g2h_tlbe(entry, addr);
>  if (hostp) {
>  *hostp = p;
>  }
> @@ -1619,7 +1623,7 @@ static int probe_access_internal(CPUArchState *env, 
> target_ulong addr,
>  }
>
>  /* Everything else is RAM. */
> -*phost = (void *)((uintptr_t)addr + entry->addend);
> +*phost = (void *)g2h_tlbe(entry, addr);
>  return flags;
>  }
>
> @@ -1727,7 +1731,7 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong 
> addr, int mmu_idx,
>  data->v.io.offset = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>  } else {
>  data->is_io = false;
> -data->v.ram.hostaddr = (void *)((uintptr_t)addr + tlbe->addend);
> +data->v.ram.hostaddr = (void *)g2h_tlbe(tlbe, addr);
>  }
>  return true;
>  } else {
> @@ -1826,7 +1830,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, 
> target_ulong addr,
>  goto stop_the_world;
>  }
>
> -hostaddr = (void *)((uintptr_t)addr + tlbe->addend);
> +hostaddr = (void *)g2h_tlbe(tlbe, addr);
>
>  if (unlikely(tlb_addr & TLB_NOTDIRTY)) {
>  notdirty_write(env_cpu(env), addr, size,
> @@ -1938,7 +1942,7 @@ load_helper(CPUArchState *env, target_ulong addr, 
> MemOpIdx oi,
>  access_type, op ^ (need_swap * MO_BSWAP));
>  }
>
> -haddr = (void *)((uintptr_t)addr + entry->addend);
> +haddr = (void *)g2h_tlbe(entry, addr);
>
>  /*
>   * Keep these two load_memop separate to ensure that the compiler
> @@ -1975,7 +1979,7 @@ load_helper(CPUArchState *env, target_ulong addr, 
> MemOpIdx oi,
>  return res & MAKE_64BIT_MASK(0, size * 8);
>  }
>
> -haddr = (void *)((uintptr_t)addr + entry->addend);
> +haddr = (void *)g2h_tlbe(entry, addr);
>  return load_memop(haddr, op);
>  }
>
> @@ -2467,7 +2471,7 @@ store_helper(CPUArchState *env, target_ulong addr, 
> uint64_t val,
>  notdirty_write(env_cpu(env), addr, size, iotlbentry, retaddr);
>  }
>
> -haddr = (void *)((uintptr_t)addr + entry->addend);
> +haddr = (void *)g2h_tlbe(entry, addr);
>
>  /*
>   * Keep these two store_memop separate to ensure that the compiler
> @@ -2492,7 +2496,7 @@ store_helper(CPUArchState *env, target_ulong addr, 
> uint64_t val,
>  return;
>  }
>
> -haddr = (void *)((uintptr_t)addr + entry->addend);
> +haddr = (void *)g2h_tlbe(entry, addr);
>  store_memop(haddr, val, op);
>  }
>
> --
> 2.25.1
>
>

Re: [PATCH 1/8] tcg: Add TCG_TARGET_SIGNED_ADDR32

2021-10-11 Thread Alistair Francis

On Mon, Oct 11, 2021 at 3:49 AM Richard Henderson
 wrote:
>
> Define as 0 for all tcg hosts.  Put this in a separate header,
> because we'll want this in places that do not ordinarily have
> access to all of tcg/tcg.h.
>
> Signed-off-by: Richard Henderson 

Reviewed-by: Alistair Francis 

Alistair

> ---
>  tcg/aarch64/tcg-target-sa32.h | 1 +
>  tcg/arm/tcg-target-sa32.h | 1 +
>  tcg/i386/tcg-target-sa32.h| 1 +
>  tcg/mips/tcg-target-sa32.h| 1 +
>  tcg/ppc/tcg-target-sa32.h | 1 +
>  tcg/riscv/tcg-target-sa32.h   | 1 +
>  tcg/s390x/tcg-target-sa32.h   | 1 +
>  tcg/sparc/tcg-target-sa32.h   | 1 +
>  tcg/tci/tcg-target-sa32.h | 1 +
>  9 files changed, 9 insertions(+)
>  create mode 100644 tcg/aarch64/tcg-target-sa32.h
>  create mode 100644 tcg/arm/tcg-target-sa32.h
>  create mode 100644 tcg/i386/tcg-target-sa32.h
>  create mode 100644 tcg/mips/tcg-target-sa32.h
>  create mode 100644 tcg/ppc/tcg-target-sa32.h
>  create mode 100644 tcg/riscv/tcg-target-sa32.h
>  create mode 100644 tcg/s390x/tcg-target-sa32.h
>  create mode 100644 tcg/sparc/tcg-target-sa32.h
>  create mode 100644 tcg/tci/tcg-target-sa32.h
>
> diff --git a/tcg/aarch64/tcg-target-sa32.h b/tcg/aarch64/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/aarch64/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/arm/tcg-target-sa32.h b/tcg/arm/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/arm/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/i386/tcg-target-sa32.h b/tcg/i386/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/i386/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/mips/tcg-target-sa32.h b/tcg/mips/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/mips/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/ppc/tcg-target-sa32.h b/tcg/ppc/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/ppc/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/riscv/tcg-target-sa32.h b/tcg/riscv/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/riscv/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/s390x/tcg-target-sa32.h b/tcg/s390x/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/s390x/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/sparc/tcg-target-sa32.h b/tcg/sparc/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/sparc/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> diff --git a/tcg/tci/tcg-target-sa32.h b/tcg/tci/tcg-target-sa32.h
> new file mode 100644
> index 00..cb185b1526
> --- /dev/null
> +++ b/tcg/tci/tcg-target-sa32.h
> @@ -0,0 +1 @@
> +#define TCG_TARGET_SIGNED_ADDR32 0
> --
> 2.25.1
>
>

Re: [PATCH] Trim some trailing space from human-readable output

2021-10-11 Thread Max Filippov

On Sat, Oct 9, 2021 at 8:24 AM Markus Armbruster  wrote:
>
> I noticed -cpu help printing enough trailing spaces to make the output
> at least 84 characters wide.  Looks ugly unless the terminal is wider.
> Ugly or not, trailing spaces are stupid.
>
> The culprit is this line in x86_cpu_list_entry():
>
> qemu_printf("x86 %-20s  %-58s\n", name, desc);
>
> This prints a string with minimum field left-justified right before a
> newline.  Change it to
>
> qemu_printf("x86 %-20s  %s\n", name, desc);
>
> which avoids the trailing spaces and is simpler to boot.
>
> A search for the pattern with "git-grep -E '%-[0-9]+s\\n'" found a few
> more instances.  Change them similarly.
>
> Signed-off-by: Markus Armbruster 
> ---
>  monitor/hmp-cmds.c | 2 +-
>  target/i386/cpu-dump.c | 4 ++--
>  target/i386/cpu.c  | 2 +-
>  target/ppc/cpu_init.c  | 2 +-
>  target/s390x/cpu_models.c  | 4 ++--
>  target/xtensa/mmu_helper.c | 2 +-
>  6 files changed, 8 insertions(+), 8 deletions(-)

For target/xtensa:
Acked-by: Max Filippov 

-- 
Thanks.
-- Max

Re: [PATCH] Trim some trailing space from human-readable output

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/9/21 17:24, Markus Armbruster wrote:
> I noticed -cpu help printing enough trailing spaces to make the output
> at least 84 characters wide.  Looks ugly unless the terminal is wider.
> Ugly or not, trailing spaces are stupid.
> 
> The culprit is this line in x86_cpu_list_entry():
> 
> qemu_printf("x86 %-20s  %-58s\n", name, desc);
> 
> This prints a string with minimum field left-justified right before a
> newline.  Change it to
> 
> qemu_printf("x86 %-20s  %s\n", name, desc);
> 
> which avoids the trailing spaces and is simpler to boot.
> 
> A search for the pattern with "git-grep -E '%-[0-9]+s\\n'" found a few
> more instances.  Change them similarly.
> 
> Signed-off-by: Markus Armbruster 
> ---
>  monitor/hmp-cmds.c | 2 +-
>  target/i386/cpu-dump.c | 4 ++--
>  target/i386/cpu.c  | 2 +-
>  target/ppc/cpu_init.c  | 2 +-
>  target/s390x/cpu_models.c  | 4 ++--
>  target/xtensa/mmu_helper.c | 2 +-
>  6 files changed, 8 insertions(+), 8 deletions(-)

Nitpicking, do you mind prefixing the patch subject with 'monitor:'?

Otherwise:
Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v3 12/12] vfio-user: acceptance test

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/11/21 07:31, Jagannathan Raman wrote:
> Acceptance test for libvfio-user in QEMU
> 
> Signed-off-by: Elena Ufimtseva 
> Signed-off-by: John G Johnson 
> Signed-off-by: Jagannathan Raman 
> ---
>  MAINTAINERS   |  1 +
>  tests/acceptance/vfio-user.py | 96 +++
>  2 files changed, 97 insertions(+)
>  create mode 100644 tests/acceptance/vfio-user.py

> +class VfioUser(Test):
> +"""
> +:avocado: tags=vfiouser
> +"""
> +KERNEL_COMMON_COMMAND_LINE = 'printk.time=0 '
> +
> +def do_test(self, kernel_url, initrd_url, kernel_command_line,
> +machine_type):
> +"""Main test method"""
> +self.require_accelerator('kvm')
> +
> +kernel_path = self.fetch_asset(kernel_url)
> +initrd_path = self.fetch_asset(initrd_url)
> +
> +socket = os.path.join('/tmp', str(uuid.uuid4()))
> +if os.path.exists(socket):
> +os.remove(socket)
> +
> +# Create remote process
> +remote_vm = self.get_vm()
> +remote_vm.add_args('-machine', 'x-remote')
> +remote_vm.add_args('-nodefaults')
> +remote_vm.add_args('-device', 'lsi53c895a,id=lsi1')
> +remote_vm.add_args('-object', 'vfio-user,id=vfioobj1,'
> +   'devid=lsi1,socket='+socket)

Nitpicking for style: spaces around '+' here,

> +remote_vm.launch()
> +
> +# Create proxy process
> +self.vm.set_console()
> +self.vm.add_args('-machine', machine_type)
> +self.vm.add_args('-accel', 'kvm')
> +self.vm.add_args('-cpu', 'host')
> +self.vm.add_args('-object',
> + 'memory-backend-memfd,id=sysmem-file,size=2G')
> +self.vm.add_args('--numa', 'node,memdev=sysmem-file')
> +self.vm.add_args('-m', '2048')
> +self.vm.add_args('-kernel', kernel_path,
> + '-initrd', initrd_path,
> + '-append', kernel_command_line)
> +self.vm.add_args('-device',
> + 'vfio-user-pci,'
> + 'socket='+socket)

and here. Otherwise:
Reviewed-by: Philippe Mathieu-Daudé 

Thanks for adding this test :)

Re: [PATCH 0/8] target/mips: Use tcg_constant_*

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/3/21 19:57, Philippe Mathieu-Daudé wrote:
> Replace temporary TCG registers by tcg_constant_*() when possible.
> 
> Philippe Mathieu-Daudé (8):
>   target/mips: Remove unused register from MSA 2R/2RF instruction format
>   target/mips: Use tcg_constant_i32() in gen_msa_elm_df()
>   target/mips: Use tcg_constant_i32() in gen_msa_2rf()
>   target/mips: Use tcg_constant_i32() in gen_msa_2r()
>   target/mips: Use tcg_constant_i32() in gen_msa_3rf()
>   target/mips: Use explicit extract32() calls in gen_msa_i5()
>   target/mips: Use tcg_constant_i32() in gen_msa_i5()
>   target/mips: Use tcg_constant_tl() in gen_compute_compact_branch()
> 
>  target/mips/tcg/msa_translate.c | 87 +
>  target/mips/tcg/translate.c |  4 +-
>  2 files changed, 45 insertions(+), 46 deletions(-)

Thanks, series applied to mips-next tree.

Re: [PATCH 2/4] MAINTAINERS: Add entries to cover MIPS CPS / GIC hardware

2021-10-11 Thread Philippe Mathieu-Daudé

Hi Paul,

You are the maintainer of the Boston machine which is the
only one using these peripherals; would you agree to be
(co-)maintainer of these files?

Regards,

Phil.

On 10/4/21 11:25, Philippe Mathieu-Daudé wrote:
> MIPS CPS and GIC models are unrelated to the TCG frontend.
> Move them as new sections under the 'Devices' group.
> 
> Cc: Paul Burton 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  MAINTAINERS | 22 --
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cfee52a3046..a5268ad0651 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -239,14 +239,8 @@ F: target/mips/
>  F: configs/devices/mips*/*
>  F: disas/mips.c
>  F: docs/system/cpu-models-mips.rst.inc
> -F: hw/intc/mips_gic.c
>  F: hw/mips/
> -F: hw/misc/mips_*
> -F: hw/timer/mips_gictimer.c
> -F: include/hw/intc/mips_gic.h
>  F: include/hw/mips/
> -F: include/hw/misc/mips_*
> -F: include/hw/timer/mips_gictimer.h
>  F: tests/tcg/mips/
>  
>  MIPS TCG CPUs (nanoMIPS ISA)
> @@ -2271,6 +2265,22 @@ S: Odd Fixes
>  F: hw/intc/openpic.c
>  F: include/hw/ppc/openpic.h
>  
> +MIPS CPS
> +M: Paul Burton 
> +M: Philippe Mathieu-Daudé 
> +S: Odd Fixes
> +F: hw/misc/mips_*
> +F: include/hw/misc/mips_*
> +
> +MIPS GIC
> +M: Paul Burton 
> +M: Philippe Mathieu-Daudé 
> +S: Odd Fixes
> +F: hw/intc/mips_gic.c
> +F: hw/timer/mips_gictimer.c
> +F: include/hw/intc/mips_gic.h
> +F: include/hw/timer/mips_gictimer.h
> +
>  Subsystems
>  --
>  Overall Audio backends
>

Re: [PATCH v1 2/2] memory: Update description of memory_region_is_mapped()

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/11/21 23:21, Richard Henderson wrote:
> On 10/11/21 10:45 AM, David Hildenbrand wrote:
>>   /**
>>    * memory_region_is_mapped: returns true if #MemoryRegion is mapped
>> - * into any address space.
>> + * into another #MemoryRegion directly. Will return false if the
>> + * #MemoryRegion is mapped indirectly via an alias.
> 
> Hmm.  I guess.  It kinda sorta sounds like a bug, but I don't know the
> interface well enough to tell.

I tend to agree there is a generic issue with aliases, see:

https://www.mail-archive.com/qemu-devel@nongnu.org/msg732527.html
then
https://www.mail-archive.com/qemu-devel@nongnu.org/msg799622.html
"memory: Directly dispatch alias accesses on origin memory region"

The API description looks OK to me, I'd rather change the
implementation... Maybe we need a MR_ALIAS_FOREACH() macro?

Re: [PATCH 1/8] tcg: Add TCG_TARGET_SIGNED_ADDR32

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/10/21 19:43, Richard Henderson wrote:
> Define as 0 for all tcg hosts.  Put this in a separate header,
> because we'll want this in places that do not ordinarily have
> access to all of tcg/tcg.h.
> 
> Signed-off-by: Richard Henderson 
> ---
>  tcg/aarch64/tcg-target-sa32.h | 1 +
>  tcg/arm/tcg-target-sa32.h | 1 +
>  tcg/i386/tcg-target-sa32.h| 1 +
>  tcg/mips/tcg-target-sa32.h| 1 +
>  tcg/ppc/tcg-target-sa32.h | 1 +
>  tcg/riscv/tcg-target-sa32.h   | 1 +
>  tcg/s390x/tcg-target-sa32.h   | 1 +
>  tcg/sparc/tcg-target-sa32.h   | 1 +
>  tcg/tci/tcg-target-sa32.h | 1 +
>  9 files changed, 9 insertions(+)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 4/8] accel/tcg: Add guest_base_signed_addr32 for user-only

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/10/21 19:43, Richard Henderson wrote:
> While the host may prefer to treat 32-bit addresses as signed,
> there are edge cases of guests that cannot be implemented with
> addresses 0x7fff_ and 0x8000_ being non-consecutive.
> 
> Therefore, default to guest_base_signed_addr32 false, and allow
> probe_guest_base to determine whether it is possible to set it
> to true.  A tcg backend which sets TCG_TARGET_SIGNED_ADDR32 will
> have to cope with either setting for user-only.
> 
> Signed-off-by: Richard Henderson 
> ---
>  include/exec/cpu-all.h  | 16 
>  include/exec/cpu_ldst.h |  3 ++-
>  bsd-user/main.c |  4 
>  linux-user/main.c   |  3 +++
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index 32cfb634c6..80b5e17329 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -146,6 +146,7 @@ static inline void tswap64s(uint64_t *s)
>  
>  #if defined(CONFIG_USER_ONLY)
>  #include "exec/user/abitypes.h"
> +#include "tcg-target-sa32.h"

Unrelated but this header could be simplified by moving this
block to a new header such "exec/user/address.h".

>  
>  /* On some host systems the guest address space is reserved on the host.
>   * This allows the guest address space to be offset to a convenient location.
> @@ -154,6 +155,21 @@ extern uintptr_t guest_base;
>  extern bool have_guest_base;
>  extern unsigned long reserved_va;
>  
> +#if TCG_TARGET_SIGNED_ADDR32 && TARGET_LONG_BITS == 32
> +extern bool guest_base_signed_addr32;
> +#else
> +#define guest_base_signed_addr32  false
> +#endif
> +
> +static inline void set_guest_base_signed_addr32(void)
> +{
> +#ifdef guest_base_signed_addr32
> +qemu_build_not_reached();
> +#else
> +guest_base_signed_addr32 = true;
> +#endif
> +}
> +
>  /*
>   * Limit the guest addresses as best we can.
>   *

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH 8/8] target/riscv: Support TCG_TARGET_SIGNED_ADDR32

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/10/21 19:44, Richard Henderson wrote:
> All RV64 32-bit operations sign-extend the output, so we are easily
> able to keep TCG_TYPE_I32 values sign-extended in host registers.
> 
> Signed-off-by: Richard Henderson 
> ---
>  tcg/riscv/tcg-target-sa32.h | 6 +-
>  tcg/riscv/tcg-target.c.inc  | 8 ++--
>  2 files changed, 7 insertions(+), 7 deletions(-)

Nice.

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH] memory: Log access direction for invalid accesses

2021-10-11 Thread Richard Henderson


On 10/11/21 10:32 AM, BALATON Zoltan wrote:

In memory_region_access_valid() invalid accesses are logged to help
debugging but the log message does not say if it was a read or write.
Log that too to better identify the access causing the problem.

Signed-off-by: BALATON Zoltan
---
  softmmu/memory.c | 20 ++--
  1 file changed, 10 insertions(+), 10 deletions(-)


Thanks, queued for tcg-next.

r~

Re: [PATCH 2/8] accel/tcg: Split out g2h_tlbe

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/10/21 19:43, Richard Henderson wrote:
> Create a new function to combine a CPUTLBEntry addend
> with the guest address to form a host address.
> 
> Signed-off-by: Richard Henderson 
> ---
>  accel/tcg/cputlb.c | 24 ++--
>  1 file changed, 14 insertions(+), 10 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v3 6/9] bsd-user/mmap.c: Convert to qemu_log logging for mmap debugging

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/8/21 23:23, Warner Losh wrote:
> Convert DEBUG_MMAP to qemu_log CPU_LOG_PAGE.
> 
> Signed-off-by: Warner Losh 
> ---
>  bsd-user/mmap.c | 53 +
>  1 file changed, 23 insertions(+), 30 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH] s390x/ipl: check kernel command line size

2021-10-11 Thread Philippe Mathieu-Daudé

On 10/11/21 15:38, Thomas Huth wrote:
> On 06/10/2021 11.26, Marc Hartmayer wrote:
>> Check if the provided kernel command line exceeds the maximum size of
>> the s390x
>> Linux kernel command line size, which is 896 bytes.
>>
>> Reported-by: Sven Schnelle 
>> Signed-off-by: Marc Hartmayer 
>> ---
>>   hw/s390x/ipl.c | 12 +++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c
>> index 1821c6faeef3..a58ea58cc736 100644
>> --- a/hw/s390x/ipl.c
>> +++ b/hw/s390x/ipl.c
>> @@ -38,6 +38,7 @@
>>   #define KERN_IMAGE_START    0x01UL
>>   #define LINUX_MAGIC_ADDR    0x010008UL
>>   #define KERN_PARM_AREA  0x010480UL
>> +#define KERN_PARM_AREA_SIZE 0x000380UL
>>   #define INITRD_START    0x80UL
>>   #define INITRD_PARM_START   0x010408UL
>>   #define PARMFILE_START  0x001000UL
>> @@ -190,10 +191,19 @@ static void s390_ipl_realize(DeviceState *dev,
>> Error **errp)
>>    * loader) and it won't work. For this case we force it to
>> 0x1, too.
>>    */
>>   if (pentry == KERN_IMAGE_START || pentry == 0x800) {
>> -    char *parm_area = rom_ptr(KERN_PARM_AREA,
>> strlen(ipl->cmdline) + 1);
>> +    size_t cmdline_size = strlen(ipl->cmdline) + 1;
>> +    char *parm_area = rom_ptr(KERN_PARM_AREA, cmdline_size);
>> +
>>   ipl->start_addr = KERN_IMAGE_START;
>>   /* Overwrite parameters in the kernel image, which are
>> "rom" */
>>   if (parm_area) {
>> +    if (cmdline_size > KERN_PARM_AREA_SIZE) {
>> +    error_setg(errp,
>> +   "kernel command line exceeds maximum
>> size: %lu > %lu",
> 
> I think the first %lu should be %zd instead?

%zu ;)

Reviewed-by: Philippe Mathieu-Daudé 

> 
> Apart from that, the patch looks fine to me... so if you agree, I can
> fix that up when picking up the patch.
> 
>  Thomas

Re: [RFC PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start

2021-10-11 Thread Alex Bennée



Richard Henderson  writes:

> On 10/11/21 8:57 AM, Alex Bennée wrote:
>> We use INDEX_op_insn_start to make the start of instruction
>> boundaries. If we don't do it in the .insn_start hook things get
>> confused especially now plugins want to use that marking to identify
>> the start of instructions and will bomb out if it sees instrumented
>> ops before the first instruction boundary.
>> Signed-off-by: Alex Bennée 
>> ---
>>   target/s390x/tcg/translate.c | 25 ++---
>>   1 file changed, 22 insertions(+), 3 deletions(-)
>> diff --git a/target/s390x/tcg/translate.c
>> b/target/s390x/tcg/translate.c
>> index f284870cd2..fe145ff2eb 100644
>> --- a/target/s390x/tcg/translate.c
>> +++ b/target/s390x/tcg/translate.c
>> @@ -6380,9 +6380,6 @@ static DisasJumpType translate_one(CPUS390XState *env, 
>> DisasContext *s)
>>   /* Search for the insn in the table.  */
>>   insn = extract_insn(env, s);
>>   -/* Emit insn_start now that we know the ILEN.  */
>> -tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen);
>> -
>>   /* Not found means unimplemented/illegal opcode.  */
>>   if (insn == NULL) {
>>   qemu_log_mask(LOG_UNIMP, "unimplemented opcode 0x%02x%02x\n",
>> @@ -6550,8 +6547,30 @@ static void s390x_tr_tb_start(DisasContextBase *db, 
>> CPUState *cs)
>>   {
>>   }
>>   +/*
>> + * We just enough partial instruction decoding here to calculate the
>> + * length of the instruction so we can drop the INDEX_op_insn_start
>> + * before anything else is emitted in the TCGOp stream.
>> + *
>> + * See extract_insn for the full decode.
>> + */
>>   static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs)
>>   {
>> +CPUS390XState *env = cs->env_ptr;
>> +DisasContext *s = container_of(dcbase, DisasContext, base);
>> +uint64_t insn, pc = s->base.pc_next;
>> +int op, ilen;
>> +
>> +if (unlikely(s->ex_value)) {
>> +ilen = s->ex_value & 0xf;
>> +} else {
>> +insn = ld_code2(env, s, pc);  /* FIXME: don't reload same pc twice 
>> */
>
> I think we might as well delay the set of ilen until after the normal
> load, rather than introduce a fixme.

I'd got as far as this before thinking it was getting messy:

--8<---cut here---start->8---
squash! target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start

1 file changed, 16 insertions(+), 19 deletions(-)
target/s390x/tcg/translate.c | 35 ---

modified   target/s390x/tcg/translate.c
@@ -147,6 +147,7 @@ struct DisasContext {
  */
 uint64_t pc_tmp;
 uint32_t ilen;
+uint16_t start_of_insn; /* collected at s390x_tr_insn_start */
 enum cc_op cc_op;
 bool do_debug;
 };
@@ -388,10 +389,10 @@ static void update_cc_op(DisasContext *s)
 }
 }
 
-static inline uint64_t ld_code2(CPUS390XState *env, DisasContext *s,
+static inline uint16_t ld_code2(CPUS390XState *env, DisasContext *s,
 uint64_t pc)
 {
-return (uint64_t)translator_lduw(env, >base, pc);
+return translator_lduw(env, >base, pc);
 }
 
 static inline uint64_t ld_code4(CPUS390XState *env, DisasContext *s,
@@ -6261,7 +6262,7 @@ static void extract_field(DisasFields *o, const 
DisasField *f, uint64_t insn)
 static const DisasInsn *extract_insn(CPUS390XState *env, DisasContext *s)
 {
 uint64_t insn, pc = s->base.pc_next;
-int op, op2, ilen;
+int op, op2;
 const DisasInsn *info;
 
 if (unlikely(s->ex_value)) {
@@ -6272,28 +6273,25 @@ static const DisasInsn *extract_insn(CPUS390XState 
*env, DisasContext *s)
 
 /* Extract the values saved by EXECUTE.  */
 insn = s->ex_value & 0xull;
-ilen = s->ex_value & 0xf;
+s->ilen = s->ex_value & 0xf;
 op = insn >> 56;
 } else {
-insn = ld_code2(env, s, pc);
-op = (insn >> 8) & 0xff;
-ilen = get_ilen(op);
-switch (ilen) {
+op = extract32(s->start_of_insn, 8, 8);
+insn = deposit64(0, 48, 16, s->start_of_insn);
+switch (s->ilen) {
 case 2:
-insn = insn << 48;
 break;
 case 4:
-insn = ld_code4(env, s, pc) << 32;
+insn = deposit64(insn, 32, 16, ld_code2(env, s, pc + 2));
 break;
 case 6:
-insn = (insn << 48) | (ld_code4(env, s, pc + 2) << 16);
+insn = deposit64(insn, 16, 32, ld_code4(env, s, pc + 2));
 break;
 default:
 g_assert_not_reached();
 }
 }
-s->pc_tmp = s->base.pc_next + ilen;
-s->ilen = ilen;
+s->pc_tmp = s->base.pc_next + s->ilen;
 
 /* We can't actually determine the insn format until we've looked up
the full insn opcode.  Which we can't do without locating the
@@ -6558,18 +6556,17 @@ static void s390x_tr_insn_start(DisasContextBase 
*dcbase, CPUState *cs)
 {
 CPUS390XState *env = cs->env_ptr;
 DisasContext *s = container_of(dcbase,

Re: [PATCH v1 2/2] memory: Update description of memory_region_is_mapped()

2021-10-11 Thread Richard Henderson


On 10/11/21 10:45 AM, David Hildenbrand wrote:

  /**
   * memory_region_is_mapped: returns true if #MemoryRegion is mapped
- * into any address space.
+ * into another #MemoryRegion directly. Will return false if the
+ * #MemoryRegion is mapped indirectly via an alias.


Hmm.  I guess.  It kinda sorta sounds like a bug, but I don't know the interface well 
enough to tell.


r~

Re: [PATCH v1 1/2] machine: Use host_memory_backend_is_mapped() in machine_consume_memdev()

2021-10-11 Thread Richard Henderson


On 10/11/21 10:45 AM, David Hildenbrand wrote:

memory_region_is_mapped() is the wrong check, we actually want to check
whether the backend is already marked mapped.

For example, memory regions mapped via an alias, such as NVDIMMs,
currently don't make memory_region_is_mapped() return "true". As the
machine is initialized before any memory devices (and thereby before
NVDIMMs are initialized), this isn't a fix but merely a cleanup.

Signed-off-by: David Hildenbrand
---
  hw/core/machine.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Richard Henderson 

r~

Re: [PATCH v2 09/15] softmmu/qdev-monitor: add error handling in qdev_set_id

2021-10-11 Thread Eric Blake

On Fri, Oct 08, 2021 at 03:34:36PM +0200, Kevin Wolf wrote:
> From: Damien Hedde 
> 
> qdev_set_id() is mostly used when the user adds a device (using
> -device cli option or device_add qmp command). This commit adds
> an error parameter to handle the case where the given id is
> already taken.
> 
> Also document the function and add a return value in order to
> be able to capture success/failure: the function now returns the
> id in case of success, or NULL in case of failure.
> 
> The commit modifies the 2 calling places (qdev-monitor and
> xen-legacy-backend) to add the error object parameter.
> 
> Note that the id is, right now, guaranteed to be unique because
> all ids came from the "device" QemuOptsList where the id is used
> as key. This addition is a preparation for a future commit which
> will relax the uniqueness.
> 
> Signed-off-by: Damien Hedde 
> Signed-off-by: Kevin Wolf 
> ---

> +++ b/softmmu/qdev-monitor.c
> @@ -593,22 +593,34 @@ static BusState *qbus_find(const char *path, Error 
> **errp)
>  }
>  
>  /* Takes ownership of @id, will be freed when deleting the device */
> -void qdev_set_id(DeviceState *dev, char *id)
> +const char *qdev_set_id(DeviceState *dev, char *id, Error **errp)
>  {
> -if (id) {
> -dev->id = id;
> -}
> +ObjectProperty *prop;
>  
> -if (dev->id) {
> -object_property_add_child(qdev_get_peripheral(), dev->id,
> -  OBJECT(dev));
> +assert(!dev->id && !dev->realized);
> +
> +/*
> + * object_property_[try_]add_child() below will assert the device
> + * has no parent
> + */
> +if (id) {
> +prop = object_property_try_add_child(qdev_get_peripheral(), id,
> + OBJECT(dev), NULL);
> +if (prop) {
> +dev->id = id;
> +} else {
> +error_setg(errp, "Duplicate device ID '%s'", id);
> +return NULL;

id is not freed on this error path...

> +}
>  } else {
>  static int anon_count;
>  gchar *name = g_strdup_printf("device[%d]", anon_count++);
> -object_property_add_child(qdev_get_peripheral_anon(), name,
> -  OBJECT(dev));
> +prop = object_property_add_child(qdev_get_peripheral_anon(), name,
> + OBJECT(dev));
>  g_free(name);
>  }
> +
> +return prop->name;
>  }
>  
>  DeviceState *qdev_device_add(QemuOpts *opts, Error **errp)
> @@ -691,7 +703,13 @@ DeviceState *qdev_device_add(QemuOpts *opts, Error 
> **errp)
>  }
>  }
>  
> -qdev_set_id(dev, g_strdup(qemu_opts_id(opts)));
> +/*
> + * set dev's parent and register its id.
> + * If it fails it means the id is already taken.
> + */
> +if (!qdev_set_id(dev, g_strdup(qemu_opts_id(opts)), errp)) {
> +goto err_del_dev;

...nor on this, which means the g_strdup() leaks on failure.

> +}
>  
>  /* set properties */
>  if (qemu_opt_foreach(opts, set_property, dev, errp)) {
> -- 
> 2.31.1
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks

2021-10-11 Thread Leonardo Bras Soares Passos

On Mon, Oct 11, 2021 at 5:45 PM Eric Blake  wrote:
>
> On Mon, Oct 11, 2021 at 04:38:23PM -0300, Leonardo Bras Soares Passos wrote:
> > > >  /**
> > > > - * qio_channel_writev_full:
> > > > + * qio_channel_writev_full_flags:
> > > >   * @ioc: the channel object
> > > >   * @iov: the array of memory regions to write data from
> > > >   * @niov: the length of the @iov array
> > > >   * @fds: an array of file handles to send
> > > >   * @nfds: number of file handles in @fds
> > > > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
> > > >   * @errp: pointer to a NULL-initialized error object
> > > >   *
> > > >   * Write data to the IO channel, reading it from the
> > > > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> > > >   * guaranteed. If the channel is non-blocking and no
> > > >   * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK
> > > >   *
> > > > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed,
> > > > + * function will return once each buffer was queued for
> > > > + * sending.
> > >
> > > This would be a good place to document the requirement to keep the
> > > buffer unchanged until the zerocopy sequence completes.
> >
> > That makes sense, even though that may be true for just some 
> > implementations,
> > it makes sense to document it here.
> >
>
> >
> > Ok,
> > Is it enough to document it in a single one of the places suggested, or
> > would you recommend documenting it in all suggested places?
>
> Ah, the curse of maintaining copy-and-paste.  If you can find a way to
> say "see this other type for limitations" that sounds fine, it avoids
> the risk of later edits touching one but not all identical copies.
> But our current process for generating sphynx documentation from the
> qapi generator does not have cross-referencing abilities that I'm
> aware of.  Markus or John, any thoughts?
>
> > >
> > > > +int qio_channel_flush_zerocopy(QIOChannel *ioc,
> > > > +   Error **errp)
> > > > +{
> > > > +QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> > > > +
> > > > +if (!klass->io_flush_zerocopy ||
> > > > +!qio_channel_has_feature(ioc, 
> > > > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY)) {
> > > > +return 0;
> > >
> > > Matches your documentation, but an ideal app should not be trying to
> > > flush if the write failed in the first place.  So wouldn't it be
> > > better to return -1 or even abort() on a coding error?
> >
> > The point here is that any valid user of zrocopy_flush would have
> > already used zerocopy_writev
> > at some point, and failed if not supported / enabled.
> >
> > Having this not returning error can help the user keep a simpler
> > approach when using
> > a setup in which the writev can happen in both zerocopy or default behavior.
> >
> > I mean, the user will not need to check if zerocopy was or was not
> > enabled, and just flush anyway.
> >
> > But if it's not good behavior, or you guys think it's a better
> > approach to fail here, I can also do that.
>
> Either flush is supposed to be a no-op when zerocopy fails (so
> returning 0 is okay), or should not be attempted unless zerocopy
> succeeded (in which case abort()ing seems like the best way to point
> out the programmer's error).  But I wasn't clear from your
> documentation which of the two behaviors you had in mind.

Oh, sorry about that.
Yeah, I intend to use it as a no-op.
If it's fine I will update the docs for v5.


>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>

Thanks!

Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks

2021-10-11 Thread Eric Blake

On Mon, Oct 11, 2021 at 04:38:23PM -0300, Leonardo Bras Soares Passos wrote:
> > >  /**
> > > - * qio_channel_writev_full:
> > > + * qio_channel_writev_full_flags:
> > >   * @ioc: the channel object
> > >   * @iov: the array of memory regions to write data from
> > >   * @niov: the length of the @iov array
> > >   * @fds: an array of file handles to send
> > >   * @nfds: number of file handles in @fds
> > > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
> > >   * @errp: pointer to a NULL-initialized error object
> > >   *
> > >   * Write data to the IO channel, reading it from the
> > > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> > >   * guaranteed. If the channel is non-blocking and no
> > >   * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK
> > >   *
> > > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed,
> > > + * function will return once each buffer was queued for
> > > + * sending.
> >
> > This would be a good place to document the requirement to keep the
> > buffer unchanged until the zerocopy sequence completes.
> 
> That makes sense, even though that may be true for just some implementations,
> it makes sense to document it here.
> 

> 
> Ok,
> Is it enough to document it in a single one of the places suggested, or
> would you recommend documenting it in all suggested places?

Ah, the curse of maintaining copy-and-paste.  If you can find a way to
say "see this other type for limitations" that sounds fine, it avoids
the risk of later edits touching one but not all identical copies.
But our current process for generating sphynx documentation from the
qapi generator does not have cross-referencing abilities that I'm
aware of.  Markus or John, any thoughts?

> >
> > > +int qio_channel_flush_zerocopy(QIOChannel *ioc,
> > > +   Error **errp)
> > > +{
> > > +QIOChannelClass *klass = QIO_CHANNEL_GET_CLASS(ioc);
> > > +
> > > +if (!klass->io_flush_zerocopy ||
> > > +!qio_channel_has_feature(ioc, 
> > > QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY)) {
> > > +return 0;
> >
> > Matches your documentation, but an ideal app should not be trying to
> > flush if the write failed in the first place.  So wouldn't it be
> > better to return -1 or even abort() on a coding error?
> 
> The point here is that any valid user of zrocopy_flush would have
> already used zerocopy_writev
> at some point, and failed if not supported / enabled.
> 
> Having this not returning error can help the user keep a simpler
> approach when using
> a setup in which the writev can happen in both zerocopy or default behavior.
> 
> I mean, the user will not need to check if zerocopy was or was not
> enabled, and just flush anyway.
> 
> But if it's not good behavior, or you guys think it's a better
> approach to fail here, I can also do that.

Either flush is supposed to be a no-op when zerocopy fails (so
returning 0 is okay), or should not be attempted unless zerocopy
succeeded (in which case abort()ing seems like the best way to point
out the programmer's error).  But I wasn't clear from your
documentation which of the two behaviors you had in mind.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH v1] libvhost-user: fix VHOST_USER_REM_MEM_REG skipping mmap_addr

2021-10-11 Thread David Hildenbrand

We end up not copying the mmap_addr of all existing regions, resulting
in a SEGFAULT once we actually try to map/access anything within our
memory regions.

Fixes: 875b9fd97b34 ("Support individual region unmap in libvhost-user")
Cc: qemu-sta...@nongnu.org
Cc: Michael S. Tsirkin 
Cc: Raphael Norwitz 
Cc: "Marc-André Lureau" 
Cc: Stefan Hajnoczi 
Cc: Paolo Bonzini 
Cc: Coiby Xu 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index bf09693255..787f4d2d4f 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -816,6 +816,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 shadow_regions[j].gpa = dev->regions[i].gpa;
 shadow_regions[j].size = dev->regions[i].size;
 shadow_regions[j].qva = dev->regions[i].qva;
+shadow_regions[j].mmap_addr = dev->regions[i].mmap_addr;
 shadow_regions[j].mmap_offset = dev->regions[i].mmap_offset;
 j++;
 } else {
-- 
2.31.1

Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-10-11 Thread Leonardo Bras Soares Passos

Hello Eric,

On Mon, Oct 11, 2021 at 4:32 PM Eric Blake  wrote:
>
> On Sat, Oct 09, 2021 at 04:56:13AM -0300, Leonardo Bras wrote:
> > Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> > zerocopy interface.
> >
> > Change multifd_send_sync_main() so it can distinguish the last sync from
> > the setup and per-iteration ones, so a flush_zerocopy() can be called
> > at the last sync in order to make sure all RAM is sent before finishing
> > the migration.
> >
> > Also make it return -1 if flush_zerocopy() fails, in order to cancel
> > the migration process, and avoid resuming the guest in the target host
> > without receiving all current RAM.
> >
> > This will work fine on RAM migration because the RAM pages are not usually 
> > freed,
> > and there is no problem on changing the pages content between async_send() 
> > and
> > the actual sending of the buffer, because this change will dirty the page 
> > and
> > cause it to be re-sent on a next iteration anyway.
> >
> > Given a lot of locked memory may be needed in order to use multid migration
> > with zerocopy enabled, make it optional by creating a new parameter
> > multifd-zerocopy on qapi, so low-privileged users can still perform multifd
> > migrations.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  qapi/migration.json   | 18 ++
> >  migration/migration.h |  1 +
> >  migration/multifd.h   |  2 +-
> >  migration/migration.c | 20 
> >  migration/multifd.c   | 33 -
> >  migration/ram.c   | 20 +---
> >  monitor/hmp-cmds.c|  4 
> >  7 files changed, 85 insertions(+), 13 deletions(-)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 88f07baedd..c4890cbb54 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -724,6 +724,11 @@
> >  #  will consume more CPU.
> >  #  Defaults to 1. (Since 5.0)
> >  #
> > +# @multifd-zerocopy: Controls behavior on sending memory pages on multifd 
> > migration.
> > +#When true, enables a zerocopy mechanism for sending 
> > memory
> > +#pages, if host does support it.
>
> s/does support/supports/ (several instances this patch)

I will make sure to fix that in v5.

>
> > +#Defaults to false. (Since 6.2)
> > +#
> >  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
> >  #aliases for the purpose of dirty bitmap 
> > migration.  Such
> >  #aliases may for example be the corresponding 
> > names on the
> > @@ -758,6 +763,7 @@
> > 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> > 'max-cpu-throttle', 'multifd-compression',
> > 'multifd-zlib-level' ,'multifd-zstd-level',
> > +'multifd-zerocopy',
> > 'block-bitmap-mapping' ] }
>
> Should this feature be guarded with 'if':'CONFIG_LINUX', since that's
> the only platform where you even compile the code (even then, it can
> still fail if the kernel doesn't support it).

I think it makes sense for the feature to always be available, even
though it's not supported
outside linux > v4.14.

IMHO it makes more sense for the user to get an error when it starts
migration, due to host
not supporting zerocopy, than the error happening in the config step,
which may cause the user
to question if it was the right parameter.

The config message error here could also be ignored, and users can
think zerocopy is working, while it's not.

For automated migrations, this feature should never be enabled  for
non-linux / older linux hosts anyway.

Is there a good argument I am missing for this feature being disabled
on non-linux?

>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>

Best regards,
Leo

Re: [PATCH v4 2/3] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX

2021-10-11 Thread Leonardo Bras Soares Passos

Hello Eric,

On Mon, Oct 11, 2021 at 4:28 PM Eric Blake  wrote:
>
> On Sat, Oct 09, 2021 at 04:56:12AM -0300, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new optional callbacks io_write_zerocopy and
> > io_flush_zerocopy on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> >
> > qio_channel_socket_writev() contents were moved to a helper function
> > qio_channel_socket_writev_flags() which accepts an extra argument for flags.
> > (This argument is passed directly to sendmsg().
> >
> > The above helper function is used to implement qio_channel_socket_writev(),
> > with flags = 0, keeping it's behavior unchanged, and
>
> its (remember, "it's" is shorthand for "it is", which does not fit here)
>
> > qio_channel_socket_writev_zerocopy() with flags = MSG_ZEROCOPY.
> >
> > qio_channel_socket_flush_zerocopy() was implemented by counting how many 
> > times
> > sendmsg(...,MSG_ZEROCOPY) was sucessfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error 
> > ocurs.
>
> occurs

Thanks!
(I will check if my codespell is enabled in this setup)

>
> >
> > A new function qio_channel_socket_poll() was also created in order to avoid
> > busy-looping recvmsg() in qio_channel_socket_flush_zerocopy() while waiting 
> > for
> > updates in socket's error queue.
> >
> > Notes on using writev_zerocopy():
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid 
> > copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent 
> > instead.
> > - If this is a problem, it's recommended to call flush_zerocopy() before 
> > freeing
> > or re-using the buffer.
> >
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, 
> > and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may 
> > require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zerocopy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zerocopy as a feature, it
> > requires a mechanism to disable it, so it can still be acessible to less
> > privileged users.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  include/io/channel-socket.h |   2 +
> >  include/io/channel.h|   1 +
> >  io/channel-socket.c | 180 ++--
> >  3 files changed, 173 insertions(+), 10 deletions(-)
> >
> > +static int qio_channel_socket_flush_zerocopy(QIOChannel *ioc,
> > + Error **errp)
> > +{
>
> > +
> > +/* No errors, count sucessfully finished sendmsg()*/
>
> Space before */

Thanks!
Also, typo on 'successfully'.

>
> > +sioc->zerocopy_sent += serr->ee_data - serr->ee_info + 1;
> > +}
> > +return 0;
> > +}
> > +
> > +#endif /* CONFIG_LINUX */
> > +
> >  static int
> >  qio_channel_socket_set_blocking(QIOChannel *ioc,
> >  bool enabled,
> > @@ -787,6 +943,10 @@ static void qio_channel_socket_class_init(ObjectClass 
> > *klass,
> >  ioc_klass->io_set_delay = qio_channel_socket_set_delay;
> >  ioc_klass->io_create_watch = qio_channel_socket_create_watch;
> >  ioc_klass->io_set_aio_fd_handler = 
> > qio_channel_socket_set_aio_fd_handler;
> > +#ifdef CONFIG_LINUX
> > +ioc_klass->io_writev_zerocopy = qio_channel_socket_writev_zerocopy;
> > +ioc_klass->io_flush_zerocopy = qio_channel_socket_flush_zerocopy;
> > +#endif
> >  }
>
> I did a high-level look at the code, rather than an in-depth review of
> whether zero-copy was being used correctly.

It's so far been very helpful. Thanks!

Best regards,
Leo

Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks

2021-10-11 Thread Leonardo Bras Soares Passos

Hello Eric, thank you for the feedback!

On Mon, Oct 11, 2021 at 4:17 PM Eric Blake  wrote:
>
> On Sat, Oct 09, 2021 at 04:56:11AM -0300, Leonardo Bras wrote:
> > Adds io_async_writev and io_async_flush as optional callback to 
> > QIOChannelClass,
>
> Are these names accurate?

I am sorry, I think I missed some renaming before generating the patchset.
As you suggested those names are incorrect (as they reflect a previous
naming used in v3).
I will replace them for io_{writev,flush}_zerocopy in v5.

>
> > allowing the implementation of asynchronous writes by subclasses.
> >
> > How to use them:
> > - Write data using qio_channel_writev_zerocopu(),
>
> s/copu/copy/

Thanks, I will fix that.

>
> > - Wait write completion with qio_channel_flush_zerocopy().
> >
> > Notes:
> > As some zerocopy implementations work asynchronously, it's
> > recommended to keep the write buffer untouched until the return of
> > qio_channel_flush_zerocopy(), by the risk of sending an updated buffer
>
> s/by/to avoid/
>
> > instead of the one at the write.
> >
> > As the new callbacks are optional, if a subclass does not implement them, 
> > then:
> > - io_async_writev will return -1,
> > - io_async_flush will return 0 without changing anything.
>
> Are these names accurate?

They are not, I will replace them for io_{writev,flush}_zerocopy in v5.

>
> >
> > Also, some functions like qio_channel_writev_full_all() were adapted to
> > receive a flag parameter. That allows shared code between zerocopy and
> > non-zerocopy writev.
> >
> > Signed-off-by: Leonardo Bras 
> > ---
> >  include/io/channel.h | 103 +++
> >  io/channel.c |  74 +++
> >  2 files changed, 141 insertions(+), 36 deletions(-)
> >
> > diff --git a/include/io/channel.h b/include/io/channel.h
> > index 88988979f8..e7d4e1521f 100644
> > --- a/include/io/channel.h
> > +++ b/include/io/channel.h
> > @@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
> >
> >  #define QIO_CHANNEL_ERR_BLOCK -2
> >
> > +#define QIO_CHANNEL_WRITE_FLAG_ZEROCOPY 0x1
> > +
> >  typedef enum QIOChannelFeature QIOChannelFeature;
> >
> >  enum QIOChannelFeature {
> >  QIO_CHANNEL_FEATURE_FD_PASS,
> >  QIO_CHANNEL_FEATURE_SHUTDOWN,
> >  QIO_CHANNEL_FEATURE_LISTEN,
> > +QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY,
> >  };
> >
> >
> > @@ -136,6 +139,12 @@ struct QIOChannelClass {
> >IOHandler *io_read,
> >IOHandler *io_write,
> >void *opaque);
> > +ssize_t (*io_writev_zerocopy)(QIOChannel *ioc,
> > +  const struct iovec *iov,
> > +  size_t niov,
> > +  Error **errp);
> > +int (*io_flush_zerocopy)(QIOChannel *ioc,
> > +  Error **errp);
>
> Indentation is off by one.

Thanks, I will fix that for v5.

>
> >  };
> >
> >  /* General I/O handling functions */
> > @@ -222,12 +231,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> >
> >
> >  /**
> > - * qio_channel_writev_full:
> > + * qio_channel_writev_full_flags:
> >   * @ioc: the channel object
> >   * @iov: the array of memory regions to write data from
> >   * @niov: the length of the @iov array
> >   * @fds: an array of file handles to send
> >   * @nfds: number of file handles in @fds
> > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
> >   * @errp: pointer to a NULL-initialized error object
> >   *
> >   * Write data to the IO channel, reading it from the
> > @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
> >   * guaranteed. If the channel is non-blocking and no
> >   * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK
> >   *
> > + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed,
> > + * function will return once each buffer was queued for
> > + * sending.
>
> This would be a good place to document the requirement to keep the
> buffer unchanged until the zerocopy sequence completes.

That makes sense, even though that may be true for just some implementations,
it makes sense to document it here.

>
> > Error **errp);
> >
> >  /**
> > - * qio_channel_writev_full_all:
> > + * qio_channel_writev_full_all_flags:
> >   * @ioc: the channel object
> >   * @iov: the array of memory regions to write data from
> >   * @niov: the length of the @iov array
> >   * @fds: an array of file handles to send
> >   * @nfds: number of file handles in @fds
> > + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
> >   * @errp: pointer to a NULL-initialized error object
> >   *
> >   *
> > @@ -846,13 +868,58 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
> >   * to be written, yielding from the current coroutine
> >   * if required.
> >   *
> > + * If QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed in flags,
> > + * instead of waiting for all requested data to be written,

Re: [PATCH v4 3/3] multifd: Implement zerocopy write in multifd migration (multifd-zerocopy)

2021-10-11 Thread Eric Blake

On Sat, Oct 09, 2021 at 04:56:13AM -0300, Leonardo Bras wrote:
> Implement zerocopy on nocomp_send_write(), by making use of QIOChannel
> zerocopy interface.
> 
> Change multifd_send_sync_main() so it can distinguish the last sync from
> the setup and per-iteration ones, so a flush_zerocopy() can be called
> at the last sync in order to make sure all RAM is sent before finishing
> the migration.
> 
> Also make it return -1 if flush_zerocopy() fails, in order to cancel
> the migration process, and avoid resuming the guest in the target host
> without receiving all current RAM.
> 
> This will work fine on RAM migration because the RAM pages are not usually 
> freed,
> and there is no problem on changing the pages content between async_send() and
> the actual sending of the buffer, because this change will dirty the page and
> cause it to be re-sent on a next iteration anyway.
> 
> Given a lot of locked memory may be needed in order to use multid migration
> with zerocopy enabled, make it optional by creating a new parameter
> multifd-zerocopy on qapi, so low-privileged users can still perform multifd
> migrations.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  qapi/migration.json   | 18 ++
>  migration/migration.h |  1 +
>  migration/multifd.h   |  2 +-
>  migration/migration.c | 20 
>  migration/multifd.c   | 33 -
>  migration/ram.c   | 20 +---
>  monitor/hmp-cmds.c|  4 
>  7 files changed, 85 insertions(+), 13 deletions(-)
> 
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 88f07baedd..c4890cbb54 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -724,6 +724,11 @@
>  #  will consume more CPU.
>  #  Defaults to 1. (Since 5.0)
>  #
> +# @multifd-zerocopy: Controls behavior on sending memory pages on multifd 
> migration.
> +#When true, enables a zerocopy mechanism for sending 
> memory
> +#pages, if host does support it.

s/does support/supports/ (several instances this patch)

> +#Defaults to false. (Since 6.2)
> +#
>  # @block-bitmap-mapping: Maps block nodes and bitmaps on them to
>  #aliases for the purpose of dirty bitmap migration.  
> Such
>  #aliases may for example be the corresponding names 
> on the
> @@ -758,6 +763,7 @@
> 'xbzrle-cache-size', 'max-postcopy-bandwidth',
> 'max-cpu-throttle', 'multifd-compression',
> 'multifd-zlib-level' ,'multifd-zstd-level',
> +'multifd-zerocopy',
> 'block-bitmap-mapping' ] }

Should this feature be guarded with 'if':'CONFIG_LINUX', since that's
the only platform where you even compile the code (even then, it can
still fail if the kernel doesn't support it).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 2/3] QIOChannelSocket: Implement io_writev_zerocopy & io_flush_zerocopy for CONFIG_LINUX

2021-10-11 Thread Eric Blake

On Sat, Oct 09, 2021 at 04:56:12AM -0300, Leonardo Bras wrote:
> For CONFIG_LINUX, implement the new optional callbacks io_write_zerocopy and
> io_flush_zerocopy on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> feature is available in the host kernel, which is checked on
> qio_channel_socket_connect_sync()
> 
> qio_channel_socket_writev() contents were moved to a helper function
> qio_channel_socket_writev_flags() which accepts an extra argument for flags.
> (This argument is passed directly to sendmsg().
> 
> The above helper function is used to implement qio_channel_socket_writev(),
> with flags = 0, keeping it's behavior unchanged, and

its (remember, "it's" is shorthand for "it is", which does not fit here)

> qio_channel_socket_writev_zerocopy() with flags = MSG_ZEROCOPY.
> 
> qio_channel_socket_flush_zerocopy() was implemented by counting how many times
> sendmsg(...,MSG_ZEROCOPY) was sucessfully called, and then reading the
> socket's error queue, in order to find how many of them finished sending.
> Flush will loop until those counters are the same, or until some error ocurs.

occurs

> 
> A new function qio_channel_socket_poll() was also created in order to avoid
> busy-looping recvmsg() in qio_channel_socket_flush_zerocopy() while waiting 
> for
> updates in socket's error queue.
> 
> Notes on using writev_zerocopy():
> 1: Buffer
> - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid 
> copying,
> some caution is necessary to avoid overwriting any buffer before it's sent.
> If something like this happen, a newer version of the buffer may be sent 
> instead.
> - If this is a problem, it's recommended to call flush_zerocopy() before 
> freeing
> or re-using the buffer.
> 
> 2: Locked memory
> - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, and
> unlocked after it's sent.
> - Depending on the size of each buffer, and how often it's sent, it may 
> require
> a larger amount of locked memory than usually available to non-root user.
> - If the required amount of locked memory is not available, writev_zerocopy
> will return an error, which can abort an operation like migration,
> - Because of this, when an user code wants to add zerocopy as a feature, it
> requires a mechanism to disable it, so it can still be acessible to less
> privileged users.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  include/io/channel-socket.h |   2 +
>  include/io/channel.h|   1 +
>  io/channel-socket.c | 180 ++--
>  3 files changed, 173 insertions(+), 10 deletions(-)
> 
> +static int qio_channel_socket_flush_zerocopy(QIOChannel *ioc,
> + Error **errp)
> +{

> +
> +/* No errors, count sucessfully finished sendmsg()*/

Space before */

> +sioc->zerocopy_sent += serr->ee_data - serr->ee_info + 1;
> +}
> +return 0;
> +}
> +
> +#endif /* CONFIG_LINUX */
> +
>  static int
>  qio_channel_socket_set_blocking(QIOChannel *ioc,
>  bool enabled,
> @@ -787,6 +943,10 @@ static void qio_channel_socket_class_init(ObjectClass 
> *klass,
>  ioc_klass->io_set_delay = qio_channel_socket_set_delay;
>  ioc_klass->io_create_watch = qio_channel_socket_create_watch;
>  ioc_klass->io_set_aio_fd_handler = qio_channel_socket_set_aio_fd_handler;
> +#ifdef CONFIG_LINUX
> +ioc_klass->io_writev_zerocopy = qio_channel_socket_writev_zerocopy;
> +ioc_klass->io_flush_zerocopy = qio_channel_socket_flush_zerocopy;
> +#endif
>  }

I did a high-level look at the code, rather than an in-depth review of
whether zero-copy was being used correctly.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 1/3] QIOChannel: Add io_writev_zerocopy & io_flush_zerocopy callbacks

2021-10-11 Thread Eric Blake

On Sat, Oct 09, 2021 at 04:56:11AM -0300, Leonardo Bras wrote:
> Adds io_async_writev and io_async_flush as optional callback to 
> QIOChannelClass,

Are these names accurate?

> allowing the implementation of asynchronous writes by subclasses.
> 
> How to use them:
> - Write data using qio_channel_writev_zerocopu(),

s/copu/copy/

> - Wait write completion with qio_channel_flush_zerocopy().
> 
> Notes:
> As some zerocopy implementations work asynchronously, it's
> recommended to keep the write buffer untouched until the return of
> qio_channel_flush_zerocopy(), by the risk of sending an updated buffer

s/by/to avoid/

> instead of the one at the write.
> 
> As the new callbacks are optional, if a subclass does not implement them, 
> then:
> - io_async_writev will return -1,
> - io_async_flush will return 0 without changing anything.

Are these names accurate?

> 
> Also, some functions like qio_channel_writev_full_all() were adapted to
> receive a flag parameter. That allows shared code between zerocopy and
> non-zerocopy writev.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  include/io/channel.h | 103 +++
>  io/channel.c |  74 +++
>  2 files changed, 141 insertions(+), 36 deletions(-)
> 
> diff --git a/include/io/channel.h b/include/io/channel.h
> index 88988979f8..e7d4e1521f 100644
> --- a/include/io/channel.h
> +++ b/include/io/channel.h
> @@ -32,12 +32,15 @@ OBJECT_DECLARE_TYPE(QIOChannel, QIOChannelClass,
>  
>  #define QIO_CHANNEL_ERR_BLOCK -2
>  
> +#define QIO_CHANNEL_WRITE_FLAG_ZEROCOPY 0x1
> +
>  typedef enum QIOChannelFeature QIOChannelFeature;
>  
>  enum QIOChannelFeature {
>  QIO_CHANNEL_FEATURE_FD_PASS,
>  QIO_CHANNEL_FEATURE_SHUTDOWN,
>  QIO_CHANNEL_FEATURE_LISTEN,
> +QIO_CHANNEL_FEATURE_WRITE_ZEROCOPY,
>  };
>  
>  
> @@ -136,6 +139,12 @@ struct QIOChannelClass {
>IOHandler *io_read,
>IOHandler *io_write,
>void *opaque);
> +ssize_t (*io_writev_zerocopy)(QIOChannel *ioc,
> +  const struct iovec *iov,
> +  size_t niov,
> +  Error **errp);
> +int (*io_flush_zerocopy)(QIOChannel *ioc,
> +  Error **errp);

Indentation is off by one.

>  };
>  
>  /* General I/O handling functions */
> @@ -222,12 +231,13 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
>  
>  
>  /**
> - * qio_channel_writev_full:
> + * qio_channel_writev_full_flags:
>   * @ioc: the channel object
>   * @iov: the array of memory regions to write data from
>   * @niov: the length of the @iov array
>   * @fds: an array of file handles to send
>   * @nfds: number of file handles in @fds
> + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
>   * @errp: pointer to a NULL-initialized error object
>   *
>   * Write data to the IO channel, reading it from the
> @@ -242,6 +252,10 @@ ssize_t qio_channel_readv_full(QIOChannel *ioc,
>   * guaranteed. If the channel is non-blocking and no
>   * data can be sent, it will return QIO_CHANNEL_ERR_BLOCK
>   *
> + * If flag QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed,
> + * function will return once each buffer was queued for
> + * sending.

This would be a good place to document the requirement to keep the
buffer unchanged until the zerocopy sequence completes.

> Error **errp);
>  
>  /**
> - * qio_channel_writev_full_all:
> + * qio_channel_writev_full_all_flags:
>   * @ioc: the channel object
>   * @iov: the array of memory regions to write data from
>   * @niov: the length of the @iov array
>   * @fds: an array of file handles to send
>   * @nfds: number of file handles in @fds
> + * @flags: write flags (QIO_CHANNEL_WRITE_FLAG_*)
>   * @errp: pointer to a NULL-initialized error object
>   *
>   *
> @@ -846,13 +868,58 @@ int qio_channel_readv_full_all(QIOChannel *ioc,
>   * to be written, yielding from the current coroutine
>   * if required.
>   *
> + * If QIO_CHANNEL_WRITE_FLAG_ZEROCOPY is passed in flags,
> + * instead of waiting for all requested data to be written,
> + * this function will wait until it's all queued for writing.

Another good place to document restrictions on buffer stability.

> + *
>   * Returns: 0 if all bytes were written, or -1 on error
>   */
>  
> -int qio_channel_writev_full_all(QIOChannel *ioc,
> -const struct iovec *iov,
> -size_t niov,
> -int *fds, size_t nfds,
> -Error **errp);
> +int qio_channel_writev_full_all_flags(QIOChannel *ioc,
> +  const struct iovec *iov,
> +  size_t niov,
> +  int *fds, size_t nfds,
> +  int flags, Error

Re: [PATCH 0/9] nios2: Enable cross compile and fix signals

2021-10-11 Thread Richard Henderson


Ping.

On 10/1/21 8:33 AM, Richard Henderson wrote:

Patches 2, 3, and 5 have appeared before.

The patch for the kuser page has been updated to use the commpage
infrastructure, which needed expanding just a bit to handle the
page being at the beginning of the address space.

Getting the toolchain built allowed the code to actually be tested,
which showed up a few more problems in the testsuite.

I have already pushed the debian-nios2-cross image to gitlab, much
like we did for hexagon and its locally built toolchain.


r~


Richard Henderson (9):
   tests/docker: Add debian-nios2-cross image
   linux-user/nios2: Properly emulate EXCP_TRAP
   linux-user/nios2: Fixes for signal frame setup
   linux-user/elfload: Rename ARM_COMMPAGE to HI_COMMPAGE
   linux-user/nios2: Map a real kuser page
   linux-user/nios2: Fix EA vs PC confusion
   linux-user/nios2: Fix sigmask in setup_rt_frame
   linux-user/nios2: Use set_sigmask in do_rt_sigreturn
   tests/tcg: Enable container_cross_cc for nios2

  target/nios2/cpu.h|  2 +-
  linux-user/elfload.c  | 66 +++--
  linux-user/nios2/cpu_loop.c   | 93 +-
  linux-user/nios2/signal.c | 56 +--
  linux-user/signal.c   |  2 -
  target/nios2/translate.c  | 26 +++--
  tests/docker/Makefile.include | 19 
  .../dockerfiles/debian-nios2-cross.docker | 34 +++
  .../build-toolchain.sh| 97 +++
  tests/tcg/configure.sh|  6 ++
  10 files changed, 302 insertions(+), 99 deletions(-)
  create mode 100644 tests/docker/dockerfiles/debian-nios2-cross.docker
  create mode 100755 
tests/docker/dockerfiles/debian-nios2-cross.docker.d/build-toolchain.sh

Re: [PATCH RFC v2 5/5] block: Deprecate transaction type drive-backup

2021-10-11 Thread Eric Blake

On Sat, Oct 09, 2021 at 02:09:44PM +0200, Markus Armbruster wrote:
> Several moons ago, Vladimir posted
> 
> Subject: [PATCH v2 3/3] qapi: deprecate drive-backup
> Date: Wed,  5 May 2021 16:58:03 +0300
> Message-Id: <20210505135803.67896-4-vsement...@virtuozzo.com>
> https://lists.gnu.org/archive/html/qemu-devel/2021-05/msg01394.html
> 
> with this
> 
> TODO: We also need to deprecate drive-backup transaction action..
> But union members in QAPI doesn't support 'deprecated' feature. I tried
> to dig a bit, but failed :/ Markus, could you please help with it? At
> least by advice?
> 
> This is one way to resolve it.  Sorry it took so long.
> 
> John explored another way, namely adding feature flags to union
> branches.  Could also be useful, say to add different features to
> branches in multiple unions sharing the same tag enum.
> 
> Signed-off-by: Markus Armbruster 
> ---
>  qapi/transaction.json | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/transaction.json b/qapi/transaction.json
> index d175b5f863..0564a893b3 100644
> --- a/qapi/transaction.json
> +++ b/qapi/transaction.json
> @@ -54,6 +54,9 @@
>  # @blockdev-snapshot-sync: since 1.1
>  # @drive-backup: Since 1.6
>  #
> +# Features:
> +# @deprecated: Member @drive-backup is deprecated.  Use FIXME instead.

Obviously, we'd need to flesh this out ("'blockdev-backup' with proper
node names"? something else?) before dropping RFC on this patch.

And we'd want to edit docs/about/deprecated.rst to match.

> +#
>  # Since: 1.1
>  ##
>  { 'enum': 'TransactionActionKind',
> @@ -62,7 +65,7 @@
>  'block-dirty-bitmap-disable', 'block-dirty-bitmap-merge',
>  'blockdev-backup', 'blockdev-snapshot',
>  'blockdev-snapshot-internal-sync', 'blockdev-snapshot-sync',
> -'drive-backup' ] }
> +{ 'name': 'drive-backup', 'features': [ 'deprecated' ] } ] }
>  
>  ##
>  # @AbortWrapper:
> -- 
> 2.31.1
>

But the idea is reasonable, and I'm not sure if we're any closer to
John's idea of feature flags on union branches.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 4/5] qapi: Implement deprecated-input={reject,crash} for enum values

2021-10-11 Thread Eric Blake

On Sat, Oct 09, 2021 at 02:09:43PM +0200, Markus Armbruster wrote:
> This copies the code implementing the policy from qapi/qmp-dispatch.c
> to qapi/qobject-input-visitor.c.  Tolerable, but if we acquire more
> copes, we should look into factoring them out.

copies

> 
> Signed-off-by: Markus Armbruster 
> ---
>  docs/devel/qapi-code-gen.rst |  6 --
>  qapi/compat.json |  3 ++-
>  include/qapi/util.h  |  6 +-
>  qapi/qapi-visit-core.c   | 18 +++---
>  scripts/qapi/types.py| 17 -
>  5 files changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/docs/devel/qapi-code-gen.rst b/docs/devel/qapi-code-gen.rst
> index 00334e9fb8..006a6f4a9a 100644
> --- a/docs/devel/qapi-code-gen.rst
> +++ b/docs/devel/qapi-code-gen.rst
> @@ -708,8 +708,10 @@ QEMU shows a certain behaviour.
>  Special features
>  
>  
> -Feature "deprecated" marks a command, event, or struct member as
> -deprecated.  It is not supported elsewhere so far.
> +Feature "deprecated" marks a command, event, struct or enum member as

Do we want the comma before the conjunction?  (I've seen style guides
that recommend it, and style guides that discourage it; while I tend
to write by the former style, I usually don't care about the latter.
Rather, switching styles mid-patch caught my attention).

> +deprecated.  It is not supported elsewhere so far.  Interfaces so
> +marked may be withdrawn in future releases in accordance with QEMU's
> +deprecation policy.
>  
>  
> +++ b/qapi/qapi-visit-core.c
> @@ -393,7 +393,7 @@ static bool input_type_enum(Visitor *v, const char *name, 
> int *obj,
>  const QEnumLookup *lookup, Error **errp)
>  {
>  int64_t value;
> -char *enum_str;
> +g_autofree char *enum_str = NULL;

Nice change while touching the code.  Is it worth mentioning in the
commit message?

>  
>  if (!visit_type_str(v, name, _str, errp)) {
>  return false;
> @@ -402,11 +402,23 @@ static bool input_type_enum(Visitor *v, const char 
> *name, int *obj,
>  value = qapi_enum_parse(lookup, enum_str, -1, NULL);
>  if (value < 0) {
>  error_setg(errp, QERR_INVALID_PARAMETER, enum_str);
> -g_free(enum_str);
>  return false;
>  }
>  
> -g_free(enum_str);
> +if (lookup->flags && (lookup->flags[value] & QAPI_ENUM_DEPRECATED)) {
> +switch (v->compat_policy.deprecated_input) {
> +case COMPAT_POLICY_INPUT_ACCEPT:
> +break;
> +case COMPAT_POLICY_INPUT_REJECT:
> +error_setg(errp, "Deprecated value '%s' disabled by policy",
> +   enum_str);
> +return false;
> +case COMPAT_POLICY_INPUT_CRASH:
> +default:
> +abort();
> +}
> +}
> +
>  *obj = value;
>  return true;
>  }

Grammar fixes are minor, so:

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start

2021-10-11 Thread Richard Henderson

We use INDEX_op_insn_start to make the start of instruction boundaries.
If we don't do it in the .insn_start hook things get confused especially
now plugins want to use that marking to identify the start of instructions
and will bomb out if it sees instrumented ops before the first instruction
boundary.

Signed-off-by: Richard Henderson 
---
 target/s390x/tcg/translate.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index f284870cd2..a2d6fa5cca 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -138,6 +138,7 @@ struct DisasFields {
 struct DisasContext {
 DisasContextBase base;
 const DisasInsn *insn;
+TCGOp *insn_start;
 DisasFields fields;
 uint64_t ex_value;
 /*
@@ -6380,8 +6381,8 @@ static DisasJumpType translate_one(CPUS390XState *env, 
DisasContext *s)
 /* Search for the insn in the table.  */
 insn = extract_insn(env, s);
 
-/* Emit insn_start now that we know the ILEN.  */
-tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen);
+/* Update insn_start now that we know the ILEN.  */
+tcg_set_insn_start_param(s->insn_start, 2, s->ilen);
 
 /* Not found means unimplemented/illegal opcode.  */
 if (insn == NULL) {
@@ -6552,6 +6553,11 @@ static void s390x_tr_tb_start(DisasContextBase *db, 
CPUState *cs)
 
 static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs)
 {
+DisasContext *dc = container_of(dcbase, DisasContext, base);
+
+/* Delay the set of ilen until we've read the insn. */
+tcg_gen_insn_start(dc->base.pc_next, dc->cc_op, 0);
+dc->insn_start = tcg_last_op();
 }
 
 static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
-- 
2.25.1

Re: [PATCH v2 1/6] net/vmnet: dependencies setup, initial preparations

2021-10-11 Thread Eric Blake

On Tue, Aug 31, 2021 at 10:27:15PM +0300, Vladislav Yaroshchuk wrote:
> Add 'vmnet' customizable option and 'vmnet.framework' probe into
> configure;
> 
> Create separate netdev per each vmnet operating mode
> because they use quite different settings. Especially since
> macOS 11.0 (vmnet.framework API gets lots of updates)
> Create source files for network client driver, update meson.build;
> 
> Three new netdevs are added:
> - vmnet-host
> - vmnet-shared
> - vmnet-bridged
> 
> Signed-off-by: Vladislav Yaroshchuk 
> ---

While I'm not the best for reviewing the overall series, I can at
least comment on the proposed QAPI changes:

> +++ b/qapi/net.json
> @@ -452,6 +452,89 @@
>  '*vhostdev': 'str',
>  '*queues':   'int' } }
>  
> +##
> +# @NetdevVmnetHostOptions:
> +#
> +# vmnet (host mode) network backend.
> +#
> +# Allows the vmnet interface to communicate with
> +# other vmnet interfaces that are in host mode and also with the native host.
> +#
> +# @dhcpstart: The starting IPv4 address to use for the interface. Must be in 
> the
> +# private IP range (RFC 1918). Must be specified along
> +# with @dhcpend and @subnetmask.
> +# This address is used as the gateway address. The subsequent 
> address
> +# up to and including dhcpend are placed in the DHCP pool.

Long lines.  Most of the qapi docs wrap around 70 or so columns rather
than pushing right up to the 80 column limit.

> +#
> +# @dhcpend: The DHCP IPv4 range end address to use for the interface. Must 
> be in
> +#   the private IP range (RFC 1918). Must be specified along
> +#   with @dhcpstart and @subnetmask.
> +#
> +# @subnetmask: The IPv4 subnet mask to use on the interface. Must be 
> specified
> +#  along with @dhcpstart and @subnetmask.
> +#
> +#
> +# Since: 6.2,

Drop the trailing comma

> +##
> +{ 'struct': 'NetdevVmnetHostOptions',
> +  'data': {
> +'*dhcpstart':   'str',
> +'*dhcpend': 'str',
> +'*subnetmask':  'str'
> +  },
> +  'if': 'CONFIG_VMNET' }
> +
> +##
> +# @NetdevVmnetSharedOptions:
> +#
> +# vmnet (shared mode) network backend.
> +#
> +# Allows traffic originating from the vmnet interface to reach the
> +# Internet through a network address translator (NAT). The vmnet interface
> +# can also communicate with the native host. By default, the vmnet interface
> +# is able to communicate with other shared mode interfaces. If a subnet range
> +# is specified, the vmnet interface can communicate with other shared mode
> +# interfaces on the same subnet.
> +#
> +# @dhcpstart: The starting IPv4 address to use for the interface. Must be in 
> the
> +# private IP range (RFC 1918). Must be specified along
> +# with @dhcpend and @subnetmask.
> +# This address is used as the gateway address. The subsequent 
> address
> +# up to and including dhcpend are  placed in the DHCP pool.

extra space

> +#
> +# @dhcpend: The DHCP IPv4 range end address to use for the interface. Must 
> be in
> +#   the private IP range (RFC 1918). Must be specified along
> +#   with @dhcpstart and @subnetmask.
> +#
> +# @subnetmask: The IPv4 subnet mask to use on the interface. Must be 
> specified
> +#  along with @dhcpstart and @subnetmask.
> +#
> +#
> +# Since: 6.2,

Another odd comma

> +##
> +{ 'struct': 'NetdevVmnetSharedOptions',
> +  'data': {
> +'*dhcpstart':'str',
> +'*dhcpend':  'str',
> +'*subnetmask':   'str'
> +  },
> +  'if': 'CONFIG_VMNET' }

The NetdevVmnetHostOptions and NetdevVmnetSharedOptions types are
identical; why do you need two types?

> +
> +##
> +# @NetdevVmnetBridgedOptions:
> +#
> +# vmnet (bridged mode) network backend.
> +#
> +# Bridges the vmnet interface with a physical network interface.
> +#
> +# @ifname: The name of the physical interface to be bridged.
> +#
> +# Since: 6.2
> +##
> +{ 'struct': 'NetdevVmnetBridgedOptions',
> +  'data': { 'ifname': 'str' },
> +  'if': 'CONFIG_VMNET' }
> +
>  ##
>  # @NetClientDriver:
>  #
> @@ -460,10 +543,16 @@
>  # Since: 2.7
>  #
>  #@vhost-vdpa since 5.1
> +#@vmnet-host since 6.2
> +#@vmnet-shared since 6.2
> +#@vmnet-bridged since 6.2
>  ##
>  { 'enum': 'NetClientDriver',
>'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'vde',
> -'bridge', 'hubport', 'netmap', 'vhost-user', 'vhost-vdpa' ] }
> +'bridge', 'hubport', 'netmap', 'vhost-user', 'vhost-vdpa',
> +{ 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
> +{ 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
> +{ 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
>  
>  ##
>  # @Netdev:
> @@ -477,6 +566,9 @@
>  # Since: 1.2
>  #
>  #'l2tpv3' - since 2.1
> +#'vmnet-host' - since 6.2
> +#'vmnet-shared' - since 6.2
> +#'vmnet-bridged' - since 6.2
>  ##
>  { 'union': 'Netdev',
>'base': { 'id': 'str', 'type':

Re: [PATCH v2 1/5] qapi: Enable enum member introspection to show more than name

2021-10-11 Thread Eric Blake

On Sat, Oct 09, 2021 at 02:09:40PM +0200, Markus Armbruster wrote:
> The next commit will add feature flags to enum members.  There's a
> problem, though: query-qmp-schema shows an enum type's members as an
> array of member names (SchemaInfoEnum member @values).  If it showed
> an array of objects with a name member, we could simply add more
> members to these objects.  Since it's just strings, we can't.
> 
> I can see three ways to correct this design mistake:
> 
> 1. Do it the way we should have done it, plus compatibility goo.
...
> 2. Like 1, but omit "boring" elements of @member, and empty @member.

> 3. Versioned query-qmp-schema.

> This commit implements 1.  Libvirt developers prefer it.
> 
> Signed-off-by: Markus Armbruster 
> ---
>  qapi/introspect.json   | 21 +++--
>  scripts/qapi/introspect.py | 18 ++
>  2 files changed, 33 insertions(+), 6 deletions(-)

Reviewed-by: Eric Blake 

> 
> diff --git a/qapi/introspect.json b/qapi/introspect.json
> index 39bd303778..f806bd7281 100644
> --- a/qapi/introspect.json
> +++ b/qapi/introspect.json
> @@ -142,14 +142,31 @@
>  #
>  # Additional SchemaInfo members for meta-type 'enum'.
>  #
> -# @values: the enumeration type's values, in no particular order.
> +# @members: the enum type's members, in no particular order
> +#   (since 6.2).
> +#
> +# @values: the enumeration type's member names, in no particular order.
> +#  Redundant with @members.  Just for backward compatibility.
>  #
>  # Values of this type are JSON string on the wire.
>  #
>  # Since: 2.5
>  ##
>  { 'struct': 'SchemaInfoEnum',
> -  'data': { 'values': ['str'] } }
> +  'data': { 'members': [ 'SchemaInfoEnumMember' ],
> +'values': ['str'] } }

Not deprecated at this time, I agree with your choice to make the
actual deprecation of 'values' to be in a later patch (if at all).

> +
> +##
> +# @SchemaInfoEnumMember:
> +#
> +# An object member.
> +#
> +# @name: the member's name, as defined in the QAPI schema.
> +#
> +# Since: 6.2
> +##
> +{ 'struct': 'SchemaInfoEnumMember',
> +  'data': { 'name': 'str' } }

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [RFC PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start

2021-10-11 Thread Richard Henderson


On 10/11/21 8:57 AM, Alex Bennée wrote:

We use INDEX_op_insn_start to make the start of instruction
boundaries. If we don't do it in the .insn_start hook things get
confused especially now plugins want to use that marking to identify
the start of instructions and will bomb out if it sees instrumented
ops before the first instruction boundary.

Signed-off-by: Alex Bennée 
---
  target/s390x/tcg/translate.c | 25 ++---
  1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index f284870cd2..fe145ff2eb 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -6380,9 +6380,6 @@ static DisasJumpType translate_one(CPUS390XState *env, 
DisasContext *s)
  /* Search for the insn in the table.  */
  insn = extract_insn(env, s);
  
-/* Emit insn_start now that we know the ILEN.  */

-tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen);
-
  /* Not found means unimplemented/illegal opcode.  */
  if (insn == NULL) {
  qemu_log_mask(LOG_UNIMP, "unimplemented opcode 0x%02x%02x\n",
@@ -6550,8 +6547,30 @@ static void s390x_tr_tb_start(DisasContextBase *db, 
CPUState *cs)
  {
  }
  
+/*

+ * We just enough partial instruction decoding here to calculate the
+ * length of the instruction so we can drop the INDEX_op_insn_start
+ * before anything else is emitted in the TCGOp stream.
+ *
+ * See extract_insn for the full decode.
+ */
  static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs)
  {
+CPUS390XState *env = cs->env_ptr;
+DisasContext *s = container_of(dcbase, DisasContext, base);
+uint64_t insn, pc = s->base.pc_next;
+int op, ilen;
+
+if (unlikely(s->ex_value)) {
+ilen = s->ex_value & 0xf;
+} else {
+insn = ld_code2(env, s, pc);  /* FIXME: don't reload same pc twice */


I think we might as well delay the set of ilen until after the normal load, rather than 
introduce a fixme.


r~


+op = (insn >> 8) & 0xff;
+ilen = get_ilen(op);
+}
+
+/* Emit insn_start now that we know the ILEN.  */
+tcg_gen_insn_start(s->base.pc_next, s->cc_op, ilen);
  }
  
  static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)

Re: [PULL 0/2] Block patches

2021-10-11 Thread Richard Henderson


On 10/11/21 5:40 AM, Stefan Hajnoczi wrote:

The following changes since commit ca61fa4b803e5d0abaf6f1ceb690f23bb78a4def:

   Merge remote-tracking branch 'remotes/quic/tags/pull-hex-20211006' into 
staging (2021-10-06 12:11:14 -0700)

are available in the Git repository at:

   https://gitlab.com/stefanha/qemu.git tags/block-pull-request

for you to fetch changes up to 1cc7eada97914f090125e588497986f6f7900514:

   iothread: use IOThreadParamInfo in iothread_[set|get]_param() (2021-10-07 
15:29:50 +0100)


Pull request



Stefano Garzarella (2):
   iothread: rename PollParamInfo to IOThreadParamInfo
   iothread: use IOThreadParamInfo in iothread_[set|get]_param()

  iothread.c | 28 +++-
  1 file changed, 15 insertions(+), 13 deletions(-)


Applied, thanks.

r~

Re: [RFC PATCH v1 2/2] s390x/kvm: Pass SIGP Stop flags

2021-10-11 Thread Eric Farman

On Mon, 2021-10-11 at 11:21 +0200, David Hildenbrand wrote:
> On 11.10.21 10:40, Christian Borntraeger wrote:
> > 
> > Am 11.10.21 um 09:09 schrieb David Hildenbrand:
> > > On 08.10.21 22:38, Eric Farman wrote:
> > > > When building a Stop IRQ to pass to KVM, we should incorporate
> > > > the flags if handling the SIGP Stop and Store Status order.
> > > > With that, KVM can reject other orders that are submitted for
> > > > the same CPU while the operation is fully processed.
> > > > 
> > > > Signed-off-by: Eric Farman 
> > > > Acked-by: Janosch Frank 
> > > > ---
> > > >target/s390x/kvm/kvm.c | 4 
> > > >1 file changed, 4 insertions(+)
> > > > 
> > > > diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c
> > > > index 5b1fdb55c4..701b9ddc88 100644
> > > > --- a/target/s390x/kvm/kvm.c
> > > > +++ b/target/s390x/kvm/kvm.c
> > > > @@ -2555,6 +2555,10 @@ void kvm_s390_stop_interrupt(S390CPU
> > > > *cpu)
> > > >.type = KVM_S390_SIGP_STOP,
> > > >};
> > > > +if (cpu->env.sigp_order == SIGP_STOP_STORE_STATUS) {
> > > > +irq.u.stop.flags = KVM_S390_STOP_FLAG_STORE_STATUS;
> > > > +}
> > > > +
> > > 
> > > KVM_S390_STOP_FLAG_STORE_STATUS tells KVM to perform the store
> > > status as well ... is that really what we want?
> > At least it should not hurt I guess. QEMU then does it again?
> 
> The thing is, that before we officially completed the action in user 
> space (and let other SIGP actions actually succeed in user space on
> the 
> CPU), the target CPU will be reported as !busy in the kernel
> already. 
> And before we even inject the stop interrupt, the CPU will be
> detected 
> as !busy in the kernel. I guess it will fix some cases where we poll
> via 
> SENSE if the stop and store happened, because the store *did* happen
> in 
> the kernel and we'll simply store again in user space.
> 
> However, I wonder if we want to handle it more generically: Properly 
> flag a CPU as busy for SIGP when we start processing the order until
> we 
> completed processing the order. That would allow to handle other
> SIGP 
> operations in user space cleanly, without any chance for races with 
> SENSE code running in the kernel.
> 

I think a generic solution would be ideal, but I'm wrestling with the
race with the kernel's SENSE code. Today, handle_sigp_single_dst
already checks to see if a CPU is currently processing an order and
returns a CC2 when it does, but of course the kernel's SENSE code
doesn't know that. We could flag the CPU as busy in the kernel when
sending a SIGP to userspace, so that the SENSE code indicates BUSY, but
then how do we know when userspace is finished and the CPU is no longer
BUSY?

Eric

[PATCH v1 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()

2021-10-11 Thread David Hildenbrand

Let's factor out prefaulting/populating to make further changes easier to
review and add a comment what we are actually expecting to happen. While at
it, use the actual page size of the ramblock, which defaults to
qemu_real_host_page_size for anonymous memory. Further, rename
ram_block_populate_pages() to ram_block_populate_read() as well, to make
it clearer what we are doing.

In the future, we might want to use MADV_POPULATE_READ to speed up
population.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 migration/ram.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index b225ec7507..c212081f85 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1639,26 +1639,35 @@ out:
 return ret;
 }
 
+static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
+   ram_addr_t size)
+{
+/*
+ * We read one byte of each page; this will preallocate page tables if
+ * required and populate the shared zeropage on MAP_PRIVATE anonymous 
memory
+ * where no page was populated yet. This might require adaption when
+ * supporting other mappings, like shmem.
+ */
+for (; offset < size; offset += block->page_size) {
+char tmp = *((char *)block->host + offset);
+
+/* Don't optimize the read out */
+asm volatile("" : "+r" (tmp));
+}
+}
+
 /*
- * ram_block_populate_pages: populate memory in the RAM block by reading
- *   an integer from the beginning of each page.
+ * ram_block_populate_read: preallocate page tables and populate pages in the
+ *   RAM block by reading a byte of each page.
  *
  * Since it's solely used for userfault_fd WP feature, here we just
  *   hardcode page size to qemu_real_host_page_size.
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_pages(RAMBlock *block)
+static void ram_block_populate_read(RAMBlock *block)
 {
-char *ptr = (char *) block->host;
-
-for (ram_addr_t offset = 0; offset < block->used_length;
-offset += qemu_real_host_page_size) {
-char tmp = *(ptr + offset);
-
-/* Don't optimize the read out */
-asm volatile("" : "+r" (tmp));
-}
+populate_read_range(block, 0, block->used_length);
 }
 
 /*
@@ -1684,7 +1693,7 @@ void ram_write_tracking_prepare(void)
  * UFFDIO_WRITEPROTECT_MODE_WP mode setting would silently skip
  * pages with pte_none() entries in page table.
  */
-ram_block_populate_pages(block);
+ram_block_populate_read(block);
 }
 }
 
-- 
2.31.1

Re: [PATCH 2/2] target/i386/sev: Use local variable for kvm_sev_launch_measure

2021-10-11 Thread Dr. David Alan Gilbert

* Dov Murik (dovmu...@linux.ibm.com) wrote:
> The struct kvm_sev_launch_measure has a constant and small size, and
> therefore we can use a regular local variable for it instead of
> allocating and freeing heap memory for it.
> 
> Signed-off-by: Dov Murik 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  target/i386/sev.c | 16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/target/i386/sev.c b/target/i386/sev.c
> index 0062566c71..eede07f11d 100644
> --- a/target/i386/sev.c
> +++ b/target/i386/sev.c
> @@ -729,7 +729,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
>  SevGuestState *sev = sev_guest;
>  int ret, error;
>  g_autofree guchar *data = NULL;
> -g_autofree struct kvm_sev_launch_measure *measurement = NULL;
> +struct kvm_sev_launch_measure measurement = {};
>  
>  if (!sev_check_state(sev, SEV_STATE_LAUNCH_UPDATE)) {
>  return;
> @@ -743,23 +743,21 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
>  }
>  }
>  
> -measurement = g_new0(struct kvm_sev_launch_measure, 1);
> -
>  /* query the measurement blob length */
>  ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE,
> -measurement, );
> -if (!measurement->len) {
> +, );
> +if (!measurement.len) {
>  error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'",
>   __func__, ret, error, fw_error_to_str(errno));
>  return;
>  }
>  
> -data = g_new0(guchar, measurement->len);
> -measurement->uaddr = (unsigned long)data;
> +data = g_new0(guchar, measurement.len);
> +measurement.uaddr = (unsigned long)data;
>  
>  /* get the measurement blob */
>  ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE,
> -measurement, );
> +, );
>  if (ret) {
>  error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'",
>   __func__, ret, error, fw_error_to_str(errno));
> @@ -769,7 +767,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
>  sev_set_guest_state(sev, SEV_STATE_LAUNCH_SECRET);
>  
>  /* encode the measurement value and emit the event */
> -sev->measurement = g_base64_encode(data, measurement->len);
> +sev->measurement = g_base64_encode(data, measurement.len);
>  trace_kvm_sev_launch_measurement(sev->measurement);
>  }
>  
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [RFC PATCH v1 2/2] s390x/kvm: Pass SIGP Stop flags

2021-10-11 Thread David Hildenbrand


On 11.10.21 19:58, Eric Farman wrote:

On Mon, 2021-10-11 at 11:21 +0200, David Hildenbrand wrote:

On 11.10.21 10:40, Christian Borntraeger wrote:


Am 11.10.21 um 09:09 schrieb David Hildenbrand:

On 08.10.21 22:38, Eric Farman wrote:

When building a Stop IRQ to pass to KVM, we should incorporate
the flags if handling the SIGP Stop and Store Status order.
With that, KVM can reject other orders that are submitted for
the same CPU while the operation is fully processed.

Signed-off-by: Eric Farman 
Acked-by: Janosch Frank 
---
target/s390x/kvm/kvm.c | 4 
1 file changed, 4 insertions(+)

diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c
index 5b1fdb55c4..701b9ddc88 100644
--- a/target/s390x/kvm/kvm.c
+++ b/target/s390x/kvm/kvm.c
@@ -2555,6 +2555,10 @@ void kvm_s390_stop_interrupt(S390CPU
*cpu)
.type = KVM_S390_SIGP_STOP,
};
+if (cpu->env.sigp_order == SIGP_STOP_STORE_STATUS) {
+irq.u.stop.flags = KVM_S390_STOP_FLAG_STORE_STATUS;
+}
+


KVM_S390_STOP_FLAG_STORE_STATUS tells KVM to perform the store
status as well ... is that really what we want?

At least it should not hurt I guess. QEMU then does it again?


The thing is, that before we officially completed the action in user
space (and let other SIGP actions actually succeed in user space on
the
CPU), the target CPU will be reported as !busy in the kernel
already.
And before we even inject the stop interrupt, the CPU will be
detected
as !busy in the kernel. I guess it will fix some cases where we poll
via
SENSE if the stop and store happened, because the store *did* happen
in
the kernel and we'll simply store again in user space.

However, I wonder if we want to handle it more generically: Properly
flag a CPU as busy for SIGP when we start processing the order until
we
completed processing the order. That would allow to handle other
SIGP
operations in user space cleanly, without any chance for races with
SENSE code running in the kernel.



I think a generic solution would be ideal, but I'm wrestling with the
race with the kernel's SENSE code. Today, handle_sigp_single_dst
already checks to see if a CPU is currently processing an order and
returns a CC2 when it does, but of course the kernel's SENSE code
doesn't know that. We could flag the CPU as busy in the kernel when
sending a SIGP to userspace, so that the SENSE code indicates BUSY, but
then how do we know when userspace is finished and the CPU is no longer
BUSY?


I'd just add a new IOCTL for marking a CPU busy/!busy for SIGP from user 
space. You can then either let user space perform both actions 
(set+unset), or let the kernel automatically set "busy" and user space 
only clear "busy". You can define a new capability to enable the 
"automatically set busy when going to user space on sigp" -- might 
require some thoughts on some corner cases.


Maybe there might be other scenarios in the future where we might want 
to set a CPU busy for sigp without that CPU triggering a sigp action 
itself (e.g., externally triggered reset of a CPU? Simulation of 
check-stop? store status?), so at least having a way to set/reset a CPU 
busy for SIGP might be valuable.


Once we go to user space to process a SIGP, we usually don't care too 
much about some additional overhead due to 1 or 2 ioctls IMHO.


--
Thanks,

David / dhildenb

[PATCH v1 5/9] virtio-mem: Drop precopy notifier

2021-10-11 Thread David Hildenbrand

Migration code now properly handles RAMBlocks which are indirectly managed
by a RamDiscardManager. No need for manual handling via the free page
optimization interface, let's get rid of it.

Acked-by: Michael S. Tsirkin 
Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 34 --
 include/hw/virtio/virtio-mem.h |  3 ---
 2 files changed, 37 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 284096ec5f..d5a578142b 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -776,7 +776,6 @@ static void virtio_mem_device_realize(DeviceState *dev, 
Error **errp)
 host_memory_backend_set_mapped(vmem->memdev, true);
 vmstate_register_ram(>memdev->mr, DEVICE(vmem));
 qemu_register_reset(virtio_mem_system_reset, vmem);
-precopy_add_notifier(>precopy_notifier);
 
 /*
  * Set ourselves as RamDiscardManager before the plug handler maps the
@@ -796,7 +795,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
  * found via an address space anymore. Unset ourselves.
  */
 memory_region_set_ram_discard_manager(>memdev->mr, NULL);
-precopy_remove_notifier(>precopy_notifier);
 qemu_unregister_reset(virtio_mem_system_reset, vmem);
 vmstate_unregister_ram(>memdev->mr, DEVICE(vmem));
 host_memory_backend_set_mapped(vmem->memdev, false);
@@ -1089,43 +1087,11 @@ static void virtio_mem_set_block_size(Object *obj, 
Visitor *v, const char *name,
 vmem->block_size = value;
 }
 
-static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void 
*arg,
-   uint64_t offset, uint64_t size)
-{
-void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block);
-
-qemu_guest_free_page_hint(host + offset, size);
-return 0;
-}
-
-static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem)
-{
-virtio_mem_for_each_unplugged_range(vmem, NULL,
-virtio_mem_precopy_exclude_range_cb);
-}
-
-static int virtio_mem_precopy_notify(NotifierWithReturn *n, void *data)
-{
-VirtIOMEM *vmem = container_of(n, VirtIOMEM, precopy_notifier);
-PrecopyNotifyData *pnd = data;
-
-switch (pnd->reason) {
-case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
-virtio_mem_precopy_exclude_unplugged(vmem);
-break;
-default:
-break;
-}
-
-return 0;
-}
-
 static void virtio_mem_instance_init(Object *obj)
 {
 VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
 notifier_list_init(>size_change_notifiers);
-vmem->precopy_notifier.notify = virtio_mem_precopy_notify;
 QLIST_INIT(>rdl_list);
 
 object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 9a6e348fa2..a5dd6a493b 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -65,9 +65,6 @@ struct VirtIOMEM {
 /* notifiers to notify when "size" changes */
 NotifierList size_change_notifiers;
 
-/* don't migrate unplugged memory */
-NotifierWithReturn precopy_notifier;
-
 /* listeners to notify on plug/unplug activity. */
 QLIST_HEAD(, RamDiscardListener) rdl_list;
 };
-- 
2.31.1

[PATCH v1 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*()

2021-10-11 Thread David Hildenbrand

The parameter is unused, let's drop it.

Reviewed-by: Peter Xu 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Juan Quintela 
Signed-off-by: David Hildenbrand 
---
 migration/ram.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7a43bfd7af..bb908822d5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -789,8 +789,7 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, 
RAMBlock *rb,
 return find_next_bit(bitmap, size, start);
 }
 
-static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
-   RAMBlock *rb,
+static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
unsigned long page)
 {
 uint8_t shift;
@@ -818,8 +817,7 @@ static void 
migration_clear_memory_region_dirty_bitmap(RAMState *rs,
 }
 
 static void
-migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
- RAMBlock *rb,
+migration_clear_memory_region_dirty_bitmap_range(RAMBlock *rb,
  unsigned long start,
  unsigned long npages)
 {
@@ -832,7 +830,7 @@ migration_clear_memory_region_dirty_bitmap_range(RAMState 
*rs,
  * exclusive.
  */
 for (i = chunk_start; i < chunk_end; i += chunk_pages) {
-migration_clear_memory_region_dirty_bitmap(rs, rb, i);
+migration_clear_memory_region_dirty_bitmap(rb, i);
 }
 }
 
@@ -850,7 +848,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState 
*rs,
  * the page in the chunk we clear the remote dirty bitmap for all.
  * Clearing it earlier won't be a problem, but too late will.
  */
-migration_clear_memory_region_dirty_bitmap(rs, rb, page);
+migration_clear_memory_region_dirty_bitmap(rb, page);
 
 ret = test_and_clear_bit(page, rb->bmap);
 if (ret) {
@@ -2777,8 +2775,7 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
  * are initially set. Otherwise those skipped pages will be sent in
  * the next round after syncing from the memory region bitmap.
  */
-migration_clear_memory_region_dirty_bitmap_range(ram_state, block,
- start, npages);
+migration_clear_memory_region_dirty_bitmap_range(block, start, npages);
 ram_state->migration_dirty_pages -=
   bitmap_count_one_with_offset(block->bmap, start, npages);
 bitmap_clear(block->bmap, start, npages);
-- 
2.31.1

[PATCH v1 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots

2021-10-11 Thread David Hildenbrand

We already don't ever migrate memory that corresponds to discarded ranges
as managed by a RamDiscardManager responsible for the mapped memory region
of the RAMBlock.

virtio-mem uses this mechanism to logically unplug parts of a RAMBlock.
Right now, we still populate zeropages for the whole usable part of the
RAMBlock, which is undesired because:

1. Even populating the shared zeropage will result in memory getting
   consumed for page tables.
2. Memory backends without a shared zeropage (like hugetlbfs and shmem)
   will populate an actual, fresh page, resulting in an unintended
   memory consumption.

Discarded ("logically unplugged") parts have to remain discarded. As
these pages are never part of the migration stream, there is no need to
track modifications via userfaultfd WP reliably for these parts.

Further, any writes to these ranges by the VM are invalid and the
behavior is undefined.

Note that Linux only supports userfaultfd WP on private anonymous memory
for now, which usually results in the shared zeropage getting populated.
The issue will become more relevant once userfaultfd WP supports shmem
and hugetlb.

Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 migration/ram.c | 38 --
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index c212081f85..dbbb1e6712 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1656,6 +1656,17 @@ static inline void populate_read_range(RAMBlock *block, 
ram_addr_t offset,
 }
 }
 
+static inline int populate_read_section(MemoryRegionSection *section,
+void *opaque)
+{
+const hwaddr size = int128_get64(section->size);
+hwaddr offset = section->offset_within_region;
+RAMBlock *block = section->mr->ram_block;
+
+populate_read_range(block, offset, size);
+return 0;
+}
+
 /*
  * ram_block_populate_read: preallocate page tables and populate pages in the
  *   RAM block by reading a byte of each page.
@@ -1665,9 +1676,32 @@ static inline void populate_read_range(RAMBlock *block, 
ram_addr_t offset,
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_read(RAMBlock *block)
+static void ram_block_populate_read(RAMBlock *rb)
 {
-populate_read_range(block, 0, block->used_length);
+/*
+ * Skip populating all pages that fall into a discarded range as managed by
+ * a RamDiscardManager responsible for the mapped memory region of the
+ * RAMBlock. Such discarded ("logically unplugged") parts of a RAMBlock
+ * must not get populated automatically. We don't have to track
+ * modifications via userfaultfd WP reliably, because these pages will
+ * not be part of the migration stream either way -- see
+ * ramblock_dirty_bitmap_exclude_discarded_pages().
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
+RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+MemoryRegionSection section = {
+.mr = rb->mr,
+.offset_within_region = 0,
+.size = rb->mr->size,
+};
+
+ram_discard_manager_replay_populated(rdm, ,
+ populate_read_section, NULL);
+} else {
+populate_read_range(rb, 0, rb->used_length);
+}
 }
 
 /*
-- 
2.31.1

[PATCH v1 7/9] migration: Simplify alignment and alignment checks

2021-10-11 Thread David Hildenbrand

Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to
read.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 migration/migration.c| 6 +++---
 migration/postcopy-ram.c | 9 -
 migration/ram.c  | 2 +-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 6ac807ef3d..aa89ebac0c 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -391,7 +391,7 @@ int 
migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
   RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
-void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
+void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, 
qemu_ram_pagesize(rb));
 bool received = false;
 
 WITH_QEMU_LOCK_GUARD(>page_request_mutex) {
@@ -2619,8 +2619,8 @@ static void migrate_handle_rp_req_pages(MigrationState 
*ms, const char* rbname,
  * Since we currently insist on matching page sizes, just sanity check
  * we're being asked for whole host pages.
  */
-if (start & (our_host_ps - 1) ||
-   (len & (our_host_ps - 1))) {
+if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
+!QEMU_IS_ALIGNED(len, our_host_ps)) {
 error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
  " len: %zd", __func__, start, len);
 mark_source_rp_bad(ms);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 3609ce7e52..e721f69d0f 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -402,7 +402,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState 
*mis)
  strerror(errno));
 goto out;
 }
-g_assert(((size_t)testarea & (pagesize - 1)) == 0);
+g_assert(QEMU_PTR_IS_ALIGNED(testarea, pagesize));
 
 reg_struct.range.start = (uintptr_t)testarea;
 reg_struct.range.len = pagesize;
@@ -660,7 +660,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 struct uffdio_range range;
 int ret;
 trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
-range.start = client_addr & ~(pagesize - 1);
+range.start = ROUND_DOWN(client_addr, pagesize);
 range.len = pagesize;
 ret = ioctl(pcfd->fd, UFFDIO_WAKE, );
 if (ret) {
@@ -702,8 +702,7 @@ static int postcopy_request_page(MigrationIncomingState 
*mis, RAMBlock *rb,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
  uint64_t client_addr, uint64_t rb_offset)
 {
-size_t pagesize = qemu_ram_pagesize(rb);
-uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
+uint64_t aligned_rbo = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
 MigrationIncomingState *mis = migration_incoming_get_current();
 
 trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
@@ -993,7 +992,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
 break;
 }
 
-rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
+rb_offset = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
 trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
 qemu_ram_get_idstr(rb),
 rb_offset,
diff --git a/migration/ram.c b/migration/ram.c
index 56240f0f17..b225ec7507 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -811,7 +811,7 @@ static void 
migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
 assert(shift >= 6);
 
 size = 1ULL << (TARGET_PAGE_BITS + shift);
-start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
+start = QEMU_ALIGN_DOWN((ram_addr_t)page << TARGET_PAGE_BITS, size);
 trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
 memory_region_clear_dirty_bitmap(rb->mr, start, size);
 }
-- 
2.31.1

[PATCH v1 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination

2021-10-11 Thread David Hildenbrand

Currently, when someone (i.e., the VM) accesses discarded parts inside a
RAMBlock with a RamDiscardManager managing the corresponding mapped memory
region, postcopy will request migration of the corresponding page from the
source. The source, however, will never answer, because it refuses to
migrate such pages with undefined content ("logically unplugged"): the
pages are never dirty, and get_queued_page() will consequently skip
processing these postcopy requests.

Especially reading discarded ("logically unplugged") ranges is supposed to
work in some setups (for example with current virtio-mem), although it
barely ever happens: still, not placing a page would currently stall the
VM, as it cannot make forward progress.

Let's check the state via the RamDiscardManager (the state e.g.,
of virtio-mem is migrated during precopy) and avoid sending a request
that will never get answered. Place a fresh zero page instead to keep
the VM working. This is the same behavior that would happen
automatically without userfaultfd being active, when accessing virtual
memory regions without populated pages -- "populate on demand".

For now, there are valid cases (as documented in the virtio-mem spec) where
a VM might read discarded memory; in the future, we will disallow that.
Then, we might want to handle that case differently, e.g., warning the
user that the VM seems to be mis-behaving.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 migration/postcopy-ram.c | 31 +++
 migration/ram.c  | 21 +
 migration/ram.h  |  1 +
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2e9697bdd2..3609ce7e52 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -671,6 +671,29 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 return ret;
 }
 
+static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
+ ram_addr_t start, uint64_t haddr)
+{
+void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, 
qemu_ram_pagesize(rb));
+
+/*
+ * Discarded pages (via RamDiscardManager) are never migrated. On unlikely
+ * access, place a zeropage, which will also set the relevant bits in the
+ * recv_bitmap accordingly, so we won't try placing a zeropage twice.
+ *
+ * Checking a single bit is sufficient to handle pagesize > TPS as either
+ * all relevant bits are set or not.
+ */
+assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb)));
+if (ramblock_page_is_discarded(rb, start)) {
+bool received = ramblock_recv_bitmap_test_byte_offset(rb, start);
+
+return received ? 0 : postcopy_place_page_zero(mis, aligned, rb);
+}
+
+return migrate_send_rp_req_pages(mis, rb, start, haddr);
+}
+
 /*
  * Callback from shared fault handlers to ask for a page,
  * the page must be specified by a RAMBlock and an offset in that rb
@@ -690,7 +713,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, 
RAMBlock *rb,
 qemu_ram_get_idstr(rb), rb_offset);
 return postcopy_wake_shared(pcfd, client_addr, rb);
 }
-migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
+postcopy_request_page(mis, rb, aligned_rbo, client_addr);
 return 0;
 }
 
@@ -984,8 +1007,8 @@ retry:
  * Send the request to the source - we want to request one
  * of our host page sizes (which is >= TPS)
  */
-ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
-msg.arg.pagefault.address);
+ret = postcopy_request_page(mis, rb, rb_offset,
+msg.arg.pagefault.address);
 if (ret) {
 /* May be network failure, try to wait for recovery */
 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
@@ -993,7 +1016,7 @@ retry:
 goto retry;
 } else {
 /* This is a unavoidable fault */
-error_report("%s: migrate_send_rp_req_pages() get %d",
+error_report("%s: postcopy_request_page() get %d",
  __func__, ret);
 break;
 }
diff --git a/migration/ram.c b/migration/ram.c
index 3be969f749..56240f0f17 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -912,6 +912,27 @@ static uint64_t 
ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
 return cleared_bits;
 }
 
+/*
+ * Check if a host-page aligned page falls into a discarded range as managed by
+ * a RamDiscardManager responsible for the mapped memory region of the 
RAMBlock.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
+{
+if (rb->mr &&

[PATCH v1 1/9] memory: Introduce replay_discarded callback for RamDiscardManager

2021-10-11 Thread David Hildenbrand

Introduce replay_discarded callback similar to our existing
replay_populated callback, to be used my migration code to never migrate
discarded memory.

Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 include/exec/memory.h | 21 +
 softmmu/memory.c  | 11 +++
 2 files changed, 32 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index a185b6dcb8..75b4f600e3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -540,6 +540,7 @@ static inline void 
ram_discard_listener_init(RamDiscardListener *rdl,
 }
 
 typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
+typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -628,6 +629,21 @@ struct RamDiscardManagerClass {
 MemoryRegionSection *section,
 ReplayRamPopulate replay_fn, void *opaque);
 
+/**
+ * @replay_discarded:
+ *
+ * Call the #ReplayRamDiscard callback for all discarded parts within the
+ * #MemoryRegionSection via the #RamDiscardManager.
+ *
+ * @rdm: the #RamDiscardManager
+ * @section: the #MemoryRegionSection
+ * @replay_fn: the #ReplayRamDiscard callback
+ * @opaque: pointer to forward to the callback
+ */
+void (*replay_discarded)(const RamDiscardManager *rdm,
+ MemoryRegionSection *section,
+ ReplayRamDiscard replay_fn, void *opaque);
+
 /**
  * @register_listener:
  *
@@ -672,6 +688,11 @@ int ram_discard_manager_replay_populated(const 
RamDiscardManager *rdm,
  ReplayRamPopulate replay_fn,
  void *opaque);
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+  MemoryRegionSection *section,
+  ReplayRamDiscard replay_fn,
+  void *opaque);
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
RamDiscardListener *rdl,
MemoryRegionSection *section);
diff --git a/softmmu/memory.c b/softmmu/memory.c
index db182e5d3d..3bcfc3899b 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2081,6 +2081,17 @@ int ram_discard_manager_replay_populated(const 
RamDiscardManager *rdm,
 return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+  MemoryRegionSection *section,
+  ReplayRamDiscard replay_fn,
+  void *opaque)
+{
+RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
+
+g_assert(rdmc->replay_discarded);
+rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+}
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
RamDiscardListener *rdl,
MemoryRegionSection *section)
-- 
2.31.1

[PATCH v1 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source

2021-10-11 Thread David Hildenbrand

We don't want to migrate memory that corresponds to discarded ranges as
managed by a RamDiscardManager responsible for the mapped memory region of
the RAMBlock. The content of these pages is essentially stale and
without any guarantees for the VM ("logically unplugged").

Depending on the underlying memory type, even reading memory might populate
memory on the source, resulting in an undesired memory consumption. Of
course, on the destination, even writing a zeropage consumes memory,
which we also want to avoid (similar to free page hinting).

Currently, virtio-mem tries achieving that goal (not migrating "unplugged"
memory that was discarded) by going via qemu_guest_free_page_hint() - but
it's hackish and incomplete.

For example, background snapshots still end up reading all memory, as
they don't do bitmap syncs. Postcopy recovery code will re-add
previously cleared bits to the dirty bitmap and migrate them.

Let's consult the RamDiscardManager after setting up our dirty bitmap
initially and when postcopy recovery code reinitializes it: clear
corresponding bits in the dirty bitmaps (e.g., of the RAMBlock and inside
KVM). It's important to fixup the dirty bitmap *after* our initial bitmap
sync, such that the corresponding dirty bits in KVM are actually cleared.

As colo is incompatible with discarding of RAM and inhibits it, we don't
have to bother.

Note: if a misbehaving guest would use discarded ranges after migration
started we would still migrate that memory: however, then we already
populated that memory on the migration source.

Reviewed-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 migration/ram.c | 77 +
 1 file changed, 77 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index bb908822d5..3be969f749 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -858,6 +858,60 @@ static inline bool migration_bitmap_clear_dirty(RAMState 
*rs,
 return ret;
 }
 
+static void dirty_bitmap_clear_section(MemoryRegionSection *section,
+   void *opaque)
+{
+const hwaddr offset = section->offset_within_region;
+const hwaddr size = int128_get64(section->size);
+const unsigned long start = offset >> TARGET_PAGE_BITS;
+const unsigned long npages = size >> TARGET_PAGE_BITS;
+RAMBlock *rb = section->mr->ram_block;
+uint64_t *cleared_bits = opaque;
+
+/*
+ * We don't grab ram_state->bitmap_mutex because we expect to run
+ * only when starting migration or during postcopy recovery where
+ * we don't have concurrent access.
+ */
+if (!migration_in_postcopy() && !migrate_background_snapshot()) {
+migration_clear_memory_region_dirty_bitmap_range(rb, start, npages);
+}
+*cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
+bitmap_clear(rb->bmap, start, npages);
+}
+
+/*
+ * Exclude all dirty pages from migration that fall into a discarded range as
+ * managed by a RamDiscardManager responsible for the mapped memory region of
+ * the RAMBlock. Clear the corresponding bits in the dirty bitmaps.
+ *
+ * Discarded pages ("logically unplugged") have undefined content and must
+ * not get migrated, because even reading these pages for migration might
+ * result in undesired behavior.
+ *
+ * Returns the number of cleared bits in the RAMBlock dirty bitmap.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
+{
+uint64_t cleared_bits = 0;
+
+if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) {
+RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+MemoryRegionSection section = {
+.mr = rb->mr,
+.offset_within_region = 0,
+.size = int128_make64(qemu_ram_get_used_length(rb)),
+};
+
+ram_discard_manager_replay_discarded(rdm, ,
+ dirty_bitmap_clear_section,
+ _bits);
+}
+return cleared_bits;
+}
+
 /* Called with RCU critical section */
 static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb)
 {
@@ -2668,6 +2722,19 @@ static void ram_list_init_bitmaps(void)
 }
 }
 
+static void migration_bitmap_clear_discarded_pages(RAMState *rs)
+{
+unsigned long pages;
+RAMBlock *rb;
+
+RCU_READ_LOCK_GUARD();
+
+RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
+pages = ramblock_dirty_bitmap_clear_discarded_pages(rb);
+rs->migration_dirty_pages -= pages;
+}
+}
+
 static void ram_init_bitmaps(RAMState *rs)
 {
 /* For memory_global_dirty_log_start below.  */
@@ -2684,6 +2751,12 @@ static void ram_init_bitmaps(RAMState *rs)
 }
 qemu_mutex_unlock_ramlist();
 qemu_mutex_unlock_iothread();
+
+/*
+ * After an eventual first bitmap sync, fixup the initial bitmap
+ * containing all 1s to

[PATCH v1 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback

2021-10-11 Thread David Hildenbrand

Implement it similar to the replay_populated callback.

Acked-by: Peter Xu 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 58 ++
 1 file changed, 58 insertions(+)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..284096ec5f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -228,6 +228,38 @@ static int virtio_mem_for_each_plugged_section(const 
VirtIOMEM *vmem,
 return ret;
 }
 
+static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
+ MemoryRegionSection *s,
+ void *arg,
+ virtio_mem_section_cb cb)
+{
+unsigned long first_bit, last_bit;
+uint64_t offset, size;
+int ret = 0;
+
+first_bit = s->offset_within_region / vmem->bitmap_size;
+first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, first_bit);
+while (first_bit < vmem->bitmap_size) {
+MemoryRegionSection tmp = *s;
+
+offset = first_bit * vmem->block_size;
+last_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+ first_bit + 1) - 1;
+size = (last_bit - first_bit + 1) * vmem->block_size;
+
+if (!virito_mem_intersect_memory_section(, offset, size)) {
+break;
+}
+ret = cb(, arg);
+if (ret) {
+break;
+}
+first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+   last_bit + 2);
+}
+return ret;
+}
+
 static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
 {
 RamDiscardListener *rdl = arg;
@@ -1170,6 +1202,31 @@ static int virtio_mem_rdm_replay_populated(const 
RamDiscardManager *rdm,
 
virtio_mem_rdm_replay_populated_cb);
 }
 
+static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
+  void *arg)
+{
+struct VirtIOMEMReplayData *data = arg;
+
+((ReplayRamDiscard)data->fn)(s, data->opaque);
+return 0;
+}
+
+static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+MemoryRegionSection *s,
+ReplayRamDiscard replay_fn,
+void *opaque)
+{
+const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
+struct VirtIOMEMReplayData data = {
+.fn = replay_fn,
+.opaque = opaque,
+};
+
+g_assert(s->mr == >memdev->mr);
+virtio_mem_for_each_unplugged_section(vmem, s, ,
+  virtio_mem_rdm_replay_discarded_cb);
+}
+
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
  RamDiscardListener *rdl,
  MemoryRegionSection *s)
@@ -1234,6 +1291,7 @@ static void virtio_mem_class_init(ObjectClass *klass, 
void *data)
 rdmc->get_min_granularity = virtio_mem_rdm_get_min_granularity;
 rdmc->is_populated = virtio_mem_rdm_is_populated;
 rdmc->replay_populated = virtio_mem_rdm_replay_populated;
+rdmc->replay_discarded = virtio_mem_rdm_replay_discarded;
 rdmc->register_listener = virtio_mem_rdm_register_listener;
 rdmc->unregister_listener = virtio_mem_rdm_unregister_listener;
 }
-- 
2.31.1

Re: [PATCH 1/2] target/i386/sev: Use local variable for kvm_sev_launch_start

2021-10-11 Thread Dr. David Alan Gilbert

* Dov Murik (dovmu...@linux.ibm.com) wrote:
> The struct kvm_sev_launch_start has a constant and small size, and
> therefore we can use a regular local variable for it instead of
> allocating and freeing heap memory for it.
> 
> Signed-off-by: Dov Murik 

Reviewed-by: Dr. David Alan Gilbert 

> ---
>  target/i386/sev.c | 23 ++-
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/target/i386/sev.c b/target/i386/sev.c
> index 4c64c68244..0062566c71 100644
> --- a/target/i386/sev.c
> +++ b/target/i386/sev.c
> @@ -647,31 +647,29 @@ sev_launch_start(SevGuestState *sev)
>  gsize sz;
>  int ret = 1;
>  int fw_error, rc;
> -struct kvm_sev_launch_start *start;
> +struct kvm_sev_launch_start start = {
> +.handle = sev->handle, .policy = sev->policy
> +};
>  guchar *session = NULL, *dh_cert = NULL;
>  
> -start = g_new0(struct kvm_sev_launch_start, 1);
> -
> -start->handle = sev->handle;
> -start->policy = sev->policy;
>  if (sev->session_file) {
>  if (sev_read_file_base64(sev->session_file, , ) < 0) {
>  goto out;
>  }
> -start->session_uaddr = (unsigned long)session;
> -start->session_len = sz;
> +start.session_uaddr = (unsigned long)session;
> +start.session_len = sz;
>  }
>  
>  if (sev->dh_cert_file) {
>  if (sev_read_file_base64(sev->dh_cert_file, _cert, ) < 0) {
>  goto out;
>  }
> -start->dh_uaddr = (unsigned long)dh_cert;
> -start->dh_len = sz;
> +start.dh_uaddr = (unsigned long)dh_cert;
> +start.dh_len = sz;
>  }
>  
> -trace_kvm_sev_launch_start(start->policy, session, dh_cert);
> -rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, start, _error);
> +trace_kvm_sev_launch_start(start.policy, session, dh_cert);
> +rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, , _error);
>  if (rc < 0) {
>  error_report("%s: LAUNCH_START ret=%d fw_error=%d '%s'",
>  __func__, ret, fw_error, fw_error_to_str(fw_error));
> @@ -679,11 +677,10 @@ sev_launch_start(SevGuestState *sev)
>  }
>  
>  sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE);
> -sev->handle = start->handle;
> +sev->handle = start.handle;
>  ret = 0;
>  
>  out:
> -g_free(start);
>  g_free(session);
>  g_free(dh_cert);
>  return ret;
> -- 
> 2.25.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[PATCH v1 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager

2021-10-11 Thread David Hildenbrand

This series is fully reviewed by Peter and I hope we can get either more
review feedback or get it merged via the migration tree soonish. Thanks.

---

virtio-mem exposes a dynamic amount of memory within RAMBlocks by
coordinating with the VM. Memory within a RAMBlock can either get
plugged and consequently used by the VM, or unplugged and consequently no
longer used by the VM. Logical unplug is realized by discarding the
physical memory backing for virtual memory ranges, similar to memory
ballooning.

However, important difference to virtio-balloon are:

a) A virtio-mem device only operates on its assigned memory region /
   RAMBlock ("device memory")
b) Initially, all device memory is logically unplugged
c) Virtual machines will never accidentally reuse memory that is currently
   logically unplugged. The spec defines most accesses to unplugged memory
   as "undefined behavior" -- except reading unplugged memory, which is
   currently expected to work, but that will change in the future.
d) The (un)plug granularity is in the range of megabytes -- "memory blocks"
e) The state (plugged/unplugged) of a memory block is always known and
   properly tracked.

Whenever memory blocks within the RAMBlock get (un)plugged, changes are
communicated via the RamDiscardManager to other QEMU subsystems, most
prominently vfio which updates the DMA mapping accordingly. "Unplugging"
corresponds to "discarding" and "plugging" corresponds to "populating".

While migrating (precopy/postcopy) that state of such memory blocks cannot
change, as virtio-mem will reject any guest requests that would change
the state of blocks with "busy". We don't want to migrate such logically
unplugged memory, because it can result in an unintended memory consumption
both, on the source (when reading memory from some memory backends) and on
the destination (when writing memory). Further, migration time can be
heavily reduced when skipping logically unplugged blocks and we avoid
populating unnecessary page tables in Linux.

Right now, virtio-mem reuses the free page hinting infrastructure during
precopy to exclude all logically unplugged ("discarded") parts from the
migration stream. However, there are some scenarios that are not handled
properly and need fixing. Further, there are some ugly corner cases in
postcopy code and background snapshotting code that similarly have to
handle such special RAMBlocks.

Let's reuse the RamDiscardManager infrastructure to essentially handle
precopy, postcopy and background snapshots cleanly, which means:

a) In precopy code, fixing up the initial dirty bitmaps (in the RAMBlock
   and e.g., KVM) to exclude discarded ranges.
b) In postcopy code, placing a zeropage when requested to handle a page
   falling into a discarded range -- because the source will never send it.
   Further, fix up the dirty bitmap when overwriting it in recovery mode.
c) In background snapshot code, never populating discarded ranges, not even
   with the shared zeropage, to avoid unintended memory consumption,
   especially in the future with hugetlb and shmem.

Detail: When realizing a virtio-mem devices, it will register the RAM
for migration via vmstate_register_ram(). Further, it will
set itself as the RamDiscardManager for the corresponding memory
region of the RAMBlock via memory_region_set_ram_discard_manager().
Last but not least, memory device code will actually map the
memory region into guest physical address space. So migration
code can always properly identify such RAMBlocks.

Tested with precopy/postcopy on shmem, where even reading unpopulated
memory ranges will populate actual memory and not the shared zeropage.
Tested with background snapshots on anonymous memory, because other
backends are not supported yet with upstream Linux.


v5 -> v6:
- Rebased and added ACKs

v4 -> v5:
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Use ROUND_DOWN and fix compile warning on 32 bit
-- Use int128_make64() instead of wrongly int128_get64()
- "migration: Simplify alignment and alignment checks"
-- Use ROUND_DOWN where possible instead of QEMU_ALIGN_DOWN and fix
   compilation warning on 32 bit
- "migration/ram: Factor out populating pages readable in
   ram_block_populate_pages()"
-- Rename functions, add a comment.
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on background
   snapshots"
-- Adjust to changed function names

v3 -> v4:
- Added ACKs
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Use QEMU_ALIGN_DOWN() to align to ram pagesize
- "migration: Simplify alignment and alignment checks"
-- Added
- "migration/ram: Factor out populating pages readable in
   ram_block_populate_pages()"
-- Added
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on background
   snapshots"
-- Simplified due to factored out code

v2 -> v3:
- "migration/ram: Don't passs RAMState to

Re: [PATCH] memory: Log access direction for invalid accesses

2021-10-11 Thread David Hildenbrand


On 11.10.21 19:32, BALATON Zoltan wrote:

In memory_region_access_valid() invalid accesses are logged to help
debugging but the log message does not say if it was a read or write.
Log that too to better identify the access causing the problem.

Signed-off-by: BALATON Zoltan 
---
  softmmu/memory.c | 20 ++--
  1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index db182e5d3d..e5826faa0c 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1378,17 +1378,17 @@ bool memory_region_access_valid(MemoryRegion *mr,
  {
  if (mr->ops->valid.accepts
  && !mr->ops->valid.accepts(mr->opaque, addr, size, is_write, attrs)) {
-qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr "
-   "0x%" HWADDR_PRIX ", size %u, "
-   "region '%s', reason: rejected\n",
+qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX
+  ", size %u, region '%s', reason: rejected\n",
+  is_write ? "write" : "read",
addr, size, memory_region_name(mr));
  return false;
  }
  
  if (!mr->ops->valid.unaligned && (addr & (size - 1))) {

-qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr "
-   "0x%" HWADDR_PRIX ", size %u, "
-   "region '%s', reason: unaligned\n",
+qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX
+  ", size %u, region '%s', reason: unaligned\n",
+  is_write ? "write" : "read",
addr, size, memory_region_name(mr));
  return false;
  }
@@ -1400,10 +1400,10 @@ bool memory_region_access_valid(MemoryRegion *mr,
  
  if (size > mr->ops->valid.max_access_size

  || size < mr->ops->valid.min_access_size) {
-qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr "
-   "0x%" HWADDR_PRIX ", size %u, "
-   "region '%s', reason: invalid size "
-   "(min:%u max:%u)\n",
+qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX
+  ", size %u, region '%s', reason: invalid size "
+  "(min:%u max:%u)\n",
+  is_write ? "write" : "read",
addr, size, memory_region_name(mr),
mr->ops->valid.min_access_size,
mr->ops->valid.max_access_size);



Reviewed-by: David Hildenbrand 

--
Thanks,

David / dhildenb

Re: ACPI endianness

2021-10-11 Thread BALATON Zoltan


On Mon, 11 Oct 2021, Michael S. Tsirkin wrote:

On Mon, Oct 11, 2021 at 03:51:01PM +0200, BALATON Zoltan wrote:

... but given we did not previously do the read, maybe we should keep
it that way at least for the time being.


How do you know there was no read before this write? Did you check it? I've
only added a printf in the write method and saw the value was swapped but
did not check if there was a read before that. There are no traces in these
methods so maybe I would not see it unless adding a printf there too.


All I am saying is that qemu did not convert a write into
a read+write.


OK confirmed that by disabling pm_io_space_update() in hw/isa/vt82c686.c 
so the via-pm region is never mapped and then modifying the log messages 
for invalid accesses (patch sent separately) I get:


~ # poweroff
Sent SIGKILL to all processes
Requesting system poweroff
pci_cfg_write vt82c686b-usb-uhci 12:3 @0xc0 <- 0x8f00
pci_cfg_write vt82c686b-usb-uhci 12:2 @0xc0 <- 0x8f00
[   16.498465] reboot: Power down
Invalid write at addr 0xFE000F05, size 1, region '(null)', reason: rejected
Invalid write at addr 0xF05, size 1, region '(null)', reason: rejected

So the guest only tries to write one byte but not sure if the read 
generated within memory.c would show up in these logs. I suspect if you 
fixed that you'd get all sorts of problems with other device models so the 
less likely to break anything fix would be to go back to native endian for 
acpi but I wait for what you come up with. I'd like this to be fixed one 
way or another for 6.2 and fixing endianness would probably be enough for 
that.


Regards,
BALATON Zoltan

[PATCH v1 2/2] memory: Update description of memory_region_is_mapped()

2021-10-11 Thread David Hildenbrand

memory_region_is_mapped() only indicates if the memory region is mapped
into a different memory region, and only if it is mapped directly
(->container), not via an alias.

Update the documentation to make this clearer.

Signed-off-by: David Hildenbrand 
---
 include/exec/memory.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index a185b6dcb8..abc17fc3c0 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -2265,7 +2265,8 @@ bool memory_region_present(MemoryRegion *container, 
hwaddr addr);
 
 /**
  * memory_region_is_mapped: returns true if #MemoryRegion is mapped
- * into any address space.
+ * into another #MemoryRegion directly. Will return false if the
+ * #MemoryRegion is mapped indirectly via an alias.
  *
  * @mr: a #MemoryRegion which should be checked if it's mapped
  */
-- 
2.31.1

[PATCH v1 1/2] machine: Use host_memory_backend_is_mapped() in machine_consume_memdev()

2021-10-11 Thread David Hildenbrand

memory_region_is_mapped() is the wrong check, we actually want to check
whether the backend is already marked mapped.

For example, memory regions mapped via an alias, such as NVDIMMs,
currently don't make memory_region_is_mapped() return "true". As the
machine is initialized before any memory devices (and thereby before
NVDIMMs are initialized), this isn't a fix but merely a cleanup.

Signed-off-by: David Hildenbrand 
---
 hw/core/machine.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index b8d95eec32..a1db865939 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -1260,7 +1260,7 @@ MemoryRegion *machine_consume_memdev(MachineState 
*machine,
 {
 MemoryRegion *ret = host_memory_backend_get_memory(backend);
 
-if (memory_region_is_mapped(ret)) {
+if (host_memory_backend_is_mapped(backend)) {
 error_report("memory backend %s can't be used multiple times.",
  object_get_canonical_path_component(OBJECT(backend)));
 exit(EXIT_FAILURE);
-- 
2.31.1

[PATCH v1 0/2] memory: Update description of memory_region_is_mapped()

2021-10-11 Thread David Hildenbrand

Playing with memory_region_is_mapped(), I realized that memory regions
mapped via an alias behave a little bit "differently", as they don't have
their ->container set.
* memory_region_is_mapped() will never succeed for memory regions mapped
  via an alias
* memory_region_to_address_space(), memory_region_find(),
  memory_region_find_rcu(), memory_region_present() won't work, which seems
  okay, because we don't expect such memory regions getting passed to these
  functions.
* memory_region_to_absolute_addr() will result in a wrong address. As
  the result is only used for tracing, that is tolerable.

Let's clarify the documentation of memory_region_is_mapped() and change
one user that really should be checking something else.

Cc: Eduardo Habkost 
Cc: Marcel Apfelbaum 
Cc: Paolo Bonzini 
Cc: Peter Xu 
Cc: Igor Mammedov 
Cc: Richard Henderson 
Cc: Philippe Mathieu-Daudé 

David Hildenbrand (2):
  machine: Use host_memory_backend_is_mapped() in
machine_consume_memdev()
  memory: Update description of memory_region_is_mapped()

 hw/core/machine.c | 2 +-
 include/exec/memory.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.31.1

[PATCH] memory: Log access direction for invalid accesses

2021-10-11 Thread BALATON Zoltan

In memory_region_access_valid() invalid accesses are logged to help
debugging but the log message does not say if it was a read or write.
Log that too to better identify the access causing the problem.

Signed-off-by: BALATON Zoltan 
---
 softmmu/memory.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/softmmu/memory.c b/softmmu/memory.c
index db182e5d3d..e5826faa0c 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1378,17 +1378,17 @@ bool memory_region_access_valid(MemoryRegion *mr,
 {
 if (mr->ops->valid.accepts
 && !mr->ops->valid.accepts(mr->opaque, addr, size, is_write, attrs)) {
-qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr "
-   "0x%" HWADDR_PRIX ", size %u, "
-   "region '%s', reason: rejected\n",
+qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX
+  ", size %u, region '%s', reason: rejected\n",
+  is_write ? "write" : "read",
   addr, size, memory_region_name(mr));
 return false;
 }
 
 if (!mr->ops->valid.unaligned && (addr & (size - 1))) {
-qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr "
-   "0x%" HWADDR_PRIX ", size %u, "
-   "region '%s', reason: unaligned\n",
+qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX
+  ", size %u, region '%s', reason: unaligned\n",
+  is_write ? "write" : "read",
   addr, size, memory_region_name(mr));
 return false;
 }
@@ -1400,10 +1400,10 @@ bool memory_region_access_valid(MemoryRegion *mr,
 
 if (size > mr->ops->valid.max_access_size
 || size < mr->ops->valid.min_access_size) {
-qemu_log_mask(LOG_GUEST_ERROR, "Invalid access at addr "
-   "0x%" HWADDR_PRIX ", size %u, "
-   "region '%s', reason: invalid size "
-   "(min:%u max:%u)\n",
+qemu_log_mask(LOG_GUEST_ERROR, "Invalid %s at addr 0x%" HWADDR_PRIX
+  ", size %u, region '%s', reason: invalid size "
+  "(min:%u max:%u)\n",
+  is_write ? "write" : "read",
   addr, size, memory_region_name(mr),
   mr->ops->valid.min_access_size,
   mr->ops->valid.max_access_size);
-- 
2.21.4

Re: [PATCH v4 00/11] virtio-iommu: Add ACPI support

2021-10-11 Thread Jean-Philippe Brucker

Hi Haiwei,

On Mon, Oct 11, 2021 at 06:10:07PM +0800, Haiwei Li wrote:
[...]
> Gave up waiting for root file system device.  Common problems:
>  - Boot args (cat /proc/cmdline)
>- Check rootdelay= (did the system wait long enough?)
>  - Missing modules (cat /proc/modules; ls /dev)
> ALERT!  UUID=3caf26b5-4d08-43e0-8634-7573269c4f70 does not exist.
> Dropping to a shell!
> 
> Any suggestions? Thanks.

It's possible that the rootfs is on a disk behind the IOMMU, and the IOMMU
driver doesn't get loaded. That could happen, for example, if the
virtio-iommu module is not present in the initramfs. Since IOMMU drivers
are typically built into the kernel rather than modules, distro tools that
build the initramfs might not pick up IOMMU modules. I'm guessing this
could be the issue here because of the hints and "Dropping to a shell"
line.

The clean solution will be to patch the initramfs tools to learn about
IOMMU drivers (I'm somewhat working on that). In the meantime, if this is
indeed the problem, you could try explicitly adding the virtio-iommu
module to the initramfs, or building the kernel with CONFIG_VIRTIO_IOMMU=y
rather than =m, though that requires VIRTIO and VIRTIO_PCI to be built-in
as well.

Thanks,
Jean

[PATCH v1] virtio-mem: Don't skip alignment checks when warning about block size

2021-10-11 Thread David Hildenbrand

If we warn about the block size being smaller than the default, we skip
some alignment checks.

This can currently only fail on x86-64, when specifying a block size of
1 MiB, however, we detect the THP size of 2 MiB.

Fixes: 228957fea3a9 ("virtio-mem: Probe THP size to determine default block 
size")
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..7ce9901791 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -701,7 +701,8 @@ static void virtio_mem_device_realize(DeviceState *dev, 
Error **errp)
 warn_report("'%s' property is smaller than the default block size (%"
 PRIx64 " MiB)", VIRTIO_MEM_BLOCK_SIZE_PROP,
 virtio_mem_default_block_size(rb) / MiB);
-} else if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) {
+}
+if (!QEMU_IS_ALIGNED(vmem->requested_size, vmem->block_size)) {
 error_setg(errp, "'%s' property has to be multiples of '%s' (0x%" 
PRIx64
")", VIRTIO_MEM_REQUESTED_SIZE_PROP,
VIRTIO_MEM_BLOCK_SIZE_PROP, vmem->block_size);
-- 
2.31.1

[PATCH 1/2] target/i386/sev: Use local variable for kvm_sev_launch_start

2021-10-11 Thread Dov Murik

The struct kvm_sev_launch_start has a constant and small size, and
therefore we can use a regular local variable for it instead of
allocating and freeing heap memory for it.

Signed-off-by: Dov Murik 
---
 target/i386/sev.c | 23 ++-
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 4c64c68244..0062566c71 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -647,31 +647,29 @@ sev_launch_start(SevGuestState *sev)
 gsize sz;
 int ret = 1;
 int fw_error, rc;
-struct kvm_sev_launch_start *start;
+struct kvm_sev_launch_start start = {
+.handle = sev->handle, .policy = sev->policy
+};
 guchar *session = NULL, *dh_cert = NULL;
 
-start = g_new0(struct kvm_sev_launch_start, 1);
-
-start->handle = sev->handle;
-start->policy = sev->policy;
 if (sev->session_file) {
 if (sev_read_file_base64(sev->session_file, , ) < 0) {
 goto out;
 }
-start->session_uaddr = (unsigned long)session;
-start->session_len = sz;
+start.session_uaddr = (unsigned long)session;
+start.session_len = sz;
 }
 
 if (sev->dh_cert_file) {
 if (sev_read_file_base64(sev->dh_cert_file, _cert, ) < 0) {
 goto out;
 }
-start->dh_uaddr = (unsigned long)dh_cert;
-start->dh_len = sz;
+start.dh_uaddr = (unsigned long)dh_cert;
+start.dh_len = sz;
 }
 
-trace_kvm_sev_launch_start(start->policy, session, dh_cert);
-rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, start, _error);
+trace_kvm_sev_launch_start(start.policy, session, dh_cert);
+rc = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_START, , _error);
 if (rc < 0) {
 error_report("%s: LAUNCH_START ret=%d fw_error=%d '%s'",
 __func__, ret, fw_error, fw_error_to_str(fw_error));
@@ -679,11 +677,10 @@ sev_launch_start(SevGuestState *sev)
 }
 
 sev_set_guest_state(sev, SEV_STATE_LAUNCH_UPDATE);
-sev->handle = start->handle;
+sev->handle = start.handle;
 ret = 0;
 
 out:
-g_free(start);
 g_free(session);
 g_free(dh_cert);
 return ret;
-- 
2.25.1

[PATCH 0/2] target/i386/sev: Replace malloc with local variables

2021-10-11 Thread Dov Murik

In two places in sev.c we use malloc+free to manage memory for small
constant struct variables.  Modify this to use local variables.

This small series can be applied on top of master or on top of Phil's
Housekeeping SEV series [1].

[1] 
https://lore.kernel.org/qemu-devel/20211007161716.453984-1-phi...@redhat.com/

Dov Murik (2):
  target/i386/sev: Use local variable for kvm_sev_launch_start
  target/i386/sev: Use local variable for kvm_sev_launch_measure

 target/i386/sev.c | 39 +--
 1 file changed, 17 insertions(+), 22 deletions(-)

-- 
2.25.1

[PATCH 2/2] target/i386/sev: Use local variable for kvm_sev_launch_measure

2021-10-11 Thread Dov Murik

The struct kvm_sev_launch_measure has a constant and small size, and
therefore we can use a regular local variable for it instead of
allocating and freeing heap memory for it.

Signed-off-by: Dov Murik 
---
 target/i386/sev.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/target/i386/sev.c b/target/i386/sev.c
index 0062566c71..eede07f11d 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -729,7 +729,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
 SevGuestState *sev = sev_guest;
 int ret, error;
 g_autofree guchar *data = NULL;
-g_autofree struct kvm_sev_launch_measure *measurement = NULL;
+struct kvm_sev_launch_measure measurement = {};
 
 if (!sev_check_state(sev, SEV_STATE_LAUNCH_UPDATE)) {
 return;
@@ -743,23 +743,21 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
 }
 }
 
-measurement = g_new0(struct kvm_sev_launch_measure, 1);
-
 /* query the measurement blob length */
 ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE,
-measurement, );
-if (!measurement->len) {
+, );
+if (!measurement.len) {
 error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'",
  __func__, ret, error, fw_error_to_str(errno));
 return;
 }
 
-data = g_new0(guchar, measurement->len);
-measurement->uaddr = (unsigned long)data;
+data = g_new0(guchar, measurement.len);
+measurement.uaddr = (unsigned long)data;
 
 /* get the measurement blob */
 ret = sev_ioctl(sev->sev_fd, KVM_SEV_LAUNCH_MEASURE,
-measurement, );
+, );
 if (ret) {
 error_report("%s: LAUNCH_MEASURE ret=%d fw_error=%d '%s'",
  __func__, ret, error, fw_error_to_str(errno));
@@ -769,7 +767,7 @@ sev_launch_get_measure(Notifier *notifier, void *unused)
 sev_set_guest_state(sev, SEV_STATE_LAUNCH_SECRET);
 
 /* encode the measurement value and emit the event */
-sev->measurement = g_base64_encode(data, measurement->len);
+sev->measurement = g_base64_encode(data, measurement.len);
 trace_kvm_sev_launch_measurement(sev->measurement);
 }
 
-- 
2.25.1

Re: [PATCH 3/6] numa: Add SGXEPCSection list for multiple sections

2021-10-11 Thread Eric Blake

On Mon, Oct 11, 2021 at 07:15:51PM +0800, Yang Zhong wrote:
> The SGXEPCSection list added into SGXInfo to show the multiple
> SGX EPC sections detailed info, not the total size like before.
> 
> Signed-off-by: Yang Zhong 
> ---
>  qapi/misc-target.json | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/misc-target.json b/qapi/misc-target.json
> index 594fbd1577..89a5a4250a 100644
> --- a/qapi/misc-target.json
> +++ b/qapi/misc-target.json
> @@ -334,6 +334,21 @@
>'returns': 'SevAttestationReport',
>'if': 'TARGET_I386' }
>  
> +##
> +# @SGXEPCSection:
> +#
> +# Information about intel SGX EPC section info
> +#
> +# @index: the SGX epc section index
> +#
> +# @size: the size of epc section
> +#
> +# Since: 6.2
> +##
> +{ 'struct': 'SGXEPCSection',
> +  'data': { 'index': 'uint64',
> +'size': 'uint64'}}
> +
>  ##
>  # @SGXInfo:
>  #
> @@ -347,7 +362,7 @@
>  #
>  # @flc: true if FLC is supported
>  #
> -# @section-size: The EPC section size for guest
> +# @sections: The EPC sections info for guest
>  #
>  # Since: 6.2

Given this has not yet been in a stable release, we can make this change...

>  ##
> @@ -356,7 +371,7 @@
>  'sgx1': 'bool',
>  'sgx2': 'bool',
>  'flc': 'bool',
> -'section-size': 'uint64'},
> +'sections': ['SGXEPCSection']},
> 'if': 'TARGET_I386' }

...but are we sure we have the best interface possible if we are still
expressing uncertainty about the QAPI used to represent it?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 0/7] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()

2021-10-11 Thread David Hildenbrand


On 04.10.21 14:02, David Hildenbrand wrote:

#1 is a preparation for improved error reporting, #2 adds support for
MADV_POPULATE_WRITE, #3 cleans up the code to avoid global variables and
prepare for concurrency, #4 and #5 optimize thread handling, #6 makes
os_mem_prealloc() safe to be called from multiple threads concurrently and
#7 makes the SIGBUS handler coexist cleanly with the MCE SIGBUS handler
under Linux.

Details regarding MADV_POPULATE_WRITE can be found in introducing upstream
Linux commits 4ca9b3859dac ("mm/madvise: introduce
MADV_POPULATE_(READ|WRITE) to prefault page tables") and eb2faa513c24
("mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE)"),
and in the man page update [1].


Paolo, are you planning on taking this via your tree (POSIX)? Thanks!

--
Thanks,

David / dhildenb

Re: [PATCH 1/6] numa: Enable numa for SGX EPC sections

2021-10-11 Thread Eric Blake

On Mon, Oct 11, 2021 at 07:15:49PM +0800, Yang Zhong wrote:
> The basic SGX did not enable numa for SGX EPC sections, which
> result in all EPC sections located in numa node 0. This patch
> enable SGX numa function in the guest and the EPC section can
> work with RAM as one numa node.
> 
> The Guest kernel related log:
> [0.009981] ACPI: SRAT: Node 0 PXM 0 [mem 0x18000-0x183ff]
> [0.009982] ACPI: SRAT: Node 1 PXM 1 [mem 0x18400-0x185bf]
> The SRAT table can normally show SGX EPC sections menory info in different
> numa nodes.
> 
> The SGX EPC numa related command:
>  ..
>  -m 4G,maxmem=20G \
>  -smp sockets=2,cores=2 \
>  -cpu host,+sgx-provisionkey \
>  -object memory-backend-ram,size=2G,host-nodes=0,policy=bind,id=node0 \
>  -object 
> memory-backend-epc,id=mem0,size=64M,prealloc=on,host-nodes=0,policy=bind \
>  -numa node,nodeid=0,cpus=0-1,memdev=node0 \
>  -object memory-backend-ram,size=2G,host-nodes=1,policy=bind,id=node1 \
>  -object 
> memory-backend-epc,id=mem1,size=28M,prealloc=on,host-nodes=1,policy=bind \
>  -numa node,nodeid=1,cpus=2-3,memdev=node1 \
>  -M 
> sgx-epc.0.memdev=mem0,sgx-epc.0.node=0,sgx-epc.1.memdev=mem1,sgx-epc.1.node=1 
> \
>  ..
> 
> Signed-off-by: Yang Zhong 
> ---
>  qapi/machine.json |  6 +-
>  include/hw/i386/sgx-epc.h |  3 +++
>  hw/i386/acpi-build.c  |  4 
>  hw/i386/sgx-epc.c |  3 +++
>  hw/i386/sgx.c | 44 +++
>  monitor/hmp-cmds.c|  1 +
>  qemu-options.hx   |  4 ++--
>  7 files changed, 62 insertions(+), 3 deletions(-)
> 
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 5db54df298..09b6188e6f 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1213,6 +1213,7 @@
>'data': { '*id': 'str',
>  'memaddr': 'size',
>  'size': 'size',
> +'node': 'int',
>  'memdev': 'str'
>}
>  }
> @@ -1288,7 +1289,10 @@
>  # Since: 6.2
>  ##
>  { 'struct': 'SgxEPC',
> -  'data': { 'memdev': 'str' } }
> +  'data': { 'memdev': 'str',
> +'node': 'int'
> +  }
> +}

Missing documentation of the new member.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: Moving QEMU downloads to GitLab Releases?

2021-10-11 Thread Gerd Hoffmann

  Hi,

> > So while I'm all for making things a little more independent,
> > let's not do it in a way that makes life difficult for downstreams.
> > There are more there than just a couple of big distro builders.
> 
> I think this is fine. Firmware blobs aren't needed to build QEMU, only
> to run the system emulator.

Yep.  But doing firmware builds outside gitlab should be easy, so we
should probably keep the core logic in a script / Makefile / whatever
(simliar to todays roms/Makefile), then simply invoke that from gitlab
ci yaml files.

take care,
  Gerd

Re: [PATCH] qcow2: Silence clang -m32 compiler warning

2021-10-11 Thread Eric Blake

On Mon, Oct 11, 2021 at 05:50:31PM +0200, Hanna Reitz wrote:
> With -m32, size_t is generally only a uint32_t.  That makes clang
> complain that in the assertion
> 
>   assert(qiov->size <= INT64_MAX);
> 
> the range of the type of qiov->size (size_t) is too small for any of its
> values to ever exceed INT64_MAX.

Yep, I'm not surprised that we hit that.

> 
> Cast qiov->size to uint64_t to silence clang.
> 
> Fixes: f7ef38dd1310d7d9db76d0aa16899cbc5744f36d
>("block: use int64_t instead of uint64_t in driver read
>handlers")
> Signed-off-by: Hanna Reitz 
> ---
> I don't know whether this is the best possible solution, or whether we
> should care about this at all (I personally think it's basically just
> wrong for clang to warn about always-true conditions in assertions), but
> I thought I might as well just send this patch as the basis for a
> discussion.

I agree that the compiler is overly noisy, but the fix is simple
enough that I'm fine with it as written.

Reviewed-by: Eric Blake 

Since the original went through my tree, I'm happy to take this one
through my NBD tree as well.

> ---
>  block/qcow2-cluster.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 5727f92dcb..21884a1ab9 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -513,7 +513,8 @@ static int coroutine_fn 
> do_perform_cow_read(BlockDriverState *bs,
>   */
>  assert(src_cluster_offset <= INT64_MAX);
>  assert(src_cluster_offset + offset_in_cluster <= INT64_MAX);
> -assert(qiov->size <= INT64_MAX);
> +/* Cast qiov->size to uint64_t to silence a compiler warning on -m32 */
> +assert((uint64_t)qiov->size <= INT64_MAX);
>  bdrv_check_qiov_request(src_cluster_offset + offset_in_cluster, 
> qiov->size,
>  qiov, 0, _abort);
>  /*
> -- 
> 2.31.1
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v4 2/2] monitor: refactor set/expire_password and allow VNC display id

2021-10-11 Thread Eric Blake

On Mon, Oct 11, 2021 at 04:03:01PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Reiter (s.rei...@proxmox.com) wrote:
> > It is possible to specify more than one VNC server on the command line,
> > either with an explicit ID or the auto-generated ones à la "default",
> > "vnc2", "vnc3", ...
> > 

> > +++ b/monitor/hmp-cmds.c
> > @@ -1451,10 +1451,41 @@ void hmp_set_password(Monitor *mon, const QDict 
> > *qdict)
> >  {
> >  const char *protocol  = qdict_get_str(qdict, "protocol");
> >  const char *password  = qdict_get_str(qdict, "password");
> > +const char *display = qdict_get_try_str(qdict, "display");
> >  const char *connected = qdict_get_try_str(qdict, "connected");
> >  Error *err = NULL;
> > +DisplayProtocol proto;
> >  
> > -qmp_set_password(protocol, password, !!connected, connected, );
> > +SetPasswordOptions opts = {
> > +.password = g_strdup(password),
> 
> You're leaking that strdup on the error returns; you probably want to
> add an error:  exit and goto it to do all the cleanup.

Or maybe there's a way to use g_autofree to let the compiler clean it
up automatically.

> 
> > +.u.vnc.display = NULL,
> > +};
> > +
> > +proto = qapi_enum_parse(_lookup, protocol,
> > +DISPLAY_PROTOCOL_VNC, );
> > +if (err) {
> > +hmp_handle_error(mon, err);
> > +return;
> > +}
> > +opts.protocol = proto;
> > +
> > +if (proto == DISPLAY_PROTOCOL_VNC) {
> > +opts.u.vnc.has_display = !!display;
> > +opts.u.vnc.display = g_strdup(display);
> > +} else if (proto == DISPLAY_PROTOCOL_SPICE) {
> > +opts.u.spice.has_connected = !!connected;
> > +opts.u.spice.connected =
> > +qapi_enum_parse(_lookup, connected,
> > +SET_PASSWORD_ACTION_KEEP, );
> > +if (err) {
> > +hmp_handle_error(mon, err);
> > +return;
> > +}
> > +}
> > +
> > +qmp_set_password(, );
> > +g_free(opts.password);
> > +g_free(opts.u.vnc.display);

Hmm. Why are we hand-cleaning only portions of the QAPI type instead
of using the generated qapi_free_SetPasswordOptions() do to things?

> >  hmp_handle_error(mon, err);
> >  }
> >  
> > @@ -1462,9 +1493,31 @@ void hmp_expire_password(Monitor *mon, const QDict 
> > *qdict)
> >  {
> >  const char *protocol  = qdict_get_str(qdict, "protocol");
> >  const char *whenstr = qdict_get_str(qdict, "time");
> > +const char *display = qdict_get_try_str(qdict, "display");
> >  Error *err = NULL;
> > +DisplayProtocol proto;
> >  
> > -qmp_expire_password(protocol, whenstr, );
> > +ExpirePasswordOptions opts = {
> > +.time = g_strdup(whenstr),
> > +.u.vnc.display = NULL,
> > +};
> 
> Same here; that 'whenstr' gets leaked on errors.
> 
> > +proto = qapi_enum_parse(_lookup, protocol,
> > +DISPLAY_PROTOCOL_VNC, );
> > +if (err) {
> > +hmp_handle_error(mon, err);
> > +return;
> > +}
> > +opts.protocol = proto;
> > +
> > +if (proto == DISPLAY_PROTOCOL_VNC) {
> > +opts.u.vnc.has_display = !!display;
> > +opts.u.vnc.display = g_strdup(display);
> > +}
> > +
> > +qmp_expire_password(, );
> > +g_free(opts.time);
> > +g_free(opts.u.vnc.display);
> >  hmp_handle_error(mon, err);

Same here - using the generated qapi_free_ function rather than doing
things by hand, and/or smart use of g_autofree, may make this easier
to maintain.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[RFC PATCH] target/s390x: move tcg_gen_insn_start to s390x_tr_insn_start

2021-10-11 Thread Alex Bennée

We use INDEX_op_insn_start to make the start of instruction
boundaries. If we don't do it in the .insn_start hook things get
confused especially now plugins want to use that marking to identify
the start of instructions and will bomb out if it sees instrumented
ops before the first instruction boundary.

Signed-off-by: Alex Bennée 
---
 target/s390x/tcg/translate.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/target/s390x/tcg/translate.c b/target/s390x/tcg/translate.c
index f284870cd2..fe145ff2eb 100644
--- a/target/s390x/tcg/translate.c
+++ b/target/s390x/tcg/translate.c
@@ -6380,9 +6380,6 @@ static DisasJumpType translate_one(CPUS390XState *env, 
DisasContext *s)
 /* Search for the insn in the table.  */
 insn = extract_insn(env, s);
 
-/* Emit insn_start now that we know the ILEN.  */
-tcg_gen_insn_start(s->base.pc_next, s->cc_op, s->ilen);
-
 /* Not found means unimplemented/illegal opcode.  */
 if (insn == NULL) {
 qemu_log_mask(LOG_UNIMP, "unimplemented opcode 0x%02x%02x\n",
@@ -6550,8 +6547,30 @@ static void s390x_tr_tb_start(DisasContextBase *db, 
CPUState *cs)
 {
 }
 
+/*
+ * We just enough partial instruction decoding here to calculate the
+ * length of the instruction so we can drop the INDEX_op_insn_start
+ * before anything else is emitted in the TCGOp stream.
+ *
+ * See extract_insn for the full decode.
+ */
 static void s390x_tr_insn_start(DisasContextBase *dcbase, CPUState *cs)
 {
+CPUS390XState *env = cs->env_ptr;
+DisasContext *s = container_of(dcbase, DisasContext, base);
+uint64_t insn, pc = s->base.pc_next;
+int op, ilen;
+
+if (unlikely(s->ex_value)) {
+ilen = s->ex_value & 0xf;
+} else {
+insn = ld_code2(env, s, pc);  /* FIXME: don't reload same pc twice */
+op = (insn >> 8) & 0xff;
+ilen = get_ilen(op);
+}
+
+/* Emit insn_start now that we know the ILEN.  */
+tcg_gen_insn_start(s->base.pc_next, s->cc_op, ilen);
 }
 
 static void s390x_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
-- 
2.30.2

Re: [PATCH v4 08/11] tests/acpi: allow updates of VIOT expected data files

2021-10-11 Thread Igor Mammedov

On Fri, 8 Oct 2021 16:26:22 +0100
Jean-Philippe Brucker  wrote:

> On Wed, Oct 06, 2021 at 10:12:15AM +0200, Igor Mammedov wrote:
> > On Fri,  1 Oct 2021 18:33:56 +0100
> > Jean-Philippe Brucker  wrote:
> >   
> > > Create empty data files and allow updates for the upcoming VIOT tests.
> > > 
> > > Signed-off-by: Jean-Philippe Brucker 
> > > ---
> > >  tests/qtest/bios-tables-test-allowed-diff.h | 3 +++
> > >  tests/data/acpi/q35/DSDT.viot   | 0  
> > 
> > does default tests/data/acpi/q35/DSDT differs from
> > DSDT.viot?  
> 
> Yes the VIOT test has one more PCI device (virtio-iommu) and PXB devices,
> so there are additional descriptors in the DSDT


also see tests/qtest/bios-tables-test.c step 6
(---include diff--- part)

> 
> Thanks,
> Jean
> 
> >   
> > >  tests/data/acpi/q35/VIOT.viot   | 0
> > >  tests/data/acpi/virt/VIOT   | 0
> > >  4 files changed, 3 insertions(+)
> > >  create mode 100644 tests/data/acpi/q35/DSDT.viot
> > >  create mode 100644 tests/data/acpi/q35/VIOT.viot
> > >  create mode 100644 tests/data/acpi/virt/VIOT
> > > 
> > > diff --git a/tests/qtest/bios-tables-test-allowed-diff.h 
> > > b/tests/qtest/bios-tables-test-allowed-diff.h
> > > index dfb8523c8b..29b5b1eabc 100644
> > > --- a/tests/qtest/bios-tables-test-allowed-diff.h
> > > +++ b/tests/qtest/bios-tables-test-allowed-diff.h
> > > @@ -1 +1,4 @@
> > >  /* List of comma-separated changed AML files to ignore */
> > > +"tests/data/acpi/virt/VIOT",
> > > +"tests/data/acpi/q35/DSDT.viot",
> > > +"tests/data/acpi/q35/VIOT.viot",
> > > diff --git a/tests/data/acpi/q35/DSDT.viot b/tests/data/acpi/q35/DSDT.viot
> > > new file mode 100644
> > > index 00..e69de29bb2
> > > diff --git a/tests/data/acpi/q35/VIOT.viot b/tests/data/acpi/q35/VIOT.viot
> > > new file mode 100644
> > > index 00..e69de29bb2
> > > diff --git a/tests/data/acpi/virt/VIOT b/tests/data/acpi/virt/VIOT
> > > new file mode 100644
> > > index 00..e69de29bb2  
> >   
>

Re: ACPI endianness

2021-10-11 Thread Michael S. Tsirkin

On Mon, Oct 11, 2021 at 03:51:01PM +0200, BALATON Zoltan wrote:
> > ... but given we did not previously do the read, maybe we should keep
> > it that way at least for the time being.
> 
> How do you know there was no read before this write? Did you check it? I've
> only added a printf in the write method and saw the value was swapped but
> did not check if there was a read before that. There are no traces in these
> methods so maybe I would not see it unless adding a printf there too.

All I am saying is that qemu did not convert a write into
a read+write.

[PATCH] qcow2: Silence clang -m32 compiler warning

2021-10-11 Thread Hanna Reitz

With -m32, size_t is generally only a uint32_t.  That makes clang
complain that in the assertion

  assert(qiov->size <= INT64_MAX);

the range of the type of qiov->size (size_t) is too small for any of its
values to ever exceed INT64_MAX.

Cast qiov->size to uint64_t to silence clang.

Fixes: f7ef38dd1310d7d9db76d0aa16899cbc5744f36d
   ("block: use int64_t instead of uint64_t in driver read
   handlers")
Signed-off-by: Hanna Reitz 
---
I don't know whether this is the best possible solution, or whether we
should care about this at all (I personally think it's basically just
wrong for clang to warn about always-true conditions in assertions), but
I thought I might as well just send this patch as the basis for a
discussion.
---
 block/qcow2-cluster.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 5727f92dcb..21884a1ab9 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -513,7 +513,8 @@ static int coroutine_fn 
do_perform_cow_read(BlockDriverState *bs,
  */
 assert(src_cluster_offset <= INT64_MAX);
 assert(src_cluster_offset + offset_in_cluster <= INT64_MAX);
-assert(qiov->size <= INT64_MAX);
+/* Cast qiov->size to uint64_t to silence a compiler warning on -m32 */
+assert((uint64_t)qiov->size <= INT64_MAX);
 bdrv_check_qiov_request(src_cluster_offset + offset_in_cluster, qiov->size,
 qiov, 0, _abort);
 /*
-- 
2.31.1

Re: Moving QEMU downloads to GitLab Releases?

2021-10-11 Thread Stefan Hajnoczi

On Mon, Oct 11, 2021 at 08:28:34AM -0600, Warner Losh wrote:
> On Mon, Oct 11, 2021 at 4:59 AM Stefan Hajnoczi  wrote:
> 
> > On Mon, Oct 11, 2021 at 09:21:30AM +0200, Gerd Hoffmann wrote:
> > >   Hi,
> > >
> > > > > I guess the main question is who is using the ROM/BIOS sources in the
> > > > > tarballs, and would this 2-step process work for those users? If
> > there
> > > > > are distros relying on them then maybe there are some more logistics
> > to
> > > > > consider since the make-release downloads are both time-consuming and
> > > > > prone to network errors/stalls since it relies on the availability
> > of a
> > > > > good number of different hosts.
> > > >
> > > > Great, maybe Gerd can comment on splitting out firmware.
> > >
> > > I think the binaries are mostly a convenience feature for developers.
> > > Distros typically build from source anyway, and typically they build
> > > from upstream source instead of qemu submodule.
> > >
> > > The idea to split them out to a separate repo is exists for quite a
> > > while.  I have an old qemu-firmware repo laying around on my disk, which
> > > basically moves roms/ + submodules and the binaries built from it over.
> > >
> > > I think with the switch to gitlab it doesn't make sense any more to
> > > commit pre-built firmware binaries to a git repo.  Instead we should
> > > build the firmware in gitlab ci, i.e. effectively move the build rules
> > > we have right now in roms/Makefile to .gitlab-ci.d/, then publish the
> > > firmware binaries as build artifacts or gitlab pages.
> > >
> > > When done just remove the pre-build binaries from git and add a helper
> > > script to fetch firmware binaries from gitlab instead.
> > >
> > > > QEMU mirrors firmware sources on GitLab too. We're able to provide the
> > > > same level of download availability on our mirror firmware repos as for
> > > > the main qemu.git repo.
> > >
> > > I think enabling CI for the firmware mirrors should work given that it
> > > is possible to specify a custom CI/CD configuration file, i.e. use
> > > something like
> > >
> > > /qemu-project/qemu/.gitlab-ci.d/firmware/seabios.yml
> > >
> > > or
> > >
> > > /qemu-project/qemu-firmware/seabios.yml
> > >
> > > as config file for the
> > >
> > > /qemu-project/seabios/
> > >
> > > mirror.  Then we can publish binaries using gitlab pages at
> > >
> > > https://qemu-project.gitlab.io/seabios/
> > >
> > > That way we also don't need the roms/ submodules any more, i.e. we
> > > can remove both sources and binaries for the firmware from the
> > > release tarballs.
> >
> > Thanks!
> >
> > For developer convenience it would be nice to avoid manual steps after
> > git clone qemu.git. Maybe ./configure should check for firmware blobs
> > and automatically download them. There could be a ./configure
> > --disable-firmware-download option that distros can use to skip the
> > download when building everything from source.
> >
> 
> One thing to keep in mind about the downstream consumers: Many
> of them have two phases to their build process that run asynchronously
> to each other. There is a fetch phase that grabs everything, and a build
> phase that builds the binaries. The second phase may not have access
> to the internet for a variety of reasons (these days being a security
> measure, for good or ill). Please make sure that any such plans
> allow for this model.
> 
> Today, that's all dealt with by grabbing tarballs from github which
> is how the submodules are dealt with. So long as the images
> had well known names, and could be fetched with the normal
> wget/cgit/fetch programs, that would suffice. Requiring use of
> some weird bespoke script would cause friction for the downstreams
> using qemu.
> 
> So while I'm all for making things a little more independent,
> let's not do it in a way that makes life difficult for downstreams.
> There are more there than just a couple of big distro builders.

I think this is fine. Firmware blobs aren't needed to build QEMU, only
to run the system emulator.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH 5/8] linux-user: Support TCG_TARGET_SIGNED_ADDR32

2021-10-11 Thread Richard Henderson


On 10/11/21 3:22 AM, Alex Bennée wrote:


Richard Henderson  writes:


When using reserved_va, which is the default for a 64-bit host
and a 32-bit guest, set guest_base_signed_addr32 if requested
by TCG_TARGET_SIGNED_ADDR32, and the executable layout allows.

Signed-off-by: Richard Henderson 
---
  include/exec/cpu-all.h |  4 ---
  linux-user/elfload.c   | 62 ++
  2 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 80b5e17329..71d8e1de7a 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -278,11 +278,7 @@ extern intptr_t qemu_host_page_mask;
  #define PAGE_RESET 0x0040
  /* For linux-user, indicates that the page is MAP_ANON. */
  #define PAGE_ANON  0x0080
-
-#if defined(CONFIG_BSD) && defined(CONFIG_USER_ONLY)
-/* FIXME: Code that sets/uses this is broken and needs to go away.  */
  #define PAGE_RESERVED  0x0100
-#endif


Can we reference why this FIXME is being dropped in the commit message?


I'm not sure to what pbrook was referring with "... and is already broken" there.  I need 
something here to reserve a page, PAGE_RESERVED seems like a good name, so I took it out 
of the cupboard.


I'll do some archaeology.


r~


Looking at the current tree state I can see several uses of it due to
moves in 5b6dd8683d (exec: move TB handling to translate-all.c) which
post-date 2e9a5713f0 (Remove PAGE_RESERVED).

Otherwise looks reasonable:

Reviewed-by: Alex Bennée

Re: [PATCH 3/8] accel/tcg: Support TCG_TARGET_SIGNED_ADDR32 for softmmu

2021-10-11 Thread Richard Henderson


On 10/10/21 9:30 PM, WANG Xuerui wrote:

@@ -92,6 +93,9 @@ static inline size_t sizeof_tlb(CPUTLBDescFast *fast)
  
  static inline uintptr_t g2h_tlbe(const CPUTLBEntry *tlb, target_ulong gaddr)

  {
+if (TCG_TARGET_SIGNED_ADDR32 && TARGET_LONG_BITS == 32) {

It seems this branch's direction should always match that of the branch
added below, so if TARGET_LONG_BITS == TARGET_LONG_BITS == 32 this
invariant is broken? Or is this expected behavior?


The conditions should match, yes.

In revising the patch set I decided that the tcg backend should simply not set this flag 
for a 32-bit host.



r~

Re: [PATCH 6/8] tcg/aarch64: Support TCG_TARGET_SIGNED_ADDR32

2021-10-11 Thread Richard Henderson


On 10/11/21 3:28 AM, Alex Bennée wrote:


Richard Henderson  writes:


AArch64 has both sign and zero-extending addressing modes, which
means that either treatment of guest addresses is equally efficient.
Enabling this for AArch64 gives us testing of the feature in CI.


So which guests front ends will exercise this backend?


All 32-bit guests.


Is this something we can exercise in 32 bit user mode tests?


Yes.

Which is why I enabled this for aarch64, so that we'd have a major platform 
testing it daily.


r~

Re: [RFC v2 1/2] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5

2021-10-11 Thread Igor Mammedov

On Tue,  5 Oct 2021 10:53:12 +0200
Eric Auger  wrote:

> Add a 'preserve_config' field in struct GPEXConfig and
> if set generate the DSM #5 for preserving PCI boot configurations.
> The DSM presence is needed to expose RMRs.

here should be pointers to spec and location within it
where it says preserving PCI boot configuration is necessary
or in absence of that a bit more detailed explanation
why it's necessary.

> 
> At the moment the DSM generation is not yet enabled.
> 
> Signed-off-by: Eric Auger 
> ---
>  include/hw/pci-host/gpex.h |  1 +
>  hw/pci-host/gpex-acpi.c| 12 
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
> index fcf8b63820..3f8f8ec38d 100644
> --- a/include/hw/pci-host/gpex.h
> +++ b/include/hw/pci-host/gpex.h
> @@ -64,6 +64,7 @@ struct GPEXConfig {
>  MemMapEntry pio;
>  int irq;
>  PCIBus  *bus;
> +boolpreserve_config;
s/^^^/preserve_fw_config/

>  };
>  
>  int gpex_set_irq_num(GPEXHost *s, int index, int gsi);
> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> index e7e162a00a..7dab259379 100644
> --- a/hw/pci-host/gpex-acpi.c
> +++ b/hw/pci-host/gpex-acpi.c
> @@ -164,6 +164,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig 
> *cfg)
>  aml_append(dev, aml_name_decl("_PXM", aml_int(numa_node)));
>  }
>  


> +if (cfg->preserve_config) {
> +method = aml_method("_DSM", 5, AML_SERIALIZED);
> +aml_append(method, aml_return(aml_int(0)));
> +aml_append(dev, method);
> +}
> +
>  acpi_dsdt_add_pci_route_table(dev, cfg->irq);
>  
>  /*
> @@ -191,6 +197,12 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig 
> *cfg)
>  aml_append(dev, aml_name_decl("_STR", aml_unicode("PCIe 0 Device")));
>  aml_append(dev, aml_name_decl("_CCA", aml_int(1)));
>  
> +if (cfg->preserve_config) {
> +method = aml_method("_DSM", 5, AML_SERIALIZED);
> +aml_append(method, aml_return(aml_int(0)));
> +aml_append(dev, method);
> +}
> +


these ones seem to wrong ,
it adds duplicate _DSM methods with wrong ARGs number.

virt board already has _DSM defined, see

  acpi_dsdt_add_pci_osc()
E5C937D0-3553-4D7A-9117-EA4D19C3434D

you need to modify that one (and possibly move out DSM into a separate 
function),
also preserving config might regress what commit 40c3472a29c9a was fixing.


>  acpi_dsdt_add_pci_route_table(dev, cfg->irq);
>  
>  method = aml_method("_CBA", 0, AML_NOTSERIALIZED);

Re: [PATCH v2 04/15] qom: Reduce use of error_propagate()

2021-10-11 Thread Markus Armbruster

Kevin Wolf  writes:

> ERRP_GUARD() makes debugging easier by making sure that _abort
> still fails at the real origin of the error instead of
> error_propagate().
>
> Signed-off-by: Kevin Wolf 

Yes, please!

Reviewed-by: Markus Armbruster

Re: [PATCH v4 2/2] monitor: refactor set/expire_password and allow VNC display id

2021-10-11 Thread Dr. David Alan Gilbert

* Stefan Reiter (s.rei...@proxmox.com) wrote:
> It is possible to specify more than one VNC server on the command line,
> either with an explicit ID or the auto-generated ones à la "default",
> "vnc2", "vnc3", ...
> 
> It is not possible to change the password on one of these extra VNC
> displays though. Fix this by adding a "display" parameter to the
> "set_password" and "expire_password" QMP and HMP commands.
> 
> For HMP, the display is specified using the "-d" value flag.
> 
> For QMP, the schema is updated to explicitly express the supported
> variants of the commands with protocol-discriminated unions.
> 
> Suggested-by: Eric Blake 
> Suggested-by: Markus Armbruster 
> Signed-off-by: Stefan Reiter 
> ---
>  hmp-commands.hx|  24 ---
>  monitor/hmp-cmds.c |  57 ++-
>  monitor/qmp-cmds.c |  62 ++--
>  qapi/ui.json   | 173 +
>  4 files changed, 235 insertions(+), 81 deletions(-)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index cf723c69ac..d78e4cfc47 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1514,33 +1514,35 @@ ERST
>  
>  {
>  .name   = "set_password",
> -.args_type  = "protocol:s,password:s,connected:s?",
> -.params = "protocol password action-if-connected",
> +.args_type  = "protocol:s,password:s,display:-dS,connected:s?",
> +.params = "protocol password [-d display] [action-if-connected]",
>  .help   = "set spice/vnc password",
>  .cmd= hmp_set_password,
>  },
>  
>  SRST
> -``set_password [ vnc | spice ] password [ action-if-connected ]``
> -  Change spice/vnc password.  *action-if-connected* specifies what
> -  should happen in case a connection is established: *fail* makes the
> -  password change fail.  *disconnect* changes the password and
> +``set_password [ vnc | spice ] password [ -d display ] [ action-if-connected 
> ]``
> +  Change spice/vnc password.  *display* can be used with 'vnc' to specify
> +  which display to set the password on.  *action-if-connected* specifies
> +  what should happen in case a connection is established: *fail* makes
> +  the password change fail.  *disconnect* changes the password and
>disconnects the client.  *keep* changes the password and keeps the
>connection up.  *keep* is the default.
>  ERST
>  
>  {
>  .name   = "expire_password",
> -.args_type  = "protocol:s,time:s",
> -.params = "protocol time",
> +.args_type  = "protocol:s,time:s,display:-dS",
> +.params = "protocol time [-d display]",
>  .help   = "set spice/vnc password expire-time",
>  .cmd= hmp_expire_password,
>  },
>  
>  SRST
> -``expire_password [ vnc | spice ]`` *expire-time*
> -  Specify when a password for spice/vnc becomes
> -  invalid. *expire-time* accepts:
> +``expire_password [ vnc | spice ] expire-time [ -d display ]``
> +  Specify when a password for spice/vnc becomes invalid.
> +  *display* behaves the same as in ``set_password``.
> +  *expire-time* accepts:
>  
>``now``
>  Invalidate password instantly.
> diff --git a/monitor/hmp-cmds.c b/monitor/hmp-cmds.c
> index b5e71d9e6f..48b3fe4519 100644
> --- a/monitor/hmp-cmds.c
> +++ b/monitor/hmp-cmds.c
> @@ -1451,10 +1451,41 @@ void hmp_set_password(Monitor *mon, const QDict 
> *qdict)
>  {
>  const char *protocol  = qdict_get_str(qdict, "protocol");
>  const char *password  = qdict_get_str(qdict, "password");
> +const char *display = qdict_get_try_str(qdict, "display");
>  const char *connected = qdict_get_try_str(qdict, "connected");
>  Error *err = NULL;
> +DisplayProtocol proto;
>  
> -qmp_set_password(protocol, password, !!connected, connected, );
> +SetPasswordOptions opts = {
> +.password = g_strdup(password),

You're leaking that strdup on the error returns; you probably want to
add an error:  exit and goto it to do all the cleanup.

> +.u.vnc.display = NULL,
> +};
> +
> +proto = qapi_enum_parse(_lookup, protocol,
> +DISPLAY_PROTOCOL_VNC, );
> +if (err) {
> +hmp_handle_error(mon, err);
> +return;
> +}
> +opts.protocol = proto;
> +
> +if (proto == DISPLAY_PROTOCOL_VNC) {
> +opts.u.vnc.has_display = !!display;
> +opts.u.vnc.display = g_strdup(display);
> +} else if (proto == DISPLAY_PROTOCOL_SPICE) {
> +opts.u.spice.has_connected = !!connected;
> +opts.u.spice.connected =
> +qapi_enum_parse(_lookup, connected,
> +SET_PASSWORD_ACTION_KEEP, );
> +if (err) {
> +hmp_handle_error(mon, err);
> +return;
> +}
> +}
> +
> +qmp_set_password(, );
> +g_free(opts.password);
> +g_free(opts.u.vnc.display);
>  hmp_handle_error(mon, err);
>  }
>  
> @@ -1462,9 +1493,31 @@ void hmp_expire_password(Monitor

Re: Moving QEMU downloads to GitLab Releases?

2021-10-11 Thread Anthony Liguori

On Fri, Oct 1, 2021 at 12:20 AM Stefan Hajnoczi  wrote:

> On Thu, Sep 30, 2021 at 03:57:49PM +, Eldon Stegall wrote:
> > Hello!
> > I'd be happy to help with this. I'm mostly a consumer of QEMU, but
> > greatly appreciate all the work this community has done, and was able
> > to contribute a little by helping with QEMU advent this past year. I
> > would be happy to help streamline some of this activities if that would
> > be welcome, and would gratefully contribute time and resources. Hosting
> > and serving data like this has been core to my recent experience.
> >
> > I would be happy to suggest and build out a distribution strategy for
> > these packages, and believe I could cut some costs, and even convince a
> > small consultancy I am a part of here that uses QEMU to foot a
> > reasonable bill.
> >
> > A brief introduction, since I haven't had the pleasure of attending
> > FOSDEM or any other QEMU meetups: I am a startup-oriented Cloud Security
> > Architect, based out of Atlanta, previously with companies like
> > DataStax, but now working on AWS video pipelines for a startup here.
>
> Thanks for joining the discussion and for running last year's QEMU
> Advent Calendar, Eldon.
>
> Any ideas for moving download.qemu.org to a hosted service would be
> appreciated! We haven't compared CDN and cloud providers closely yet. If
> you have experience in this area or time to check them out, then that
> would be valuable.
>
> QEMU has funds if there is a cost for file hosting (probably less than
> $100/month). Some providers may be willing to support an open source
> project for free. Possible providers include CloudFlare, Akamai, Fastly,
> Microsoft Azure, Google Cloud Storage, etc.
>

https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/

Let me know if ya'll apply and I'm happy to push it through.

Regards,

Anthony Liguori

RE: [PATCH v2] vmdk: allow specification of tools version

2021-10-11 Thread Weissschuh, Thomas [ext]

Hi Hanna,

> -Original Message-
> From: Hanna Reitz 
> Sent: Monday, October 11, 2021 4:09 PM
> Subject: Re: [PATCH v2] vmdk: allow specification of tools version
> 
> On 13.09.21 15:04, Thomas Weißschuh wrote:
> > VMDK files support an attribute that represents the version of the
> > guest tools that are installed on the disk.
> > This attribute is used by vSphere before a machine has been started to
> > determine if the VM has the guest tools installed.
> > This is important when configuring "Operating system customizations"
> > in vSphere, as it checks for the presence of the guest tools before
> > allowing those customizations.
> > Thus when the VM has not yet booted normally it would be impossible to
> > customize it, therefore preventing a customized first-boot.
> >
> > The attribute should not hurt on disks that do not have the guest
> > tools installed and indeed the VMware tools also unconditionally add
> > this attribute.
> > (Defaulting to the value "2147483647", as is done in this patch)
> >
> > Signed-off-by: Thomas Weißschuh 
> > ---
> >
> > v1:
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore
> > .kernel.org%2Fqemu-devel%2F20210908174250.12946-1-thomas.weissschuh.ex
> > t%40zeiss.com%2Fdata=04%7C01%7C%7C6f16aae6ba9b49d75a0b08d98cc0c1c
> > a%7C28042244bb514cd680347776fa3703e8%7C1%7C0%7C637695581774186236%7CUn
> > known%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haW
> > wiLCJXVCI6Mn0%3D%7C1000sdata=wyrXC3HNxa5QFY8MW4PuzqZNpMh6NnzicD9k
> > 4Pw0j7o%3Dreserved=0
> >
> > v1 -> v2:
> > * Expand QAPI docs (Eric Blake)
> >
> >   block/vmdk.c | 24 
> >   qapi/block-core.json |  3 +++
> >   2 files changed, 23 insertions(+), 4 deletions(-)
> 
> [...]
> 
> > diff --git a/qapi/block-core.json b/qapi/block-core.json index
> > c8ce1d9d5d..42431f52d0 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -4597,6 +4597,8 @@
> >   # @adapter-type: The adapter type used to fill in the descriptor.
> Default: ide.
> >   # @hwversion: Hardware version. The meaningful options are "4" or "6".
> >   # Default: "4".
> > +# @toolsversion: VMware guest tools version.
> > + Default: "2147483647" (Since 6.2)
> 
> There's a # missing at the start of the line, and so this doesn't compile.
> 
> I've added it (hope that's OK for you) and taken this patch to my block
> branch:

Absolutely, thanks for noticing and fixing.

> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.c
> om%2Fhreitz%2Fqemu%2F-
> %2Fcommits%2Fblock%2Fdata=04%7C01%7C%7C6f16aae6ba9b49d75a0b08d98cc0c1
> ca%7C28042244bb514cd680347776fa3703e8%7C1%7C0%7C637695581774186236%7CUnkno
> wn%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXV
> CI6Mn0%3D%7C1000sdata=BsLGfL%2BpnI1NE4M%2FnBN6UCWUQkb19HOPjgKTveMaA1k
> %3Dreserved=0

Thomas

Re: [PATCH] s390x/ipl: check kernel command line size

2021-10-11 Thread Marc Hartmayer

Thomas Huth  writes:

> On 06/10/2021 11.26, Marc Hartmayer wrote:
>> Check if the provided kernel command line exceeds the maximum size of the 
>> s390x
>> Linux kernel command line size, which is 896 bytes.
>> 
>> Reported-by: Sven Schnelle 
>> Signed-off-by: Marc Hartmayer 
>> ---
>>   hw/s390x/ipl.c | 12 +++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>> 
>> diff --git a/hw/s390x/ipl.c b/hw/s390x/ipl.c
>> index 1821c6faeef3..a58ea58cc736 100644
>> --- a/hw/s390x/ipl.c
>> +++ b/hw/s390x/ipl.c
>> @@ -38,6 +38,7 @@
>>   #define KERN_IMAGE_START0x01UL
>>   #define LINUX_MAGIC_ADDR0x010008UL
>>   #define KERN_PARM_AREA  0x010480UL
>> +#define KERN_PARM_AREA_SIZE 0x000380UL
>>   #define INITRD_START0x80UL
>>   #define INITRD_PARM_START   0x010408UL
>>   #define PARMFILE_START  0x001000UL
>> @@ -190,10 +191,19 @@ static void s390_ipl_realize(DeviceState *dev, Error 
>> **errp)
>>* loader) and it won't work. For this case we force it to 
>> 0x1, too.
>>*/
>>   if (pentry == KERN_IMAGE_START || pentry == 0x800) {
>> -char *parm_area = rom_ptr(KERN_PARM_AREA, strlen(ipl->cmdline) 
>> + 1);
>> +size_t cmdline_size = strlen(ipl->cmdline) + 1;
>> +char *parm_area = rom_ptr(KERN_PARM_AREA, cmdline_size);
>> +
>>   ipl->start_addr = KERN_IMAGE_START;
>>   /* Overwrite parameters in the kernel image, which are "rom" */
>>   if (parm_area) {
>> +if (cmdline_size > KERN_PARM_AREA_SIZE) {
>> +error_setg(errp,
>> +   "kernel command line exceeds maximum size: 
>> %lu > %lu",
>
> I think the first %lu should be %zd instead?

Yep, makes sense - thanks!

>
> Apart from that, the patch looks fine to me... so if you agree, I can fix 
> that up when picking up the patch.

Thanks.

>
>   Thomas
>
>
>> +   cmdline_size, KERN_PARM_AREA_SIZE);
>> +return;
>> +}
>> +
>>   strcpy(parm_area, ipl->cmdline);
>>   }
>>   } else {
>> 
>
-- 
Kind regards / Beste Grüße
   Marc Hartmayer

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen 
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

1 2 >

1 - 100 of 173 matches

Mail list logo