On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be 
> >>>> updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports 
> >>>> two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary 
> >>>> (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed 
> >> descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors 
> >> from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid 
> >> owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task 
> that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We 
> would also need
> to fix locked memory accounting, which is associated with the mm of the 
> original task.
> 
> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the 
> vfio conflict
> errors I mentioned in the previous email.  We would need to suppress 
> redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address 
> space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In 
> general, it felt like 
> I was working against vfio rather than with it.
> 
> Another big reason is a requirement to preserve anonymous memory for legacy 
> qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to 
> allow updates 
> without exec, I still need the exec option.
> 
> >> However, the main issue is that guest ram must be backed by named shared 
> >> memory, and
> >> we would need to add code to support shared memory for all the secondary 
> >> memory objects.
> >> That makes it less interesting for us at this time; we care about updating 
> >> legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects 
> with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration 
> case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the 
> functionality.
> 
> I do like the idea of supporting update without exec, but as a future 
> project, 
> and not at the expense of dropping update with exec.

Alex: We're discussing how to live update QEMU while VFIO devices are
running. This patch series introduces monitor commands that call
execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
file descriptors. This way live update is transparent to VFIO but it
won't work if a sandboxed QEMU process is forbidden to call execve(2).
What are your thoughts on 1) the execve(2) approach and 2) extending
VFIO to allow running devices to be attached to a different process so
that execve(2) is not necessary?

Steven: Do you know if cpr will work with Intel's upcoming Shared
Virtual Addressing? I'm worried that execve(2) may be a short-term
solution that works around VFIO's current limitations but even execve(2)
may stop working in the future as IOMMUs and DMA approaches change.

Stefan

Attachment: signature.asc
Description: PGP signature

_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://listman.redhat.com/mailman/listinfo/vfio-users

Reply via email to