On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote: > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: > >>>> Provide the cprsave and cprload commands for live update. These save and > >>>> restore VM state, with minimal guest pause time, so that qemu may be > >>>> updated > >>>> to a new version in between. > >>>> > >>>> cprsave stops the VM and saves vmstate to an ordinary file. It supports > >>>> two > >>>> modes: restart and reboot. For restart, cprsave exec's the qemu binary > >>>> (or > >>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in a > >>>> paused state and waits for the cprload command. > >>> > >>> I think cprsave/cprload could be generalized by using QMP to stash the > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code > >>> already opens fds passed using this mechanism. > >>> > >>> I haven't checked but it may be possible to drop some patches by reusing > >>> QEMU's monitor file descriptor passing since the code already knows how > >>> to open from 'getfd' fds. > >>> > >>> The reason why using QMP is interesting is because it eliminates the > >>> need for execve(2). QEMU may be unable to execute a program due to > >>> chroot, seccomp, etc. > >>> > >>> QMP would enable cprsave/cprload to work both with and without > >>> execve(2). > >>> > >>> One tricky thing with this approach might be startup ordering: how to > >>> get fds via the QMP monitor in the new process before processing the > >>> entire command-line. > >> > >> Early on I experimented with a similar approach. Old qemu passed > >> descriptors to an > >> escrow process and exited; new qemu started and retrieved the descriptors > >> from escrow. > >> vfio mostly worked after I hacked the kernel to suppress the original-pid > >> owner check. > >> I suspect my recent vfio extensions would smooth the rough edges. > > > > I wonder about the reason for VFIO's pid limitation, maybe because it > > pins pages from the original process? > > The dma unmap code verifies that the requesting task is the same as the task > that mapped > the pages. We could add an ioctl that passes ownership to a new task. We > would also need > to fix locked memory accounting, which is associated with the mm of the > original task. > > > Is this VFIO pid limitation the main reason why you chose to make QEMU > > execve(2) the new binary? > > That is one. Plus, re-attaching to named shared memory for pc.ram causes the > vfio conflict > errors I mentioned in the previous email. We would need to suppress > redundant dma map calls, > but allow legitimate dma maps and unmaps in response to the ongoing address > space changes and > diff callbacks caused by some drivers. It would be messy and fragile. In > general, it felt like > I was working against vfio rather than with it. > > Another big reason is a requirement to preserve anonymous memory for legacy > qemu updates (via > code injection which I briefly mentioned in KVM forum). If we extend cpr to > allow updates > without exec, I still need the exec option. > > >> However, the main issue is that guest ram must be backed by named shared > >> memory, and > >> we would need to add code to support shared memory for all the secondary > >> memory objects. > >> That makes it less interesting for us at this time; we care about updating > >> legacy qemu > >> instances with anonymous guest memory. > > > > Thanks for explaining this more in the other sub-thread. The secondary > > memory objects you mentioned are relatively small so I don't think > > saving them in the traditional way is a problem. > > > > Two approaches for zero-copy memory migration fit into QEMU's existing > > migration infrastructure: > > > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, > > etc) so they are not saved into the savevm file. The existing --object > > memory-backend-file syntax can be used. > > > > - Extending the live migration protocol to detect when file descriptor > > passing is available (i.e. UNIX domain socket migration) and using > > that for memory-backend-* objects that have fds. > > > > Either of these approaches would handle RAM with existing savevm/migrate > > commands. > > Yes, but the vfio issues would still need to be solved, and we would need new > command line options to back existing and future secondary memory objects > with > named shared memory. > > > The remaining issue is how to migrate VFIO and other file descriptors > > that cannot be reopened by the new process. As mentioned, QEMU already > > has file descriptor passing support in the QMP monitor and support for > > opening passed file descriptors (see qemu_open_internal(), > > monitor_fd_param(), and socket_get_fd()). > > > > The advantage of integrating live update functionality into the existing > > savevm/migrate commands is that it will work in more use cases with > > less code duplication/maintenance/bitrot prevention than the > > special-case cprsave command in this patch series. > > > > Maybe there is a fundamental technical reason why live update needs to > > be different from QEMU's existing migration commands but I haven't > > figured it out yet. > > vfio and anonymous memory. > > Regarding code duplication, I did consider whether to extend the migration > syntax and implementation versus creating something new. Those functions > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr > use case, and the cpr functions handle state that is n/a for the migration > case. > I judged that handling both in the same functions would be less readable and > maintainable. After feedback during the V1 review, I simplified the cprsave > code by by calling qemu_save_device_state, as Xen does, thus eliminating any > interaction with the migration code. > > Regarding bit rot, I still need to add a cpr test to the test suite, when the > review is more complete and folks agree on the final form of the > functionality. > > I do like the idea of supporting update without exec, but as a future > project, > and not at the expense of dropping update with exec.
Alex: We're discussing how to live update QEMU while VFIO devices are running. This patch series introduces monitor commands that call execve(2) to run the new QEMU binary and inherit the memory/vfio/etc file descriptors. This way live update is transparent to VFIO but it won't work if a sandboxed QEMU process is forbidden to call execve(2). What are your thoughts on 1) the execve(2) approach and 2) extending VFIO to allow running devices to be attached to a different process so that execve(2) is not necessary? Steven: Do you know if cpr will work with Intel's upcoming Shared Virtual Addressing? I'm worried that execve(2) may be a short-term solution that works around VFIO's current limitations but even execve(2) may stop working in the future as IOMMUs and DMA approaches change. Stefan
signature.asc
Description: PGP signature
_______________________________________________ vfio-users mailing list vfio-users@redhat.com https://listman.redhat.com/mailman/listinfo/vfio-users