On 5/18/2021 9:39 AM, Stefan Hajnoczi wrote: > On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote: >> On Mon, 17 May 2021 12:40:43 +0100 >> Stefan Hajnoczi <stefa...@redhat.com> wrote: >> >>> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote: >>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote: >>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote: >>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote: >>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote: >>>>>>>> Provide the cprsave and cprload commands for live update. These save >>>>>>>> and >>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be >>>>>>>> updated >>>>>>>> to a new version in between. >>>>>>>> >>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file. It >>>>>>>> supports two >>>>>>>> modes: restart and reboot. For restart, cprsave exec's the qemu >>>>>>>> binary (or >>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv. qemu restarts in >>>>>>>> a >>>>>>>> paused state and waits for the cprload command. >>>>>>> >>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the >>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code >>>>>>> already opens fds passed using this mechanism. >>>>>>> >>>>>>> I haven't checked but it may be possible to drop some patches by reusing >>>>>>> QEMU's monitor file descriptor passing since the code already knows how >>>>>>> to open from 'getfd' fds. >>>>>>> >>>>>>> The reason why using QMP is interesting is because it eliminates the >>>>>>> need for execve(2). QEMU may be unable to execute a program due to >>>>>>> chroot, seccomp, etc. >>>>>>> >>>>>>> QMP would enable cprsave/cprload to work both with and without >>>>>>> execve(2). >>>>>>> >>>>>>> One tricky thing with this approach might be startup ordering: how to >>>>>>> get fds via the QMP monitor in the new process before processing the >>>>>>> entire command-line. >>>>>> >>>>>> Early on I experimented with a similar approach. Old qemu passed >>>>>> descriptors to an >>>>>> escrow process and exited; new qemu started and retrieved the >>>>>> descriptors from escrow. >>>>>> vfio mostly worked after I hacked the kernel to suppress the >>>>>> original-pid owner check. >>>>>> I suspect my recent vfio extensions would smooth the rough edges. >>>>> >>>>> I wonder about the reason for VFIO's pid limitation, maybe because it >>>>> pins pages from the original process? >>>> >>>> The dma unmap code verifies that the requesting task is the same as the >>>> task that mapped >>>> the pages. We could add an ioctl that passes ownership to a new task. We >>>> would also need >>>> to fix locked memory accounting, which is associated with the mm of the >>>> original task. >>>> >>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU >>>>> execve(2) the new binary? >>>> >>>> That is one. Plus, re-attaching to named shared memory for pc.ram causes >>>> the vfio conflict >>>> errors I mentioned in the previous email. We would need to suppress >>>> redundant dma map calls, >>>> but allow legitimate dma maps and unmaps in response to the ongoing >>>> address space changes and >>>> diff callbacks caused by some drivers. It would be messy and fragile. In >>>> general, it felt like >>>> I was working against vfio rather than with it. >>>> >>>> Another big reason is a requirement to preserve anonymous memory for >>>> legacy qemu updates (via >>>> code injection which I briefly mentioned in KVM forum). If we extend cpr >>>> to allow updates >>>> without exec, I still need the exec option. >>>> >>>>>> However, the main issue is that guest ram must be backed by named shared >>>>>> memory, and >>>>>> we would need to add code to support shared memory for all the secondary >>>>>> memory objects. >>>>>> That makes it less interesting for us at this time; we care about >>>>>> updating legacy qemu >>>>>> instances with anonymous guest memory. >>>>> >>>>> Thanks for explaining this more in the other sub-thread. The secondary >>>>> memory objects you mentioned are relatively small so I don't think >>>>> saving them in the traditional way is a problem. >>>>> >>>>> Two approaches for zero-copy memory migration fit into QEMU's existing >>>>> migration infrastructure: >>>>> >>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs, >>>>> etc) so they are not saved into the savevm file. The existing --object >>>>> memory-backend-file syntax can be used. >>>>> >>>>> - Extending the live migration protocol to detect when file descriptor >>>>> passing is available (i.e. UNIX domain socket migration) and using >>>>> that for memory-backend-* objects that have fds. >>>>> >>>>> Either of these approaches would handle RAM with existing savevm/migrate >>>>> commands. >>>> >>>> Yes, but the vfio issues would still need to be solved, and we would need >>>> new >>>> command line options to back existing and future secondary memory objects >>>> with >>>> named shared memory. >>>> >>>>> The remaining issue is how to migrate VFIO and other file descriptors >>>>> that cannot be reopened by the new process. As mentioned, QEMU already >>>>> has file descriptor passing support in the QMP monitor and support for >>>>> opening passed file descriptors (see qemu_open_internal(), >>>>> monitor_fd_param(), and socket_get_fd()). >>>>> >>>>> The advantage of integrating live update functionality into the existing >>>>> savevm/migrate commands is that it will work in more use cases with >>>>> less code duplication/maintenance/bitrot prevention than the >>>>> special-case cprsave command in this patch series. >>>>> >>>>> Maybe there is a fundamental technical reason why live update needs to >>>>> be different from QEMU's existing migration commands but I haven't >>>>> figured it out yet. >>>> >>>> vfio and anonymous memory. >>>> >>>> Regarding code duplication, I did consider whether to extend the migration >>>> syntax and implementation versus creating something new. Those functions >>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the >>>> cpr >>>> use case, and the cpr functions handle state that is n/a for the migration >>>> case. >>>> I judged that handling both in the same functions would be less readable >>>> and >>>> maintainable. After feedback during the V1 review, I simplified the >>>> cprsave >>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating >>>> any >>>> interaction with the migration code. >>>> >>>> Regarding bit rot, I still need to add a cpr test to the test suite, when >>>> the >>>> review is more complete and folks agree on the final form of the >>>> functionality. >>>> >>>> I do like the idea of supporting update without exec, but as a future >>>> project, >>>> and not at the expense of dropping update with exec. >>> >>> Alex: We're discussing how to live update QEMU while VFIO devices are >>> running. This patch series introduces monitor commands that call >>> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc >>> file descriptors. This way live update is transparent to VFIO but it >>> won't work if a sandboxed QEMU process is forbidden to call execve(2). >>> What are your thoughts on 1) the execve(2) approach and 2) extending >>> VFIO to allow running devices to be attached to a different process so >>> that execve(2) is not necessary? >> >> Tracking processes is largely to support page pinning; we need to be >> able to support both asynchronous page pinning to handle requests from >> mdev drivers and we need to make sure pinned page accounting is >> tracked to the same process. If userspace can "pay" for locked pages >> from one process on mappping, then "credit" them to another process on >> unmap, that seems fairly exploitable. We'd need some way to transfer >> the locked memory accounting or handle it outside of vfio. Thanks, > > Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to > associate the in-kernel vhost device with a userspace process and it's > mm. > > Would it be possible to add a VFIO_SET_OWNER ioctl that associates the > current process with the vfio_device? Only one process would be the > owner at any given time.
It is possible, but the implementation would need to invent new hooks in mm to transfer the locked memory accounting. > I'm not sure how existing DMA mappings would behave, but this patch > series seems to rely on DMA continuing to work even though there is a > window of time when the execve(2) process doesn't have guest RAM > mmapped. So I guess existing DMA mappings continue to work because the > pages were previously pinned? Correct. And changes to mappings are blocked between the pre-exec process issuing VFIO_DMA_UNMAP_FLAG_VADDR and the post-exec process issuing VFIO_DMA_MAP_FLAG_VADDR. - Steve _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://listman.redhat.com/mailman/listinfo/vfio-users