On 5/18/2021 9:39 AM, Stefan Hajnoczi wrote:
> On Mon, May 17, 2021 at 01:10:01PM -0600, Alex Williamson wrote:
>> On Mon, 17 May 2021 12:40:43 +0100
>> Stefan Hajnoczi <stefa...@redhat.com> wrote:
>>
>>> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
>>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
>>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
>>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
>>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
>>>>>>>> Provide the cprsave and cprload commands for live update.  These save 
>>>>>>>> and
>>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be 
>>>>>>>> updated
>>>>>>>> to a new version in between.
>>>>>>>>
>>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It 
>>>>>>>> supports two
>>>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu 
>>>>>>>> binary (or
>>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in 
>>>>>>>> a
>>>>>>>> paused state and waits for the cprload command.  
>>>>>>>
>>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>>>> already opens fds passed using this mechanism.
>>>>>>>
>>>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>>>> to open from 'getfd' fds.
>>>>>>>
>>>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>>>> chroot, seccomp, etc.
>>>>>>>
>>>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>>>> execve(2).
>>>>>>>
>>>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>>>> get fds via the QMP monitor in the new process before processing the
>>>>>>> entire command-line.  
>>>>>>
>>>>>> Early on I experimented with a similar approach.  Old qemu passed 
>>>>>> descriptors to an
>>>>>> escrow process and exited; new qemu started and retrieved the 
>>>>>> descriptors from escrow.
>>>>>> vfio mostly worked after I hacked the kernel to suppress the 
>>>>>> original-pid owner check.
>>>>>> I suspect my recent vfio extensions would smooth the rough edges.  
>>>>>
>>>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>>>> pins pages from the original process?  
>>>>
>>>> The dma unmap code verifies that the requesting task is the same as the 
>>>> task that mapped
>>>> the pages.  We could add an ioctl that passes ownership to a new task.  We 
>>>> would also need
>>>> to fix locked memory accounting, which is associated with the mm of the 
>>>> original task.
>>>>   
>>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>>>> execve(2) the new binary?  
>>>>
>>>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes 
>>>> the vfio conflict
>>>> errors I mentioned in the previous email.  We would need to suppress 
>>>> redundant dma map calls,
>>>> but allow legitimate dma maps and unmaps in response to the ongoing 
>>>> address space changes and
>>>> diff callbacks caused by some drivers. It would be messy and fragile. In 
>>>> general, it felt like 
>>>> I was working against vfio rather than with it.
>>>>
>>>> Another big reason is a requirement to preserve anonymous memory for 
>>>> legacy qemu updates (via
>>>> code injection which I briefly mentioned in KVM forum).  If we extend cpr 
>>>> to allow updates 
>>>> without exec, I still need the exec option.
>>>>   
>>>>>> However, the main issue is that guest ram must be backed by named shared 
>>>>>> memory, and
>>>>>> we would need to add code to support shared memory for all the secondary 
>>>>>> memory objects.
>>>>>> That makes it less interesting for us at this time; we care about 
>>>>>> updating legacy qemu 
>>>>>> instances with anonymous guest memory.  
>>>>>
>>>>> Thanks for explaining this more in the other sub-thread. The secondary
>>>>> memory objects you mentioned are relatively small so I don't think
>>>>> saving them in the traditional way is a problem.
>>>>>
>>>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>>>> migration infrastructure:
>>>>>
>>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>>>   etc) so they are not saved into the savevm file. The existing --object
>>>>>   memory-backend-file syntax can be used.
>>>>>
>>>>> - Extending the live migration protocol to detect when file descriptor
>>>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>>>   that for memory-backend-* objects that have fds.
>>>>>
>>>>> Either of these approaches would handle RAM with existing savevm/migrate
>>>>> commands.  
>>>>
>>>> Yes, but the vfio issues would still need to be solved, and we would need 
>>>> new
>>>> command line options to back existing and future secondary memory objects 
>>>> with 
>>>> named shared memory.
>>>>   
>>>>> The remaining issue is how to migrate VFIO and other file descriptors
>>>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>>>> has file descriptor passing support in the QMP monitor and support for
>>>>> opening passed file descriptors (see qemu_open_internal(),
>>>>> monitor_fd_param(), and socket_get_fd()).
>>>>>
>>>>> The advantage of integrating live update functionality into the existing
>>>>> savevm/migrate commands is that it will work in more use cases with
>>>>> less code duplication/maintenance/bitrot prevention than the
>>>>> special-case cprsave command in this patch series.
>>>>>
>>>>> Maybe there is a fundamental technical reason why live update needs to
>>>>> be different from QEMU's existing migration commands but I haven't
>>>>> figured it out yet.  
>>>>
>>>> vfio and anonymous memory.
>>>>
>>>> Regarding code duplication, I did consider whether to extend the migration
>>>> syntax and implementation versus creating something new.  Those functions
>>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the 
>>>> cpr
>>>> use case, and the cpr functions handle state that is n/a for the migration 
>>>> case.
>>>> I judged that handling both in the same functions would be less readable 
>>>> and
>>>> maintainable.  After feedback during the V1 review, I simplified the 
>>>> cprsave
>>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating 
>>>> any
>>>> interaction with the migration code.
>>>>
>>>> Regarding bit rot, I still need to add a cpr test to the test suite, when 
>>>> the 
>>>> review is more complete and folks agree on the final form of the 
>>>> functionality.
>>>>
>>>> I do like the idea of supporting update without exec, but as a future 
>>>> project, 
>>>> and not at the expense of dropping update with exec.  
>>>
>>> Alex: We're discussing how to live update QEMU while VFIO devices are
>>> running. This patch series introduces monitor commands that call
>>> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
>>> file descriptors. This way live update is transparent to VFIO but it
>>> won't work if a sandboxed QEMU process is forbidden to call execve(2).
>>> What are your thoughts on 1) the execve(2) approach and 2) extending
>>> VFIO to allow running devices to be attached to a different process so
>>> that execve(2) is not necessary?
>>
>> Tracking processes is largely to support page pinning; we need to be
>> able to support both asynchronous page pinning to handle requests from
>> mdev drivers and we need to make sure pinned page accounting is
>> tracked to the same process.  If userspace can "pay" for locked pages
>> from one process on mappping, then "credit" them to another process on
>> unmap, that seems fairly exploitable.  We'd need some way to transfer
>> the locked memory accounting or handle it outside of vfio.  Thanks,
> 
> Vhost's VHOST_SET_OWNER ioctl is somewhat similar. It's used to
> associate the in-kernel vhost device with a userspace process and it's
> mm.
> 
> Would it be possible to add a VFIO_SET_OWNER ioctl that associates the
> current process with the vfio_device? Only one process would be the
> owner at any given time.

It is possible, but the implementation would need to invent new hooks in mm
to transfer the locked memory accounting.

> I'm not sure how existing DMA mappings would behave, but this patch
> series seems to rely on DMA continuing to work even though there is a
> window of time when the execve(2) process doesn't have guest RAM
> mmapped. So I guess existing DMA mappings continue to work because the
> pages were previously pinned?

Correct.  And changes to mappings are blocked between the pre-exec process 
issuing
VFIO_DMA_UNMAP_FLAG_VADDR and the post-exec process issuing 
VFIO_DMA_MAP_FLAG_VADDR.

- Steve

_______________________________________________
vfio-users mailing list
vfio-users@redhat.com
https://listman.redhat.com/mailman/listinfo/vfio-users

Reply via email to