Re: [Qemu-devel] Async savevm using userfaultfd(2)
On Thu, Oct 13, 2016 at 04:27:19PM +0200, Andrea Arcangeli wrote: > I would suggest not to implement mprotect+sigsegv because maintaining > both APIs would be messy but mostly because mprotect cannot really > work for all cases and it would risk to fail at any time with > -ENOMEM. postcopy live migration had similar issues and this is why it > wasn't possible to achieve it reliably without userfaultfd. Yes, thanks for explaining the issues. I agree that the mprotect approach isn't worthwhile. We need to use userfaultfd. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] Async savevm using userfaultfd(2)
Hello, On Thu, Oct 13, 2016 at 09:30:49AM +0100, Dr. David Alan Gilbert wrote: > I think it should, or at least I think all other kernel things end up being > caught by userfaultfd during postcopy. Yes indeed, it will work. vhost blocks in its own task context inside the kernel and the vmsave/postcopy live snapshotting thread will get waken up, will copy the page off to a private snapshot buffer and then mark the memory writable again and wakeup the vhost thread at the same time. The other showstopper limitation of mprotect is that you'd run out of vmas and mprotect will eventually fail on large virtual machines. That problem doesn't exist with userfaultfd WP. Unfortunately it seems there's some problem with userfaultfd WP and KVM but it is reported to work for regular userland memory. I didn't get around solving that yet but it's work in progress... I thought of finishing making userfaultfd WP fully accurate with special user bits in the pagetables so there are no false positives. The problem is that when we swapout a WP page, we would be forced to mark it readonly during swapin if we don't store the WP information in the swapentry, so I'm saving now the WP information in the swap entry. This also will prevent false positive WP userfaults after fork() runs (fork would mark the pagetable readonly so without an user bit marking which pagetables are write protected we wouldn't know if we've to fault or not). The other advantage is that you can snapshot at 4k granularity by selectively splitting THPs. The granularity of the snapshot process is decided by userland. You decide if to copy 4k or 2m and depending on that you will unwrprotect 4k or 2m, and the kernel will split the THP if it's a THP but you unprotect only 4k of it. With userfaultfd it's always userland (not the kernel) deciding the granularity of the fault. live snapshotting/async vmsave/redis snapshotting all need the same thing and they're doing the same thing with uffd wp. And you most certainly want to do faults at 4k granularity and let khugepaged rebuild the splitted THP later on. Or you'd run in the same corner case redis run into with THP because THP cows with 2M granularity by default. It's faster and takes less memory to copy and unwrprotect only 4k. Other positive aspect of uffd WP is that you will decide the max amount of "buffer" memory you are ok to use. If you set that "buffer" to the size of the guest it'll behave like fork() so you will risk 100% higher mem utilization in the worst case. With fork() you are forced to have 100% of the VM size free to succeed the snapshotting/vmsaving. With userfaultfd you can decide and configure it. Once the limit hits simply vmsave will behave synchronous and you will have to wait the write() to disk to complete to free up one buffer page, before you can copy off new data from guest to buffer and then wakeup the tasks stuck in the kernel page fault waiting wakeup from uffd. I would suggest not to implement mprotect+sigsegv because maintaining both APIs would be messy but mostly because mprotect cannot really work for all cases and it would risk to fail at any time with -ENOMEM. postcopy live migration had similar issues and this is why it wasn't possible to achieve it reliably without userfaultfd. In addition to this userfaultfd is much faster too, no signals, no userland interprocess communication through pipe/unixsocket, no return to userland for the task that hits the fault, schedule-in-kernel to block which is cheaper and won't force mm tlbflushes, direct in-kernel communication between the task that hits the fault and the vmsaveasync thread (that can wait on epoll or anything) etc... I'll look into fixing userfaultfd for the already implemented postcopy live snapshotting ASAP, I've got a bugreport pending but until the WP full accuracy is completed (to obsolete soft-dirty for good) the current status of userfaultfd WP is not complete anyway. And about future uffd features for other usages: once it will work reliable and with full accuracy, one more feature possible would be to add a userfaultfd async queue model too, where task won't need to block and uffd msgs will be allocated and delivered to userland asynchronously. That would obsolete soft dirty and reduce the computational complexity as well, because you wouldn't need to scan a worst case of 4TiB/4KiB pagetable entries to know which pages have been modified during checkpoint/restore. Instead it'd provide the info with the same efficiency that PML does in HW on Intel for guests (uffd async WP would work for host and without HW support). Of course at the next pass, you'd then would also need to wrprotect only those regions that have been modified and not the whole range, and only userland will know those ranges that need to be wrprotected again. A vectored API would be quite nice for such selective wrprotection to reduce the number of uffd ioctl to issue too, at the moment it's not vectored but adding a vectored API will be
Re: [Qemu-devel] Async savevm using userfaultfd(2)
* Stefan Hajnoczi (stefa...@gmail.com) wrote: > On Wed, Oct 12, 2016 at 4:04 PM, Stefan Hajnocziwrote: > > Perhaps this approach can be prototyped with mprotect and a SIGSEGV > > handler if anyone wants to get async savevm going. I don't know if > > there are any disadvantages to mprotecting guest RAM that the kvm kernel > > module is using. Hopefully in-kernel devices and vhost will continue to > > work. > > I woke up this morning with a strong feeling that a SIGSEGV handler > won't work with vhost. YKYBHTLW you wake up with strong feelings about SIGSEGV handlers. > The problem is that the QEMU process' SIGSEGV > handler won't be called when the vhost kernel thread faults. Now I'm > wondering whether userfaultfd will work together with vhost. I think it should, or at least I think all other kernel things end up being caught by userfaultfd during postcopy. Dave > Stefan -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] Async savevm using userfaultfd(2)
On Wed, Oct 12, 2016 at 4:04 PM, Stefan Hajnocziwrote: > Perhaps this approach can be prototyped with mprotect and a SIGSEGV > handler if anyone wants to get async savevm going. I don't know if > there are any disadvantages to mprotecting guest RAM that the kvm kernel > module is using. Hopefully in-kernel devices and vhost will continue to > work. I woke up this morning with a strong feeling that a SIGSEGV handler won't work with vhost. The problem is that the QEMU process' SIGSEGV handler won't be called when the vhost kernel thread faults. Now I'm wondering whether userfaultfd will work together with vhost. Stefan
Re: [Qemu-devel] Async savevm using userfaultfd(2)
On 2016/10/12 22:21, Dr. David Alan Gilbert wrote: * Stefan Hajnoczi (stefa...@gmail.com) wrote: John and I recently discussed asynchronous savevm and I wanted to post the ideas so they aren't forgotten. (We're not actively working on this feature.) Asynchronous savevm has the same effect as the 'savevm' monitor command: it saves RAM, device state, and a snapshot of all disks at the point in time the command was issued. The current 'savevm' monitor command is synchronous so the guest and QEMU monitor are blocked while the operation runs (it can take a while!). Asynchronous savevm has the advantage of allowing the guest and QEMU monitor to continue while the operation is running. This sounds similar to live migration to file but remember that live migration's consistency point is when the guest is paused at the end of the iteration phase. The user has no control over *when* live migration captures the guest state. Therefore it's not a useful command for taking snapshots of guest state at a specific point in time - we need asynchronous savevm for that. Async savevm must copy-on-write guest RAM so the guest can continue writing to memory while the snapshot is being saved. Rik van Riel suggested using userfaultfd(2) to do this on Linux. Unlike post-copy live migration, we want to track memory writes (instead of missing page faults). The userfaultfd(2) flag UFFDIO_REGISTER_MODE_WP provides these semantics. Unfortunately I think UFFDIO_REGISTER_MODE_WP is not yet implemented? A prototype of this has already been written by Hailiang Zhang; see https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03441.html Yes, I have updated it to 2th version in private, but unfortunately, there are still some problems with UFFDIO_REGISTER_MODE_WP API in kernel, It still can't support KVM, (only supports tcg mode). I have given feedback to Andrea, but got no response ... :( http://www.mail-archive.com/qemu-devel@nongnu.org/msg394897.html Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest RAM and copy the original pages to a buffer. If memory is dirtied too quickly then it's necessary to throttle the guest or fail the savevm operation. The only limit there is the size of the buffer, waiting for space will do the throttling. Yes, We can optimize it by extend the size of buffer and use multiple fds to handle the user fault. Hailiang Dave Perhaps this approach can be prototyped with mprotect and a SIGSEGV handler if anyone wants to get async savevm going. I don't know if there are any disadvantages to mprotecting guest RAM that the kvm kernel module is using. Hopefully in-kernel devices and vhost will continue to work. Stefan -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK .
Re: [Qemu-devel] Async savevm using userfaultfd(2)
On 10/12/2016 05:04 PM, Stefan Hajnoczi wrote: > John and I recently discussed asynchronous savevm and I wanted to post > the ideas so they aren't forgotten. (We're not actively working on this > feature.) > > Asynchronous savevm has the same effect as the 'savevm' monitor command: > it saves RAM, device state, and a snapshot of all disks at the point in > time the command was issued. > > The current 'savevm' monitor command is synchronous so the guest and > QEMU monitor are blocked while the operation runs (it can take a > while!). Asynchronous savevm has the advantage of allowing the guest > and QEMU monitor to continue while the operation is running. > > This sounds similar to live migration to file but remember that live > migration's consistency point is when the guest is paused at the end of > the iteration phase. The user has no control over *when* live migration > captures the guest state. Therefore it's not a useful command for > taking snapshots of guest state at a specific point in time - we need > asynchronous savevm for that. > > Async savevm must copy-on-write guest RAM so the guest can continue > writing to memory while the snapshot is being saved. Rik van Riel > suggested using userfaultfd(2) to do this on Linux. > > Unlike post-copy live migration, we want to track memory writes (instead > of missing page faults). The userfaultfd(2) flag > UFFDIO_REGISTER_MODE_WP provides these semantics. Unfortunately I think > UFFDIO_REGISTER_MODE_WP is not yet implemented? > > Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest > RAM and copy the original pages to a buffer. If memory is dirtied too > quickly then it's necessary to throttle the guest or fail the savevm > operation. > > Perhaps this approach can be prototyped with mprotect and a SIGSEGV > handler if anyone wants to get async savevm going. I don't know if > there are any disadvantages to mprotecting guest RAM that the kvm kernel > module is using. Hopefully in-kernel devices and vhost will continue to > work. > > Stefan good idea!
Re: [Qemu-devel] Async savevm using userfaultfd(2)
* Stefan Hajnoczi (stefa...@gmail.com) wrote: > John and I recently discussed asynchronous savevm and I wanted to post > the ideas so they aren't forgotten. (We're not actively working on this > feature.) > > Asynchronous savevm has the same effect as the 'savevm' monitor command: > it saves RAM, device state, and a snapshot of all disks at the point in > time the command was issued. > > The current 'savevm' monitor command is synchronous so the guest and > QEMU monitor are blocked while the operation runs (it can take a > while!). Asynchronous savevm has the advantage of allowing the guest > and QEMU monitor to continue while the operation is running. > > This sounds similar to live migration to file but remember that live > migration's consistency point is when the guest is paused at the end of > the iteration phase. The user has no control over *when* live migration > captures the guest state. Therefore it's not a useful command for > taking snapshots of guest state at a specific point in time - we need > asynchronous savevm for that. > > Async savevm must copy-on-write guest RAM so the guest can continue > writing to memory while the snapshot is being saved. Rik van Riel > suggested using userfaultfd(2) to do this on Linux. > > Unlike post-copy live migration, we want to track memory writes (instead > of missing page faults). The userfaultfd(2) flag > UFFDIO_REGISTER_MODE_WP provides these semantics. Unfortunately I think > UFFDIO_REGISTER_MODE_WP is not yet implemented? A prototype of this has already been written by Hailiang Zhang; see https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03441.html > Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest > RAM and copy the original pages to a buffer. If memory is dirtied too > quickly then it's necessary to throttle the guest or fail the savevm > operation. The only limit there is the size of the buffer, waiting for space will do the throttling. Dave > > Perhaps this approach can be prototyped with mprotect and a SIGSEGV > handler if anyone wants to get async savevm going. I don't know if > there are any disadvantages to mprotecting guest RAM that the kvm kernel > module is using. Hopefully in-kernel devices and vhost will continue to > work. > > Stefan -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] Async savevm using userfaultfd(2)
On 10/12/2016 09:04 AM, Stefan Hajnoczi wrote: > John and I recently discussed asynchronous savevm and I wanted to post > the ideas so they aren't forgotten. (We're not actively working on this > feature.) > > Asynchronous savevm has the same effect as the 'savevm' monitor command: > it saves RAM, device state, and a snapshot of all disks at the point in > time the command was issued. > Interesting idea. I suspect this would have benefits over using fork()'s copy-on-write semantics, even if we could come up with a way to safely fork where the child permits no state modification, but merely starts scraping off the memory state of the guest at the time of the fork. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-devel] Async savevm using userfaultfd(2)
John and I recently discussed asynchronous savevm and I wanted to post the ideas so they aren't forgotten. (We're not actively working on this feature.) Asynchronous savevm has the same effect as the 'savevm' monitor command: it saves RAM, device state, and a snapshot of all disks at the point in time the command was issued. The current 'savevm' monitor command is synchronous so the guest and QEMU monitor are blocked while the operation runs (it can take a while!). Asynchronous savevm has the advantage of allowing the guest and QEMU monitor to continue while the operation is running. This sounds similar to live migration to file but remember that live migration's consistency point is when the guest is paused at the end of the iteration phase. The user has no control over *when* live migration captures the guest state. Therefore it's not a useful command for taking snapshots of guest state at a specific point in time - we need asynchronous savevm for that. Async savevm must copy-on-write guest RAM so the guest can continue writing to memory while the snapshot is being saved. Rik van Riel suggested using userfaultfd(2) to do this on Linux. Unlike post-copy live migration, we want to track memory writes (instead of missing page faults). The userfaultfd(2) flag UFFDIO_REGISTER_MODE_WP provides these semantics. Unfortunately I think UFFDIO_REGISTER_MODE_WP is not yet implemented? Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest RAM and copy the original pages to a buffer. If memory is dirtied too quickly then it's necessary to throttle the guest or fail the savevm operation. Perhaps this approach can be prototyped with mprotect and a SIGSEGV handler if anyone wants to get async savevm going. I don't know if there are any disadvantages to mprotecting guest RAM that the kvm kernel module is using. Hopefully in-kernel devices and vhost will continue to work. Stefan signature.asc Description: PGP signature