Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-14 Thread Stefan Hajnoczi
On Thu, Oct 13, 2016 at 04:27:19PM +0200, Andrea Arcangeli wrote:
> I would suggest not to implement mprotect+sigsegv because maintaining
> both APIs would be messy but mostly because mprotect cannot really
> work for all cases and it would risk to fail at any time with
> -ENOMEM. postcopy live migration had similar issues and this is why it
> wasn't possible to achieve it reliably without userfaultfd.

Yes, thanks for explaining the issues.  I agree that the mprotect
approach isn't worthwhile.  We need to use userfaultfd.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-13 Thread Andrea Arcangeli
Hello,

On Thu, Oct 13, 2016 at 09:30:49AM +0100, Dr. David Alan Gilbert wrote:
> I think it should, or at least I think all other kernel things end up being
> caught by userfaultfd during postcopy.

Yes indeed, it will work. vhost blocks in its own task context inside
the kernel and the vmsave/postcopy live snapshotting thread will get
waken up, will copy the page off to a private snapshot buffer and then
mark the memory writable again and wakeup the vhost thread at the same
time.

The other showstopper limitation of mprotect is that you'd run out of
vmas and mprotect will eventually fail on large virtual machines. That
problem doesn't exist with userfaultfd WP.

Unfortunately it seems there's some problem with userfaultfd WP and
KVM but it is reported to work for regular userland memory. I didn't
get around solving that yet but it's work in progress... I thought of
finishing making userfaultfd WP fully accurate with special user bits
in the pagetables so there are no false positives. The problem is that
when we swapout a WP page, we would be forced to mark it readonly
during swapin if we don't store the WP information in the swapentry,
so I'm saving now the WP information in the swap entry. This also will
prevent false positive WP userfaults after fork() runs (fork would
mark the pagetable readonly so without an user bit marking which
pagetables are write protected we wouldn't know if we've to fault or
not).

The other advantage is that you can snapshot at 4k granularity by
selectively splitting THPs. The granularity of the snapshot process is
decided by userland. You decide if to copy 4k or 2m and depending on
that you will unwrprotect 4k or 2m, and the kernel will split the THP
if it's a THP but you unprotect only 4k of it. With userfaultfd it's
always userland (not the kernel) deciding the granularity of the
fault.

live snapshotting/async vmsave/redis snapshotting all need the same
thing and they're doing the same thing with uffd wp. And you most
certainly want to do faults at 4k granularity and let khugepaged
rebuild the splitted THP later on. Or you'd run in the same corner
case redis run into with THP because THP cows with 2M granularity by
default. It's faster and takes less memory to copy and unwrprotect
only 4k.

Other positive aspect of uffd WP is that you will decide the max
amount of "buffer" memory you are ok to use. If you set that "buffer"
to the size of the guest it'll behave like fork() so you will risk
100% higher mem utilization in the worst case. With fork() you are
forced to have 100% of the VM size free to succeed the
snapshotting/vmsaving. With userfaultfd you can decide and configure
it. Once the limit hits simply vmsave will behave synchronous and you
will have to wait the write() to disk to complete to free up one
buffer page, before you can copy off new data from guest to buffer and
then wakeup the tasks stuck in the kernel page fault waiting wakeup
from uffd.

I would suggest not to implement mprotect+sigsegv because maintaining
both APIs would be messy but mostly because mprotect cannot really
work for all cases and it would risk to fail at any time with
-ENOMEM. postcopy live migration had similar issues and this is why it
wasn't possible to achieve it reliably without userfaultfd.

In addition to this userfaultfd is much faster too, no signals, no
userland interprocess communication through pipe/unixsocket, no return
to userland for the task that hits the fault, schedule-in-kernel to
block which is cheaper and won't force mm tlbflushes, direct in-kernel
communication between the task that hits the fault and the vmsaveasync
thread (that can wait on epoll or anything) etc...

I'll look into fixing userfaultfd for the already implemented postcopy
live snapshotting ASAP, I've got a bugreport pending but until the WP
full accuracy is completed (to obsolete soft-dirty for good) the
current status of userfaultfd WP is not complete anyway.

And about future uffd features for other usages: once it will work
reliable and with full accuracy, one more feature possible would be to
add a userfaultfd async queue model too, where task won't need to
block and uffd msgs will be allocated and delivered to userland
asynchronously. That would obsolete soft dirty and reduce the
computational complexity as well, because you wouldn't need to scan a
worst case of 4TiB/4KiB pagetable entries to know which pages have
been modified during checkpoint/restore. Instead it'd provide the info
with the same efficiency that PML does in HW on Intel for guests (uffd
async WP would work for host and without HW support). Of course at the
next pass, you'd then would also need to wrprotect only those regions
that have been modified and not the whole range, and only userland
will know those ranges that need to be wrprotected again. A vectored
API would be quite nice for such selective wrprotection to reduce the
number of uffd ioctl to issue too, at the moment it's not vectored but
adding a vectored API will be 

Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-13 Thread Dr. David Alan Gilbert
* Stefan Hajnoczi (stefa...@gmail.com) wrote:
> On Wed, Oct 12, 2016 at 4:04 PM, Stefan Hajnoczi  wrote:
> > Perhaps this approach can be prototyped with mprotect and a SIGSEGV
> > handler if anyone wants to get async savevm going.  I don't know if
> > there are any disadvantages to mprotecting guest RAM that the kvm kernel
> > module is using.  Hopefully in-kernel devices and vhost will continue to
> > work.
> 
> I woke up this morning with a strong feeling that a SIGSEGV handler
> won't work with vhost.

YKYBHTLW you wake up with strong feelings about SIGSEGV handlers.

>  The problem is that the QEMU process' SIGSEGV
> handler won't be called when the vhost kernel thread faults.  Now I'm
> wondering whether userfaultfd will work together with vhost.

I think it should, or at least I think all other kernel things end up being
caught by userfaultfd during postcopy.

Dave

> Stefan
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-12 Thread Stefan Hajnoczi
On Wed, Oct 12, 2016 at 4:04 PM, Stefan Hajnoczi  wrote:
> Perhaps this approach can be prototyped with mprotect and a SIGSEGV
> handler if anyone wants to get async savevm going.  I don't know if
> there are any disadvantages to mprotecting guest RAM that the kvm kernel
> module is using.  Hopefully in-kernel devices and vhost will continue to
> work.

I woke up this morning with a strong feeling that a SIGSEGV handler
won't work with vhost.  The problem is that the QEMU process' SIGSEGV
handler won't be called when the vhost kernel thread faults.  Now I'm
wondering whether userfaultfd will work together with vhost.

Stefan



Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-12 Thread Hailiang Zhang

On 2016/10/12 22:21, Dr. David Alan Gilbert wrote:

* Stefan Hajnoczi (stefa...@gmail.com) wrote:

John and I recently discussed asynchronous savevm and I wanted to post
the ideas so they aren't forgotten.  (We're not actively working on this
feature.)

Asynchronous savevm has the same effect as the 'savevm' monitor command:
it saves RAM, device state, and a snapshot of all disks at the point in
time the command was issued.

The current 'savevm' monitor command is synchronous so the guest and
QEMU monitor are blocked while the operation runs (it can take a
while!).  Asynchronous savevm has the advantage of allowing the guest
and QEMU monitor to continue while the operation is running.

This sounds similar to live migration to file but remember that live
migration's consistency point is when the guest is paused at the end of
the iteration phase.  The user has no control over *when* live migration
captures the guest state.  Therefore it's not a useful command for
taking snapshots of guest state at a specific point in time - we need
asynchronous savevm for that.

Async savevm must copy-on-write guest RAM so the guest can continue
writing to memory while the snapshot is being saved.  Rik van Riel
suggested using userfaultfd(2) to do this on Linux.

Unlike post-copy live migration, we want to track memory writes (instead
of missing page faults).  The userfaultfd(2) flag
UFFDIO_REGISTER_MODE_WP provides these semantics.  Unfortunately I think
UFFDIO_REGISTER_MODE_WP is not yet implemented?


A prototype of this has already been written by Hailiang Zhang;
see https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03441.html



Yes, I have updated it to 2th version in private, but unfortunately,
there are still some problems with UFFDIO_REGISTER_MODE_WP API in kernel,
It still can't support KVM, (only supports tcg mode).
I have given feedback to Andrea, but got no response ... :(
http://www.mail-archive.com/qemu-devel@nongnu.org/msg394897.html


Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest
RAM and copy the original pages to a buffer.  If memory is dirtied too
quickly then it's necessary to throttle the guest or fail the savevm
operation.


The only limit there is the size of the buffer, waiting for space will
do the throttling.



Yes, We can optimize it by extend the size of buffer and use multiple fds to
handle the user fault.

Hailiang


Dave



Perhaps this approach can be prototyped with mprotect and a SIGSEGV
handler if anyone wants to get async savevm going.  I don't know if
there are any disadvantages to mprotecting guest RAM that the kvm kernel
module is using.  Hopefully in-kernel devices and vhost will continue to
work.

Stefan



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK


.






Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-12 Thread Denis V. Lunev
On 10/12/2016 05:04 PM, Stefan Hajnoczi wrote:
> John and I recently discussed asynchronous savevm and I wanted to post
> the ideas so they aren't forgotten.  (We're not actively working on this
> feature.)
>
> Asynchronous savevm has the same effect as the 'savevm' monitor command:
> it saves RAM, device state, and a snapshot of all disks at the point in
> time the command was issued.
>
> The current 'savevm' monitor command is synchronous so the guest and
> QEMU monitor are blocked while the operation runs (it can take a
> while!).  Asynchronous savevm has the advantage of allowing the guest
> and QEMU monitor to continue while the operation is running.
>
> This sounds similar to live migration to file but remember that live
> migration's consistency point is when the guest is paused at the end of
> the iteration phase.  The user has no control over *when* live migration
> captures the guest state.  Therefore it's not a useful command for
> taking snapshots of guest state at a specific point in time - we need
> asynchronous savevm for that.
>
> Async savevm must copy-on-write guest RAM so the guest can continue
> writing to memory while the snapshot is being saved.  Rik van Riel
> suggested using userfaultfd(2) to do this on Linux.
>
> Unlike post-copy live migration, we want to track memory writes (instead
> of missing page faults).  The userfaultfd(2) flag
> UFFDIO_REGISTER_MODE_WP provides these semantics.  Unfortunately I think
> UFFDIO_REGISTER_MODE_WP is not yet implemented?
>
> Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest
> RAM and copy the original pages to a buffer.  If memory is dirtied too
> quickly then it's necessary to throttle the guest or fail the savevm
> operation.
>
> Perhaps this approach can be prototyped with mprotect and a SIGSEGV
> handler if anyone wants to get async savevm going.  I don't know if
> there are any disadvantages to mprotecting guest RAM that the kvm kernel
> module is using.  Hopefully in-kernel devices and vhost will continue to
> work.
>
> Stefan
good idea!




Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-12 Thread Dr. David Alan Gilbert
* Stefan Hajnoczi (stefa...@gmail.com) wrote:
> John and I recently discussed asynchronous savevm and I wanted to post
> the ideas so they aren't forgotten.  (We're not actively working on this
> feature.)
> 
> Asynchronous savevm has the same effect as the 'savevm' monitor command:
> it saves RAM, device state, and a snapshot of all disks at the point in
> time the command was issued.
> 
> The current 'savevm' monitor command is synchronous so the guest and
> QEMU monitor are blocked while the operation runs (it can take a
> while!).  Asynchronous savevm has the advantage of allowing the guest
> and QEMU monitor to continue while the operation is running.
> 
> This sounds similar to live migration to file but remember that live
> migration's consistency point is when the guest is paused at the end of
> the iteration phase.  The user has no control over *when* live migration
> captures the guest state.  Therefore it's not a useful command for
> taking snapshots of guest state at a specific point in time - we need
> asynchronous savevm for that.
> 
> Async savevm must copy-on-write guest RAM so the guest can continue
> writing to memory while the snapshot is being saved.  Rik van Riel
> suggested using userfaultfd(2) to do this on Linux.
> 
> Unlike post-copy live migration, we want to track memory writes (instead
> of missing page faults).  The userfaultfd(2) flag
> UFFDIO_REGISTER_MODE_WP provides these semantics.  Unfortunately I think
> UFFDIO_REGISTER_MODE_WP is not yet implemented?

A prototype of this has already been written by Hailiang Zhang;
see https://lists.gnu.org/archive/html/qemu-devel/2016-08/msg03441.html

> Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest
> RAM and copy the original pages to a buffer.  If memory is dirtied too
> quickly then it's necessary to throttle the guest or fail the savevm
> operation.

The only limit there is the size of the buffer, waiting for space will
do the throttling.

Dave

> 
> Perhaps this approach can be prototyped with mprotect and a SIGSEGV
> handler if anyone wants to get async savevm going.  I don't know if
> there are any disadvantages to mprotecting guest RAM that the kvm kernel
> module is using.  Hopefully in-kernel devices and vhost will continue to
> work.
> 
> Stefan


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] Async savevm using userfaultfd(2)

2016-10-12 Thread Eric Blake
On 10/12/2016 09:04 AM, Stefan Hajnoczi wrote:
> John and I recently discussed asynchronous savevm and I wanted to post
> the ideas so they aren't forgotten.  (We're not actively working on this
> feature.)
> 
> Asynchronous savevm has the same effect as the 'savevm' monitor command:
> it saves RAM, device state, and a snapshot of all disks at the point in
> time the command was issued.
> 

Interesting idea.

I suspect this would have benefits over using fork()'s copy-on-write
semantics, even if we could come up with a way to safely fork where the
child permits no state modification, but merely starts scraping off the
memory state of the guest at the time of the fork.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-devel] Async savevm using userfaultfd(2)

2016-10-12 Thread Stefan Hajnoczi
John and I recently discussed asynchronous savevm and I wanted to post
the ideas so they aren't forgotten.  (We're not actively working on this
feature.)

Asynchronous savevm has the same effect as the 'savevm' monitor command:
it saves RAM, device state, and a snapshot of all disks at the point in
time the command was issued.

The current 'savevm' monitor command is synchronous so the guest and
QEMU monitor are blocked while the operation runs (it can take a
while!).  Asynchronous savevm has the advantage of allowing the guest
and QEMU monitor to continue while the operation is running.

This sounds similar to live migration to file but remember that live
migration's consistency point is when the guest is paused at the end of
the iteration phase.  The user has no control over *when* live migration
captures the guest state.  Therefore it's not a useful command for
taking snapshots of guest state at a specific point in time - we need
asynchronous savevm for that.

Async savevm must copy-on-write guest RAM so the guest can continue
writing to memory while the snapshot is being saved.  Rik van Riel
suggested using userfaultfd(2) to do this on Linux.

Unlike post-copy live migration, we want to track memory writes (instead
of missing page faults).  The userfaultfd(2) flag
UFFDIO_REGISTER_MODE_WP provides these semantics.  Unfortunately I think
UFFDIO_REGISTER_MODE_WP is not yet implemented?

Once UFFDIO_REGISTER_MODE_WP is available QEMU can catch writes to guest
RAM and copy the original pages to a buffer.  If memory is dirtied too
quickly then it's necessary to throttle the guest or fail the savevm
operation.

Perhaps this approach can be prototyped with mprotect and a SIGSEGV
handler if anyone wants to get async savevm going.  I don't know if
there are any disadvantages to mprotecting guest RAM that the kvm kernel
module is using.  Hopefully in-kernel devices and vhost will continue to
work.

Stefan


signature.asc
Description: PGP signature