from:"Wang, Wei W"

RE: [PATCH v2] migration/postcopy: ensure preempt channel is ready before loading states

2024-04-06 Thread Wang, Wei W

On Saturday, April 6, 2024 5:53 AM, Peter Xu wrote:
> On Fri, Apr 05, 2024 at 11:40:56AM +0800, Wei Wang wrote:
> > Before loading the guest states, ensure that the preempt channel has
> > been ready to use, as some of the states (e.g. via virtio_load) might
> > trigger page faults that will be handled through the preempt channel.
> > So yield to the main thread in the case that the channel create event
> > hasn't been dispatched.
> >
> > Originally-by: Lei Wang 
> > Link:
> > https://lore.kernel.org/all/9aa5d1be-7801-40dd-83fd-f7e041ced249@intel
> > .com/T/
> > Suggested-by: Peter Xu 
> 
> The current version doesn't have any of my credits. :) Thanks, but I'll just 
> drop
> it to reflect reality, so we keep the credit to the right ones.
> 
> > Signed-off-by: Lei Wang 
> > Signed-off-by: Wei Wang 
> > ---
> >  migration/savevm.c | 17 +
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/migration/savevm.c b/migration/savevm.c index
> > 388d7af7cd..63f9991a8a 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -2342,6 +2342,23 @@ static int
> > loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> >
> >  QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
> >
> > +/*
> > + * Before loading the guest states, ensure that the preempt channel has
> > + * been ready to use, as some of the states (e.g. via virtio_load) 
> > might
> > + * trigger page faults that will be handled through the preempt 
> > channel.
> > + * So yield to the main thread in the case that the channel create 
> > event
> > + * hasn't been dispatched.
> 
> I'll squash below into it.  If any of you disagree please shoot: valid until 
> this
> Sunday.
> 
> + * TODO: if we can move migration loadvm out of main thread, then we
> + * won't block main thread from polling the accept() fds.  We can drop
> + * this as a whole when that is done.
> 
Looks good to me.

RE: [PATCH v2] migration/postcopy: ensure preempt channel is ready before loading states

2024-04-04 Thread Wang, Wei W

On Friday, April 5, 2024 11:41 AM, Wang, Wei W wrote:
> 
> Before loading the guest states, ensure that the preempt channel has been
> ready to use, as some of the states (e.g. via virtio_load) might trigger page
> faults that will be handled through the preempt channel. So yield to the main
> thread in the case that the channel create event hasn't been dispatched.
> 
> Originally-by: Lei Wang 
> Link: https://lore.kernel.org/all/9aa5d1be-7801-40dd-83fd-
> f7e041ced...@intel.com/T/
> Suggested-by: Peter Xu 
> Signed-off-by: Lei Wang 
> Signed-off-by: Wei Wang 
> ---
>  migration/savevm.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c index
> 388d7af7cd..63f9991a8a 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2342,6 +2342,23 @@ static int
> loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> 
>  QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
> 
> +/*
> + * Before loading the guest states, ensure that the preempt channel has
> + * been ready to use, as some of the states (e.g. via virtio_load) might
> + * trigger page faults that will be handled through the preempt channel.
> + * So yield to the main thread in the case that the channel create event
> + * hasn't been dispatched.
> + */
> +do {
> +if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
> +mis->postcopy_qemufile_dst) {
> +break;
> +}
> +
> +aio_co_schedule(qemu_get_current_aio_context(),
> qemu_coroutine_self());
> +qemu_coroutine_yield();
> +} while (1);
> +
>  ret = qemu_loadvm_state_main(packf, mis);
>  trace_loadvm_handle_cmd_packaged_main(ret);
>  qemu_fclose(packf);
> --
> 2.27.0

Main change from v1 is the drop of the wait on sem.
It's still patched to loadvm_handle_cmd_packaged, as the sem issue
(possible twice wait) isn't there now.

RE: [PATCH v1] migration/postcopy: ensure preempt channel is ready before loading states

2024-04-04 Thread Wang, Wei W

On Friday, April 5, 2024 10:33 AM, Peter Xu wrote:
> On Fri, Apr 05, 2024 at 01:38:31AM +0000, Wang, Wei W wrote:
> > On Friday, April 5, 2024 4:57 AM, Peter Xu wrote:
> > > On Fri, Apr 05, 2024 at 12:48:15AM +0800, Wang, Lei wrote:
> > > > On 4/5/2024 0:25, Wang, Wei W wrote:> On Thursday, April 4, 2024
> > > > 10:12 PM, Peter Xu wrote:
> > > > >> On Thu, Apr 04, 2024 at 06:05:50PM +0800, Wei Wang wrote:
> > > > >>> Before loading the guest states, ensure that the preempt
> > > > >>> channel has been ready to use, as some of the states (e.g. via
> > > > >>> virtio_load) might trigger page faults that will be handled
> > > > >>> through the
> > > preempt channel.
> > > > >>> So yield to the main thread in the case that the channel
> > > > >>> create event has been dispatched.
> > > > >>>
> > > > >>> Originally-by: Lei Wang 
> > > > >>> Link:
> > > > >>> https://lore.kernel.org/all/9aa5d1be-7801-40dd-83fd-f7e041ced2
> > > > >>> 49@i
> > > > >>> ntel
> > > > >>> .com/T/
> > > > >>> Suggested-by: Peter Xu 
> > > > >>> Signed-off-by: Lei Wang 
> > > > >>> Signed-off-by: Wei Wang 
> > > > >>> ---
> > > > >>>  migration/savevm.c | 17 +
> > > > >>>  1 file changed, 17 insertions(+)
> > > > >>>
> > > > >>> diff --git a/migration/savevm.c b/migration/savevm.c index
> > > > >>> 388d7af7cd..fbc9f2bdd4 100644
> > > > >>> --- a/migration/savevm.c
> > > > >>> +++ b/migration/savevm.c
> > > > >>> @@ -2342,6 +2342,23 @@ static int
> > > > >>> loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > > > >>>
> > > > >>>  QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
> > > > >>>
> > > > >>> +/*
> > > > >>> + * Before loading the guest states, ensure that the
> > > > >>> + preempt channel
> > > has
> > > > >>> + * been ready to use, as some of the states (e.g. via 
> > > > >>> virtio_load)
> might
> > > > >>> + * trigger page faults that will be handled through the
> > > > >>> + preempt
> > > channel.
> > > > >>> + * So yield to the main thread in the case that the
> > > > >>> + channel create
> > > event
> > > > >>> + * has been dispatched.
> > > > >>> + */
> > > > >>> +do {
> > > > >>> +if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
> > > > >>> +mis->postcopy_qemufile_dst) {
> > > > >>> +break;
> > > > >>> +}
> > > > >>> +
> > > > >>> +aio_co_schedule(qemu_get_current_aio_context(),
> > > > >> qemu_coroutine_self());
> > > > >>> +qemu_coroutine_yield();
> > > > >>> +} while
> > > > >>> + (!qemu_sem_timedwait(>postcopy_qemufile_dst_done,
> > > > >>> + 1));
> > > > >>
> > > > >> I think we need s/!// here, so the same mistake I made?  I
> > > > >> think we need to rework the retval of qemu_sem_timedwait() at some
> point later..
> > > > >
> > > > > No. qemu_sem_timedwait returns false when timeout, which means
> > > > > sem
> > > isn’t posted yet.
> > > > > So it needs to go back to the loop. (the patch was tested)
> > > >
> > > > When timeout, qemu_sem_timedwait() will return -1. I think the
> > > > patch test passed may because you will always have at least one
> > > > yield (the first yield in the do ...while ...) when
> loadvm_handle_cmd_packaged()?
> > >
> > > My guess is that here the kick will work and qemu_sem_timedwait()
> > > later will ETIMEOUT -> qemu_sem_timedwait() returns -1, then the loop
> just broke.
> > > That aio schedule should make sure anyway that the file is ready;
> > > the preempt thread must run before this to not hang that thread.
> >
> > Yes, misr

RE: [PATCH v1] migration/postcopy: ensure preempt channel is ready before loading states

2024-04-04 Thread Wang, Wei W

On Friday, April 5, 2024 4:57 AM, Peter Xu wrote:
> On Fri, Apr 05, 2024 at 12:48:15AM +0800, Wang, Lei wrote:
> > On 4/5/2024 0:25, Wang, Wei W wrote:> On Thursday, April 4, 2024 10:12
> > PM, Peter Xu wrote:
> > >> On Thu, Apr 04, 2024 at 06:05:50PM +0800, Wei Wang wrote:
> > >>> Before loading the guest states, ensure that the preempt channel
> > >>> has been ready to use, as some of the states (e.g. via
> > >>> virtio_load) might trigger page faults that will be handled through the
> preempt channel.
> > >>> So yield to the main thread in the case that the channel create
> > >>> event has been dispatched.
> > >>>
> > >>> Originally-by: Lei Wang 
> > >>> Link:
> > >>> https://lore.kernel.org/all/9aa5d1be-7801-40dd-83fd-f7e041ced249@i
> > >>> ntel
> > >>> .com/T/
> > >>> Suggested-by: Peter Xu 
> > >>> Signed-off-by: Lei Wang 
> > >>> Signed-off-by: Wei Wang 
> > >>> ---
> > >>>  migration/savevm.c | 17 +
> > >>>  1 file changed, 17 insertions(+)
> > >>>
> > >>> diff --git a/migration/savevm.c b/migration/savevm.c index
> > >>> 388d7af7cd..fbc9f2bdd4 100644
> > >>> --- a/migration/savevm.c
> > >>> +++ b/migration/savevm.c
> > >>> @@ -2342,6 +2342,23 @@ static int
> > >>> loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> > >>>
> > >>>  QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
> > >>>
> > >>> +/*
> > >>> + * Before loading the guest states, ensure that the preempt channel
> has
> > >>> + * been ready to use, as some of the states (e.g. via virtio_load) 
> > >>> might
> > >>> + * trigger page faults that will be handled through the preempt
> channel.
> > >>> + * So yield to the main thread in the case that the channel create
> event
> > >>> + * has been dispatched.
> > >>> + */
> > >>> +do {
> > >>> +if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
> > >>> +mis->postcopy_qemufile_dst) {
> > >>> +break;
> > >>> +}
> > >>> +
> > >>> +aio_co_schedule(qemu_get_current_aio_context(),
> > >> qemu_coroutine_self());
> > >>> +qemu_coroutine_yield();
> > >>> +} while
> > >>> + (!qemu_sem_timedwait(>postcopy_qemufile_dst_done,
> > >>> + 1));
> > >>
> > >> I think we need s/!// here, so the same mistake I made?  I think we
> > >> need to rework the retval of qemu_sem_timedwait() at some point later..
> > >
> > > No. qemu_sem_timedwait returns false when timeout, which means sem
> isn’t posted yet.
> > > So it needs to go back to the loop. (the patch was tested)
> >
> > When timeout, qemu_sem_timedwait() will return -1. I think the patch
> > test passed may because you will always have at least one yield (the
> > first yield in the do ...while ...) when loadvm_handle_cmd_packaged()?
> 
> My guess is that here the kick will work and qemu_sem_timedwait() later will
> ETIMEOUT -> qemu_sem_timedwait() returns -1, then the loop just broke.
> That aio schedule should make sure anyway that the file is ready; the preempt
> thread must run before this to not hang that thread.

Yes, misread of the return value. It still worked because the loop broke at
the "if (mis->postcopy_qemufile_dst)" check.

Even below will work:
do {
if (mis->postcopy_qemufile_dst) {
break;
 }
...
} while (1);

I still don’t see the value of using postcopy_qemufile_dst_done sem in the code 
though
It simplify blocks the main thread from creating the preempt channel for 1ms 
(regardless
of the possibility about whether the sem has been be posted or not. We add it 
for the case
it is not posted and need to go back to the loop).

RE: [PATCH v1] migration/postcopy: ensure preempt channel is ready before loading states

2024-04-04 Thread Wang, Wei W

On Thursday, April 4, 2024 10:12 PM, Peter Xu wrote:
> On Thu, Apr 04, 2024 at 06:05:50PM +0800, Wei Wang wrote:
> > Before loading the guest states, ensure that the preempt channel has
> > been ready to use, as some of the states (e.g. via virtio_load) might
> > trigger page faults that will be handled through the preempt channel.
> > So yield to the main thread in the case that the channel create event
> > has been dispatched.
> >
> > Originally-by: Lei Wang 
> > Link:
> > https://lore.kernel.org/all/9aa5d1be-7801-40dd-83fd-f7e041ced249@intel
> > .com/T/
> > Suggested-by: Peter Xu 
> > Signed-off-by: Lei Wang 
> > Signed-off-by: Wei Wang 
> > ---
> >  migration/savevm.c | 17 +
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/migration/savevm.c b/migration/savevm.c index
> > 388d7af7cd..fbc9f2bdd4 100644
> > --- a/migration/savevm.c
> > +++ b/migration/savevm.c
> > @@ -2342,6 +2342,23 @@ static int
> > loadvm_handle_cmd_packaged(MigrationIncomingState *mis)
> >
> >  QEMUFile *packf = qemu_file_new_input(QIO_CHANNEL(bioc));
> >
> > +/*
> > + * Before loading the guest states, ensure that the preempt channel has
> > + * been ready to use, as some of the states (e.g. via virtio_load) 
> > might
> > + * trigger page faults that will be handled through the preempt 
> > channel.
> > + * So yield to the main thread in the case that the channel create 
> > event
> > + * has been dispatched.
> > + */
> > +do {
> > +if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
> > +mis->postcopy_qemufile_dst) {
> > +break;
> > +}
> > +
> > +aio_co_schedule(qemu_get_current_aio_context(),
> qemu_coroutine_self());
> > +qemu_coroutine_yield();
> > +} while (!qemu_sem_timedwait(>postcopy_qemufile_dst_done,
> > + 1));
> 
> I think we need s/!// here, so the same mistake I made?  I think we need to
> rework the retval of qemu_sem_timedwait() at some point later..

No. qemu_sem_timedwait returns false when timeout, which means sem isn’t posted 
yet.
So it needs to go back to the loop. (the patch was tested)

> 
> Besides, this patch kept the sem_wait() in postcopy_preempt_thread() so it
> will wait() on this sem again.  If this qemu_sem_timedwait() accidentally
> consumed the sem count then I think the other thread can hang forever?

I can get the issue you mentioned, and seems better to be placed before the 
creation of
the preempt thread. Then we probably don’t need to wait_sem in the preempt 
thread, as the
channel is guaranteed to be ready when it runs?

Update will be:

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index eccff499cb..5a70ce4f23 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -1254,6 +1254,15 @@ int postcopy_ram_incoming_setup(MigrationIncomingState 
*mis)
 }

 if (migrate_postcopy_preempt()) {
+do {
+if (!migrate_postcopy_preempt() || !qemu_in_coroutine() ||
+mis->postcopy_qemufile_dst) {
+break;
+}
+aio_co_schedule(qemu_get_current_aio_context(), 
qemu_coroutine_self());
+qemu_coroutine_yield();
+} while (!qemu_sem_timedwait(>postcopy_qemufile_dst_done, 1));
+
 /*
  * This thread needs to be created after the temp pages because
  * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
@@ -1743,12 +1752,6 @@ void *postcopy_preempt_thread(void *opaque)

 qemu_sem_post(>thread_sync_sem);

-/*
- * The preempt channel is established in asynchronous way.  Wait
- * for its completion.
- */
-qemu_sem_wait(>postcopy_qemufile_dst_done);

RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-04 Thread Wang, Wei W

On Thursday, April 4, 2024 12:34 AM, Peter Xu wrote:

> On Wed, Apr 03, 2024 at 04:04:21PM +0000, Wang, Wei W wrote:

> > On Wednesday, April 3, 2024 10:42 PM, Peter Xu wrote:

> > > On Wed, Apr 03, 2024 at 04:35:35PM +0800, Wang, Lei wrote:

> > > > We should change the following line from

> > > >

> > > >   while (!qemu_sem_timedwait(>postcopy_qemufile_dst_done,

> > > 100)) {

> > > >

> > > > to

> > > >

> > > >   while (qemu_sem_timedwait(>postcopy_qemufile_dst_done,

> > > 100)) {

> > >

> > > Stupid me.. :(  Thanks for figuring this out.

> > >

> > > >

> > > > After that fix, test passed and no segfault.

> > > >

> > > > Given that the test shows a yield to the main loop won't introduce

> > > > much overhead (<1ms), how about first yield unconditionally, then

> > > > we enter the while loop to wait for several ms and yield periodically?

> > >

> > > Shouldn't the expectation be that this should return immediately

> > > without a wait?  We're already processing LISTEN command, and on the

> > > source as you said it was much after the connect().  It won't

> > > guarantee the ordering but IIUC the majority should still have a direct 
> > > hit?

> > >

> > > What we can do though is reducing the 100ms timeout if you see

> > > that's perhaps a risk of having too large a downtime when by

> > > accident.  We can even do it in a tight loop here considering

> > > downtime is important, but to provide an intermediate ground: how about

> 100ms -> 1ms poll?

> >

> > Would it be better to use busy wait here, instead of blocking for even 1ms

> here?

> > It's likely that the preempt channel is waiting for the main thread to

> > dispatch for accept(), but we are calling qemu_sem_timedwait here to block

> the main thread for 1 more ms.

>

> I think it's about the expectation of whether we should already received that

> sem post.  My understanding is in most cases we should directly return and

> avoid such wait.



The assumption of " should already received" might not be true.

On my machine, it seems always "not received".



>

> Per my previous experience, 1ms is not a major issue to be added on top of

> downtime in corner cases like this.



Yeah, in most cases, probably yes. There are some usages requiring relatively 
low

downtime. Remember NFV usages in early days expect the downtime to be close to

10ms. In this case, adding 1ms would mean ~10% performance regression ☹



>

> We do have a lot of othre potential optimizations to reduce downtime, or I

> should say in the other way, that..  there can be a lot of cases where we can 
> hit

> much larger downtime than expected. Consider when we don't even account

> downtime for device states for now, either load_state or save_state, we only

> count RAM but that's far from accurate.. and we do have more chances to

> optimize.  Some are listed here, but some may not:

>

> https://wiki.qemu.org/ToDo/LiveMigration#Optimizations

>

> If you agree with my above "expectation" statement, I'd say we should avoid

> using a busy loop whenever possible in QEMU unless extremely necessary.



I just posted a version with a bit changes from your suggestion:



It still blocks by the sem in the loop, but before that it checks if the 
channel is

created. If not, yield to main loop. Then when reaching to sem, it is likely not

need to wait.



Another change is that I moved it to the place before we load the states (in

loadvm_handle_cmd_packaged). Because I think that's logically the place we

should ensure the channel to be ready (don’t have a good reason to add it

when handling the LISTEN cmd).



Please let me know if you would disagree anywhere.

RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-03 Thread Wang, Wei W

On Wednesday, April 3, 2024 10:42 PM, Peter Xu wrote:
> On Wed, Apr 03, 2024 at 04:35:35PM +0800, Wang, Lei wrote:
> > We should change the following line from
> >
> > while (!qemu_sem_timedwait(>postcopy_qemufile_dst_done,
> 100)) {
> >
> > to
> >
> > while (qemu_sem_timedwait(>postcopy_qemufile_dst_done,
> 100)) {
> 
> Stupid me.. :(  Thanks for figuring this out.
> 
> >
> > After that fix, test passed and no segfault.
> >
> > Given that the test shows a yield to the main loop won't introduce
> > much overhead (<1ms), how about first yield unconditionally, then we
> > enter the while loop to wait for several ms and yield periodically?
> 
> Shouldn't the expectation be that this should return immediately without a
> wait?  We're already processing LISTEN command, and on the source as you
> said it was much after the connect().  It won't guarantee the ordering but 
> IIUC
> the majority should still have a direct hit?
> 
> What we can do though is reducing the 100ms timeout if you see that's
> perhaps a risk of having too large a downtime when by accident.  We can even
> do it in a tight loop here considering downtime is important, but to provide 
> an
> intermediate ground: how about 100ms -> 1ms poll?

Would it be better to use busy wait here, instead of blocking for even 1ms here?
It's likely that the preempt channel is waiting for the main thread to dispatch 
for accept(),
but we are calling qemu_sem_timedwait here to block the main thread for 1 more 
ms.


> 
> If you agree (and also to Wei; please review this and comment if there's 
> any!),
> would you write up the commit log, fully test it in whatever way you could,
> and resend as a formal patch (please do this before Friday if possible)?  You
> can keep a "Suggested-by:" for me.  I want to queue it for
> rc3 if it can catch it. It seems important if Wei can always reproduce it.

Not sure if Lei would be able to online as the following two days are Chinese 
holiday.
If not, I could help take over to send late tomorrow. Let's see.

RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-02 Thread Wang, Wei W

On Tuesday, April 2, 2024 2:56 PM, Wang, Lei4 wrote:
> On 4/2/2024 0:13, Peter Xu wrote:> On Fri, Mar 29, 2024 at 08:54:07AM +0000,
> Wang, Wei W wrote:
> >> On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> >>> When using the post-copy preemption feature to perform post-copy
> >>> live migration, the below scenario could lead to a deadlock and the
> >>> migration will never finish:
> >>>
> >>>  - Source connect() the preemption channel in postcopy_start().
> >>>  - Source and the destination side TCP stack finished the 3-way handshake
> >>>thus the connection is successful.
> >>>  - The destination side main thread is handling the loading of the bulk
> RAM
> >>>pages thus it doesn't start to handle the pending connection event in 
> >>> the
> >>>event loop. and doesn't post the semaphore
> postcopy_qemufile_dst_done for
> >>>the preemption thread.
> >>>  - The source side sends non-iterative device states, such as the virtio
> >>>states.
> >>>  - The destination main thread starts to receive the virtio states, this
> >>>process may lead to a page fault (e.g., 
> >>> virtio_load()->vring_avail_idx()
> >>>may trigger a page fault since the avail ring page may not be received
> >>>yet).
> >
> > Ouch.  Yeah I think this part got overlooked when working on the
> > preempt channel.
> >
> >>>  - The page request is sent back to the source side. Source sends the page
> >>>content to the destination side preemption thread.
> >>>  - Since the event is not arrived and the semaphore
> >>>postcopy_qemufile_dst_done is not posted, the preemption thread in
> >>>destination side is blocked, and cannot handle receiving the page.
> >>>  - The QEMU main load thread on the destination side is stuck at the page
> >>>fault, and cannot yield and handle the connect() event for the
> >>>preemption channel to unblock the preemption thread.
> >>>  - The postcopy will stuck there forever since this is a deadlock.
> >>>
> >>> The key point to reproduce this bug is that the source side is
> >>> sending pages at a rate faster than the destination handling,
> >>> otherwise, the qemu_get_be64() in
> >>> ram_load_precopy() will have a chance to yield since at that time
> >>> there are no pending data in the buffer to get. This will make this
> >>> bug harder to be reproduced.
> >
> > How hard would this reproduce?
> 
> We can manually make this easier to reproduce by adding the following code
> to make the destination busier to load the pages:
> 
> diff --git a/migration/ram.c b/migration/ram.c index 0ad9fbba48..0b42877e1f
> 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -4232,6 +4232,7 @@ static int ram_load_precopy(QEMUFile *f)  {
>  MigrationIncomingState *mis = migration_incoming_get_current();
>  int flags = 0, ret = 0, invalid_flags = 0, len = 0, i = 0;
> +volatile unsigned long long a;
> 
>  if (!migrate_compress()) {
>  invalid_flags |= RAM_SAVE_FLAG_COMPRESS_PAGE; @@ -4347,6
> +4348,7 @@ static int ram_load_precopy(QEMUFile *f)
>  break;
> 
>  case RAM_SAVE_FLAG_PAGE:
> +for (a = 0; a < 1; a++);
>  qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
>  break;
> 

Which version of QEMU are you using?
I tried with the latest upstream QEMU (e.g. v8.2.0 release, 1600b9f46b1bd), it's
always reproducible without any changes (with local migration tests).


> >
> > I'm thinking whether this should be 9.0 material or 9.1.  It's pretty
> > late for 9.0 though, but we can still discuss.
> >
> >>>
> >>> Fix this by yielding the load coroutine when receiving
> >>> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> >>> connection event before loading the non-iterative devices state to
> >>> avoid the deadlock condition.
> >>>
> >>> Signed-off-by: Lei Wang 
> >>
> >> This seems to be a regression issue caused by this commit:
> >> 737840e2c6ea (migration: Use the number of transferred bytes
> >> directly)
> >>
> >> Adding qemu_fflush back to migration_rate_exceeded() or
> >> ram_save_iterate seems to work (might not be a good fix though).
> >>
> >>> ---
> >>>  migration/savevm.c | 5 +
> >>>  1 file changed, 5 insertions(+)

RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-04-02 Thread Wang, Wei W

On Tuesday, April 2, 2024 12:13 AM, Peter Xu wrote:
> On Fri, Mar 29, 2024 at 08:54:07AM +0000, Wang, Wei W wrote:
> > On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> > > When using the post-copy preemption feature to perform post-copy
> > > live migration, the below scenario could lead to a deadlock and the
> > > migration will never finish:
> > >
> > >  - Source connect() the preemption channel in postcopy_start().
> > >  - Source and the destination side TCP stack finished the 3-way handshake
> > >thus the connection is successful.
> > >  - The destination side main thread is handling the loading of the bulk 
> > > RAM
> > >pages thus it doesn't start to handle the pending connection event in 
> > > the
> > >event loop. and doesn't post the semaphore
> postcopy_qemufile_dst_done for
> > >the preemption thread.
> > >  - The source side sends non-iterative device states, such as the virtio
> > >states.
> > >  - The destination main thread starts to receive the virtio states, this
> > >process may lead to a page fault (e.g., 
> > > virtio_load()->vring_avail_idx()
> > >may trigger a page fault since the avail ring page may not be received
> > >yet).
> 
> Ouch.  Yeah I think this part got overlooked when working on the preempt
> channel.
> 
> > >  - The page request is sent back to the source side. Source sends the page
> > >content to the destination side preemption thread.
> > >  - Since the event is not arrived and the semaphore
> > >postcopy_qemufile_dst_done is not posted, the preemption thread in
> > >destination side is blocked, and cannot handle receiving the page.
> > >  - The QEMU main load thread on the destination side is stuck at the page
> > >fault, and cannot yield and handle the connect() event for the
> > >preemption channel to unblock the preemption thread.
> > >  - The postcopy will stuck there forever since this is a deadlock.
> > >
> > > The key point to reproduce this bug is that the source side is
> > > sending pages at a rate faster than the destination handling,
> > > otherwise, the qemu_get_be64() in
> > > ram_load_precopy() will have a chance to yield since at that time
> > > there are no pending data in the buffer to get. This will make this
> > > bug harder to be reproduced.
> 
> How hard would this reproduce?

It seems 100% reproducible on my machine (with migration src and dst
on the same physical machine).

> 
> I'm thinking whether this should be 9.0 material or 9.1.  It's pretty late 
> for 9.0
> though, but we can still discuss.
> 
> > >
> > > Fix this by yielding the load coroutine when receiving
> > > MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> > > connection event before loading the non-iterative devices state to
> > > avoid the deadlock condition.
> > >
> > > Signed-off-by: Lei Wang 
> >
> > This seems to be a regression issue caused by this commit:
> > 737840e2c6ea (migration: Use the number of transferred bytes directly)
> >
> > Adding qemu_fflush back to migration_rate_exceeded() or
> > ram_save_iterate seems to work (might not be a good fix though).
> >
> > > ---
> > >  migration/savevm.c | 5 +
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/migration/savevm.c b/migration/savevm.c index
> > > e386c5267f..8fd4dc92f2 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile
> *f)
> > >  return loadvm_postcopy_handle_advise(mis, len);
> > >
> > >  case MIG_CMD_POSTCOPY_LISTEN:
> > > +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
> > > +aio_co_schedule(qemu_get_current_aio_context(),
> > > +qemu_coroutine_self());
> > > +qemu_coroutine_yield();
> > > +}
> >
> > The above could be moved to loadvm_postcopy_handle_listen().
> 
> I'm not 100% sure such thing (no matter here or moved into it, which does look
> cleaner) would work for us.
> 
> The problem is I still don't yet see an ordering restricted on top of (1)
> accept() happens, and (2) receive LISTEN cmd here.  What happens if the
> accept() request is not yet received when reaching LISTEN?  Or is it always
> guaranteed the accept(fd) will always be polled here?
> 
> For exampl

RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN

2024-03-29 Thread Wang, Wei W

On Friday, March 29, 2024 11:32 AM, Wang, Lei4 wrote:
> When using the post-copy preemption feature to perform post-copy live
> migration, the below scenario could lead to a deadlock and the migration will
> never finish:
> 
>  - Source connect() the preemption channel in postcopy_start().
>  - Source and the destination side TCP stack finished the 3-way handshake
>thus the connection is successful.
>  - The destination side main thread is handling the loading of the bulk RAM
>pages thus it doesn't start to handle the pending connection event in the
>event loop. and doesn't post the semaphore postcopy_qemufile_dst_done for
>the preemption thread.
>  - The source side sends non-iterative device states, such as the virtio
>states.
>  - The destination main thread starts to receive the virtio states, this
>process may lead to a page fault (e.g., virtio_load()->vring_avail_idx()
>may trigger a page fault since the avail ring page may not be received
>yet).
>  - The page request is sent back to the source side. Source sends the page
>content to the destination side preemption thread.
>  - Since the event is not arrived and the semaphore
>postcopy_qemufile_dst_done is not posted, the preemption thread in
>destination side is blocked, and cannot handle receiving the page.
>  - The QEMU main load thread on the destination side is stuck at the page
>fault, and cannot yield and handle the connect() event for the
>preemption channel to unblock the preemption thread.
>  - The postcopy will stuck there forever since this is a deadlock.
> 
> The key point to reproduce this bug is that the source side is sending pages 
> at a
> rate faster than the destination handling, otherwise, the qemu_get_be64() in
> ram_load_precopy() will have a chance to yield since at that time there are no
> pending data in the buffer to get. This will make this bug harder to be
> reproduced.
> 
> Fix this by yielding the load coroutine when receiving
> MIG_CMD_POSTCOPY_LISTEN so the main event loop can handle the
> connection event before loading the non-iterative devices state to avoid the
> deadlock condition.
> 
> Signed-off-by: Lei Wang 

This seems to be a regression issue caused by this commit:
737840e2c6ea (migration: Use the number of transferred bytes directly)

Adding qemu_fflush back to migration_rate_exceeded() or ram_save_iterate
seems to work (might not be a good fix though).

> ---
>  migration/savevm.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c index
> e386c5267f..8fd4dc92f2 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2445,6 +2445,11 @@ static int loadvm_process_command(QEMUFile *f)
>  return loadvm_postcopy_handle_advise(mis, len);
> 
>  case MIG_CMD_POSTCOPY_LISTEN:
> +if (migrate_postcopy_preempt() && qemu_in_coroutine()) {
> +aio_co_schedule(qemu_get_current_aio_context(),
> +qemu_coroutine_self());
> +qemu_coroutine_yield();
> +}

The above could be moved to loadvm_postcopy_handle_listen().

Another option is to follow the old way (i.e. pre_7_2) to do 
postcopy_preempt_setup
in migrate_fd_connect. This can save the above overhead of switching to the
main thread during the downtime. Seems Peter's previous patch already solved the
channel disordering issue. Let's see Peter and others' opinions.

>  return loadvm_postcopy_handle_listen(mis);
> 

>  case MIG_CMD_POSTCOPY_RUN:
> --
> 2.39.3

RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes

2024-01-09 Thread Wang, Wei W

On Wednesday, January 10, 2024 12:32 AM, Li, Xiaoyao wrote:
> On 1/9/2024 10:53 PM, Wang, Wei W wrote:
> > On Tuesday, January 9, 2024 1:47 PM, Li, Xiaoyao wrote:
> >> On 12/21/2023 9:47 PM, Wang, Wei W wrote:
> >>> On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
> >>>> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> >>>>> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE
> >> there.
> >>>>> I'm suggesting below:
> >>>>>
> >>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> >>>>> 2d9a2455de..63ba74b221 100644
> >>>>> --- a/accel/kvm/kvm-all.c
> >>>>> +++ b/accel/kvm/kvm-all.c
> >>>>> @@ -1375,6 +1375,11 @@ static int
> kvm_set_memory_attributes(hwaddr
> >>>> start, hwaddr size, uint64_t attr)
> >>>>> struct kvm_memory_attributes attrs;
> >>>>> int r;
> >>>>>
> >>>>> +if ((attr & kvm_supported_memory_attributes) != attr) {
> >>>>> +error_report("KVM doesn't support memory attr %lx\n", attr);
> >>>>> +return -EINVAL;
> >>>>> +}
> >>>>
> >>>> In the case of setting a range of memory to shared while KVM
> >>>> doesn't support private memory. Above check doesn't work. and
> >>>> following IOCTL
> >> fails.
> >>>
> >>> SHARED attribute uses the value 0, which indicates it's always supported,
> no?
> >>> For the implementation, can you find in the KVM side where the ioctl
> >>> would get failed in that case?
> >>
> >> I'm worrying about the future case, that KVM supports other memory
> >> attribute than shared/private. For example, KVM supports RWX bits
> >> (bit 0
> >> - 2) but not shared/private bit.
> >
> > What's the exact issue?
> > +#define KVM_MEMORY_ATTRIBUTE_READ   (1ULL << 2)
> > +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
> > +#define KVM_MEMORY_ATTRIBUTE_EXE  (1ULL << 0)
> >
> > They are checked via
> > "if ((attr & kvm_supported_memory_attributes) != attr)" shared above
> > in kvm_set_memory_attributes.
> > In the case you described, kvm_supported_memory_attributes will be 0x7.
> > Anything unexpected?
> 
> Sorry that I thought for wrong case.
> 
> It doesn't work on the case that KVM doesn't support memory_attribute, e.g.,
> an old KVM. In this case, 'kvm_supported_memory_attributes' is 0, and 'attr' 
> is
> 0 as well.

How is this different in your existing implementation?

The official way of defining a feature is to take a bit (based on the first 
feature,
*_PRIVATE, defined). Using 0 as an attr is a bit magic and it passes all the 
"&" based check.
But using it for *_SHARED looks fine to me as semantically memory can always be 
shared
and the ioctl will return with -ENOTTY anyway in your mentioned case.

RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes

2024-01-09 Thread Wang, Wei W

On Tuesday, January 9, 2024 1:47 PM, Li, Xiaoyao wrote:
> On 12/21/2023 9:47 PM, Wang, Wei W wrote:
> > On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
> >> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> >>> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE
> there.
> >>> I'm suggesting below:
> >>>
> >>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> >>> 2d9a2455de..63ba74b221 100644
> >>> --- a/accel/kvm/kvm-all.c
> >>> +++ b/accel/kvm/kvm-all.c
> >>> @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr
> >> start, hwaddr size, uint64_t attr)
> >>>struct kvm_memory_attributes attrs;
> >>>int r;
> >>>
> >>> +if ((attr & kvm_supported_memory_attributes) != attr) {
> >>> +error_report("KVM doesn't support memory attr %lx\n", attr);
> >>> +return -EINVAL;
> >>> +}
> >>
> >> In the case of setting a range of memory to shared while KVM doesn't
> >> support private memory. Above check doesn't work. and following IOCTL
> fails.
> >
> > SHARED attribute uses the value 0, which indicates it's always supported, 
> > no?
> > For the implementation, can you find in the KVM side where the ioctl
> > would get failed in that case?
> 
> I'm worrying about the future case, that KVM supports other memory attribute
> than shared/private. For example, KVM supports RWX bits (bit 0
> - 2) but not shared/private bit.

What's the exact issue?
+#define KVM_MEMORY_ATTRIBUTE_READ   (1ULL << 2)
+#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_EXE  (1ULL << 0)

They are checked via
"if ((attr & kvm_supported_memory_attributes) != attr)" shared above in
kvm_set_memory_attributes.
In the case you described, kvm_supported_memory_attributes will be 0x7.
Anything unexpected?

RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes

2023-12-21 Thread Wang, Wei W

On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> > No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE there.
> > I'm suggesting below:
> >
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> > 2d9a2455de..63ba74b221 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr
> start, hwaddr size, uint64_t attr)
> >   struct kvm_memory_attributes attrs;
> >   int r;
> >
> > +if ((attr & kvm_supported_memory_attributes) != attr) {
> > +error_report("KVM doesn't support memory attr %lx\n", attr);
> > +return -EINVAL;
> > +}
> 
> In the case of setting a range of memory to shared while KVM doesn't support
> private memory. Above check doesn't work. and following IOCTL fails.

SHARED attribute uses the value 0, which indicates it's always supported, no?
For the implementation, can you find in the KVM side where the ioctl
would get failed in that case?

static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
   struct kvm_memory_attributes *attrs)
{
gfn_t start, end;

/* flags is currently not used. */
if (attrs->flags)
return -EINVAL;
if (attrs->attributes & ~kvm_supported_mem_attributes(kvm)) ==> 0 here
return -EINVAL;
if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
return -EINVAL;
if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
return -EINVAL;

RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes

2023-12-21 Thread Wang, Wei W

On Thursday, December 21, 2023 2:11 PM, Li, Xiaoyao wrote:
> On 12/12/2023 9:56 PM, Wang, Wei W wrote:
> > On Wednesday, November 15, 2023 3:14 PM, Xiaoyao Li wrote:
> >> Introduce the helper functions to set the attributes of a range of
> >> memory to private or shared.
> >>
> >> This is necessary to notify KVM the private/shared attribute of each gpa
> range.
> >> KVM needs the information to decide the GPA needs to be mapped at
> >> hva- based shared memory or guest_memfd based private memory.
> >>
> >> Signed-off-by: Xiaoyao Li 
> >> ---
> >>   accel/kvm/kvm-all.c  | 42
> ++
> >>   include/sysemu/kvm.h |  3 +++
> >>   2 files changed, 45 insertions(+)
> >>
> >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> >> 69afeb47c9c0..76e2404d54d2 100644
> >> --- a/accel/kvm/kvm-all.c
> >> +++ b/accel/kvm/kvm-all.c
> >> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;  static int
> >> kvm_sstep_flags; static bool kvm_immediate_exit;  static bool
> >> kvm_guest_memfd_supported;
> >> +static uint64_t kvm_supported_memory_attributes;
> >>   static hwaddr kvm_max_slot_size = ~0;
> >>
> >>   static const KVMCapabilityInfo kvm_required_capabilites[] = { @@
> >> -1305,6
> >> +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
> >>   kvm_max_slot_size = max_slot_size;
> >>   }
> >>
> >> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size,
> >> +uint64_t attr) {
> >> +struct kvm_memory_attributes attrs;
> >> +int r;
> >> +
> >> +attrs.attributes = attr;
> >> +attrs.address = start;
> >> +attrs.size = size;
> >> +attrs.flags = 0;
> >> +
> >> +r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, );
> >> +if (r) {
> >> +warn_report("%s: failed to set memory (0x%lx+%#zx) with attr
> >> + 0x%lx
> >> error '%s'",
> >> + __func__, start, size, attr, strerror(errno));
> >> +}
> >> +return r;
> >> +}
> >> +
> >> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) {
> >> +if (!(kvm_supported_memory_attributes &
> >> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> >> +error_report("KVM doesn't support PRIVATE memory attribute\n");
> >> +return -EINVAL;
> >> +}
> >> +
> >> +return kvm_set_memory_attributes(start, size,
> >> +KVM_MEMORY_ATTRIBUTE_PRIVATE); }
> >> +
> >> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) {
> >> +if (!(kvm_supported_memory_attributes &
> >> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> >> +error_report("KVM doesn't support PRIVATE memory attribute\n");
> >> +return -EINVAL;
> >> +}
> >
> > Duplicate code in kvm_set_memory_attributes_shared/private.
> > Why not move the check into kvm_set_memory_attributes?
> 
> Because it's not easy to put the check into there.
> 
> Both setting and clearing one bit require the capability check. If moving the
> check into kvm_set_memory_attributes(), the check of
> KVM_MEMORY_ATTRIBUTE_PRIVATE will have to become unconditionally,
> which is not aligned to the function name because the name is not restricted 
> to
> shared/private attribute only.

No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE there.
I'm suggesting below:

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2d9a2455de..63ba74b221 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr start, 
hwaddr size, uint64_t attr)
 struct kvm_memory_attributes attrs;
 int r;

+if ((attr & kvm_supported_memory_attributes) != attr) {
+error_report("KVM doesn't support memory attr %lx\n", attr);
+return -EINVAL;
+}
+
 attrs.attributes = attr;
 attrs.address = start;
 attrs.size = size;
@@ -1390,21 +1395,11 @@ static int kvm_set_memory_attributes(hwaddr start, 
hwaddr size, uint64_t attr)

 int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
 {
-if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
-error_report("KVM doesn't support PRIVATE memory attribute\n");
-return -EINVAL;
-}
-
 return kvm_set_memory_attributes(start, size, 
KVM_MEMORY_ATTRIBUTE_PRIVATE);
 }

 int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
 {
-if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
-error_report("KVM doesn't support PRIVATE memory attribute\n");
-return -EINVAL;
-}
-
 return kvm_set_memory_attributes(start, size, 0);
 }

Maybe you don't even need the kvm_set_memory_attributes_shared/private wrappers.

RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes

2023-12-12 Thread Wang, Wei W

On Wednesday, November 15, 2023 3:14 PM, Xiaoyao Li wrote:
> Introduce the helper functions to set the attributes of a range of memory to
> private or shared.
> 
> This is necessary to notify KVM the private/shared attribute of each gpa 
> range.
> KVM needs the information to decide the GPA needs to be mapped at hva-
> based shared memory or guest_memfd based private memory.
> 
> Signed-off-by: Xiaoyao Li 
> ---
>  accel/kvm/kvm-all.c  | 42 ++
>  include/sysemu/kvm.h |  3 +++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> 69afeb47c9c0..76e2404d54d2 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;  static int kvm_sstep_flags;
> static bool kvm_immediate_exit;  static bool kvm_guest_memfd_supported;
> +static uint64_t kvm_supported_memory_attributes;
>  static hwaddr kvm_max_slot_size = ~0;
> 
>  static const KVMCapabilityInfo kvm_required_capabilites[] = { @@ -1305,6
> +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
>  kvm_max_slot_size = max_slot_size;
>  }
> 
> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size,
> +uint64_t attr) {
> +struct kvm_memory_attributes attrs;
> +int r;
> +
> +attrs.attributes = attr;
> +attrs.address = start;
> +attrs.size = size;
> +attrs.flags = 0;
> +
> +r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, );
> +if (r) {
> +warn_report("%s: failed to set memory (0x%lx+%#zx) with attr 0x%lx
> error '%s'",
> + __func__, start, size, attr, strerror(errno));
> +}
> +return r;
> +}
> +
> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) {
> +if (!(kvm_supported_memory_attributes &
> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +error_report("KVM doesn't support PRIVATE memory attribute\n");
> +return -EINVAL;
> +}
> +
> +return kvm_set_memory_attributes(start, size,
> +KVM_MEMORY_ATTRIBUTE_PRIVATE); }
> +
> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) {
> +if (!(kvm_supported_memory_attributes &
> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +error_report("KVM doesn't support PRIVATE memory attribute\n");
> +return -EINVAL;
> +}

Duplicate code in kvm_set_memory_attributes_shared/private.
Why not move the check into kvm_set_memory_attributes?

RE: [PATCH v2] migration: refactor migration_completion

2023-10-12 Thread Wang, Wei W

On Wednesday, October 11, 2023 8:41 PM, Juan Quintela wrote:
> Wei Wang  wrote:
> > Current migration_completion function is a bit long. Refactor the long
> > implementation into different subfunctions:
> > - migration_completion_precopy: completion code related to precopy
> > - migration_completion_postcopy: completion code related to postcopy
> > - close_return_path_on_source: rp thread related cleanup on migration
> > completion. It is named to match with open_return_path_on_source.

Btw, just a reminder that the above line related to close_return_path_on_source
in the commit log needs to be removed as it's not added by the patch after 
solving
the conflicts.

RE: [PATCH v2] migration: refactor migration_completion

2023-10-12 Thread Wang, Wei W

On Thursday, October 12, 2023 4:32 AM, Juan Quintela wrote:
> > Yeah, this generates a nicer diff, thanks.
> > I'll rebase and resend it.
> 
> Already on the pull request.
> 
> I have to fix the conflict, but it has the same changes that yours as far as 
> I can
> see.

Yes, just need to remove the conflict part, other changes seem to be the same 
as before.

RE: [PATCH v2] migration: refactor migration_completion

2023-10-11 Thread Wang, Wei W

On Wednesday, October 11, 2023 8:41 PM, Juan Quintela wrote:
> Wei Wang  wrote:
> > Current migration_completion function is a bit long. Refactor the long
> > implementation into different subfunctions:
> > - migration_completion_precopy: completion code related to precopy
> > - migration_completion_postcopy: completion code related to postcopy
> > - close_return_path_on_source: rp thread related cleanup on migration
> > completion. It is named to match with open_return_path_on_source.
> >
> > This improves readability and is easier for future updates (e.g. add
> > new subfunctions when completion code related to new features are
> > needed). No functional changes intended.
> >
> > Signed-off-by: Wei Wang 
> 
> There was some conflict with:
> 
> commit d50f5dc075cbb891bfe4a9378600a4871264468a
> Author: Fabiano Rosas 
> Date:   Mon Sep 18 14:28:20 2023 -0300
> 
> migration: Consolidate return path closing code
> 
> (basically the traces and the rp_thread_created check were already on the
> tree).
> 
> BTW, the diff is uglier than it needs to be.
> 
> You can add to your global .gitconfig:
> 
> [diff]
> algorithm = patience
> renames = true

Yeah, this generates a nicer diff, thanks.
I'll rebase and resend it.

RE: [PATCH v2] migration: refactor migration_completion

2023-10-11 Thread Wang, Wei W

On Friday, August 4, 2023 9:37 PM, Peter Xu wrote:
Fri, Aug 04, 2023 at 05:30:53PM +0800, Wei Wang wrote:
> > Current migration_completion function is a bit long. Refactor the long
> > implementation into different subfunctions:
> > - migration_completion_precopy: completion code related to precopy
> > - migration_completion_postcopy: completion code related to postcopy
> > - close_return_path_on_source: rp thread related cleanup on migration
> > completion. It is named to match with open_return_path_on_source.
> >
> > This improves readability and is easier for future updates (e.g. add
> > new subfunctions when completion code related to new features are
> > needed). No functional changes intended.
> >
> > Signed-off-by: Wei Wang 
> 
> Reviewed-by: Peter Xu 
> 

Hi Juan,
Do you think this refactoring would be good to merge or have any more comments?

Thanks,
Wei

RE: [PATCH v1] migration: refactor migration_completion

2023-07-27 Thread Wang, Wei W

On Thursday, July 27, 2023 1:10 AM, Peter Xu wrote:
> On Fri, Jul 21, 2023 at 11:14:55AM +0000, Wang, Wei W wrote:
> > On Friday, July 21, 2023 4:38 AM, Peter Xu wrote:
> > > Looks good to me, after addressing Isaku's comments.
> > >
> > > The current_active_state is very unfortunate, along with most of the
> > > calls to
> > > migrate_set_state() - I bet most of the code will definitely go
> > > wrong if that cmpxchg didn't succeed inside of migrate_set_state(),
> > > IOW in most cases we simply always want:
> >
> > Can you share examples where it could be wrong?
> > (If it has bugs, we need to fix)
> 
> Nop.  What I meant is most of the cases we want to set the state without
> caring much about the old state, so at least we can have a helper like below
> and simply call migrate_set_state(s, STATE) where we don't care old state.
> 
> >
> > >
> > >   migrate_set_state(>state, s->state, XXX);
> > >
> > > Not sure whether one pre-requisite patch is good to have so we can
> > > rename
> > > migrate_set_state() to something like __migrate_set_state(), then:
> > >
> > >   migrate_set_state(s, XXX) {
> > > __migrate_set_state(>state, s->state, XXX);
> > >   }
> > >
> > > I don't even know whether there's any call site that will need
> > > __migrate_set_state() for real..
> > >
> >
> > Seems this would break the use of "MIGRATION_STATUS_CANCELLING".
> > For example,
> > - In migration_maybe_pause:
> > migrate_set_state(>state, MIGRATION_STATUS_PRE_SWITCHOVER,
> > new_state); If the current
> > s->state isn't MIGRATION_STATUS_PRE_SWITCHOVER (could be
> > MIGRATION_STATUS_CANCELLING),  then s->state won’t be updated to
> > new_state.
> > - Then, in migration_completion, the following update to s->state won't
> succeed:
> >migrate_set_state(>state, current_active_state,
> >   MIGRATION_STATUS_COMPLETED);
> >
> > - Finally, when reaching migration_iteration_finish(), s->state is
> > MIGRATION_STATUS_CANCELLING, instead of
> MIGRATION_STATUS_COMPLETED.
> 
> The whole state changes are just flaky to me in general, even with the help of
> old_state cmpxchg.

Yes, the design/implementation of the migration state transition can be
improved (it looks fragile to me). I think this should be done in a separate
patchset, though. For this patch, we could keep it no functional change.

> 
> E.g., I'm wondering whether below race can happen, assuming we're starting
> with ACTIVE state and just about to complete migration:
> 
>   main threadmigration thread
>   ---
> 
> migration_maybe_pause(current_active_state==ACTIVE)
>  if (s->state != 
> MIGRATION_STATUS_CANCELLING)
>--> true, keep setting state
>qemu_mutex_unlock_iothread();
> qemu_mutex_lock_iothread();
> migrate_fd_cancel()
>   if (old_state == MIGRATION_STATUS_PRE_SWITCHOVER)
> --> false, not posting to pause_sem
>   set state to MIGRATION_STATUS_CANCELLING
>   migrate_set_state(>state, 
> *current_active_state,
> 
> MIGRATION_STATUS_PRE_SWITCHOVER);
> --> false, cmpxchg fail
>   qemu_sem_wait(>pause_sem);
> --> hang death?

Still need "migrate continue" to unblock the migration thread.
Probably we should document that PAUSE_BEFORE_SWITCHOVER always requires an
explicit "migrate continue" to be issued from user (even after migration is 
cancelled).

RE: [PATCH v1] migration: refactor migration_completion

2023-07-21 Thread Wang, Wei W

On Friday, July 21, 2023 4:38 AM, Peter Xu wrote:
> Looks good to me, after addressing Isaku's comments.
> 
> The current_active_state is very unfortunate, along with most of the calls to
> migrate_set_state() - I bet most of the code will definitely go wrong if that
> cmpxchg didn't succeed inside of migrate_set_state(), IOW in most cases we
> simply always want:

Can you share examples where it could be wrong?
(If it has bugs, we need to fix)

> 
>   migrate_set_state(>state, s->state, XXX);
> 
> Not sure whether one pre-requisite patch is good to have so we can rename
> migrate_set_state() to something like __migrate_set_state(), then:
> 
>   migrate_set_state(s, XXX) {
> __migrate_set_state(>state, s->state, XXX);
>   }
> 
> I don't even know whether there's any call site that will need
> __migrate_set_state() for real..
> 

Seems this would break the use of "MIGRATION_STATUS_CANCELLING".
For example, 
- In migration_maybe_pause:
migrate_set_state(>state, MIGRATION_STATUS_PRE_SWITCHOVER,
new_state);
If the current s->state isn't MIGRATION_STATUS_PRE_SWITCHOVER (could
be MIGRATION_STATUS_CANCELLING),  then s->state won’t be updated to
new_state.
- Then, in migration_completion, the following update to s->state won't succeed:
   migrate_set_state(>state, current_active_state,
  MIGRATION_STATUS_COMPLETED);

- Finally, when reaching migration_iteration_finish(), s->state is
MIGRATION_STATUS_CANCELLING, instead of MIGRATION_STATUS_COMPLETED.

RE: [PATCH v1] migration: refactor migration_completion

2023-07-18 Thread Wang, Wei W

On Tuesday, July 18, 2023 1:44 PM, Isaku Yamahata wrote:
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2058,6 +2058,21 @@ static int
> await_return_path_close_on_source(MigrationState *ms)
> >  return ms->rp_state.error;
> >  }
> >
> > +static int close_return_path_on_source(MigrationState *ms) {
> > +int ret;
> > +
> > +if (!ms->rp_state.rp_thread_created) {
> > +return 0;
> > +}
> > +
> > +trace_migration_return_path_end_before();
> > +ret = await_return_path_close_on_source(ms);
> > +trace_migration_return_path_end_after(ret);
> > +
> > +return ret;
> > +}
> > +
> 
> There is only one caller, migration_completion().  We can consolidate two
> functions, await_return_path_close_on_source() and
> close_return_path_on_source(), into single function.

Sounds good, thanks.

> > +static int migration_completion_postcopy(MigrationState *s) {
> > +trace_migration_completion_postcopy_end();
> > +
> > +qemu_mutex_lock_iothread();
> > +qemu_savevm_state_complete_postcopy(s->to_dst_file);
> > +qemu_mutex_unlock_iothread();
> > +
> > +/*
> > + * Shutdown the postcopy fast path thread.  This is only needed when
> dest
> > + * QEMU binary is old (7.1/7.2).  QEMU 8.0+ doesn't need this.
> > + */
> > +if (migrate_postcopy_preempt() && s->preempt_pre_7_2) {
> > +postcopy_preempt_shutdown_file(s);
> > +}
> > +
> > +trace_migration_completion_postcopy_end_after_complete();
> > +
> > +return 0;
> 
> Always return 0?  Make it void.

OK.

> > +static void migration_completion(MigrationState *s) {
> > +int ret = -1;
> > +int current_active_state = s->state;
> > +
> > +if (s->state == MIGRATION_STATUS_ACTIVE) {
> > +ret = migration_completion_precopy(s, _active_state);
> > +} else if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
> > +ret = migration_completion_postcopy(s);
> 
> Here we can set ret = 0.

Yes, after migration_completion_postcopy is made "void".

RE: [PATCH v2 2/2] qtest/migration-tests.c: use "-incoming defer" for postcopy tests

2023-06-01 Thread Wang, Wei W

On Wednesday, May 31, 2023 8:58 PM, Peter Xu wrote:
> > > Hmm.. so we used to do socket_start_incoming_migration_internal()
> > > before setting the right num for the preempt test, then I'm curious
> > > why it wasn't failing before this patch when trying to connect with the
> preempt channel..
> > >
> > > Wei, do you know?
> >
> > I think there are two reasons:
> > #1 "backlog" specifies the number of pending connections. As long as
> > the server accepts the connections faster than the clients side
> > connecting, connection will succeed. For the preempt test, it uses
> > only 2 channels, so very likely to not have pending connections. (This
> > is easier to trigger for the multifd case, e.g. use a much larger number 
> > like
> 16).
> > #2 per my tests (on kernel 6.2), the number of pending connections
> > allowed is actually "backlog + 1", which is 2 in this case. Adding
> > more pending connections will then be dropped. I'm not sure if " backlog +
> 1" is true for older versions of kernel, though.
> 
> Interesting to know, thanks.
> 
> If there'll be a new version, please consider adding some of those into the
> commit message.

OK, will resend with commit update.
Plan to wait a bit in case there would be other feedbacks.

> 
> Reviewed-by: Peter Xu 
> 
> --
> Peter Xu

RE: [PATCH v2 2/2] qtest/migration-tests.c: use "-incoming defer" for postcopy tests

2023-05-31 Thread Wang, Wei W

On Tuesday, May 30, 2023 10:41 PM, Peter Xu wrote:
> On Tue, May 30, 2023 at 05:02:59PM +0800, Wei Wang wrote:
> > The Postcopy preempt capability requires to be set before incoming
> > starts, so change the postcopy tests to start with deferred incoming
> > and call migrate-incoming after the cap has been set.
> >
> > Signed-off-by: Wei Wang 
> 
> Hmm.. so we used to do socket_start_incoming_migration_internal() before
> setting the right num for the preempt test, then I'm curious why it wasn't
> failing before this patch when trying to connect with the preempt channel..
> 
> Wei, do you know?

I think there are two reasons:
#1 "backlog" specifies the number of pending connections. As long as the server
accepts the connections faster than the clients side connecting, connection
will succeed. For the preempt test, it uses only 2 channels, so very likely to 
not have
pending connections. (This is easier to trigger for the multifd case, e.g. use 
a much
larger number like 16).
#2 per my tests (on kernel 6.2), the number of pending connections allowed is 
actually
"backlog + 1", which is 2 in this case. Adding more pending connections will 
then be
dropped. I'm not sure if " backlog + 1" is true for older versions of kernel, 
though.

RE: [PATCH v1] migration: fix migrate_params_test_apply to set the dest param correctly

2023-05-30 Thread Wang, Wei W

On Monday, May 29, 2023 10:58 PM, Peter Xu wrote:
> >
> > #1migrate_params_test_apply(params, );
> >
> >  #2   if (!migrate_params_check(, errp)) {
> > /* Invalid parameter */
> > return;
> > }
> >  #3  migrate_params_apply(params, errp);
> >
> > #2 tries to do params check using tmp, which is expected to be set up
> > by #1, but #1 didn't use "",
> 
> #1 initialized "" with current parameters, here:
> 
> *dest = migrate_get_current()->parameters;
> 
> ?

Yes. Sorry, I had a misunderstanding of this one. All the has_* of
the current params has been initialized to true at the beginning.
(I once dumped tmp after migrate_params_test_apply, it were all 0,
which drove me to make the changes, but couldn't reproduce it now
- things appear to be correct without this patch)

RE: [PATCH v1] migration: fix migrate_params_test_apply to set the dest param correctly

2023-05-29 Thread Wang, Wei W

On Saturday, May 27, 2023 5:49 AM, Peter Xu wrote:
> On Wed, May 24, 2023 at 04:01:57PM +0800, Wei Wang wrote:
> > qmp_migrate_set_parameters expects to use tmp for parameters check, so
> > migrate_params_test_apply is expected to copy the related fields from
> > params to tmp. So fix migrate_params_test_apply to use the function
> > parameter, *dest, rather than the global one. The dest->has_xxx (xxx
> > is the feature name) related fields need to be set, as they will be
> > checked by migrate_params_check.
> 
> I think it's fine to do as what you suggested, but I don't see much benefit
> either.. the old code IIUC will check all params even if 1 param changed,
> while after your change it only checks the modified ones.
> 
> There's slight benefits but not so much, especially "22+, 2-" LOCs, because
> we don't really do this a lot; some more sanity check also makes sense to me
> even if everything is always checked, so we'll hit errors if anything
> accidentally goes wrong too.
> 
> Is there a real bug somewhere?

Yes. Please see qmp_migrate_set_parameters:

#1migrate_params_test_apply(params, );

 #2   if (!migrate_params_check(, errp)) {
/* Invalid parameter */
return;
}
 #3  migrate_params_apply(params, errp);

#2 tries to do params check using tmp, which is expected to be set up
by #1, but #1 didn't use "", so "tmp" doesn’t seem to store the
valid values as expected for the check (that is, #2 above isn’t effectively
doing any check for the user input params)

The alternative fix would be to remove the intermediate "tmp" params,
but this might break the usage from commit 1bda8b3c6950, so need thoughts
from Markus if we want go for this approach.

RE: [PATCH v1] migration: fail the cap check if it requires the use of deferred incoming

2023-05-23 Thread Wang, Wei W

On Tuesday, May 23, 2023 10:50 PM, Peter Xu wrote:
> On Tue, May 23, 2023 at 02:30:25PM +0000, Wang, Wei W wrote:
> > > It's about whether we want to protect e.g. below steps:
> > >
> > > 1. start dest qemu with -incoming defer 2.
> > > "migrate-set-capabilities" to enable multifd 3. "migrate-incoming
> > > xxx" to setup the sockets
> > > 4. "migrate-set-parameters" to setup the num of multifd   <--- will be
> invalid here
> >
> > Yes, step 4 is invalid, but I think nobody cares about that (i.e. no
> > place uses the invalid value) as step2 already fails the cap setting (with
> error messages).
> 
> Since only until step 3 it setups the transport_data, so step 2 should be fine
> and not fail?  That's the whole point of my example or I missd something
> here..

OK, I misread something before.  Good point, thanks.

RE: [PATCH v1] migration: fail the cap check if it requires the use of deferred incoming

2023-05-23 Thread Wang, Wei W


On Tuesday, May 23, 2023 9:41 PM, Peter Xu wrote:
> On Tue, May 23, 2023 at 01:44:03AM +0000, Wang, Wei W wrote:
> > On Tuesday, May 23, 2023 7:36 AM, Peter Xu wrote:
> > > > > We may also want to trap the channel setups on num:
> > > > >
> > > > > migrate_params_test_apply():
> > > > >
> > > > > if (params->has_multifd_channels) {
> > > > > dest->multifd_channels = params->multifd_channels;
> > > > > }
> > > >
> > > > Didn’t get this one. What do you want to add to above?
> > >
> > > I meant after listen() is called with an explicit number in this
> > > case, should we disallow changing of multifd number of channels?
> >
> > Got you, thanks. That seems unnecessary to me, as the cap setting is
> > required for the use of multifd and patching there already achieves below
> what we want:
> > - users get the error message when deferred -incoming isn’t used;
> > - fail the cap setting for multifd, meaning that multifd won't be used (i.e.
> > no place that will care about multifd_channels).
> 
> It's about whether we want to protect e.g. below steps:
> 
> 1. start dest qemu with -incoming defer
> 2. "migrate-set-capabilities" to enable multifd 3. "migrate-incoming xxx" to
> setup the sockets
> 4. "migrate-set-parameters" to setup the num of multifd   <--- will be invalid
> here

Yes, step 4 is invalid, but I think nobody cares about that (i.e. no place uses 
the
invalid value) as step2 already fails the cap setting (with error messages).

> 
> Would that still be a problem that falls into the same category of what this
> patch wants to protect qemu from?

My intension was to notice the user explicitly that the deferred -incoming must
be used for multifd (if not the case, stop the use of multifd). I think the 
patch
already achieves this.
Adding such check to migrate-set-parameters doesn’t cause problems,
but seems a bit redundant to me. Not sure if others would also have strong
preferences to do so for any reason.

Thanks,
Wei

RE: [PATCH v1] migration: fail the cap check if it requires the use of deferred incoming

2023-05-22 Thread Wang, Wei W

On Tuesday, May 23, 2023 7:36 AM, Peter Xu wrote:
> > > We may also want to trap the channel setups on num:
> > >
> > > migrate_params_test_apply():
> > >
> > > if (params->has_multifd_channels) {
> > > dest->multifd_channels = params->multifd_channels;
> > > }
> >
> > Didn’t get this one. What do you want to add to above?
> 
> I meant after listen() is called with an explicit number in this case, should 
> we
> disallow changing of multifd number of channels?

Got you, thanks. That seems unnecessary to me, as the cap setting is required
for the use of multifd and patching there already achieves below what we want:
- users get the error message when deferred -incoming isn’t used;
- fail the cap setting for multifd, meaning that multifd won't be used (i.e.
no place that will care about multifd_channels).

RE: [PATCH v1] migration: fail the cap check if it requires the use of deferred incoming

2023-05-19 Thread Wang, Wei W

On Friday, May 19, 2023 11:34 PM, Peter Xu wrote:
> > Ah yes indeed it keeps working, because we apply -global bits before
> > setup sockets. Then it's fine by me since that's the only thing I
> > would still like to keep it working. :)
> >
> > If so, can we reword the error message a bit?  Obviously as you said
> > we're not really checking against -defer, but established channels.
> > The problem is if something is established without knowing multifd
> > being there it may not work for multifd or preempt, not strictly about 
> > defer.
> >
> > How about:
> >
> >   "Multifd/Preempt-Mode cannot be modified if incoming channel has
> setup"
> >
> > ?

Yes, I'll reword it a bit.

> 
> We may also want to trap the channel setups on num:
> 
> migrate_params_test_apply():
> 
> if (params->has_multifd_channels) {
> dest->multifd_channels = params->multifd_channels;
> }

Didn’t get this one. What do you want to add to above?

RE: [PATCH] multifd: Set a higher "backlog" default value for listen()

2023-05-18 Thread Wang, Wei W

On Friday, May 19, 2023 10:52 AM, Wang, Lei4 wrote:
> > We can change it to uint16_t or uint32_t, but need to see if listening
> > on a larger value is OK to everyone.
> 
> Is there any use case to use >256 migration channels? If not, then I suppose
> it's no need to increase it.

People can choose to use more than 256 channels to boost performance.
If it is determined that using larger than 256 channels doesn't increase 
performance
on all the existing platforms, then we need to have it reflected in the code 
explicitly,
e.g. fail with errors messages when user does:
migrate_set_parameter multifd-channels 512

> 
> >
> > Man page of listen mentions that the  maximum length of the queue for
> > incomplete sockets can be set using
> > /proc/sys/net/ipv4/tcp_max_syn_backlog,
> > and it is 4096 by default on my machine

RE: [PATCH] multifd: Set a higher "backlog" default value for listen()

2023-05-18 Thread Wang, Wei W

On Friday, May 19, 2023 9:31 AM, Wang, Lei4 wrote:
> On 5/18/2023 17:16, Juan Quintela wrote:
> > Lei Wang  wrote:
> >> When destination VM is launched, the "backlog" parameter for listen()
> >> is set to 1 as default in socket_start_incoming_migration_internal(),
> >> which will lead to socket connection error (the queue of pending
> >> connections is full) when "multifd" and "multifd-channels" are set
> >> later on and a high number of channels are used. Set it to a
> >> hard-coded higher default value 512 to fix this issue.
> >>
> >> Reported-by: Wei Wang 
> >> Signed-off-by: Lei Wang 
> >
> > [cc'd daiel who is the maintainer of qio]
> >
> > My understanding of that value is that 230 or something like that
> > would be more than enough.  The maxiimum number of multifd channels is
> 256.
> 
> You are right, the "multifd-channels" expects uint8_t, so 256 is enough.
> 

We can change it to uint16_t or uint32_t, but need to see if listening on a 
larger
value is OK to everyone.

Man page of listen mentions that the  maximum length of the queue for
incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog,
and it is 4096 by default on my machine.

RE: [PATCH v1] migration: fail the cap check if it requires the use of deferred incoming

2023-05-18 Thread Wang, Wei W

On Friday, May 19, 2023 3:20 AM, Peter Xu wrote:
> On Fri, May 19, 2023 at 12:00:26AM +0800, Wei Wang wrote:
> > qemu_start_incoming_migration needs to check the number of multifd
> > channels or postcopy ram channels to configure the backlog parameter (i.e.
> > the maximum length to which the queue of pending connections for
> > sockfd may grow) of listen(). So multifd and postcopy-preempt caps
> > require the use of deferred incoming, that is, calling
> > qemu_start_incoming_migration should be deferred via qmp or hmp
> > commands after the cap of multifd and postcopy-preempt are configured.
> >
> > Check if deferred incoming is used when enabling multifd or
> > postcopy-preempt, and fail the check with error messages if not.
> >
> > Signed-off-by: Wei Wang 
> 
> IIUC this will unfortunately break things like:
> 
>   -global migration.x-postcopy-preempt=on
> 
> where the cap is actually applied before incoming starts even with !defer so
> it should still work.

Actually the patch doesn’t check "!defer". It just checks if incoming has been 
started
or not. It allows the 2 caps to be set only before incoming starts. So I think 
the above
should work.

> 
> Can we just make socket_start_incoming_migration_internal() listen on a
> static but larger value?

Yes, agree for this and that's out initial change.
This needs listen() to create a longer queue for pending connections (seems OK 
to me).
Need to see Daniel and Juan's opinion about this.

RE: [PATCH] multifd: Set a higher "backlog" default value for listen()

2023-05-18 Thread Wang, Wei W

On Thursday, May 18, 2023 8:43 PM, Juan Quintela wrote:
> 
> 
> Are you using -incoming defer?
> 
> No? right.
> 
> With multifd, you should use -incoming defer. 

Yes, just confirmed that it works well with deferred incoming.
I think we should enforce this kind of requirement in the code.
I'll make a patch for this.

Thanks,
Wei

RE: [PATCH] multifd: Set a higher "backlog" default value for listen()

2023-05-18 Thread Wang, Wei W

On Thursday, May 18, 2023 4:52 PM, Wang, Lei4 wrote:
> When destination VM is launched, the "backlog" parameter for listen() is set
> to 1 as default in socket_start_incoming_migration_internal(), which will
> lead to socket connection error (the queue of pending connections is full)
> when "multifd" and "multifd-channels" are set later on and a high number of
> channels are used. Set it to a hard-coded higher default value 512 to fix this
> issue.
> 
> Reported-by: Wei Wang 
> Signed-off-by: Lei Wang 
> ---
>  migration/socket.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/socket.c b/migration/socket.c index
> 1b6f5baefb..b43a66ef7e 100644
> --- a/migration/socket.c
> +++ b/migration/socket.c
> @@ -179,7 +179,7 @@
> socket_start_incoming_migration_internal(SocketAddress *saddr,
>  QIONetListener *listener = qio_net_listener_new();
>  MigrationIncomingState *mis = migration_incoming_get_current();
>  size_t i;
> -int num = 1;
> +int num = 512;
> 

Probably we need a macro for it, e.g.
#define MIGRATION_CHANNEL_MAX  512

Also, I think below lines could be removed, as using a larger value of num 
(i.e. 512)
doesn't seem to consume more resources anywhere:
-if (migrate_use_multifd()) {
-num = migrate_multifd_channels();
-} else if (migrate_postcopy_preempt()) {
-num = RAM_CHANNEL_MAX;
-}

RE: Future of icount discussion for next KVM call?

2023-03-03 Thread Wang, Wei W

On Thursday, February 16, 2023 10:36 PM, Wang, Wei W wrote:
> > On Thursday, February 16, 2023 9:57 PM, Juan Quintela wrote:
> > > Just to see what we are having now:
> > >
> > > - single qemu binary moved to next slot (moved to next week?)
> > >   Phillipe proposal
> > > - TDX migration: we have the slides, but no code
> > >   So I guess we can move it to the following slot, when we have a chance
> > >   to look at the code, Wei?
> >
> > It's ok to me to continue the discussion on either Feb 21st or March
> > 7th, and I plan to finish some update and share the code before end of
> next week.
> 
> KVM code can be read here: https://github.com/intel/tdx/  tdx-mig-wip
> QEMU code will be shared soon.

QEMU code can be read here: https://github.com/intel/qemu-tdx/   tdx-mig-wip

RE: Future of icount discussion for next KVM call?

2023-02-24 Thread Wang, Wei W



> -Original Message-
> From: Wang, Wei W
> Sent: Thursday, February 16, 2023 10:36 PM
> To: quint...@redhat.com; Alex Bennée 
> Cc: Paolo Bonzini ; Pavel Dovgalyuk
> ; qemu-devel@nongnu.org; Richard Henderson
> ; Mark Burton
> ; Bill Mills ; Marco
> Liebel ; Alexandre Iooss ;
> Mahmoud Mandour ; Emilio Cota
> ; kvm-devel ; Philippe Mathieu-
> Daudé 
> Subject: RE: Future of icount discussion for next KVM call?
> 
> On Thursday, February 16, 2023 9:57 PM, Juan Quintela wrote:
> > Just to see what we are having now:
> >
> > - single qemu binary moved to next slot (moved to next week?)
> >   Phillipe proposal
> > - TDX migration: we have the slides, but no code
> >   So I guess we can move it to the following slot, when we have a chance
> >   to look at the code, Wei?
> 
> It's ok to me to continue the discussion on either Feb 21st or March 7th, and 
> I
> plan to finish some update and share the code before end of next week.

KVM code can be read here: https://github.com/intel/tdx/  tdx-mig-wip
QEMU code will be shared soon.

RE: Future of icount discussion for next KVM call?

2023-02-16 Thread Wang, Wei W

On Thursday, February 16, 2023 9:57 PM, Juan Quintela wrote:
> Just to see what we are having now:
> 
> - single qemu binary moved to next slot (moved to next week?)
>   Phillipe proposal
> - TDX migration: we have the slides, but no code
>   So I guess we can move it to the following slot, when we have a chance
>   to look at the code, Wei?

It's ok to me to continue the discussion on either Feb 21st or March 7th, and I 
plan to finish some update and share the code before end of next week.

Thanks,
Wei

RE: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory

2023-01-29 Thread Wang, Wei W

On Monday, January 30, 2023 1:26 PM, Ackerley Tng wrote:
> 
> > +static int restrictedmem_getattr(struct user_namespace *mnt_userns,
> > +const struct path *path, struct kstat *stat,
> > +u32 request_mask, unsigned int query_flags)
> {
> > +   struct inode *inode = d_inode(path->dentry);
> > +   struct restrictedmem_data *data = inode->i_mapping-
> >private_data;
> > +   struct file *memfd = data->memfd;
> > +
> > +   return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> > +request_mask, query_flags);
> 
> Instead of calling shmem's getattr() with path, we should be using the the
> memfd's path.
> 
> Otherwise, shmem's getattr() will use restrictedmem's inode instead of
> shmem's inode. The private fields will be of the wrong type, and the host will
> crash when shmem_is_huge() does SHMEM_SB(inode->i_sb)->huge), since
> inode->i_sb->s_fs_info is NULL for the restrictedmem's superblock.
> 
> Here's the patch:
> 
> diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index
> 37191cd9eed1..06b72d593bd8 100644
> --- a/mm/restrictedmem.c
> +++ b/mm/restrictedmem.c
> @@ -84,7 +84,7 @@ static int restrictedmem_getattr(struct user_namespace
> *mnt_userns,
>   struct restrictedmem *rm = inode->i_mapping->private_data;
>   struct file *memfd = rm->memfd;
> 
> - return memfd->f_inode->i_op->getattr(mnt_userns, path, stat,
> + return memfd->f_inode->i_op->getattr(mnt_userns, 
> >f_path, stat,
>request_mask, query_flags);
>   }
> 

Nice catch. I also encountered this issue during my work.
The fix can further be enforced by shmem:

index c301487be5fb..d850c0190359 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -472,8 +472,9 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode 
*inode,
   pgoff_t index, bool shmem_huge_force)
 {
loff_t i_size;
+   struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);

-   if (!S_ISREG(inode->i_mode))
+   if (!sbinfo || !S_ISREG(inode->i_mode))
return false;
if (vma && ((vma->vm_flags & VM_NOHUGEPAGE) ||
test_bit(MMF_DISABLE_THP, >vm_mm->flags)))
@@ -485,7 +486,7 @@ bool shmem_is_huge(struct vm_area_struct *vma, struct inode 
*inode,
if (shmem_huge == SHMEM_HUGE_DENY)
return false;

-   switch (SHMEM_SB(inode->i_sb)->huge) {
+   switch (sbinfo->huge) {
case SHMEM_HUGE_ALWAYS:
return true;
case SHMEM_HUGE_WITHIN_SIZE:

RE: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

2023-01-02 Thread Wang, Wei W

On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > Because guest memory defaults to private, and now this patch stores
> > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> _SHARED,
> > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > mem attr in advance.
> 
> KVM defaults to 'shared' because this ioctl can also be potentially used by
> normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> confidential VMs. 

Do you mean a normal VM could have pages marked private? What's the usage?
(If all the pages are just marked shared for normal VMs, then why do we need it)

> As for more KVM_EXIT_MEMORY_FAULT exits during the
> booting time, yes, setting all memory to 'private' for confidential VMs 
> through
> this ioctl in userspace before guest launch is an approach for KVM userspace 
> to
> 'override' the KVM default and reduce the number of implicit conversions.

Most pages of a confidential VM are likely to be private pages. It seems more 
efficient
(and not difficult to check vm_type) to have KVM defaults to "private" for 
confidential VMs
and defaults to "shared" for normal VMs.

RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Wang, Wei W

On Thursday, September 15, 2022 10:29 PM, Chao Peng wrote:
> +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn,
> +  int *order)

Better to remove "order" from this interface?
Some callers only need to get pfn, and no need to bother with
defining and inputting something unused. For callers who need the "order",
can easily get it via thp_order(pfn_to_page(pfn)) on their own.

RE: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd

2022-09-22 Thread Wang, Wei W

On Thursday, September 22, 2022 5:11 AM, Andy Lutomirski wrote:
> To: Christopherson,, Sean ; David Hildenbrand
> 
> Cc: Chao Peng ; kvm list
> ; Linux Kernel Mailing List
> ; linux...@kvack.org;
> linux-fsde...@vger.kernel.org; Linux API ;
> linux-...@vger.kernel.org; qemu-devel@nongnu.org; Paolo Bonzini
> ; Jonathan Corbet ; Vitaly
> Kuznetsov ; Wanpeng Li ;
> Jim Mattson ; Joerg Roedel ;
> Thomas Gleixner ; Ingo Molnar ;
> Borislav Petkov ; the arch/x86 maintainers ;
> H. Peter Anvin ; Hugh Dickins ; Jeff
> Layton ; J . Bruce Fields ; Andrew
> Morton ; Shuah Khan ;
> Mike Rapoport ; Steven Price ;
> Maciej S . Szmigiero ; Vlastimil Babka
> ; Vishal Annapurve ; Yu Zhang
> ; Kirill A. Shutemov
> ; Nakajima, Jun ;
> Hansen, Dave ; Andi Kleen ;
> aarca...@redhat.com; ddut...@redhat.com; dhild...@redhat.com; Quentin
> Perret ; Michael Roth ;
> Hocko, Michal ; Muchun Song
> ; Wang, Wei W ;
> Will Deacon ; Marc Zyngier ; Fuad Tabba
> 
> Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible
> memfd
> 
> (please excuse any formatting disasters.  my internet went out as I was
> composing this, and i did my best to rescue it.)
> 
> On Mon, Sep 19, 2022, at 12:10 PM, Sean Christopherson wrote:
> > +Will, Marc and Fuad (apologies if I missed other pKVM folks)
> >
> > On Mon, Sep 19, 2022, David Hildenbrand wrote:
> >> On 15.09.22 16:29, Chao Peng wrote:
> >> > From: "Kirill A. Shutemov" 
> >> >
> >> > KVM can use memfd-provided memory for guest memory. For normal
> >> > userspace accessible memory, KVM userspace (e.g. QEMU) mmaps the
> >> > memfd into its virtual address space and then tells KVM to use the
> >> > virtual address to setup the mapping in the secondary page table (e.g.
> EPT).
> >> >
> >> > With confidential computing technologies like Intel TDX, the
> >> > memfd-provided memory may be encrypted with special key for special
> >> > software domain (e.g. KVM guest) and is not expected to be directly
> >> > accessed by userspace. Precisely, userspace access to such
> >> > encrypted memory may lead to host crash so it should be prevented.
> >>
> >> Initially my thaught was that this whole inaccessible thing is TDX
> >> specific and there is no need to force that on other mechanisms.
> >> That's why I suggested to not expose this to user space but handle
> >> the notifier requirements internally.
> >>
> >> IIUC now, protected KVM has similar demands. Either access
> >> (read/write) of guest RAM would result in a fault and possibly crash
> >> the hypervisor (at least not the whole machine IIUC).
> >
> > Yep.  The missing piece for pKVM is the ability to convert from shared
> > to private while preserving the contents, e.g. to hand off a large
> > buffer (hundreds of MiB) for processing in the protected VM.  Thoughts
> > on this at the bottom.
> >
> >> > This patch introduces userspace inaccessible memfd (created with
> >> > MFD_INACCESSIBLE). Its memory is inaccessible from userspace
> >> > through ordinary MMU access (e.g. read/write/mmap) but can be
> >> > accessed via in-kernel interface so KVM can directly interact with
> >> > core-mm without the need to map the memory into KVM userspace.
> >>
> >> With secretmem we decided to not add such "concept switch" flags and
> >> instead use a dedicated syscall.
> >>
> >
> > I have no personal preference whatsoever between a flag and a
> > dedicated syscall, but a dedicated syscall does seem like it would
> > give the kernel a bit more flexibility.
> 
> The third option is a device node, e.g. /dev/kvm_secretmem or
> /dev/kvm_tdxmem or similar.  But if we need flags or other details in the
> future, maybe this isn't ideal.
> 
> >
> >> What about memfd_inaccessible()? Especially, sealing and hugetlb are
> >> not even supported and it might take a while to support either.
> >
> > Don't know about sealing, but hugetlb support for "inaccessible"
> > memory needs to come sooner than later.  "inaccessible" in quotes
> > because we might want to choose a less binary name, e.g.
> > "restricted"?.
> >
> > Regarding pKVM's use case, with the shim approach I believe this can
> > be done by allowing userspace mmap() the "hidden" memfd, but with a
> > ton of restrictions piled on top.
> >
> > My first thought was to make the uAPI a set of KVM ioctls so that KVM
> > could tightly tight

RE: [RFC PATCH 6/7] x86: Use new XSAVE ioctls handling

2022-01-11 Thread Wang, Wei W

On Wednesday, January 12, 2022 10:51 AM, Zeng, Guang wrote:
> To: Tian, Kevin ; Zhong, Yang ;
> qemu-devel@nongnu.org
> Cc: pbonz...@redhat.com; Christopherson,, Sean ;
> jing2@linux.intel.com; Wang, Wei W 
> Subject: Re: [RFC PATCH 6/7] x86: Use new XSAVE ioctls handling
> 
> On 1/11/2022 10:30 AM, Tian, Kevin wrote:
> >> From: Zeng, Guang 
> >> Sent: Monday, January 10, 2022 5:47 PM
> >>
> >> On 1/10/2022 4:40 PM, Tian, Kevin wrote:
> >>>> From: Zhong, Yang 
> >>>> Sent: Friday, January 7, 2022 5:32 PM
> >>>>
> >>>> From: Jing Liu 
> >>>>
> >>>> Extended feature has large state while current kvm_xsave only
> >>>> allows 4KB. Use new XSAVE ioctls if the xstate size is large than
> >>>> kvm_xsave.
> >>> shouldn't we always use the new xsave ioctls as long as
> >>> CAP_XSAVE2 is available?
> >>
> >> CAP_XSAVE2 may return legacy xsave size or 0 working with old kvm
> >> version in which it's not available.
> >> QEMU just use the new xsave ioctls only when the return value of
> >> CAP_XSAVE2 is larger than legacy xsave size.
> > CAP_XSAVE2  is the superset of CAP_XSAVE. If available it can support
> > both legacy 4K size or bigger.
> 
> Got your point now. We can use new ioctl once CAP_XSAVE2 is available.
> As your suggestion, I'd like to change commit log as follows:
> 
> "x86: Use new XSAVE ioctls handling
> 
>    Extended feature has large state while current
>    kvm_xsave only allows 4KB. Use new XSAVE ioctls
>    if check extension of CAP_XSAVE2 is available."
> 
> And introduce has_xsave2 to indicate the valid of CAP_XSAVE2 with following
> change:
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index
> 97520e9dff..c8dae88ced 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -116,6 +116,7 @@ static bool has_msr_ucode_rev;
>   static bool has_msr_vmx_procbased_ctls2;
>   static bool has_msr_perf_capabs;
>   static bool has_msr_pkrs;
> +static bool has_xsave2 = false;

It's 0-initialized, so I think no need for the "false" assignment.
Probably better to use "int" (like has_xsave), and improved it a bit:

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 3fb3ddbe2b..dee40ad0ad 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -122,6 +122,7 @@ static uint32_t num_architectural_pmu_gp_counters;
 static uint32_t num_architectural_pmu_fixed_counters;

 static int has_xsave;
+static int has_xsave2;
 static int has_xcrs;
 static int has_pit_state2;
 static int has_exception_payload;
@@ -1564,6 +1565,26 @@ static Error *invtsc_mig_blocker;

 #define KVM_MAX_CPUID_ENTRIES  100

+static void kvm_init_xsave(CPUX86State *env)
+{
+if (has_xsave2) {
+env->xsave_buf_len = QEMU_ALIGN_UP(has_xsave2, 4096);;
+} else if (has_xsave) {
+env->xsave_buf_len = sizeof(struct kvm_xsave);
+} else {
+return;
+}
+
+env->xsave_buf = qemu_memalign(4096, env->xsave_buf_len);
+memset(env->xsave_buf, 0, env->xsave_buf_len);
+ /*
+  * The allocated storage must be large enough for all of the
+  * possible XSAVE state components.
+  */
+assert(kvm_arch_get_supported_cpuid(kvm_state, 0xd, 0, R_ECX) <=
+   env->xsave_buf_len);
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
 struct {
@@ -1982,18 +2003,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 goto fail;
 }

-if (has_xsave) {
-env->xsave_buf_len = sizeof(struct kvm_xsave);
-env->xsave_buf = qemu_memalign(4096, env->xsave_buf_len);
-memset(env->xsave_buf, 0, env->xsave_buf_len);
-
-/*
- * The allocated storage must be large enough for all of the
- * possible XSAVE state components.
- */
-assert(kvm_arch_get_supported_cpuid(kvm_state, 0xd, 0, R_ECX)
-   <= env->xsave_buf_len);
-}
+kvm_init_xsave(env);

 max_nested_state_len = kvm_max_nested_state_length();
 if (max_nested_state_len > 0) {
@@ -2323,6 +2333,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 }

 has_xsave = kvm_check_extension(s, KVM_CAP_XSAVE);
+has_xsave2 = kvm_check_extension(s, KVM_CAP_XSAVE2);
 has_xcrs = kvm_check_extension(s, KVM_CAP_XCRS);
 has_pit_state2 = kvm_check_extension(s, KVM_CAP_PIT_STATE2);

@@ -3241,13 +3252,14 @@ static int kvm_get_xsave(X86CPU *cpu)
 {
 CPUX86State *env = >env;
 void *xsave = env->xsave_buf;
-int ret;
+int type, ret;

 if (!has_xsave) {
 return kvm_get_fpu(cpu);
 }

-ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_XSAVE, xsave);
+type = has_xsave2 ? KVM_GET_XSAVE2: KVM_GET_XSAVE;
+ret = kvm_vcpu_ioctl(CPU(cpu), type, xsave);
 if (ret < 0) {
 return ret;
 }

RE: [PATCH 1/2] virito-balloon: process all in sgs for free_page_vq

2021-11-25 Thread Wang, Wei W

On Friday, November 26, 2021 10:31 AM, Jason Wang wrote:
> 
> I've tested the code with migration before sending the patches, I see the hint
> works fine.
> 

That's great (assume you saw great reduction in the migration time as well).
Reviewed-by: Wei Wang 

Thanks,
Wei

RE: [PATCH 1/2] virito-balloon: process all in sgs for free_page_vq

2021-11-25 Thread Wang, Wei W

On Friday, November 26, 2021 12:11 AM, David Hildenbrand wrote:
> On 25.11.21 17:09, Michael S. Tsirkin wrote:
> > On Thu, Nov 25, 2021 at 09:28:59AM +0100, David Hildenbrand wrote:
> >> On 25.11.21 03:20, Jason Wang wrote:
> >>> We only process the first in sg which may lead to the bitmap of the
> >>> pages belongs to following sgs were not cleared. This may result
> >>> more pages to be migrated. Fixing this by process all in sgs for
> >>> free_page_vq.
> >>>
> >>> Signed-off-by: Jason Wang 
> >>> ---
> >>>  hw/virtio/virtio-balloon.c | 7 +--
> >>>  1 file changed, 5 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> >>> index c6962fcbfe..17de2558cb 100644
> >>> --- a/hw/virtio/virtio-balloon.c
> >>> +++ b/hw/virtio/virtio-balloon.c
> >>> @@ -510,6 +510,7 @@ static bool get_free_page_hints(VirtIOBalloon
> *dev)
> >>>  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> >>>  VirtQueue *vq = dev->free_page_vq;
> >>>  bool ret = true;
> >>> +int i;
> >>>
> >>>  while (dev->block_iothread) {
> >>>  qemu_cond_wait(>free_page_cond,
> >free_page_lock);
> >>> @@ -544,8 +545,10 @@ static bool get_free_page_hints(VirtIOBalloon
> *dev)
> >>>  }
> >>>
> >>>  if (elem->in_num && dev->free_page_hint_status ==
> FREE_PAGE_HINT_S_START) {
> >>> -qemu_guest_free_page_hint(elem->in_sg[0].iov_base,
> >>> -  elem->in_sg[0].iov_len);
> >>> +for (i = 0; i < elem->in_num; i++) {
> >>> +qemu_guest_free_page_hint(elem->in_sg[i].iov_base,
> >>> +  elem->in_sg[i].iov_len);
> >>> +}
> >>>  }
> >>>
> >>>  out:
> >>>
> >>
> >> Yes, but:
> >>
> >> 1. Linux never used more than one
> >> 2. QEMU never consumed more than one

Yes, it works based on the fact that Linux only sends one hint each time.

> >>
> >> The spec states:
> >>
> >> "(b) The driver maps a series of pages and adds them to the
> >> free_page_vq as individual scatter-gather input buffer entries."
> >>
> >> However, the spec was written by someone else (Alex) as the code was
> >> (Wei). The code was there first.
> >>
> >> I don't particularly care what to adjust (code or spec). However, to
> >> me it feels more like the spec is slightly wrong and it was intended
> >> like the code is by the original code author.
> >>
> >> But then again, I don't particularly care :)
> >
> > Original QEMU side code had several bugs so, that's another one.
> > Given nothing too bad happens if guest submits too many S/Gs, and
> > given the spec also has a general chapter suggesting devices are
> > flexible in accepting a single buffer split to multiple S/Gs, I'm
> > inclined to accept the patch.
> 
> Yeah, as I said, I don't particularly care. It's certainly an "easy change".
> 
> Acked-by: David Hildenbrand 
> 

Don’t object the change.
Just in case something unexpected, it would be better if someone could help do 
a test.

Thanks,
Wei

RE: [PATCH v4 14/14] kvm: Add support for userspace MSR filtering and handling of MSR_KVM_MIGRATION_CONTROL.

2021-09-10 Thread Wang, Wei W

On Friday, September 10, 2021 5:14 PM, Ashish Kalra wrote:
> > It seems this is enabled/disabled by the guest, which means that the guest
> can always refuse to be migrated?
> >
> 
> Yes.
> 
> Are there any specific concerns/issues with that ?

It's kind of wired if everybody rejects to migrate from the guest
but they ticked the option "agree to be migrated" when they buy the guest
(e.g. at a discounted price).

In general, I think the migration decision should be made by the management 
s/w, not the migratable guest.

Best,
Wei

RE: [PATCH v4 10/14] migration: add support to migrate shared regions list

2021-09-10 Thread Wang, Wei W

On Friday, September 10, 2021 4:48 PM, Ashish Kalra wrote:
> On Fri, Sep 10, 2021 at 07:54:10AM +0000, Wang, Wei W wrote:
> There has been a long discussion on this implementation on KVM mailing list.
> Tracking shared memory via a list of ranges instead of using bitmap is more
> optimal. Most of the guest memory will be private and the unencrypted/shared
> regions are basically ranges/intervals, so easy to implement and maintain 
> using
> lists.

OK. At which version did you discuss this or do you have a link? (I didn't find 
it in v9 KVM patches)

> A list will consume much less memory than a bitmap.
> 
> The bitmap will consume more memory as it will need to be sized as per guest
> RAM size and will remain sparsely populated due to limited amount of
> shared/unencrypted guest memory regions.

I also thought about this. It depends on the guest.
I think "A list will consume much less memory" is true when we assume most of 
guest pages are private pages.
>From design perspective, what if guest chooses to have most of its pages being 
>shared?
Lists might consume much more memory than bitmaps in some cases, I think.
(Probably I can check your previous discussions first)

Best,
Wei

RE: [PATCH v4 14/14] kvm: Add support for userspace MSR filtering and handling of MSR_KVM_MIGRATION_CONTROL.

2021-09-10 Thread Wang, Wei W

On Wednesday, August 4, 2021 8:00 PM, Ashish Kalra wrote:
> +/*
> + * Currently this exit is only used by SEV guests for
> + * MSR_KVM_MIGRATION_CONTROL to indicate if the guest
> + * is ready for migration.
> + */
> +static int kvm_handle_x86_msr(X86CPU *cpu, struct kvm_run *run) {
> +static uint64_t msr_kvm_migration_control;
> +
> +if (run->msr.index != MSR_KVM_MIGRATION_CONTROL) {
> +run->msr.error = -EINVAL;
> +return -1;
> +}
> +
> +switch (run->exit_reason) {
> +case KVM_EXIT_X86_RDMSR:
> +run->msr.error = 0;
> +run->msr.data = msr_kvm_migration_control;
> +break;
> +case KVM_EXIT_X86_WRMSR:
> +msr_kvm_migration_control = run->msr.data;
> +if (run->msr.data == KVM_MIGRATION_READY) {
> +sev_del_migrate_blocker();

It seems this is enabled/disabled by the guest, which means that the guest can 
always refuse to be migrated?

Best,
Wei

RE: [PATCH v4 10/14] migration: add support to migrate shared regions list

2021-09-10 Thread Wang, Wei W

> From: Brijesh Singh 
> 
> When memory encryption is enabled, the hypervisor maintains a shared
> regions list which is referred by hypervisor during migration to check if 
> page is
> private or shared. This list is built during the VM bootup and must be 
> migrated
> to the target host so that hypervisor on target host can use it for future
> migration.
> 
> Signed-off-by: Brijesh Singh 
> Co-developed-by: Ashish Kalra 
> Signed-off-by: Ashish Kalra 
> ---
>  include/sysemu/sev.h |  2 ++
>  target/i386/sev.c| 43
> +++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/include/sysemu/sev.h b/include/sysemu/sev.h index
> 3b913518c0..118ee66406 100644
> --- a/include/sysemu/sev.h
> +++ b/include/sysemu/sev.h
> @@ -32,5 +32,7 @@ void sev_es_set_reset_vector(CPUState *cpu);  int
> sev_remove_shared_regions_list(unsigned long gfn_start,
> unsigned long gfn_end);  int
> sev_add_shared_regions_list(unsigned long gfn_start, unsigned long gfn_end);
> +int sev_save_outgoing_shared_regions_list(QEMUFile *f); int
> +sev_load_incoming_shared_regions_list(QEMUFile *f);
> 
>  #endif
> diff --git a/target/i386/sev.c b/target/i386/sev.c index
> 6d44b7ad21..789051f7b4 100644
> --- a/target/i386/sev.c
> +++ b/target/i386/sev.c
> @@ -135,10 +135,15 @@ static const char *const sev_fw_errlist[] = {
> 
>  #define SEV_FW_BLOB_MAX_SIZE0x4000  /* 16KB
> */
> 
> +#define SHARED_REGION_LIST_CONT 0x1
> +#define SHARED_REGION_LIST_END  0x2
> +
>  static struct ConfidentialGuestMemoryEncryptionOps
> sev_memory_encryption_ops = {
>  .save_setup = sev_save_setup,
>  .save_outgoing_page = sev_save_outgoing_page,
>  .load_incoming_page = sev_load_incoming_page,
> +.save_outgoing_shared_regions_list =
> sev_save_outgoing_shared_regions_list,
> +.load_incoming_shared_regions_list =
> + sev_load_incoming_shared_regions_list,

Hi Ashish,
I have some questions about the callbacks:

1) why using a list of shared regions, instead of bitmaps to record 
private/shared state?
I saw that the KVM side implementation used bitmaps in the first place and 
changed to
shared regions since v10, but don't find the reason.

2) why is the save/load of shared region list (or bitmap) made vendor specific?
I think it can be a common interface and data structure, e.g. KVM maintains a 
per memory slot
bitmap to be obtained by QEMU.

Best,
Wei

RE: [PATCH v3] migration: clear the memory region dirty bitmap when skipping free pages

2021-07-23 Thread Wang, Wei W

On Friday, July 23, 2021 4:17 PM, David Hildenbrand wrote:
> > On Friday, July 23, 2021 3:50 PM, David Hildenbrand wrote:
> >>
> >> Migration of a 8 GiB VM
> >> * within the same host
> >> * after Linux is up and idle
> >> * free page hinting enabled
> >> * after dirtying most VM memory using memhog
> >
> > Thanks for the tests!
> >
> > I think it would be better to test using idle guests (no memhog).
> > With memhog eating most of the guest free pages, it's likely no or very few
> free pages are reported during the test.
> 
> *After dirtying*. memhog is no longer running.
> 
> ... and also look again at the numbers how much memory we actually migrate :)
> 

OK, I was looking at the duplicate number, which was big in the idle guest case 
you posted before.
This looks good to me, the dirtied pages became free and got skipped as well. 
Thanks!

Best,
Wei

RE: [PATCH v3] migration: clear the memory region dirty bitmap when skipping free pages

2021-07-23 Thread Wang, Wei W

On Friday, July 23, 2021 3:50 PM, David Hildenbrand wrote:
> 
> Migration of a 8 GiB VM
> * within the same host
> * after Linux is up and idle
> * free page hinting enabled
> * after dirtying most VM memory using memhog

Thanks for the tests!

I think it would be better to test using idle guests (no memhog).
With memhog eating most of the guest free pages, it's likely no or very few 
free pages are reported during the test.

Best,
Wei

RE: [PATCH v3] migration: clear the memory region dirty bitmap when skipping free pages

2021-07-22 Thread Wang, Wei W

On Thursday, July 22, 2021 5:48 PM, David Hildenbrand wrote:
> On 22.07.21 10:30, Wei Wang wrote:
> > When skipping free pages to send, their corresponding dirty bits in
> > the memory region dirty bitmap need to be cleared. Otherwise the
> > skipped pages will be sent in the next round after the migration
> > thread syncs dirty bits from the memory region dirty bitmap.
> >
> > Cc: David Hildenbrand 
> > Cc: Peter Xu 
> > Cc: Michael S. Tsirkin 
> > Reported-by: David Hildenbrand 
> > Signed-off-by: Wei Wang 
> > ---
> >   migration/ram.c | 74
> +
> >   1 file changed, 56 insertions(+), 18 deletions(-)
> >
> 
> LGTM, thanks
> 
> Reviewed-by: David Hildenbrand 
> 

Thanks. Please remember to have a regression test together with Peterx's that 
patch when you get a chance.

Best,
Wei

RE: [PATCH v2] migration: clear the memory region dirty bitmap when skipping free pages

2021-07-18 Thread Wang, Wei W

On Friday, July 16, 2021 4:26 PM, David Hildenbrand wrote:
> >>> +/*
> >>> + * CLEAR_BITMAP_SHIFT_MIN should always guarantee this... this
> >>> + * can make things easier sometimes since then start address
> >>> + * of the small chunk will always be 64 pages aligned so the
> >>> + * bitmap will always be aligned to unsigned long. We should
> >>> + * even be able to remove this restriction but I'm simply
> >>> + * keeping it.
> >>> + */
> >>> +assert(shift >= 6);
> >>> +
> >>> +size = 1ULL << (TARGET_PAGE_BITS + shift);
> >>> +start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
> >>
> >> these as well as.
> >
> > Is there any coding style requirement for this?
> 
> Don't think so. It simply results in less LOC and less occurrences of 
> variables.
> 
> > My thought was that those operations could mostly be avoided if they
> > don't pass the above if condition (e.g. just once per 1GB chunk).
> 
> Usually the compiler will reshuffle as possible to optimize. But in this 
> case, due
> to clear_bmap_test_and_clear(), it might not be able to move the
> computations behind that call. So the final code might actually differ.
> 
> Not that we really care about this micro-optimization, though.

OK, looks that's just a personal favor. I'm inclined to keeping the 
micro-optimization.

> 
> >
> >>
> >>> +trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
> >>> +memory_region_clear_dirty_bitmap(rb->mr, start, size); }
> >>> +
> >>> +static void
> >>> +migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
> >>> + RAMBlock
> *rb,
> >>> + unsigned
> long
> >> start,
> >>> + unsigned
> long
> >>> +npages) {
> >>> +unsigned long page_to_clear, i, nchunks;
> >>> +unsigned long chunk_pages = 1UL << rb->clear_bmap_shift;
> >>> +
> >>> +nchunks = (start + npages) / chunk_pages - start / chunk_pages
> >>> + + 1;
> >>
> >> Wouldn't you have to align the start and the end range up/down to
> >> properly calculate the number of chunks?
> >
> > No, divide will round it to the integer (beginning of the chunk to clear).
> 
> 
> nchunks = (start + npages) / chunk_pages - start / chunk_pages + 1;

I had a mistake on the right boundary, it should be [start, start + npages), 
instead of [start, start + npages].
i.e. nchunks = (start + npages - 1) / chunk_pages - start / chunk_pages + 1

But I can take your approach here, thanks.

Best,
Wei

RE: [PATCH v2] migration: clear the memory region dirty bitmap when skipping free pages

2021-07-16 Thread Wang, Wei W

On Thursday, July 15, 2021 5:29 PM, David Hildenbrand wrote:
> On 15.07.21 09:53, Wei Wang wrote:
> > When skipping free pages to send, their corresponding dirty bits in
> > the memory region dirty bitmap need to be cleared. Otherwise the
> > skipped pages will be sent in the next round after the migration
> > thread syncs dirty bits from the memory region dirty bitmap.
> >
> > Cc: David Hildenbrand 
> > Cc: Peter Xu 
> > Cc: Michael S. Tsirkin 
> > Reported-by: David Hildenbrand 
> > Signed-off-by: Wei Wang 
> > ---
> >   migration/ram.c | 72
> -
> >   1 file changed, 54 insertions(+), 18 deletions(-)
> >
> > v1->v2 changelog:
> > - move migration_clear_memory_region_dirty_bitmap under bitmap_mutex
> as
> >we lack confidence to have it outside the lock for now.
> > - clean the unnecessary subproject commit.
> >
> > diff --git a/migration/ram.c b/migration/ram.c index
> > b5fc454b2f..69e06b55ec 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -789,6 +789,51 @@ unsigned long
> migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
> >   return find_next_bit(bitmap, size, start);
> >   }
> >
> > +static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
> > +
> RAMBlock *rb,
> > +
> unsigned long
> > +page) {
> > +uint8_t shift;
> > +hwaddr size, start;
> > +
> > +if (!rb->clear_bmap || !clear_bmap_test_and_clear(rb, page)) {
> > +return;
> > +}
> > +
> > +shift = rb->clear_bmap_shift;
> 
> You could initialize this right at the beginning of the function without 
> doing any
> harm.
> 
> > +/*
> > + * CLEAR_BITMAP_SHIFT_MIN should always guarantee this... this
> > + * can make things easier sometimes since then start address
> > + * of the small chunk will always be 64 pages aligned so the
> > + * bitmap will always be aligned to unsigned long. We should
> > + * even be able to remove this restriction but I'm simply
> > + * keeping it.
> > + */
> > +assert(shift >= 6);
> > +
> > +size = 1ULL << (TARGET_PAGE_BITS + shift);
> > +start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
> 
> these as well as.

Is there any coding style requirement for this?
My thought was that those operations could mostly be avoided if they don't pass 
the
above if condition (e.g. just once per 1GB chunk).

> 
> > +trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
> > +memory_region_clear_dirty_bitmap(rb->mr, start, size); }
> > +
> > +static void
> > +migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
> > + RAMBlock *rb,
> > + unsigned long
> start,
> > + unsigned long
> > +npages) {
> > +unsigned long page_to_clear, i, nchunks;
> > +unsigned long chunk_pages = 1UL << rb->clear_bmap_shift;
> > +
> > +nchunks = (start + npages) / chunk_pages - start / chunk_pages +
> > + 1;
> 
> Wouldn't you have to align the start and the end range up/down to properly
> calculate the number of chunks?

No, divide will round it to the integer (beginning of the chunk to clear).

> 
> The following might be better and a little easier to grasp:
> 
> unsigned long chunk_pages = 1ULL << rb->clear_bmap_shift; unsigned long
> aligned_start = QEMU_ALIGN_DOWN(start, chunk_pages); unsigned long
> aligned_end = QEMU_ALIGN_UP(start + npages, chunk_pages)
> 
> /*
>   * Clear the clar_bmap of all covered chunks. It's sufficient to call it for
>   * one page within a chunk.
>   */
> for (start = aligned_start, start != aligned_end, start += chunk_pages) {

What if "aligned_end == start + npages"?
i.e the above start + npages is aligned by itself without QEMU_ALIGN_UP().
For example, chunk size is 1GB, and start+npages=2GB, which is right at the 
beginning of [2GB,3GB) chunk.
Then aligned_end is also 2GB, but we need to clear the [2GB, 3GB) chunk, right?

Best,
Wei

RE: [PATCH v1] migration: clear the memory region dirty bitmap when skipping free pages

2021-07-14 Thread Wang, Wei W

On Wednesday, July 14, 2021 6:30 PM, David Hildenbrand wrote:
> 
> On 14.07.21 12:27, Michael S. Tsirkin wrote:
> > On Wed, Jul 14, 2021 at 03:51:04AM -0400, Wei Wang wrote:
> >> When skipping free pages, their corresponding dirty bits in the
> >> memory region dirty bitmap need to be cleared. Otherwise the skipped
> >> pages will be sent in the next round after the migration thread syncs
> >> dirty bits from the memory region dirty bitmap.
> >>
> >> migration_clear_memory_region_dirty_bitmap_range is put outside the
> >> bitmap_mutex, becasue
> >
> > because?
> >
> >> memory_region_clear_dirty_bitmap is possible to block on the kvm slot
> >> mutex (don't want holding bitmap_mutex while blocked on another
> >> mutex), and clear_bmap_test_and_clear uses atomic operation.
> 
> How is that different from our existing caller?
> 
> Please either clean everything up, completely avoiding the lock (separate
> patch), or move it under the lock.
> 
> Or am I missing something important?

That seems ok to me and Peter to have it outside the lock. Not sure if Dave or 
Juan knows the reason why clear_bmap needs to be under the mutex given that it 
is atomic operation.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-13 Thread Wang, Wei W

On Tuesday, July 13, 2021 11:59 PM, Peter Xu wrote:
> On Tue, Jul 13, 2021 at 08:40:21AM +0000, Wang, Wei W wrote:
> 
> Didn't get a chance to document it as it's in a pull now; but as long as 
> you're okay
> with no-per-page lock (which I still don't agree with), I can follow this up.
> 
> Are below parameters enough for me to enable free-page-hint?
> 
>  -object iothread,id=io1 \
>  -device virtio-balloon,free-page-hint=on,iothread=io1 \
> 
> I tried to verify it's running by tracing inside guest with kprobe
> report_free_page_func() but it didn't really trigger.  Guest has kernel
> 5.11.12-300.fc34.x86_64, should be fairly new to have that supported.  Do you
> know what I'm missing?

Please check if you have the virtio-balloon driver loaded in the guest.

> 
> P.S. This also reminded me that, maybe we want an entry in qemu-options.hx for
> balloon device, as it has lots of options, some of them may not be easy to be
> setup right.
> 

Sounds good to me.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-13 Thread Wang, Wei W

On Tuesday, July 13, 2021 6:22 PM, David Hildenbrand wrote:
> Can you send an official patch for the free page hinting clean_bmap handling I
> reported?
> 
> I can then give both tests in combination a quick test (before/after this 
> patch
> here).
> 

Yes, I'll send, thanks!

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-13 Thread Wang, Wei W

On Thursday, July 1, 2021 4:08 AM, Peter Xu wrote:
> Taking the mutex every time for each dirty bit to clear is too slow, 
> especially we'll
> take/release even if the dirty bit is cleared.  So far it's only used to sync 
> with
> special cases with qemu_guest_free_page_hint() against migration thread,
> nothing really that serious yet.  Let's move the lock to be upper.
> 
> There're two callers of migration_bitmap_clear_dirty().
> 
> For migration, move it into ram_save_iterate().  With the help of MAX_WAIT
> logic, we'll only run ram_save_iterate() for no more than 50ms-ish time, so 
> taking
> the lock once there at the entry.  It also means any call sites to
> qemu_guest_free_page_hint() can be delayed; but it should be very rare, only
> during migration, and I don't see a problem with it.
> 
> For COLO, move it up to colo_flush_ram_cache().  I think COLO forgot to take
> that lock even when calling ramblock_sync_dirty_bitmap(), where another
> example is migration_bitmap_sync() who took it right.  So let the mutex cover
> both the
> ramblock_sync_dirty_bitmap() and migration_bitmap_clear_dirty() calls.
> 
> It's even possible to drop the lock so we use atomic operations upon rb->bmap
> and the variable migration_dirty_pages.  I didn't do it just to still be 
> safe, also
> not predictable whether the frequent atomic ops could bring overhead too e.g.
> on huge vms when it happens very often.  When that really comes, we can
> keep a local counter and periodically call atomic ops.  Keep it simple for 
> now.
> 
> Cc: Wei Wang 
> Cc: David Hildenbrand 
> Cc: Hailiang Zhang 
> Cc: Dr. David Alan Gilbert 
> Cc: Juan Quintela 
> Cc: Leonardo Bras Soares Passos 
> Signed-off-by: Peter Xu 

FWIW
Reviewed-by: Wei Wang 

If no one could help do a regression test of free page hint, please document 
something like this in the patch:
Locking at the coarser granularity is possible to minimize the improvement 
brought by free page hints, but seems not causing critical issues.
We will let users of free page hints to report back any requirement and come up 
with a better solution later.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-13 Thread Wang, Wei W

On Friday, July 9, 2021 10:48 PM, Peter Xu wrote:
> On Fri, Jul 09, 2021 at 08:58:08AM +0000, Wang, Wei W wrote:
> > On Friday, July 9, 2021 2:31 AM, Peter Xu wrote:
> > > > > Yes I think this is the place I didn't make myself clear.  It's
> > > > > not about sleeping, it's about the cmpxchg being expensive
> > > > > already when the vm
> > > is huge.
> > > >
> > > > OK.
> > > > How did you root cause that it's caused by cmpxchg, instead of
> > > > lock contention (i.e. syscall and sleep) or some other code inside
> > > pthread_mutex_lock(). Do you have cycles about cmpxchg v.s. cycles
> > > of pthread_mutex_lock()?
> > >
> > > We've got "perf top -g" showing a huge amount of stacks lying in
> > > pthread_mutex_lock().
> >
> > This only explains pthread_mutex_lock is the cause, not root caused to
> cmpxchg.
> 
> I think that's enough already to prove we can move the lock elsewhere.
> 
> It's not really a heavy race between threads; it's the pure overhead we 
> called it
> too many times.  So it's not really a problem yet about "what type of lock we
> should use" (mutex or spin lock) or "how this lock is implemented" (say, 
> whether
> using cmpxchg only or optimize using test + test-and-set, as that sounds like 
> a
> good optimization of pure test-and-set spinlocks) because the lock is not 
> busy at
> all.

Just FYI:
there is a big while(1) {} inside pthread_mutex_lock, not sure if the hotspot 
is in the loop there.

> > What if the guest gets stopped and then the migration thread goes to sleep?
> 
> Isn't the balloon code run in a standalone iothread?  Why guest stopped can
> stop migration thread?

Yes, it is async as you know. Guest puts hints into the vq, then gets paused,
and then the device takes the mutex, then the migration thread gets blocked.
In general, when we use mutex, we need to consider that case that it could be 
blocked.

> From what I learned in the past few years, funnily "speed of migration" is
> normally not what people care the most.  Issues are mostly with convergence
> and being transparent to users using the VMs so they aren't even aware.

Yes, migration time isn’t that critically important, but shorter is better than 
longer.
Skipping those free pages saves network bandwidth, which is also good.
Oherwise, 0-page optimization in migration is also meaningless.
In theory, free pages in the last round could also be skipped to reduce downtime
(just haven't got a good testcase to show it).

> >
> > Seems we lack resources for those tests right now. If you are urgent for a
> decision to have it work first, I'm also OK with you to merge it.
> 
> No I can't merge it myself as I'm not the maintainer. :) I haven't received 
> any ack
> yet, so at least I'll need to see how Dave and Juan think.  It's just that I 
> don't
> think qemuspin could help much in this case, and I don't want to mess up the
> issue too.
> 

Yes, I'm also OK if they want to merge it.
If it is possible that anyone from your testing team (QA?) could help do a 
regression test of free page hint,
checking the difference (e.g. migration time of an idle guest) after applying 
this patch, that would be greater. 
Thanks!

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-09 Thread Wang, Wei W

On Friday, July 9, 2021 2:31 AM, Peter Xu wrote:
> > > Yes I think this is the place I didn't make myself clear.  It's not
> > > about sleeping, it's about the cmpxchg being expensive already when the vm
> is huge.
> >
> > OK.
> > How did you root cause that it's caused by cmpxchg, instead of lock
> > contention (i.e. syscall and sleep) or some other code inside
> pthread_mutex_lock(). Do you have cycles about cmpxchg v.s. cycles of
> pthread_mutex_lock()?
> 
> We've got "perf top -g" showing a huge amount of stacks lying in
> pthread_mutex_lock().

This only explains pthread_mutex_lock is the cause, not root caused to cmpxchg.

> 
> >
> > I check the implementation of pthread_mutex_lock(). The code path for lock
> acquire is long. QemuSpin looks more efficient.
> > (probably we also don’t want migration thread to sleep in any case)
> 
> I didn't check it, but I really hoped it should be very close to a spinlock 
> version
> when it's not contended.  We should be careful on using spin locks in 
> userspace,
> e.g., with that moving clear log into critical section will be too much and 
> actuall
> close to "wrong", imho, because the kvm memslot lock inside is sleepable.

OK, that might be a good reason to keep clear_map out of the lock.
We also don’t want the lock holder to have more chances to go to sleep though 
it is mutex.

> 
> I think it's very fine to have migration thread sleep.  IMHO we need explicit
> justification for each mutex to be converted to a spinlock in userspace.  So 
> far I
> don't see it yet for this bitmap lock.

What if the guest gets stopped and then the migration thread goes to sleep?

> 
> Frankly I also don't know how spinlock would work reliably without being able 
> to
> disable scheduling, then could the thread got scheduled out duing taking the
> spinlock?  Would another thread trying to take this lock spinning on this 
> sleeping
> task?

Yes, and it needs good justification:
If it's per-page spinlock, the granularity is very small, so it has very little 
chance that a lock holder gets scheduled out as time slice uses up. Even that 
happens once, it seems no big issues, just the waiter wastes some CPU cycles, 
this is better than having the migration thread scheduled out by mutex, I think.

> 
> >
> > I think it's also better to see the comparison of migration throughput data 
> > (i.e.
> pages per second) in the following cases, before we make a decision:
> > - per-page mutex
> > - per-page spinlock
> > - 50-ms mutex
> 
> I don't have the machines, so I can't do this.  I also see this as separate 
> issues to
> solve to use spinlock, as I said before I would prefer some justification 
> first to use
> it rather than blindly running tests and pick a patch with higher number.
> 
> I also hope we can reach a consensus that we fix one issue at a time.  This 
> patch
> already proves itself with some real workloads, I hope we can merge it first,
> either with 50ms period or less.
> 
> Per-page locking is already against at least my instinction of how this 
> should be
> done; I just regreted I didn't see that an issue when reviewing the balloon 
> patch
> as I offered the r-b myself.  However I want to make it work as before then we
> figure out a better approach on how the lock is taken, and which lock we 
> should
> use.  I don't see it a must to do all things in one patch.

Sorry, it wasn't my intention to be a barrier to merging your patch.
Just try to come up with better solutions if possible.
- Option1 QemuSpin lock would reduce the lock acquiring overhead, but need to 
test if that migration could converge;
- Option2 conditional lock. We haven't see the test results if turning on free 
page optimization with that per-page lock could still make your 2TB guest 
migration converge.

Seems we lack resources for those tests right now. If you are urgent for a 
decision to have it work first, I'm also OK with you to merge it.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-07 Thread Wang, Wei W

On Thursday, July 8, 2021 12:55 AM, Peter Xu wrote:
> On Wed, Jul 07, 2021 at 08:34:50AM +0000, Wang, Wei W wrote:
> > On Wednesday, July 7, 2021 1:47 AM, Peter Xu wrote:
> > > On Sat, Jul 03, 2021 at 02:53:27AM +, Wang, Wei W wrote:
> > > > +   do {
> > > > +page_to_clear = start + (i++ <<
> > > > + block->clear_bmap_shift);
> > >
> > > Why "i" needs to be shifted?
> >
> > Just move to the next clear chunk, no?
> > For example, (1 << 18) pages chunk (i.e. 1GB).
> 
> But migration_clear_memory_region_dirty_bitmap() has done the shifting?
> 

Please see this example: start=0, npages = 2 * (1 <<18), i.e. we have 2 chunks 
of pages to clear, and start from 0.
First chunk: from 0 to (1 <<18);
Second chunk: from (1 << 18) to 2*(1<<18).

To clear the second chunk, we need to pass (start + "1 << 18") to 
migration_clear_memory_region_dirty_bitmap(),
and clear_bmap_test_and_clear() there will do ">>18" to transform it into the 
id of clear_bitmap, which is 1.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-07 Thread Wang, Wei W

On Thursday, July 8, 2021 12:44 AM, Peter Xu wrote:
> > > Not to mention the hard migration issues are mostly with non-idle
> > > guest, in that case having the balloon in the guest will be
> > > disastrous from this pov since it'll start to take mutex for each
> > > page, while balloon would hardly report anything valid since most guest 
> > > pages
> are being used.
> >
> > If no pages are reported, migration thread wouldn't wait on the lock then.
> 
> Yes I think this is the place I didn't make myself clear.  It's not about 
> sleeping, it's
> about the cmpxchg being expensive already when the vm is huge.

OK.
How did you root cause that it's caused by cmpxchg, instead of lock contention 
(i.e. syscall and sleep) or
some other code inside pthread_mutex_lock(). Do you have cycles about cmpxchg 
v.s. cycles of pthread_mutex_lock()?

I check the implementation of pthread_mutex_lock(). The code path for lock 
acquire is long. QemuSpin looks more efficient.
(probably we also don’t want migration thread to sleep in any case)

I think it's also better to see the comparison of migration throughput data 
(i.e. pages per second) in the following cases, before we make a decision:
- per-page mutex
- per-page spinlock
- 50-ms mutex

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-07 Thread Wang, Wei W

On Thursday, July 8, 2021 12:45 AM, Peter Xu wrote:
> On Wed, Jul 07, 2021 at 12:45:32PM +0000, Wang, Wei W wrote:
> > Btw, what would you think if we change mutex to QemuSpin? It will also 
> > reduce
> the overhead, I think.
> 
> As I replied at the other place, the bottleneck we encountered is the lock 
> taking
> not sleeping, so I'm afraid spinlock will have the same issue. Thanks,

I suspect the overhead you observed might be caused by the syscalls for mutex. 
Per-page syscall might be too much.
If possible, you could have another test of the 3GB guest migration using 
QemuSpin.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-07 Thread Wang, Wei W

On Wednesday, July 7, 2021 1:40 AM, Peter Xu wrote:
> On Tue, Jul 06, 2021 at 12:05:49PM +0200, David Hildenbrand wrote:
> > On 06.07.21 11:41, Wang, Wei W wrote:
> > > On Monday, July 5, 2021 9:42 PM, David Hildenbrand wrote:
> > > > On 03.07.21 04:53, Wang, Wei W wrote:
> > > > > On Friday, July 2, 2021 3:07 PM, David Hildenbrand wrote:
> > > > > > On 02.07.21 04:48, Wang, Wei W wrote:
> > > > > > > On Thursday, July 1, 2021 10:22 PM, David Hildenbrand wrote:
> > > > > > > > On 01.07.21 14:51, Peter Xu wrote:
> > > > > >
> > > > > > I think that clearly shows the issue.
> > > > > >
> > > > > > My theory I did not verify yet: Assume we have 1GB chunks in the 
> > > > > > clear
> bmap.
> > > > > > Assume the VM reports all pages within a 1GB chunk as free
> > > > > > (easy with a fresh VM). While processing hints, we will clear
> > > > > > the bits from the dirty bmap in the RAMBlock. As we will never
> > > > > > migrate any page of that 1GB chunk, we will not actually clear
> > > > > > the dirty bitmap of the memory region. When re-syncing, we
> > > > > > will set all bits bits in the dirty bmap again from the dirty
> > > > > > bitmap in the memory region. Thus, many of our hints end up
> > > > > > being mostly ignored. The smaller the clear bmap chunk, the
> > > > more extreme the issue.
> > > > >
> > > > > OK, that looks possible. We need to clear the related bits from
> > > > > the memory region bitmap before skipping the free pages. Could
> > > > > you try with
> > > > below patch:
> > > >
> > > > I did a quick test (with the memhog example) and it seems like it mostly
> works.
> > > > However, we're now doing the bitmap clearing from another, racing
> > > > with the migration thread. Especially:
> > > >
> > > > 1. Racing with clear_bmap_set() via
> > > > cpu_physical_memory_sync_dirty_bitmap()
> > > > 2. Racing with migration_bitmap_clear_dirty()
> > > >
> > > > So that might need some thought, if I'm not wrong.
> > >
> > > I think this is similar to the non-clear_bmap case, where the
> > > rb->bmap is set or cleared by the migration thread and
> > > qemu_guest_free_page_hint. For example, the migration thread could find a
> bit is set from rb->bmap before qemu_guest_free_page_hint gets that bit
> cleared in time.
> > > The result is that the free page is transferred, which isn't necessary, 
> > > but won't
> cause any issue.
> > > This is very rare in practice.
> >
> > Here are my concerns regarding races:
> >
> > a) We now end up calling migration_clear_memory_region_dirty_bitmap()
> > without holding the bitmap_mutex. We have to clarify if that is ok. At
> > least
> > migration_bitmap_clear_dirty() holds it *while* clearing the log and
> > migration_bitmap_sync() while setting bits in the clear_bmap. I think
> > we also have to hold the mutex on the new path.
> 
> Makes sense; I think we can let bitmap_mutex to protect both dirty/clear
> bitmaps, and also the dirty pages counter.  I'll comment in Wei's patch too 
> later.

Btw, what would you think if we change mutex to QemuSpin? It will also reduce 
the overhead, I think.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-07 Thread Wang, Wei W

On Wednesday, July 7, 2021 2:00 AM, Peter Xu wrote:
> On Fri, Jul 02, 2021 at 02:29:41AM +0000, Wang, Wei W wrote:
> > With that, if free page opt is off, the mutex is skipped, isn't it?
> 
> Yes, but when free page is on, it'll check once per page.  As I mentioned I 
> still
> don't think it's the right thing to do.

With free page opt on, if the migration thread waits for lock acquire on a 
page, it actually means that it is trying to skip the transfer of a page.
For example, waiting for the lock takes 100ns, then the skip of sending a page 
saves back 1000ns, then overall we saved 900ns per page (i.e. pay less and earn 
more).

> 
> We encountered this problem when migrating a 3tb vm and the mutex spins and
> eats tons of cpu resources.  It shouldn't happen with/without balloon, imho.

I think we should compare the overall migration time.

> 
> Not to mention the hard migration issues are mostly with non-idle guest, in 
> that
> case having the balloon in the guest will be disastrous from this pov since 
> it'll start
> to take mutex for each page, while balloon would hardly report anything valid
> since most guest pages are being used.

If no pages are reported, migration thread wouldn't wait on the lock then.

To conclude: to decide whether the per page lock hurts the performance 
considering that the lock in some sense actually prevents the migration thread 
from sending free pages which it shouldn't, we need to compare the overall 
migration time.
(previous data could be found 
here:https://patchwork.kernel.org/project/kvm/cover/1535333539-32420-1-git-send-email-wei.w.w...@intel.com/,
 I think the situation should be the same for either 8GB or 3TB guest, in terms 
of the overall migration time comparison) 

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-07 Thread Wang, Wei W

On Wednesday, July 7, 2021 1:47 AM, Peter Xu wrote:
> On Sat, Jul 03, 2021 at 02:53:27AM +0000, Wang, Wei W wrote:
> > +   do {
> > +page_to_clear = start + (i++ << block->clear_bmap_shift);
> 
> Why "i" needs to be shifted?

Just move to the next clear chunk, no?
For example, (1 << 18) pages chunk (i.e. 1GB).

> 
> > +migration_clear_memory_region_dirty_bitmap(ram_state,
> > +   block,
> > +
> page_to_clear);
> > +   } while (i <= npages >> block->clear_bmap_shift);
> 
> I agree with David that this should be better put into the mutex section, if 
> so
> we'd also touch up comment for bitmap_mutex.  Or is it a reason to explicitly
> not do so?

clear_bmap_test_and_clear already uses atomic operation on clear_bmap.
But it's also OK to me if you guys feel safer to move it under the lock.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-06 Thread Wang, Wei W

On Monday, July 5, 2021 9:42 PM, David Hildenbrand wrote:
> On 03.07.21 04:53, Wang, Wei W wrote:
> > On Friday, July 2, 2021 3:07 PM, David Hildenbrand wrote:
> >> On 02.07.21 04:48, Wang, Wei W wrote:
> >>> On Thursday, July 1, 2021 10:22 PM, David Hildenbrand wrote:
> >>>> On 01.07.21 14:51, Peter Xu wrote:
> >>
> >> I think that clearly shows the issue.
> >>
> >> My theory I did not verify yet: Assume we have 1GB chunks in the clear 
> >> bmap.
> >> Assume the VM reports all pages within a 1GB chunk as free (easy with
> >> a fresh VM). While processing hints, we will clear the bits from the
> >> dirty bmap in the RAMBlock. As we will never migrate any page of that
> >> 1GB chunk, we will not actually clear the dirty bitmap of the memory
> >> region. When re-syncing, we will set all bits bits in the dirty bmap
> >> again from the dirty bitmap in the memory region. Thus, many of our
> >> hints end up being mostly ignored. The smaller the clear bmap chunk, the
> more extreme the issue.
> >
> > OK, that looks possible. We need to clear the related bits from the
> > memory region bitmap before skipping the free pages. Could you try with
> below patch:
> 
> I did a quick test (with the memhog example) and it seems like it mostly 
> works.
> However, we're now doing the bitmap clearing from another, racing with the
> migration thread. Especially:
> 
> 1. Racing with clear_bmap_set() via cpu_physical_memory_sync_dirty_bitmap()
> 2. Racing with migration_bitmap_clear_dirty()
> 
> So that might need some thought, if I'm not wrong.

I think this is similar to the non-clear_bmap case, where the rb->bmap is set 
or cleared by
the migration thread and qemu_guest_free_page_hint. For example, the migration 
thread
could find a bit is set from rb->bmap before qemu_guest_free_page_hint gets 
that bit cleared in time.
The result is that the free page is transferred, which isn't necessary, but 
won't cause any issue.
This is very rare in practice.

> 
> The simplest approach would be removing/freeing the clear_bmap via
> PRECOPY_NOTIFY_SETUP(), similar to
> precopy_enable_free_page_optimization() we had before. Of course, this will
> skip the clear_bmap optimization.

Not necessarily, I think. The two optimizations are not mutual exclusive 
essentially.
At least, they work well together now. If there are other implementation issues 
reported,
we could check them then.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-02 Thread Wang, Wei W

On Friday, July 2, 2021 3:07 PM, David Hildenbrand wrote:
> On 02.07.21 04:48, Wang, Wei W wrote:
> > On Thursday, July 1, 2021 10:22 PM, David Hildenbrand wrote:
> >> On 01.07.21 14:51, Peter Xu wrote:
> 
> I think that clearly shows the issue.
> 
> My theory I did not verify yet: Assume we have 1GB chunks in the clear bmap.
> Assume the VM reports all pages within a 1GB chunk as free (easy with a fresh
> VM). While processing hints, we will clear the bits from the dirty bmap in the
> RAMBlock. As we will never migrate any page of that 1GB chunk, we will not
> actually clear the dirty bitmap of the memory region. When re-syncing, we will
> set all bits bits in the dirty bmap again from the dirty bitmap in the memory
> region. Thus, many of our hints end up being mostly ignored. The smaller the
> clear bmap chunk, the more extreme the issue.

OK, that looks possible. We need to clear the related bits from the memory 
region
bitmap before skipping the free pages. Could you try with below patch:

diff --git a/migration/ram.c b/migration/ram.c
index ace8ad431c..a1f6df3e6c 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -811,6 +811,26 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, 
RAMBlock *rb,
 return next;
 }

+
+static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
+   RAMBlock *rb,
+  unsigned long page)
+{
+uint8_t shift;
+hwaddr size, start;
+
+if (!rb->clear_bmap || !clear_bmap_test_and_clear(rb, page))
+return;
+
+shift = rb->clear_bmap_shift;
+assert(shift >= 6);
+
+size = 1ULL << (TARGET_PAGE_BITS + shift);
+start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
+trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
+memory_region_clear_dirty_bitmap(rb->mr, start, size);
+}
+
 static inline bool migration_bitmap_clear_dirty(RAMState *rs,
 RAMBlock *rb,
 unsigned long page)
@@ -827,26 +847,9 @@ static inline bool migration_bitmap_clear_dirty(RAMState 
*rs,
  * the page in the chunk we clear the remote dirty bitmap for all.
  * Clearing it earlier won't be a problem, but too late will.
  */
-if (rb->clear_bmap && clear_bmap_test_and_clear(rb, page)) {
-uint8_t shift = rb->clear_bmap_shift;
-hwaddr size = 1ULL << (TARGET_PAGE_BITS + shift);
-hwaddr start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
-
-/*
- * CLEAR_BITMAP_SHIFT_MIN should always guarantee this... this
- * can make things easier sometimes since then start address
- * of the small chunk will always be 64 pages aligned so the
- * bitmap will always be aligned to unsigned long.  We should
- * even be able to remove this restriction but I'm simply
- * keeping it.
- */
-assert(shift >= 6);
-trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
-memory_region_clear_dirty_bitmap(rb->mr, start, size);
-}
+migration_clear_memory_region_dirty_bitmap(rs, rb, page);

 ret = test_and_clear_bit(page, rb->bmap);
-
 if (ret) {
 rs->migration_dirty_pages--;
 }
@@ -2746,7 +2749,7 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
 {
 RAMBlock *block;
 ram_addr_t offset;
-size_t used_len, start, npages;
+size_t used_len, start, npages, page_to_clear, i = 0;
 MigrationState *s = migrate_get_current();

 /* This function is currently expected to be used during live migration */
@@ -2775,6 +2778,19 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
 start = offset >> TARGET_PAGE_BITS;
 npages = used_len >> TARGET_PAGE_BITS;

+/*
+ * The skipped free pages are equavelent to be sent from clear_bmap's
+* perspective, so clear the bits from the memory region bitmap which
+* are initially set. Otherwise those skipped pages will be sent in
+* the next round after syncing from the memory region bitmap.
+*/
+/*
+ * The skipped free pages are equavelent to be sent from clear_bmap's
+* perspective, so clear the bits from the memory region bitmap which
+* are initially set. Otherwise those skipped pages will be sent in
+* the next round after syncing from the memory region bitmap.
+*/
+   do {
+page_to_clear = start + (i++ << block->clear_bmap_shift);
+migration_clear_memory_region_dirty_bitmap(ram_state,
+   block,
+   page_to_clear);
+

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-01 Thread Wang, Wei W

On Thursday, July 1, 2021 10:22 PM, David Hildenbrand wrote:
> On 01.07.21 14:51, Peter Xu wrote:
> Spoiler alert: the introduction of clean bitmaps partially broke free page 
> hinting
> already (as clearing happens deferred -- and might never happen if we don't
> migrate *any* page within a clean bitmap chunk, so pages actually remain
> dirty ...). "broke" here means that pages still get migrated even though they
> were reported by the guest. We'd actually not want to use clean bmaps with 
> free
> page hinting ... long story short, free page hinting is a very fragile beast 
> already
> and some of the hints are basically ignored and pure overhead ...

Not really. Both clear_bmap and free page hint are to "clear" the bitmap.
No matter who comes first to clear it, it won’t cause more (unexpected) pages 
to be sent.

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-07-01 Thread Wang, Wei W

On Thursday, July 1, 2021 8:51 PM, Peter Xu wrote:
> On Thu, Jul 01, 2021 at 04:42:38AM +0000, Wang, Wei W wrote:
> > On Thursday, July 1, 2021 4:08 AM, Peter Xu wrote:
> > > Taking the mutex every time for each dirty bit to clear is too slow,
> > > especially we'll take/release even if the dirty bit is cleared.  So
> > > far it's only used to sync with special cases with
> > > qemu_guest_free_page_hint() against migration thread, nothing really that
> serious yet.  Let's move the lock to be upper.
> > >
> > > There're two callers of migration_bitmap_clear_dirty().
> > >
> > > For migration, move it into ram_save_iterate().  With the help of
> > > MAX_WAIT logic, we'll only run ram_save_iterate() for no more than
> > > 50ms-ish time, so taking the lock once there at the entry.  It also
> > > means any call sites to
> > > qemu_guest_free_page_hint() can be delayed; but it should be very
> > > rare, only during migration, and I don't see a problem with it.
> > >
> > > For COLO, move it up to colo_flush_ram_cache().  I think COLO forgot
> > > to take that lock even when calling ramblock_sync_dirty_bitmap(),
> > > where another example is migration_bitmap_sync() who took it right.
> > > So let the mutex cover both the
> > > ramblock_sync_dirty_bitmap() and migration_bitmap_clear_dirty() calls.
> > >
> > > It's even possible to drop the lock so we use atomic operations upon
> > > rb->bmap and the variable migration_dirty_pages.  I didn't do it
> > > just to still be safe, also not predictable whether the frequent atomic 
> > > ops
> could bring overhead too e.g.
> > > on huge vms when it happens very often.  When that really comes, we
> > > can keep a local counter and periodically call atomic ops.  Keep it 
> > > simple for
> now.
> > >
> >
> > If free page opt is enabled, 50ms waiting time might be too long for 
> > handling
> just one hint (via qemu_guest_free_page_hint)?
> > How about making the lock conditionally?
> > e.g.
> > #define QEMU_LOCK_GUARD_COND (lock, cond) {
> > if (cond)
> > QEMU_LOCK_GUARD(lock);
> > }
> > Then in migration_bitmap_clear_dirty:
> > QEMU_LOCK_GUARD_COND(>bitmap_mutex, rs->fpo_enabled);
> 
> Yeah that's indeed some kind of comment I'd like to get from either you or 
> David
> when I add the cc list.. :)
> 
> I was curious how that would affect the guest when the free page hint helper 
> can
> stuck for a while.  Per my understanding it's fully async as the blocked 
> thread
> here is asynchronously with the guest since both virtio-balloon and virtio-mem
> are fully async. If so, would it really affect the guest a lot?  Is it still 
> tolerable if it
> only happens during migration?

Yes, it is async and won't block the guest. But it will make the optimization 
doesn’t run as expected.
The intention is to have the migration thread skip the transfer of the free 
pages, but now the migration
thread is kind of using the 50ms lock to prevent the clearing of free pages 
while it is likely just sending free pages inside the lock.
(the reported free pages are better to be cleared in the bitmap in time in case 
they have already sent)

> 
> Taking that mutex for each dirty bit is still an overkill to me, irrelevant 
> of whether
> it's "conditional" or not.  

With that, if free page opt is off, the mutex is skipped, isn't it?

> If I'm the cloud admin, I would more prefer migration
> finishes earlier, imho, rather than freeing some more pages on the host (after
> migration all pages will be gone!).  If it still blocks the guest in some 
> unhealthy
> way I still prefer to take the lock here, however maybe make it shorter than
> 50ms.
> 

Yes, with the optimization, migration will be finished earlier.
Why it needs to free pages on the host?
(just skip sending the page)

Best,
Wei

RE: [PATCH] migration: Move bitmap_mutex out of migration_bitmap_clear_dirty()

2021-06-30 Thread Wang, Wei W

On Thursday, July 1, 2021 4:08 AM, Peter Xu wrote:
> Taking the mutex every time for each dirty bit to clear is too slow, 
> especially we'll
> take/release even if the dirty bit is cleared.  So far it's only used to sync 
> with
> special cases with qemu_guest_free_page_hint() against migration thread,
> nothing really that serious yet.  Let's move the lock to be upper.
> 
> There're two callers of migration_bitmap_clear_dirty().
> 
> For migration, move it into ram_save_iterate().  With the help of MAX_WAIT
> logic, we'll only run ram_save_iterate() for no more than 50ms-ish time, so 
> taking
> the lock once there at the entry.  It also means any call sites to
> qemu_guest_free_page_hint() can be delayed; but it should be very rare, only
> during migration, and I don't see a problem with it.
> 
> For COLO, move it up to colo_flush_ram_cache().  I think COLO forgot to take
> that lock even when calling ramblock_sync_dirty_bitmap(), where another
> example is migration_bitmap_sync() who took it right.  So let the mutex cover
> both the
> ramblock_sync_dirty_bitmap() and migration_bitmap_clear_dirty() calls.
> 
> It's even possible to drop the lock so we use atomic operations upon rb->bmap
> and the variable migration_dirty_pages.  I didn't do it just to still be 
> safe, also
> not predictable whether the frequent atomic ops could bring overhead too e.g.
> on huge vms when it happens very often.  When that really comes, we can
> keep a local counter and periodically call atomic ops.  Keep it simple for 
> now.
> 

If free page opt is enabled, 50ms waiting time might be too long for handling 
just one hint (via qemu_guest_free_page_hint)?
How about making the lock conditionally?
e.g.
#define QEMU_LOCK_GUARD_COND (lock, cond) {
if (cond)
QEMU_LOCK_GUARD(lock);
}
Then in migration_bitmap_clear_dirty:
QEMU_LOCK_GUARD_COND(>bitmap_mutex, rs->fpo_enabled);


Best,
Wei

RE: Question on Compression for Raw Image

2020-10-20 Thread Wang, Wei W

On Tuesday, October 20, 2020 4:01 PM, Kevin Wolf wrote:
> Am 20.10.2020 um 03:31 hat Wang, Wei W geschrieben:
> > Hi,
> >
> > Does anyone know the reason why raw-format.c doesn't have
> compression
> > support (but qcow has the supported added)?  For example, raw image
> > backup with compression, "qemu-img convert -c -O raw origin.img
> > dist.img", doesn't work.
> 
> A raw image is by definition a file that contains the exact same sequence of
> bytes as the guest sees, without any additional information or encoding. If
> you compress a raw file, the guest will see compressed data on its hard disk
> instead of the real data.

Ok, thanks. I'm thinking QEMU could do decompression of the compressed data in 
raw.img when guest reads data.

> 
> Anything you could do to add transparent compression to it would mean that
> it's not a raw image any more, but a new image format.
> 
Yes, decompression makes it transparent to the guest. Would you think it's good 
to reuse the raw image implementation, just add the compress/decompress option?

Thanks,
Wei

Question on Compression for Raw Image

2020-10-19 Thread Wang, Wei W

Hi,

Does anyone know the reason why raw-format.c doesn't have compression support 
(but qcow has the supported added)?
For example, raw image backup with compression, "qemu-img convert -c -O raw 
origin.img  dist.img", doesn't work.

Thanks,
Wei

Re: [Qemu-devel] [PATCH v11 7/7] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2019-02-19 Thread Wang, Wei W

On Friday, December 14, 2018 7:17 PM, Dr. David Alan Gilbert wrote:
> > On 12/14/2018 05:56 PM, Dr. David Alan Gilbert wrote:
> > > * Wei Wang (wei.w.w...@intel.com) wrote:
> > > > On 12/13/2018 11:45 PM, Dr. David Alan Gilbert wrote:
> > > > > * Wei Wang (wei.w.w...@intel.com) wrote:
> > > > > > The new feature enables the virtio-balloon device to receive
> > > > > > hints of guest free pages from the free page vq.
> > > > > >
> > > > > > A notifier is registered to the migration precopy notifier
> > > > > > chain. The notifier calls free_page_start after the migration
> > > > > > thread syncs the dirty bitmap, so that the free page
> > > > > > optimization starts to clear bits of free pages from the
> > > > > > bitmap. It calls the free_page_stop before the migration
> > > > > > thread syncs the bitmap, which is the end of the current round
> > > > > > of ram save. The free_page_stop is also called to stop the
> optimization in the case when there is an error occurred in the process of
> ram saving.
> > > > > >
> > > > > > Note: balloon will report pages which were free at the time of this
> call.
> > > > > > As the reporting happens asynchronously, dirty bit logging
> > > > > > must be enabled before this free_page_start call is made.
> > > > > > Guest reporting must be disabled before the migration dirty bitmap
> is synchronized.
> > > > > >
> > > > > > Signed-off-by: Wei Wang 
> > > > > > CC: Michael S. Tsirkin 
> > > > > > CC: Dr. David Alan Gilbert 
> > > > > > CC: Juan Quintela 
> > > > > > CC: Peter Xu 
> > > > > I think I'm OK for this from the migration side, I'd appreciate
> > > > > someone checking the virtio and aio bits.
> > > > >
> > > > > I'm not too sure how it gets switched on and off - i.e. if we
> > > > > get a nice new qemu on a new kernel, what happens when I try and
> > > > > migrate to the same qemu on an older kernel without these hints?
> > > > >
> > > > This feature doesn't rely on the host kernel. Those hints are
> > > > reported from the guest kernel.
> > > > So migration across different hosts wouldn't affect the use of this
> feature.
> > > > Please correct me if I didn't get your point.
> > > Ah OK, yes;  now what about migrating from new->old qemu with a new
> > > guest but old machine type?
> > >
> >
> > I think normally, the source QEMU and destination QEMU should have the
> > same QEMU booting parameter. If the destination QEMU doesn't support
> > "--device virtio-balloon,free-page-hint=true", which the source QEMU
> > has, the destination side QEMU will fail to boot, and migration will
> > not happen then.
> 
> Ah that's OK; as long as free-page-hint is false by default that will work 
> fine.
> 
> Dave
> 

Hi Dave,

Could we have this feature in QEMU 4.0 (freeze on Mar 12)?

Best,
Wei

Re: [Qemu-devel] [PATCH] migration: savevm: consult migration blockers

2018-11-16 Thread Wang, Wei W

On Saturday, November 17, 2018 12:48 AM, Paolo Bonzini wrote:
> To: qemu-devel@nongnu.org
> Subject: [Qemu-devel] [PATCH] migration: savevm: consult migration
> blockers
> 
> There is really no difference between live migration and savevm, except that
> savevm does not require bdrv_invalidate_cache to be implemented by all
> disks.  However, it is unlikely that savevm is used with anything except qcow2
> disks, so the penalty is small and worth the improvement in catching bad
> usage of savevm.
> 
> Only one place was taking care of savevm when adding a migration blocker,
> and it can be removed.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  migration/savevm.c | 4 
>  target/i386/kvm.c  | 3 ---
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c index
> ef707b8c43..1c49776a91 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2455,6 +2455,10 @@ int save_snapshot(const char *name, Error
> **errp)
>  struct tm tm;
>  AioContext *aio_context;
> 
> +if (migration_is_blocked(errp)) {
> +return false;

Just curious why returning false here when the returning type is "int"

Best,
Wei

Re: [Qemu-devel] [PATCH 1/8] migration: stop compressing page in migration thread

2018-03-27 Thread Wang, Wei W

On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:
> 
> As compression is a heavy work, do not do it in migration thread, instead, we
> post it out as a normal page
> 
> Signed-off-by: Xiao Guangrong 

Hi Guangrong,

Dave asked me to help review your patch, so I will just drop my 2 cents 
wherever possible, and hope that could be inspiring for your work.

> ---
>  migration/ram.c | 32 
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c index
> 7266351fd0..615693f180 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
> *rs, PageSearchStatus *pss,
>  int pages = -1;
>  uint64_t bytes_xmit = 0;
>  uint8_t *p;
> -int ret, blen;
> +int ret;
>  RAMBlock *block = pss->block;
>  ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
> 
> @@ -1162,23 +1162,23 @@ static int
> ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>  if (block != rs->last_sent_block) {
>  flush_compressed_data(rs);
>  pages = save_zero_page(rs, block, offset);
> -if (pages == -1) {
> -/* Make sure the first page is sent out before other pages */
> -bytes_xmit = save_page_header(rs, rs->f, block, offset |
> -  RAM_SAVE_FLAG_COMPRESS_PAGE);
> -blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE,
> - migrate_compress_level());
> -if (blen > 0) {
> -ram_counters.transferred += bytes_xmit + blen;
> -ram_counters.normal++;
> -pages = 1;
> -} else {
> -qemu_file_set_error(rs->f, blen);
> -error_report("compressed data failed!");
> -}
> -}
>  if (pages > 0) {
>  ram_release_pages(block->idstr, offset, pages);
> +} else {
> +/*
> + * Make sure the first page is sent out before other pages.
> + *
> + * we post it as normal page as compression will take much
> + * CPU resource.
> + */
> +ram_counters.transferred += save_page_header(rs, rs->f, 
> block,
> +offset | RAM_SAVE_FLAG_PAGE);
> +qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
> +  migrate_release_ram() &
> +  migration_in_postcopy());
> +ram_counters.transferred += TARGET_PAGE_SIZE;
> +ram_counters.normal++;
> +pages = 1;
>  }
>  } else {
>  pages = save_zero_page(rs, block, offset);
> --

I agree that this patch is an improvement for the current implementation. So 
just pile up mine here:
Reviewed-by: Wei Wang 

If you are interested in something more aggressive, I can share an alternative 
approach, which I think would be better. Please see below.

Actually, we can use the multi-threaded compression for the first page as well, 
which will not block the migration thread progress. The advantage is that we 
can enjoy the compression benefit for the first page and meanwhile not blocking 
the migration thread - the page is given to a compression thread and compressed 
asynchronously to the migration thread execution.

The main barrier to achieving the above that is that we need to make sure the 
first page of each block is sent first in the multi-threaded environment. We 
can twist the current implementation to achieve that, which is not hard:

For example, we can add a new flag to RAMBlock - bool first_page_added. In each 
thread of compression, they need
1) check if this is the first page of the block.
2) If it is the first page, set block->first_page_added after sending the page;
3) If it is not the first the page, wait to send the page only when 
block->first_page_added is set.

Best,
Wei

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v5 4/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-26 Thread Wang, Wei W

On Monday, March 26, 2018 11:04 PM, Daniel P. Berrangé wrote:
> On Mon, Mar 26, 2018 at 02:54:45PM +0000, Wang, Wei W wrote:
> > On Monday, March 26, 2018 7:09 PM, Daniel P. Berrangé wrote:
> > >
> > > As far as libvirt is concerned there are three sets of threads it
> > > provides control over
> > >
> > >  - vCPUs - each VCPU in KVM has a thread. Libvirt provides per-thread
> > >tunable control
> > >
> > >  - IOThreads - each named I/O thread can be associated with one or more
> > >devices. Libvirt provides per-thread tunable control.
> > >
> > >  - Emulator - any other QEMU thread which isn't an vCPU thread or IO
> thread
> > >gets called an emulator thread by libvirt. There is no-per thread
> > >tunable control - we can set tunables for entire set of emulator 
> > > threads
> > >at once.
> > >
> >
> >
> > Hi Daniel,
> > Thanks for sharing the details, they are very helpful. I still have a 
> > question:
> >
> > There is no fundamental difference between iothread and our
> > optimization thread (it is similar to the migration thread, which is
> > created when migration begins and terminated when migration is done) -
> > both of them are pthreads and each has a name. Could we also add the
> > similar per-thread tunable control in libvirt for such threads?
> >
> > For example, in QEMU we can add a new migration qmp command,
> > migrate_enable_free_page_optimization (just like other commands
> > migrate_set_speed 10G), this command will create the optimization thread.
> > In this way, creation of the thread is in libvirt's control, and
> > libvirt can then support tuning the thread (e.g. pin it to any pCPU), right?
> 
> Anything is possible from a technical level, but from a design POV we would
> rather not start exposing tunables for arbitrary device type specific threads
> as they are inherantly non-portable and may not even exist in QEMU long
> term as it is just an artifact of the current QEMU implementation.
> 

OK. My actual concern with iothread is that it exists during the whole QEMU 
lifecycle. Block device using it is fine since block device access happens 
frequently during the QEMU lifecycle. Usages like live migration, if they 
happen once per day, running this iothread would be a waste of CPU cycles.

Best,
Wei

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v5 4/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-03-26 Thread Wang, Wei W

On Monday, March 26, 2018 7:09 PM, Daniel P. Berrangé wrote:
> 
> As far as libvirt is concerned there are three sets of threads it provides
> control over
> 
>  - vCPUs - each VCPU in KVM has a thread. Libvirt provides per-thread
>tunable control
> 
>  - IOThreads - each named I/O thread can be associated with one or more
>devices. Libvirt provides per-thread tunable control.
> 
>  - Emulator - any other QEMU thread which isn't an vCPU thread or IO thread
>gets called an emulator thread by libvirt. There is no-per thread
>tunable control - we can set tunables for entire set of emulator threads
>at once.
> 

Hi Daniel,
Thanks for sharing the details, they are very helpful. I still have a question:

There is no fundamental difference between iothread and our optimization thread 
(it is similar to the migration thread, which is created when migration begins 
and terminated when migration is done) - both of them are pthreads and each has 
a name. Could we also add the similar per-thread tunable control in libvirt for 
such threads?

For example, in QEMU we can add a new migration qmp command, 
migrate_enable_free_page_optimization (just like other commands 
migrate_set_speed 10G), this command will create the optimization thread. In 
this way, creation of the thread is in libvirt's control, and libvirt can then 
support tuning the thread (e.g. pin it to any pCPU), right?

> So, if this balloon driver thread needs to support tuning controls separately
> from other general purpose QEMU threads, then it would ideally use
> iothread infrastructure.
> 
> I don't particularly understand what this code is doing, but please consider
> whether NUMA has any impact on the work done in this thread. Specifically
> when the guest has multiple virtual NUMA nodes, each associated with a
> specific host NUMA node. If there is any memory intensive work being done
> here, then it might need to be executed on the correct host NUMA node
> according to the memory region being touched.
> 

I think it would not be significantly impacted by NUMA, because this 
optimization thread doesn’t access to the guest memory a lot except the 
virtqueue (even with iothread, we may still not know which pCPU to pin to match 
virtqueue in the vNUMA case). Essentially, it gets the free page address and 
length, then clears bits from the migration dirty bitmap, which is allocated by 
QEMU itself.
So, I think adding the tunable support is nicer, but I'm not sure if that would 
be required.

Best,
Wei

Re: [Qemu-devel] [PATCH v2 2/3] migration: use the free page reporting feature from balloon

2018-02-26 Thread Wang, Wei W

On Monday, February 26, 2018 1:07 PM, Wei Wang wrote:
> On 02/09/2018 07:50 PM, Dr. David Alan Gilbert wrote:
> > * Wei Wang (wei.w.w...@intel.com) wrote:
> >> Use the free page reporting feature from the balloon device to clear
> >> the bits corresponding to guest free pages from the dirty bitmap, so
> >> that the free memory are not sent.
> >>
> >> Signed-off-by: Wei Wang 
> >> CC: Michael S. Tsirkin 
> >> CC: Juan Quintela 
> >> ---
> >>   migration/ram.c | 24 
> >>   1 file changed, 20 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/migration/ram.c b/migration/ram.c index d6f462c..4fe16d2
> >> 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -49,6 +49,7 @@
> >>   #include "qemu/rcu_queue.h"
> >>   #include "migration/colo.h"
> >>   #include "migration/block.h"
> >> +#include "sysemu/balloon.h"
> >>
> >>   /***/
> >>   /* ram save/restore */
> >> @@ -206,6 +207,10 @@ struct RAMState {
> >>   uint32_t last_version;
> >>   /* We are in the first round */
> >>   bool ram_bulk_stage;
> >> +/* The feature, skipping the transfer of free pages, is supported */
> >> +bool free_page_support;
> >> +/* Skip the transfer of free pages in the bulk stage */
> >> +bool free_page_done;
> >>   /* How many times we have dirty too many pages */
> >>   int dirty_rate_high_cnt;
> >>   /* these variables are used for bitmap sync */ @@ -773,7 +778,7
> >> @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock
> *rb,
> >>   unsigned long *bitmap = rb->bmap;
> >>   unsigned long next;
> >>
> >> -if (rs->ram_bulk_stage && start > 0) {
> >> +if (rs->ram_bulk_stage && start > 0 && !rs->free_page_support) {
> >>   next = start + 1;
> >>   } else {
> >>   next = find_next_bit(bitmap, size, start); @@ -1653,6
> >> +1658,8 @@ static void ram_state_reset(RAMState *rs)
> >>   rs->last_page = 0;
> >>   rs->last_version = ram_list.version;
> >>   rs->ram_bulk_stage = true;
> >> +rs->free_page_support = balloon_free_page_support();
> >> +rs->free_page_done = false;
> >>   }
> >>
> >>   #define MAX_WAIT 50 /* ms, half buffered_file limit */ @@ -2135,7
> >> +2142,7 @@ static int ram_state_init(RAMState **rsp)
> >>   return 0;
> >>   }
> >>
> >> -static void ram_list_init_bitmaps(void)
> >> +static void ram_list_init_bitmaps(RAMState *rs)
> >>   {
> >>   RAMBlock *block;
> >>   unsigned long pages;
> >> @@ -2145,7 +2152,11 @@ static void ram_list_init_bitmaps(void)
> >>   QLIST_FOREACH_RCU(block, _list.blocks, next) {
> >>   pages = block->max_length >> TARGET_PAGE_BITS;
> >>   block->bmap = bitmap_new(pages);
> >> -bitmap_set(block->bmap, 0, pages);
> >> +if (rs->free_page_support) {
> >> +bitmap_set(block->bmap, 1, pages);
> > I don't understand how it makes sense to do that here; ignoring
> > anything ese it means that migration_dirty_pages is wrong which could
> > end up with migration finishing before all real pages are sent.
> >
> 
> The bulk stage treats all the pages as dirty pages, so we set all the bits to 
> "1",
> this is needed by this optimization feature, because the free pages reported
> from the guest can then be directly cleared from the bitmap (we don't need
> any more bitmaps to record free pages).
> 

Sorry, there was a misunderstanding of the bitmap_set API (thought it was used 
to set all the bits to 1 or 0). So the above change isn't needed actually. Btw, 
this doesn't affect the results I reported.

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device

2018-02-06 Thread Wang, Wei W

On Tuesday, February 6, 2018 5:32 PM, Stefan Hajnoczi wrote:
> On Tue, Feb 06, 2018 at 01:28:25AM +0000, Wang, Wei W wrote:
> > On Tuesday, February 6, 2018 12:26 AM, Stefan Hajnoczi wrote:
> > > On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> > > > On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > > > > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > > > > Issues:
> > > > > > Suppose we have both the vhost and virtio-net set up, and 
> > > > > > vhost pmd <-> virtio-net pmd communication works well. Now, 
> > > > > > vhost pmd exits (virtio-net pmd is still there). Some time 
> > > > > > later, we re-run vhost pmd, the vhost pmd doesn't know the 
> > > > > > virtqueue addresses of the virtio-net pmd, unless the 
> > > > > > virtio-net pmd reloads to start the 2nd phase of the 
> > > > > > vhost-user protocol. So the second run of the vhost
> > > pmd won't work.
> > > > > >
> > > > > > Any thoughts?
> > > > > >
> > > > > > Best,
> > > > > > Wei
> > > > > So vhost in qemu must resend all configuration on reconnect.
> > > > > Does this address the issues?
> > > > >
> > > >
> > > > Yes, but the issues are
> > > > 1) there is no reconnecting when a pmd exits (the socket 
> > > > connection seems still on at the device layer);
> > >
> > > This is how real hardware works too.  If the driver suddenly stops 
> > > running then the device remains operational.  When the driver is 
> > > started again it resets the device and initializes it.
> > >
> > > > 2) If we find a way to break the QEMU layer socket connection 
> > > > when pmd exits and get it reconnect, virtio-net device still 
> > > > won't send all the configure when reconnecting, because socket 
> > > > connecting only triggers phase 1 of vhost-user negotiation (i.e.
> > > > vhost_user_init). Phase 2 is triggered after the driver loads 
> > > > (i.e. vhost_net_start). If the virtio-net pmd doesn't reload, 
> > > > there are no phase 2 messages (like virtqueue addresses which 
> > > > are allocated by the pmd). I think we need to think more about 
> > > > this before
> moving forward.
> > >
> > > Marc-André: How does vhost-user reconnect work when the master 
> > > goes away and a new master comes online?  Wei found that the QEMU 
> > > slave implementation only does partial vhost-user initialization 
> > > upon reconnect, so the new master doesn't get the virtqueue 
> > > address and
> related information.
> > > Is this a QEMU bug?
> >
> > Actually we are discussing the slave (vhost is the slave, right?) going 
> > away.
> When a slave exits and some moment later a new slave runs, the master
> (virtio-net) won't send the virtqueue addresses to the new vhost slave.
> 
> Yes, apologies for the typo.  s/QEMU slave/QEMU master/
> 
> Yesterday I asked Marc-André for help on IRC and we found the code 
> path where the QEMU master performs phase 2 negotiation upon 
> reconnect.  It's not obvious but the qmp_set_link() calls in 
> net_vhost_user_event() will do it.
> 
> I'm going to try to reproduce the issue you're seeing now.  Will let 
> you know what I find.
> 

OK. Thanks. I observed no messages after re-run virtio-vhost-user pmd, and 
found there is no re-connection event happening in the device side. 

I also tried to switch the role of client/server - virtio-net to run a server 
socket, and virtio-vhost-user to run the client, and it seems the current code 
fails to run that way. The reason is the virtio-net side 
vhost_user_get_features() doesn't return. On the vhost side, I don't see 
virtio_vhost_user_deliver_m2s being invoked to deliver the GET_FEATURES 
message. I'll come back to continue later.

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device

2018-02-05 Thread Wang, Wei W

On Tuesday, February 6, 2018 12:26 AM, Stefan Hajnoczi wrote:
> On Fri, Feb 02, 2018 at 09:08:44PM +0800, Wei Wang wrote:
> > On 02/02/2018 01:08 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > > > Issues:
> > > > Suppose we have both the vhost and virtio-net set up, and vhost
> > > > pmd <-> virtio-net pmd communication works well. Now, vhost pmd
> > > > exits (virtio-net pmd is still there). Some time later, we re-run
> > > > vhost pmd, the vhost pmd doesn't know the virtqueue addresses of
> > > > the virtio-net pmd, unless the virtio-net pmd reloads to start the
> > > > 2nd phase of the vhost-user protocol. So the second run of the vhost
> pmd won't work.
> > > >
> > > > Any thoughts?
> > > >
> > > > Best,
> > > > Wei
> > > So vhost in qemu must resend all configuration on reconnect.
> > > Does this address the issues?
> > >
> >
> > Yes, but the issues are
> > 1) there is no reconnecting when a pmd exits (the socket connection
> > seems still on at the device layer);
> 
> This is how real hardware works too.  If the driver suddenly stops running
> then the device remains operational.  When the driver is started again it
> resets the device and initializes it.
> 
> > 2) If we find a way to break the QEMU layer socket connection when pmd
> > exits and get it reconnect, virtio-net device still won't send all the
> > configure when reconnecting, because socket connecting only triggers
> > phase 1 of vhost-user negotiation (i.e. vhost_user_init). Phase 2 is
> > triggered after the driver loads (i.e. vhost_net_start). If the
> > virtio-net pmd doesn't reload, there are no phase 2 messages (like
> > virtqueue addresses which are allocated by the pmd). I think we need
> > to think more about this before moving forward.
> 
> Marc-André: How does vhost-user reconnect work when the master goes
> away and a new master comes online?  Wei found that the QEMU slave
> implementation only does partial vhost-user initialization upon reconnect, so
> the new master doesn't get the virtqueue address and related information.
> Is this a QEMU bug?

Actually we are discussing the slave (vhost is the slave, right?) going away. 
When a slave exits and some moment later a new slave runs, the master 
(virtio-net) won't send the virtqueue addresses to the new vhost slave.

Best,
Wei

Re: [Qemu-devel] [RFC 0/2] virtio-vhost-user: add virtio-vhost-user device

2018-02-05 Thread Wang, Wei W

On Friday, February 2, 2018 11:26 PM, Stefan Hajnoczi wrote:
> On Tue, Jan 30, 2018 at 08:09:19PM +0800, Wei Wang wrote:
> > Background:
> > The vhost-user negotiation is split into 2 phases currently. The 1st
> > phase happens when the connection is established, and we can find
> > what's done in the 1st phase in vhost_user_init(). The 2nd phase
> > happens when the master driver is loaded (e.g. run of virtio-net pmd)
> > and set status to the device, and we can find what's done in the 2nd
> > phase in vhost_dev_start(), which includes sending the memory info and
> > virtqueue info. The socket is connected, till one of the QEMU devices
> > exits, so pmd exiting won't end the QEMU side socket connection.
> >
> > Issues:
> > Suppose we have both the vhost and virtio-net set up, and vhost pmd
> > <-> virtio-net pmd communication works well. Now, vhost pmd exits
> > (virtio-net pmd is still there). Some time later, we re-run vhost pmd,
> > the vhost pmd doesn't know the virtqueue addresses of the virtio-net
> > pmd, unless the virtio-net pmd reloads to start the 2nd phase of the
> > vhost-user protocol. So the second run of the vhost pmd won't work.
> 
> This isn't a problem for virtio-vhost-user since the vhost-pmd resets the
> virtio-vhost-user device when it restarts.  The vhost-user AF_UNIX socket
> reconnects and negotiation restarts.

I'm not sure if you've agreed that vhost-user negotiation is split into two 
phases as described above. If not, it's also not difficult to check, thanks to 
the RTE_LOG put at the vhost_user_msg_handler:
RTE_LOG(INFO, VHOST_CONFIG, "read message %s\n",
vhost_message_str[msg.request.master]);
It tells us what messages are received and when they are received. 

Before trying the virtio-vhost-user setup, please make sure the virtio-net side 
VM doesn't have the virtio-net kernel driver loaded (blacklist the module or 
disable it in .config).
VM1: the VM with virtio-vhost-user
VM2: the VM with virtio-net

1) After we boot VM1 and the virtio-vhost-user pmd, we boot VM2, and we will 
see the following log in VM1:
VHOST_CONFIG: new device, handle is 0
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_GET_PROTOCOL_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_PROTOCOL_FEATURES
VHOST_CONFIG: read message VHOST_USER_GET_QUEUE_NUM
VHOST_CONFIG: read message VHOST_USER_SET_OWNER
VHOST_CONFIG: read message VHOST_USER_GET_FEATURES
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:0 file:-1
VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
VHOST_CONFIG: vring call idx:1 file:-1

Those messages are what I called phase1 negotiation. They are negotiated when 
the AF_UNIX socket connects.

2) Then in VM2 we load the virtio-net pmd, we will see phase2 messages showing 
up in VM1, like:
VHOST_CONFIG: read message VHOST_USER_SET_VRING_NUM
VHOST_CONFIG: read message VHOST_USER_SET_VRING_BASE
VHOST_CONFIG: read message VHOST_USER_SET_VRING_ADDR
...
Those messages are sent to virtio-vhost-user only when the VM2's virtio-net pmd 
loads.

AF_UNIX socket reconnection only triggers phase1 negotiation. If virtio-net pmd 
in VM2 doesn't reload, virtio-vhost-user pmd won't get the above phase2 
messages. Do you agree with the issue?

Best,
Wei

Re: [Qemu-devel] [PATCH v1 1/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

2018-01-17 Thread Wang, Wei W

On Wednesday, January 17, 2018 7:41 PM, Juan Quintela wrote:
> Wei Wang  wrote:
> > +void skip_free_pages_from_dirty_bitmap(RAMBlock *block, ram_addr_t
> offset,
> > +   size_t len) {
> > +long start = offset >> TARGET_PAGE_BITS,
> > + nr = len >> TARGET_PAGE_BITS;
> > +
> > +bitmap_clear(block->bmap, start, nr);
> 
> But what assures us that all the nr pages are dirty?

Actually the free page optimization is used for the bulk stage, where all the 
pages are treated as dirty. In patch 2, we have:

+if (rs->free_page_support) {
+bitmap_set(block->bmap, 1, pages);
+} else {
+bitmap_set(block->bmap, 0, pages);
+}

> 
> > +ram_state->migration_dirty_pages -= nr;
> 
> This should be
> ram_state->migration_dirty_pages -=
>  count_ones(block->bmap, start, nr);
> 
> For a count_ones function, no?

Not really. "nr" is the number of bits to clear from the bitmap, so 
"migration_dirty_pages -= nr" shows how many dirty bits left in the bitmap.

One cornercase I'm thinking about is when we do 
bitmap_clear(block->bmap, start, nr);
would it be possible that "start + nr" exceeds the bitmap size? that is, would 
it be possible that the free page block (e.g. 2MB) from the guest crosses the 
QEMU ram block boundary?

> Furthermore, we have one "optimization" and this only works for the second
> stages onward:
> 
> static inline
> unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>   unsigned long start) {
> unsigned long size = rb->used_length >> TARGET_PAGE_BITS;
> unsigned long *bitmap = rb->bmap;
> unsigned long next;
> 
> if (rs->ram_bulk_stage && start > 0) {
> next = start + 1;
> } else {
> next = find_next_bit(bitmap, size, start);
> }
> 
> return next;
> }
> 
> So, for making this really work, we have more work to do.
> 
> Actually, what I think we should do was to _ask_ the guest which pages are
> used at the beggining, instead of just setting all pages as dirty, but that
> requires kernel changes and lot of search of corner cases.
> 

This series optimizes the bulk stage memory transfer. Do you mean you have 
another idea to optimize the second round and onward? How would you let the 
guest track pages that are written?

Best,
Wei

Re: [Qemu-devel] vhost-pci and virtio-vhost-user

2018-01-13 Thread Wang, Wei W

On Friday, January 12, 2018 6:38 PM, Stefan Hajnoczi wrote:
> On Fri, Jan 12, 2018 at 02:44:00PM +0800, Wei Wang wrote:
> > On 01/11/2018 05:56 PM, Stefan Hajnoczi wrote:
> > > On Thu, Jan 11, 2018 at 6:31 AM, Wei Wang 
> wrote:
> > > > On 01/11/2018 12:14 AM, Stefan Hajnoczi wrote:
> >
> I expect vhost-pci to require fewer code changes.  If you judge "simpler" just
> by the patch count or size, then vhost-pci will win.
> 
> The reason for that is virtio-vhost-user integrates with librte_vhost.
> This requires refactoring librte_vhost to support multiple transports.
> 

I think the driver having more code is fine. The device part is what will go 
into the spec, and people have the freedom to implement their own driver 
(either the dpdk or kernel driver) in the future, if they don't want the 
librte_vhost based one.

So, for the first version, I think it would be better to not make the device 
full-featured (e.g. pass through as little messages as possible), this would 
give users a chance to get a device and driver with the least lines of code.  

> I think the virtio-vhost-user end result is worth it though: vhost devices 
> like
> examples/vhost/ and examples/vhost/scsi/ will work with both AF_UNIX and
> virtio-vhost-user.  This makes it simpler for users and vhost device
> developers - you only have one implementation of net, scsi, blk, etc devices.

For the driver part, isn't it that net, scsi, blk will have their own separate 
implementation on top of the common libret_vhost?

Best,
Wei

Re: [Qemu-devel] [PATCH v20 0/7] Virtio-balloon Enhancement

2017-12-20 Thread Wang, Wei W

On Wednesday, December 20, 2017 8:26 PM, Matthew Wilcox wrote:
> On Wed, Dec 20, 2017 at 06:34:36PM +0800, Wei Wang wrote:
> > On 12/19/2017 10:05 PM, Tetsuo Handa wrote:
> > > I think xb_find_set() has a bug in !node path.
> >
> > I think we can probably remove the "!node" path for now. It would be
> > good to get the fundamental part in first, and leave optimization to
> > come as separate patches with corresponding test cases in the future.
> 
> You can't remove the !node path.  You'll see !node when the highest set bit
> is less than 1024.  So do something like this:
> 
>   unsigned long bit;
>   xb_preload(GFP_KERNEL);
>   xb_set_bit(xb, 700);
>   xb_preload_end();
>   bit = xb_find_set(xb, ULONG_MAX, 0);
>   assert(bit == 700);

This above test will result in "!node with bitmap !=NULL", and it goes to the 
regular "if (bitmap)" path, which finds 700.

A better test would be
...
xb_set_bit(xb, 700);
assert(xb_find_set(xb, ULONG_MAX, 800) == ULONG_MAX);
...

The first try with the "if (bitmap)" path doesn't find a set bit, and the 
remaining tries will always result in "!node && !bitmap", which implies no set 
bit anymore and no need to try in this case.

So, I think we can change it to

If (!node && !bitmap)
return size;


Best,
Wei

Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Wang, Wei W

On Saturday, December 16, 2017 3:22 AM, Matthew Wilcox wrote:
> On Fri, Dec 15, 2017 at 10:49:15AM -0800, Matthew Wilcox wrote:
> > Here's the API I'm looking at right now.  The user need take no lock;
> > the locking (spinlock) is handled internally to the implementation.

Another place I saw your comment " The xb_ API requires you to handle your own 
locking" which seems conflict with the above "the user need take no lock".
Doesn't the caller need a lock to avoid concurrent accesses to the ida bitmap?

> I looked at the API some more and found some flaws:
>  - how does xbit_alloc communicate back which bit it allocated?
>  - What if xbit_find_set() is called on a completely empty array with
>a range of 0, ULONG_MAX -- there's no invalid number to return.

We'll change it to "bool xb_find_set(.., unsigned long *result)", returning 
false indicates no "1" bit is found.

>  - xbit_clear() can't return an error.  Neither can xbit_zero().

I found the current xbit_clear implementation only returns 0, and there isn't 
an error to be returned from this function. In this case, is it better to make 
the function "void"?

>  - Need to add __must_check to various return values to discourage sloppy
>programming
> 
> So I modify the proposed API we compete with thusly:
> 
> bool xbit_test(struct xbitmap *, unsigned long bit); int __must_check
> xbit_set(struct xbitmap *, unsigned long bit, gfp_t); void xbit_clear(struct
> xbitmap *, unsigned long bit); int __must_check xbit_alloc(struct xbitmap *,
> unsigned long *bit, gfp_t);
> 
> int __must_check xbit_fill(struct xbitmap *, unsigned long start,
> unsigned long nbits, gfp_t); void xbit_zero(struct 
> xbitmap *,
> unsigned long start, unsigned long nbits); int __must_check
> xbit_alloc_range(struct xbitmap *, unsigned long *bit,
> unsigned long nbits, gfp_t);
> 
> bool xbit_find_clear(struct xbitmap *, unsigned long *start, unsigned long
> max); bool xbit_find_set(struct xbitmap *, unsigned long *start, unsigned
> long max);
> 
> (I'm a little sceptical about the API accepting 'max' for the find functions 
> and
> 'nbits' in the fill/zero/alloc_range functions, but I think that matches how
> people want to use it, and it matches how bitmap.h works)

Are you suggesting to rename the current xb_ APIs to the above xbit_ names 
(with parameter changes)? 

Why would we need xbit_alloc, which looks like ida_get_new, I think set/clear 
should be adequate to the current usages.

Best,
Wei

Re: [Qemu-devel] [PATCH v19 3/7] xbitmap: add more operations

2017-12-17 Thread Wang, Wei W



> -Original Message-
> From: Tetsuo Handa [mailto:penguin-ker...@i-love.sakura.ne.jp]
> Sent: Sunday, December 17, 2017 6:22 PM
> To: Wang, Wei W <wei.w.w...@intel.com>; wi...@infradead.org
> Cc: virtio-...@lists.oasis-open.org; linux-ker...@vger.kernel.org; qemu-
> de...@nongnu.org; virtualizat...@lists.linux-foundation.org;
> k...@vger.kernel.org; linux...@kvack.org; m...@redhat.com;
> mho...@kernel.org; a...@linux-foundation.org; mawil...@microsoft.com;
> da...@redhat.com; cornelia.h...@de.ibm.com;
> mgor...@techsingularity.net; aarca...@redhat.com;
> amit.s...@redhat.com; pbonz...@redhat.com;
> liliang.opensou...@gmail.com; yang.zhang...@gmail.com;
> quan...@aliyun.com; ni...@redhat.com; r...@redhat.com
> Subject: Re: [PATCH v19 3/7] xbitmap: add more operations
> 
> Wei Wang wrote:
> > > But passing GFP_NOWAIT means that we can handle allocation failure.
> > > There is no need to use preload approach when we can handle allocation
> failure.
> >
> > I think the reason we need xb_preload is because radix tree insertion
> > needs the memory being preallocated already (it couldn't suffer from
> > memory failure during the process of inserting, probably because
> > handling the failure there isn't easy, Matthew may know the backstory
> > of
> > this)
> 
> According to https://lwn.net/Articles/175432/ , I think that preloading is
> needed only when failure to insert an item into a radix tree is a significant
> problem.
> That is, when failure to insert an item into a radix tree is not a problem, I
> think that we don't need to use preloading.

It also mentions that the preload attempts to allocate sufficient memory to 
*guarantee* that the next radix tree insertion cannot fail.

If we check radix_tree_node_alloc(), the comments there says "this assumes that 
the caller has performed appropriate preallocation".

So, I think we would get a risk of triggering some issue without preload().

> >
> > So, I think we can handle the memory failure with xb_preload, which
> > stops going into the radix tree APIs, but shouldn't call radix tree
> > APIs without the related memory preallocated.
> 
> It seems to me that virtio-ballon case has no problem without using
> preloading.

Why is that?

Best,
Wei

Re: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-11 Thread Wang, Wei W

On Monday, December 11, 2017 7:12 PM, Stefan Hajnoczi wrote:
> On Sat, Dec 09, 2017 at 04:23:17PM +0000, Wang, Wei W wrote:
> > On Friday, December 8, 2017 4:34 PM, Stefan Hajnoczi wrote:
> > > On Fri, Dec 8, 2017 at 6:43 AM, Wei Wang <wei.w.w...@intel.com>
> wrote:
> > > > On 12/08/2017 07:54 AM, Michael S. Tsirkin wrote:
> > > >>
> > > >> On Thu, Dec 07, 2017 at 06:28:19PM +, Stefan Hajnoczi wrote:
> > > >>>
> > > >>> On Thu, Dec 7, 2017 at 5:38 PM, Michael S. Tsirkin 
> > > >>> <m...@redhat.com>
> > > > Thanks Stefan and Michael for the sharing and discussion. I 
> > > > think above 3 and 4 are debatable (e.g. whether it is simpler 
> > > > really depends). 1 and 2 are implementations, I think both 
> > > > approaches could implement the device that way. We originally 
> > > > thought about one device and driver to support all types (called 
> > > > it transformer sometimes :-) ), that would look interesting from 
> > > > research point of view, but from real usage point of view, I 
> > > > think it would be better to have them separated,
> > > because:
> > > > - different device types have different driver logic, mixing 
> > > > them together would cause the driver to look messy. Imagine that 
> > > > a networking driver developer has to go over the block related 
> > > > code to debug, that also increases the difficulty.
> > >
> > > I'm not sure I understand where things get messy because:
> > > 1. The vhost-pci device implementation in QEMU relays messages but 
> > > has no device logic, so device-specific messages like 
> > > VHOST_USER_NET_SET_MTU are trivial at this layer.
> > > 2. vhost-user slaves only handle certain vhost-user protocol messages.
> > > They handle device-specific messages for their device type only.
> > > This is like vhost drivers today where the ioctl() function 
> > > returns an error if the ioctl is not supported by the device.  It's not 
> > > messy.
> > >
> > > Where are you worried about messy driver logic?
> >
> > Probably I didn’t explain well, please let me summarize my thought a 
> > little
> bit, from the perspective of the control path and data path.
> >
> > Control path: the vhost-user messages - I would prefer just have the 
> > interaction between QEMUs, instead of relaying to the GuestSlave, 
> > because
> > 1) I think the claimed advantage (easier to debug and develop) 
> > doesn’t seem very convincing
> 
> You are defining a mapping from the vhost-user protocol to a custom 
> virtio device interface.  Every time the vhost-user protocol (feature 
> bits, messages,
> etc) is extended it will be necessary to map this new extension to the 
> virtio device interface.
> 
> That's non-trivial.  Mistakes are possible when designing the mapping.
> Using the vhost-user protocol as the device interface minimizes the 
> effort and risk of mistakes because most messages are relayed 1:1.
> 
> > 2) some messages can be directly answered by QemuSlave , and some
> messages are not useful to give to the GuestSlave (inside the VM), 
> e.g. fds, VhostUserMemoryRegion from SET_MEM_TABLE msg (the device 
> first maps the master memory and gives the offset (in terms of the 
> bar, i.e., where does it sit in the bar) of the mapped gpa to the 
> guest. if we give the raw VhostUserMemoryRegion to the guest, that wouldn’t 
> be usable).
> 
> I agree that QEMU has to handle some of messages, but it should still 
> relay all (possibly modified) messages to the guest.
> 
> The point of using the vhost-user protocol is not just to use a 
> familiar binary encoding, it's to match the semantics of vhost-user 
> 100%.  That way the vhost-user software stack can work either in host 
> userspace or with vhost-pci without significant changes.
> 
> Using the vhost-user protocol as the device interface doesn't seem any 
> harder than defining a completely new virtio device interface.  It has 
> the advantages that I've pointed out:
> 
> 1. Simple 1:1 mapping for most that is easy to maintain as the
>vhost-user protocol grows.
> 
> 2. Compatible with vhost-user so slaves can run in host userspace
>or the guest.
> 
> I don't see why it makes sense to define new device interfaces for 
> each device type and create a software stack that is incompatible with 
> vhost-user.


I think this 1:1 mapping wouldn't be easy:

1) We will have 2 Qemu side slaves to achieve this bidirectional relaying, that 
is, the working model will b

Re: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-09 Thread Wang, Wei W

On Friday, December 8, 2017 4:34 PM, Stefan Hajnoczi wrote:
> On Fri, Dec 8, 2017 at 6:43 AM, Wei Wang  wrote:
> > On 12/08/2017 07:54 AM, Michael S. Tsirkin wrote:
> >>
> >> On Thu, Dec 07, 2017 at 06:28:19PM +, Stefan Hajnoczi wrote:
> >>>
> >>> On Thu, Dec 7, 2017 at 5:38 PM, Michael S. Tsirkin 
> > Thanks Stefan and Michael for the sharing and discussion. I think
> > above 3 and 4 are debatable (e.g. whether it is simpler really
> > depends). 1 and 2 are implementations, I think both approaches could
> > implement the device that way. We originally thought about one device
> > and driver to support all types (called it transformer sometimes :-)
> > ), that would look interesting from research point of view, but from
> > real usage point of view, I think it would be better to have them separated,
> because:
> > - different device types have different driver logic, mixing them
> > together would cause the driver to look messy. Imagine that a
> > networking driver developer has to go over the block related code to
> > debug, that also increases the difficulty.
> 
> I'm not sure I understand where things get messy because:
> 1. The vhost-pci device implementation in QEMU relays messages but has no
> device logic, so device-specific messages like VHOST_USER_NET_SET_MTU are
> trivial at this layer.
> 2. vhost-user slaves only handle certain vhost-user protocol messages.
> They handle device-specific messages for their device type only.  This is like
> vhost drivers today where the ioctl() function returns an error if the ioctl 
> is
> not supported by the device.  It's not messy.
> 
> Where are you worried about messy driver logic?

Probably I didn’t explain well, please let me summarize my thought a little 
bit, from the perspective of the control path and data path.

Control path: the vhost-user messages - I would prefer just have the 
interaction between QEMUs, instead of relaying to the GuestSlave, because
1) I think the claimed advantage (easier to debug and develop) doesn’t seem 
very convincing
2) some messages can be directly answered by QemuSlave , and some messages are 
not useful to give to the GuestSlave (inside the VM), e.g. fds, 
VhostUserMemoryRegion from SET_MEM_TABLE msg (the device first maps the master 
memory and gives the offset (in terms of the bar, i.e., where does it sit in 
the bar) of the mapped gpa to the guest. if we give the raw 
VhostUserMemoryRegion to the guest, that wouldn’t be usable).

Data path: that's the discussion we had about one driver or separate driver for 
different device types, and this is not related to the control path.
I meant if we have one driver for all the types, that driver would look messy, 
because each type has its own data sending/receiving logic. For example, net 
type deals with a pair of tx and rx, and transmission is skb based (e.g. 
xmit_skb), while block type deals with a request queue. If we have one driver, 
then the driver will include all the things together.

The last part is whether to make it a virtio device or a regular pci device
I don’t have a strong preference. I think virtio device works fine (e.g. use 
some bar area to create ioevenfds to solve the "no virtqueue no fds" issue if 
you and Michael think that's acceptable),  and we can reuse some other things 
like feature negotiation from virtio. But if Michael and you have a decision to 
make it a regular PCI device, I think that would also work though.

Best,
Wei

Re: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-09 Thread Wang, Wei W

On Friday, December 8, 2017 10:28 PM, Michael S. Tsirkin wrote:
> On Fri, Dec 08, 2017 at 06:08:05AM +, Stefan Hajnoczi wrote:
> > On Thu, Dec 7, 2017 at 11:54 PM, Michael S. Tsirkin 
> wrote:
> > > On Thu, Dec 07, 2017 at 06:28:19PM +, Stefan Hajnoczi wrote:
> > >> On Thu, Dec 7, 2017 at 5:38 PM, Michael S. Tsirkin 
> wrote:
> > >> > On Thu, Dec 07, 2017 at 05:29:14PM +, Stefan Hajnoczi wrote:
> > >> >> On Thu, Dec 7, 2017 at 4:47 PM, Michael S. Tsirkin 
> wrote:
> > >> >> > On Thu, Dec 07, 2017 at 04:29:45PM +, Stefan Hajnoczi wrote:
> > >> >> >> On Thu, Dec 7, 2017 at 2:02 PM, Michael S. Tsirkin
>  wrote:
> > >> >> >> > On Thu, Dec 07, 2017 at 01:08:04PM +, Stefan Hajnoczi
> wrote:
> > >> >> >> >> Instead of responding individually to these points, I hope
> > >> >> >> >> this will explain my perspective.  Let me know if you do
> > >> >> >> >> want individual responses, I'm happy to talk more about
> > >> >> >> >> the points above but I think the biggest difference is our
> perspective on this:
> > >> >> >> >>
> > >> >> >> >> Existing vhost-user slave code should be able to run on
> > >> >> >> >> top of vhost-pci.  For example, QEMU's
> > >> >> >> >> contrib/vhost-user-scsi/vhost-user-scsi.c should work
> > >> >> >> >> inside the guest with only minimal changes to the source
> > >> >> >> >> file (i.e. today it explicitly opens a UNIX domain socket
> > >> >> >> >> and that should be done by libvhost-user instead).  It
> > >> >> >> >> shouldn't be hard to add vhost-pci vfio support to
> contrib/libvhost-user/ alongside the existing UNIX domain socket code.
> > >> >> >> >>
> > >> >> >> >> This seems pretty easy to achieve with the vhost-pci PCI
> > >> >> >> >> adapter that I've described but I'm not sure how to
> > >> >> >> >> implement libvhost-user on top of vhost-pci vfio if the
> > >> >> >> >> device doesn't expose the vhost-user protocol.
> > >> >> >> >>
> > >> >> >> >> I think this is a really important goal.  Let's use a
> > >> >> >> >> single vhost-user software stack instead of creating a
> > >> >> >> >> separate one for guest code only.
> > >> >> >> >>
> > >> >> >> >> Do you agree that the vhost-user software stack should be
> > >> >> >> >> shared between host userspace and guest code as much as
> possible?
> > >> >> >> >
> > >> >> >> >
> > >> >> >> >
> > >> >> >> > The sharing you propose is not necessarily practical
> > >> >> >> > because the security goals of the two are different.
> > >> >> >> >
> > >> >> >> > It seems that the best motivation presentation is still the
> > >> >> >> > original rfc
> > >> >> >> >
> > >> >> >> > http://virtualization.linux-foundation.narkive.com/A7FkzAgp
> > >> >> >> > /rfc-vhost-user-enhancements-for-vm2vm-communication
> > >> >> >> >
> > >> >> >> > So comparing with vhost-user iotlb handling is different:
> > >> >> >> >
> > >> >> >> > With vhost-user guest trusts the vhost-user backend on the host.
> > >> >> >> >
> > >> >> >> > With vhost-pci we can strive to limit the trust to qemu only.
> > >> >> >> > The switch running within a VM does not have to be trusted.
> > >> >> >>
> > >> >> >> Can you give a concrete example?
> > >> >> >>
> > >> >> >> I have an idea about what you're saying but it may be wrong:
> > >> >> >>
> > >> >> >> Today the iotlb mechanism in vhost-user does not actually
> > >> >> >> enforce memory permissions.  The vhost-user slave has full
> > >> >> >> access to mmapped memory regions even when iotlb is enabled.
> > >> >> >> Currently the iotlb just adds an indirection layer but no
> > >> >> >> real security.  (Is this correct?)
> > >> >> >
> > >> >> > Not exactly. iotlb protects against malicious drivers within guest.
> > >> >> > But yes, not against a vhost-user driver on the host.
> > >> >> >
> > >> >> >> Are you saying the vhost-pci device code in QEMU should
> > >> >> >> enforce iotlb permissions so the vhost-user slave guest only
> > >> >> >> has access to memory regions that are allowed by the iotlb?
> > >> >> >
> > >> >> > Yes.
> > >> >>
> > >> >> Okay, thanks for confirming.
> > >> >>
> > >> >> This can be supported by the approach I've described.  The
> > >> >> vhost-pci QEMU code has control over the BAR memory so it can
> > >> >> prevent the guest from accessing regions that are not allowed by the
> iotlb.
> > >> >>
> > >> >> Inside the guest the vhost-user slave still has the memory
> > >> >> region descriptions and sends iotlb messages.  This is
> > >> >> completely compatible with the libvirt-user APIs and existing
> > >> >> vhost-user slave code can run fine.  The only unique thing is
> > >> >> that guest accesses to memory regions not allowed by the iotlb do
> not work because QEMU has prevented it.
> > >> >
> > >> > I don't think this can work since suddenly you need to map full
> > >> > IOMMU address space into BAR.
> > >>
> > >> The BAR covers all guest RAM
> > >> but QEMU can set up MemoryRegions that hide parts from the guest
> > >>

Re: [Qemu-devel] [virtio-dev] [PATCH v3 0/7] Vhost-pci for inter-VM communication

2017-12-06 Thread Wang, Wei W

On Wednesday, December 6, 2017 9:50 PM, Stefan Hajnoczi wrote:
> On Tue, Dec 05, 2017 at 11:33:09AM +0800, Wei Wang wrote:
> > Vhost-pci is a point-to-point based inter-VM communication solution.
> > This patch series implements the vhost-pci-net device setup and
> > emulation. The device is implemented as a virtio device, and it is set
> > up via the vhost-user protocol to get the neessary info (e.g the
> > memory info of the remote VM, vring info).
> >
> > Currently, only the fundamental functions are implemented. More
> > features, such as MQ and live migration, will be updated in the future.
> >
> > The DPDK PMD of vhost-pci has been posted to the dpdk mailinglist here:
> > http://dpdk.org/ml/archives/dev/2017-November/082615.html
> 
> I have asked questions about the scope of this feature.  In particular, I 
> think
> it's best to support all device types rather than just virtio-net.  Here is a
> design document that shows how this can be achieved.
> 
> What I'm proposing is different from the current approach:
> 1. It's a PCI adapter (see below for justification) 2. The vhost-user 
> protocol is
> exposed by the device (not handled 100% in
>QEMU).  Ultimately I think your approach would also need to do this.
> 
> I'm not implementing this and not asking you to implement it.  Let's just use
> this for discussion so we can figure out what the final vhost-pci will look 
> like.
> 
> Please let me know what you think, Wei, Michael, and others.
> 

Thanks for sharing the thoughts. If I understand it correctly, the key 
difference is that this approach tries to relay every vhost-user msg to the 
guest. I'm not sure about the benefits of doing this. 
To make data plane (i.e. driver to send/receive packets) work, I think, mostly, 
the memory info and vring info are enough. Other things like callfd, kickfd 
don't need to be sent to the guest, they are needed by QEMU only for the 
eventfd and irqfd setup.



> ---
> vhost-pci device specification
> ---
> The vhost-pci device allows guests to act as vhost-user slaves.  This enables
> appliance VMs like network switches or storage targets to back devices in
> other VMs.  VM-to-VM communication is possible without vmexits using
> polling mode drivers.
> 
> The vhost-user protocol has been used to implement virtio devices in
> userspace processes on the host.  vhost-pci maps the vhost-user protocol to
> a PCI adapter so guest software can perform virtio device emulation.
> This is useful in environments where high-performance VM-to-VM
> communication is necessary or where it is preferrable to deploy device
> emulation as VMs instead of host userspace processes.
> 
> The vhost-user protocol involves file descriptor passing and shared memory.
> This precludes vhost-user slave implementations over virtio-vsock, virtio-
> serial, or TCP/IP.  Therefore a new device type is needed to expose the
> vhost-user protocol to guests.
> 
> The vhost-pci PCI adapter has the following resources:
> 
> Queues (used for vhost-user protocol communication):
> 1. Master-to-slave messages
> 2. Slave-to-master messages
> 
> Doorbells (used for slave->guest/master events):
> 1. Vring call (one doorbell per virtqueue) 2. Vring err (one doorbell per
> virtqueue) 3. Log changed
> 
> Interrupts (used for guest->slave events):
> 1. Vring kick (one MSI per virtqueue)
> 
> Shared Memory BARs:
> 1. Guest memory
> 2. Log
> 
> Master-to-slave queue:
> The following vhost-user protocol messages are relayed from the vhost-user
> master.  Each message follows the vhost-user protocol VhostUserMsg layout.
> 
> Messages that include file descriptor passing are relayed but do not carry 
> file
> descriptors.  The relevant resources (doorbells, interrupts, or shared memory
> BARs) are initialized from the file descriptors prior to the message becoming
> available on the Master-to-Slave queue.
> 
> Resources must only be used after the corresponding vhost-user message
> has been received.  For example, the Vring call doorbell can only be used
> after VHOST_USER_SET_VRING_CALL becomes available on the Master-to-
> Slave queue.
> 
> Messages must be processed in order.
> 
> The following vhost-user protocol messages are relayed:
>  * VHOST_USER_GET_FEATURES
>  * VHOST_USER_SET_FEATURES
>  * VHOST_USER_GET_PROTOCOL_FEATURES
>  * VHOST_USER_SET_PROTOCOL_FEATURES
>  * VHOST_USER_SET_OWNER
>  * VHOST_USER_SET_MEM_TABLE
>The shared memory is available in the corresponding BAR.
>  * VHOST_USER_SET_LOG_BASE
>The shared memory is available in the corresponding BAR.
>  * VHOST_USER_SET_LOG_FD
>The logging file descriptor can be signalled through the logging
>virtqueue.
>  * VHOST_USER_SET_VRING_NUM
>  * VHOST_USER_SET_VRING_ADDR
>  * VHOST_USER_SET_VRING_BASE
>  * VHOST_USER_GET_VRING_BASE
>  * VHOST_USER_SET_VRING_KICK
>This message is still needed because it may indicate only polling
>mode is supported.
>  * VHOST_USER_SET_VRING_CALL
>This message is still

Re: [Qemu-devel] [PATCH v18 05/10] xbitmap: add more operations

2017-12-01 Thread Wang, Wei W

On Friday, December 1, 2017 9:02 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > On 11/30/2017 06:34 PM, Tetsuo Handa wrote:
> > > Wei Wang wrote:
> > >> + * @start: the start of the bit range, inclusive
> > >> + * @end: the end of the bit range, inclusive
> > >> + *
> > >> + * This function is used to clear a bit in the xbitmap. If all the
> > >> +bits of the
> > >> + * bitmap are 0, the bitmap will be freed.
> > >> + */
> > >> +void xb_clear_bit_range(struct xb *xb, unsigned long start,
> > >> +unsigned long end) {
> > >> +struct radix_tree_root *root = >xbrt;
> > >> +struct radix_tree_node *node;
> > >> +void **slot;
> > >> +struct ida_bitmap *bitmap;
> > >> +unsigned int nbits;
> > >> +
> > >> +for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 
> > >> 1) {
> > >> +unsigned long index = start / IDA_BITMAP_BITS;
> > >> +unsigned long bit = start % IDA_BITMAP_BITS;
> > >> +
> > >> +bitmap = __radix_tree_lookup(root, index, , );
> > >> +if (radix_tree_exception(bitmap)) {
> > >> +unsigned long ebit = bit + 2;
> > >> +unsigned long tmp = (unsigned long)bitmap;
> > >> +
> > >> +nbits = min(end - start + 1, BITS_PER_LONG - 
> > >> ebit);
> > > "nbits = min(end - start + 1," seems to expect that start == end is
> > > legal for clearing only 1 bit. But this function is no-op if start == end.
> > > Please clarify what "inclusive" intended.
> >
> > If xb_clear_bit_range(xb,10,10), then it is effectively the same as
> > xb_clear_bit(10). Why would it be illegal?
> >
> > "@start inclusive" means that the @start will also be included to be
> > cleared.
> 
> If start == end is legal,
> 
>for (; start < end; start = (start | (IDA_BITMAP_BITS - 1)) + 1) {
> 
> makes this loop do nothing because 10 < 10 is false.


How about "start <= end "?

Best,
Wei

Re: [Qemu-devel] [PATCH v18 07/10] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-11-30 Thread Wang, Wei W

On Thursday, November 30, 2017 6:36 PM, Tetsuo Handa wrote:
> Wei Wang wrote:
> > +static inline int xb_set_page(struct virtio_balloon *vb,
> > +  struct page *page,
> > +  unsigned long *pfn_min,
> > +  unsigned long *pfn_max)
> > +{
> > +   unsigned long pfn = page_to_pfn(page);
> > +   int ret;
> > +
> > +   *pfn_min = min(pfn, *pfn_min);
> > +   *pfn_max = max(pfn, *pfn_max);
> > +
> > +   do {
> > +   ret = xb_preload_and_set_bit(>page_xb, pfn,
> > +GFP_NOWAIT | __GFP_NOWARN);
> 
> It is a bit of pity that __GFP_NOWARN here is applied to only xb_preload().
> Memory allocation by xb_set_bit() will after all emit warnings. Maybe
> 
>   xb_init(>page_xb);
>   vb->page_xb.gfp_mask |= __GFP_NOWARN;
> 
> is tolerable? Or, unconditionally apply __GFP_NOWARN at xb_init()?
> 



Please have a check this one: radix_tree_node_alloc()

In our case, I think the code path goes to 

if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
...
ret = kmem_cache_alloc(radix_tree_node_cachep,
   gfp_mask | __GFP_NOWARN);
...
goto out;
}

So I think the __GFP_NOWARN is already there.



>   static inline void xb_init(struct xb *xb)
>   {
>   INIT_RADIX_TREE(>xbrt, IDR_RT_MARKER | GFP_NOWAIT);
>   }
> 
> > +   } while (unlikely(ret == -EAGAIN));
> > +
> > +   return ret;
> > +}
> > +
> 
> 
> 
> > @@ -172,11 +283,18 @@ static unsigned fill_balloon(struct virtio_balloon
> *vb, size_t num)
> > vb->num_pfns = 0;
> >
> > while ((page = balloon_page_pop())) {
> > +   if (use_sg) {
> > +   if (xb_set_page(vb, page, _min, _max) < 0) {
> > +   __free_page(page);
> > +   break;
> 
> You cannot "break;" without consuming all pages in "pages".


Right, I think it should be "continue" here. Thanks. 

> 
> > +   }
> > +   } else {
> > +   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > +   }
> > +
> > balloon_page_enqueue(>vb_dev_info, page);
> >
> > vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > -
> > -   set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
> > vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
> > if (!virtio_has_feature(vb->vdev,
> >
>   VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> 
> 
> 
> > @@ -212,9 +334,12 @@ static unsigned leak_balloon(struct virtio_balloon
> *vb, size_t num)
> > struct page *page;
> > struct balloon_dev_info *vb_dev_info = >vb_dev_info;
> > LIST_HEAD(pages);
> > +   bool use_sg = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_SG);
> 
> You can pass use_sg as an argument to leak_balloon(). Then, you won't need
> to define leak_balloon_sg_oom(). Since xbitmap allocation does not use
> __GFP_DIRECT_RECLAIM, it is safe to reuse leak_balloon() for OOM path.
> Just be sure to pass use_sg == false because memory allocation for use_sg ==
> true likely fails when called from OOM path. (But trying use_sg == true for
> OOM path and then fallback to use_sg == false is not bad?)
> 


But once the SG mechanism is in use, we cannot use the non-SG mechanism, 
because the interface between the guest and host is not the same for SG and 
non-SG. Methods to make the host support both mechanisms at the same time would 
complicate the interface and implementation. 

I also think the old non-SG mechanism should be deprecated to use since its 
implementation isn't perfect in some sense, e.g. it balloons pages using outbuf 
which implies that no changes are allowed to the balloon pages and this isn't 
true for ballooning. The new mechanism (SG) has changed it to use inbuf.

So I think using leak_balloon_sg_oom() would be better. Is there any reason 
that we couldn't use it?

Best,
Wei

Re: [Qemu-devel] [virtio-dev] Re: [PATCH v17 6/6] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ

2017-11-17 Thread Wang, Wei W

On Friday, November 17, 2017 8:45 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2017 at 07:35:03PM +0800, Wei Wang wrote:
> > On 11/16/2017 09:27 PM, Wei Wang wrote:
> > > On 11/16/2017 04:32 AM, Michael S. Tsirkin wrote:
> > > > On Fri, Nov 03, 2017 at 04:13:06PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_VQ feature
> > > > > indicates the support of reporting hints of guest free pages to
> > > > > the host via virtio-balloon. The host requests the guest to
> > > > > report the free pages by sending commands via the virtio-balloon
> configuration registers.
> > > > >
> > > > > When the guest starts to report, the first element added to the
> > > > > free page vq is a sequence id of the start reporting command.
> > > > > The id is given by the host, and it indicates whether the
> > > > > following free pages correspond to the command. For example, the
> > > > > host may stop the report and start again with a new command id.
> > > > > The obsolete pages for the previous start command can be
> > > > > detected by the id dismatching on the host. The id is added to
> > > > > the vq using an output buffer, and the free pages are added to
> > > > > the vq using input buffer.
> > > > >
> > > > > Here are some explainations about the added configuration registers:
> > > > > - host2guest_cmd: a register used by the host to send commands
> > > > > to the guest.
> > > > > - guest2host_cmd: written by the guest to ACK to the host about
> > > > > the commands that have been received. The host will clear the
> > > > > corresponding bits on the host2guest_cmd register. The guest
> > > > > also uses this register to send commands to the host (e.g. when finish
> free page reporting).
> > > > > - free_page_cmd_id: the sequence id of the free page report
> > > > > command given by the host.
> > > > >
> > > > > Signed-off-by: Wei Wang 
> > > > > Signed-off-by: Liang Li 
> > > > > Cc: Michael S. Tsirkin 
> > > > > Cc: Michal Hocko 
> > > > > ---
> > > > >
> > > > > +
> > > > > +static void report_free_page(struct work_struct *work) {
> > > > > +struct virtio_balloon *vb;
> > > > > +
> > > > > +vb = container_of(work, struct virtio_balloon,
> > > > > report_free_page_work);
> > > > > +report_free_page_cmd_id(vb);
> > > > > +walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> > > > > +/*
> > > > > + * The last few free page blocks that were added may not reach 
> > > > > the
> > > > > + * batch size, but need a kick to notify the device to
> > > > > handle them.
> > > > > + */
> > > > > +virtqueue_kick(vb->free_page_vq);
> > > > > +report_free_page_end(vb);
> > > > > +}
> > > > > +
> > > > I think there's an issue here: if pages are poisoned and
> > > > hypervisor subsequently drops them, testing them after allocation
> > > > will trigger a false positive.
> > > >
> > > > The specific configuration:
> > > >
> > > > PAGE_POISONING on
> > > > PAGE_POISONING_NO_SANITY off
> > > > PAGE_POISONING_ZERO off
> > > >
> > > >
> > > > Solutions:
> > > > 1. disable the feature in that configuration
> > > > suggested as an initial step
> > >
> > > Thanks for the finding.
> > > Similar to this option: I'm thinking could we make
> > > walk_free_mem_block() simply return if that option is on?
> > > That is, at the beginning of the function:
> > > if (!page_poisoning_enabled())
> > > return;
> > >
> >
> >
> > Thought about it more, I think it would be better to put this logic to
> > virtio_balloon:
> >
> > send_free_page_cmd_id(vb, >start_cmd_id);
> > if (page_poisoning_enabled() &&
> > !IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY))
> > walk_free_mem_block(vb, 0, _balloon_send_free_pages);
> > send_free_page_cmd_id(vb, >stop_cmd_id);
> >
> >
> > walk_free_mem_block() should be a more generic API, and this potential
> > page poisoning issue is specific to live migration which is only one
> > use case of this function, so I think it is better to handle it in the
> > special use case itself.
> >
> > Best,
> > Wei
> >
> 
> It's a quick work-around but it doesn't make me very happy.
> 
> AFAIK e.g. RHEL has a debug kernel with poisoning enabled.
> If this never uses free page hinting at all, it will be much less useful for
> debugging guests.
> 

I understand your concern. I think people who use debugging guests don't regard 
performance as the first priority, and most vendors usually wouldn't use 
debugging guests for their products.

How about taking it as the initial solution? We can exploit more solutions 
after this series is done.

Best,
Wei

Re: [Qemu-devel] [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ

2017-09-05 Thread Wang, Wei W

Ping for comments if possible. Thanks.

On Monday, August 28, 2017 6:09 PM, Wang, Wei W wrote:
> [PATCH v15 5/5] virtio-balloon: VIRTIO_BALLOON_F_CTRL_VQ
> 
> Add a new vq, ctrl_vq, to handle commands between the host and guest.
> With this feature, we will be able to have the control plane and data plane
> separated. In other words, the control related data of each feature will be 
> sent
> via the ctrl_vq cmds, meanwhile each feature may have its own data plane vq.
> 
> Free page report is the the first new feature controlled via ctrl_vq, and a 
> new
> cmd class, VIRTIO_BALLOON_CTRLQ_CLASS_FREE_PAGE, is added.
> Currently, this feature has two cmds:
> VIRTIO_BALLOON_FREE_PAGE_F_START: This cmd is sent from host to guest to
> start the free page reporting work.
> VIRTIO_BALLOON_FREE_PAGE_F_STOP: This cmd is used bidirectionally. The
> guest would send the cmd to the host to indicate the reporting work is done.
> The host would send the cmd to the guest to actively request the stop of the
> reporting work.
> 
> The free_page_vq is used to transmit the guest free page blocks to the host.
> 
> Signed-off-by: Wei Wang <wei.w.w...@intel.com>
> Signed-off-by: Liang Li <liang.z...@intel.com>
> ---
>  drivers/virtio/virtio_balloon.c | 247 +-
> --
>  include/uapi/linux/virtio_balloon.h |  15 +++
>  2 files changed, 242 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_balloon.c 
> b/drivers/virtio/virtio_balloon.c index
> 8ecc1d4..1d384a4 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -55,7 +55,13 @@ static struct vfsmount *balloon_mnt;
> 
>  struct virtio_balloon {
>   struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *ctrl_vq,
> +  *free_page_vq;
> +
> + /* Balloon's own wq for cpu-intensive work items */
> + struct workqueue_struct *balloon_wq;
> + /* The work items submitted to the balloon wq are listed here */
> + struct work_struct report_free_page_work;
> 
>   /* The balloon servicing is delegated to a freezable workqueue. */
>   struct work_struct update_balloon_stats_work; @@ -65,6 +71,9 @@
> struct virtio_balloon {
>   spinlock_t stop_update_lock;
>   bool stop_update;
> 
> + /* Stop reporting free pages */
> + bool report_free_page_stop;
> +
>   /* Waiting for host to ack the pages we released. */
>   wait_queue_head_t acked;
> 
> @@ -93,6 +102,11 @@ struct virtio_balloon {
> 
>   /* To register callback in oom notifier call chain */
>   struct notifier_block nb;
> +
> + /* Host to guest ctrlq cmd buf for free page report */
> + struct virtio_balloon_ctrlq_cmd free_page_cmd_in;
> + /* Guest to Host ctrlq cmd buf for free page report */
> + struct virtio_balloon_ctrlq_cmd free_page_cmd_out;
>  };
> 
>  static struct virtio_device_id id_table[] = { @@ -177,6 +191,26 @@ static 
> void
> send_balloon_page_sg(struct virtio_balloon *vb,
>   }
>  }
> 
> +static void send_free_page_sg(struct virtqueue *vq, void *addr,
> +uint32_t size) {
> + unsigned int len;
> + int err = -ENOSPC;
> +
> + do {
> + if (vq->num_free) {
> + err = add_one_sg(vq, addr, size);
> + /* Sanity check: this can't really happen */
> + WARN_ON(err);
> + if (!err)
> + virtqueue_kick(vq);
> + }
> +
> + /* Release entries if there are */
> + while (virtqueue_get_buf(vq, ))
> + ;
> + } while (err == -ENOSPC && vq->num_free); }
> +
>  /*
>   * Send balloon pages in sgs to host. The balloon pages are recorded in the
>   * page xbitmap. Each bit in the bitmap corresponds to a page of PAGE_SIZE.
> @@ -525,42 +559,206 @@ static void update_balloon_size_func(struct
> work_struct *work)
>   queue_work(system_freezable_wq, work);  }
> 
> -static int init_vqs(struct virtio_balloon *vb)
> +static bool virtio_balloon_send_free_pages(void *opaque, unsigned long pfn,
> +unsigned long nr_pages)
> +{
> + struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> + void *addr = (void *)pfn_to_kaddr(pfn);
> + uint32_t len = nr_pages << PAGE_SHIFT;
> +
> + if (vb->report_free_page_stop)
> + return 1;
> +
> + send_free_page_sg(vb->free_page_vq, addr, len);
> +
> + return 0;
> +}
> +
> +static voi

Re: [Qemu-devel] [PATCH v12 5/8] virtio-balloon: VIRTIO_BALLOON_F_SG

2017-07-29 Thread Wang, Wei W

On Sunday, July 30, 2017 12:23 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 29, 2017 at 08:47:08PM +0800, Wei Wang wrote:
> > On 07/29/2017 07:08 AM, Michael S. Tsirkin wrote:
> > > On Thu, Jul 27, 2017 at 10:50:11AM +0800, Wei Wang wrote:
> > > > > > > OK I thought this over. While we might need these new APIs
> > > > > > > in the future, I think that at the moment, there's a way to
> > > > > > > implement this feature that is significantly simpler. Just
> > > > > > > add each s/g as a separate input buffer.
> > > > > > Should it be an output buffer?
> > > > > Hypervisor overwrites these pages with zeroes. Therefore it is
> > > > > writeable by device: DMA_FROM_DEVICE.
> > > > Why would the hypervisor need to zero the buffer?
> > > The page is supplied to hypervisor and can lose the value that is
> > > there.  That is the definition of writeable by device.
> >
> > I think for the free pages, it should be clear that they will be added
> > as output buffer to the device, because (as we discussed) they are
> > just hints, and some of them may be used by the guest after the report_ API 
> > is
> invoked.
> > The device/hypervisor should not use or discard them.
> 
> Discarding contents is exactly what you propose doing if migration is going 
> on,
> isn't it?

That's actually a different concept. Please let me explain it with this example:

The hypervisor receives the hint saying the guest PageX is a free page, but as 
we know, 
after that report_ API exits, the guest kernel may take PageX to use, so PageX 
is not free
page any more. At this time, if the hypervisor writes to the page, that would 
crash the guest.
So, I think the cornerstone of this work is that the hypervisor should not 
touch the
reported pages.

Best,
Wei

Re: [Qemu-devel] [PATCH v12 6/8] mm: support reporting free page blocks

2017-07-26 Thread Wang, Wei W

On Wednesday, July 26, 2017 7:55 PM, Michal Hocko wrote:
> On Wed 26-07-17 19:44:23, Wei Wang wrote:
> [...]
> > I thought about it more. Probably we can use the callback function
> > with a little change like this:
> >
> > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2,
> > unsigned long pfn,
> >unsigned long nr_pages))
> > {
> > ...
> > for_each_populated_zone(zone) {
> >for_each_migratetype_order(order, type) {
> > report_unused_page_block(zone, order, type,
> > ); // from patch 6
> > pfn = page_to_pfn(page);
> > visit(opaque1, pfn, 1 << order);
> > }
> > }
> > }
> >
> > The above function scans all the free list and directly sends each
> > free page block to the hypervisor via the virtio_balloon callback
> > below. No need to implement a bitmap.
> >
> > In virtio-balloon, we have the callback:
> > void *virtio_balloon_report_unused_pages(void *opaque,  unsigned long
> > pfn, unsigned long nr_pages) {
> > struct virtio_balloon *vb = (struct virtio_balloon *)opaque;
> > ...put the free page block to the the ring of vb; }
> >
> >
> > What do you think?
> 
> I do not mind conveying a context to the callback. I would still prefer
> to keep the original min_order to check semantic though. Why? Well,
> it doesn't make much sense to scan low order free blocks all the time
> because they are simply too volatile. Larger blocks tend to surivive for
> longer. So I assume you would only care about larger free blocks. This
> will also make the call cheaper.
> --

OK, I will keep min order there in the next version.

Best,
Wei

Re: [Qemu-devel] [PATCH v12 6/8] mm: support reporting free page blocks

2017-07-25 Thread Wang, Wei W

On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote:
> On Tue 25-07-17 19:56:24, Wei Wang wrote:
> > On 07/25/2017 07:25 PM, Michal Hocko wrote:
> > >On Tue 25-07-17 17:32:00, Wei Wang wrote:
> > >>On 07/24/2017 05:00 PM, Michal Hocko wrote:
> > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote:
> > On 07/19/2017 04:13 PM, Michal Hocko wrote:
> > >>>[...
> > We don't need to do the pfn walk in the guest kernel. When the API
> > reports, for example, a 2MB free page block, the API caller offers to
> > the hypervisor the base address of the page block, and size=2MB, to
> > the hypervisor.
> 
> So you want to skip pfn walks by regularly calling into the page allocator to
> update your bitmap. If that is the case then would an API that would allow you
> to update your bitmap via a callback be s sufficient? Something like
>   void walk_free_mem(int node, int min_order,
>   void (*visit)(unsigned long pfn, unsigned long 
> nr_pages))
> 
> The function will call the given callback for each free memory block on the 
> given
> node starting from the given min_order. The callback will be strictly an 
> atomic
> and very light context. You can update your bitmap from there.

I would need to introduce more about the background here:
The hypervisor and the guest live in their own address space. The hypervisor's 
bitmap
isn't seen by the guest. I think we also wouldn't be able to give a callback 
function 
from the hypervisor to the guest in this case.

> 
> This would address my main concern that the allocator internals would get
> outside of the allocator proper. 

What issue would it have to expose the internal, for_each_zone()?
I think new code which would call it will also be strictly checked when they
are pushed to upstream.

Best,
Wei

Re: [Qemu-devel] [PATCH v4] virtio-net: enable configurable tx queue size

2017-07-06 Thread Wang, Wei W

On Thursday, July 6, 2017 9:49 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 28, 2017 at 10:37:59AM +0800, Wei Wang wrote:
> > This patch enables the virtio-net tx queue size to be configurable
> > between 256 (the default queue size) and 1024 by the user when the
> > vhost-user backend is used.
> >
> > Currently, the maximum tx queue size for other backends is 512 due to
> > the following limitations:
> > - QEMU backend: the QEMU backend implementation in some cases may send
> > 1024+1 iovs to writev.
> > - Vhost_net backend: there are possibilities that the guest sends a
> > vring_desc of memory which crosses a MemoryRegion thereby generating
> > more than 1024 iovs after translation from guest-physical address in
> > the backend.
> >
> > Signed-off-by: Wei Wang 
> 
> Could you pls add a bit info about how this was tested?
> Was any special setup for dpdk necessary?

Yes, I used the vhost-user implementation in DPDK. So, on the host, I have 
the legacy ovs-dpdk setup ready (I'm using dpdk-stable-16.11.1 and 
openvswitch-2.6.1,
the setup steps can be found inside the source code directory).

When booting the guest, I have the following QEMU commands:
-chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1
-netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
-device virtio-net-pci,mac=52:54:00:00:00:01,netdev=mynet1,tx_queue_size=1024

To check the guest tx queue size, I simply added a printk() at the end of 
virtnet_probe()
to print out vi->sq->vq->num_free, which initially equals to the queue size.

Then, I did Ping and netperf tests to transmit packets between VMs, which 
worked fine.

If the related configuration support to Libvirt is ready, I think we can get 
the customer
to try in their test environment, too.

Best,
Wei

Re: [Qemu-devel] [PATCH v3 1/2] virtio-net: enable configurable tx queue size

2017-06-26 Thread Wang, Wei W

On Tuesday, June 27, 2017 9:51 AM, Eric Blake wrote:
> On 06/22/2017 09:32 PM, Wei Wang wrote:
> > This patch enables the virtio-net tx queue size to be configurable 
> > between 256 (the default queue size) and 1024 by the user when the 
> > vhost-user backend is used.
> 
> When sending a multi-patch series, don't forget the 0/2 cover letter.
> 

Thanks for the reminder, and I'll do it in the next version.

Best,
Wei

1 2 >

1 - 100 of 136 matches

Mail list logo