Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-12-03 Thread Fei Li

Hi Juan,

Kindly ping again. :)

Have a nice day, thanks
Fei

On 11/12/2018 12:43 PM, Fei Li wrote:

Hi Juan,

Kindly ping, as this multifd migration topic needs your suggestions. :)

Have a nice day, thanks
Fei

On 11/03/2018 12:33 AM, Dr. David Alan Gilbert wrote:

* Peter Xu (pet...@redhat.com) wrote:

On Fri, Nov 02, 2018 at 11:00:24AM +0800, Fei Li wrote:


On 11/02/2018 10:37 AM, Peter Xu wrote:

On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:

Set the migration state to "failed" instead of "setup" when failing
to send packet via some channel.

Could you please provide more information in the commit message?
E.g., what will happen if without this patch?  Will it crash the
source or stall the source migration or others?  Otherwise it's a bit
hard for me to understand what's this patch for.

Sorry for the inadequate description , I was intended to say that when
failing
to do the live migration using multifd, e.g. sending less channels, 
the src
status displays "setup" when running `info migrate`. I assume we 
should tell

users that the "Migration status" is "failed" now (and along with the
failure reason).

The current src status when failed 
inmultifd_new_send_channel_async():



(qemu) migrate_set_capability x-multifd on
(qemu) migrate_set_parameter x-multifd-channels 4
(qemu) migrate -d tcp:192.168.190.98:
(qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async 
due to

...
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
zero-blocks:
off compress: off events: off postcopy-ram: off x-colo: off 
release-ram: off

block: off return-path: off pause-before-switchover: off x-multifd: on
dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
Migration status: setup
total time: 0 milliseconds

Thanks for the information.

I had a quick look.  For now we do this:

 multifd_save_setup (without waiting for channels to be ready)
 create thread migration_thread
 (in thread)
 ram_save_setup
 multifd_send_sync_main (wait for the channels)

The thing is that we didn't get the notification when one of the
multifd channel is failed.  IMHO instead of setting the global
migration state in a per-channel function, we should just report the
error upwards, then the main thread should decide how to change the
state machine of the migration.

Best to wait for Juan on that; I've got vague memories that reporting
errors among the threads was a bit tricky.

Dave


And we have set it in migrate_set_error() after all so the main thread
should be able to know somehow (though IMHO I'll even prefer to have a
per-channel variable to keep the state of the channel, then the
per-channel functions won't touch any globals which offers better
isolation).

I'm not sure how Juan thinks about it, but I'd prefer some work to
provide such isolation and also some mechanism to allow the main
thread to detect the per-channel errors not only during setup phase
but also during the migration (e.g., when network is suddenly down).
Then we don't touch any globals (e.g., we shouldn't call
migrate_get_current in any per-channel function like
multifd_new_send_channel_async).

Normally I would prefer to not touch global states in feature 
specific

code path, but I'd like to know the problem more first...

Thanks,


Cc: Peter Xu 
Signed-off-by: Fei Li 
---
   migration/ram.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4db3b3e8f4..c84d164fc8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1072,6 +1072,7 @@ out:
   static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)

   {
   MultiFDSendParams *p = opaque;
+    MigrationState *s = migrate_get_current();
   QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
   Error *local_err = NULL;
@@ -1083,6 +1084,7 @@ static void 
multifd_new_send_channel_async(QIOTask *task, gpointer opaque)

   if (multifd_save_cleanup(_err) != 0) {
   migrate_set_error(migrate_get_current(), local_err);
   }
+    migrate_set_state(>state, s->state, 
MIGRATION_STATUS_FAILED);

   } else {
   p->c = QIO_CHANNEL(sioc);
   qio_channel_set_delay(p->c, false);
--
2.13.7


Regards,


Regards,

--
Peter Xu

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK










Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-11 Thread Fei Li

Hi Juan,

Kindly ping, as this multifd migration topic needs your suggestions. :)

Have a nice day, thanks
Fei

On 11/03/2018 12:33 AM, Dr. David Alan Gilbert wrote:

* Peter Xu (pet...@redhat.com) wrote:

On Fri, Nov 02, 2018 at 11:00:24AM +0800, Fei Li wrote:


On 11/02/2018 10:37 AM, Peter Xu wrote:

On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:

Set the migration state to "failed" instead of "setup" when failing
to send packet via some channel.

Could you please provide more information in the commit message?
E.g., what will happen if without this patch?  Will it crash the
source or stall the source migration or others?  Otherwise it's a bit
hard for me to understand what's this patch for.

Sorry for the inadequate description , I was intended to say that when
failing
to do the live migration using multifd, e.g. sending less channels, the src
status displays "setup" when running `info migrate`. I assume we should tell
users that the "Migration status" is "failed" now (and along with the
failure reason).

The current src status when failed inmultifd_new_send_channel_async():


(qemu) migrate_set_capability x-multifd on
(qemu) migrate_set_parameter x-multifd-channels 4
(qemu) migrate -d tcp:192.168.190.98:
(qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async due to
...
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks:
off compress: off events: off postcopy-ram: off x-colo: off release-ram: off
block: off return-path: off pause-before-switchover: off x-multifd: on
dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
Migration status: setup
total time: 0 milliseconds

Thanks for the information.

I had a quick look.  For now we do this:

 multifd_save_setup (without waiting for channels to be ready)
 create thread migration_thread
 (in thread)
 ram_save_setup
 multifd_send_sync_main (wait for the channels)

The thing is that we didn't get the notification when one of the
multifd channel is failed.  IMHO instead of setting the global
migration state in a per-channel function, we should just report the
error upwards, then the main thread should decide how to change the
state machine of the migration.

Best to wait for Juan on that; I've got vague memories that reporting
errors among the threads was a bit tricky.

Dave


And we have set it in migrate_set_error() after all so the main thread
should be able to know somehow (though IMHO I'll even prefer to have a
per-channel variable to keep the state of the channel, then the
per-channel functions won't touch any globals which offers better
isolation).

I'm not sure how Juan thinks about it, but I'd prefer some work to
provide such isolation and also some mechanism to allow the main
thread to detect the per-channel errors not only during setup phase
but also during the migration (e.g., when network is suddenly down).
Then we don't touch any globals (e.g., we shouldn't call
migrate_get_current in any per-channel function like
multifd_new_send_channel_async).


Normally I would prefer to not touch global states in feature specific
code path, but I'd like to know the problem more first...

Thanks,


Cc: Peter Xu 
Signed-off-by: Fei Li 
---
   migration/ram.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4db3b3e8f4..c84d164fc8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1072,6 +1072,7 @@ out:
   static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
   {
   MultiFDSendParams *p = opaque;
+MigrationState *s = migrate_get_current();
   QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
   Error *local_err = NULL;
@@ -1083,6 +1084,7 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
   if (multifd_save_cleanup(_err) != 0) {
   migrate_set_error(migrate_get_current(), local_err);
   }
+migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
   } else {
   p->c = QIO_CHANNEL(sioc);
   qio_channel_set_delay(p->c, false);
--
2.13.7


Regards,


Regards,

--
Peter Xu

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK








Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-02 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> On Fri, Nov 02, 2018 at 11:00:24AM +0800, Fei Li wrote:
> > 
> > 
> > On 11/02/2018 10:37 AM, Peter Xu wrote:
> > > On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:
> > > > Set the migration state to "failed" instead of "setup" when failing
> > > > to send packet via some channel.
> > > Could you please provide more information in the commit message?
> > > E.g., what will happen if without this patch?  Will it crash the
> > > source or stall the source migration or others?  Otherwise it's a bit
> > > hard for me to understand what's this patch for.
> > Sorry for the inadequate description , I was intended to say that when
> > failing
> > to do the live migration using multifd, e.g. sending less channels, the src
> > status displays "setup" when running `info migrate`. I assume we should tell
> > users that the "Migration status" is "failed" now (and along with the
> > failure reason).
> > 
> > The current src status when failed inmultifd_new_send_channel_async():
> > 
> > 
> > (qemu) migrate_set_capability x-multifd on
> > (qemu) migrate_set_parameter x-multifd-channels 4
> > (qemu) migrate -d tcp:192.168.190.98:
> > (qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async due to
> > ...
> > (qemu) info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > decompress-error-check: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks:
> > off compress: off events: off postcopy-ram: off x-colo: off release-ram: off
> > block: off return-path: off pause-before-switchover: off x-multifd: on
> > dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
> > Migration status: setup
> > total time: 0 milliseconds
> 
> Thanks for the information.
> 
> I had a quick look.  For now we do this:
> 
> multifd_save_setup (without waiting for channels to be ready)
> create thread migration_thread
> (in thread)
> ram_save_setup
> multifd_send_sync_main (wait for the channels)
> 
> The thing is that we didn't get the notification when one of the
> multifd channel is failed.  IMHO instead of setting the global
> migration state in a per-channel function, we should just report the
> error upwards, then the main thread should decide how to change the
> state machine of the migration.

Best to wait for Juan on that; I've got vague memories that reporting
errors among the threads was a bit tricky.

Dave

> And we have set it in migrate_set_error() after all so the main thread
> should be able to know somehow (though IMHO I'll even prefer to have a
> per-channel variable to keep the state of the channel, then the
> per-channel functions won't touch any globals which offers better
> isolation).
> 
> I'm not sure how Juan thinks about it, but I'd prefer some work to
> provide such isolation and also some mechanism to allow the main
> thread to detect the per-channel errors not only during setup phase
> but also during the migration (e.g., when network is suddenly down).
> Then we don't touch any globals (e.g., we shouldn't call
> migrate_get_current in any per-channel function like
> multifd_new_send_channel_async).
> 
> > 
> > > 
> > > Normally I would prefer to not touch global states in feature specific
> > > code path, but I'd like to know the problem more first...
> > > 
> > > Thanks,
> > > 
> > > > Cc: Peter Xu 
> > > > Signed-off-by: Fei Li 
> > > > ---
> > > >   migration/ram.c | 2 ++
> > > >   1 file changed, 2 insertions(+)
> > > > 
> > > > diff --git a/migration/ram.c b/migration/ram.c
> > > > index 4db3b3e8f4..c84d164fc8 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -1072,6 +1072,7 @@ out:
> > > >   static void multifd_new_send_channel_async(QIOTask *task, gpointer 
> > > > opaque)
> > > >   {
> > > >   MultiFDSendParams *p = opaque;
> > > > +MigrationState *s = migrate_get_current();
> > > >   QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
> > > >   Error *local_err = NULL;
> > > > @@ -1083,6 +1084,7 @@ static void 
> > > > multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
> > > >   if (multifd_save_cleanup(_err) != 0) {
> > > >   migrate_set_error(migrate_get_current(), local_err);
> > > >   }
> > > > +migrate_set_state(>state, s->state, 
> > > > MIGRATION_STATUS_FAILED);
> > > >   } else {
> > > >   p->c = QIO_CHANNEL(sioc);
> > > >   qio_channel_set_delay(p->c, false);
> > > > -- 
> > > > 2.13.7
> > > > 
> > > Regards,
> > > 
> > 
> 
> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-02 Thread Peter Xu
On Fri, Nov 02, 2018 at 03:13:05PM +0800, Fei Li wrote:
> 
> 
> On 11/02/2018 11:32 AM, Peter Xu wrote:
> > On Fri, Nov 02, 2018 at 11:00:24AM +0800, Fei Li wrote:
> > > 
> > > On 11/02/2018 10:37 AM, Peter Xu wrote:
> > > > On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:
> > > > > Set the migration state to "failed" instead of "setup" when failing
> > > > > to send packet via some channel.
> > > > Could you please provide more information in the commit message?
> > > > E.g., what will happen if without this patch?  Will it crash the
> > > > source or stall the source migration or others?  Otherwise it's a bit
> > > > hard for me to understand what's this patch for.
> > > Sorry for the inadequate description , I was intended to say that when
> > > failing
> > > to do the live migration using multifd, e.g. sending less channels, the 
> > > src
> > > status displays "setup" when running `info migrate`. I assume we should 
> > > tell
> > > users that the "Migration status" is "failed" now (and along with the
> > > failure reason).
> > > 
> > > The current src status when failed inmultifd_new_send_channel_async():
> > > 
> > > 
> > > (qemu) migrate_set_capability x-multifd on
> > > (qemu) migrate_set_parameter x-multifd-channels 4
> > > (qemu) migrate -d tcp:192.168.190.98:
> > > (qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async due to
> > > ...
> > > (qemu) info migrate
> > > globals:
> > > store-global-state: on
> > > only-migratable: off
> > > send-configuration: on
> > > send-section-footer: on
> > > decompress-error-check: on
> > > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
> > > zero-blocks:
> > > off compress: off events: off postcopy-ram: off x-colo: off release-ram: 
> > > off
> > > block: off return-path: off pause-before-switchover: off x-multifd: on
> > > dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
> > > Migration status: setup
> > > total time: 0 milliseconds
> > Thanks for the information.
> > 
> > I had a quick look.  For now we do this:
> > 
> >  multifd_save_setup (without waiting for channels to be ready)
> >  create thread migration_thread
> >  (in thread)
> >  ram_save_setup
> >  multifd_send_sync_main (wait for the channels)

[1]

> > 
> > The thing is that we didn't get the notification when one of the
> > multifd channel is failed.  IMHO instead of setting the global
> > migration state in a per-channel function, we should just report the
> > error upwards, then the main thread should decide how to change the
> > state machine of the migration.
> Thanks for the detail explanation, do agree with reporting and letting
> the main thread handle this. :)
> But one thing to note is that during my previous debugging, I remember
> sometimes the main thread: migration_thread() is called earlier than
> the first channel is ready in multifd_new_send_channel_async(). Thus
> we should be careful about where/when to check the state of the channel.

Yeah, I guess that's exactly the stack I described at [1] above.

So my preference here would be that: in multifd_save_setup() we don't
continue until we know that the sockets are ready.  After all AFAIU
currently we'll depend on all the channels when migrate, so we can't
really do anything if without all the channels ready.  That'll
simplify the error handling of the case you've encountered during
SETUP.

> > And we have set it in migrate_set_error() after all so the main thread
> > should be able to know somehow
> But in our current code, the main thread has not utilized the s->error
> to know whether the migration state, right? As I checked the code,
> the s->error is only used
> - in qmp query: copy s->error to info->error_desc when detecting the migrate
> status is failed;
> - in migrate_fd_cleanup() when migrate_fd_connect() fails: print the error
> Or the s->error is just used in this way?

Hmm, _maybe_ we can introduce MultiFDSendParams.err then we can put
per-thread error there.

> >   (though IMHO I'll even prefer to have a
> > per-channel variable to keep the state of the channel, then the
> > per-channel functions won't touch any globals which offers better
> > isolation).
> > 
> > I'm not sure how Juan thinks about it, but I'd prefer some work to
> > provide such isolation and also some mechanism to allow the main
> > thread to detect the per-channel errors not only during setup phase
> > but also during the migration (e.g., when network is suddenly down).
> > Then we don't touch any globals (e.g., we shouldn't call
> > migrate_get_current in any per-channel function like
> > multifd_new_send_channel_async).
> Ok, wait for Juan's comment. :)

Yes.

Regards,

-- 
Peter Xu



Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-02 Thread Fei Li




On 11/02/2018 11:32 AM, Peter Xu wrote:

On Fri, Nov 02, 2018 at 11:00:24AM +0800, Fei Li wrote:


On 11/02/2018 10:37 AM, Peter Xu wrote:

On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:

Set the migration state to "failed" instead of "setup" when failing
to send packet via some channel.

Could you please provide more information in the commit message?
E.g., what will happen if without this patch?  Will it crash the
source or stall the source migration or others?  Otherwise it's a bit
hard for me to understand what's this patch for.

Sorry for the inadequate description , I was intended to say that when
failing
to do the live migration using multifd, e.g. sending less channels, the src
status displays "setup" when running `info migrate`. I assume we should tell
users that the "Migration status" is "failed" now (and along with the
failure reason).

The current src status when failed inmultifd_new_send_channel_async():


(qemu) migrate_set_capability x-multifd on
(qemu) migrate_set_parameter x-multifd-channels 4
(qemu) migrate -d tcp:192.168.190.98:
(qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async due to
...
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks:
off compress: off events: off postcopy-ram: off x-colo: off release-ram: off
block: off return-path: off pause-before-switchover: off x-multifd: on
dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
Migration status: setup
total time: 0 milliseconds

Thanks for the information.

I had a quick look.  For now we do this:

 multifd_save_setup (without waiting for channels to be ready)
 create thread migration_thread
 (in thread)
 ram_save_setup
 multifd_send_sync_main (wait for the channels)

The thing is that we didn't get the notification when one of the
multifd channel is failed.  IMHO instead of setting the global
migration state in a per-channel function, we should just report the
error upwards, then the main thread should decide how to change the
state machine of the migration.

Thanks for the detail explanation, do agree with reporting and letting
the main thread handle this. :)
But one thing to note is that during my previous debugging, I remember
sometimes the main thread: migration_thread() is called earlier than
the first channel is ready in multifd_new_send_channel_async(). Thus
we should be careful about where/when to check the state of the channel.

And we have set it in migrate_set_error() after all so the main thread
should be able to know somehow

But in our current code, the main thread has not utilized the s->error
to know whether the migration state, right? As I checked the code,
the s->error is only used
- in qmp query: copy s->error to info->error_desc when detecting the 
migrate status is failed;

- in migrate_fd_cleanup() when migrate_fd_connect() fails: print the error
Or the s->error is just used in this way?

  (though IMHO I'll even prefer to have a
per-channel variable to keep the state of the channel, then the
per-channel functions won't touch any globals which offers better
isolation).

I'm not sure how Juan thinks about it, but I'd prefer some work to
provide such isolation and also some mechanism to allow the main
thread to detect the per-channel errors not only during setup phase
but also during the migration (e.g., when network is suddenly down).
Then we don't touch any globals (e.g., we shouldn't call
migrate_get_current in any per-channel function like
multifd_new_send_channel_async).

Ok, wait for Juan's comment. :)

Have a nice day, and thanks again for the detail explanation.
Fei

Normally I would prefer to not touch global states in feature specific
code path, but I'd like to know the problem more first...

Thanks,


Cc: Peter Xu 
Signed-off-by: Fei Li 
---
   migration/ram.c | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4db3b3e8f4..c84d164fc8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1072,6 +1072,7 @@ out:
   static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
   {
   MultiFDSendParams *p = opaque;
+MigrationState *s = migrate_get_current();
   QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
   Error *local_err = NULL;
@@ -1083,6 +1084,7 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
   if (multifd_save_cleanup(_err) != 0) {
   migrate_set_error(migrate_get_current(), local_err);
   }
+migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
   } else {
   p->c = QIO_CHANNEL(sioc);
   qio_channel_set_delay(p->c, false);
--
2.13.7


Regards,


Regards,






Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-01 Thread Peter Xu
On Fri, Nov 02, 2018 at 11:00:24AM +0800, Fei Li wrote:
> 
> 
> On 11/02/2018 10:37 AM, Peter Xu wrote:
> > On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:
> > > Set the migration state to "failed" instead of "setup" when failing
> > > to send packet via some channel.
> > Could you please provide more information in the commit message?
> > E.g., what will happen if without this patch?  Will it crash the
> > source or stall the source migration or others?  Otherwise it's a bit
> > hard for me to understand what's this patch for.
> Sorry for the inadequate description , I was intended to say that when
> failing
> to do the live migration using multifd, e.g. sending less channels, the src
> status displays "setup" when running `info migrate`. I assume we should tell
> users that the "Migration status" is "failed" now (and along with the
> failure reason).
> 
> The current src status when failed inmultifd_new_send_channel_async():
> 
> 
> (qemu) migrate_set_capability x-multifd on
> (qemu) migrate_set_parameter x-multifd-channels 4
> (qemu) migrate -d tcp:192.168.190.98:
> (qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async due to
> ...
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks:
> off compress: off events: off postcopy-ram: off x-colo: off release-ram: off
> block: off return-path: off pause-before-switchover: off x-multifd: on
> dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
> Migration status: setup
> total time: 0 milliseconds

Thanks for the information.

I had a quick look.  For now we do this:

multifd_save_setup (without waiting for channels to be ready)
create thread migration_thread
(in thread)
ram_save_setup
multifd_send_sync_main (wait for the channels)

The thing is that we didn't get the notification when one of the
multifd channel is failed.  IMHO instead of setting the global
migration state in a per-channel function, we should just report the
error upwards, then the main thread should decide how to change the
state machine of the migration.

And we have set it in migrate_set_error() after all so the main thread
should be able to know somehow (though IMHO I'll even prefer to have a
per-channel variable to keep the state of the channel, then the
per-channel functions won't touch any globals which offers better
isolation).

I'm not sure how Juan thinks about it, but I'd prefer some work to
provide such isolation and also some mechanism to allow the main
thread to detect the per-channel errors not only during setup phase
but also during the migration (e.g., when network is suddenly down).
Then we don't touch any globals (e.g., we shouldn't call
migrate_get_current in any per-channel function like
multifd_new_send_channel_async).

> 
> > 
> > Normally I would prefer to not touch global states in feature specific
> > code path, but I'd like to know the problem more first...
> > 
> > Thanks,
> > 
> > > Cc: Peter Xu 
> > > Signed-off-by: Fei Li 
> > > ---
> > >   migration/ram.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index 4db3b3e8f4..c84d164fc8 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1072,6 +1072,7 @@ out:
> > >   static void multifd_new_send_channel_async(QIOTask *task, gpointer 
> > > opaque)
> > >   {
> > >   MultiFDSendParams *p = opaque;
> > > +MigrationState *s = migrate_get_current();
> > >   QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
> > >   Error *local_err = NULL;
> > > @@ -1083,6 +1084,7 @@ static void multifd_new_send_channel_async(QIOTask 
> > > *task, gpointer opaque)
> > >   if (multifd_save_cleanup(_err) != 0) {
> > >   migrate_set_error(migrate_get_current(), local_err);
> > >   }
> > > +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
> > >   } else {
> > >   p->c = QIO_CHANNEL(sioc);
> > >   qio_channel_set_delay(p->c, false);
> > > -- 
> > > 2.13.7
> > > 
> > Regards,
> > 
> 

Regards,

-- 
Peter Xu



Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-01 Thread Fei Li




On 11/02/2018 10:37 AM, Peter Xu wrote:

On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:

Set the migration state to "failed" instead of "setup" when failing
to send packet via some channel.

Could you please provide more information in the commit message?
E.g., what will happen if without this patch?  Will it crash the
source or stall the source migration or others?  Otherwise it's a bit
hard for me to understand what's this patch for.
Sorry for the inadequate description , I was intended to say that when 
failing

to do the live migration using multifd, e.g. sending less channels, the src
status displays "setup" when running `info migrate`. I assume we should tell
users that the "Migration status" is "failed" now (and along with the 
failure reason).


The current src status when failed inmultifd_new_send_channel_async():


(qemu) migrate_set_capability x-multifd on
(qemu) migrate_set_parameter x-multifd-channels 4
(qemu) migrate -d tcp:192.168.190.98:
(qemu) qemu-system-x86_64: failed in multifd_new_send_channel_async due 
to ...

(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off 
release-ram: off block: off return-path: off pause-before-switchover: 
off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off 
late-block-activate: off

Migration status: setup
total time: 0 milliseconds



Normally I would prefer to not touch global states in feature specific
code path, but I'd like to know the problem more first...

Thanks,


Cc: Peter Xu 
Signed-off-by: Fei Li 
---
  migration/ram.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4db3b3e8f4..c84d164fc8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1072,6 +1072,7 @@ out:
  static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
  {
  MultiFDSendParams *p = opaque;
+MigrationState *s = migrate_get_current();
  QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
  Error *local_err = NULL;
  
@@ -1083,6 +1084,7 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)

  if (multifd_save_cleanup(_err) != 0) {
  migrate_set_error(migrate_get_current(), local_err);
  }
+migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
  } else {
  p->c = QIO_CHANNEL(sioc);
  qio_channel_set_delay(p->c, false);
--
2.13.7


Regards,





Re: [Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-01 Thread Peter Xu
On Thu, Nov 01, 2018 at 06:17:11PM +0800, Fei Li wrote:
> Set the migration state to "failed" instead of "setup" when failing
> to send packet via some channel.

Could you please provide more information in the commit message?
E.g., what will happen if without this patch?  Will it crash the
source or stall the source migration or others?  Otherwise it's a bit
hard for me to understand what's this patch for.

Normally I would prefer to not touch global states in feature specific
code path, but I'd like to know the problem more first...

Thanks,

> 
> Cc: Peter Xu 
> Signed-off-by: Fei Li 
> ---
>  migration/ram.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 4db3b3e8f4..c84d164fc8 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1072,6 +1072,7 @@ out:
>  static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
>  {
>  MultiFDSendParams *p = opaque;
> +MigrationState *s = migrate_get_current();
>  QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
>  Error *local_err = NULL;
>  
> @@ -1083,6 +1084,7 @@ static void multifd_new_send_channel_async(QIOTask 
> *task, gpointer opaque)
>  if (multifd_save_cleanup(_err) != 0) {
>  migrate_set_error(migrate_get_current(), local_err);
>  }
> +migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
>  } else {
>  p->c = QIO_CHANNEL(sioc);
>  qio_channel_set_delay(p->c, false);
> -- 
> 2.13.7
> 

Regards,

-- 
Peter Xu



[Qemu-devel] [PATCH RFC v7 5/9] migration: fix the multifd code when sending less channels

2018-11-01 Thread Fei Li
Set the migration state to "failed" instead of "setup" when failing
to send packet via some channel.

Cc: Peter Xu 
Signed-off-by: Fei Li 
---
 migration/ram.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index 4db3b3e8f4..c84d164fc8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1072,6 +1072,7 @@ out:
 static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 {
 MultiFDSendParams *p = opaque;
+MigrationState *s = migrate_get_current();
 QIOChannel *sioc = QIO_CHANNEL(qio_task_get_source(task));
 Error *local_err = NULL;
 
@@ -1083,6 +1084,7 @@ static void multifd_new_send_channel_async(QIOTask *task, 
gpointer opaque)
 if (multifd_save_cleanup(_err) != 0) {
 migrate_set_error(migrate_get_current(), local_err);
 }
+migrate_set_state(>state, s->state, MIGRATION_STATUS_FAILED);
 } else {
 p->c = QIO_CHANNEL(sioc);
 qio_channel_set_delay(p->c, false);
-- 
2.13.7