Re: [Qemu-devel] [PATCH v6 11/11] migration: create a dedicated thread to release rdma resource

2018-08-20 Thread 858585 jemmy
On Fri, Aug 17, 2018 at 10:59 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> ibv_dereg_mr wait for a long time for big memory size virtual server.
>>
>> The test result is:
>>   10GB  326ms
>>   20GB  699ms
>>   30GB  1021ms
>>   40GB  1387ms
>>   50GB  1712ms
>>   60GB  2034ms
>>   70GB  2457ms
>>   80GB  2807ms
>>   90GB  3107ms
>>   100GB 3474ms
>>   110GB 3735ms
>>   120GB 4064ms
>>   130GB 4567ms
>>   140GB 4886ms
>>
>> this will cause the guest os hang for a while when migration finished.
>> So create a dedicated thread to release rdma resource.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/migration.c |  6 ++
>>  migration/migration.h |  3 +++
>>  migration/rdma.c  | 47 +++
>>  3 files changed, 40 insertions(+), 16 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index f7d6e26..25d9009 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1499,6 +1499,7 @@ void migrate_init(MigrationState *s)
>>  s->vm_was_running = false;
>>  s->iteration_initial_bytes = 0;
>>  s->threshold_size = 0;
>> +s->rdma_cleanup_thread_quit = true;
>>  }
>>
>>  static GSList *migration_blockers;
>> @@ -1660,6 +1661,10 @@ static bool migrate_prepare(MigrationState *s, bool 
>> blk, bool blk_inc,
>>  return false;
>>  }
>>
>> +if (s->rdma_cleanup_thread_quit != true) {
>> +return false;
>> +}
>
> That's not good! We error out without saying anything to the user
> about why;  if we error we must always say why, otherwise we'll get
> bug reports from people without anyway to debug them.

Yes, it should give some error message.

>
> However, we also don't need to; all we need to do is turn this into
> a semaphore or similar, and then wait for it at the start of
> 'rdma_start_outgoing_migration' - that way there's no error exit, it
> just waits a few seconds and then carries on correctly.
> (Maybe we need to wait in incoming as well to cope with
> postcopy-recovery).

Yes, this should also consider postcopy-recovery case. I will fix it.
but rdma_start_outgoing_migration is invoked in the main thread, it's will a
little complex to wait.
and if it waits for a long time, it's not easy to troubleshooting.

Maybe report some error message is an easy way to fix it?
Thanks.

>
> Dave
>
>>  if (runstate_check(RUN_STATE_INMIGRATE)) {
>>  error_setg(errp, "Guest is waiting for an incoming migration");
>>  return false;
>> @@ -3213,6 +3218,7 @@ static void migration_instance_init(Object *obj)
>>
>>  ms->state = MIGRATION_STATUS_NONE;
>>  ms->mbps = -1;
>> +ms->rdma_cleanup_thread_quit = true;
>>  qemu_sem_init(&ms->pause_sem, 0);
>>  qemu_mutex_init(&ms->error_mutex);
>>
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 64a7b33..60138dd 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -224,6 +224,9 @@ struct MigrationState
>>   * do not trigger spurious decompression errors.
>>   */
>>  bool decompress_error_check;
>> +
>> +/* Set this when rdma resource have released */
>> +bool rdma_cleanup_thread_quit;
>>  };
>>
>>  void migrate_set_state(int *state, int old_state, int new_state);
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index e1498f2..3282f35 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2995,35 +2995,50 @@ static void 
>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>  }
>>  }
>>
>> -static int qio_channel_rdma_close(QIOChannel *ioc,
>> -  Error **errp)
>> +static void *qio_channel_rdma_close_thread(void *arg)
>>  {
>> -QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> -RDMAContext *rdmain, *rdmaout;
>> -trace_qemu_rdma_close();
>> -
>> -rdmain = rioc->rdmain;
>> -if (rdmain) {
>> -atomic_rcu_set(&rioc->rdmain, NULL);
>> -}
>> +RDMAContext **rdma = arg;
>> +RDMAContext *rdmain = rdma[0];
>> +RDMAContext *rdmaout = rdma[1];
>> +MigrationState *s = migrate_get_current();
>>
>> -rdmaout = rioc->rdmaout;
>> -if (rdmaout) {
>> -atomic_rcu_set(&rioc->rdmaout, NULL);
>> -}
>> +rcu_register_thread();
>>
>>  synchronize_rcu();
>> -
>>  if (rdmain) {
>>  qemu_rdma_cleanup(rdmain);
>>  }
>> -
>>  if (rdmaout) {
>>  qemu_rdma_cleanup(rdmaout);
>>  }
>>
>>  g_free(rdmain);
>>  g_free(rdmaout);
>> +g_free(rdma);
>> +
>> +rcu_unregister_thread();
>> +s->rdma_cleanup_thread_quit = true;
>> +return NULL;
>> +}
>> +
>> +static int qio_channel_rdma_close(QIOChannel *ioc,
>> +  Error **errp)
>> +{
>> +QemuThread t;
>> +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> +RDMAContext **rdma = g_new0(RDMAContext*, 2);
>> +MigrationState *s = migrate_get_current();
>> +
>> +trace_qemu_rdma_close();
>> +   

Re: [Qemu-devel] [PATCH v6 10/11] migration: remove the unnecessary RDMA_CONTROL_ERROR message

2018-08-20 Thread 858585 jemmy
On Fri, Aug 17, 2018 at 10:04 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> It's not necessary to send RDMA_CONTROL_ERROR when clean up rdma resource.
>> If rdma->error_state is ture, the message may not send successfully.
>> and the cm event can also notify the peer qemu.
>>
>> Signed-off-by: Lidong Chen 
>
> How does this keep 'cancel' working; I added 32bce196344 last year to
> make that code also send the RDMA_CONTROL_ERROR in 'cancelling'.

I guess send the RDMA_CONTROL_ERROR is to notify peer qemu close rdma
connection,
 and to make sure the receive rdma_disconnect event.

But the two sides qemu should cleanup rdma independently. maybe the
destination qemu is hang.

1.the current qemu version already not wait for
RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect,

2.for peer qemu, it's already poll rdma->channel->fd, compare to send
RDMA_CONTROL_ERROR, maybe use cm event
to notify peer qemu to quit is better. maybe the rdma is already in
error_state, and RDMA_CONTROL_ERROR
cannot send successfully. when cancel migraiton, the destination will
receive RDMA_CM_EVENT_DISCONNECTED.

3. I prefer use RDMA_CONTROL_ERROR to notify some code logic error,
not rdma connection error.

>
> Dave
>
>> ---
>>  migration/rdma.c | 11 ---
>>  1 file changed, 11 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index ae07515..e1498f2 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2305,17 +2305,6 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>>  int idx;
>>
>>  if (rdma->cm_id && rdma->connected) {
>> -if ((rdma->error_state ||
>> - migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) &&
>> -!rdma->received_error) {
>> -RDMAControlHeader head = { .len = 0,
>> -   .type = RDMA_CONTROL_ERROR,
>> -   .repeat = 1,
>> - };
>> -error_report("Early error. Sending error.");
>> -qemu_rdma_post_send_control(rdma, NULL, &head);
>> -}
>> -
>>  rdma_disconnect(rdma->cm_id);
>>  trace_qemu_rdma_cleanup_disconnect();
>>  rdma->connected = false;
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v6 09/11] migration: poll the cm event for destination qemu

2018-08-20 Thread 858585 jemmy
On Fri, Aug 17, 2018 at 10:01 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> The destination qemu only poll the comp_channel->fd in
>> qemu_rdma_wait_comp_channel. But when source qemu disconnnect
>> the rdma connection, the destination qemu should be notified.
>>
>> Signed-off-by: Lidong Chen 
>
> OK, this could do with an update to the migration_incoming_co comment in
> migration.h, since previously it was only used by colo; if we merge this
> first please post a patch to update the comment.

How about?

/* The coroutine we should enter back for incoming migration */
Coroutine *migration_incoming_co;

>
> Other than that, I think I'm OK:
>
> Reviewed-by: Dr. David Alan Gilbert 
>
>> ---
>>  migration/migration.c |  3 ++-
>>  migration/rdma.c  | 32 +++-
>>  2 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index df0c2cf..f7d6e26 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -389,6 +389,7 @@ static void process_incoming_migration_co(void *opaque)
>>  int ret;
>>
>>  assert(mis->from_src_file);
>> +mis->migration_incoming_co = qemu_coroutine_self();
>>  mis->largest_page_size = qemu_ram_pagesize_largest();
>>  postcopy_state_set(POSTCOPY_INCOMING_NONE);
>>  migrate_set_state(&mis->state, MIGRATION_STATUS_NONE,
>> @@ -418,7 +419,6 @@ static void process_incoming_migration_co(void *opaque)
>>
>>  /* we get COLO info, and know if we are in COLO mode */
>>  if (!ret && migration_incoming_enable_colo()) {
>> -mis->migration_incoming_co = qemu_coroutine_self();
>>  qemu_thread_create(&mis->colo_incoming_thread, "COLO incoming",
>>   colo_process_incoming_thread, mis, QEMU_THREAD_JOINABLE);
>>  mis->have_colo_incoming_thread = true;
>> @@ -442,6 +442,7 @@ static void process_incoming_migration_co(void *opaque)
>>  }
>>  mis->bh = qemu_bh_new(process_incoming_migration_bh, mis);
>>  qemu_bh_schedule(mis->bh);
>> +mis->migration_incoming_co = NULL;
>>  }
>>
>>  static void migration_incoming_setup(QEMUFile *f)
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 1affc46..ae07515 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -3226,6 +3226,35 @@ err:
>>
>>  static void rdma_accept_incoming_migration(void *opaque);
>>
>> +static void rdma_cm_poll_handler(void *opaque)
>> +{
>> +RDMAContext *rdma = opaque;
>> +int ret;
>> +struct rdma_cm_event *cm_event;
>> +MigrationIncomingState *mis = migration_incoming_get_current();
>> +
>> +ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> +if (ret) {
>> +error_report("get_cm_event failed %d", errno);
>> +return;
>> +}
>> +rdma_ack_cm_event(cm_event);
>> +
>> +if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
>> +cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
>> +error_report("receive cm event, cm event is %d", cm_event->event);
>> +rdma->error_state = -EPIPE;
>> +if (rdma->return_path) {
>> +rdma->return_path->error_state = -EPIPE;
>> +}
>> +
>> +if (mis->migration_incoming_co) {
>> +qemu_coroutine_enter(mis->migration_incoming_co);
>> +}
>> +return;
>> +}
>> +}
>> +
>>  static int qemu_rdma_accept(RDMAContext *rdma)
>>  {
>>  RDMACapabilities cap;
>> @@ -3326,7 +3355,8 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>>  NULL,
>>  (void *)(intptr_t)rdma->return_path);
>>  } else {
>> -qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
>> +qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
>> +NULL, rdma);
>>  }
>>
>>  ret = rdma_accept(rdma->cm_id, &conn_param);
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v6 03/12] migration: avoid concurrent invoke channel_close by different threads

2018-08-06 Thread 858585 jemmy
This patch causes compile error when make check.

  LINKtests/test-qdist
migration/qemu-file.o: In function `qemu_fclose':
/tmp/qemu-test/src/migration/qemu-file.c:331: undefined reference to
`migrate_get_current'
/tmp/qemu-test/src/migration/qemu-file.c:333: undefined reference to
`migrate_get_current'
collect2: error: ld returned 1 exit status
make: *** [tests/test-vmstate] Error 1

but I don't find an efficient way to fix it.
so I prefer to remove this patch from this series.
maybe it's not the best way to protect by
migrate_get_current()->qemu_file_close_lock.


On Fri, Aug 3, 2018 at 5:13 PM, Lidong Chen  wrote:
> From: Lidong Chen 
>
> The channel_close maybe invoked by different threads. For example, source
> qemu invokes qemu_fclose in main thread, migration thread and return path
> thread. Destination qemu invokes qemu_fclose in main thread, listen thread
> and COLO incoming thread.
>
> Signed-off-by: Lidong Chen 
> Reviewed-by: Daniel P. Berrangé 
> ---
>  migration/migration.c | 2 ++
>  migration/migration.h | 7 +++
>  migration/qemu-file.c | 6 --
>  3 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index b7d9854..a3a0756 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3200,6 +3200,7 @@ static void migration_instance_finalize(Object *obj)
>  qemu_sem_destroy(&ms->postcopy_pause_sem);
>  qemu_sem_destroy(&ms->postcopy_pause_rp_sem);
>  qemu_sem_destroy(&ms->rp_state.rp_sem);
> +qemu_mutex_destroy(&ms->qemu_file_close_lock);
>  error_free(ms->error);
>  }
>
> @@ -3236,6 +3237,7 @@ static void migration_instance_init(Object *obj)
>  qemu_sem_init(&ms->rp_state.rp_sem, 0);
>  qemu_sem_init(&ms->rate_limit_sem, 0);
>  qemu_mutex_init(&ms->qemu_file_lock);
> +qemu_mutex_init(&ms->qemu_file_close_lock);
>  }
>
>  /*
> diff --git a/migration/migration.h b/migration/migration.h
> index 64a7b33..a50c2de 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -122,6 +122,13 @@ struct MigrationState
>  QemuMutex qemu_file_lock;
>
>  /*
> + * The to_src_file and from_dst_file point to one QIOChannelRDMA,
> + * And qemu_fclose maybe invoked by different threads. use this lock
> + * to avoid concurrent invoke channel_close by different threads.
> + */
> +QemuMutex qemu_file_close_lock;
> +
> +/*
>   * Used to allow urgent requests to override rate limiting.
>   */
>  QemuSemaphore rate_limit_sem;
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 977b9ae..74c48e0 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -323,12 +323,14 @@ void qemu_update_position(QEMUFile *f, size_t size)
>   */
>  int qemu_fclose(QEMUFile *f)
>  {
> -int ret;
> +int ret, ret2;
>  qemu_fflush(f);
>  ret = qemu_file_get_error(f);
>
>  if (f->ops->close) {
> -int ret2 = f->ops->close(f->opaque);
> +qemu_mutex_lock(&migrate_get_current()->qemu_file_close_lock);
> +ret2 = f->ops->close(f->opaque);
> +qemu_mutex_unlock(&migrate_get_current()->qemu_file_close_lock);
>  if (ret >= 0) {
>  ret = ret2;
>  }
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH v6 00/12] Enable postcopy RDMA live migration

2018-08-05 Thread 858585 jemmy
There is one compile error, please ignore those patch, I will send a
new version patch.

On Fri, Aug 3, 2018 at 5:13 PM, Lidong Chen  wrote:
> The RDMA QIOChannel does not support bi-directional communication, so when 
> RDMA
> live migration with postcopy enabled, the source qemu return path get qemu 
> file
> error.
>
> These patches implement bi-directional communication for RDMA QIOChannel and
> disable the RDMA WRITE during the postcopy phase.
>
> This patch just make postcopy works, and will improve performance later.
>
> [v6]
>  - rebase
>  - add the check whether release rdma resource has finished(David)
>  - remove unnecessary RDMA_CONTROL_ERROR when cleanup(David)
>  - poll the cm event for destination qemu
>
> [v5]
>  - rebase
>  - fix bug for create a dedicated thread to release rdma resource(David)
>  - fix bug for poll the cm event while wait RDMA work request 
> completion(David,Gal)
>
> [v4]
>  - not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect
>  - implement io_set_aio_fd_handler function for RDMA QIOChannel (Juan 
> Quintela)
>  - invoke qio_channel_yield only when qemu_in_coroutine() (Juan Quintela)
>  - create a dedicated thread to release rdma resource
>  - poll the cm event while wait RDMA work request completion
>  - implement the shutdown function for RDMA QIOChannel
>
> [v3]
>  - add a mutex in QEMUFile struct to avoid concurrent channel close (Daniel)
>  - destroy the mutex before free QEMUFile (David)
>  - use rdmain and rmdaout instead of rdma->return_path (Daniel)
>
> [v2]
>  - does not update bytes_xfer when disable RDMA WRITE (David)
>  - implement bi-directional communication for RDMA QIOChannel (Daniel)
>
> Lidong Chen (12):
>   migration: disable RDMA WRITE after postcopy started
>   migration: create a dedicated connection for rdma return path
>   migration: avoid concurrent invoke channel_close by different threads
>   migration: implement bi-directional RDMA QIOChannel
>   migration: Stop rdma yielding during incoming postcopy
>   migration: implement io_set_aio_fd_handler function for RDMA
> QIOChannel
>   migration: invoke qio_channel_yield only when qemu_in_coroutine()
>   migration: poll the cm event while wait RDMA work request completion
>   migration: implement the shutdown for RDMA QIOChannel
>   migration: poll the cm event for destination qemu
>   migration: remove the unnecessary RDMA_CONTROL_ERROR message
>   migration: create a dedicated thread to release rdma resource
>
>  migration/colo.c  |   2 +
>  migration/migration.c |  13 +-
>  migration/migration.h |  10 +
>  migration/postcopy-ram.c  |   2 +
>  migration/qemu-file-channel.c |  12 +-
>  migration/qemu-file.c |  14 +-
>  migration/ram.c   |   4 +
>  migration/rdma.c  | 448 
> ++
>  migration/savevm.c|   3 +
>  9 files changed, 458 insertions(+), 50 deletions(-)
>
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH v5 08/10] migration: create a dedicated thread to release rdma resource

2018-07-26 Thread 858585 jemmy
On Mon, Jul 23, 2018 at 10:54 PM, Gal Shachaf  wrote:
> On Thu, Jul 5, 2018 at 10:26 PM, 858585 jemmy  wrote:
>> On Thu, Jun 28, 2018 at 2:59 AM, Dr. David Alan Gilbert
>>  wrote:
>>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>>>> ibv_dereg_mr wait for a long time for big memory size virtual server.
>>>>
>>>> The test result is:
>>>>   10GB  326ms
>>>>   20GB  699ms
>>>>   30GB  1021ms
>>>>   40GB  1387ms
>>>>   50GB  1712ms
>>>>   60GB  2034ms
>>>>   70GB  2457ms
>>>>   80GB  2807ms
>>>>   90GB  3107ms
>>>>   100GB 3474ms
>>>>   110GB 3735ms
>>>>   120GB 4064ms
>>>>   130GB 4567ms
>>>>   140GB 4886ms
>>>>
>>>> this will cause the guest os hang for a while when migration finished.
>>>> So create a dedicated thread to release rdma resource.
>>>>
>>>> Signed-off-by: Lidong Chen 
>>>> ---
>>>>  migration/rdma.c | 43 +++
>>>>  1 file changed, 27 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/migration/rdma.c b/migration/rdma.c index
>>>> dfa4f77..f12e8d5 100644
>>>> --- a/migration/rdma.c
>>>> +++ b/migration/rdma.c
>>>> @@ -2979,35 +2979,46 @@ static void 
>>>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>>>  }
>>>>  }
>>>>
>>>> -static int qio_channel_rdma_close(QIOChannel *ioc,
>>>> -  Error **errp)
>>>> +static void *qio_channel_rdma_close_thread(void *arg)
>>>>  {
>>>> -QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>>> -RDMAContext *rdmain, *rdmaout;
>>>> -trace_qemu_rdma_close();
>>>> +RDMAContext **rdma = arg;
>>>> +RDMAContext *rdmain = rdma[0];
>>>> +RDMAContext *rdmaout = rdma[1];
>>>>
>>>> -rdmain = rioc->rdmain;
>>>> -if (rdmain) {
>>>> -atomic_rcu_set(&rioc->rdmain, NULL);
>>>> -}
>>>> -
>>>> -rdmaout = rioc->rdmaout;
>>>> -if (rdmaout) {
>>>> -atomic_rcu_set(&rioc->rdmaout, NULL);
>>>> -}
>>>> +rcu_register_thread();
>>>>
>>>>  synchronize_rcu();
>>>
>>> * see below
>>>
>>>> -
>>>>  if (rdmain) {
>>>>  qemu_rdma_cleanup(rdmain);
>>>>  }
>>>> -
>>>>  if (rdmaout) {
>>>>  qemu_rdma_cleanup(rdmaout);
>>>>  }
>>>>
>>>>  g_free(rdmain);
>>>>  g_free(rdmaout);
>>>> +g_free(rdma);
>>>> +
>>>> +rcu_unregister_thread();
>>>> +return NULL;
>>>> +}
>>>> +
>>>> +static int qio_channel_rdma_close(QIOChannel *ioc,
>>>> +  Error **errp) {
>>>> +QemuThread t;
>>>> +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>>> +RDMAContext **rdma = g_new0(RDMAContext*, 2);
>>>> +
>>>> +trace_qemu_rdma_close();
>>>> +if (rioc->rdmain || rioc->rdmaout) {
>>>> +rdma[0] =  rioc->rdmain;
>>>> +rdma[1] =  rioc->rdmaout;
>>>> +qemu_thread_create(&t, "rdma cleanup", 
>>>> qio_channel_rdma_close_thread,
>>>> +   rdma, QEMU_THREAD_DETACHED);
>>>> +atomic_rcu_set(&rioc->rdmain, NULL);
>>>> +atomic_rcu_set(&rioc->rdmaout, NULL);
>>>
>>> I'm not sure this pair is ordered with the synchronise_rcu above;
>>> Doesn't that mean, on a bad day, that you could get:
>>>
>>>
>>> main-thread  rdma_cleanup another-thread
>>> qmu_thread_create
>>>   synchronise_rcu
>>> reads rioc->rdmain
>>> starts doing something with rdmain
>>> atomic_rcu_set
>>>   rdma_cleanup
>>>
>>>
>>> so the another-thread is using it during the cleanup?
>>> Would just moving the atomic_rcu_sets before the qemu_thread_create
>>> fix that?
>> yes, I will fix it.

Re: [Qemu-devel] [PATCH v5 08/10] migration: create a dedicated thread to release rdma resource

2018-07-18 Thread 858585 jemmy
On Thu, Jul 5, 2018 at 10:26 PM, 858585 jemmy  wrote:
> On Thu, Jun 28, 2018 at 2:59 AM, Dr. David Alan Gilbert
>  wrote:
>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>>> ibv_dereg_mr wait for a long time for big memory size virtual server.
>>>
>>> The test result is:
>>>   10GB  326ms
>>>   20GB  699ms
>>>   30GB  1021ms
>>>   40GB  1387ms
>>>   50GB  1712ms
>>>   60GB  2034ms
>>>   70GB  2457ms
>>>   80GB  2807ms
>>>   90GB  3107ms
>>>   100GB 3474ms
>>>   110GB 3735ms
>>>   120GB 4064ms
>>>   130GB 4567ms
>>>   140GB 4886ms
>>>
>>> this will cause the guest os hang for a while when migration finished.
>>> So create a dedicated thread to release rdma resource.
>>>
>>> Signed-off-by: Lidong Chen 
>>> ---
>>>  migration/rdma.c | 43 +++
>>>  1 file changed, 27 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/migration/rdma.c b/migration/rdma.c
>>> index dfa4f77..f12e8d5 100644
>>> --- a/migration/rdma.c
>>> +++ b/migration/rdma.c
>>> @@ -2979,35 +2979,46 @@ static void 
>>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>>  }
>>>  }
>>>
>>> -static int qio_channel_rdma_close(QIOChannel *ioc,
>>> -  Error **errp)
>>> +static void *qio_channel_rdma_close_thread(void *arg)
>>>  {
>>> -QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>> -RDMAContext *rdmain, *rdmaout;
>>> -trace_qemu_rdma_close();
>>> +RDMAContext **rdma = arg;
>>> +RDMAContext *rdmain = rdma[0];
>>> +RDMAContext *rdmaout = rdma[1];
>>>
>>> -rdmain = rioc->rdmain;
>>> -if (rdmain) {
>>> -atomic_rcu_set(&rioc->rdmain, NULL);
>>> -}
>>> -
>>> -rdmaout = rioc->rdmaout;
>>> -if (rdmaout) {
>>> -atomic_rcu_set(&rioc->rdmaout, NULL);
>>> -}
>>> +rcu_register_thread();
>>>
>>>  synchronize_rcu();
>>
>> * see below
>>
>>> -
>>>  if (rdmain) {
>>>  qemu_rdma_cleanup(rdmain);
>>>  }
>>> -
>>>  if (rdmaout) {
>>>  qemu_rdma_cleanup(rdmaout);
>>>  }
>>>
>>>  g_free(rdmain);
>>>  g_free(rdmaout);
>>> +g_free(rdma);
>>> +
>>> +rcu_unregister_thread();
>>> +return NULL;
>>> +}
>>> +
>>> +static int qio_channel_rdma_close(QIOChannel *ioc,
>>> +  Error **errp)
>>> +{
>>> +QemuThread t;
>>> +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>> +RDMAContext **rdma = g_new0(RDMAContext*, 2);
>>> +
>>> +trace_qemu_rdma_close();
>>> +if (rioc->rdmain || rioc->rdmaout) {
>>> +rdma[0] =  rioc->rdmain;
>>> +rdma[1] =  rioc->rdmaout;
>>> +qemu_thread_create(&t, "rdma cleanup", 
>>> qio_channel_rdma_close_thread,
>>> +   rdma, QEMU_THREAD_DETACHED);
>>> +atomic_rcu_set(&rioc->rdmain, NULL);
>>> +atomic_rcu_set(&rioc->rdmaout, NULL);
>>
>> I'm not sure this pair is ordered with the synchronise_rcu above;
>> Doesn't that mean, on a bad day, that you could get:
>>
>>
>> main-thread  rdma_cleanup another-thread
>> qmu_thread_create
>>   synchronise_rcu
>> reads rioc->rdmain
>> starts doing something with rdmain
>> atomic_rcu_set
>>   rdma_cleanup
>>
>>
>> so the another-thread is using it during the cleanup?
>> Would just moving the atomic_rcu_sets before the qemu_thread_create
>> fix that?
> yes, I will fix it.
>
>>
>> However, I've got other worries as well:
>>a) qemu_rdma_cleanup does:
>>migrate_get_current()->state == MIGRATION_STATUS_CANCELLING
>>
>>   which worries me a little if someone immediately tries to restart
>>   the migration.

Because the current qemu version don't wait for
RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect,
so I think it's not necessary to send RDMA_CONTROL_ERROR.

compare to send RDMA_CONTROL_ERROR, I think use cm event to notify
peer qemu is better.
maybe the rdma is already in error_state, and RDMA_CONTROL_ERROR
cannot send successfully.

For peer qemu, in qemu_rdma_wait_comp_channel function, it's should
not only poll rdma->comp_channel->fd,
it's should also poll  rdma->channel->fd.

for the source qemu, it's fixed by this patch.
migration: poll the cm event while wait RDMA work request completion

and for destination qemu, it's need a new patch to fix it.

so I prefer to remove MIGRATION_STATUS_CANCELLING in qemu_rdma_cleanup function.

>>
>>b) I don't understand what happens if someone does try and restart
>>   the migration after that, but in the ~5s it takes the ibv cleanup
>>   to happen.

I prefer to add a new variable in current_migration.  if the rdma
cleanup thread has not
exited. it's should not start a new migration.

>
> yes, I will try to fix it.
>
>>
>> Dave
>>
>>
>>> +}
>>>
>>>  return 0;
>>>  }
>>> --
>>> 1.8.3.1
>>>
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH] migration: release MigrationIncomingState in migration_object_finalize

2018-07-17 Thread 858585 jemmy
On Thu, Jul 12, 2018 at 12:08 PM, 858585 jemmy  wrote:
> On Fri, Jul 6, 2018 at 6:41 PM, Dr. David Alan Gilbert
>  wrote:
>> * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
>>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>>> > Qemu initialize the MigrationIncomingState structure in 
>>> > migration_object_init,
>>> > but not release it. this patch release it in migration_object_finalize.
>>> >
>>> > Signed-off-by: Lidong Chen 
>>>
>>> Queued
>>
>> I've had to unqueue this, see below:
>>
>>>
>>> > ---
>>> >  migration/migration.c | 7 +++
>>> >  1 file changed, 7 insertions(+)
>>> >
>>> > diff --git a/migration/migration.c b/migration/migration.c
>>> > index 05aec2c..e009a05 100644
>>> > --- a/migration/migration.c
>>> > +++ b/migration/migration.c
>>> > @@ -156,6 +156,13 @@ void migration_object_init(void)
>>> >  void migration_object_finalize(void)
>>> >  {
>>> >  object_unref(OBJECT(current_migration));
>>> > +
>>> > +qemu_sem_destroy(¤t_incoming->postcopy_pause_sem_fault);
>>> > +qemu_sem_destroy(¤t_incoming->postcopy_pause_sem_dst);
>>> > +qemu_event_destroy(¤t_incoming->main_thread_load_event);
>>> > +qemu_mutex_destroy(¤t_incoming->rp_mutex);
>>> > +g_array_free(current_incoming->postcopy_remote_fds, true);
>>
>> That array is already free'd in migration_incoming_state_destroy,
>> so I see reliable glib assert's from this array free.
>
> The migration_incoming_state_destroy only invoked in destination qemu.
> The source qemu will not free this memory.
> So I think free current_incoming->postcopy_remote_fds is not good way.
>
> and migration_object_init and migration_object_finalize should not be
> invoked in main
> function. It's better to  alloc memory when start migration and
> release it when migration finished.
>
> I will submit a new version patch to fix it.

I find many function use current_incoming and current_migration,
if we alloc these when migration start, and release these when
migration finished, it need change many function.
so I will just remove
g_array_free(current_incoming->postcopy_remote_fds, true) from the
patch.

Thanks

>
>>
>> Dave
>>
>>> > +g_free(current_incoming);
>>> >  }
>>> >
>>> >  /* For outgoing */
>>> > --
>>> > 1.8.3.1
>>> >
>>> --
>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH] migration: release MigrationIncomingState in migration_object_finalize

2018-07-11 Thread 858585 jemmy
On Fri, Jul 6, 2018 at 6:41 PM, Dr. David Alan Gilbert
 wrote:
> * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote:
>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> > Qemu initialize the MigrationIncomingState structure in 
>> > migration_object_init,
>> > but not release it. this patch release it in migration_object_finalize.
>> >
>> > Signed-off-by: Lidong Chen 
>>
>> Queued
>
> I've had to unqueue this, see below:
>
>>
>> > ---
>> >  migration/migration.c | 7 +++
>> >  1 file changed, 7 insertions(+)
>> >
>> > diff --git a/migration/migration.c b/migration/migration.c
>> > index 05aec2c..e009a05 100644
>> > --- a/migration/migration.c
>> > +++ b/migration/migration.c
>> > @@ -156,6 +156,13 @@ void migration_object_init(void)
>> >  void migration_object_finalize(void)
>> >  {
>> >  object_unref(OBJECT(current_migration));
>> > +
>> > +qemu_sem_destroy(¤t_incoming->postcopy_pause_sem_fault);
>> > +qemu_sem_destroy(¤t_incoming->postcopy_pause_sem_dst);
>> > +qemu_event_destroy(¤t_incoming->main_thread_load_event);
>> > +qemu_mutex_destroy(¤t_incoming->rp_mutex);
>> > +g_array_free(current_incoming->postcopy_remote_fds, true);
>
> That array is already free'd in migration_incoming_state_destroy,
> so I see reliable glib assert's from this array free.

The migration_incoming_state_destroy only invoked in destination qemu.
The source qemu will not free this memory.
So I think free current_incoming->postcopy_remote_fds is not good way.

and migration_object_init and migration_object_finalize should not be
invoked in main
function. It's better to  alloc memory when start migration and
release it when migration finished.

I will submit a new version patch to fix it.

>
> Dave
>
>> > +g_free(current_incoming);
>> >  }
>> >
>> >  /* For outgoing */
>> > --
>> > 1.8.3.1
>> >
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v5 08/10] migration: create a dedicated thread to release rdma resource

2018-07-05 Thread 858585 jemmy
On Thu, Jun 28, 2018 at 2:59 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> ibv_dereg_mr wait for a long time for big memory size virtual server.
>>
>> The test result is:
>>   10GB  326ms
>>   20GB  699ms
>>   30GB  1021ms
>>   40GB  1387ms
>>   50GB  1712ms
>>   60GB  2034ms
>>   70GB  2457ms
>>   80GB  2807ms
>>   90GB  3107ms
>>   100GB 3474ms
>>   110GB 3735ms
>>   120GB 4064ms
>>   130GB 4567ms
>>   140GB 4886ms
>>
>> this will cause the guest os hang for a while when migration finished.
>> So create a dedicated thread to release rdma resource.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/rdma.c | 43 +++
>>  1 file changed, 27 insertions(+), 16 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index dfa4f77..f12e8d5 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2979,35 +2979,46 @@ static void 
>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>  }
>>  }
>>
>> -static int qio_channel_rdma_close(QIOChannel *ioc,
>> -  Error **errp)
>> +static void *qio_channel_rdma_close_thread(void *arg)
>>  {
>> -QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> -RDMAContext *rdmain, *rdmaout;
>> -trace_qemu_rdma_close();
>> +RDMAContext **rdma = arg;
>> +RDMAContext *rdmain = rdma[0];
>> +RDMAContext *rdmaout = rdma[1];
>>
>> -rdmain = rioc->rdmain;
>> -if (rdmain) {
>> -atomic_rcu_set(&rioc->rdmain, NULL);
>> -}
>> -
>> -rdmaout = rioc->rdmaout;
>> -if (rdmaout) {
>> -atomic_rcu_set(&rioc->rdmaout, NULL);
>> -}
>> +rcu_register_thread();
>>
>>  synchronize_rcu();
>
> * see below
>
>> -
>>  if (rdmain) {
>>  qemu_rdma_cleanup(rdmain);
>>  }
>> -
>>  if (rdmaout) {
>>  qemu_rdma_cleanup(rdmaout);
>>  }
>>
>>  g_free(rdmain);
>>  g_free(rdmaout);
>> +g_free(rdma);
>> +
>> +rcu_unregister_thread();
>> +return NULL;
>> +}
>> +
>> +static int qio_channel_rdma_close(QIOChannel *ioc,
>> +  Error **errp)
>> +{
>> +QemuThread t;
>> +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> +RDMAContext **rdma = g_new0(RDMAContext*, 2);
>> +
>> +trace_qemu_rdma_close();
>> +if (rioc->rdmain || rioc->rdmaout) {
>> +rdma[0] =  rioc->rdmain;
>> +rdma[1] =  rioc->rdmaout;
>> +qemu_thread_create(&t, "rdma cleanup", 
>> qio_channel_rdma_close_thread,
>> +   rdma, QEMU_THREAD_DETACHED);
>> +atomic_rcu_set(&rioc->rdmain, NULL);
>> +atomic_rcu_set(&rioc->rdmaout, NULL);
>
> I'm not sure this pair is ordered with the synchronise_rcu above;
> Doesn't that mean, on a bad day, that you could get:
>
>
> main-thread  rdma_cleanup another-thread
> qmu_thread_create
>   synchronise_rcu
> reads rioc->rdmain
> starts doing something with rdmain
> atomic_rcu_set
>   rdma_cleanup
>
>
> so the another-thread is using it during the cleanup?
> Would just moving the atomic_rcu_sets before the qemu_thread_create
> fix that?
yes, I will fix it.

>
> However, I've got other worries as well:
>a) qemu_rdma_cleanup does:
>migrate_get_current()->state == MIGRATION_STATUS_CANCELLING
>
>   which worries me a little if someone immediately tries to restart
>   the migration.
>
>b) I don't understand what happens if someone does try and restart
>   the migration after that, but in the ~5s it takes the ibv cleanup
>   to happen.

yes, I will try to fix it.

>
> Dave
>
>
>> +}
>>
>>  return 0;
>>  }
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v5 04/10] migration: implement bi-directional RDMA QIOChannel

2018-06-27 Thread 858585 jemmy
On Wed, Jun 13, 2018 at 10:21 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> From: Lidong Chen 
>>
>> This patch implements bi-directional RDMA QIOChannel. Because different
>> threads may access RDMAQIOChannel currently, this patch use RCU to protect 
>> it.
>>
>> Signed-off-by: Lidong Chen 
>
> Paolo: Does it make sense the way RCU is used here  Holding the
> read-lock for so long in multifd_rdma_[read|write]v is what worries me
> most.
>
> Dave
>

Hi Paolo:
 Could you review this patch?
 Thanks.

>> ---
>>  migration/colo.c |   2 +
>>  migration/migration.c|   2 +
>>  migration/postcopy-ram.c |   2 +
>>  migration/ram.c  |   4 +
>>  migration/rdma.c | 196 
>> ---
>>  migration/savevm.c   |   3 +
>>  6 files changed, 183 insertions(+), 26 deletions(-)
>>
>> diff --git a/migration/colo.c b/migration/colo.c
>> index 4381067..88936f5 100644
>> --- a/migration/colo.c
>> +++ b/migration/colo.c
>> @@ -534,6 +534,7 @@ void *colo_process_incoming_thread(void *opaque)
>>  uint64_t value;
>>  Error *local_err = NULL;
>>
>> +rcu_register_thread();
>>  qemu_sem_init(&mis->colo_incoming_sem, 0);
>>
>>  migrate_set_state(&mis->state, MIGRATION_STATUS_ACTIVE,
>> @@ -666,5 +667,6 @@ out:
>>  }
>>  migration_incoming_exit_colo();
>>
>> +rcu_unregister_thread();
>>  return NULL;
>>  }
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 1d0aaec..4253d9f 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2028,6 +2028,7 @@ static void *source_return_path_thread(void *opaque)
>>  int res;
>>
>>  trace_source_return_path_thread_entry();
>> +rcu_register_thread();
>>
>>  retry:
>>  while (!ms->rp_state.error && !qemu_file_get_error(rp) &&
>> @@ -2167,6 +2168,7 @@ out:
>>  trace_source_return_path_thread_end();
>>  ms->rp_state.from_dst_file = NULL;
>>  qemu_fclose(rp);
>> +rcu_unregister_thread();
>>  return NULL;
>>  }
>>
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 48e5155..98613eb 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -853,6 +853,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
>>  RAMBlock *rb = NULL;
>>
>>  trace_postcopy_ram_fault_thread_entry();
>> +rcu_register_thread();
>>  mis->last_rb = NULL; /* last RAMBlock we sent part of */
>>  qemu_sem_post(&mis->fault_thread_sem);
>>
>> @@ -1059,6 +1060,7 @@ retry:
>>  }
>>  }
>>  }
>> +rcu_unregister_thread();
>>  trace_postcopy_ram_fault_thread_exit();
>>  g_free(pfd);
>>  return NULL;
>> diff --git a/migration/ram.c b/migration/ram.c
>> index a500015..a674fb5 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -683,6 +683,7 @@ static void *multifd_send_thread(void *opaque)
>>  MultiFDSendParams *p = opaque;
>>  Error *local_err = NULL;
>>
>> +rcu_register_thread();
>>  if (multifd_send_initial_packet(p, &local_err) < 0) {
>>  goto out;
>>  }
>> @@ -706,6 +707,7 @@ out:
>>  p->running = false;
>>  qemu_mutex_unlock(&p->mutex);
>>
>> +rcu_unregister_thread();
>>  return NULL;
>>  }
>>
>> @@ -819,6 +821,7 @@ static void *multifd_recv_thread(void *opaque)
>>  {
>>  MultiFDRecvParams *p = opaque;
>>
>> +rcu_register_thread();
>>  while (true) {
>>  qemu_mutex_lock(&p->mutex);
>>  if (p->quit) {
>> @@ -833,6 +836,7 @@ static void *multifd_recv_thread(void *opaque)
>>  p->running = false;
>>  qemu_mutex_unlock(&p->mutex);
>>
>> +rcu_unregister_thread();
>>  return NULL;
>>  }
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index f6705a3..769f443 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -86,6 +86,7 @@ static uint32_t known_capabilities = 
>> RDMA_CAPABILITY_PIN_ALL;
>>  " to abort!"); \
>>  rdma->error_reported = 1; \
>>  } \
>> +rcu_read_unlock(); \
>>  return rdma->error_state; \
>>  } \
>>  } while (0)
>> @@ -402,7 +403,8 @@ typedef struct QIOChannelRDMA QIOChannelRDMA;
>>
>>  struct QIOChannelRDMA {
>>  QIOChannel parent;
>> -RDMAContext *rdma;
>> +RDMAContext *rdmain;
>> +RDMAContext *rdmaout;
>>  QEMUFile *file;
>>  bool blocking; /* XXX we don't actually honour this yet */
>>  };
>> @@ -2630,12 +2632,20 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>> *ioc,
>>  {
>>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>  QEMUFile *f = rioc->file;
>> -RDMAContext *rdma = rioc->rdma;
>> +RDMAContext *rdma;
>>  int ret;
>>  ssize_t done = 0;
>>  size_t i;
>>  size_t len = 0;
>>
>> +rcu_read_lock();
>> +rdma = atomic_rcu_read(&rioc->rdmaout);
>> +
>> +if (!rdma) {
>> +rcu_read_unlock();
>> +retu

Re: [Qemu-devel] [PATCH v5 09/10] migration: poll the cm event while wait RDMA work request completion

2018-06-13 Thread 858585 jemmy
On Wed, Jun 13, 2018 at 10:24 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> If the peer qemu is crashed, the qemu_rdma_wait_comp_channel function
>> maybe loop forever. so we should also poll the cm event fd, and when
>> receive RDMA_CM_EVENT_DISCONNECTED and RDMA_CM_EVENT_DEVICE_REMOVAL,
>> we consider some error happened.
>>
>> Signed-off-by: Lidong Chen 
>
> Was there a reply which explained/pointed to docs for cm_event?

https://linux.die.net/man/3/rdma_get_cm_event

> Or a Review-by from one of the Infiniband people would be fine.

yes, I should add Gal Shachaf ,Aviad Yehezkel

we are working together on RDMA live migration.

Thanks.

>
> Dave
>
>> ---
>>  migration/rdma.c | 33 ++---
>>  1 file changed, 30 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index f12e8d5..bb6989e 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1489,6 +1489,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, 
>> uint64_t *wr_id_out,
>>   */
>>  static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>  {
>> +struct rdma_cm_event *cm_event;
>> +int ret = -1;
>> +
>>  /*
>>   * Coroutine doesn't start until migration_fd_process_incoming()
>>   * so don't yield unless we know we're running inside of a coroutine.
>> @@ -1505,13 +1508,37 @@ static int qemu_rdma_wait_comp_channel(RDMAContext 
>> *rdma)
>>   * without hanging forever.
>>   */
>>  while (!rdma->error_state  && !rdma->received_error) {
>> -GPollFD pfds[1];
>> +GPollFD pfds[2];
>>  pfds[0].fd = rdma->comp_channel->fd;
>>  pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>> +pfds[0].revents = 0;
>> +
>> +pfds[1].fd = rdma->channel->fd;
>> +pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>> +pfds[1].revents = 0;
>> +
>>  /* 0.1s timeout, should be fine for a 'cancel' */
>> -switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) {
>> +switch (qemu_poll_ns(pfds, 2, 100 * 1000 * 1000)) {
>> +case 2:
>>  case 1: /* fd active */
>> -return 0;
>> +if (pfds[0].revents) {
>> +return 0;
>> +}
>> +
>> +if (pfds[1].revents) {
>> +ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> +if (!ret) {
>> +rdma_ack_cm_event(cm_event);
>> +}
>> +
>> +error_report("receive cm event while wait comp channel,"
>> + "cm event is %d", cm_event->event);
>> +if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
>> +cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
>> +return -EPIPE;
>> +}
>> +}
>> +break;
>>
>>  case 0: /* Timeout, go around again */
>>  break;
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v4 11/12] migration: poll the cm event while wait RDMA work request completion

2018-06-05 Thread 858585 jemmy
On Sun, Jun 3, 2018 at 11:04 PM, Aviad Yehezkel
 wrote:
> +Gal
>
> Gal, please comment with our findings.

some suggestion from Gal:
1.Regarding the GIOConditions for the FPollFD.events/revents:
G_IO_IN is enough for cm_channel – if it is not empty, it will return
POLLIN | POLLRDNORM, but you can also use G_IO_HUP | G_IO_ERR just to
be safe.
2.Please note that you are not currently checking for error return
values on qemu_poll_ns, rdma_get_cm_event and rdma_ack_cm_event.
3.should consider checking for specific RDMA_CM_EVENT types:
RDMA_CM_EVENT_DISCONNECTED, RDMA_CM_EVENT_DEVICE_REMOVAL.for example,
receiving RDMA_CM_EVENT_ADDR_CHANGE should not result in error
4.it is better to first poll the CQ, and only if it has no new CQEs
poll the eventsQ. This way, you will not go into error even if you’ve
got a CQE.

I will send new version patch.

>
> Thanks!
>
>
> On 5/31/2018 10:36 AM, 858585 jemmy wrote:
>>
>> On Thu, May 31, 2018 at 1:33 AM, Dr. David Alan Gilbert
>>  wrote:
>>>
>>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>>>>
>>>> If the peer qemu is crashed, the qemu_rdma_wait_comp_channel function
>>>> maybe loop forever. so we should also poll the cm event fd, and when
>>>> receive any cm event, we consider some error happened.
>>>>
>>>> Signed-off-by: Lidong Chen 
>>>
>>> I don't understand enough about the way the infiniband fd's work to
>>> fully review this; so I'd appreciate if some one who does could
>>> comment/add their review.
>>
>> Hi Avaid:
>>  we need your help. I also not find any document about the cq
>> channel event fd and
>> cm channel event f.
>>  Should we set the events to G_IO_IN | G_IO_HUP | G_IO_ERR? or
>> G_IO_IN is enough?
>>  pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>  pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>  Thanks.
>>
>>>> ---
>>>>   migration/rdma.c | 35 ---
>>>>   1 file changed, 24 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/migration/rdma.c b/migration/rdma.c
>>>> index 1b9e261..d611a06 100644
>>>> --- a/migration/rdma.c
>>>> +++ b/migration/rdma.c
>>>> @@ -1489,6 +1489,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma,
>>>> uint64_t *wr_id_out,
>>>>*/
>>>>   static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>>>   {
>>>> +struct rdma_cm_event *cm_event;
>>>> +int ret = -1;
>>>> +
>>>>   /*
>>>>* Coroutine doesn't start until migration_fd_process_incoming()
>>>>* so don't yield unless we know we're running inside of a
>>>> coroutine.
>>>> @@ -1504,25 +1507,35 @@ static int
>>>> qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>>>* But we need to be able to handle 'cancel' or an error
>>>>* without hanging forever.
>>>>*/
>>>> -while (!rdma->error_state  && !rdma->received_error) {
>>>> -GPollFD pfds[1];
>>>> +while (!rdma->error_state && !rdma->received_error) {
>>>> +GPollFD pfds[2];
>>>>   pfds[0].fd = rdma->comp_channel->fd;
>>>>   pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>>> +pfds[0].revents = 0;
>>>> +
>>>> +pfds[1].fd = rdma->channel->fd;
>>>> +pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>>>> +pfds[1].revents = 0;
>>>> +
>>>>   /* 0.1s timeout, should be fine for a 'cancel' */
>>>> -switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) {
>>>> -case 1: /* fd active */
>>>> -return 0;
>>>> +qemu_poll_ns(pfds, 2, 100 * 1000 * 1000);
>>>
>>> Shouldn't we still check the return value of this; if it's negative
>>> something has gone wrong.
>>
>> I will fix this.
>> Thanks.
>>
>>> Dave
>>>
>>>> -case 0: /* Timeout, go around again */
>>>> -break;
>>>> +if (pfds[1].revents) {
>>>> +ret = rdma_get_cm_event(rdma->channel, &cm_event);
>>>> +if (!ret) {
>>>> +rdma_ack_cm_event(cm_event);
>>>> +}
>>>> +error_report("receive cm event while wait comp
>>>> channel,"
>>>> + "cm event is %d", cm_event->event);
>>>>
>>>> -default: /* Error of some type -
>>>> -  * I don't trust errno from qemu_poll_ns
>>>> - */
>>>> -error_report("%s: poll failed", __func__);
>>>> +/* consider any rdma communication event as an error */
>>>>   return -EPIPE;
>>>>   }
>>>>
>>>> +if (pfds[0].revents) {
>>>> +return 0;
>>>> +}
>>>> +
>>>>   if (migrate_get_current()->state ==
>>>> MIGRATION_STATUS_CANCELLING) {
>>>>   /* Bail out and let the cancellation happen */
>>>>   return -EPIPE;
>>>> --
>>>> 1.8.3.1
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>
>



Re: [Qemu-devel] [PATCH v4 04/12] migration: avoid concurrent invoke channel_close by different threads

2018-06-03 Thread 858585 jemmy
On Sun, Jun 3, 2018 at 9:50 PM, 858585 jemmy  wrote:
> On Thu, May 31, 2018 at 6:52 PM, Dr. David Alan Gilbert
>  wrote:
>> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>>> On Wed, May 30, 2018 at 10:45 PM, Dr. David Alan Gilbert
>>>  wrote:
>>> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>>> >> From: Lidong Chen 
>>> >>
>>> >> The channel_close maybe invoked by different threads. For example, source
>>> >> qemu invokes qemu_fclose in main thread, migration thread and return path
>>> >> thread. Destination qemu invokes qemu_fclose in main thread, listen 
>>> >> thread
>>> >> and COLO incoming thread.
>>> >>
>>> >> Add a mutex in QEMUFile struct to avoid concurrent invoke channel_close.
>>> >>
>>> >> Signed-off-by: Lidong Chen 
>>> >> ---
>>> >>  migration/qemu-file.c | 5 +
>>> >>  1 file changed, 5 insertions(+)
>>> >>
>>> >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>>> >> index 977b9ae..87d0f05 100644
>>> >> --- a/migration/qemu-file.c
>>> >> +++ b/migration/qemu-file.c
>>> >> @@ -52,6 +52,7 @@ struct QEMUFile {
>>> >>  unsigned int iovcnt;
>>> >>
>>> >>  int last_error;
>>> >> +QemuMutex lock;
>>> >
>>> > That could do with a comment saying what you're protecting
>>> >
>>> >>  };
>>> >>
>>> >>  /*
>>> >> @@ -96,6 +97,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const 
>>> >> QEMUFileOps *ops)
>>> >>
>>> >>  f = g_new0(QEMUFile, 1);
>>> >>
>>> >> +qemu_mutex_init(&f->lock);
>>> >>  f->opaque = opaque;
>>> >>  f->ops = ops;
>>> >>  return f;
>>> >> @@ -328,7 +330,9 @@ int qemu_fclose(QEMUFile *f)
>>> >>  ret = qemu_file_get_error(f);
>>> >>
>>> >>  if (f->ops->close) {
>>> >> +qemu_mutex_lock(&f->lock);
>>> >>  int ret2 = f->ops->close(f->opaque);
>>> >> +qemu_mutex_unlock(&f->lock);
>>> >
>>> > OK, and at least for the RDMA code, if it calls
>>> > close a 2nd time, rioc->rdma is checked so it wont actually free stuff a
>>> > 2nd time.
>>> >
>>> >>  if (ret >= 0) {
>>> >>  ret = ret2;
>>> >>  }
>>> >> @@ -339,6 +343,7 @@ int qemu_fclose(QEMUFile *f)
>>> >>  if (f->last_error) {
>>> >>  ret = f->last_error;
>>> >>  }
>>> >> +qemu_mutex_destroy(&f->lock);
>>> >>  g_free(f);
>>> >
>>> > Hmm but that's not safe; if two things really do call qemu_fclose()
>>> > on the same structure they race here and can end up destroying the lock
>>> > twice, or doing f->lock  after the 1st one has already g_free(f).
>>>
>>> >
>>> >
>>> > So lets go back a step.
>>> > I think:
>>> >   a) There should always be a separate QEMUFile* for
>>> >  to_src_file and from_src_file - I don't see where you open
>>> >  the 2nd one; I don't see your implementation of
>>> >  f->ops->get_return_path.
>>>
>>> yes, current qemu version use a separate QEMUFile* for to_src_file and
>>> from_src_file.
>>> and the two QEMUFile point to one QIOChannelRDMA.
>>>
>>> the f->ops->get_return_path is implemented by channel_output_ops or
>>> channel_input_ops.
>>
>> Ah OK, yes that makes sense.
>>
>>> >   b) I *think* that while the different threads might all call
>>> >  fclose(), I think there should only ever be one qemu_fclose
>>> >  call for each direction on the QEMUFile.
>>> >
>>> > But now we have two problems:
>>> >   If (a) is true then f->lock  is separate on each one so
>>> >doesn't really protect if the two directions are closed
>>> >at once. (Assuming (b) is true)
>>>
>>> yes, you are right.  so I should add a QemuMutex in QIOChannel structure, 
>>> not
>>> QEMUFile structure. and qemu_mutex_destroy the QemuMutex in
>>> qio_channel_finalize.
>>
>> OK, that sounds better.
>>
>> Dave
>>
>
> Hi Dave:
> Another way is protect channel_close in migration part, like
> QemuMutex rp_mutex.
> As Daniel mentioned, QIOChannel impls are only intended to a single 
> thread.
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg530100.html
>
> which way is better? Does QIOChannel have the plan to support multi 
> thread?
> Not only channel_close need lock between different threads,
> writev_buffer write also
> need.
>
> thanks.
>
>

I find qemu not call qemu_mutex_destroy to release rp_mutex in
migration_instance_finalize:(
although qemu_mutex_destroy is not necceesary, but it is a good practice to do.
it's better we fixed it.

>>> Thank you.
>>>
>>> >
>>> >   If (a) is false and we actually share a single QEMUFile then
>>> >  that race at the end happens.
>>> >
>>> > Dave
>>> >
>>> >
>>> >>  trace_qemu_file_fclose();
>>> >>  return ret;
>>> >> --
>>> >> 1.8.3.1
>>> >>
>>> > --
>>> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v4 04/12] migration: avoid concurrent invoke channel_close by different threads

2018-06-03 Thread 858585 jemmy
On Thu, May 31, 2018 at 6:52 PM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> On Wed, May 30, 2018 at 10:45 PM, Dr. David Alan Gilbert
>>  wrote:
>> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> From: Lidong Chen 
>> >>
>> >> The channel_close maybe invoked by different threads. For example, source
>> >> qemu invokes qemu_fclose in main thread, migration thread and return path
>> >> thread. Destination qemu invokes qemu_fclose in main thread, listen thread
>> >> and COLO incoming thread.
>> >>
>> >> Add a mutex in QEMUFile struct to avoid concurrent invoke channel_close.
>> >>
>> >> Signed-off-by: Lidong Chen 
>> >> ---
>> >>  migration/qemu-file.c | 5 +
>> >>  1 file changed, 5 insertions(+)
>> >>
>> >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> >> index 977b9ae..87d0f05 100644
>> >> --- a/migration/qemu-file.c
>> >> +++ b/migration/qemu-file.c
>> >> @@ -52,6 +52,7 @@ struct QEMUFile {
>> >>  unsigned int iovcnt;
>> >>
>> >>  int last_error;
>> >> +QemuMutex lock;
>> >
>> > That could do with a comment saying what you're protecting
>> >
>> >>  };
>> >>
>> >>  /*
>> >> @@ -96,6 +97,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const 
>> >> QEMUFileOps *ops)
>> >>
>> >>  f = g_new0(QEMUFile, 1);
>> >>
>> >> +qemu_mutex_init(&f->lock);
>> >>  f->opaque = opaque;
>> >>  f->ops = ops;
>> >>  return f;
>> >> @@ -328,7 +330,9 @@ int qemu_fclose(QEMUFile *f)
>> >>  ret = qemu_file_get_error(f);
>> >>
>> >>  if (f->ops->close) {
>> >> +qemu_mutex_lock(&f->lock);
>> >>  int ret2 = f->ops->close(f->opaque);
>> >> +qemu_mutex_unlock(&f->lock);
>> >
>> > OK, and at least for the RDMA code, if it calls
>> > close a 2nd time, rioc->rdma is checked so it wont actually free stuff a
>> > 2nd time.
>> >
>> >>  if (ret >= 0) {
>> >>  ret = ret2;
>> >>  }
>> >> @@ -339,6 +343,7 @@ int qemu_fclose(QEMUFile *f)
>> >>  if (f->last_error) {
>> >>  ret = f->last_error;
>> >>  }
>> >> +qemu_mutex_destroy(&f->lock);
>> >>  g_free(f);
>> >
>> > Hmm but that's not safe; if two things really do call qemu_fclose()
>> > on the same structure they race here and can end up destroying the lock
>> > twice, or doing f->lock  after the 1st one has already g_free(f).
>>
>> >
>> >
>> > So lets go back a step.
>> > I think:
>> >   a) There should always be a separate QEMUFile* for
>> >  to_src_file and from_src_file - I don't see where you open
>> >  the 2nd one; I don't see your implementation of
>> >  f->ops->get_return_path.
>>
>> yes, current qemu version use a separate QEMUFile* for to_src_file and
>> from_src_file.
>> and the two QEMUFile point to one QIOChannelRDMA.
>>
>> the f->ops->get_return_path is implemented by channel_output_ops or
>> channel_input_ops.
>
> Ah OK, yes that makes sense.
>
>> >   b) I *think* that while the different threads might all call
>> >  fclose(), I think there should only ever be one qemu_fclose
>> >  call for each direction on the QEMUFile.
>> >
>> > But now we have two problems:
>> >   If (a) is true then f->lock  is separate on each one so
>> >doesn't really protect if the two directions are closed
>> >at once. (Assuming (b) is true)
>>
>> yes, you are right.  so I should add a QemuMutex in QIOChannel structure, not
>> QEMUFile structure. and qemu_mutex_destroy the QemuMutex in
>> qio_channel_finalize.
>
> OK, that sounds better.
>
> Dave
>

Hi Dave:
Another way is protect channel_close in migration part, like
QemuMutex rp_mutex.
As Daniel mentioned, QIOChannel impls are only intended to a single thread.
https://www.mail-archive.com/qemu-devel@nongnu.org/msg530100.html

which way is better? Does QIOChannel have the plan to support multi thread?
Not only channel_close need lock between different threads,
writev_buffer write also
need.

thanks.


>> Thank you.
>>
>> >
>> >   If (a) is false and we actually share a single QEMUFile then
>> >  that race at the end happens.
>> >
>> > Dave
>> >
>> >
>> >>  trace_qemu_file_fclose();
>> >>  return ret;
>> >> --
>> >> 1.8.3.1
>> >>
>> > --
>> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v4 10/12] migration: create a dedicated thread to release rdma resource

2018-05-31 Thread 858585 jemmy
On Thu, May 31, 2018 at 6:55 PM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> On Thu, May 31, 2018 at 12:50 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> ibv_dereg_mr wait for a long time for big memory size virtual server.
>> >>
>> >> The test result is:
>> >>   10GB  326ms
>> >>   20GB  699ms
>> >>   30GB  1021ms
>> >>   40GB  1387ms
>> >>   50GB  1712ms
>> >>   60GB  2034ms
>> >>   70GB  2457ms
>> >>   80GB  2807ms
>> >>   90GB  3107ms
>> >>   100GB 3474ms
>> >>   110GB 3735ms
>> >>   120GB 4064ms
>> >>   130GB 4567ms
>> >>   140GB 4886ms
>> >>
>> >> this will cause the guest os hang for a while when migration finished.
>> >> So create a dedicated thread to release rdma resource.
>> >>
>> >> Signed-off-by: Lidong Chen 
>> >> ---
>> >>  migration/rdma.c | 21 +
>> >>  1 file changed, 17 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/migration/rdma.c b/migration/rdma.c
>> >> index dfa4f77..1b9e261 100644
>> >> --- a/migration/rdma.c
>> >> +++ b/migration/rdma.c
>> >> @@ -2979,12 +2979,12 @@ static void 
>> >> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>> >>  }
>> >>  }
>> >>
>> >> -static int qio_channel_rdma_close(QIOChannel *ioc,
>> >> -  Error **errp)
>> >> +static void *qio_channel_rdma_close_thread(void *arg)
>> >>  {
>> >> -QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> >> +QIOChannelRDMA *rioc = arg;
>> >>  RDMAContext *rdmain, *rdmaout;
>> >> -trace_qemu_rdma_close();
>> >> +
>> >> +rcu_register_thread();
>> >>
>> >>  rdmain = rioc->rdmain;
>> >>  if (rdmain) {
>> >> @@ -3009,6 +3009,19 @@ static int qio_channel_rdma_close(QIOChannel *ioc,
>> >>  g_free(rdmain);
>> >>  g_free(rdmaout);
>> >>
>> >> +rcu_unregister_thread();
>> >> +return NULL;
>> >> +}
>> >> +
>> >> +static int qio_channel_rdma_close(QIOChannel *ioc,
>> >> +  Error **errp)
>> >> +{
>> >> +QemuThread t;
>> >> +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> >> +trace_qemu_rdma_close();
>> >> +
>> >> +qemu_thread_create(&t, "rdma cleanup", qio_channel_rdma_close_thread,
>> >> +   rioc, QEMU_THREAD_DETACHED);
>> >
>> > I don't think this can be this simple; consider the lock in patch 4;
>> > now that lock means qui_channel_rdma_close() can't be called in
>> > parallel; but with this change it means:
>> >
>> >
>> >  f->lock
>> >qemu_thread_create  (1)
>> > !f->lock
>> >  f->lock
>> >qemu_thread_create
>> > !f->lock
>> >
>> > so we don't really protect the thing you were trying to lock
>>
>> yes, I should not use rioc as the thread arg.
>>
>> static int qio_channel_rdma_close(QIOChannel *ioc,
>>   Error **errp)
>> {
>> QemuThread t;
>> RDMAContext *rdma[2];
>> QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>
>> trace_qemu_rdma_close();
>> if (rioc->rdmain || rioc->rdmaout) {
>> rdma[0] =  rioc->rdmain;
>> rdma[1] =  rioc->rdmaout;
>> qemu_thread_create(&t, "rdma cleanup", qio_channel_rdma_close_thread,
>>rdma, QEMU_THREAD_DETACHED);
>> rioc->rdmain = NULL;
>> rioc->rdmaout = NULL;
>
> Is it safe to close both directions at once?
> For example, if you get the close from the return path thread, might the
> main thread be still using it's QEMUFile in the opposite direction;
> it'll call close a little bit later?

I use rcu to protect this.  qio_channel_rdma_close_thread call synchronize_rcu,
it will wait until all other thread not access rdmain and rdmaout.

And if the return patch close the qemu file, the migration thread qemu
file will be set error soon
because the QIOChannel is closed. QIOChannelSocket also work this way.



>
> Dave
>
>> }
>> return 0;
>> }
>>
>> >
>> > Dave
>> >
>> >>  return 0;
>> >>  }
>> >>
>> >> --
>> >> 1.8.3.1
>> >>
>> > --
>> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v4 11/12] migration: poll the cm event while wait RDMA work request completion

2018-05-31 Thread 858585 jemmy
On Thu, May 31, 2018 at 1:33 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> If the peer qemu is crashed, the qemu_rdma_wait_comp_channel function
>> maybe loop forever. so we should also poll the cm event fd, and when
>> receive any cm event, we consider some error happened.
>>
>> Signed-off-by: Lidong Chen 
>
> I don't understand enough about the way the infiniband fd's work to
> fully review this; so I'd appreciate if some one who does could
> comment/add their review.

Hi Avaid:
we need your help. I also not find any document about the cq
channel event fd and
cm channel event f.
Should we set the events to G_IO_IN | G_IO_HUP | G_IO_ERR? or
G_IO_IN is enough?
pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
Thanks.

>
>> ---
>>  migration/rdma.c | 35 ---
>>  1 file changed, 24 insertions(+), 11 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 1b9e261..d611a06 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1489,6 +1489,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, 
>> uint64_t *wr_id_out,
>>   */
>>  static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>>  {
>> +struct rdma_cm_event *cm_event;
>> +int ret = -1;
>> +
>>  /*
>>   * Coroutine doesn't start until migration_fd_process_incoming()
>>   * so don't yield unless we know we're running inside of a coroutine.
>> @@ -1504,25 +1507,35 @@ static int qemu_rdma_wait_comp_channel(RDMAContext 
>> *rdma)
>>   * But we need to be able to handle 'cancel' or an error
>>   * without hanging forever.
>>   */
>> -while (!rdma->error_state  && !rdma->received_error) {
>> -GPollFD pfds[1];
>> +while (!rdma->error_state && !rdma->received_error) {
>> +GPollFD pfds[2];
>>  pfds[0].fd = rdma->comp_channel->fd;
>>  pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>> +pfds[0].revents = 0;
>> +
>> +pfds[1].fd = rdma->channel->fd;
>> +pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
>> +pfds[1].revents = 0;
>> +
>>  /* 0.1s timeout, should be fine for a 'cancel' */
>> -switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) {
>> -case 1: /* fd active */
>> -return 0;
>> +qemu_poll_ns(pfds, 2, 100 * 1000 * 1000);
>
> Shouldn't we still check the return value of this; if it's negative
> something has gone wrong.

I will fix this.
Thanks.

>
> Dave
>
>> -case 0: /* Timeout, go around again */
>> -break;
>> +if (pfds[1].revents) {
>> +ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> +if (!ret) {
>> +rdma_ack_cm_event(cm_event);
>> +}
>> +error_report("receive cm event while wait comp channel,"
>> + "cm event is %d", cm_event->event);
>>
>> -default: /* Error of some type -
>> -  * I don't trust errno from qemu_poll_ns
>> - */
>> -error_report("%s: poll failed", __func__);
>> +/* consider any rdma communication event as an error */
>>  return -EPIPE;
>>  }
>>
>> +if (pfds[0].revents) {
>> +return 0;
>> +}
>> +
>>  if (migrate_get_current()->state == 
>> MIGRATION_STATUS_CANCELLING) {
>>  /* Bail out and let the cancellation happen */
>>  return -EPIPE;
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v4 10/12] migration: create a dedicated thread to release rdma resource

2018-05-31 Thread 858585 jemmy
On Thu, May 31, 2018 at 12:50 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> ibv_dereg_mr wait for a long time for big memory size virtual server.
>>
>> The test result is:
>>   10GB  326ms
>>   20GB  699ms
>>   30GB  1021ms
>>   40GB  1387ms
>>   50GB  1712ms
>>   60GB  2034ms
>>   70GB  2457ms
>>   80GB  2807ms
>>   90GB  3107ms
>>   100GB 3474ms
>>   110GB 3735ms
>>   120GB 4064ms
>>   130GB 4567ms
>>   140GB 4886ms
>>
>> this will cause the guest os hang for a while when migration finished.
>> So create a dedicated thread to release rdma resource.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/rdma.c | 21 +
>>  1 file changed, 17 insertions(+), 4 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index dfa4f77..1b9e261 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2979,12 +2979,12 @@ static void 
>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>  }
>>  }
>>
>> -static int qio_channel_rdma_close(QIOChannel *ioc,
>> -  Error **errp)
>> +static void *qio_channel_rdma_close_thread(void *arg)
>>  {
>> -QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> +QIOChannelRDMA *rioc = arg;
>>  RDMAContext *rdmain, *rdmaout;
>> -trace_qemu_rdma_close();
>> +
>> +rcu_register_thread();
>>
>>  rdmain = rioc->rdmain;
>>  if (rdmain) {
>> @@ -3009,6 +3009,19 @@ static int qio_channel_rdma_close(QIOChannel *ioc,
>>  g_free(rdmain);
>>  g_free(rdmaout);
>>
>> +rcu_unregister_thread();
>> +return NULL;
>> +}
>> +
>> +static int qio_channel_rdma_close(QIOChannel *ioc,
>> +  Error **errp)
>> +{
>> +QemuThread t;
>> +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> +trace_qemu_rdma_close();
>> +
>> +qemu_thread_create(&t, "rdma cleanup", qio_channel_rdma_close_thread,
>> +   rioc, QEMU_THREAD_DETACHED);
>
> I don't think this can be this simple; consider the lock in patch 4;
> now that lock means qui_channel_rdma_close() can't be called in
> parallel; but with this change it means:
>
>
>  f->lock
>qemu_thread_create  (1)
> !f->lock
>  f->lock
>qemu_thread_create
> !f->lock
>
> so we don't really protect the thing you were trying to lock

yes, I should not use rioc as the thread arg.

static int qio_channel_rdma_close(QIOChannel *ioc,
  Error **errp)
{
QemuThread t;
RDMAContext *rdma[2];
QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);

trace_qemu_rdma_close();
if (rioc->rdmain || rioc->rdmaout) {
rdma[0] =  rioc->rdmain;
rdma[1] =  rioc->rdmaout;
qemu_thread_create(&t, "rdma cleanup", qio_channel_rdma_close_thread,
   rdma, QEMU_THREAD_DETACHED);
rioc->rdmain = NULL;
rioc->rdmaout = NULL;
}
return 0;
}

>
> Dave
>
>>  return 0;
>>  }
>>
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v4 04/12] migration: avoid concurrent invoke channel_close by different threads

2018-05-31 Thread 858585 jemmy
On Wed, May 30, 2018 at 10:45 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> From: Lidong Chen 
>>
>> The channel_close maybe invoked by different threads. For example, source
>> qemu invokes qemu_fclose in main thread, migration thread and return path
>> thread. Destination qemu invokes qemu_fclose in main thread, listen thread
>> and COLO incoming thread.
>>
>> Add a mutex in QEMUFile struct to avoid concurrent invoke channel_close.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/qemu-file.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 977b9ae..87d0f05 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -52,6 +52,7 @@ struct QEMUFile {
>>  unsigned int iovcnt;
>>
>>  int last_error;
>> +QemuMutex lock;
>
> That could do with a comment saying what you're protecting
>
>>  };
>>
>>  /*
>> @@ -96,6 +97,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps 
>> *ops)
>>
>>  f = g_new0(QEMUFile, 1);
>>
>> +qemu_mutex_init(&f->lock);
>>  f->opaque = opaque;
>>  f->ops = ops;
>>  return f;
>> @@ -328,7 +330,9 @@ int qemu_fclose(QEMUFile *f)
>>  ret = qemu_file_get_error(f);
>>
>>  if (f->ops->close) {
>> +qemu_mutex_lock(&f->lock);
>>  int ret2 = f->ops->close(f->opaque);
>> +qemu_mutex_unlock(&f->lock);
>
> OK, and at least for the RDMA code, if it calls
> close a 2nd time, rioc->rdma is checked so it wont actually free stuff a
> 2nd time.
>
>>  if (ret >= 0) {
>>  ret = ret2;
>>  }
>> @@ -339,6 +343,7 @@ int qemu_fclose(QEMUFile *f)
>>  if (f->last_error) {
>>  ret = f->last_error;
>>  }
>> +qemu_mutex_destroy(&f->lock);
>>  g_free(f);
>
> Hmm but that's not safe; if two things really do call qemu_fclose()
> on the same structure they race here and can end up destroying the lock
> twice, or doing f->lock  after the 1st one has already g_free(f).

>
>
> So lets go back a step.
> I think:
>   a) There should always be a separate QEMUFile* for
>  to_src_file and from_src_file - I don't see where you open
>  the 2nd one; I don't see your implementation of
>  f->ops->get_return_path.

yes, current qemu version use a separate QEMUFile* for to_src_file and
from_src_file.
and the two QEMUFile point to one QIOChannelRDMA.

the f->ops->get_return_path is implemented by channel_output_ops or
channel_input_ops.

>   b) I *think* that while the different threads might all call
>  fclose(), I think there should only ever be one qemu_fclose
>  call for each direction on the QEMUFile.
>
> But now we have two problems:
>   If (a) is true then f->lock  is separate on each one so
>doesn't really protect if the two directions are closed
>at once. (Assuming (b) is true)

yes, you are right.  so I should add a QemuMutex in QIOChannel structure, not
QEMUFile structure. and qemu_mutex_destroy the QemuMutex in
qio_channel_finalize.

Thank you.

>
>   If (a) is false and we actually share a single QEMUFile then
>  that race at the end happens.
>
> Dave
>
>
>>  trace_qemu_file_fclose();
>>  return ret;
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v3 5/6] migration: implement bi-directional RDMA QIOChannel

2018-05-22 Thread 858585 jemmy
ping.

On Mon, May 21, 2018 at 7:49 PM, 858585 jemmy  wrote:
> On Wed, May 16, 2018 at 5:36 PM, 858585 jemmy  wrote:
>> On Tue, May 15, 2018 at 10:54 PM, Paolo Bonzini  wrote:
>>> On 05/05/2018 16:35, Lidong Chen wrote:
>>>> @@ -2635,12 +2637,20 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>>>> *ioc,
>>>>  {
>>>>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>>>  QEMUFile *f = rioc->file;
>>>> -RDMAContext *rdma = rioc->rdma;
>>>> +RDMAContext *rdma;
>>>>  int ret;
>>>>  ssize_t done = 0;
>>>>  size_t i;
>>>>  size_t len = 0;
>>>>
>>>> +rcu_read_lock();
>>>> +rdma = atomic_rcu_read(&rioc->rdmaout);
>>>> +
>>>> +if (!rdma) {
>>>> +rcu_read_unlock();
>>>> +return -EIO;
>>>> +}
>>>> +
>>>>  CHECK_ERROR_STATE();
>>>>
>>>>  /*
>>>
>>> I am not sure I understand this.  It would probably be wrong to use the
>>> output side from two threads at the same time, so why not use two mutexes?
>>
>> Two thread will not invoke qio_channel_rdma_writev at the same time.
>> The source qemu, migration thread only use writev, and the return path
>> thread only
>> use readv.
>> The destination qemu already have a mutex mis->rp_mutex to make sure
>> not use writev
>> at the same time.
>>
>> The rcu_read_lock is used to protect not use RDMAContext when another
>> thread closes it.
>
> Any suggestion?
>
>>
>>>
>>> Also, who is calling qio_channel_rdma_close in such a way that another
>>> thread is still using it?  Would it be possible to synchronize with the
>>> other thread *before*, for example with qemu_thread_join?
>>
>> The MigrationState structure includes to_dst_file and from_dst_file
>> QEMUFile, the two QEMUFile use the same QIOChannel.
>> For example, if the return path thread call
>> qemu_fclose(ms->rp_state.from_dst_file),
>> It will also close the RDMAContext for ms->to_dst_file.
>>
>> For live migration, the source qemu invokes qemu_fclose in different
>> threads include main thread, migration thread, return path thread.
>>
>> The destination qemu invokes qemu_fclose in main thread, listen thread and
>> COLO incoming thread.
>>
>> I do not find an effective way to synchronize these threads.
>>
>> Thanks.
>>
>>>
>>> Thanks,
>>>
>>> Paolo



Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-22 Thread 858585 jemmy
On Wed, May 16, 2018 at 9:13 PM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>
> 
>
>> >> >> > I wonder why dereg_mr takes so long - I could understand if reg_mr
>> >> >> > took a long time, but why for dereg, that sounds like the easy side.
>> >> >>
>> >> >> I use perf collect the information when ibv_dereg_mr is invoked.
>> >> >>
>> >> >> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>> >> >>   `
>> >> >>- put_compound_page
>> >> >>   - 98.45% put_page
>> >> >>__ib_umem_release
>> >> >>ib_umem_release
>> >> >>dereg_mr
>> >> >>mlx5_ib_dereg_mr
>> >> >>ib_dereg_mr
>> >> >>uverbs_free_mr
>> >> >>remove_commit_idr_uobject
>> >> >>_rdma_remove_commit_uobject
>> >> >>rdma_remove_commit_uobject
>> >> >>ib_uverbs_dereg_mr
>> >> >>ib_uverbs_write
>> >> >>vfs_write
>> >> >>sys_write
>> >> >>system_call_fastpath
>> >> >>__GI___libc_write
>> >> >>0
>> >> >>   + 1.55% __ib_umem_release
>> >> >> +   8.31%  client2  [kernel.kallsyms]  [k] compound_unlock_irqrestore
>> >> >> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>> >> >> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>> >> >> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>> >> >> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>> >> >> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>> >> >> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>> >> >> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>> >> >> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>> >> >> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>> >> >> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>> >> >> +   3.13%  client2  [kernel.kallsyms]  [k] get_pageblock_flags_group
>> >> >> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>> >> >>
>> >> >> the reason is __ib_umem_release will loop many times for each page.
>> >> >>
>> >> >> static void __ib_umem_release(struct ib_device *dev, struct ib_umem
>> >> >> *umem, int dirty)
>> >> >> {
>> >> >> struct scatterlist *sg;
>> >> >> struct page *page;
>> >> >> int i;
>> >> >>
>> >> >> if (umem->nmap > 0)
>> >> >>  ib_dma_unmap_sg(dev, umem->sg_head.sgl,
>> >> >> umem->npages,
>> >> >> DMA_BIDIRECTIONAL);
>> >> >>
>> >> >>  for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) {  <<
>> >> >> loop a lot of times for each page.here
>> >> >
>> >> > Why 'lot of times for each page'?  I don't know this code at all, but
>> >> > I'd expected once per page?
>> >>
>> >> sorry, once per page, but a lot of page for a big size virtual machine.
>> >
>> > Ah OK; so yes it seems best if you can find a way to do the release in
>> > the migration thread then;  still maybe this is something some
>> > of the kernel people could look at speeding up?
>>
>> The kernel code seem is not complex, and I have no idea how to speed up.
>
> Me neither; but I'll ask around.
>
>> >> >
>> >> > With your other kernel fix, does the problem of the missing
>> >> > RDMA_CM_EVENT_DISCONNECTED events go away?
>> >>
>> >> Yes, after kernel and qemu fixed, this issue never happens again.
>> >
>> > I'm confused; which qemu fix; my question was whether the kernel fix by
>> > itself fixed the problem of the missing event.
>>
>> this qemu fix:
>> migration: update index

Re: [Qemu-devel] [PATCH v3 5/6] migration: implement bi-directional RDMA QIOChannel

2018-05-21 Thread 858585 jemmy
On Wed, May 16, 2018 at 5:36 PM, 858585 jemmy  wrote:
> On Tue, May 15, 2018 at 10:54 PM, Paolo Bonzini  wrote:
>> On 05/05/2018 16:35, Lidong Chen wrote:
>>> @@ -2635,12 +2637,20 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>>> *ioc,
>>>  {
>>>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>>  QEMUFile *f = rioc->file;
>>> -RDMAContext *rdma = rioc->rdma;
>>> +RDMAContext *rdma;
>>>  int ret;
>>>  ssize_t done = 0;
>>>  size_t i;
>>>  size_t len = 0;
>>>
>>> +rcu_read_lock();
>>> +rdma = atomic_rcu_read(&rioc->rdmaout);
>>> +
>>> +if (!rdma) {
>>> +rcu_read_unlock();
>>> +return -EIO;
>>> +}
>>> +
>>>  CHECK_ERROR_STATE();
>>>
>>>  /*
>>
>> I am not sure I understand this.  It would probably be wrong to use the
>> output side from two threads at the same time, so why not use two mutexes?
>
> Two thread will not invoke qio_channel_rdma_writev at the same time.
> The source qemu, migration thread only use writev, and the return path
> thread only
> use readv.
> The destination qemu already have a mutex mis->rp_mutex to make sure
> not use writev
> at the same time.
>
> The rcu_read_lock is used to protect not use RDMAContext when another
> thread closes it.

Any suggestion?

>
>>
>> Also, who is calling qio_channel_rdma_close in such a way that another
>> thread is still using it?  Would it be possible to synchronize with the
>> other thread *before*, for example with qemu_thread_join?
>
> The MigrationState structure includes to_dst_file and from_dst_file
> QEMUFile, the two QEMUFile use the same QIOChannel.
> For example, if the return path thread call
> qemu_fclose(ms->rp_state.from_dst_file),
> It will also close the RDMAContext for ms->to_dst_file.
>
> For live migration, the source qemu invokes qemu_fclose in different
> threads include main thread, migration thread, return path thread.
>
> The destination qemu invokes qemu_fclose in main thread, listen thread and
> COLO incoming thread.
>
> I do not find an effective way to synchronize these threads.
>
> Thanks.
>
>>
>> Thanks,
>>
>> Paolo



Re: [Qemu-devel] FW: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-17 Thread 858585 jemmy
On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel
 wrote:
>
>
> On 5/17/2018 5:42 AM, 858585 jemmy wrote:
>>
>> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel
>>  wrote:
>>>
>>> Hi Lidong and David,
>>> Sorry for the late response, I had to ramp up on migration code and build
>>> a
>>> setup on my side.
>>>
>>> PSB my comments for this patch below.
>>> For the RDMA post-copy patches I will comment next week after testing on
>>> Mellanox side too.
>>>
>>> Thanks!
>>>
>>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote:
>>>>
>>>>
>>>> -Original Message-
>>>> From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
>>>> Sent: Wednesday, May 16, 2018 4:13 PM
>>>> To: 858585 jemmy 
>>>> Cc: Aviad Yehezkel ; Juan Quintela
>>>> ; qemu-devel ; Gal Shachaf
>>>> ; Adi Dotan ; Lidong Chen
>>>> 
>>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED
>>>> event after rdma_disconnect
>>>>
>>>> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>>>>
>>>> 
>>>>
>>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if
>>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the
>>>>>>>>>> easy side.
>>>>>>>>>
>>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked.
>>>>>>>>>
>>>>>>>>> -   9.95%  client2  [kernel.kallsyms]  [k] put_compound_page
>>>>>>>>> `
>>>>>>>>>  - put_compound_page
>>>>>>>>> - 98.45% put_page
>>>>>>>>>  __ib_umem_release
>>>>>>>>>  ib_umem_release
>>>>>>>>>  dereg_mr
>>>>>>>>>  mlx5_ib_dereg_mr
>>>>>>>>>  ib_dereg_mr
>>>>>>>>>  uverbs_free_mr
>>>>>>>>>  remove_commit_idr_uobject
>>>>>>>>>  _rdma_remove_commit_uobject
>>>>>>>>>  rdma_remove_commit_uobject
>>>>>>>>>  ib_uverbs_dereg_mr
>>>>>>>>>  ib_uverbs_write
>>>>>>>>>  vfs_write
>>>>>>>>>  sys_write
>>>>>>>>>  system_call_fastpath
>>>>>>>>>  __GI___libc_write
>>>>>>>>>  0
>>>>>>>>> + 1.55% __ib_umem_release
>>>>>>>>> +   8.31%  client2  [kernel.kallsyms]  [k]
>>>>>>>>> compound_unlock_irqrestore
>>>>>>>>> +   7.01%  client2  [kernel.kallsyms]  [k] page_waitqueue
>>>>>>>>> +   7.00%  client2  [kernel.kallsyms]  [k] set_page_dirty
>>>>>>>>> +   6.61%  client2  [kernel.kallsyms]  [k] unlock_page
>>>>>>>>> +   6.33%  client2  [kernel.kallsyms]  [k] put_page_testzero
>>>>>>>>> +   5.68%  client2  [kernel.kallsyms]  [k] set_page_dirty_lock
>>>>>>>>> +   4.30%  client2  [kernel.kallsyms]  [k] __wake_up_bit
>>>>>>>>> +   4.04%  client2  [kernel.kallsyms]  [k] free_pages_prepare
>>>>>>>>> +   3.65%  client2  [kernel.kallsyms]  [k] release_pages
>>>>>>>>> +   3.62%  client2  [kernel.kallsyms]  [k] arch_local_irq_save
>>>>>>>>> +   3.35%  client2  [kernel.kallsyms]  [k] page_mapping
>>>>>>>>> +   3.13%  client2  [kernel.kallsyms]  [k]
>>>>>>>>> get_pageblock_flags_group
>>>>>>>>> +   3.09%  client2  [kernel.kallsyms]  [k] put_page
>>>>>>>>>
>>>>>>>>> the reason is __ib_umem_release will loop many times for each page.
>>>>>>>>>
>>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct
>>>>>>>>> ib_umem *umem, int dirty) {
>>>>>>>>>   struct scatterlist *sg;
>>>>>>>>

Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-16 Thread 858585 jemmy
On Wed, May 16, 2018 at 5:53 PM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> On Wed, May 16, 2018 at 5:39 PM, Dr. David Alan Gilbert
>>  wrote:
>> > * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> >> On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
>> >>  wrote:
>> >> > * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> >> >> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
>> >> >>  wrote:
>> >> >> > * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> >> >> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> >> >> >>  wrote:
>> >> >> >> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> >> >> >> When cancel migration during RDMA precopy, the source qemu main 
>> >> >> >> >> thread hangs sometime.
>> >> >> >> >>
>> >> >> >> >> The backtrace is:
>> >> >> >> >> (gdb) bt
>> >> >> >> >> #0  0x7f249eabd43d in write () from 
>> >> >> >> >> /lib64/libpthread.so.0
>> >> >> >> >> #1  0x7f24a1ce98e4 in rdma_get_cm_event 
>> >> >> >> >> (channel=0x4675d10, event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> >> >> >> #2  0x007b6166 in qemu_rdma_cleanup (rdma=0x6784000) 
>> >> >> >> >> at migration/rdma.c:2296
>> >> >> >> >> #3  0x007b7cae in qio_channel_rdma_close 
>> >> >> >> >> (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>> >> >> >> >> #4  0x008db60e in qio_channel_close (ioc=0x3bfcc30, 
>> >> >> >> >> errp=0x0) at io/channel.c:273
>> >> >> >> >> #5  0x007a8765 in channel_close (opaque=0x3bfcc30) 
>> >> >> >> >> at migration/qemu-file-channel.c:98
>> >> >> >> >> #6  0x007a71f9 in qemu_fclose (f=0x527c000) at 
>> >> >> >> >> migration/qemu-file.c:334
>> >> >> >> >> #7  0x00795b96 in migrate_fd_cleanup 
>> >> >> >> >> (opaque=0x3b46280) at migration/migration.c:1162
>> >> >> >> >> #8  0x0093a71b in aio_bh_call (bh=0x3db7a20) at 
>> >> >> >> >> util/async.c:90
>> >> >> >> >> #9  0x0093a7b2 in aio_bh_poll (ctx=0x3b121c0) at 
>> >> >> >> >> util/async.c:118
>> >> >> >> >> #10 0x0093f2ad in aio_dispatch (ctx=0x3b121c0) at 
>> >> >> >> >> util/aio-posix.c:436
>> >> >> >> >> #11 0x0093ab41 in aio_ctx_dispatch 
>> >> >> >> >> (source=0x3b121c0, callback=0x0, user_data=0x0)
>> >> >> >> >> at util/async.c:261
>> >> >> >> >> #12 0x7f249f73c7aa in g_main_context_dispatch () from 
>> >> >> >> >> /lib64/libglib-2.0.so.0
>> >> >> >> >> #13 0x0093dc5e in glib_pollfds_poll () at 
>> >> >> >> >> util/main-loop.c:215
>> >> >> >> >> #14 0x0093dd4e in os_host_main_loop_wait 
>> >> >> >> >> (timeout=2800) at util/main-loop.c:263
>> >> >> >> >> #15 0x0093de05 in main_loop_wait (nonblocking=0) at 
>> >> >> >> >> util/main-loop.c:522
>> >> >> >> >> #16 0x005bc6a5 in main_loop () at vl.c:1944
>> >> >> >> >> #17 0x005c39b5 in main (argc=56, 
>> >> >> >> >> argv=0x7ffe2f6443f8, envp=0x3ad0030) at vl.c:4752
>> >> >> >> >>
>> >> >> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after 
>> >> >> >> >> rdma_disconnect sometime.
>> >> >> >> >> I do not find out the root cause why not get 
>> >> >> >> >> RDMA_CM_EVENT_DISCONNECTED event, but
>> >> >> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all 
>> >> >> >> >> ram blocks which fixed
&

Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-16 Thread 858585 jemmy
On Wed, May 16, 2018 at 5:39 PM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> >> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
>> >>  wrote:
>> >> > * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> >> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> >> >>  wrote:
>> >> >> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> >> >> When cancel migration during RDMA precopy, the source qemu main 
>> >> >> >> thread hangs sometime.
>> >> >> >>
>> >> >> >> The backtrace is:
>> >> >> >> (gdb) bt
>> >> >> >> #0  0x7f249eabd43d in write () from /lib64/libpthread.so.0
>> >> >> >> #1  0x7f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, 
>> >> >> >> event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> >> >> #2  0x007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at 
>> >> >> >> migration/rdma.c:2296
>> >> >> >> #3  0x007b7cae in qio_channel_rdma_close 
>> >> >> >> (ioc=0x3bfcc30, errp=0x0) at migration/rdma.c:2999
>> >> >> >> #4  0x008db60e in qio_channel_close (ioc=0x3bfcc30, 
>> >> >> >> errp=0x0) at io/channel.c:273
>> >> >> >> #5  0x007a8765 in channel_close (opaque=0x3bfcc30) at 
>> >> >> >> migration/qemu-file-channel.c:98
>> >> >> >> #6  0x007a71f9 in qemu_fclose (f=0x527c000) at 
>> >> >> >> migration/qemu-file.c:334
>> >> >> >> #7  0x00795b96 in migrate_fd_cleanup (opaque=0x3b46280) 
>> >> >> >> at migration/migration.c:1162
>> >> >> >> #8  0x0093a71b in aio_bh_call (bh=0x3db7a20) at 
>> >> >> >> util/async.c:90
>> >> >> >> #9  0x0093a7b2 in aio_bh_poll (ctx=0x3b121c0) at 
>> >> >> >> util/async.c:118
>> >> >> >> #10 0x0093f2ad in aio_dispatch (ctx=0x3b121c0) at 
>> >> >> >> util/aio-posix.c:436
>> >> >> >> #11 0x0093ab41 in aio_ctx_dispatch (source=0x3b121c0, 
>> >> >> >> callback=0x0, user_data=0x0)
>> >> >> >> at util/async.c:261
>> >> >> >> #12 0x7f249f73c7aa in g_main_context_dispatch () from 
>> >> >> >> /lib64/libglib-2.0.so.0
>> >> >> >> #13 0x0093dc5e in glib_pollfds_poll () at 
>> >> >> >> util/main-loop.c:215
>> >> >> >> #14 0x0093dd4e in os_host_main_loop_wait 
>> >> >> >> (timeout=2800) at util/main-loop.c:263
>> >> >> >> #15 0x0093de05 in main_loop_wait (nonblocking=0) at 
>> >> >> >> util/main-loop.c:522
>> >> >> >> #16 0x005bc6a5 in main_loop () at vl.c:1944
>> >> >> >> #17 0x005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, 
>> >> >> >> envp=0x3ad0030) at vl.c:4752
>> >> >> >>
>> >> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after 
>> >> >> >> rdma_disconnect sometime.
>> >> >> >> I do not find out the root cause why not get 
>> >> >> >> RDMA_CM_EVENT_DISCONNECTED event, but
>> >> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram 
>> >> >> >> blocks which fixed
>> >> >> >> in previous patch.
>> >> >> >
>> >> >> > Does this happen without your other changes?
>> >> >>
>> >> >> Yes, this issue also happen on v2.12.0. base on
>> >> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>> >> >>
>> >> >> > Can you give me instructions to repeat it and also say which
>> >> >> > cards you wereusing?
>> >> >>
>> >> >> This issue can be reproduced by start and cancel migration.
>> >> >> less than 10 times, this issue will b

Re: [Qemu-devel] [PATCH v3 5/6] migration: implement bi-directional RDMA QIOChannel

2018-05-16 Thread 858585 jemmy
On Tue, May 15, 2018 at 10:54 PM, Paolo Bonzini  wrote:
> On 05/05/2018 16:35, Lidong Chen wrote:
>> @@ -2635,12 +2637,20 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>> *ioc,
>>  {
>>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>  QEMUFile *f = rioc->file;
>> -RDMAContext *rdma = rioc->rdma;
>> +RDMAContext *rdma;
>>  int ret;
>>  ssize_t done = 0;
>>  size_t i;
>>  size_t len = 0;
>>
>> +rcu_read_lock();
>> +rdma = atomic_rcu_read(&rioc->rdmaout);
>> +
>> +if (!rdma) {
>> +rcu_read_unlock();
>> +return -EIO;
>> +}
>> +
>>  CHECK_ERROR_STATE();
>>
>>  /*
>
> I am not sure I understand this.  It would probably be wrong to use the
> output side from two threads at the same time, so why not use two mutexes?

Two thread will not invoke qio_channel_rdma_writev at the same time.
The source qemu, migration thread only use writev, and the return path
thread only
use readv.
The destination qemu already have a mutex mis->rp_mutex to make sure
not use writev
at the same time.

The rcu_read_lock is used to protect not use RDMAContext when another
thread closes it.

>
> Also, who is calling qio_channel_rdma_close in such a way that another
> thread is still using it?  Would it be possible to synchronize with the
> other thread *before*, for example with qemu_thread_join?

The MigrationState structure includes to_dst_file and from_dst_file
QEMUFile, the two QEMUFile use the same QIOChannel.
For example, if the return path thread call
qemu_fclose(ms->rp_state.from_dst_file),
It will also close the RDMAContext for ms->to_dst_file.

For live migration, the source qemu invokes qemu_fclose in different
threads include main thread, migration thread, return path thread.

The destination qemu invokes qemu_fclose in main thread, listen thread and
COLO incoming thread.

I do not find an effective way to synchronize these threads.

Thanks.

>
> Thanks,
>
> Paolo



Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-16 Thread 858585 jemmy
On Tue, May 15, 2018 at 3:27 AM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> >> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>> >>  wrote:
>> >> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> >> When cancel migration during RDMA precopy, the source qemu main thread 
>> >> >> hangs sometime.
>> >> >>
>> >> >> The backtrace is:
>> >> >> (gdb) bt
>> >> >> #0  0x7f249eabd43d in write () from /lib64/libpthread.so.0
>> >> >> #1  0x7f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, 
>> >> >> event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> >> #2  0x007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at 
>> >> >> migration/rdma.c:2296
>> >> >> #3  0x007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, 
>> >> >> errp=0x0) at migration/rdma.c:2999
>> >> >> #4  0x008db60e in qio_channel_close (ioc=0x3bfcc30, 
>> >> >> errp=0x0) at io/channel.c:273
>> >> >> #5  0x007a8765 in channel_close (opaque=0x3bfcc30) at 
>> >> >> migration/qemu-file-channel.c:98
>> >> >> #6  0x007a71f9 in qemu_fclose (f=0x527c000) at 
>> >> >> migration/qemu-file.c:334
>> >> >> #7  0x00795b96 in migrate_fd_cleanup (opaque=0x3b46280) at 
>> >> >> migration/migration.c:1162
>> >> >> #8  0x0093a71b in aio_bh_call (bh=0x3db7a20) at 
>> >> >> util/async.c:90
>> >> >> #9  0x0093a7b2 in aio_bh_poll (ctx=0x3b121c0) at 
>> >> >> util/async.c:118
>> >> >> #10 0x0093f2ad in aio_dispatch (ctx=0x3b121c0) at 
>> >> >> util/aio-posix.c:436
>> >> >> #11 0x0093ab41 in aio_ctx_dispatch (source=0x3b121c0, 
>> >> >> callback=0x0, user_data=0x0)
>> >> >> at util/async.c:261
>> >> >> #12 0x7f249f73c7aa in g_main_context_dispatch () from 
>> >> >> /lib64/libglib-2.0.so.0
>> >> >> #13 0x0093dc5e in glib_pollfds_poll () at 
>> >> >> util/main-loop.c:215
>> >> >> #14 0x0093dd4e in os_host_main_loop_wait 
>> >> >> (timeout=2800) at util/main-loop.c:263
>> >> >> #15 0x0093de05 in main_loop_wait (nonblocking=0) at 
>> >> >> util/main-loop.c:522
>> >> >> #16 0x005bc6a5 in main_loop () at vl.c:1944
>> >> >> #17 0x005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, 
>> >> >> envp=0x3ad0030) at vl.c:4752
>> >> >>
>> >> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after 
>> >> >> rdma_disconnect sometime.
>> >> >> I do not find out the root cause why not get 
>> >> >> RDMA_CM_EVENT_DISCONNECTED event, but
>> >> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram 
>> >> >> blocks which fixed
>> >> >> in previous patch.
>> >> >
>> >> > Does this happen without your other changes?
>> >>
>> >> Yes, this issue also happen on v2.12.0. base on
>> >> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>> >>
>> >> > Can you give me instructions to repeat it and also say which
>> >> > cards you wereusing?
>> >>
>> >> This issue can be reproduced by start and cancel migration.
>> >> less than 10 times, this issue will be reproduced.
>> >>
>> >> The command line is:
>> >> virsh migrate --live --copy-storage-all  --undefinesource --persistent
>> >> --timeout 10800 \
>> >>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
>> >> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
>> >>
>> >> The net card i use is :
>> >> :3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> [ConnectX-4 Lx]
>> >> :3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
>> >> [ConnectX-4 Lx]
>> >>
>> >> This issue is related to ibv_dereg_mr, if not invoke ibv

Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-14 Thread 858585 jemmy
On Sat, May 12, 2018 at 2:03 AM, Dr. David Alan Gilbert
 wrote:
> * 858585 jemmy (jemmy858...@gmail.com) wrote:
>> On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> When cancel migration during RDMA precopy, the source qemu main thread 
>> >> hangs sometime.
>> >>
>> >> The backtrace is:
>> >> (gdb) bt
>> >> #0  0x7f249eabd43d in write () from /lib64/libpthread.so.0
>> >> #1  0x7f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, 
>> >> event=0x7ffe2f643dd0) at src/cma.c:2189
>> >> #2  0x007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at 
>> >> migration/rdma.c:2296
>> >> #3  0x007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, 
>> >> errp=0x0) at migration/rdma.c:2999
>> >> #4  0x008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) 
>> >> at io/channel.c:273
>> >> #5  0x007a8765 in channel_close (opaque=0x3bfcc30) at 
>> >> migration/qemu-file-channel.c:98
>> >> #6  0x007a71f9 in qemu_fclose (f=0x527c000) at 
>> >> migration/qemu-file.c:334
>> >> #7  0x00795b96 in migrate_fd_cleanup (opaque=0x3b46280) at 
>> >> migration/migration.c:1162
>> >> #8  0x0093a71b in aio_bh_call (bh=0x3db7a20) at 
>> >> util/async.c:90
>> >> #9  0x0093a7b2 in aio_bh_poll (ctx=0x3b121c0) at 
>> >> util/async.c:118
>> >> #10 0x0093f2ad in aio_dispatch (ctx=0x3b121c0) at 
>> >> util/aio-posix.c:436
>> >> #11 0x0093ab41 in aio_ctx_dispatch (source=0x3b121c0, 
>> >> callback=0x0, user_data=0x0)
>> >> at util/async.c:261
>> >> #12 0x7f249f73c7aa in g_main_context_dispatch () from 
>> >> /lib64/libglib-2.0.so.0
>> >> #13 0x0093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>> >> #14 0x0093dd4e in os_host_main_loop_wait (timeout=2800) 
>> >> at util/main-loop.c:263
>> >> #15 0x0093de05 in main_loop_wait (nonblocking=0) at 
>> >> util/main-loop.c:522
>> >> #16 0x005bc6a5 in main_loop () at vl.c:1944
>> >> #17 0x005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, 
>> >> envp=0x3ad0030) at vl.c:4752
>> >>
>> >> It does not get the RDMA_CM_EVENT_DISCONNECTED event after 
>> >> rdma_disconnect sometime.
>> >> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED 
>> >> event, but
>> >> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks 
>> >> which fixed
>> >> in previous patch.
>> >
>> > Does this happen without your other changes?
>>
>> Yes, this issue also happen on v2.12.0. base on
>> commit 4743c23509a51bd4ee85cc272287a41917d1be35
>>
>> > Can you give me instructions to repeat it and also say which
>> > cards you wereusing?
>>
>> This issue can be reproduced by start and cancel migration.
>> less than 10 times, this issue will be reproduced.
>>
>> The command line is:
>> virsh migrate --live --copy-storage-all  --undefinesource --persistent
>> --timeout 10800 \
>>  --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
>> qemu+ssh://9.16.46.142/system rdma://9.16.46.142
>>
>> The net card i use is :
>> :3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
>> [ConnectX-4 Lx]
>> :3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
>> [ConnectX-4 Lx]
>>
>> This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
>> all ram block, this issue can be reproduced.
>> If we fixed the bugs and use ibv_dereg_mr to release all ram block,
>> this issue never happens.
>
> Maybe that is the right fix; I can imagine that the RDMA code doesn't
> like closing down if there are still ramblocks registered that
> potentially could have incoming DMA?
>
>> And for the kernel part, there is a bug also cause not release ram
>> block when canceling live migration.
>> https://patchwork.kernel.org/patch/10385781/
>
> OK, that's a pain; which threads are doing the dereg - is some stuff
> in the migration thread and some stuff in the main thread on cleanup?

Yes, the migration thread invokes ibv_reg_mr, and the main thread bh
will invoke ibv_dereg_mr.
and when the main thread schedu

Re: [Qemu-devel] [PATCH 1/2] migration: implement io_set_aio_fd_handler function for RDMA QIOChannel

2018-05-08 Thread 858585 jemmy
On Wed, May 9, 2018 at 1:10 AM, Dr. David Alan Gilbert
 wrote:
> * Juan Quintela (quint...@redhat.com) wrote:
>> Lidong Chen  wrote:
>> > if qio_channel_rdma_readv return QIO_CHANNEL_ERR_BLOCK, the destination 
>> > qemu
>> > crash.
>> >
>> > The backtrace is:
>> > (gdb) bt
>> > #0  0x in ?? ()
>> > #1  0x008db50e in qio_channel_set_aio_fd_handler 
>> > (ioc=0x38111e0, ctx=0x3726080,
>> > io_read=0x8db841 , io_write=0x0, 
>> > opaque=0x38111e0) at io/channel.c:
>> > #2  0x008db952 in qio_channel_set_aio_fd_handlers 
>> > (ioc=0x38111e0) at io/channel.c:438
>> > #3  0x008dbab4 in qio_channel_yield (ioc=0x38111e0, 
>> > condition=G_IO_IN) at io/channel.c:47
>> > #4  0x007a870b in channel_get_buffer (opaque=0x38111e0, 
>> > buf=0x440c038 "", pos=0, size=327
>> > at migration/qemu-file-channel.c:83
>> > #5  0x007a70f6 in qemu_fill_buffer (f=0x440c000) at 
>> > migration/qemu-file.c:299
>> > #6  0x007a79d0 in qemu_peek_byte (f=0x440c000, offset=0) at 
>> > migration/qemu-file.c:562
>> > #7  0x007a7a22 in qemu_get_byte (f=0x440c000) at 
>> > migration/qemu-file.c:575
>> > #8  0x007a7c78 in qemu_get_be32 (f=0x440c000) at 
>> > migration/qemu-file.c:655
>> > #9  0x007a0508 in qemu_loadvm_state (f=0x440c000) at 
>> > migration/savevm.c:2126
>> > #10 0x00794141 in process_incoming_migration_co (opaque=0x0) 
>> > at migration/migration.c:366
>> > #11 0x0095c598 in coroutine_trampoline (i0=84033984, i1=0) at 
>> > util/coroutine-ucontext.c:1
>> > #12 0x7f9c0db56d40 in ?? () from /lib64/libc.so.6
>> > #13 0x7f96fe858760 in ?? ()
>> > #14 0x in ?? ()
>> >
>> > RDMA QIOChannel not implement io_set_aio_fd_handler. so
>> > qio_channel_set_aio_fd_handler will access NULL pointer.
>> >
>> > Signed-off-by: Lidong Chen 
>> > ---
>>
>>
>> Hi
>>
>> could you resend, it don't compile for me :-(
>
> This really sits after the other set of rdma changes.
> I doubt this path is reachable without the previous set.
>
> Dave

Hi Juan:
I should not separate the patchset.  Sorry for this mistake. This
patch is base on another patch.
http://patchwork.ozlabs.org/patch/909156/
After Daniel have reviewed this patch, I will send the v4 version
which will include all patch for RDMA live migration.
Thanks.

>>
>> /mnt/kvm/qemu/cleanup/migration/rdma.c: In function 
>> ‘qio_channel_rdma_set_aio_fd_handler’:
>> /mnt/kvm/qemu/cleanup/migration/rdma.c:2877:39: error: ‘QIOChannelRDMA’ {aka 
>> ‘struct QIOChannelRDMA’} has no member named ‘rdmain’; did you mean ‘rdma’?
>>  aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
>>^~
>>rdma
>> /mnt/kvm/qemu/cleanup/migration/rdma.c:2880:39: error: ‘QIOChannelRDMA’ {aka 
>> ‘struct QIOChannelRDMA’} has no member named ‘rdmaout’; did you mean ‘rdma’?
>>  aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
>>^~~
>>rdma
>> make: *** [/mnt/kvm/qemu/cleanup/rules.mak:66: migration/rdma.o] Error 1
>>   CC  migration/block.o
>>   CC  ui/vnc.o
>>
>> It seems like
>>
>> > diff --git a/migration/rdma.c b/migration/rdma.c
>> > index 92e4d30..dfa4f77 100644
>> > --- a/migration/rdma.c
>> > +++ b/migration/rdma.c
>> > @@ -2963,6 +2963,21 @@ static GSource 
>> > *qio_channel_rdma_create_watch(QIOChannel *ioc,
>> >  return source;
>> >  }
>> >
>> > +static void qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>> > +  AioContext *ctx,
>> > +  IOHandler *io_read,
>> > +  IOHandler *io_write,
>> > +  void *opaque)
>> > +{
>> > +QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> > +if (io_read) {
>> > +aio_set_fd_handler(ctx, rioc->rdmain->comp_channel->fd,
>>
>> this should be rioc->rdam->comp_channel
>>
>> > +   false, io_read, io_write, NULL, opaque);
>> > +} else {
>> > +aio_set_fd_handler(ctx, rioc->rdmaout->comp_channel->fd,
>>
>> and this rioc-rdma->comp_channel
>>
>> But will preffer if you confirm.
>>
>> Thanks.
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect

2018-05-08 Thread 858585 jemmy
On Wed, May 9, 2018 at 2:40 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> When cancel migration during RDMA precopy, the source qemu main thread hangs 
>> sometime.
>>
>> The backtrace is:
>> (gdb) bt
>> #0  0x7f249eabd43d in write () from /lib64/libpthread.so.0
>> #1  0x7f24a1ce98e4 in rdma_get_cm_event (channel=0x4675d10, 
>> event=0x7ffe2f643dd0) at src/cma.c:2189
>> #2  0x007b6166 in qemu_rdma_cleanup (rdma=0x6784000) at 
>> migration/rdma.c:2296
>> #3  0x007b7cae in qio_channel_rdma_close (ioc=0x3bfcc30, 
>> errp=0x0) at migration/rdma.c:2999
>> #4  0x008db60e in qio_channel_close (ioc=0x3bfcc30, errp=0x0) at 
>> io/channel.c:273
>> #5  0x007a8765 in channel_close (opaque=0x3bfcc30) at 
>> migration/qemu-file-channel.c:98
>> #6  0x007a71f9 in qemu_fclose (f=0x527c000) at 
>> migration/qemu-file.c:334
>> #7  0x00795b96 in migrate_fd_cleanup (opaque=0x3b46280) at 
>> migration/migration.c:1162
>> #8  0x0093a71b in aio_bh_call (bh=0x3db7a20) at util/async.c:90
>> #9  0x0093a7b2 in aio_bh_poll (ctx=0x3b121c0) at util/async.c:118
>> #10 0x0093f2ad in aio_dispatch (ctx=0x3b121c0) at 
>> util/aio-posix.c:436
>> #11 0x0093ab41 in aio_ctx_dispatch (source=0x3b121c0, 
>> callback=0x0, user_data=0x0)
>> at util/async.c:261
>> #12 0x7f249f73c7aa in g_main_context_dispatch () from 
>> /lib64/libglib-2.0.so.0
>> #13 0x0093dc5e in glib_pollfds_poll () at util/main-loop.c:215
>> #14 0x0093dd4e in os_host_main_loop_wait (timeout=2800) at 
>> util/main-loop.c:263
>> #15 0x0093de05 in main_loop_wait (nonblocking=0) at 
>> util/main-loop.c:522
>> #16 0x005bc6a5 in main_loop () at vl.c:1944
>> #17 0x005c39b5 in main (argc=56, argv=0x7ffe2f6443f8, 
>> envp=0x3ad0030) at vl.c:4752
>>
>> It does not get the RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect 
>> sometime.
>> I do not find out the root cause why not get RDMA_CM_EVENT_DISCONNECTED 
>> event, but
>> it can be reproduced if not invoke ibv_dereg_mr to release all ram blocks 
>> which fixed
>> in previous patch.
>
> Does this happen without your other changes?

Yes, this issue also happen on v2.12.0. base on
commit 4743c23509a51bd4ee85cc272287a41917d1be35

> Can you give me instructions to repeat it and also say which
> cards you wereusing?

This issue can be reproduced by start and cancel migration.
less than 10 times, this issue will be reproduced.

The command line is:
virsh migrate --live --copy-storage-all  --undefinesource --persistent
--timeout 10800 \
 --verbose 83e0049e-1325-4f31-baf9-25231509ada1  \
qemu+ssh://9.16.46.142/system rdma://9.16.46.142

The net card i use is :
:3b:00.0 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]
:3b:00.1 Ethernet controller: Mellanox Technologies MT27710 Family
[ConnectX-4 Lx]

This issue is related to ibv_dereg_mr, if not invoke ibv_dereg_mr for
all ram block, this issue can be reproduced.
If we fixed the bugs and use ibv_dereg_mr to release all ram block,
this issue never happens.

And for the kernel part, there is a bug also cause not release ram
block when canceling live migration.
https://patchwork.kernel.org/patch/10385781/

>
>> Anyway, it should not invoke rdma_get_cm_event in main thread, and the event 
>> channel
>> is also destroyed in qemu_rdma_cleanup.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/rdma.c   | 12 ++--
>>  migration/trace-events |  1 -
>>  2 files changed, 2 insertions(+), 11 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 0dd4033..92e4d30 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2275,8 +2275,7 @@ static int qemu_rdma_write(QEMUFile *f, RDMAContext 
>> *rdma,
>>
>>  static void qemu_rdma_cleanup(RDMAContext *rdma)
>>  {
>> -struct rdma_cm_event *cm_event;
>> -int ret, idx;
>> +int idx;
>>
>>  if (rdma->cm_id && rdma->connected) {
>>  if ((rdma->error_state ||
>> @@ -2290,14 +2289,7 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
>>  qemu_rdma_post_send_control(rdma, NULL, &head);
>>  }
>>
>> -ret = rdma_disconnect(rdma->cm_id);
>> -if (!ret) {
>> -trace_qemu_rdma_cleanup_waiting_for_disconnect();
>> -ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> -if (!ret) {
>> -rdma_ack_cm_event(cm_event);
>> -}
>> -}
>> +rdma_disconnect(rdma->cm_id);
>
> I'm worried whether this change could break stuff:
> The docs say for rdma_disconnect that it flushes any posted work
> requests to the completion queue;  so unless we wait for the event
> do we know the stuff has been flushed?   In the normal non-cancel case
> I'm worried that means we could lose something.
> (But I don't know the rdma/infiniband

Re: [Qemu-devel] [PATCH 1/2] migration: update index field when delete or qsort RDMALocalBlock

2018-05-08 Thread 858585 jemmy
On Wed, May 9, 2018 at 1:19 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> rdma_delete_block function deletes RDMALocalBlock base on index field,
>> but not update the index field. So when next time invoke rdma_delete_block,
>> it will not work correctly.
>>
>> If start and cancel migration repeatedly, some RDMALocalBlock not invoke
>> ibv_dereg_mr to decrease kernel mm_struct vmpin. When vmpin is large than
>> max locked memory limitation, ibv_reg_mr will failed, and migration can not
>> start successfully again.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/rdma.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index ed9cfb1..0dd4033 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -713,6 +713,9 @@ static int rdma_delete_block(RDMAContext *rdma, 
>> RDMALocalBlock *block)
>>  memcpy(local->block + block->index, old + (block->index + 1),
>>  sizeof(RDMALocalBlock) *
>>  (local->nb_blocks - (block->index + 1)));
>> +for (x = block->index; x < local->nb_blocks - 1; x++) {
>> +local->block[x].index--;
>> +}
>
> Yes; is that equivalent to   local->blocks[x].index = x;   ?

yes, it's equivalent.

>
>>  }
>>  } else {
>>  assert(block == local->block);
>> @@ -3398,6 +3401,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f, 
>> void *opaque)
>>  qsort(rdma->local_ram_blocks.block,
>>rdma->local_ram_blocks.nb_blocks,
>>sizeof(RDMALocalBlock), dest_ram_sort_func);
>> +for (i = 0; i < local->nb_blocks; i++) {
>> +local->block[i].index = i;
>> +}
>> +
>
> Which is basically the way that one does it;
>
> OK, it's a while since I looked at this but I think it fixes my 3 year
> old 03fcab38617 patch, so
>
>
>
> Reviewed-by: Dr. David Alan Gilbert 
>
>>  if (rdma->pin_all) {
>>  ret = qemu_rdma_reg_whole_ram_blocks(rdma);
>>  if (ret) {
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v3 3/6] migration: remove unnecessary variables len in QIOChannelRDMA

2018-05-08 Thread 858585 jemmy
On Tue, May 8, 2018 at 10:19 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> Because qio_channel_rdma_writev and qio_channel_rdma_readv maybe invoked
>> by different threads concurrently, this patch removes unnecessary variables
>> len in QIOChannelRDMA and use local variable instead.
>>
>> Signed-off-by: Lidong Chen 
>> Reviewed-by: Dr. David Alan Gilbert 
>> Reviewed-by: Daniel P. Berrangéberra...@redhat.com>
>
> Note there's a ' <' missing somehow; minor fix up during commit
> hopefully.
>
> Dave

Sorry for this mistake, I will check more carefully.

>
>> ---
>>  migration/rdma.c | 15 +++
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index c745427..f5c1d02 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -404,7 +404,6 @@ struct QIOChannelRDMA {
>>  QIOChannel parent;
>>  RDMAContext *rdma;
>>  QEMUFile *file;
>> -size_t len;
>>  bool blocking; /* XXX we don't actually honour this yet */
>>  };
>>
>> @@ -2640,6 +2639,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>  int ret;
>>  ssize_t done = 0;
>>  size_t i;
>> +size_t len = 0;
>>
>>  CHECK_ERROR_STATE();
>>
>> @@ -2659,10 +2659,10 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>> *ioc,
>>  while (remaining) {
>>  RDMAControlHeader head;
>>
>> -rioc->len = MIN(remaining, RDMA_SEND_INCREMENT);
>> -remaining -= rioc->len;
>> +len = MIN(remaining, RDMA_SEND_INCREMENT);
>> +remaining -= len;
>>
>> -head.len = rioc->len;
>> +head.len = len;
>>  head.type = RDMA_CONTROL_QEMU_FILE;
>>
>>  ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL, 
>> NULL);
>> @@ -2672,8 +2672,8 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>  return ret;
>>  }
>>
>> -data += rioc->len;
>> -done += rioc->len;
>> +data += len;
>> +done += len;
>>  }
>>  }
>>
>> @@ -2768,8 +2768,7 @@ static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
>>  }
>>  }
>>  }
>> -rioc->len = done;
>> -return rioc->len;
>> +return done;
>>  }
>>
>>  /*
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH v2 4/5] migration: implement bi-directional RDMA QIOChannel

2018-04-27 Thread 858585 jemmy
On Fri, Apr 27, 2018 at 5:16 PM, Daniel P. Berrangé  wrote:
> On Fri, Apr 27, 2018 at 03:56:38PM +0800, 858585 jemmy wrote:
>> On Fri, Apr 27, 2018 at 1:36 AM, Dr. David Alan Gilbert
>>  wrote:
>> > * Lidong Chen (jemmy858...@gmail.com) wrote:
>> >> This patch implements bi-directional RDMA QIOChannel. Because different
>> >> threads may access RDMAQIOChannel concurrently, this patch use RCU to 
>> >> protect it.
>> >>
>> >> Signed-off-by: Lidong Chen 
>> >
>> > I'm a bit confused by this.
>> >
>> > I can see it's adding RCU to protect the rdma structures against
>> > deletion from multiple threads; that I'm OK with in principal; is that
>> > the only locking we need? (I guess the two directions are actually
>> > separate RDMAContext's so maybe).
>>
>> The qio_channel_rdma_close maybe invoked by migration thread and
>> return path thread
>> concurrently, so I use a mutex to protect it.
>
> Hmm, that is not good - concurrent threads calling close must not be
> allowed to happen even with non-RDMA I/O chanels.
>
> For example, with the QIOChannelSocket, one thread can call close
> which sets the fd = -1, another thread can race with this and either
> end up calling close again on the same FD or calling close on -1.
> Either way the second thread will get an error from close() when
> it should have skipped the close() and returned success. Perhaps
> migration gets lucky and this doesn't result in it being marked
> as failed, but it is still not good.
>
> So only one thread should be calling close().

for live migration, source qemu invokes qemu_fclose in different
threads, include main thread,
migration thread, return path thread.
destination qemu invokes qemu_fclose in main thread, listen thread and
COLO incoming thread.

so I prefer to add a lock for QEMUFile struct, like this:

int qemu_fclose(QEMUFile *f)
{
int ret;
qemu_fflush(f);
ret = qemu_file_get_error(f);

if (f->ops->close) {
+qemu_mutex_lock(&f->lock);
int ret2 = f->ops->close(f->opaque);
+   qemu_mutex_unlock(&f->lock);
if (ret >= 0) {
ret = ret2;
}
}
/* If any error was spotted before closing, we should report it
 * instead of the close() return value.
 */
if (f->last_error) {
ret = f->last_error;
}
g_free(f);
trace_qemu_file_fclose();
return ret;
}

Any suggestion?

>
>> If one thread invoke qio_channel_rdma_writev, another thread invokes
>> qio_channel_rdma_readv,
>> two threads will use separate RDMAContext, so it does not need a lock.
>>
>> If two threads invoke qio_channel_rdma_writev concurrently, it will
>> need a lock to protect.
>> but I find source qemu migration thread only invoke
>> qio_channel_rdma_writev, the return path
>> thread only invokes qio_channel_rdma_readv.
>
> QIOChannel impls are only intended to cope with a single thread doing
> I/O in each direction. If you have two threads needing to read, or
> two threads needing to write, the layer above should provide locking
> to ensure correct ordering  of I/O oprations.

yes, so I think RCU is enough, we do not need more lock.

>
>> The destination qemu only invoked qio_channel_rdma_readv by main
>> thread before postcopy and or
>> listen thread after postcopy.
>>
>> The destination qemu have already protected it by using
>> qemu_mutex_lock(&mis->rp_mutex) when writing data to
>> source qemu.
>>
>> But should we use qemu_mutex_lock to protect qio_channel_rdma_writev
>> and qio_channel_rdma_readv?
>> to avoid some change in future invoke qio_channel_rdma_writev or
>> qio_channel_rdma_readv concurrently?
>
>>
>> >
>> > But is there nothing else to make the QIOChannel bidirectional?
>> >
>> > Also, a lot seems dependent on listen_id, can you explain how that's
>> > being used.
>>
>> The destination qemu is server side, so listen_id is not zero. the
>> source qemu is client side,
>> the listen_id is zero.
>> I use listen_id to determine whether qemu is destination or source.
>>
>> for the destination qemu, if write data to source, it need use the
>> return_path rdma, like this:
>> if (rdma->listen_id) {
>> rdma = rdma->return_path;
>> }
>>
>> for the source qemu, if read data from destination, it also need use
>> the return_path rdma.
>> if (!rdma->listen_id) {
>> rdma = rdma->return_path;
>> }
>
> This feels uncessarily complex to me. W

Re: [Qemu-devel] [PATCH v2 4/5] migration: implement bi-directional RDMA QIOChannel

2018-04-27 Thread 858585 jemmy
On Fri, Apr 27, 2018 at 1:36 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> This patch implements bi-directional RDMA QIOChannel. Because different
>> threads may access RDMAQIOChannel concurrently, this patch use RCU to 
>> protect it.
>>
>> Signed-off-by: Lidong Chen 
>
> I'm a bit confused by this.
>
> I can see it's adding RCU to protect the rdma structures against
> deletion from multiple threads; that I'm OK with in principal; is that
> the only locking we need? (I guess the two directions are actually
> separate RDMAContext's so maybe).

The qio_channel_rdma_close maybe invoked by migration thread and
return path thread
concurrently, so I use a mutex to protect it.

If one thread invoke qio_channel_rdma_writev, another thread invokes
qio_channel_rdma_readv,
two threads will use separate RDMAContext, so it does not need a lock.

If two threads invoke qio_channel_rdma_writev concurrently, it will
need a lock to protect.
but I find source qemu migration thread only invoke
qio_channel_rdma_writev, the return path
thread only invokes qio_channel_rdma_readv.

The destination qemu only invoked qio_channel_rdma_readv by main
thread before postcopy and or
listen thread after postcopy.

The destination qemu have already protected it by using
qemu_mutex_lock(&mis->rp_mutex) when writing data to
source qemu.

But should we use qemu_mutex_lock to protect qio_channel_rdma_writev
and qio_channel_rdma_readv?
to avoid some change in future invoke qio_channel_rdma_writev or
qio_channel_rdma_readv concurrently?

>
> But is there nothing else to make the QIOChannel bidirectional?
>
> Also, a lot seems dependent on listen_id, can you explain how that's
> being used.

The destination qemu is server side, so listen_id is not zero. the
source qemu is client side,
the listen_id is zero.
I use listen_id to determine whether qemu is destination or source.

for the destination qemu, if write data to source, it need use the
return_path rdma, like this:
if (rdma->listen_id) {
rdma = rdma->return_path;
}

for the source qemu, if read data from destination, it also need use
the return_path rdma.
if (!rdma->listen_id) {
rdma = rdma->return_path;
}

>
> Finally, I don't think you have anywhere that destroys the new mutex you
> added.
I will fix this next version.

>
> Dave
> P.S. Please cc Daniel Berrange on this series, since it's so much
> IOChannel stuff.
>
>> ---
>>  migration/rdma.c | 162 
>> +--
>>  1 file changed, 146 insertions(+), 16 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index f5c1d02..0652224 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -86,6 +86,7 @@ static uint32_t known_capabilities = 
>> RDMA_CAPABILITY_PIN_ALL;
>>  " to abort!"); \
>>  rdma->error_reported = 1; \
>>  } \
>> +rcu_read_unlock(); \
>>  return rdma->error_state; \
>>  } \
>>  } while (0)
>> @@ -405,6 +406,7 @@ struct QIOChannelRDMA {
>>  RDMAContext *rdma;
>>  QEMUFile *file;
>>  bool blocking; /* XXX we don't actually honour this yet */
>> +QemuMutex lock;
>>  };
>>
>>  /*
>> @@ -2635,12 +2637,29 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>> *ioc,
>>  {
>>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>  QEMUFile *f = rioc->file;
>> -RDMAContext *rdma = rioc->rdma;
>> +RDMAContext *rdma;
>>  int ret;
>>  ssize_t done = 0;
>>  size_t i;
>>  size_t len = 0;
>>
>> +rcu_read_lock();
>> +rdma = atomic_rcu_read(&rioc->rdma);
>> +
>> +if (!rdma) {
>> +rcu_read_unlock();
>> +return -EIO;
>> +}
>> +
>> +if (rdma->listen_id) {
>> +rdma = rdma->return_path;
>> +}
>> +
>> +if (!rdma) {
>> +rcu_read_unlock();
>> +return -EIO;
>> +}
>> +
>>  CHECK_ERROR_STATE();
>>
>>  /*
>> @@ -2650,6 +2669,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>  ret = qemu_rdma_write_flush(f, rdma);
>>  if (ret < 0) {
>>  rdma->error_state = ret;
>> +rcu_read_unlock();
>>  return ret;
>>  }
>>
>> @@ -2669,6 +2689,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>
>>  if (ret < 0) {
>>  rdma->error_state = ret;
>> +rcu_read_unlock();
>>  return ret;
>>  }
>>
>> @@ -2677,6 +2698,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>  }
>>  }
>>
>> +rcu_read_unlock();
>>  return done;
>>  }
>>
>> @@ -2710,12 +2732,29 @@ static ssize_t qio_channel_rdma_readv(QIOChannel 
>> *ioc,
>>Error **errp)
>>  {
>>  QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>> -RDMAContext *rdma = rioc->rdma;
>> +RDMAContext *rdma;
>>  RDMAControlHeader head;
>>  int ret = 0;
>>  ssize_t i

Re: [Qemu-devel] [PATCH v2 3/5] migration: remove unnecessary variables len in QIOChannelRDMA

2018-04-26 Thread 858585 jemmy
On Fri, Apr 27, 2018 at 12:40 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> Because qio_channel_rdma_writev and qio_channel_rdma_readv maybe invoked
>> by different threads concurrently, this patch removes unnecessary variables
>> len in QIOChannelRDMA and use local variable instead.
>>
>> Signed-off-by: Lidong Chen 
>
> I'm OK with this patch as is; but now you're making me worried that I
> don't quite understand what thrads are accessing it at the same time; we
> need to document/comment what's accessed concurrently..

for the source qemu, migration thread invokes qio_channel_rdma_writev,
and the return
path thread invokes qio_channel_rdma_readv.

for the destination qemu, before postcopy, the main thread invokes
qio_channel_rdma_readv
and qio_channel_rdma_writev.
after postcopy, the listen_thread invokes qio_channel_rdma_readv, and
the ram_fault_thread
invokes qio_channel_rdma_writev.

>
> Reviewed-by: Dr. David Alan Gilbert 
>
>> ---
>>  migration/rdma.c | 15 +++
>>  1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index c745427..f5c1d02 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -404,7 +404,6 @@ struct QIOChannelRDMA {
>>  QIOChannel parent;
>>  RDMAContext *rdma;
>>  QEMUFile *file;
>> -size_t len;
>>  bool blocking; /* XXX we don't actually honour this yet */
>>  };
>>
>> @@ -2640,6 +2639,7 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>  int ret;
>>  ssize_t done = 0;
>>  size_t i;
>> +size_t len = 0;
>>
>>  CHECK_ERROR_STATE();
>>
>> @@ -2659,10 +2659,10 @@ static ssize_t qio_channel_rdma_writev(QIOChannel 
>> *ioc,
>>  while (remaining) {
>>  RDMAControlHeader head;
>>
>> -rioc->len = MIN(remaining, RDMA_SEND_INCREMENT);
>> -remaining -= rioc->len;
>> +len = MIN(remaining, RDMA_SEND_INCREMENT);
>> +remaining -= len;
>>
>> -head.len = rioc->len;
>> +head.len = len;
>>  head.type = RDMA_CONTROL_QEMU_FILE;
>>
>>  ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL, 
>> NULL);
>> @@ -2672,8 +2672,8 @@ static ssize_t qio_channel_rdma_writev(QIOChannel *ioc,
>>  return ret;
>>  }
>>
>> -data += rioc->len;
>> -done += rioc->len;
>> +data += len;
>> +done += len;
>>  }
>>  }
>>
>> @@ -2768,8 +2768,7 @@ static ssize_t qio_channel_rdma_readv(QIOChannel *ioc,
>>  }
>>  }
>>  }
>> -rioc->len = done;
>> -return rioc->len;
>> +return done;
>>  }
>>
>>  /*
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 2/5] migration: add the interface to set get_return_path

2018-04-12 Thread 858585 jemmy
On Thu, Apr 12, 2018 at 4:28 PM, Daniel P. Berrangé  wrote:
> On Wed, Apr 11, 2018 at 06:18:18PM +0100, Dr. David Alan Gilbert wrote:
>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> > The default get_return_path function of iochannel does not work for
>> > RDMA live migration. So add the interface to set get_return_path.
>> >
>> > Signed-off-by: Lidong Chen 
>>
>> Lets see how Dan wants this done, he knows the channel/file stuff;
>> to me this feels like it should be adding a member to QIOChannelClass
>> that gets used by QEMUFile's get_return_path.
>
> No that doesn't really fit the model. IMHO the entire concept of a separate
> return path object is really wrong. The QIOChannel implementations are
> (almost) all capable of bi-directional I/O, which is why the the 
> get_retun_path
> function just creates a second QEMUFile pointing to the same QIOChannel
> object we already had. Migration only needs the second QEMUFile, because that
> struct re-uses the same struct fields for tracking different bits of info
> depending on which direction you're doing I/O in. A real fix would be to
> stop overloading the same fields for multiple purposes in the QEMUFile, so
> that we only needed a single QEMUFile instance.
>
> Ignoring that though, the particular problem we're facing here is that the
> QIOChannelRDMA impl that is used is not written in a way that allows
> bi-directional I/O, despite the RDMA code it uses being capable of it.
>
> So rather than changing this get_return_path code, IMHO, the right fix to
> simply improve the QIOChannelRDMA impl so that it fully supports 
> bi-directional
> I/O like all the other channels do.

Hi Daniel:
 Thanks for your suggestion.
 I will have a try.

>
> Regards,
> Daniel
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: [Qemu-devel] [PATCH 4/5] migration: fix qemu carsh when RDMA live migration

2018-04-12 Thread 858585 jemmy
On Thu, Apr 12, 2018 at 12:43 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> After postcopy, the destination qemu work in the dedicated
>> thread, so only invoke yield_until_fd_readable before postcopy
>> migration.
>
> The subject line needs to be more discriptive:
>migration: Stop rdma yielding during incoming postcopy
>
> I think.
> (Also please check the subject spellings)
>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/rdma.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 53773c7..81be482 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1489,11 +1489,13 @@ static int qemu_rdma_wait_comp_channel(RDMAContext 
>> *rdma)
>>   * Coroutine doesn't start until migration_fd_process_incoming()
>>   * so don't yield unless we know we're running inside of a coroutine.
>>   */
>> -if (rdma->migration_started_on_destination) {
>> +if (rdma->migration_started_on_destination &&
>> +migration_incoming_get_current()->state == MIGRATION_STATUS_ACTIVE) 
>> {
>
> OK, that's a bit delicate; watch if it ever gets called in a failure
> case or similar - and also wathc out if we make more use of the status
> on the destination, but otherwise, and with a fix for the subject;

How about use migration_incoming_get_current()->have_listen_thread?

if (rdma->migration_started_on_destination &&
migration_incoming_get_current()->have_listen_thread == false) {
yield_until_fd_readable(rdma->comp_channel->fd);
}

>
>
> Reviewed-by: Dr. David Alan Gilbert 
>
>>  yield_until_fd_readable(rdma->comp_channel->fd);
>>  } else {
>>  /* This is the source side, we're in a separate thread
>>   * or destination prior to migration_fd_process_incoming()
>> + * after postcopy, the destination also in a seprate thread.
>>   * we can't yield; so we have to poll the fd.
>>   * But we need to be able to handle 'cancel' or an error
>>   * without hanging forever.
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 5/5] migration: disable RDMA WRITR after postcopy started.

2018-04-11 Thread 858585 jemmy
On Wed, Apr 11, 2018 at 11:56 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> RDMA write operations are performed with no notification to the destination
>> qemu, then the destination qemu can not wakeup. So disable RDMA WRITE after
>> postcopy started.
>>
>> Signed-off-by: Lidong Chen 
>
> This patch needs to be near the beginning of the series; at the moment a
> bisect would lead you to the middle of the series which had return
> paths, but then would fail to work properly because it would try and use
> the RDMA code.

I will fix this problem in next version.

>
>> ---
>>  migration/qemu-file.c |  3 ++-
>>  migration/rdma.c  | 12 
>>  2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 8acb574..a64ac3a 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -260,7 +260,8 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
>> block_offset,
>>  int ret = f->hooks->save_page(f, f->opaque, block_offset,
>>offset, size, bytes_sent);
>>  f->bytes_xfer += size;
>> -if (ret != RAM_SAVE_CONTROL_DELAYED) {
>> +if (ret != RAM_SAVE_CONTROL_DELAYED &&
>> +ret != RAM_SAVE_CONTROL_NOT_SUPP) {
>
> What about f->bytes_xfer in this case?

f->bytes_xfer should not update when RAM_SAVE_CONTROL_NOT_SUPP.
I will fix this problem in next version.

>
> Is there anything we have to do at the switchover into postcopy to make
> sure that all pages have been received?

ram_save_iterate invoke ram_control_after_iterate(f, RAM_CONTROL_ROUND),
so before next iteration which switchover into postcopy, all the pages
sent by previous
iteration have been received.

>
> Dave
>
>>  if (bytes_sent && *bytes_sent > 0) {
>>  qemu_update_position(f, *bytes_sent);
>>  } else if (ret < 0) {
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 81be482..8529ddd 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2964,6 +2964,10 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void 
>> *opaque,
>>
>>  CHECK_ERROR_STATE();
>>
>> +if (migrate_get_current()->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>> +return RAM_SAVE_CONTROL_NOT_SUPP;
>> +}
>> +
>>  qemu_fflush(f);
>>
>>  if (size > 0) {
>> @@ -3528,6 +3532,10 @@ static int qemu_rdma_registration_start(QEMUFile *f, 
>> void *opaque,
>>
>>  CHECK_ERROR_STATE();
>>
>> +if (migrate_get_current()->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>> +return 0;
>> +}
>> +
>>  trace_qemu_rdma_registration_start(flags);
>>  qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
>>  qemu_fflush(f);
>> @@ -3550,6 +3558,10 @@ static int qemu_rdma_registration_stop(QEMUFile *f, 
>> void *opaque,
>>
>>  CHECK_ERROR_STATE();
>>
>> +if (migrate_get_current()->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) {
>> +return 0;
>> +}
>> +
>>  qemu_fflush(f);
>>  ret = qemu_rdma_drain_cq(f, rdma);
>>
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 0/5] Enable postcopy RDMA live migration

2018-04-11 Thread 858585 jemmy
On Wed, Apr 11, 2018 at 8:29 PM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> Current Qemu RDMA communication does not support send and receive
>> data at the same time, so when RDMA live migration with postcopy
>> enabled, the source qemu return path thread get qemu file error.
>>
>> Those patch add the postcopy support for RDMA live migration.
>
> This description is a little misleading; it doesn't really
> do RDMA during the postcopy phase - what it really does is disable
> the RDMA page sending during the postcopy phase, relying on the
> RDMA codes stream emulation to send the page.

Hi Dave:
I will modify the description in next version patch.

>
> That's not necessarily a bad fix; you get the nice performance of RDMA
> during the precopy phase, but how bad are you finding the performance
> during the postcopy phase - the RDMA code we have was only really
> designed for sending small commands over the stream?

I have not finished the performance test. There are three choices for RDMA
migration during the postcopy phase.

1. RDMA SEND operation from the source qemu
2. RDMA Write with Immediate from the source qemu
3. RDMA READ from the destination qemu

In theory, RDMA READ from the destination qemu is the best way.
But I think it's better to make choice base on the performance result.
I will send the performance result later.

If use another way during the postcopy phase, it will a big change for the code.
This patch just make postcopy works, and i will send another patch to
improve the performance.

Thanks.

>
> Dave
>
>> Lidong Chen (5):
>>   migration: create a dedicated connection for rdma return path
>>   migration: add the interface to set get_return_path
>>   migration: implement the get_return_path for RDMA iochannel
>>   migration: fix qemu carsh when RDMA live migration
>>   migration: disable RDMA WRITR after postcopy started.
>>
>>  migration/qemu-file-channel.c |  12 ++--
>>  migration/qemu-file.c |  13 +++-
>>  migration/qemu-file.h |   2 +-
>>  migration/rdma.c  | 148 
>> --
>>  4 files changed, 163 insertions(+), 12 deletions(-)
>>
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 0/5] Enable postcopy RDMA live migration

2018-04-08 Thread 858585 jemmy
ping.

On Sat, Apr 7, 2018 at 4:26 PM, Lidong Chen  wrote:
> Current Qemu RDMA communication does not support send and receive
> data at the same time, so when RDMA live migration with postcopy
> enabled, the source qemu return path thread get qemu file error.
>
> Those patch add the postcopy support for RDMA live migration.
>
> Lidong Chen (5):
>   migration: create a dedicated connection for rdma return path
>   migration: add the interface to set get_return_path
>   migration: implement the get_return_path for RDMA iochannel
>   migration: fix qemu carsh when RDMA live migration
>   migration: disable RDMA WRITR after postcopy started.
>
>  migration/qemu-file-channel.c |  12 ++--
>  migration/qemu-file.c |  13 +++-
>  migration/qemu-file.h |   2 +-
>  migration/rdma.c  | 148 
> --
>  4 files changed, 163 insertions(+), 12 deletions(-)
>
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH] migration: Fix rate limiting issue on RDMA migration

2018-03-22 Thread 858585 jemmy
On Wed, Mar 21, 2018 at 2:19 AM, Juan Quintela  wrote:
> Lidong Chen  wrote:
>> RDMA migration implement save_page function for QEMUFile, but
>> ram_control_save_page do not increase bytes_xfer. So when doing
>> RDMA migration, it will use whole bandwidth.
>>
>> Signed-off-by: Lidong Chen 
>
> Reviewed-by: Juan Quintela 
>
> This part of the code is a mess.
>
> To answer David:
> - pos: Where we need to write that bit of stuff
> - bytex_xfer: how much have we written
>
> WHen we are doing snapshots on qcow2, we store memory in a contiguous
> piece of memory, so we can "overwrite" that "page" if a new verion
> cames. Nothing else (except the block) uses te "pos" parameter, so we
> can't not trust on it.
>
> And that  has been for a fast look at the code, that I got really
> confused (again).

Hi Juan:
 what is the problem?
 Thanks.

>
>
>
>> ---
>>  migration/qemu-file.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 2ab2bf3..217609d 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -253,7 +253,7 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
>> block_offset,
>>  if (f->hooks && f->hooks->save_page) {
>>  int ret = f->hooks->save_page(f, f->opaque, block_offset,
>>offset, size, bytes_sent);
>> -
>> +f->bytes_xfer += size;
>>  if (ret != RAM_SAVE_CONTROL_DELAYED) {
>>  if (bytes_sent && *bytes_sent > 0) {
>>  qemu_update_position(f, *bytes_sent);



Re: [Qemu-devel] [PATCH] migration: Fix rate limiting issue on RDMA migration

2018-03-19 Thread 858585 jemmy
ping.

On Thu, Mar 15, 2018 at 1:33 PM, 858585 jemmy  wrote:
> On Thu, Mar 15, 2018 at 4:19 AM, Dr. David Alan Gilbert
>  wrote:
>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>>> RDMA migration implement save_page function for QEMUFile, but
>>> ram_control_save_page do not increase bytes_xfer. So when doing
>>> RDMA migration, it will use whole bandwidth.
>>
>> Hi,
>>   Thanks for this,
>>
>>> Signed-off-by: Lidong Chen 
>>> ---
>>>  migration/qemu-file.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>>> index 2ab2bf3..217609d 100644
>>> --- a/migration/qemu-file.c
>>> +++ b/migration/qemu-file.c
>>> @@ -253,7 +253,7 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
>>> block_offset,
>>>  if (f->hooks && f->hooks->save_page) {
>>>  int ret = f->hooks->save_page(f, f->opaque, block_offset,
>>>offset, size, bytes_sent);
>>> -
>>> +f->bytes_xfer += size;
>>
>> I'm a bit confused, because I know rdma.c calls acct_update_position()
>> and I'd always thought that was enough.
>> That calls qemu_update_position(...) which increases f->pos but not
>> f->bytes_xfer.
>>
>> f_pos is used to calculate the 'transferred' value in
>> migration_update_counters and thus the current bandwidth and downtime -
>> but as you say, not the rate_limit.
>>
>> So really, should this f->bytes_xfer += size   go in
>> qemu_update_position ?
>
> For tcp migration, bytes_xfer is updated before qemu_fflush(f) which
> actually send data.
> but qemu_update_position is invoked by qemu_rdma_write_one, which
> after call ibv_post_send.
> and qemu_rdma_save_page is asynchronous, it may merge the page.
> I think it's more safe to limiting rate before send data
>
>>
>> Juan: I'm not sure I know why we have both bytes_xfer and pos.
>
> Maybe the reasion is bytes_xfer is updated before send data,
> and bytes_xfer will be reset by migration_update_counters.
>
>>
>> Dave
>>
>>>  if (ret != RAM_SAVE_CONTROL_DELAYED) {
>>>  if (bytes_sent && *bytes_sent > 0) {
>>>  qemu_update_position(f, *bytes_sent);
>>> --
>>> 1.8.3.1
>>>
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH] migration: Fix rate limiting issue on RDMA migration

2018-03-14 Thread 858585 jemmy
On Thu, Mar 15, 2018 at 4:19 AM, Dr. David Alan Gilbert
 wrote:
> * Lidong Chen (jemmy858...@gmail.com) wrote:
>> RDMA migration implement save_page function for QEMUFile, but
>> ram_control_save_page do not increase bytes_xfer. So when doing
>> RDMA migration, it will use whole bandwidth.
>
> Hi,
>   Thanks for this,
>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/qemu-file.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
>> index 2ab2bf3..217609d 100644
>> --- a/migration/qemu-file.c
>> +++ b/migration/qemu-file.c
>> @@ -253,7 +253,7 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
>> block_offset,
>>  if (f->hooks && f->hooks->save_page) {
>>  int ret = f->hooks->save_page(f, f->opaque, block_offset,
>>offset, size, bytes_sent);
>> -
>> +f->bytes_xfer += size;
>
> I'm a bit confused, because I know rdma.c calls acct_update_position()
> and I'd always thought that was enough.
> That calls qemu_update_position(...) which increases f->pos but not
> f->bytes_xfer.
>
> f_pos is used to calculate the 'transferred' value in
> migration_update_counters and thus the current bandwidth and downtime -
> but as you say, not the rate_limit.
>
> So really, should this f->bytes_xfer += size   go in
> qemu_update_position ?

For tcp migration, bytes_xfer is updated before qemu_fflush(f) which
actually send data.
but qemu_update_position is invoked by qemu_rdma_write_one, which
after call ibv_post_send.
and qemu_rdma_save_page is asynchronous, it may merge the page.
I think it's more safe to limiting rate before send data

>
> Juan: I'm not sure I know why we have both bytes_xfer and pos.

Maybe the reasion is bytes_xfer is updated before send data,
and bytes_xfer will be reset by migration_update_counters.

>
> Dave
>
>>  if (ret != RAM_SAVE_CONTROL_DELAYED) {
>>  if (bytes_sent && *bytes_sent > 0) {
>>  qemu_update_position(f, *bytes_sent);
>> --
>> 1.8.3.1
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH] migration: Fix rate limiting issue on RDMA migration

2018-03-13 Thread 858585 jemmy
Ping.

On Sat, Mar 10, 2018 at 10:32 PM, Lidong Chen  wrote:
> RDMA migration implement save_page function for QEMUFile, but
> ram_control_save_page do not increase bytes_xfer. So when doing
> RDMA migration, it will use whole bandwidth.
>
> Signed-off-by: Lidong Chen 
> ---
>  migration/qemu-file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index 2ab2bf3..217609d 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -253,7 +253,7 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t 
> block_offset,
>  if (f->hooks && f->hooks->save_page) {
>  int ret = f->hooks->save_page(f, f->opaque, block_offset,
>offset, size, bytes_sent);
> -
> +f->bytes_xfer += size;
>  if (ret != RAM_SAVE_CONTROL_DELAYED) {
>  if (bytes_sent && *bytes_sent > 0) {
>  qemu_update_position(f, *bytes_sent);
> --
> 1.8.3.1
>



Re: [Qemu-devel] [Qemu-block] [PATCH v4] migration/block: move bdrv_is_allocated() into bb's AioContext

2017-05-09 Thread 858585 jemmy
On Tue, May 9, 2017 at 4:54 AM, Stefan Hajnoczi  wrote:
> On Fri, May 05, 2017 at 04:03:49PM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> when block migration with high-speed, mig_save_device_bulk hold the
>> BQL and invoke bdrv_is_allocated frequently. This patch moves
>> bdrv_is_allocated() into bb's AioContext. It will execute without
>> blocking other I/O activity.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  v4 changelog:
>>   Use the prototype code written by Stefan and fix some bug.
>>   moves bdrv_is_allocated() into bb's AioContext.
>> ---
>>  migration/block.c | 48 +++-
>>  1 file changed, 39 insertions(+), 9 deletions(-)
>
> Added Paolo because he's been reworking AioContext and locking.
>
> The goal of this patch is to avoid waiting for bdrv_is_allocated() to
> complete while holding locks.  Do bdrv_is_allocated() in the AioContext
> so event processing continues after yield.

Hi Paolo:
Some information about the problem.
https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg01423.html

why bdrv_inc_in_flight() is needed.
blk_set_aio_context maybe invoked by vcpu thread. like this:
  blk_set_aio_context
   virtio_blk_data_plane_stop
virtio_pci_stop_ioeventfd
  virtio_pci_common_write
so bs->aio_context maybe change before mig_next_allocated_cluster().

I use this method to verify this patch:
run this command in guest os:
  while [ 1 ]; do rmmod virtio_blk; modprobe virtio_blk; done

   Thanks.

>
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 060087f..c871361 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -263,6 +263,30 @@ static void blk_mig_read_cb(void *opaque, int ret)
>>  blk_mig_unlock();
>>  }
>>
>> +typedef struct {
>> +int64_t *total_sectors;
>> +int64_t *cur_sector;
>> +BlockBackend *bb;
>> +QemuEvent event;
>> +} MigNextAllocatedClusterData;
>> +
>> +static void coroutine_fn mig_next_allocated_cluster(void *opaque)
>> +{
>> +MigNextAllocatedClusterData *data = opaque;
>> +int nr_sectors;
>> +
>> +/* Skip unallocated sectors; intentionally treats failure as
>> + * an allocated sector */
>> +while (*data->cur_sector < *data->total_sectors &&
>> +   !bdrv_is_allocated(blk_bs(data->bb), *data->cur_sector,
>> +  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>> +*data->cur_sector += nr_sectors;
>> +}
>> +
>> +bdrv_dec_in_flight(blk_bs(data->bb));
>> +qemu_event_set(&data->event);
>> +}
>> +
>>  /* Called with no lock taken.  */
>>
>>  static int mig_save_device_bulk(QEMUFile *f, BlkMigDevState *bmds)
>> @@ -274,17 +298,23 @@ static int mig_save_device_bulk(QEMUFile *f, 
>> BlkMigDevState *bmds)
>>  int nr_sectors;
>>
>>  if (bmds->shared_base) {
>> +AioContext *bb_ctx;
>> +Coroutine *co;
>> +MigNextAllocatedClusterData data = {
>> +.cur_sector = &cur_sector,
>> +.total_sectors = &total_sectors,
>> +.bb = bb,
>> +};
>> +qemu_event_init(&data.event, false);
>> +
>>  qemu_mutex_lock_iothread();
>> -aio_context_acquire(blk_get_aio_context(bb));
>> -/* Skip unallocated sectors; intentionally treats failure as
>> - * an allocated sector */
>> -while (cur_sector < total_sectors &&
>> -   !bdrv_is_allocated(blk_bs(bb), cur_sector,
>> -  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>> -cur_sector += nr_sectors;
>> -}
>> -aio_context_release(blk_get_aio_context(bb));
>> +bdrv_inc_in_flight(blk_bs(bb));
>
> Please add a comment explaining why bdrv_inc_in_flight() is invoked.
>
>> +bb_ctx = blk_get_aio_context(bb);
>> +co = qemu_coroutine_create(mig_next_allocated_cluster, &data);
>> +aio_co_schedule(bb_ctx, co);
>>  qemu_mutex_unlock_iothread();
>> +
>> +qemu_event_wait(&data.event);
>>  }
>>
>>  if (cur_sector >= total_sectors) {
>> --
>> 1.8.3.1
>>
>>



Re: [Qemu-devel] [PATCH] migration/block: optimize the performance by coalescing the same write type

2017-05-07 Thread 858585 jemmy
Hi Stefan&Fam:
Could you help me review this patch?
Thanks a lot.

On Mon, Apr 24, 2017 at 10:03 PM, 858585 jemmy  wrote:
> the reason of MIN_CLUSTER_SIZE is 8192 is base on the performance
> test result. the performance is only reduce obviously when cluster_size is
> less than 8192.
>
> I write this code, run in guest os. to create the worst condition.
>
> #include 
> #include 
> #include 
>
> int main()
> {
> char *zero;
> char *nonzero;
> FILE* fp = fopen("./test.dat", "ab");
>
> zero = malloc(sizeof(char)*512*8);
> nonzero = malloc(sizeof(char)*512*8);
>
> memset(zero, 0, sizeof(char)*512*8);
> memset(nonzero, 1, sizeof(char)*512*8);
>
> while (1) {
> fwrite(zero, sizeof(char)*512*8, 1, fp);
> fwrite(nonzero, sizeof(char)*512*8, 1, fp);
> }
> fclose(fp);
> }
>
>
> On Mon, Apr 24, 2017 at 9:55 PM,   wrote:
>> From: Lidong Chen 
>>
>> This patch optimizes the performance by coalescing the same write type.
>> When the zero/non-zero state changes, perform the write for the accumulated
>> cluster count.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>> Thanks Fam Zheng and Stefan's advice.
>> ---
>>  migration/block.c | 66 
>> +--
>>  1 file changed, 49 insertions(+), 17 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 060087f..e9c5e21 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -40,6 +40,8 @@
>>
>>  #define MAX_INFLIGHT_IO 512
>>
>> +#define MIN_CLUSTER_SIZE 8192
>> +
>>  //#define DEBUG_BLK_MIGRATION
>>
>>  #ifdef DEBUG_BLK_MIGRATION
>> @@ -923,10 +925,11 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  }
>>
>>  ret = bdrv_get_info(blk_bs(blk), &bdi);
>> -if (ret == 0 && bdi.cluster_size > 0 &&
>> -bdi.cluster_size <= BLOCK_SIZE &&
>> -BLOCK_SIZE % bdi.cluster_size == 0) {
>> +if (ret == 0 && bdi.cluster_size > 0) {
>>  cluster_size = bdi.cluster_size;
>> +while (cluster_size < MIN_CLUSTER_SIZE) {
>> +cluster_size *= 2;
>> +}
>>  } else {
>>  cluster_size = BLOCK_SIZE;
>>  }
>> @@ -943,29 +946,58 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  nr_sectors * BDRV_SECTOR_SIZE,
>>  BDRV_REQ_MAY_UNMAP);
>>  } else {
>> -int i;
>> -int64_t cur_addr;
>> -uint8_t *cur_buf;
>> +int64_t cur_addr = addr * BDRV_SECTOR_SIZE;
>> +uint8_t *cur_buf = NULL;
>> +int64_t last_addr = addr * BDRV_SECTOR_SIZE;
>> +uint8_t *last_buf = NULL;
>> +int64_t end_addr = addr * BDRV_SECTOR_SIZE + BLOCK_SIZE;
>>
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> -cur_addr = addr * BDRV_SECTOR_SIZE + i * cluster_size;
>> -cur_buf = buf + i * cluster_size;
>> -
>> -if ((!block_mig_state.zero_blocks ||
>> -cluster_size < BLOCK_SIZE) &&
>> -buffer_is_zero(cur_buf, cluster_size)) {
>> -ret = blk_pwrite_zeroes(blk, cur_addr,
>> -cluster_size,
>> +cur_buf = buf;
>> +last_buf = buf;
>> +
>> +while (last_addr < end_addr) {
>> +int is_zero = 0;
>> +int buf_size = MIN(end_addr - cur_addr, cluster_size);
>> +
>> +/* If the "zero blocks" migration capability is enabled
>> + * and the buf_size == BLOCK_SIZE, then the source QEMU
>> + * process has already scanned for zeroes. CPU is wasted
>> + * scanning for zeroes in the destination QEMU process.
>> + */
>> +if (block_mig_state.zero_blocks &&

Re: [Qemu-devel] [Qemu-block] [PATCH v3] migration/block:limit the time used for block migration

2017-05-03 Thread 858585 jemmy
On Wed, May 3, 2017 at 11:44 AM, 858585 jemmy  wrote:
> On Mon, Apr 10, 2017 at 9:52 PM, Stefan Hajnoczi  wrote:
>> On Sat, Apr 08, 2017 at 09:17:58PM +0800, 858585 jemmy wrote:
>>> On Fri, Apr 7, 2017 at 7:34 PM, Stefan Hajnoczi  wrote:
>>> > On Fri, Apr 07, 2017 at 09:30:33AM +0800, 858585 jemmy wrote:
>>> >> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  
>>> >> wrote:
>>> >> > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>>> >> >
>>> >> > A proper solution is to refactor the synchronous code to make it
>>> >> > asynchronous.  This might require invoking the system call from a
>>> >> > thread pool worker.
>>> >> >
>>> >>
>>> >> yes, i agree with you, but this is a big change.
>>> >> I will try to find how to optimize this code, maybe need a long time.
>>> >>
>>> >> this patch is not a perfect solution, but can alleviate the problem.
>>> >
>>> > Let's try to understand the problem fully first.
>>> >
>>>
>>> when migrate the vm with high speed, i find vnc response slowly sometime.
>>> not only vnc response slowly, virsh console aslo response slowly sometime.
>>> and the guest os block io performance is also reduce.
>>>
>>> the bug can be reproduce by this command:
>>> virsh migrate-setspeed 165cf436-312f-47e7-90f2-f8aa63f34893 900
>>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>>> --copy-storage-inc qemu+ssh://10.59.163.38/system
>>>
>>> and --copy-storage-all have no problem.
>>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>>> --copy-storage-all qemu+ssh://10.59.163.38/system
>>>
>>> compare the difference between --copy-storage-inc and
>>> --copy-storage-all. i find out the reason is
>>> mig_save_device_bulk invoke bdrv_is_allocated, but bdrv_is_allocated
>>> is synchronous and maybe wait
>>> for a long time.
>>>
>>> i write this code to measure the time used by  brdrv_is_allocated()
>>>
>>>  279 static int max_time = 0;
>>>  280 int tmp;
>>>
>>>  288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>>>  289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>>>  290 MAX_IS_ALLOCATED_SEARCH, 
>>> &nr_sectors);
>>>  291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>>>  292
>>>  293
>>>  294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
>>>  295+ (ts2.tv_nsec - ts1.tv_nsec);
>>>  296 if (tmp > max_time) {
>>>  297max_time=tmp;
>>>  298fprintf(stderr, "max_time is %d\n", max_time);
>>>  299 }
>>>
>>> the test result is below:
>>>
>>>  max_time is 37014
>>>  max_time is 1075534
>>>  max_time is 17180913
>>>  max_time is 28586762
>>>  max_time is 49563584
>>>  max_time is 103085447
>>>  max_time is 110836833
>>>  max_time is 120331438
>>>
>>> bdrv_is_allocated is called after qemu_mutex_lock_iothread.
>>> and the main thread is also call qemu_mutex_lock_iothread.
>>> so cause the main thread maybe wait for a long time.
>>>
>>>if (bmds->shared_base) {
>>> qemu_mutex_lock_iothread();
>>> aio_context_acquire(blk_get_aio_context(bb));
>>> /* Skip unallocated sectors; intentionally treats failure as
>>>  * an allocated sector */
>>> while (cur_sector < total_sectors &&
>>>!bdrv_is_allocated(blk_bs(bb), cur_sector,
>>>   MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>>> cur_sector += nr_sectors;
>>> }
>>> aio_context_release(blk_get_aio_context(bb));
>>> qemu_mutex_unlock_iothread();
>>> }
>>>
>>> #0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
>>> #1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
>>> #2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
>>> #3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
>>> util/qemu-thread-posix.c:60
>>> #4  0x00459e58 in qemu_mutex_lock_iothread () a

Re: [Qemu-devel] [Qemu-block] [PATCH v3] migration/block:limit the time used for block migration

2017-05-02 Thread 858585 jemmy
On Mon, Apr 10, 2017 at 9:52 PM, Stefan Hajnoczi  wrote:
> On Sat, Apr 08, 2017 at 09:17:58PM +0800, 858585 jemmy wrote:
>> On Fri, Apr 7, 2017 at 7:34 PM, Stefan Hajnoczi  wrote:
>> > On Fri, Apr 07, 2017 at 09:30:33AM +0800, 858585 jemmy wrote:
>> >> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  
>> >> wrote:
>> >> > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> >> >
>> >> > A proper solution is to refactor the synchronous code to make it
>> >> > asynchronous.  This might require invoking the system call from a
>> >> > thread pool worker.
>> >> >
>> >>
>> >> yes, i agree with you, but this is a big change.
>> >> I will try to find how to optimize this code, maybe need a long time.
>> >>
>> >> this patch is not a perfect solution, but can alleviate the problem.
>> >
>> > Let's try to understand the problem fully first.
>> >
>>
>> when migrate the vm with high speed, i find vnc response slowly sometime.
>> not only vnc response slowly, virsh console aslo response slowly sometime.
>> and the guest os block io performance is also reduce.
>>
>> the bug can be reproduce by this command:
>> virsh migrate-setspeed 165cf436-312f-47e7-90f2-f8aa63f34893 900
>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>> --copy-storage-inc qemu+ssh://10.59.163.38/system
>>
>> and --copy-storage-all have no problem.
>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>> --copy-storage-all qemu+ssh://10.59.163.38/system
>>
>> compare the difference between --copy-storage-inc and
>> --copy-storage-all. i find out the reason is
>> mig_save_device_bulk invoke bdrv_is_allocated, but bdrv_is_allocated
>> is synchronous and maybe wait
>> for a long time.
>>
>> i write this code to measure the time used by  brdrv_is_allocated()
>>
>>  279 static int max_time = 0;
>>  280 int tmp;
>>
>>  288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>>  289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>>  290 MAX_IS_ALLOCATED_SEARCH, 
>> &nr_sectors);
>>  291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>>  292
>>  293
>>  294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
>>  295+ (ts2.tv_nsec - ts1.tv_nsec);
>>  296 if (tmp > max_time) {
>>  297max_time=tmp;
>>  298fprintf(stderr, "max_time is %d\n", max_time);
>>  299 }
>>
>> the test result is below:
>>
>>  max_time is 37014
>>  max_time is 1075534
>>  max_time is 17180913
>>  max_time is 28586762
>>  max_time is 49563584
>>  max_time is 103085447
>>  max_time is 110836833
>>  max_time is 120331438
>>
>> bdrv_is_allocated is called after qemu_mutex_lock_iothread.
>> and the main thread is also call qemu_mutex_lock_iothread.
>> so cause the main thread maybe wait for a long time.
>>
>>if (bmds->shared_base) {
>> qemu_mutex_lock_iothread();
>> aio_context_acquire(blk_get_aio_context(bb));
>> /* Skip unallocated sectors; intentionally treats failure as
>>  * an allocated sector */
>> while (cur_sector < total_sectors &&
>>!bdrv_is_allocated(blk_bs(bb), cur_sector,
>>   MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>> cur_sector += nr_sectors;
>> }
>> aio_context_release(blk_get_aio_context(bb));
>> qemu_mutex_unlock_iothread();
>> }
>>
>> #0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
>> #2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
>> #3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
>> util/qemu-thread-posix.c:60
>> #4  0x00459e58 in qemu_mutex_lock_iothread () at 
>> /root/qemu/cpus.c:1516
>> #5  0x00945322 in os_host_main_loop_wait (timeout=28911939) at
>> util/main-loop.c:258
>> #6  0x009453f2 in main_loop_wait (nonblocking=0) at 
>> util/main-loop.c:517
>> #7  0x005c76b4 in main_loop () at vl.c:1898
>> #8  0x005ceb77 in main (argc=49, argv=0x7fff921182b8,
>> envp=0x7fff92118448) at vl.c:4709
>
> The following patch moves b

Re: [Qemu-devel] [PATCH v2] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-26 Thread 858585 jemmy
On Thu, Apr 27, 2017 at 4:25 AM, Max Reitz  wrote:
> On 21.04.2017 11:57, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
>
> s/when/When/, s/effectively/effective/
>
>> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
>
> s/this/This/, s/reduces/reduce/
>
>> the time when converts the qcow2 image with lots of zero.
>
> s/when converts the qcow2 image/for converting qcow2 images/,
> s/zero/zero data/
>
>>
>> Signed-off-by: Lidong Chen 
>> ---
>> v2 changelog:
>> unify the compressed and non-compressed code paths
>> ---
>>  qemu-img.c | 41 +++--
>>  1 file changed, 11 insertions(+), 30 deletions(-)
>
> Functionally, looks good to me. Just some stylistic nit picks:
>
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index b220cf7..60c9adf 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -1661,6 +1661,8 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>
>>  while (nb_sectors > 0) {
>>  int n = nb_sectors;
>> +BdrvRequestFlags flags = s->compressed ? BDRV_REQ_WRITE_COMPRESSED 
>> : 0;
>> +
>>  switch (status) {
>>  case BLK_BACKING_FILE:
>>  /* If we have a backing file, leave clusters unallocated that 
>> are
>> @@ -1670,43 +1672,21 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>  break;
>>
>>  case BLK_DATA:
>> -/* We must always write compressed clusters as a whole, so don't
>> - * try to find zeroed parts in the buffer. We can only save the
>> - * write if the buffer is completely zeroed and we're allowed to
>> - * keep the target sparse. */
>> -if (s->compressed) {
>> -if (s->has_zero_init && s->min_sparse &&
>> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
>> -{
>> -assert(!s->target_has_backing);
>> -break;
>> -}
>> -
>> -iov.iov_base = buf;
>> -iov.iov_len = n << BDRV_SECTOR_BITS;
>> -qemu_iovec_init_external(&qiov, &iov, 1);
>> -
>> -ret = blk_co_pwritev(s->target, sector_num << 
>> BDRV_SECTOR_BITS,
>> - n << BDRV_SECTOR_BITS, &qiov,
>> - BDRV_REQ_WRITE_COMPRESSED);
>> -if (ret < 0) {
>> -return ret;
>> -}
>> -break;
>> -}
>> -
>> -/* If there is real non-zero data or we're told to keep the 
>> target
>> - * fully allocated (-S 0), we must write it. Otherwise we can 
>> treat
>> +/* If we're told to keep the target fully allocated (-S 0) or 
>> there
>> + * is real non-zero data, we must write it. Otherwise we can 
>> treat
>>   * it as zero sectors. */
>
> I think we should still mention why there is a difference depending on
> s->compressed. Maybe like this:
>
> /* If we're told to keep the target fully allocated (-S 0) or there
>  * is real non-zero data, we must write it. Otherwise we can treat
>  * it as zero sectors.
>  * Compressed clusters need to be written as a whole, so in that
>  * case we can only save the write if the buffer is completely
>  * zeroed. */
>
>>  if (!s->min_sparse ||
>> -is_allocated_sectors_min(buf, n, &n, s->min_sparse))
>> -{
>> +(!s->compressed &&
>> + is_allocated_sectors_min(buf, n, &n, s->min_sparse)) ||
>> +(s->compressed &&
>> + !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) {
>> +
>
> This newline is a bit weird. Normally we don't have newlines at the
> start of a block.
>
> If you (like me) think there should be a visual separation between the
> if condition and the block, I'd suggest keeping the opening brace { on
> its own line (as it is now).
>
Thanks for your review.

> Max
>
>>  iov.iov_base = buf;
>>  iov.iov_len = n << BDRV_SECTOR_BITS;
>>  qemu_iovec_init_external(&qiov, &iov, 1);
>>
>>  ret = blk_co_pwritev(s->target, sector_num << 
>> BDRV_SECTOR_BITS,
>> - n << BDRV_SECTOR_BITS, &qiov, 0);
>> + n << BDRV_SECTOR_BITS, &qiov, flags);
>>  if (ret < 0) {
>>  return ret;
>>  }
>> @@ -1716,6 +1696,7 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>
>>  case BLK_ZERO:
>>  if (s->has_zero_init) {
>> +assert(!s->target_has_backing);
>>  break;
>>  }
>>  ret = blk_co_pwrite_zeroes(s->target,
>>
>
>



Re: [Qemu-devel] [PATCH 2/2] qemu-img: fix some spelling errors

2017-04-26 Thread 858585 jemmy
On Wed, Apr 26, 2017 at 3:11 AM, Max Reitz  wrote:
> On 24.04.2017 17:53, Eric Blake wrote:
>> On 04/24/2017 10:47 AM, Eric Blake wrote:
>>> On 04/24/2017 10:37 AM, Philippe Mathieu-Daudé wrote:
>>>
>>  /*
>> - * Returns true iff the first sector pointed to by 'buf' contains at
>> least
>> - * a non-NUL byte.
>> + * Returns true if the first sector pointed to by 'buf' contains at
>> least
>> + * a non-NULL byte.
>
> NACK to both changes.  'iff' is an English word that is shorthand for
> "if and only if".  "NUL" means the one-byte character, while "NULL"
> means the 8-byte (or 4-byte, on 32-bit platform) pointer value.

 I agree with Lidong shorthands are not obvious from non-native speaker.

 What about this?

  * Returns true if (and only if) the first sector pointed to by 'buf'
 contains
>>>
>>> That might be okay.
>
> Might, yes, but we have it all over the code. I'm not particularly avid
> to change this, because I am in fact one of the culprits (and I'm a
> non-native speaker, but I do like to use LaTeX so I know my \iff).
>
> (By the way, judging from the author's name of this line of code (which
> is Thiemo Seufer), I'd wager he's not a native speaker either.)
>
  * at least a non-null character.
>>>
>>> But that still doesn't make sense.  The character name is NUL, and
>>> non-NULL refers to something that is a pointer, not a character.
>>
>> What's more, the NUL character can actually occupy more than one byte
>> (think UTF-16, where it is the two-byte 0 value).  Referring to NUL byte
>> rather than NUL character (or even the 'zero byte') makes it obvious
>> that this function is NOT encoding-sensitive, and doesn't start
>> mis-behaving just because the data picks a multi-byte character encoding.
>
> Furthermore, this doesn't have anything to do with being a native
> speaker or not: NUL is just the commonly used and probably standardized
> abbreviation of a certain ASCII character (in any language). It's OK not
> to know this, but I don't think it's OK to change the comment.
Thanks for your explanation.
>
> Max
>



Re: [Qemu-devel] [PATCH 2/2] qemu-img: fix some spelling errors

2017-04-24 Thread 858585 jemmy
On Mon, Apr 24, 2017 at 11:53 PM, Eric Blake  wrote:
> On 04/24/2017 10:47 AM, Eric Blake wrote:
>> On 04/24/2017 10:37 AM, Philippe Mathieu-Daudé wrote:
>>
>  /*
> - * Returns true iff the first sector pointed to by 'buf' contains at
> least
> - * a non-NUL byte.
> + * Returns true if the first sector pointed to by 'buf' contains at
> least
> + * a non-NULL byte.

 NACK to both changes.  'iff' is an English word that is shorthand for
 "if and only if".  "NUL" means the one-byte character, while "NULL"
 means the 8-byte (or 4-byte, on 32-bit platform) pointer value.
>>>
>>> I agree with Lidong shorthands are not obvious from non-native speaker.
>>>
>>> What about this?
>>>
>>>  * Returns true if (and only if) the first sector pointed to by 'buf'
>>> contains
>>
>> That might be okay.
>>
>>>  * at least a non-null character.
>>
>> But that still doesn't make sense.  The character name is NUL, and
>> non-NULL refers to something that is a pointer, not a character.
>
> What's more, the NUL character can actually occupy more than one byte
> (think UTF-16, where it is the two-byte 0 value).  Referring to NUL byte
> rather than NUL character (or even the ' byte') makes it obvious
> that this function is NOT encoding-sensitive, and doesn't start
> mis-behaving just because the data picks a multi-byte character encoding.

How about this?

 * Returns true  if (and only if) the first sector pointed to by 'buf'
contains at least
 * a non-zero byte.

Thanks.

>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>



Re: [Qemu-devel] [PATCH 1/2] qemu-img: make sure contain the consecutive number of zero bytes

2017-04-24 Thread 858585 jemmy
On Mon, Apr 24, 2017 at 10:43 PM, Eric Blake  wrote:
> On 04/23/2017 09:33 AM, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> is_allocated_sectors_min don't guarantee to contain the
>> consecutive number of zero bytes. this patch fixes this bug.
>
> This message was sent without an 'In-Reply-To' header pointing to a 0/2
> cover letter.  When sending a series, please always thread things to a
> cover letter; you may find 'git config format.coverletter auto' to be
> helpful.

Thanks for your kind advises.

>
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  qemu-img.c | 11 ++-
>>  1 file changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index b220cf7..df6d165 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -1060,9 +1060,9 @@ static int is_allocated_sectors(const uint8_t *buf, 
>> int n, int *pnum)
>>  }
>>
>>  /*
>> - * Like is_allocated_sectors, but if the buffer starts with a used sector,
>> - * up to 'min' consecutive sectors containing zeros are ignored. This avoids
>> - * breaking up write requests for only small sparse areas.
>> + * Like is_allocated_sectors, but up to 'min' consecutive sectors
>> + * containing zeros are ignored. This avoids breaking up write requests
>> + * for only small sparse areas.
>>   */
>>  static int is_allocated_sectors_min(const uint8_t *buf, int n, int *pnum,
>>  int min)
>> @@ -1071,11 +1071,12 @@ static int is_allocated_sectors_min(const uint8_t 
>> *buf, int n, int *pnum,
>>  int num_checked, num_used;
>>
>>  if (n < min) {
>> -min = n;
>> +*pnum = n;
>> +return 1;
>>  }
>>
>>  ret = is_allocated_sectors(buf, n, pnum);
>> -if (!ret) {
>> +if (!ret && *pnum >= min) {
>
> I seem to recall past attempts to try and patch this function, which
> were then turned down, although I haven't scrubbed the archives for a
> quick URL to point to. I'm worried that there are more subtleties here
> than what you realize.

Hi Eric:
Do you mean this URL?
https://lists.gnu.org/archive/html/qemu-block/2017-01/msg00306.html

But I think the code is not consistent with qemu-img --help.
qemu-img --help
  '-S' indicates the consecutive number of bytes (defaults to 4k) that must
   contain only zeros for qemu-img to create a sparse image during
   conversion. If the number of bytes is 0, the source will not be
scanned for
   unallocated or zero sectors, and the destination image will always be
   fully allocated.

another reason:
if s->has_zero_init is 1(the qcow2 image which have backing_file), the empty
space at the beginning of the buffer still need write and invoke
blk_co_pwrite_zeroes.
and split a single write operation into two just because there is small empty
space at the beginning.

Thanks.

>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3266
> Virtualization:  qemu.org | libvirt.org
>



Re: [Qemu-devel] [PATCH] migration/block: optimize the performance by coalescing the same write type

2017-04-24 Thread 858585 jemmy
the reason of MIN_CLUSTER_SIZE is 8192 is base on the performance
test result. the performance is only reduce obviously when cluster_size is
less than 8192.

I write this code, run in guest os. to create the worst condition.

#include 
#include 
#include 

int main()
{
char *zero;
char *nonzero;
FILE* fp = fopen("./test.dat", "ab");

zero = malloc(sizeof(char)*512*8);
nonzero = malloc(sizeof(char)*512*8);

memset(zero, 0, sizeof(char)*512*8);
memset(nonzero, 1, sizeof(char)*512*8);

while (1) {
fwrite(zero, sizeof(char)*512*8, 1, fp);
fwrite(nonzero, sizeof(char)*512*8, 1, fp);
}
fclose(fp);
}


On Mon, Apr 24, 2017 at 9:55 PM,   wrote:
> From: Lidong Chen 
>
> This patch optimizes the performance by coalescing the same write type.
> When the zero/non-zero state changes, perform the write for the accumulated
> cluster count.
>
> Signed-off-by: Lidong Chen 
> ---
> Thanks Fam Zheng and Stefan's advice.
> ---
>  migration/block.c | 66 
> +--
>  1 file changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/migration/block.c b/migration/block.c
> index 060087f..e9c5e21 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -40,6 +40,8 @@
>
>  #define MAX_INFLIGHT_IO 512
>
> +#define MIN_CLUSTER_SIZE 8192
> +
>  //#define DEBUG_BLK_MIGRATION
>
>  #ifdef DEBUG_BLK_MIGRATION
> @@ -923,10 +925,11 @@ static int block_load(QEMUFile *f, void *opaque, int 
> version_id)
>  }
>
>  ret = bdrv_get_info(blk_bs(blk), &bdi);
> -if (ret == 0 && bdi.cluster_size > 0 &&
> -bdi.cluster_size <= BLOCK_SIZE &&
> -BLOCK_SIZE % bdi.cluster_size == 0) {
> +if (ret == 0 && bdi.cluster_size > 0) {
>  cluster_size = bdi.cluster_size;
> +while (cluster_size < MIN_CLUSTER_SIZE) {
> +cluster_size *= 2;
> +}
>  } else {
>  cluster_size = BLOCK_SIZE;
>  }
> @@ -943,29 +946,58 @@ static int block_load(QEMUFile *f, void *opaque, int 
> version_id)
>  nr_sectors * BDRV_SECTOR_SIZE,
>  BDRV_REQ_MAY_UNMAP);
>  } else {
> -int i;
> -int64_t cur_addr;
> -uint8_t *cur_buf;
> +int64_t cur_addr = addr * BDRV_SECTOR_SIZE;
> +uint8_t *cur_buf = NULL;
> +int64_t last_addr = addr * BDRV_SECTOR_SIZE;
> +uint8_t *last_buf = NULL;
> +int64_t end_addr = addr * BDRV_SECTOR_SIZE + BLOCK_SIZE;
>
>  buf = g_malloc(BLOCK_SIZE);
>  qemu_get_buffer(f, buf, BLOCK_SIZE);
> -for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
> -cur_addr = addr * BDRV_SECTOR_SIZE + i * cluster_size;
> -cur_buf = buf + i * cluster_size;
> -
> -if ((!block_mig_state.zero_blocks ||
> -cluster_size < BLOCK_SIZE) &&
> -buffer_is_zero(cur_buf, cluster_size)) {
> -ret = blk_pwrite_zeroes(blk, cur_addr,
> -cluster_size,
> +cur_buf = buf;
> +last_buf = buf;
> +
> +while (last_addr < end_addr) {
> +int is_zero = 0;
> +int buf_size = MIN(end_addr - cur_addr, cluster_size);
> +
> +/* If the "zero blocks" migration capability is enabled
> + * and the buf_size == BLOCK_SIZE, then the source QEMU
> + * process has already scanned for zeroes. CPU is wasted
> + * scanning for zeroes in the destination QEMU process.
> + */
> +if (block_mig_state.zero_blocks &&
> +buf_size == BLOCK_SIZE) {
> +is_zero = 0;
> +} else {
> +is_zero = buffer_is_zero(cur_buf, buf_size);
> +}
> +
> +cur_addr += buf_size;
> +cur_buf += buf_size;
> +while (cur_addr < end_addr) {
> +buf_size = MIN(end_addr - cur_addr, cluster_size);
> +if (is_zero != buffer_is_zero(cur_buf, buf_size)) {
> +break;
> +}
> +cur_addr += buf_size;
> +cur_buf += buf_size;
> +}
> +
> +if (is_zero) {
> +ret = blk_pwrite_zeroes(blk, last_addr,
> +cur_addr - last_addr,
>  BDRV_REQ_MAY_U

Re: [Qemu-devel] [PATCH v2] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-24 Thread 858585 jemmy
Hi Everyone:
Any suggestion about this patch?
Thanks.

On Sun, Apr 23, 2017 at 5:53 PM, 858585 jemmy  wrote:
> I test four test case for this patch.
> 1.qcow2 image with lots of zero cluster, and convert with compress
> 2.qcow2 image with lots of zero cluster, and convert with no compress
> 3.qcow2 image with lots of non-zero clusters, and convert with compress
> 4.qcow2 image with lots of non-zero clusters, and convert with no compress
> all test case pass.
>
> the qcow2 image with lots of zero clusters, the time reduce obviously.
> and have no bad effort in other cases.
>
>
> On Fri, Apr 21, 2017 at 5:57 PM,   wrote:
>> From: Lidong Chen 
>>
>> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
>> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
>> the time when converts the qcow2 image with lots of zero.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>> v2 changelog:
>> unify the compressed and non-compressed code paths
>> ---
>>  qemu-img.c | 41 +++--
>>  1 file changed, 11 insertions(+), 30 deletions(-)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index b220cf7..60c9adf 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -1661,6 +1661,8 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>
>>  while (nb_sectors > 0) {
>>  int n = nb_sectors;
>> +BdrvRequestFlags flags = s->compressed ? BDRV_REQ_WRITE_COMPRESSED 
>> : 0;
>> +
>>  switch (status) {
>>  case BLK_BACKING_FILE:
>>  /* If we have a backing file, leave clusters unallocated that 
>> are
>> @@ -1670,43 +1672,21 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>  break;
>>
>>  case BLK_DATA:
>> -/* We must always write compressed clusters as a whole, so don't
>> - * try to find zeroed parts in the buffer. We can only save the
>> - * write if the buffer is completely zeroed and we're allowed to
>> - * keep the target sparse. */
>> -if (s->compressed) {
>> -if (s->has_zero_init && s->min_sparse &&
>> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
>> -{
>> -assert(!s->target_has_backing);
>> -break;
>> -}
>> -
>> -iov.iov_base = buf;
>> -iov.iov_len = n << BDRV_SECTOR_BITS;
>> -qemu_iovec_init_external(&qiov, &iov, 1);
>> -
>> -ret = blk_co_pwritev(s->target, sector_num << 
>> BDRV_SECTOR_BITS,
>> - n << BDRV_SECTOR_BITS, &qiov,
>> - BDRV_REQ_WRITE_COMPRESSED);
>> -if (ret < 0) {
>> -return ret;
>> -}
>> -break;
>> -}
>> -
>> -/* If there is real non-zero data or we're told to keep the 
>> target
>> - * fully allocated (-S 0), we must write it. Otherwise we can 
>> treat
>> +/* If we're told to keep the target fully allocated (-S 0) or 
>> there
>> + * is real non-zero data, we must write it. Otherwise we can 
>> treat
>>   * it as zero sectors. */
>>  if (!s->min_sparse ||
>> -is_allocated_sectors_min(buf, n, &n, s->min_sparse))
>> -{
>> +(!s->compressed &&
>> + is_allocated_sectors_min(buf, n, &n, s->min_sparse)) ||
>> +(s->compressed &&
>> + !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) {
>> +
>>  iov.iov_base = buf;
>>  iov.iov_len = n << BDRV_SECTOR_BITS;
>>  qemu_iovec_init_external(&qiov, &iov, 1);
>>
>>  ret = blk_co_pwritev(s->target, sector_num << 
>> BDRV_SECTOR_BITS,
>> - n << BDRV_SECTOR_BITS, &qiov, 0);
>> + n << BDRV_SECTOR_BITS, &qiov, flags);
>>  if (ret < 0) {
>>  return ret;
>>  }
>> @@ -1716,6 +1696,7 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>
>>  case BLK_ZERO:
>>  if (s->has_zero_init) {
>> +assert(!s->target_has_backing);
>>  break;
>>  }
>>  ret = blk_co_pwrite_zeroes(s->target,
>> --
>> 1.8.3.1
>>



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-24 Thread 858585 jemmy
On Mon, Apr 24, 2017 at 8:36 PM, Fam Zheng  wrote:
> On Mon, 04/24 20:26, 858585 jemmy wrote:
>> > 2) qcow2 with cluster_size = 512 is probably too uncommon to be optimized 
>> > for.
>> if culster_size is very small, should disable metadata check default?
>>
>
> No, I don't think it's worth the inconsistent behavior. People who want
> performance shouldn't use 512 bytes anyway.
ok,thanks.

>
> Fam



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-24 Thread 858585 jemmy
On Mon, Apr 24, 2017 at 8:19 PM, Fam Zheng  wrote:
> On Mon, 04/24 20:09, Fam Zheng wrote:
>> It's a separate problem.
>
> To be specific:
>
> 1) there is an option "overlap-check" that one can use to
> disable the costly metadata check;
yes, i will disable metadata check, and test the performance again.

>
> 2) qcow2 with cluster_size = 512 is probably too uncommon to be optimized for.
if culster_size is very small, should disable metadata check default?

>
> Both are irrelevant to why and how this patch can be improved, IMO.
>
> Fam



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-24 Thread 858585 jemmy
On Mon, Apr 24, 2017 at 8:09 PM, Fam Zheng  wrote:
> On Mon, 04/24 19:54, 858585 jemmy wrote:
>> On Mon, Apr 24, 2017 at 3:40 PM, 858585 jemmy  wrote:
>> > On Mon, Apr 17, 2017 at 12:00 PM, 858585 jemmy  
>> > wrote:
>> >> On Mon, Apr 17, 2017 at 11:49 AM, Fam Zheng  wrote:
>> >>> On Fri, 04/14 14:30, 858585 jemmy wrote:
>> >>>> Do you know some other format which have very small cluster size?
>> >>>
>> >>> 64k is the default cluster size for qcow2 but it can be configured at 
>> >>> image
>> >>> creation time, as 512 bytes, for example:
>> >>>
>> >>> $ qemu-img create -f qcow2 test.qcow2 -o cluster_size=512 1G
>> >>
>> >> Thanks, i will test the performance again.
>> >
>> > I find the performance reduce when cluster size is 512.
>> > I will optimize the performance and submit a patch later.
>> > Thanks.
>>
>> after optimize the code, i find the destination qemu process still have very
>> bad performance when cluster_size is 512. the reason is cause by
>> qcow2_check_metadata_overlap.
>>
>> if cluster_size is 512, the destination qemu process reach 100% cpu usage.
>> and the perf top result is below:
>>
>> Samples: 32K of event 'cycles', Event count (approx.): 20105269445
>>  91.68%  qemu-system-x86_64   [.] qcow2_check_metadata_overlap
>>   3.33%  qemu-system-x86_64   [.] range_get_last
>>   2.76%  qemu-system-x86_64   [.] ranges_overlap
>>   0.61%  qemu-system-x86_64   [.] qcow2_cache_do_get
>>
>> very large l1_size.
>> (gdb) p s->l1_size
>> $3 = 1310720
>>
>> (gdb) p s->max_refcount_table_index
>> $5 = 21905
>>
>> the backtrace:
>>
>> Breakpoint 1, qcow2_check_metadata_overlap (bs=0x16feb00, ign=0,
>> offset=440329728, size=4096) at block/qcow2-refcount.c:2344
>> 2344{
>> (gdb) bt
>> #0  qcow2_check_metadata_overlap (bs=0x16feb00, ign=0,
>> offset=440329728, size=4096) at block/qcow2-refcount.c:2344
>> #1  0x00878d9f in qcow2_pre_write_overlap_check (bs=0x16feb00,
>> ign=0, offset=440329728, size=4096) at block/qcow2-refcount.c:2473
>> #2  0x0086e382 in qcow2_co_pwritev (bs=0x16feb00,
>> offset=771047424, bytes=704512, qiov=0x7fd026bfdb90, flags=0) at
>> block/qcow2.c:1653
>> #3  0x008aeace in bdrv_driver_pwritev (bs=0x16feb00,
>> offset=770703360, bytes=1048576, qiov=0x7fd026bfdb90, flags=0) at
>> block/io.c:871
>> #4  0x008b015c in bdrv_aligned_pwritev (child=0x171b630,
>> req=0x7fd026bfd980, offset=770703360, bytes=1048576, align=1,
>> qiov=0x7fd026bfdb90, flags=0) at block/io.c:1371
>> #5  0x008b0d77 in bdrv_co_pwritev (child=0x171b630,
>> offset=770703360, bytes=1048576, qiov=0x7fd026bfdb90, flags=0) at
>> block/io.c:1622
>> #6  0x0089a76d in blk_co_pwritev (blk=0x16fe920,
>> offset=770703360, bytes=1048576, qiov=0x7fd026bfdb90, flags=0) at
>> block/block-backend.c:992
>> #7  0x0089a878 in blk_write_entry (opaque=0x7fd026bfdb70) at
>> block/block-backend.c:1017
>> #8  0x0089a95d in blk_prw (blk=0x16fe920, offset=770703360,
>> buf=0x362b050 "", bytes=1048576, co_entry=0x89a81a ,
>> flags=0) at block/block-backend.c:1045
>> #9  0x0089b222 in blk_pwrite (blk=0x16fe920, offset=770703360,
>> buf=0x362b050, count=1048576, flags=0) at block/block-backend.c:1208
>> #10 0x007d480d in block_load (f=0x1784fa0, opaque=0xfd46a0,
>> version_id=1) at migration/block.c:992
>> #11 0x0049dc58 in vmstate_load (f=0x1784fa0, se=0x16fbdc0,
>> version_id=1) at /data/qemu/migration/savevm.c:730
>> #12 0x004a0752 in qemu_loadvm_section_part_end (f=0x1784fa0,
>> mis=0xfd4160) at /data/qemu/migration/savevm.c:1923
>> #13 0x004a0842 in qemu_loadvm_state_main (f=0x1784fa0,
>> mis=0xfd4160) at /data/qemu/migration/savevm.c:1954
>> #14 0x004a0a33 in qemu_loadvm_state (f=0x1784fa0) at
>> /data/qemu/migration/savevm.c:2020
>> #15 0x007c2d33 in process_incoming_migration_co
>> (opaque=0x1784fa0) at migration/migration.c:404
>> #16 0x00966593 in coroutine_trampoline (i0=27108400, i1=0) at
>> util/coroutine-ucontext.c:79
>> #17 0x7fd03946b8f0 in ?? () from /lib64/libc.so.6
>> #18 0x7fff869c87e0 in ?? ()
>> #19 0x in ?? ()
>>
>> when the cluster_size is too small, the write performance is very bad.
>> How to solve this problem? Any suggestion?
>> 1. when the cluster_size is too smal

Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-24 Thread 858585 jemmy
On Mon, Apr 24, 2017 at 3:40 PM, 858585 jemmy  wrote:
> On Mon, Apr 17, 2017 at 12:00 PM, 858585 jemmy  wrote:
>> On Mon, Apr 17, 2017 at 11:49 AM, Fam Zheng  wrote:
>>> On Fri, 04/14 14:30, 858585 jemmy wrote:
>>>> Do you know some other format which have very small cluster size?
>>>
>>> 64k is the default cluster size for qcow2 but it can be configured at image
>>> creation time, as 512 bytes, for example:
>>>
>>> $ qemu-img create -f qcow2 test.qcow2 -o cluster_size=512 1G
>>
>> Thanks, i will test the performance again.
>
> I find the performance reduce when cluster size is 512.
> I will optimize the performance and submit a patch later.
> Thanks.

after optimize the code, i find the destination qemu process still have very
bad performance when cluster_size is 512. the reason is cause by
qcow2_check_metadata_overlap.

if cluster_size is 512, the destination qemu process reach 100% cpu usage.
and the perf top result is below:

Samples: 32K of event 'cycles', Event count (approx.): 20105269445
 91.68%  qemu-system-x86_64   [.] qcow2_check_metadata_overlap
  3.33%  qemu-system-x86_64   [.] range_get_last
  2.76%  qemu-system-x86_64   [.] ranges_overlap
  0.61%  qemu-system-x86_64   [.] qcow2_cache_do_get

very large l1_size.
(gdb) p s->l1_size
$3 = 1310720

(gdb) p s->max_refcount_table_index
$5 = 21905

the backtrace:

Breakpoint 1, qcow2_check_metadata_overlap (bs=0x16feb00, ign=0,
offset=440329728, size=4096) at block/qcow2-refcount.c:2344
2344{
(gdb) bt
#0  qcow2_check_metadata_overlap (bs=0x16feb00, ign=0,
offset=440329728, size=4096) at block/qcow2-refcount.c:2344
#1  0x00878d9f in qcow2_pre_write_overlap_check (bs=0x16feb00,
ign=0, offset=440329728, size=4096) at block/qcow2-refcount.c:2473
#2  0x0086e382 in qcow2_co_pwritev (bs=0x16feb00,
offset=771047424, bytes=704512, qiov=0x7fd026bfdb90, flags=0) at
block/qcow2.c:1653
#3  0x008aeace in bdrv_driver_pwritev (bs=0x16feb00,
offset=770703360, bytes=1048576, qiov=0x7fd026bfdb90, flags=0) at
block/io.c:871
#4  0x008b015c in bdrv_aligned_pwritev (child=0x171b630,
req=0x7fd026bfd980, offset=770703360, bytes=1048576, align=1,
qiov=0x7fd026bfdb90, flags=0) at block/io.c:1371
#5  0x008b0d77 in bdrv_co_pwritev (child=0x171b630,
offset=770703360, bytes=1048576, qiov=0x7fd026bfdb90, flags=0) at
block/io.c:1622
#6  0x0089a76d in blk_co_pwritev (blk=0x16fe920,
offset=770703360, bytes=1048576, qiov=0x7fd026bfdb90, flags=0) at
block/block-backend.c:992
#7  0x0089a878 in blk_write_entry (opaque=0x7fd026bfdb70) at
block/block-backend.c:1017
#8  0x0089a95d in blk_prw (blk=0x16fe920, offset=770703360,
buf=0x362b050 "", bytes=1048576, co_entry=0x89a81a ,
flags=0) at block/block-backend.c:1045
#9  0x0089b222 in blk_pwrite (blk=0x16fe920, offset=770703360,
buf=0x362b050, count=1048576, flags=0) at block/block-backend.c:1208
#10 0x007d480d in block_load (f=0x1784fa0, opaque=0xfd46a0,
version_id=1) at migration/block.c:992
#11 0x0049dc58 in vmstate_load (f=0x1784fa0, se=0x16fbdc0,
version_id=1) at /data/qemu/migration/savevm.c:730
#12 0x004a0752 in qemu_loadvm_section_part_end (f=0x1784fa0,
mis=0xfd4160) at /data/qemu/migration/savevm.c:1923
#13 0x004a0842 in qemu_loadvm_state_main (f=0x1784fa0,
mis=0xfd4160) at /data/qemu/migration/savevm.c:1954
#14 0x004a0a33 in qemu_loadvm_state (f=0x1784fa0) at
/data/qemu/migration/savevm.c:2020
#15 0x007c2d33 in process_incoming_migration_co
(opaque=0x1784fa0) at migration/migration.c:404
#16 0x00966593 in coroutine_trampoline (i0=27108400, i1=0) at
util/coroutine-ucontext.c:79
#17 0x7fd03946b8f0 in ?? () from /lib64/libc.so.6
#18 0x7fff869c87e0 in ?? ()
#19 0x in ?? ()

when the cluster_size is too small, the write performance is very bad.
How to solve this problem? Any suggestion?
1. when the cluster_size is too small, not invoke qcow2_pre_write_overlap_check.
2.limit the qcow2 cluster_size range, don't allow set the cluster_size
too small.
which way is better?

>
>>>
>>> Fam



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-24 Thread 858585 jemmy
On Mon, Apr 17, 2017 at 12:00 PM, 858585 jemmy  wrote:
> On Mon, Apr 17, 2017 at 11:49 AM, Fam Zheng  wrote:
>> On Fri, 04/14 14:30, 858585 jemmy wrote:
>>> Do you know some other format which have very small cluster size?
>>
>> 64k is the default cluster size for qcow2 but it can be configured at image
>> creation time, as 512 bytes, for example:
>>
>> $ qemu-img create -f qcow2 test.qcow2 -o cluster_size=512 1G
>
> Thanks, i will test the performance again.

I find the performance reduce when cluster size is 512.
I will optimize the performance and submit a patch later.
Thanks.

>>
>> Fam



Re: [Qemu-devel] [PATCH] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-23 Thread 858585 jemmy
On Fri, Apr 21, 2017 at 1:37 PM, 858585 jemmy  wrote:
> On Fri, Apr 21, 2017 at 10:58 AM, 858585 jemmy  wrote:
>> On Thu, Apr 20, 2017 at 6:00 PM, Kevin Wolf  wrote:
>>> Am 20.04.2017 um 10:38 hat jemmy858...@gmail.com geschrieben:
>>>> From: Lidong Chen 
>>>>
>>>> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
>>>> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
>>>> the time when converts the qcow2 image with lots of zero.
>>>>
>>>> Signed-off-by: Lidong Chen 
>>>
>>> Good catch, using blk_co_pwrite_zeroes() makes sense even for compressed
>>> images.
>>>
>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>> index b220cf7..0256539 100644
>>>> --- a/qemu-img.c
>>>> +++ b/qemu-img.c
>>>> @@ -1675,13 +1675,20 @@ static int coroutine_fn 
>>>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>>>   * write if the buffer is completely zeroed and we're allowed 
>>>> to
>>>>   * keep the target sparse. */
>>>>  if (s->compressed) {
>>>> -if (s->has_zero_init && s->min_sparse &&
>>>> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
>>>> -{
>>>> -assert(!s->target_has_backing);
>>>> -break;
>>>> +if (buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
>>>> +if (s->has_zero_init && s->min_sparse) {
>>>> +assert(!s->target_has_backing);
>>>> +break;
>>>> +} else {
>>>> +ret = blk_co_pwrite_zeroes(s->target,
>>>> +   sector_num << BDRV_SECTOR_BITS,
>>>> +   n << BDRV_SECTOR_BITS, 0);
>>>> +if (ret < 0) {
>>>> +return ret;
>>>> +}
>>>> +break;
>>>> +}
>>>>  }
>>>
>>> If s->min_sparse == 0, we may neither skip the write not use
>>> blk_co_pwrite_zeroes(), because this requires actual full allocation
>>> with explicit zero sectors.
>>>
>>> Of course, if you fix this, what you end up with here is a duplicate of
>>> the code path for non-compressed images. The remaining difference seems
>>> to be the BDRV_REQ_WRITE_COMPRESSED flag and buffer_is_zero() vs.
>>> is_allocated_sectors_min() (because uncompressed clusters can be written
>>> partially, but compressed clusters can't).
>>
>> I have a try to unify the code.
>>
>> I don't understand why use  is_allocated_sectors_min when don't compressed.
>> the s->min_sparse is 8 default, which is smaller than cluster_sectors.
>>
>> if a cluster which data is  8 sector zero and 8 sector non-zero
>> repeated, it will call
>> blk_co_pwritev and blk_co_pwrite_zeroes many times for a cluster.
>>
>> why not compare the zero by cluster_sectors size?
>
> I write this code, run in guest os.
>
> #include 
> #include 
> #include 
>
> int main()
> {
> char *zero;
> char *nonzero;
> FILE* fp = fopen("./test.dat", "ab");
>
> zero = malloc(sizeof(char)*512*8);
> nonzero = malloc(sizeof(char)*512*8);
>
> memset(zero, 0, sizeof(char)*512*8);
> memset(nonzero, 1, sizeof(char)*512*8);
>
> while (1) {
> fwrite(zero, sizeof(char)*512*8, 1, fp);
> fwrite(nonzero, sizeof(char)*512*8, 1, fp);
> }
> fclose(fp);
> }
>
> qemu-img info /mnt/img2016111016860868.qcow2
> image: /mnt/img2016111016860868.qcow2
> file format: qcow2
> virtual size: 20G (21474836480 bytes)
> disk size: 19G (20061552640 bytes)
> cluster_size: 65536
> backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2
>
> use -S 65536 option.
>
> time /root/kvm/bin/qemu-img convert -p -B
> /baseimage/img2016042213665396/img2016042213665396.qcow2 -O qcow2
> /mnt/img2016111016860868.qcow2 /mnt/img2017041611162809_zip_new.qcow2
> -S 65536
> (100.00/100%)
>
> real0m32.203s
> user0m5.165s
> sys 0m27.887s
>
> time /root/kvm/bin/qemu-img convert -p -B
> /baseimage/img2016042213665396/img2016042213665396.qcow2 -O qcow2
> /mnt/img2016111016860868.qcow2 /mnt/img2017041611162809_zip_new.qcow2
> (100.00/100%)
>
> real1m38.665s
> user0m45.418s
> sys 1m7.518s
>
> should we set cluster_sectors as the default value of s->min_sparse?

change the default value of s->min_sparse will break the API.
qemu-img --help describe that the default value is 4k.

  '-S' indicates the consecutive number of bytes (defaults to 4k) that must
   contain only zeros for qemu-img to create a sparse image during
   conversion. If the number of bytes is 0, the source will not be
scanned for
   unallocated or zero sectors, and the destination image will always be
   fully allocated

>
>>
>>>
>>> So I suppose that instead of just fixing the above bug, we could actually
>>> mostly unify the two code paths, if you want to have a try at it.
>>>
>>> Kevin



Re: [Qemu-devel] [PATCH v2] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-23 Thread 858585 jemmy
I test four test case for this patch.
1.qcow2 image with lots of zero cluster, and convert with compress
2.qcow2 image with lots of zero cluster, and convert with no compress
3.qcow2 image with lots of non-zero clusters, and convert with compress
4.qcow2 image with lots of non-zero clusters, and convert with no compress
all test case pass.

the qcow2 image with lots of zero clusters, the time reduce obviously.
and have no bad effort in other cases.


On Fri, Apr 21, 2017 at 5:57 PM,   wrote:
> From: Lidong Chen 
>
> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
> the time when converts the qcow2 image with lots of zero.
>
> Signed-off-by: Lidong Chen 
> ---
> v2 changelog:
> unify the compressed and non-compressed code paths
> ---
>  qemu-img.c | 41 +++--
>  1 file changed, 11 insertions(+), 30 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index b220cf7..60c9adf 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1661,6 +1661,8 @@ static int coroutine_fn 
> convert_co_write(ImgConvertState *s, int64_t sector_num,
>
>  while (nb_sectors > 0) {
>  int n = nb_sectors;
> +BdrvRequestFlags flags = s->compressed ? BDRV_REQ_WRITE_COMPRESSED : 
> 0;
> +
>  switch (status) {
>  case BLK_BACKING_FILE:
>  /* If we have a backing file, leave clusters unallocated that are
> @@ -1670,43 +1672,21 @@ static int coroutine_fn 
> convert_co_write(ImgConvertState *s, int64_t sector_num,
>  break;
>
>  case BLK_DATA:
> -/* We must always write compressed clusters as a whole, so don't
> - * try to find zeroed parts in the buffer. We can only save the
> - * write if the buffer is completely zeroed and we're allowed to
> - * keep the target sparse. */
> -if (s->compressed) {
> -if (s->has_zero_init && s->min_sparse &&
> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> -{
> -assert(!s->target_has_backing);
> -break;
> -}
> -
> -iov.iov_base = buf;
> -iov.iov_len = n << BDRV_SECTOR_BITS;
> -qemu_iovec_init_external(&qiov, &iov, 1);
> -
> -ret = blk_co_pwritev(s->target, sector_num << 
> BDRV_SECTOR_BITS,
> - n << BDRV_SECTOR_BITS, &qiov,
> - BDRV_REQ_WRITE_COMPRESSED);
> -if (ret < 0) {
> -return ret;
> -}
> -break;
> -}
> -
> -/* If there is real non-zero data or we're told to keep the 
> target
> - * fully allocated (-S 0), we must write it. Otherwise we can 
> treat
> +/* If we're told to keep the target fully allocated (-S 0) or 
> there
> + * is real non-zero data, we must write it. Otherwise we can 
> treat
>   * it as zero sectors. */
>  if (!s->min_sparse ||
> -is_allocated_sectors_min(buf, n, &n, s->min_sparse))
> -{
> +(!s->compressed &&
> + is_allocated_sectors_min(buf, n, &n, s->min_sparse)) ||
> +(s->compressed &&
> + !buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))) {
> +
>  iov.iov_base = buf;
>  iov.iov_len = n << BDRV_SECTOR_BITS;
>  qemu_iovec_init_external(&qiov, &iov, 1);
>
>  ret = blk_co_pwritev(s->target, sector_num << 
> BDRV_SECTOR_BITS,
> - n << BDRV_SECTOR_BITS, &qiov, 0);
> + n << BDRV_SECTOR_BITS, &qiov, flags);
>  if (ret < 0) {
>  return ret;
>  }
> @@ -1716,6 +1696,7 @@ static int coroutine_fn 
> convert_co_write(ImgConvertState *s, int64_t sector_num,
>
>  case BLK_ZERO:
>  if (s->has_zero_init) {
> +assert(!s->target_has_backing);
>  break;
>  }
>  ret = blk_co_pwrite_zeroes(s->target,
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-20 Thread 858585 jemmy
On Fri, Apr 21, 2017 at 10:58 AM, 858585 jemmy  wrote:
> On Thu, Apr 20, 2017 at 6:00 PM, Kevin Wolf  wrote:
>> Am 20.04.2017 um 10:38 hat jemmy858...@gmail.com geschrieben:
>>> From: Lidong Chen 
>>>
>>> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
>>> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
>>> the time when converts the qcow2 image with lots of zero.
>>>
>>> Signed-off-by: Lidong Chen 
>>
>> Good catch, using blk_co_pwrite_zeroes() makes sense even for compressed
>> images.
>>
>>> diff --git a/qemu-img.c b/qemu-img.c
>>> index b220cf7..0256539 100644
>>> --- a/qemu-img.c
>>> +++ b/qemu-img.c
>>> @@ -1675,13 +1675,20 @@ static int coroutine_fn 
>>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>>   * write if the buffer is completely zeroed and we're allowed 
>>> to
>>>   * keep the target sparse. */
>>>  if (s->compressed) {
>>> -if (s->has_zero_init && s->min_sparse &&
>>> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
>>> -{
>>> -assert(!s->target_has_backing);
>>> -break;
>>> +if (buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
>>> +if (s->has_zero_init && s->min_sparse) {
>>> +assert(!s->target_has_backing);
>>> +break;
>>> +} else {
>>> +ret = blk_co_pwrite_zeroes(s->target,
>>> +   sector_num << BDRV_SECTOR_BITS,
>>> +   n << BDRV_SECTOR_BITS, 0);
>>> +if (ret < 0) {
>>> +return ret;
>>> +}
>>> +break;
>>> +}
>>>  }
>>
>> If s->min_sparse == 0, we may neither skip the write not use
>> blk_co_pwrite_zeroes(), because this requires actual full allocation
>> with explicit zero sectors.
>>
>> Of course, if you fix this, what you end up with here is a duplicate of
>> the code path for non-compressed images. The remaining difference seems
>> to be the BDRV_REQ_WRITE_COMPRESSED flag and buffer_is_zero() vs.
>> is_allocated_sectors_min() (because uncompressed clusters can be written
>> partially, but compressed clusters can't).
>
> I have a try to unify the code.
>
> I don't understand why use  is_allocated_sectors_min when don't compressed.
> the s->min_sparse is 8 default, which is smaller than cluster_sectors.
>
> if a cluster which data is  8 sector zero and 8 sector non-zero
> repeated, it will call
> blk_co_pwritev and blk_co_pwrite_zeroes many times for a cluster.
>
> why not compare the zero by cluster_sectors size?

I write this code, run in guest os.

#include 
#include 
#include 

int main()
{
char *zero;
char *nonzero;
FILE* fp = fopen("./test.dat", "ab");

zero = malloc(sizeof(char)*512*8);
nonzero = malloc(sizeof(char)*512*8);

memset(zero, 0, sizeof(char)*512*8);
memset(nonzero, 1, sizeof(char)*512*8);

while (1) {
fwrite(zero, sizeof(char)*512*8, 1, fp);
fwrite(nonzero, sizeof(char)*512*8, 1, fp);
}
fclose(fp);
}

qemu-img info /mnt/img2016111016860868.qcow2
image: /mnt/img2016111016860868.qcow2
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 19G (20061552640 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

use -S 65536 option.

time /root/kvm/bin/qemu-img convert -p -B
/baseimage/img2016042213665396/img2016042213665396.qcow2 -O qcow2
/mnt/img2016111016860868.qcow2 /mnt/img2017041611162809_zip_new.qcow2
-S 65536
(100.00/100%)

real0m32.203s
user0m5.165s
sys 0m27.887s

time /root/kvm/bin/qemu-img convert -p -B
/baseimage/img2016042213665396/img2016042213665396.qcow2 -O qcow2
/mnt/img2016111016860868.qcow2 /mnt/img2017041611162809_zip_new.qcow2
(100.00/100%)

real1m38.665s
user0m45.418s
sys 1m7.518s

should we set cluster_sectors as the default value of s->min_sparse?

>
>>
>> So I suppose that instead of just fixing the above bug, we could actually
>> mostly unify the two code paths, if you want to have a try at it.
>>
>> Kevin



Re: [Qemu-devel] [PATCH] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-20 Thread 858585 jemmy
On Thu, Apr 20, 2017 at 6:00 PM, Kevin Wolf  wrote:
> Am 20.04.2017 um 10:38 hat jemmy858...@gmail.com geschrieben:
>> From: Lidong Chen 
>>
>> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
>> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
>> the time when converts the qcow2 image with lots of zero.
>>
>> Signed-off-by: Lidong Chen 
>
> Good catch, using blk_co_pwrite_zeroes() makes sense even for compressed
> images.
>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index b220cf7..0256539 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -1675,13 +1675,20 @@ static int coroutine_fn 
>> convert_co_write(ImgConvertState *s, int64_t sector_num,
>>   * write if the buffer is completely zeroed and we're allowed to
>>   * keep the target sparse. */
>>  if (s->compressed) {
>> -if (s->has_zero_init && s->min_sparse &&
>> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
>> -{
>> -assert(!s->target_has_backing);
>> -break;
>> +if (buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
>> +if (s->has_zero_init && s->min_sparse) {
>> +assert(!s->target_has_backing);
>> +break;
>> +} else {
>> +ret = blk_co_pwrite_zeroes(s->target,
>> +   sector_num << BDRV_SECTOR_BITS,
>> +   n << BDRV_SECTOR_BITS, 0);
>> +if (ret < 0) {
>> +return ret;
>> +}
>> +break;
>> +}
>>  }
>
> If s->min_sparse == 0, we may neither skip the write not use
> blk_co_pwrite_zeroes(), because this requires actual full allocation
> with explicit zero sectors.
>
> Of course, if you fix this, what you end up with here is a duplicate of
> the code path for non-compressed images. The remaining difference seems
> to be the BDRV_REQ_WRITE_COMPRESSED flag and buffer_is_zero() vs.
> is_allocated_sectors_min() (because uncompressed clusters can be written
> partially, but compressed clusters can't).

I have a try to unify the code.

I don't understand why use  is_allocated_sectors_min when don't compressed.
the s->min_sparse is 8 default, which is smaller than cluster_sectors.

if a cluster which data is  8 sector zero and 8 sector non-zero
repeated, it will call
blk_co_pwritev and blk_co_pwrite_zeroes many times for a cluster.

why not compare the zero by cluster_sectors size?

>
> So I suppose that instead of just fixing the above bug, we could actually
> mostly unify the two code paths, if you want to have a try at it.
>
> Kevin



Re: [Qemu-devel] [PATCH v2] qemu-img: check bs_n when use old style option

2017-04-20 Thread 858585 jemmy
On Thu, Apr 20, 2017 at 5:33 PM, Kevin Wolf  wrote:
> Am 20.04.2017 um 11:19 hat jemmy858...@gmail.com geschrieben:
>> From: Lidong Chen 
>>
>> When use old style option like -o backing_file, img_convert
>> continue run when bs_n > 1, this patch fix this bug.
>>
>> Signed-off-by: Lidong Chen 
>
> I think this is a duplicate of Max' "[PATCH for-2.10 3/3]
> qemu-img/convert: Move bs_n > 1 && -B check down".
ok, thanks.

>
> Also, -B is the old-style option, not -o backing_file.
>
> Kevin



Re: [Qemu-devel] [PATCH v2] qemu-img: check bs_n when use old style option

2017-04-20 Thread 858585 jemmy
test result for this patch:
qemu-img convert -c -p -o
backing_file=/baseimage/img2015122818606660/img2015122818606660.qcow2
-O qcow2 /data/img2017041611162809.qcow2
/data/img2017041611162809.qcow2 /mnt/img2017041611162809_zip_new.qcow2
qemu-img: Specifying backing image makes no sense when concatenating
multiple input images

qemu-img convert -c -p -B
baseimage/img2015122818606660/img2015122818606660.qcow2 -O qcow2
/data/img2017041611162809.qcow2 /data/img2017041611162809.qcow2
/mnt/img2017041611162809_zip_new.qcow2
qemu-img: Specifying backing image makes no sense when concatenating
multiple input images


On Thu, Apr 20, 2017 at 5:19 PM,   wrote:
> From: Lidong Chen 
>
> When use old style option like -o backing_file, img_convert
> continue run when bs_n > 1, this patch fix this bug.
>
> Signed-off-by: Lidong Chen 
> ---
> v2 changelog:
> avoid duplicating code.
> ---
>  qemu-img.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index b220cf7..b4d9255 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -2108,14 +2108,6 @@ static int img_convert(int argc, char **argv)
>  error_exit("Must specify image file name");
>  }
>
> -
> -if (bs_n > 1 && out_baseimg) {
> -error_report("-B makes no sense when concatenating multiple input "
> - "images");
> -ret = -1;
> -goto out;
> -}
> -
>  src_flags = 0;
>  ret = bdrv_parse_cache_mode(src_cache, &src_flags, &src_writethrough);
>  if (ret < 0) {
> @@ -2225,6 +2217,13 @@ static int img_convert(int argc, char **argv)
>  out_baseimg = out_baseimg_param;
>  }
>
> +if (bs_n > 1 && out_baseimg) {
> +error_report("Specifying backing image makes no sense when "
> + "concatenating multiple input images");
> +ret = -1;
> +goto out;
> +}
> +
>  /* Check if compression is supported */
>  if (compress) {
>  bool encryption =
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH] qemu-img: use blk_co_pwrite_zeroes for zero sectors when compressed

2017-04-20 Thread 858585 jemmy
On Thu, Apr 20, 2017 at 4:38 PM,   wrote:
> From: Lidong Chen 
>
> when the buffer is zero, blk_co_pwrite_zeroes is more effectively than
> blk_co_pwritev with BDRV_REQ_WRITE_COMPRESSED. this patch can reduces
> the time when converts the qcow2 image with lots of zero.
>

the original qcow2 file which have lots of cluster unallocated:
[root]# qemu-img info /mnt/img2016111016860868_old.qcow2
image: /mnt/img2016111016860868_old.qcow2
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 214M (224460800 bytes)
cluster_size: 65536
backing file: /baseimage/img2015122818606660/img2015122818606660.qcow2

the time used for qemu-img convert:
[root ~]# time /root/kvm/bin/qemu-img convert -c -p -o
backing_file=/baseimage/img2015122818606660/img2015122818606660.qcow2
-O
qcow2 /mnt/img2016111016860868.qcow2 /mnt/img2016111016860868_old.qcow2
(100.00/100%)
real0m29.456s
user0m29.345s
sys 0m0.481s

run dd if=/dev/zero of=./test bs=65536 in guest os
then convert again.

before apply this patch:
[root~]# time /root/kvm/bin/qemu-img convert -c -p -o
backing_file=/baseimage/img2015122818606660/img2015122818606660.qcow2
-O qcow2 /mnt/img2016111016860868.qcow2 /mnt/img2016111016860868_new.qcow2
(100.00/100%)

real5m35.617s
user5m33.417s
sys 0m10.699s

after apply this patch:
[root~]# time /root/kvm/bin/qemu-img convert -c -p -o
backing_file=/baseimage/img2015122818606660/img2015122818606660.qcow2
-O
qcow2 /mnt/img2016111016860868.qcow2 /mnt/img2016111016860868_new1.qcow2
(100.00/100%)

real0m51.189s
user0m35.239s
sys 0m14.251s

the time reduce from 5m35.617s to 0m51.189s.

[root ]# ll /mnt/img2016111016860868* -h
-rw-r--r-- 1 root root 254M Apr 20 14:50 /mnt/img2016111016860868_new.qcow2
-rw-r--r-- 1 root root 232M Apr 20 15:27 /mnt/img2016111016860868_new1.qcow2

the size reduce from 254M to 232M.

> Signed-off-by: Lidong Chen 
> ---
>  qemu-img.c | 19 +--
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/qemu-img.c b/qemu-img.c
> index b220cf7..0256539 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1675,13 +1675,20 @@ static int coroutine_fn 
> convert_co_write(ImgConvertState *s, int64_t sector_num,
>   * write if the buffer is completely zeroed and we're allowed to
>   * keep the target sparse. */
>  if (s->compressed) {
> -if (s->has_zero_init && s->min_sparse &&
> -buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> -{
> -assert(!s->target_has_backing);
> -break;
> +if (buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
> +if (s->has_zero_init && s->min_sparse) {
> +assert(!s->target_has_backing);
> +break;
> +} else {
> +ret = blk_co_pwrite_zeroes(s->target,
> +   sector_num << BDRV_SECTOR_BITS,
> +   n << BDRV_SECTOR_BITS, 0);
> +if (ret < 0) {
> +return ret;
> +}
> +break;
> +}
>  }
> -
>  iov.iov_base = buf;
>  iov.iov_len = n << BDRV_SECTOR_BITS;
>  qemu_iovec_init_external(&qiov, &iov, 1);
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH] qemu-img: check bs_n when use old style option

2017-04-20 Thread 858585 jemmy
On Thu, Apr 20, 2017 at 4:11 PM, Fam Zheng  wrote:
> On Thu, 04/20 15:59, 858585 jemmy wrote:
>> On Thu, Apr 20, 2017 at 3:51 PM, Fam Zheng  wrote:
>> > On Thu, 04/20 12:04, jemmy858...@gmail.com wrote:
>> >> From: Lidong Chen 
>> >>
>> >> When use old style option like -o backing_file, img_convert
>> >> continue run when bs_n > 1, this patch fix this bug.
>> >>
>> >> Signed-off-by: Lidong Chen 
>> >> ---
>> >>  qemu-img.c | 7 +++
>> >>  1 file changed, 7 insertions(+)
>> >>
>> >> diff --git a/qemu-img.c b/qemu-img.c
>> >> index b220cf7..c673aef 100644
>> >> --- a/qemu-img.c
>> >> +++ b/qemu-img.c
>> >> @@ -2225,6 +2225,13 @@ static int img_convert(int argc, char **argv)
>> >>  out_baseimg = out_baseimg_param;
>> >>  }
>> >>
>> >> +if (bs_n > 1 && out_baseimg) {
>> >> +error_report("-B makes no sense when concatenating multiple 
>> >> input "
>> >> + "images");
>> >> +ret = -1;
>> >> +goto out;
>> >> +}
>> >> +
>> >>  /* Check if compression is supported */
>> >>  if (compress) {
>> >>  bool encryption =
>> >> --
>> >> 1.8.3.1
>> >>
>> >>
>> >
>> > Is this essentially the same as the check a few lines above:
>> >
>> > ...
>> > if (bs_n < 1) {
>> > error_exit("Must specify image file name");
>> > }
>> >
>> >
>> > if (bs_n > 1 && out_baseimg) {
>> > error_report("-B makes no sense when concatenating multiple input "
>> >  "images");
>> > ret = -1;
>> > goto out;
>> > }
>> >
>> > src_flags = 0;
>> > ret = bdrv_parse_cache_mode(src_cache, &src_flags, &src_writethrough);
>> > if (ret < 0) {
>> > error_report("Invalid source cache option: %s", src_cache);
>> > goto out;
>> > }
>> > ...
>> >
>> > How about moving that down?
>> moving that down is ok.
>> but will exit later if use -B option.
>> which way do you think better?
>
> Exiting later is not a problem, I assume? And it's better to avoid duplicating
> code if possible.
>
> BTW if you do that way, it's better to "s/-B/Specifying backing image/" in the
> error message (to be compatible with -o backing_file syntax).
Thanks. i will submit this patch again.


>
> Fam



Re: [Qemu-devel] [PATCH] qemu-img: check bs_n when use old style option

2017-04-20 Thread 858585 jemmy
On Thu, Apr 20, 2017 at 3:51 PM, Fam Zheng  wrote:
> On Thu, 04/20 12:04, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> When use old style option like -o backing_file, img_convert
>> continue run when bs_n > 1, this patch fix this bug.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  qemu-img.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/qemu-img.c b/qemu-img.c
>> index b220cf7..c673aef 100644
>> --- a/qemu-img.c
>> +++ b/qemu-img.c
>> @@ -2225,6 +2225,13 @@ static int img_convert(int argc, char **argv)
>>  out_baseimg = out_baseimg_param;
>>  }
>>
>> +if (bs_n > 1 && out_baseimg) {
>> +error_report("-B makes no sense when concatenating multiple input "
>> + "images");
>> +ret = -1;
>> +goto out;
>> +}
>> +
>>  /* Check if compression is supported */
>>  if (compress) {
>>  bool encryption =
>> --
>> 1.8.3.1
>>
>>
>
> Is this essentially the same as the check a few lines above:
>
> ...
> if (bs_n < 1) {
> error_exit("Must specify image file name");
> }
>
>
> if (bs_n > 1 && out_baseimg) {
> error_report("-B makes no sense when concatenating multiple input "
>  "images");
> ret = -1;
> goto out;
> }
>
> src_flags = 0;
> ret = bdrv_parse_cache_mode(src_cache, &src_flags, &src_writethrough);
> if (ret < 0) {
> error_report("Invalid source cache option: %s", src_cache);
> goto out;
> }
> ...
>
> How about moving that down?
moving that down is ok.
but will exit later if use -B option.
which way do you think better?


>
> Fam



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-16 Thread 858585 jemmy
On Mon, Apr 17, 2017 at 11:49 AM, Fam Zheng  wrote:
> On Fri, 04/14 14:30, 858585 jemmy wrote:
>> Do you know some other format which have very small cluster size?
>
> 64k is the default cluster size for qcow2 but it can be configured at image
> creation time, as 512 bytes, for example:
>
> $ qemu-img create -f qcow2 test.qcow2 -o cluster_size=512 1G

Thanks, i will test the performance again.
>
> Fam



Re: [Qemu-devel] [Qemu-block] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-13 Thread 858585 jemmy
On Fri, Apr 14, 2017 at 2:38 PM, Stefan Hajnoczi  wrote:
> On Fri, Apr 14, 2017 at 7:00 AM, Fam Zheng  wrote:
>> On Thu, 04/13 10:34, jemmy858...@gmail.com wrote:
>>> From: Lidong Chen 
>>>
>>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>>> this may cause the qcow2 file size to be bigger after migration.
>>> This patch checks each cluster, using blk_pwrite_zeroes for each
>>> zero cluster.
>>>
>>> Reviewed-by: Stefan Hajnoczi 
>>> Signed-off-by: Lidong Chen 
>>> ---
>>> v6 changelog:
>>> Fix up some grammar in the comment.
>>> ---
>>>  migration/block.c | 35 +--
>>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/migration/block.c b/migration/block.c
>>> index 7734ff7..41c7a55 100644
>>> --- a/migration/block.c
>>> +++ b/migration/block.c
>>> @@ -885,6 +885,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  int64_t total_sectors = 0;
>>>  int nr_sectors;
>>>  int ret;
>>> +BlockDriverInfo bdi;
>>> +int cluster_size;
>>>
>>>  do {
>>>  addr = qemu_get_be64(f);
>>> @@ -919,6 +921,15 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  error_report_err(local_err);
>>>  return -EINVAL;
>>>  }
>>> +
>>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>>> +if (ret == 0 && bdi.cluster_size > 0 &&
>>> +bdi.cluster_size <= BLOCK_SIZE &&
>>> +BLOCK_SIZE % bdi.cluster_size == 0) {
>>> +cluster_size = bdi.cluster_size;
>>> +} else {
>>> +cluster_size = BLOCK_SIZE;
>>> +}
>>>  }
>>>
>>>  if (total_sectors - addr < BDRV_SECTORS_PER_DIRTY_CHUNK) {
>>> @@ -932,10 +943,30 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  nr_sectors * BDRV_SECTOR_SIZE,
>>>  BDRV_REQ_MAY_UNMAP);
>>>  } else {
>>> +int i;
>>> +int64_t cur_addr;
>>> +uint8_t *cur_buf;
>>> +
>>>  buf = g_malloc(BLOCK_SIZE);
>>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>>> +cur_addr = addr * BDRV_SECTOR_SIZE + i * cluster_size;
>>> +cur_buf = buf + i * cluster_size;
>>> +
>>> +if ((!block_mig_state.zero_blocks ||
>>> +cluster_size < BLOCK_SIZE) &&
>>> +buffer_is_zero(cur_buf, cluster_size)) {
>>> +ret = blk_pwrite_zeroes(blk, cur_addr,
>>> +cluster_size,
>>> +BDRV_REQ_MAY_UNMAP);
>>> +} else {
>>> +ret = blk_pwrite(blk, cur_addr, cur_buf,
>>> + cluster_size, 0);
>>> +}
>>> +if (ret < 0) {
>>> +break;
>>> +}
>>> +}
>>>  g_free(buf);
>>>  }
>>
>> Sorry for asking this question so late, but, before it gets too late: did you
>> evaluate the performance impact of this change under real world workload?
>>
>> Effectively, if no cluster is zero, this patch still splits a big write into
>> small ones, which is the opposition of usual performance optimizations (i.e.
>> trying to coalesce requests).
>
> Good point!
>
> Another patch can modify the loop to perform the largest writes
> possible.  In other words, do not perform the write immediately and
> keep a cluster counter instead.  When the zero/non-zero state changes,
> perform the write for the accumulated cluster count.

if the zero/non-zero state changes very frequently, it will not work.

I also consider this way before i submit this patch.
but i find the performance is almost the same for qcow2 which
cluster_size is 65536.

I worry about some other format which have very small cluster_size,for
example 512.
but i don't find. please tell me if you know, and i will test it.

Do you think it's necessary if the size of the cluster_size is too small, we
can cluster_size*N instead?

Thanks.


>
> Stefan



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-13 Thread 858585 jemmy
On Fri, Apr 14, 2017 at 2:00 PM, Fam Zheng  wrote:
> On Thu, 04/13 10:34, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this may cause the qcow2 file size to be bigger after migration.
>> This patch checks each cluster, using blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Reviewed-by: Stefan Hajnoczi 
>> Signed-off-by: Lidong Chen 
>> ---
>> v6 changelog:
>> Fix up some grammar in the comment.
>> ---
>>  migration/block.c | 35 +--
>>  1 file changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..41c7a55 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -885,6 +885,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  int64_t total_sectors = 0;
>>  int nr_sectors;
>>  int ret;
>> +BlockDriverInfo bdi;
>> +int cluster_size;
>>
>>  do {
>>  addr = qemu_get_be64(f);
>> @@ -919,6 +921,15 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  error_report_err(local_err);
>>  return -EINVAL;
>>  }
>> +
>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>> +if (ret == 0 && bdi.cluster_size > 0 &&
>> +bdi.cluster_size <= BLOCK_SIZE &&
>> +BLOCK_SIZE % bdi.cluster_size == 0) {
>> +cluster_size = bdi.cluster_size;
>> +} else {
>> +cluster_size = BLOCK_SIZE;
>> +}
>>  }
>>
>>  if (total_sectors - addr < BDRV_SECTORS_PER_DIRTY_CHUNK) {
>> @@ -932,10 +943,30 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  nr_sectors * BDRV_SECTOR_SIZE,
>>  BDRV_REQ_MAY_UNMAP);
>>  } else {
>> +int i;
>> +int64_t cur_addr;
>> +uint8_t *cur_buf;
>> +
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> +cur_addr = addr * BDRV_SECTOR_SIZE + i * cluster_size;
>> +cur_buf = buf + i * cluster_size;
>> +
>> +if ((!block_mig_state.zero_blocks ||
>> +cluster_size < BLOCK_SIZE) &&
>> +buffer_is_zero(cur_buf, cluster_size)) {
>> +ret = blk_pwrite_zeroes(blk, cur_addr,
>> +cluster_size,
>> +BDRV_REQ_MAY_UNMAP);
>> +} else {
>> +ret = blk_pwrite(blk, cur_addr, cur_buf,
>> + cluster_size, 0);
>> +}
>> +if (ret < 0) {
>> +break;
>> +}
>> +}
>>  g_free(buf);
>>  }
>
> Sorry for asking this question so late, but, before it gets too late: did you
> evaluate the performance impact of this change under real world workload?
>
> Effectively, if no cluster is zero, this patch still splits a big write into
> small ones, which is the opposition of usual performance optimizations (i.e.
> trying to coalesce requests).

I test this patch for qcow2, the migration speed is the same before
apply this patch.

Do you know some other format which have very small cluster size?

>
> Fam



Re: [Qemu-devel] [PATCH v6] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-13 Thread 858585 jemmy
On Thu, Apr 13, 2017 at 10:16 PM, Stefan Hajnoczi  wrote:
> On Thu, Apr 13, 2017 at 10:34:28AM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this may cause the qcow2 file size to be bigger after migration.
>> This patch checks each cluster, using blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Reviewed-by: Stefan Hajnoczi 
>> Signed-off-by: Lidong Chen 
>> ---
>> v6 changelog:
>> Fix up some grammar in the comment.
>> ---
>>  migration/block.c | 35 +--
>>  1 file changed, 33 insertions(+), 2 deletions(-)
>
> I fixed the following gcc warning when merging the patch:
>
>   migration/block.c:958:25: error: ‘cluster_size’ may be used uninitialized 
> in this function [-Werror=maybe-uninitialized]
> buffer_is_zero(cur_buf, cluster_size)) {

Thanks,i will check gcc warning next time.

>
> Thanks, applied to my block-next tree:
> https://github.com/stefanha/qemu/commits/block-next
>
> Stefan



Re: [Qemu-devel] migrate -b problems

2017-04-12 Thread 858585 jemmy
it this bug?
https://bugs.launchpad.net/qemu/+bug/1681688

On Wed, Apr 12, 2017 at 5:18 PM, Kevin Wolf  wrote:
> Hi all,
>
> after getting assertion failure reports for block migration in the last
> minute, we just hacked around it by commenting out op blocker assertions
> for the 2.9 release, but now we need to see how to fix things properly.
> Luckily, get_maintainer.pl doesn't report me, but only you. :-)
>
> The main problem I see with the block migration code (on the
> destination) is that it abuses the BlockBackend that belongs to the
> guest device to make its own writes to the image file. If the guest
> isn't allowed to write to the image (which it now isn't during incoming
> migration since it would conflict with the newer style of block
> migration using an NBD server), writing to this BlockBackend doesn't
> work any more.
>
> So what should really happen is that incoming block migration creates
> its own BlockBackend for writing to the image. Now we don't want to do
> this anew for every incoming block, but ideally we'd just create all
> necessary BlockBackends upfront and then keep using them throughout the
> whole migration. Is there a way to get some setup/teardown callbacks
> at the start/end of the migration that could initialise and free such
> global data?
>
> The other problem with block migration is that is uses a BlockBackend
> name to identify which device is migrated. However, there can be images
> that are not attached to any BlockBackend, or if it is, the BlockBackend
> might be anonymous, so this doesn't work. I suppose changing the field
> to "device name if available, node-name otherwise" would solve this.
>
> Kevin
>



Re: [Qemu-devel] [PATCH v4] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-11 Thread 858585 jemmy
On Wed, Apr 12, 2017 at 9:27 AM, 858585 jemmy  wrote:
> On Tue, Apr 11, 2017 at 11:59 PM, Stefan Hajnoczi  wrote:
>> On Tue, Apr 11, 2017 at 08:05:12PM +0800, jemmy858...@gmail.com wrote:
>>> From: Lidong Chen 
>>>
>>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>>> this maybe cause the qcow2 file size is bigger after migration.
>>> This patch check each cluster, use blk_pwrite_zeroes for each
>>> zero cluster.
>>>
>>> Signed-off-by: Lidong Chen 
>>> ---
>>>  migration/block.c | 33 +++--
>>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/migration/block.c b/migration/block.c
>>> index 7734ff7..5d0635a 100644
>>> --- a/migration/block.c
>>> +++ b/migration/block.c
>>> @@ -885,6 +885,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  int64_t total_sectors = 0;
>>>  int nr_sectors;
>>>  int ret;
>>> +BlockDriverInfo bdi;
>>> +int cluster_size;
>>>
>>>  do {
>>>  addr = qemu_get_be64(f);
>>> @@ -919,6 +921,15 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  error_report_err(local_err);
>>>  return -EINVAL;
>>>  }
>>> +
>>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>>> +if (ret == 0 && bdi.cluster_size > 0 &&
>>> +bdi.cluster_size <= BLOCK_SIZE &&
>>> +BLOCK_SIZE % bdi.cluster_size == 0) {
>>> +cluster_size = bdi.cluster_size;
>>> +} else {
>>> +cluster_size = BLOCK_SIZE;
>>
>> This is a nice trick to unify code paths.  It has a disadvantage though:
>>
>> If the "zero blocks" migration capability is enabled and the drive has
>> no cluster_size (e.g. raw files), then the source QEMU process has
>> already scanned for zeroes.  CPU is wasted scanning for zeroes in the
>> destination QEMU process.
>>
>> Given that disk images can be large we should probably avoid unnecessary
>> scanning.  This is especially true because there's no other reason
>> (besides zero detection) to pollute the CPU cache with data from the
>> disk image.
>>
>> In other words, we should only scan for zeroes when
>> !block_mig_state.zero_blocks || cluster_size < BLOCK_SIZE.
>
> This case, the source qemu process will add BLK_MIG_FLAG_ZERO_BLOCK flag.
> It will call blk_pwrite_zeroes already before apply this patch.
> so destination QEMU process will not scanning for zero cluster.
>
> There are two reason cause the destination QEMU process receive the
> block which don't have
> BLK_MIG_FLAG_ZERO_BLOCK flag.
> 1.the source QEMU process is old version, or !block_mig_state.zero_blocks.
> 2.the content of BLOCK_SIZE is not zero.
>
> So if the destination QEMU process receive the block which don't have
> BLK_MIG_FLAG_ZERO_BLOCK flag, it already mee the condition
> !block_mig_state.zero_blocks || cluster_size < BLOCK_SIZE.
>
> so i think it's unnecessary to check this condition again.
>
> Thanks.

Sorry,  you are right.
it will cause the destination QEMU process scanning for zeroes again.

How about this?

for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
cur_addr = addr * BDRV_SECTOR_SIZE + i * cluster_size;
cur_buf = buf + i * cluster_size;

if ((!block_mig_state.zero_blocks || cluster_size
< BLOCK_SIZE)
&& buffer_is_zero(cur_buf, cluster_size)) {
ret = blk_pwrite_zeroes(blk, cur_addr,
cluster_size,
BDRV_REQ_MAY_UNMAP);
} else {
ret = blk_pwrite(blk, cur_addr, cur_buf,
 cluster_size, 0);
}
if (ret < 0) {
break;
}
}


>
>>
>>> +}
>>>  }
>>>
>>>  if (total_sectors - addr < BDRV_SECTORS_PER_DIRTY_CHUNK) {
>>> @@ -932,10 +943,28 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  nr_sectors * BDRV_SECTOR_SIZE,
>>>

Re: [Qemu-devel] [PATCH v4] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-11 Thread 858585 jemmy
On Tue, Apr 11, 2017 at 11:59 PM, Stefan Hajnoczi  wrote:
> On Tue, Apr 11, 2017 at 08:05:12PM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this maybe cause the qcow2 file size is bigger after migration.
>> This patch check each cluster, use blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/block.c | 33 +++--
>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..5d0635a 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -885,6 +885,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  int64_t total_sectors = 0;
>>  int nr_sectors;
>>  int ret;
>> +BlockDriverInfo bdi;
>> +int cluster_size;
>>
>>  do {
>>  addr = qemu_get_be64(f);
>> @@ -919,6 +921,15 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  error_report_err(local_err);
>>  return -EINVAL;
>>  }
>> +
>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>> +if (ret == 0 && bdi.cluster_size > 0 &&
>> +bdi.cluster_size <= BLOCK_SIZE &&
>> +BLOCK_SIZE % bdi.cluster_size == 0) {
>> +cluster_size = bdi.cluster_size;
>> +} else {
>> +cluster_size = BLOCK_SIZE;
>
> This is a nice trick to unify code paths.  It has a disadvantage though:
>
> If the "zero blocks" migration capability is enabled and the drive has
> no cluster_size (e.g. raw files), then the source QEMU process has
> already scanned for zeroes.  CPU is wasted scanning for zeroes in the
> destination QEMU process.
>
> Given that disk images can be large we should probably avoid unnecessary
> scanning.  This is especially true because there's no other reason
> (besides zero detection) to pollute the CPU cache with data from the
> disk image.
>
> In other words, we should only scan for zeroes when
> !block_mig_state.zero_blocks || cluster_size < BLOCK_SIZE.

This case, the source qemu process will add BLK_MIG_FLAG_ZERO_BLOCK flag.
It will call blk_pwrite_zeroes already before apply this patch.
so destination QEMU process will not scanning for zero cluster.

There are two reason cause the destination QEMU process receive the
block which don't have
BLK_MIG_FLAG_ZERO_BLOCK flag.
1.the source QEMU process is old version, or !block_mig_state.zero_blocks.
2.the content of BLOCK_SIZE is not zero.

So if the destination QEMU process receive the block which don't have
BLK_MIG_FLAG_ZERO_BLOCK flag, it already mee the condition
!block_mig_state.zero_blocks || cluster_size < BLOCK_SIZE.

so i think it's unnecessary to check this condition again.

Thanks.

>
>> +}
>>  }
>>
>>  if (total_sectors - addr < BDRV_SECTORS_PER_DIRTY_CHUNK) {
>> @@ -932,10 +943,28 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  nr_sectors * BDRV_SECTOR_SIZE,
>>  BDRV_REQ_MAY_UNMAP);
>>  } else {
>> +int i;
>> +int64_t cur_addr;
>> +uint8_t *cur_buf;
>> +
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> +cur_addr = addr * BDRV_SECTOR_SIZE + i * cluster_size;
>> +cur_buf = buf + i * cluster_size;
>> +
>> +if (buffer_is_zero(cur_buf, cluster_size)) {
>> +ret = blk_pwrite_zeroes(blk, cur_addr,
>> +cluster_size,
>> +BDRV_REQ_MAY_UNMAP);
>> +} else {
>> +ret = blk_pwrite(blk, cur_addr, cur_buf,
>> + cluster_size, 0);
>> +}
>> +if (ret < 0) {
>> +break;
>> +}
>> +}
>>  g_free(buf);
>>  }



Re: [Qemu-devel] [Bug 1681688] Re: qemu live migration failed

2017-04-11 Thread 858585 jemmy
Hi Kevin:
   Can you provide some information about the original bug which you want fix?

   the original comment:
   Usually guest devices don't like other writers to the same image, so
   they use blk_set_perm() to prevent this from happening.

   i don't find where the dest qemu will use blk_set_perm during migration.
   but after apply this patch, blkconf_apply_backend_options don't update the
   blk->root->perm.

Thanks.

On Tue, Apr 11, 2017 at 4:21 PM, Lidong Chen  wrote:
> blk->root->perm is 1 when blk_new_open.
>
> the blk->root->perm is update to 3 during virtio_blk_device_realize.
>
> but after this commit, the blk->root->perm is still 1. and cause
> bdrv_aligned_pwritev failed.
>
> Breakpoint 1, blk_set_perm (blk=0x14c32b0, perm=3, shared_perm=29, 
> errp=0x7fffd380) at block/block-backend.c:579
> 579 {
> (gdb) bt
> #0  blk_set_perm (blk=0x14c32b0, perm=3, shared_perm=29, errp=0x7fffd380) 
> at block/block-backend.c:579
> #1  0x0063484b in blkconf_apply_backend_options (conf=0x2de7fd0, 
> readonly=false, resizable=true, errp=0x7fffd380) at hw/block/block.c:77
> #2  0x004a57bd in virtio_blk_device_realize (dev=0x2de7e30, 
> errp=0x7fffd3e0) at /data/qemu/hw/block/virtio-blk.c:931
> #3  0x004f688e in virtio_device_realize (dev=0x2de7e30, 
> errp=0x7fffd468) at /data/qemu/hw/virtio/virtio.c:2485
> #4  0x0065806f in device_set_realized (obj=0x2de7e30, value=true, 
> errp=0x7fffd6d8) at hw/core/qdev.c:939
> #5  0x0083aaf5 in property_set_bool (obj=0x2de7e30, v=0x2e67a90, 
> name=0xaf4b53 "realized", opaque=0x2de9660, errp=0x7fffd6d8) at 
> qom/object.c:1860
> #6  0x00838c46 in object_property_set (obj=0x2de7e30, v=0x2e67a90, 
> name=0xaf4b53 "realized", errp=0x7fffd6d8) at qom/object.c:1094
> #7  0x0083c23f in object_property_set_qobject (obj=0x2de7e30, 
> value=0x2e679e0, name=0xaf4b53 "realized", errp=0x7fffd6d8) at 
> qom/qom-qobject.c:27
> #8  0x00838f9a in object_property_set_bool (obj=0x2de7e30, 
> value=true, name=0xaf4b53 "realized", errp=0x7fffd6d8) at 
> qom/object.c:1163
> #9  0x007bafac in virtio_blk_pci_realize (vpci_dev=0x2ddf920, 
> errp=0x7fffd6d8) at hw/virtio/virtio-pci.c:1975
> #10 0x007ba966 in virtio_pci_realize (pci_dev=0x2ddf920, 
> errp=0x7fffd6d8) at hw/virtio/virtio-pci.c:1853
> #11 0x0071e439 in pci_qdev_realize (qdev=0x2ddf920, 
> errp=0x7fffd7b8) at hw/pci/pci.c:2001
> #12 0x007badaa in virtio_pci_dc_realize (qdev=0x2ddf920, 
> errp=0x7fffd7b8) at hw/virtio/virtio-pci.c:1930
> #13 0x0065806f in device_set_realized (obj=0x2ddf920, value=true, 
> errp=0x7fffd9a8) at hw/core/qdev.c:939
> #14 0x0083aaf5 in property_set_bool (obj=0x2ddf920, v=0x2decfd0, 
> name=0x9b2c0e "realized", opaque=0x2ddf5d0, errp=0x7fffd9a8) at 
> qom/object.c:1860
> #15 0x00838c46 in object_property_set (obj=0x2ddf920, v=0x2decfd0, 
> name=0x9b2c0e "realized", errp=0x7fffd9a8) at qom/object.c:1094
> #16 0x0083c23f in object_property_set_qobject (obj=0x2ddf920, 
> value=0x2dece90, name=0x9b2c0e "realized", errp=0x7fffd9a8) at 
> qom/qom-qobject.c:27
> #17 0x00838f9a in object_property_set_bool (obj=0x2ddf920, 
> value=true, name=0x9b2c0e "realized", errp=0x7fffd9a8) at 
> qom/object.c:1163
> #18 0x005bfcea in qdev_device_add (opts=0x1451320, 
> errp=0x7fffda30) at qdev-monitor.c:624
> #19 0x005c9662 in device_init_func (opaque=0x0, opts=0x1451320, 
> errp=0x0) at vl.c:2305
> #20 0x0095f491 in qemu_opts_foreach (list=0xe5bd80, func=0x5c9624 
> , opaque=0x0, errp=0x0) at util/qemu-option.c:1114
> #21 0x005ce9be in main (argc=46, argv=0x7fffdeb8, 
> envp=0x7fffe030) at vl.c:4583
>
> --
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1681688
>
> Title:
>   qemu live migration failed
>
> Status in QEMU:
>   New
>
> Bug description:
>   qemu live migration failed
>
>   the dest qemu report this error.
>
>   Receiving block device images
>   Completed 0 %^Mqemu-system-x86_64: block/io.c:1348: bdrv_aligned_pwritev: 
> Assertion `child->perm & BLK_PERM_WRITE' failed.
>
>   this bug is caused by this patch:
>   
> http://git.qemu-project.org/?p=qemu.git;a=commit;h=d35ff5e6b3aa3a706b0aa3bcf11400fac945b67a
>
>   rollback this commit, the problem solved.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1681688/+subscriptions
>



Re: [Qemu-devel] [Qemu-block] [PATCH v3] migration/block:limit the time used for block migration

2017-04-11 Thread 858585 jemmy
On Tue, Apr 11, 2017 at 8:19 PM, 858585 jemmy  wrote:
> On Mon, Apr 10, 2017 at 9:52 PM, Stefan Hajnoczi  wrote:
>> On Sat, Apr 08, 2017 at 09:17:58PM +0800, 858585 jemmy wrote:
>>> On Fri, Apr 7, 2017 at 7:34 PM, Stefan Hajnoczi  wrote:
>>> > On Fri, Apr 07, 2017 at 09:30:33AM +0800, 858585 jemmy wrote:
>>> >> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  
>>> >> wrote:
>>> >> > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>>> >> >
>>> >> > A proper solution is to refactor the synchronous code to make it
>>> >> > asynchronous.  This might require invoking the system call from a
>>> >> > thread pool worker.
>>> >> >
>>> >>
>>> >> yes, i agree with you, but this is a big change.
>>> >> I will try to find how to optimize this code, maybe need a long time.
>>> >>
>>> >> this patch is not a perfect solution, but can alleviate the problem.
>>> >
>>> > Let's try to understand the problem fully first.
>>> >
>>>
>>> when migrate the vm with high speed, i find vnc response slowly sometime.
>>> not only vnc response slowly, virsh console aslo response slowly sometime.
>>> and the guest os block io performance is also reduce.
>>>
>>> the bug can be reproduce by this command:
>>> virsh migrate-setspeed 165cf436-312f-47e7-90f2-f8aa63f34893 900
>>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>>> --copy-storage-inc qemu+ssh://10.59.163.38/system
>>>
>>> and --copy-storage-all have no problem.
>>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>>> --copy-storage-all qemu+ssh://10.59.163.38/system
>>>
>>> compare the difference between --copy-storage-inc and
>>> --copy-storage-all. i find out the reason is
>>> mig_save_device_bulk invoke bdrv_is_allocated, but bdrv_is_allocated
>>> is synchronous and maybe wait
>>> for a long time.
>>>
>>> i write this code to measure the time used by  brdrv_is_allocated()
>>>
>>>  279 static int max_time = 0;
>>>  280 int tmp;
>>>
>>>  288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>>>  289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>>>  290 MAX_IS_ALLOCATED_SEARCH, 
>>> &nr_sectors);
>>>  291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>>>  292
>>>  293
>>>  294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
>>>  295+ (ts2.tv_nsec - ts1.tv_nsec);
>>>  296 if (tmp > max_time) {
>>>  297max_time=tmp;
>>>  298fprintf(stderr, "max_time is %d\n", max_time);
>>>  299 }
>>>
>>> the test result is below:
>>>
>>>  max_time is 37014
>>>  max_time is 1075534
>>>  max_time is 17180913
>>>  max_time is 28586762
>>>  max_time is 49563584
>>>  max_time is 103085447
>>>  max_time is 110836833
>>>  max_time is 120331438
>>>
>>> bdrv_is_allocated is called after qemu_mutex_lock_iothread.
>>> and the main thread is also call qemu_mutex_lock_iothread.
>>> so cause the main thread maybe wait for a long time.
>>>
>>>if (bmds->shared_base) {
>>> qemu_mutex_lock_iothread();
>>> aio_context_acquire(blk_get_aio_context(bb));
>>> /* Skip unallocated sectors; intentionally treats failure as
>>>  * an allocated sector */
>>> while (cur_sector < total_sectors &&
>>>!bdrv_is_allocated(blk_bs(bb), cur_sector,
>>>   MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>>> cur_sector += nr_sectors;
>>> }
>>> aio_context_release(blk_get_aio_context(bb));
>>> qemu_mutex_unlock_iothread();
>>> }
>>>
>>> #0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
>>> #1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
>>> #2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
>>> #3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
>>> util/qemu-thread-posix.c:60
>>> #4  0x00459e58 in qemu_mutex_lock_iothread () at 
>>> /r

Re: [Qemu-devel] [Qemu-block] [PATCH v3] migration/block:limit the time used for block migration

2017-04-11 Thread 858585 jemmy
On Mon, Apr 10, 2017 at 9:52 PM, Stefan Hajnoczi  wrote:
> On Sat, Apr 08, 2017 at 09:17:58PM +0800, 858585 jemmy wrote:
>> On Fri, Apr 7, 2017 at 7:34 PM, Stefan Hajnoczi  wrote:
>> > On Fri, Apr 07, 2017 at 09:30:33AM +0800, 858585 jemmy wrote:
>> >> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  
>> >> wrote:
>> >> > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> >> >
>> >> > A proper solution is to refactor the synchronous code to make it
>> >> > asynchronous.  This might require invoking the system call from a
>> >> > thread pool worker.
>> >> >
>> >>
>> >> yes, i agree with you, but this is a big change.
>> >> I will try to find how to optimize this code, maybe need a long time.
>> >>
>> >> this patch is not a perfect solution, but can alleviate the problem.
>> >
>> > Let's try to understand the problem fully first.
>> >
>>
>> when migrate the vm with high speed, i find vnc response slowly sometime.
>> not only vnc response slowly, virsh console aslo response slowly sometime.
>> and the guest os block io performance is also reduce.
>>
>> the bug can be reproduce by this command:
>> virsh migrate-setspeed 165cf436-312f-47e7-90f2-f8aa63f34893 900
>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>> --copy-storage-inc qemu+ssh://10.59.163.38/system
>>
>> and --copy-storage-all have no problem.
>> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
>> --copy-storage-all qemu+ssh://10.59.163.38/system
>>
>> compare the difference between --copy-storage-inc and
>> --copy-storage-all. i find out the reason is
>> mig_save_device_bulk invoke bdrv_is_allocated, but bdrv_is_allocated
>> is synchronous and maybe wait
>> for a long time.
>>
>> i write this code to measure the time used by  brdrv_is_allocated()
>>
>>  279 static int max_time = 0;
>>  280 int tmp;
>>
>>  288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>>  289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>>  290 MAX_IS_ALLOCATED_SEARCH, 
>> &nr_sectors);
>>  291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>>  292
>>  293
>>  294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
>>  295+ (ts2.tv_nsec - ts1.tv_nsec);
>>  296 if (tmp > max_time) {
>>  297max_time=tmp;
>>  298fprintf(stderr, "max_time is %d\n", max_time);
>>  299 }
>>
>> the test result is below:
>>
>>  max_time is 37014
>>  max_time is 1075534
>>  max_time is 17180913
>>  max_time is 28586762
>>  max_time is 49563584
>>  max_time is 103085447
>>  max_time is 110836833
>>  max_time is 120331438
>>
>> bdrv_is_allocated is called after qemu_mutex_lock_iothread.
>> and the main thread is also call qemu_mutex_lock_iothread.
>> so cause the main thread maybe wait for a long time.
>>
>>if (bmds->shared_base) {
>> qemu_mutex_lock_iothread();
>> aio_context_acquire(blk_get_aio_context(bb));
>> /* Skip unallocated sectors; intentionally treats failure as
>>  * an allocated sector */
>> while (cur_sector < total_sectors &&
>>!bdrv_is_allocated(blk_bs(bb), cur_sector,
>>   MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>> cur_sector += nr_sectors;
>> }
>> aio_context_release(blk_get_aio_context(bb));
>> qemu_mutex_unlock_iothread();
>> }
>>
>> #0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
>> #1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
>> #2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
>> #3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
>> util/qemu-thread-posix.c:60
>> #4  0x00459e58 in qemu_mutex_lock_iothread () at 
>> /root/qemu/cpus.c:1516
>> #5  0x00945322 in os_host_main_loop_wait (timeout=28911939) at
>> util/main-loop.c:258
>> #6  0x009453f2 in main_loop_wait (nonblocking=0) at 
>> util/main-loop.c:517
>> #7  0x005c76b4 in main_loop () at vl.c:1898
>> #8  0x005ceb77 in main (argc=49, argv=0x7fff921182b8,
>> envp=0x7fff92118448) at vl.c:4709
>
> The following patch moves b

Re: [Qemu-devel] [PATCH v3] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-10 Thread 858585 jemmy
On Tue, Apr 11, 2017 at 12:00 AM, Stefan Hajnoczi  wrote:
> On Sun, Apr 09, 2017 at 08:37:40PM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this maybe cause the qcow2 file size is bigger after migration.
>> This patch check each cluster, use blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/block.c | 38 --
>>  1 file changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..fe613db 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -885,6 +885,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  int64_t total_sectors = 0;
>>  int nr_sectors;
>>  int ret;
>> +BlockDriverInfo bdi;
>> +int cluster_size;
>>
>>  do {
>>  addr = qemu_get_be64(f);
>> @@ -934,8 +936,40 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  } else {
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +
>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>> +cluster_size = bdi.cluster_size;
>> +
>> +if (ret == 0 && cluster_size > 0 &&
>> +cluster_size <= BLOCK_SIZE &&
>> +BLOCK_SIZE % cluster_size == 0) {
>
> How about:
>
>   if (blk != blk_prev) {
>   blk_prev = blk;
>   total_sectors = blk_nb_sectors(blk);
>   if (total_sectors <= 0) {
>   error_report("Error getting length of block device %s",
>device_name);
>   return -EINVAL;
>   }
>
>   blk_invalidate_cache(blk, &local_err);
>   if (local_err) {
>   error_report_err(local_err);
>   return -EINVAL;
>   }
>
> + ret = bdrv_get_info(blk_bs(blk), &bdi);
> + if (ret == 0 && cluster_size > 0 && cluster_size <= BLOCK_SIZE &&
> + BLOCK_SIZE % cluster_size == 0) {
> + zero_cluster_size = bdi.cluster_size;
> + } else {
> + zero_cluster_size = 0;
> + }
>   }
>
> That way we only fetch the cluster size once per device.
>
> When processing a block we do without repeatedly fetching the cluster
> size:
>
>   if (zero_cluster_size) {
>  ...detect zeroes...
>   }

good idea, i will test again for this patch.

>
>> +int i;
>> +int64_t cur_addr;
>> +uint8_t *cur_buf;
>> +
>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> +cur_addr = addr * BDRV_SECTOR_SIZE
>> ++ i * cluster_size;
>> +cur_buf = buf + i * cluster_size;
>> +
>> +if (buffer_is_zero(cur_buf, cluster_size)) {
>> +ret = blk_pwrite_zeroes(blk, cur_addr,
>> +cluster_size,
>> +BDRV_REQ_MAY_UNMAP);
>> +} else {
>> + ret = blk_pwrite(blk, cur_addr, cur_buf,
>> +  cluster_size, 0);
>
> Indentation is off here.
>
>> +}
>> +
>> +if (ret < 0) {
>> +g_free(buf);
>> +return ret;
>> +}
>> +}
>> +} else {
>> +ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> + nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +}
>>  g_free(buf);
>>  }
>>
>> --
>> 1.8.3.1
>>



Re: [Qemu-devel] [Bug 1662050] Re: qemu-img convert a overlay qcow2 image into a entire image

2017-04-09 Thread 858585 jemmy
On Mon, Apr 10, 2017 at 1:07 PM, wayen <1662...@bugs.launchpad.net> wrote:
> Hi Lidong Chen:
> I used QEMU 2.0.0 on Ubuntu 14.04.
> Do you mean your patch can make qemu-img convert qcow2 overlay into a new 
> overlay?

yes. but i find it already fixed in 2.0.0.
do you add the -o backing_file= option in the command?

>
> --
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1662050
>
> Title:
>   qemu-img convert a overlay qcow2 image into a entire image
>
> Status in QEMU:
>   Incomplete
>
> Bug description:
>   I have a base image file "base.qcow2" and a delta qcow2 image file
>   "delta.qcow2" whose backing file is "base.qcow2".
>
>   Now I use qemu-img to convert "delta.qcow2" and will get a new image
>   file "new.qcow2" which is entire and equivalent to combination of
>   "base.qcow2" and "delta.qcow2".
>
>   In fact,I don't want to get a complete image.I just want to convert
>   delta qcow2 image file "A" to a New delta overlay qcow2 image "B"
>   which is equivalent to "A". So the "new.qcow2" is not what i want. I
>   have to admit that this is not bug. Could you please take this as a
>   new feature and enable qemu-img to convert qcow2 overlay itself?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1662050/+subscriptions
>



Re: [Qemu-devel] [Bug 1662050] Re: qemu-img convert a overlay qcow2 image into a entire image

2017-04-09 Thread 858585 jemmy
Hi wayen:
Which version are you used?
I also find this problem on old version qemu, and i write a patch
for it. It works.
I'm not sure the mainline version have solve this problem.
what command are you used?

On Mon, Apr 10, 2017 at 10:14 AM, wayen <1662...@bugs.launchpad.net> wrote:
> Is there any way to remove holes from qcow2 overlay images? It's very
> important to me. I am looking forward to your reply.
>
> --
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1662050
>
> Title:
>   qemu-img convert a overlay qcow2 image into a entire image
>
> Status in QEMU:
>   Incomplete
>
> Bug description:
>   I have a base image file "base.qcow2" and a delta qcow2 image file
>   "delta.qcow2" whose backing file is "base.qcow2".
>
>   Now I use qemu-img to convert "delta.qcow2" and will get a new image
>   file "new.qcow2" which is entire and equivalent to combination of
>   "base.qcow2" and "delta.qcow2".
>
>   In fact,I don't want to get a complete image.I just want to convert
>   delta qcow2 image file "A" to a New delta overlay qcow2 image "B"
>   which is equivalent to "A". So the "new.qcow2" is not what i want. I
>   have to admit that this is not bug. Could you please take this as a
>   new feature and enable qemu-img to convert qcow2 overlay itself?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1662050/+subscriptions
>



Re: [Qemu-devel] [PATCH v2] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-09 Thread 858585 jemmy
On Mon, Apr 10, 2017 at 9:47 AM, Fam Zheng  wrote:
> On Sat, 04/08 21:29, 858585 jemmy wrote:
>> On Sat, Apr 8, 2017 at 12:52 PM, 858585 jemmy  wrote:
>> > On Fri, Apr 7, 2017 at 6:10 PM, Fam Zheng  wrote:
>> >> On Fri, 04/07 16:44, jemmy858...@gmail.com wrote:
>> >>> From: Lidong Chen 
>> >>>
>> >>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> >>> this maybe cause the qcow2 file size is bigger after migration.
>> >>> This patch check each cluster, use blk_pwrite_zeroes for each
>> >>> zero cluster.
>> >>>
>> >>> Signed-off-by: Lidong Chen 
>> >>> ---
>> >>>  migration/block.c | 37 +++--
>> >>>  1 file changed, 35 insertions(+), 2 deletions(-)
>> >>>
>> >>> diff --git a/migration/block.c b/migration/block.c
>> >>> index 7734ff7..c32e046 100644
>> >>> --- a/migration/block.c
>> >>> +++ b/migration/block.c
>> >>> @@ -885,6 +885,11 @@ static int block_load(QEMUFile *f, void *opaque, 
>> >>> int version_id)
>> >>>  int64_t total_sectors = 0;
>> >>>  int nr_sectors;
>> >>>  int ret;
>> >>> +int i;
>> >>> +int64_t addr_offset;
>> >>> +uint8_t *buf_offset;
>> >>
>> >> Poor variable names, they are not offset, maybe "cur_addr" and "cur_buf"? 
>> >> And
>> >> they can be moved to the loop block below.
>> > ok, i will change.
>> >
>> >>
>> >>> +BlockDriverInfo bdi;
>> >>> +int cluster_size;
>> >>>
>> >>>  do {
>> >>>  addr = qemu_get_be64(f);
>> >>> @@ -934,8 +939,36 @@ static int block_load(QEMUFile *f, void *opaque, 
>> >>> int version_id)
>> >>>  } else {
>> >>>  buf = g_malloc(BLOCK_SIZE);
>> >>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> >>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> >>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> >>> +
>> >>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>> >>> +cluster_size = bdi.cluster_size;
>> >>> +
>> >>> +if (ret == 0 && cluster_size > 0 &&
>> >>> +cluster_size < BLOCK_SIZE &&
>> >>
>> >> I think cluster_size == BLOCK_SIZE should work too.
>> > This case the (flags & BLK_MIG_FLAG_ZERO_BLOCK) should be true,
>> > and will invoke blk_pwrite_zeroes before apply this patch.
>> > but maybe the source qemu maybe not enabled zero flag.
>> > so i think cluster_size <= BLOCK_SIZE is ok.
>> >
>> >>
>> >>> +BLOCK_SIZE % cluster_size == 0) {
>> >>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> >>> +addr_offset = addr * BDRV_SECTOR_SIZE
>> >>> ++ i * cluster_size;
>> >>> +buf_offset = buf + i * cluster_size;
>> >>> +
>> >>> +if (buffer_is_zero(buf_offset, cluster_size)) {
>> >>> +ret = blk_pwrite_zeroes(blk, addr_offset,
>> >>> +cluster_size,
>> >>> +BDRV_REQ_MAY_UNMAP);
>> >>> +} else {
>> >>> + ret = blk_pwrite(blk, addr_offset, 
>> >>> buf_offset,
>> >>> +  cluster_size, 0);
>> >>> +}
>> >>> +
>> >>> +if (ret < 0) {
>> >>> +g_free(buf);
>> >>> +return ret;
>> >>> +}
>> >>> +}
>> >>> +} else {
>> >>> +ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> >>> + nr_sectors * BDRV_SECTOR_SIZE, 0);
&

Re: [Qemu-devel] [PATCH v3] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-09 Thread 858585 jemmy
On Mon, Apr 10, 2017 at 8:51 AM, Fam Zheng  wrote:
> On Sun, 04/09 20:37, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this maybe cause the qcow2 file size is bigger after migration.
>> This patch check each cluster, use blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/block.c | 38 --
>>  1 file changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..fe613db 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -885,6 +885,8 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  int64_t total_sectors = 0;
>>  int nr_sectors;
>>  int ret;
>> +BlockDriverInfo bdi;
>> +int cluster_size;
>>
>>  do {
>>  addr = qemu_get_be64(f);
>> @@ -934,8 +936,40 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  } else {
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +
>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>> +cluster_size = bdi.cluster_size;
>> +
>> +if (ret == 0 && cluster_size > 0 &&
>> +cluster_size <= BLOCK_SIZE &&
>> +BLOCK_SIZE % cluster_size == 0) {
>> +int i;
>> +int64_t cur_addr;
>> +uint8_t *cur_buf;
>> +
>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> +cur_addr = addr * BDRV_SECTOR_SIZE
>> ++ i * cluster_size;
>> +cur_buf = buf + i * cluster_size;
>> +
>> +if (buffer_is_zero(cur_buf, cluster_size)) {
>> +ret = blk_pwrite_zeroes(blk, cur_addr,
>> +cluster_size,
>> +BDRV_REQ_MAY_UNMAP);
>> +} else {
>> + ret = blk_pwrite(blk, cur_addr, cur_buf,
>> +  cluster_size, 0);
>> +}
>> +
>> +if (ret < 0) {
>> +g_free(buf);
>> +return ret;
>> +}
>
> This if block is not necessary because...

Hi Fam:
  It's necessary to check each cluster is written successfully.
  if we remove this if block, it maybe ignore some error, and only check
  the last cluster.
  Thanks.

>
>> +}
>> +} else {
>> +ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> + nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +}
>>  g_free(buf);
> ...
>
>
> if (ret < 0) {
> return ret;
> }
>>  }
>>
>> --
>> 1.8.3.1
>>
>>
>
> If you remove that:
>
> Reviewed-by: Fam Zheng 



Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-09 Thread 858585 jemmy
On Fri, Apr 7, 2017 at 7:33 PM, Stefan Hajnoczi  wrote:
> On Fri, Apr 07, 2017 at 09:30:33AM +0800, 858585 jemmy wrote:
>> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  wrote:
>> > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> >> From: Lidong Chen 
>> >>
>> >> when migration with high speed, mig_save_device_bulk invoke
>> >> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>> >> this patch limit the time used for bdrv_is_allocated.
>> >
>> > bdrv_is_allocated() is supposed to yield back to the event loop if it
>> > needs to block.  If your VNC session is experiencing jitter then it's
>> > probably because a system call in the bdrv_is_allocated() code path is
>> > synchronous when it should be asynchronous.
>> >
>> > You could try to identify the system call using strace -f -T.  In the
>> > output you'll see the duration of each system call.  I guess there is a
>> > file I/O system call that is taking noticable amounts of time.
>>
>> yes, i find the reason where bdrv_is_allocated needs to block.
>>
>> the mainly reason is caused by qemu_co_mutex_lock invoked by
>> qcow2_co_get_block_status.
>> qemu_co_mutex_lock(&s->lock);
>> ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
>>&cluster_offset);
>> qemu_co_mutex_unlock(&s->lock);
>>
>> other reason is caused by l2_load invoked by
>> qcow2_get_cluster_offset.
>>
>> /* load the l2 table in memory */
>>
>> ret = l2_load(bs, l2_offset, &l2_table);
>> if (ret < 0) {
>> return ret;
>> }
>
> The migration thread is holding the QEMU global mutex, the AioContext,
> and the qcow2 s->lock while the L2 table is read from disk.
>
> The QEMU global mutex is needed for block layer operations that touch
> the global drives list.  bdrv_is_allocated() can be called without the
> global mutex.
>
> The VNC server's file descriptor is not in the BDS AioContext.
> Therefore it can be processed while the migration thread holds the
> AioContext and qcow2 s->lock.
>
> Does the following patch solve the problem?
>
> diff --git a/migration/block.c b/migration/block.c
> index 7734ff7..072fc20 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -276,6 +276,7 @@ static int mig_save_device_bulk(QEMUFile *f, 
> BlkMigDevState *bmds)
>  if (bmds->shared_base) {
>  qemu_mutex_lock_iothread();
>  aio_context_acquire(blk_get_aio_context(bb));
> +qemu_mutex_unlock_iothread();
>  /* Skip unallocated sectors; intentionally treats failure as
>   * an allocated sector */
>  while (cur_sector < total_sectors &&
> @@ -283,6 +284,7 @@ static int mig_save_device_bulk(QEMUFile *f, 
> BlkMigDevState *bmds)
>MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>  cur_sector += nr_sectors;
>  }
> +qemu_mutex_lock_iothread();
>  aio_context_release(blk_get_aio_context(bb));
>  qemu_mutex_unlock_iothread();
>  }
>

this patch don't work. the qemu lockup.
the stack of main thread.
(gdb) bt
#0  0x7f4256c89264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x7f4256c84523 in _L_lock_892 () from /lib64/libpthread.so.0
#2  0x7f4256c84407 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00949f47 in qemu_mutex_lock (mutex=0x1b04a60) at
util/qemu-thread-posix.c:60
#4  0x009424cf in aio_context_acquire (ctx=0x1b04a00) at
util/async.c:484
#5  0x00942b86 in thread_pool_completion_bh (opaque=0x1b25a10)
at util/thread-pool.c:168
#6  0x00941610 in aio_bh_call (bh=0x1b1d570) at util/async.c:90
#7  0x009416bb in aio_bh_poll (ctx=0x1b04a00) at util/async.c:118
#8  0x00946baa in aio_dispatch (ctx=0x1b04a00) at util/aio-posix.c:429
#9  0x00941b30 in aio_ctx_dispatch (source=0x1b04a00,
callback=0, user_data=0x0)
at util/async.c:261
#10 0x7f4257670f0e in g_main_context_dispatch () from
/lib64/libglib-2.0.so.0
#11 0x00945282 in glib_pollfds_poll () at util/main-loop.c:213
#12 0x009453a3 in os_host_main_loop_wait (timeout=754229747)
at util/main-loop.c:261
#13 0x0094546e in main_loop_wait (nonblocking=0) at util/main-loop.c:517
#14 0x005c7664 in main_loop () at vl.c:1898
#15 0x005ceb27 in main (argc=49, argv=0x7fff7907ab28,
envp=0x7fff7907acb8) at vl.c:4709

the stack of migration thread.
(gdb) bt
#0  0x7f4256c89264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x7f4256c

Re: [Qemu-devel] [PATCH v2] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-08 Thread 858585 jemmy
On Sat, Apr 8, 2017 at 12:52 PM, 858585 jemmy  wrote:
> On Fri, Apr 7, 2017 at 6:10 PM, Fam Zheng  wrote:
>> On Fri, 04/07 16:44, jemmy858...@gmail.com wrote:
>>> From: Lidong Chen 
>>>
>>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>>> this maybe cause the qcow2 file size is bigger after migration.
>>> This patch check each cluster, use blk_pwrite_zeroes for each
>>> zero cluster.
>>>
>>> Signed-off-by: Lidong Chen 
>>> ---
>>>  migration/block.c | 37 +++--
>>>  1 file changed, 35 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/migration/block.c b/migration/block.c
>>> index 7734ff7..c32e046 100644
>>> --- a/migration/block.c
>>> +++ b/migration/block.c
>>> @@ -885,6 +885,11 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  int64_t total_sectors = 0;
>>>  int nr_sectors;
>>>  int ret;
>>> +int i;
>>> +int64_t addr_offset;
>>> +uint8_t *buf_offset;
>>
>> Poor variable names, they are not offset, maybe "cur_addr" and "cur_buf"? And
>> they can be moved to the loop block below.
> ok, i will change.
>
>>
>>> +BlockDriverInfo bdi;
>>> +int cluster_size;
>>>
>>>  do {
>>>  addr = qemu_get_be64(f);
>>> @@ -934,8 +939,36 @@ static int block_load(QEMUFile *f, void *opaque, int 
>>> version_id)
>>>  } else {
>>>  buf = g_malloc(BLOCK_SIZE);
>>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>>> +
>>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>>> +cluster_size = bdi.cluster_size;
>>> +
>>> +if (ret == 0 && cluster_size > 0 &&
>>> +cluster_size < BLOCK_SIZE &&
>>
>> I think cluster_size == BLOCK_SIZE should work too.
> This case the (flags & BLK_MIG_FLAG_ZERO_BLOCK) should be true,
> and will invoke blk_pwrite_zeroes before apply this patch.
> but maybe the source qemu maybe not enabled zero flag.
> so i think cluster_size <= BLOCK_SIZE is ok.
>
>>
>>> +BLOCK_SIZE % cluster_size == 0) {
>>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>>> +addr_offset = addr * BDRV_SECTOR_SIZE
>>> ++ i * cluster_size;
>>> +buf_offset = buf + i * cluster_size;
>>> +
>>> +if (buffer_is_zero(buf_offset, cluster_size)) {
>>> +ret = blk_pwrite_zeroes(blk, addr_offset,
>>> +cluster_size,
>>> +BDRV_REQ_MAY_UNMAP);
>>> +} else {
>>> + ret = blk_pwrite(blk, addr_offset, buf_offset,
>>> +  cluster_size, 0);
>>> +}
>>> +
>>> +if (ret < 0) {
>>> +g_free(buf);
>>> +return ret;
>>> +}
>>> +}
>>> +} else {
>>> +ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>>> + nr_sectors * BDRV_SECTOR_SIZE, 0);
>>> +}
>>>  g_free(buf);
>>>  }
>>>
>>> --
>>> 1.8.3.1
>>>
>>
>> Is it possible use (source) cluster size as the transfer chunk size, instead 
>> of
>> BDRV_SECTORS_PER_DIRTY_CHUNK? Then the existing BLK_MIG_FLAG_ZERO_BLOCK logic
>> can help and you don't need to send zero bytes on the wire. This may still 
>> not
>> be optimal if dest has larger cluster, but it should cover the common use 
>> case
>> well.
>
> yes, i also think BDRV_SECTORS_PER_DIRTY_CHUNK is too large.
> This have two disadvantage:
> 1. it will cause the dest qcow2 file size is bigger after migration.
> 2. it will cause transfer not necessary data, and maybe cause the
> migration can't be successful.
>
> in my production environment, some vm only write 2MB/s, the dirty
> block migrate speed is 70MB/s.
> but it still migration timeout.
>
> but if we change the size of BDRV_SECTORS_PER_DIRTY_CHUNK, it will
> break the protocol.
> the old version qemu will not be able to migrate to new version qemu.
> there are not information about the length about the migration buffer.
>
> so i think we should add new flags to indicate that there are an
> additional byte about the length
> of migration buffer. i will send another patch later, and test the result.

Hi Fam:
Do we need consider the circumstances than migrate from new qemu version
to old qemu version?

>
> this patch is also valuable, there are many old version qemu in my
> production environment.
> and will be benefit with this patch.
>
>>
>> Fam



Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-08 Thread 858585 jemmy
On Fri, Apr 7, 2017 at 7:34 PM, Stefan Hajnoczi  wrote:
> On Fri, Apr 07, 2017 at 09:30:33AM +0800, 858585 jemmy wrote:
>> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  wrote:
>> > On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> >
>> > A proper solution is to refactor the synchronous code to make it
>> > asynchronous.  This might require invoking the system call from a
>> > thread pool worker.
>> >
>>
>> yes, i agree with you, but this is a big change.
>> I will try to find how to optimize this code, maybe need a long time.
>>
>> this patch is not a perfect solution, but can alleviate the problem.
>
> Let's try to understand the problem fully first.
>

when migrate the vm with high speed, i find vnc response slowly sometime.
not only vnc response slowly, virsh console aslo response slowly sometime.
and the guest os block io performance is also reduce.

the bug can be reproduce by this command:
virsh migrate-setspeed 165cf436-312f-47e7-90f2-f8aa63f34893 900
virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
--copy-storage-inc qemu+ssh://10.59.163.38/system

and --copy-storage-all have no problem.
virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
--copy-storage-all qemu+ssh://10.59.163.38/system

compare the difference between --copy-storage-inc and
--copy-storage-all. i find out the reason is
mig_save_device_bulk invoke bdrv_is_allocated, but bdrv_is_allocated
is synchronous and maybe wait
for a long time.

i write this code to measure the time used by  brdrv_is_allocated()

 279 static int max_time = 0;
 280 int tmp;

 288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
 289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
 290 MAX_IS_ALLOCATED_SEARCH, &nr_sectors);
 291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
 292
 293
 294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
 295+ (ts2.tv_nsec - ts1.tv_nsec);
 296 if (tmp > max_time) {
 297max_time=tmp;
 298fprintf(stderr, "max_time is %d\n", max_time);
 299 }

the test result is below:

 max_time is 37014
 max_time is 1075534
 max_time is 17180913
 max_time is 28586762
 max_time is 49563584
 max_time is 103085447
 max_time is 110836833
 max_time is 120331438

bdrv_is_allocated is called after qemu_mutex_lock_iothread.
and the main thread is also call qemu_mutex_lock_iothread.
so cause the main thread maybe wait for a long time.

   if (bmds->shared_base) {
qemu_mutex_lock_iothread();
aio_context_acquire(blk_get_aio_context(bb));
/* Skip unallocated sectors; intentionally treats failure as
 * an allocated sector */
while (cur_sector < total_sectors &&
   !bdrv_is_allocated(blk_bs(bb), cur_sector,
  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
cur_sector += nr_sectors;
}
aio_context_release(blk_get_aio_context(bb));
qemu_mutex_unlock_iothread();
}

#0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
util/qemu-thread-posix.c:60
#4  0x00459e58 in qemu_mutex_lock_iothread () at /root/qemu/cpus.c:1516
#5  0x00945322 in os_host_main_loop_wait (timeout=28911939) at
util/main-loop.c:258
#6  0x009453f2 in main_loop_wait (nonblocking=0) at util/main-loop.c:517
#7  0x005c76b4 in main_loop () at vl.c:1898
#8  0x005ceb77 in main (argc=49, argv=0x7fff921182b8,
envp=0x7fff92118448) at vl.c:4709



> Stefan



Re: [Qemu-devel] [PATCH v2] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-07 Thread 858585 jemmy
On Fri, Apr 7, 2017 at 6:10 PM, Fam Zheng  wrote:
> On Fri, 04/07 16:44, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this maybe cause the qcow2 file size is bigger after migration.
>> This patch check each cluster, use blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/block.c | 37 +++--
>>  1 file changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..c32e046 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -885,6 +885,11 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  int64_t total_sectors = 0;
>>  int nr_sectors;
>>  int ret;
>> +int i;
>> +int64_t addr_offset;
>> +uint8_t *buf_offset;
>
> Poor variable names, they are not offset, maybe "cur_addr" and "cur_buf"? And
> they can be moved to the loop block below.
ok, i will change.

>
>> +BlockDriverInfo bdi;
>> +int cluster_size;
>>
>>  do {
>>  addr = qemu_get_be64(f);
>> @@ -934,8 +939,36 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  } else {
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +
>> +ret = bdrv_get_info(blk_bs(blk), &bdi);
>> +cluster_size = bdi.cluster_size;
>> +
>> +if (ret == 0 && cluster_size > 0 &&
>> +cluster_size < BLOCK_SIZE &&
>
> I think cluster_size == BLOCK_SIZE should work too.
This case the (flags & BLK_MIG_FLAG_ZERO_BLOCK) should be true,
and will invoke blk_pwrite_zeroes before apply this patch.
but maybe the source qemu maybe not enabled zero flag.
so i think cluster_size <= BLOCK_SIZE is ok.

>
>> +BLOCK_SIZE % cluster_size == 0) {
>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>> +addr_offset = addr * BDRV_SECTOR_SIZE
>> ++ i * cluster_size;
>> +buf_offset = buf + i * cluster_size;
>> +
>> +if (buffer_is_zero(buf_offset, cluster_size)) {
>> +ret = blk_pwrite_zeroes(blk, addr_offset,
>> +cluster_size,
>> +BDRV_REQ_MAY_UNMAP);
>> +} else {
>> + ret = blk_pwrite(blk, addr_offset, buf_offset,
>> +  cluster_size, 0);
>> +}
>> +
>> +if (ret < 0) {
>> +g_free(buf);
>> +return ret;
>> +}
>> +}
>> +} else {
>> +ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> + nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +}
>>  g_free(buf);
>>  }
>>
>> --
>> 1.8.3.1
>>
>
> Is it possible use (source) cluster size as the transfer chunk size, instead 
> of
> BDRV_SECTORS_PER_DIRTY_CHUNK? Then the existing BLK_MIG_FLAG_ZERO_BLOCK logic
> can help and you don't need to send zero bytes on the wire. This may still not
> be optimal if dest has larger cluster, but it should cover the common use case
> well.

yes, i also think BDRV_SECTORS_PER_DIRTY_CHUNK is too large.
This have two disadvantage:
1. it will cause the dest qcow2 file size is bigger after migration.
2. it will cause transfer not necessary data, and maybe cause the
migration can't be successful.

in my production environment, some vm only write 2MB/s, the dirty
block migrate speed is 70MB/s.
but it still migration timeout.

but if we change the size of BDRV_SECTORS_PER_DIRTY_CHUNK, it will
break the protocol.
the old version qemu will not be able to migrate to new version qemu.
there are not information about the length about the migration buffer.

so i think we should add new flags to indicate that there are an
additional byte about the length
of migration buffer. i will send another patch later, and test the result.

this patch is also valuable, there are many old version qemu in my
production environment.
and will be benefit with this patch.

>
> Fam



Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-07 Thread 858585 jemmy
On Fri, Apr 7, 2017 at 9:30 AM, 858585 jemmy  wrote:
> On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  wrote:
>> On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>>> From: Lidong Chen 
>>>
>>> when migration with high speed, mig_save_device_bulk invoke
>>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>>> this patch limit the time used for bdrv_is_allocated.
>>
>> bdrv_is_allocated() is supposed to yield back to the event loop if it
>> needs to block.  If your VNC session is experiencing jitter then it's
>> probably because a system call in the bdrv_is_allocated() code path is
>> synchronous when it should be asynchronous.
>>
>> You could try to identify the system call using strace -f -T.  In the
>> output you'll see the duration of each system call.  I guess there is a
>> file I/O system call that is taking noticable amounts of time.
>
> yes, i find the reason where bdrv_is_allocated needs to block.
>
> the mainly reason is caused by qemu_co_mutex_lock invoked by
> qcow2_co_get_block_status.
> qemu_co_mutex_lock(&s->lock);
> ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
>&cluster_offset);
> qemu_co_mutex_unlock(&s->lock);
>
> other reason is caused by l2_load invoked by
> qcow2_get_cluster_offset.
>
> /* load the l2 table in memory */
>
> ret = l2_load(bs, l2_offset, &l2_table);
> if (ret < 0) {
> return ret;
> }
>
>>
>> A proper solution is to refactor the synchronous code to make it
>> asynchronous.  This might require invoking the system call from a
>> thread pool worker.
>>
>
> yes, i agree with you, but this is a big change.
> I will try to find how to optimize this code, maybe need a long time.
>
> this patch is not a perfect solution, but can alleviate the problem.

Hi everyone:
Do you think should we use this patch currently? and optimize this
code later?
Thanks.

>
>> Stefan



Re: [Qemu-devel] [PATCH 2/2] migration/block: use blk_pwrite_zeroes for each zero cluster

2017-04-07 Thread 858585 jemmy
On Fri, Apr 7, 2017 at 3:08 PM, Fam Zheng  wrote:
> On Thu, 04/06 21:15, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this maybe cause the qcow2 file size is bigger after migration.
>> This patch check each cluster, use blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/block.c | 34 --
>>  1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..1fce9b9 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -885,6 +885,10 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  int64_t total_sectors = 0;
>>  int nr_sectors;
>>  int ret;
>> +int i;
>> +int cluster_size;
>> +int64_t addr_offset;
>> +uint8_t *buf_offset;
>>
>>  do {
>>  addr = qemu_get_be64(f);
>> @@ -934,8 +938,34 @@ static int block_load(QEMUFile *f, void *opaque, int 
>> version_id)
>>  } else {
>>  buf = g_malloc(BLOCK_SIZE);
>>  qemu_get_buffer(f, buf, BLOCK_SIZE);
>> -ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> - nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +
>> +cluster_size = bdrv_get_cluster_size(blk_bs(blk));
>> +
>> +if (cluster_size > 0) {
>> +for (i = 0; i < BLOCK_SIZE / cluster_size; i++) {
>
> Should we check that cluster_size < BLOCK_SIZE and (BLOCK_SIZE % cluster_size 
> ==
> 0)?
I think this is necessary.
Thanks.

>
> Fam
>
>> +addr_offset = addr * BDRV_SECTOR_SIZE
>> ++ i * cluster_size;
>> +buf_offset = buf + i * cluster_size;
>> +
>> +if (buffer_is_zero(buf_offset, cluster_size)) {
>> +ret = blk_pwrite_zeroes(blk, addr_offset,
>> +cluster_size,
>> +BDRV_REQ_MAY_UNMAP);
>> +} else {
>> +ret = blk_pwrite(blk, addr_offset,
>> + buf_offset, cluster_size, 0);
>> +}
>> +
>> +if (ret < 0) {
>> +g_free(buf);
>> +return ret;
>> +}
>> +}
>> +} else {
>> +ret = blk_pwrite(blk, addr * BDRV_SECTOR_SIZE, buf,
>> + nr_sectors * BDRV_SECTOR_SIZE, 0);
>> +}
>> +
>>  g_free(buf);
>>  }
>>
>> --
>> 1.8.3.1
>>



Re: [Qemu-devel] [PATCH 0/2] use blk_pwrite_zeroes for each zero cluster

2017-04-06 Thread 858585 jemmy
On Fri, Apr 7, 2017 at 1:22 PM, 858585 jemmy  wrote:
> the test result for this patch:
>
> the migration command :
> virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
> --copy-storage-all qemu+ssh://10.59.163.38/system
>
> the qemu-img info on source host:
> qemu-img info 
> /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
> image: 
> /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
> file format: qcow2
> virtual size: 1.0T (1095216660480 bytes)
> disk size: 1.5G (1638989824 bytes)
> cluster_size: 65536
> backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2
>
> the qemu-img info on dest host(before apply patch):
> qemu-img info 
> /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
> image: 
> /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
> file format: qcow2
> virtual size: 40G (42949672960 bytes)
> disk size: 4.1G (4423286784 bytes)
> cluster_size: 65536
> backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2
>
> the qemu-img info on dest host(after apply patch):
> qemu-img info 
> /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
> image: 
> /instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
> file format: qcow2
> virtual size: 40G (42949672960 bytes)
> disk size: 2.3G (2496200704 bytes)
> cluster_size: 65536
> backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2
>
> the disk size reduce from 4.1G to 2.3G.
>

I find a bug for my patch.
unfortunately, when use raw format , bdrv_get_cluster_size return 1.
the raw format cluster size is 1.
and will reduce migration speed.

bdrv_get_cluster_size return bs->bl.request_alignment when cluster_size is zero.
and bs->bl.request_alignment is 1 when use raw format.

static int bdrv_get_cluster_size(BlockDriverState *bs)
{
BlockDriverInfo bdi;
int ret;

ret = bdrv_get_info(bs, &bdi);
if (ret < 0 || bdi.cluster_size == 0) {
return bs->bl.request_alignment;
} else {
return bdi.cluster_size;
}
}

so i will change to use bdrv_get_info function to get cluster_size.
and bdrv_get_cluster_size is not need to be public.

>
> On Thu, Apr 6, 2017 at 9:15 PM,   wrote:
>> From: Lidong Chen 
>>
>> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
>> this maybe cause the qcow2 file size is bigger after migration.
>> This patch check each cluster, use blk_pwrite_zeroes for each
>> zero cluster.
>>
>> Lidong Chen (2):
>>   block: make bdrv_get_cluster_size public
>>   migration/block: use blk_pwrite_zeroes for each zero cluster
>>
>>  block/io.c|  2 +-
>>  include/block/block.h |  1 +
>>  migration/block.c | 34 --
>>  3 files changed, 34 insertions(+), 3 deletions(-)
>>
>> --
>> 1.8.3.1
>>



Re: [Qemu-devel] [PATCH 0/2] use blk_pwrite_zeroes for each zero cluster

2017-04-06 Thread 858585 jemmy
the test result for this patch:

the migration command :
virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
--copy-storage-all qemu+ssh://10.59.163.38/system

the qemu-img info on source host:
qemu-img info 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
image: 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
file format: qcow2
virtual size: 1.0T (1095216660480 bytes)
disk size: 1.5G (1638989824 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

the qemu-img info on dest host(before apply patch):
qemu-img info 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
image: 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
file format: qcow2
virtual size: 40G (42949672960 bytes)
disk size: 4.1G (4423286784 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

the qemu-img info on dest host(after apply patch):
qemu-img info 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
image: 
/instanceimage/165cf436-312f-47e7-90f2-f8aa63f34893/165cf436-312f-47e7-90f2-f8aa63f34893_vda.qcow2
file format: qcow2
virtual size: 40G (42949672960 bytes)
disk size: 2.3G (2496200704 bytes)
cluster_size: 65536
backing file: /baseimage/img2016042213665396/img2016042213665396.qcow2

the disk size reduce from 4.1G to 2.3G.


On Thu, Apr 6, 2017 at 9:15 PM,   wrote:
> From: Lidong Chen 
>
> BLOCK_SIZE is (1 << 20), qcow2 cluster size is 65536 by default,
> this maybe cause the qcow2 file size is bigger after migration.
> This patch check each cluster, use blk_pwrite_zeroes for each
> zero cluster.
>
> Lidong Chen (2):
>   block: make bdrv_get_cluster_size public
>   migration/block: use blk_pwrite_zeroes for each zero cluster
>
>  block/io.c|  2 +-
>  include/block/block.h |  1 +
>  migration/block.c | 34 --
>  3 files changed, 34 insertions(+), 3 deletions(-)
>
> --
> 1.8.3.1
>



Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-06 Thread 858585 jemmy
On Thu, Apr 6, 2017 at 10:02 PM, Stefan Hajnoczi  wrote:
> On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> when migration with high speed, mig_save_device_bulk invoke
>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>> this patch limit the time used for bdrv_is_allocated.
>
> bdrv_is_allocated() is supposed to yield back to the event loop if it
> needs to block.  If your VNC session is experiencing jitter then it's
> probably because a system call in the bdrv_is_allocated() code path is
> synchronous when it should be asynchronous.
>
> You could try to identify the system call using strace -f -T.  In the
> output you'll see the duration of each system call.  I guess there is a
> file I/O system call that is taking noticable amounts of time.

yes, i find the reason where bdrv_is_allocated needs to block.

the mainly reason is caused by qemu_co_mutex_lock invoked by
qcow2_co_get_block_status.
qemu_co_mutex_lock(&s->lock);
ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
   &cluster_offset);
qemu_co_mutex_unlock(&s->lock);

other reason is caused by l2_load invoked by
qcow2_get_cluster_offset.

/* load the l2 table in memory */

ret = l2_load(bs, l2_offset, &l2_table);
if (ret < 0) {
return ret;
}

>
> A proper solution is to refactor the synchronous code to make it
> asynchronous.  This might require invoking the system call from a
> thread pool worker.
>

yes, i agree with you, but this is a big change.
I will try to find how to optimize this code, maybe need a long time.

this patch is not a perfect solution, but can alleviate the problem.

> Stefan



Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-05 Thread 858585 jemmy
On Wed, Apr 5, 2017 at 6:44 PM, 858585 jemmy  wrote:
> On Wed, Apr 5, 2017 at 5:34 PM, Daniel P. Berrange  
> wrote:
>> On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>>> From: Lidong Chen 
>>>
>>> when migration with high speed, mig_save_device_bulk invoke
>>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>>> this patch limit the time used for bdrv_is_allocated.
>>
>> Can you explain why calling bdrv_is_allocated is impacting VNC performance ?
>>
>
> bdrv_is_allocated is called after qemu_mutex_lock_iothread.
>
> if (bmds->shared_base) {
> qemu_mutex_lock_iothread();
> aio_context_acquire(blk_get_aio_context(bb));
> /* Skip unallocated sectors; intentionally treats failure as
>  * an allocated sector */
> while (cur_sector < total_sectors &&
>!bdrv_is_allocated(blk_bs(bb), cur_sector,
>   MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
> cur_sector += nr_sectors;
> }
> aio_context_release(blk_get_aio_context(bb));
> qemu_mutex_unlock_iothread();
> }
>
> and the main thread is also call qemu_mutex_lock_iothread.
>
> #0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
> util/qemu-thread-posix.c:60
> #4  0x00459e58 in qemu_mutex_lock_iothread () at 
> /root/qemu/cpus.c:1516
> #5  0x00945322 in os_host_main_loop_wait (timeout=28911939) at
> util/main-loop.c:258
> #6  0x009453f2 in main_loop_wait (nonblocking=0) at 
> util/main-loop.c:517
> #7  0x005c76b4 in main_loop () at vl.c:1898
> #8  0x005ceb77 in main (argc=49, argv=0x7fff921182b8,
> envp=0x7fff92118448) at vl.c:4709
>
>> Migration is running in a background thread, so shouldn't be impacting the
>> main thread which handles VNC, unless the block layer is perhaps acquiring
>> the global qemu lock ? I wouldn't expect such a lock to be held for just
>> the bdrv_is_allocated call though.
>>

I'm not sure it's safe to remove qemu_mutex_lock_iothread. i will
analyze and test it later.
this patch is simple, and can solve the problem now.

>> Regards,
>> Daniel
>> --
>> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
>> |: http://libvirt.org  -o- http://virt-manager.org :|
>> |: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|



Re: [Qemu-devel] [PATCH v3] migration/block:limit the time used for block migration

2017-04-05 Thread 858585 jemmy
On Wed, Apr 5, 2017 at 5:34 PM, Daniel P. Berrange  wrote:
> On Wed, Apr 05, 2017 at 05:27:58PM +0800, jemmy858...@gmail.com wrote:
>> From: Lidong Chen 
>>
>> when migration with high speed, mig_save_device_bulk invoke
>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>> this patch limit the time used for bdrv_is_allocated.
>
> Can you explain why calling bdrv_is_allocated is impacting VNC performance ?
>

bdrv_is_allocated is called after qemu_mutex_lock_iothread.

if (bmds->shared_base) {
qemu_mutex_lock_iothread();
aio_context_acquire(blk_get_aio_context(bb));
/* Skip unallocated sectors; intentionally treats failure as
 * an allocated sector */
while (cur_sector < total_sectors &&
   !bdrv_is_allocated(blk_bs(bb), cur_sector,
  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
cur_sector += nr_sectors;
}
aio_context_release(blk_get_aio_context(bb));
qemu_mutex_unlock_iothread();
}

and the main thread is also call qemu_mutex_lock_iothread.

#0  0x7f107322f264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x7f107322a508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x7f107322a3d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00949ecb in qemu_mutex_lock (mutex=0xfc51a0) at
util/qemu-thread-posix.c:60
#4  0x00459e58 in qemu_mutex_lock_iothread () at /root/qemu/cpus.c:1516
#5  0x00945322 in os_host_main_loop_wait (timeout=28911939) at
util/main-loop.c:258
#6  0x009453f2 in main_loop_wait (nonblocking=0) at util/main-loop.c:517
#7  0x005c76b4 in main_loop () at vl.c:1898
#8  0x005ceb77 in main (argc=49, argv=0x7fff921182b8,
envp=0x7fff92118448) at vl.c:4709

> Migration is running in a background thread, so shouldn't be impacting the
> main thread which handles VNC, unless the block layer is perhaps acquiring
> the global qemu lock ? I wouldn't expect such a lock to be held for just
> the bdrv_is_allocated call though.
>
> Regards,
> Daniel
> --
> |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org  -o- http://virt-manager.org :|
> |: http://entangle-photo.org   -o-http://search.cpan.org/~danberr/ :|



Re: [Qemu-devel] [PATCH v2] migration/block:limit the time used for block migration

2017-04-05 Thread 858585 jemmy
sorry, i make a mistake, ignore this patch.

On Wed, Apr 5, 2017 at 4:58 PM,   wrote:
> From: Lidong Chen 
>
> when migration with quick speed, mig_save_device_bulk invoke
> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
> this patch limit the time used for bdrv_is_allocated.
>
> Signed-off-by: Lidong Chen 
> ---
>  migration/block.c | 38 ++
>  1 file changed, 30 insertions(+), 8 deletions(-)
>
> diff --git a/migration/block.c b/migration/block.c
> index 7734ff7..9d7a8ee 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -39,6 +39,7 @@
>  #define MAX_IS_ALLOCATED_SEARCH 65536
>
>  #define MAX_INFLIGHT_IO 512
> +#define BIG_DELAY 50
>
>  //#define DEBUG_BLK_MIGRATION
>
> @@ -110,6 +111,7 @@ typedef struct BlkMigState {
>  int transferred;
>  int prev_progress;
>  int bulk_completed;
> +int64_t time_ns_used;
>
>  /* Lock must be taken _inside_ the iothread lock and any AioContexts.  */
>  QemuMutex lock;
> @@ -272,16 +274,32 @@ static int mig_save_device_bulk(QEMUFile *f, 
> BlkMigDevState *bmds)
>  BlockBackend *bb = bmds->blk;
>  BlkMigBlock *blk;
>  int nr_sectors;
> +uint64_t ts1, ts2;
> +int ret = 0;
> +bool timeout_flag = false;
>
>  if (bmds->shared_base) {
>  qemu_mutex_lock_iothread();
>  aio_context_acquire(blk_get_aio_context(bb));
>  /* Skip unallocated sectors; intentionally treats failure as
>   * an allocated sector */
> -while (cur_sector < total_sectors &&
> -   !bdrv_is_allocated(blk_bs(bb), cur_sector,
> -  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
> -cur_sector += nr_sectors;
> +while (cur_sector < total_sectors) {
> +ts1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
> +MAX_IS_ALLOCATED_SEARCH, &nr_sectors);
> +ts2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +
> +block_mig_state.time_ns_used += ts2 - ts1;
> +
> +if (!ret) {
> +cur_sector += nr_sectors;
> +if (block_mig_state.time_ns_used > BIG_DELAY) {
> +timeout_flag = true;
> +break;
> +}
> +} else {
> +break;
> +}
>  }
>  aio_context_release(blk_get_aio_context(bb));
>  qemu_mutex_unlock_iothread();
> @@ -292,6 +310,11 @@ static int mig_save_device_bulk(QEMUFile *f, 
> BlkMigDevState *bmds)
>  return 1;
>  }
>
> +if (timeout_flag) {
> +bmds->cur_sector = bmds->completed_sectors = cur_sector;
> +return 0;
> +}
> +
>  bmds->completed_sectors = cur_sector;
>
>  cur_sector &= ~((int64_t)BDRV_SECTORS_PER_DIRTY_CHUNK - 1);
> @@ -576,9 +599,6 @@ static int mig_save_device_dirty(QEMUFile *f, 
> BlkMigDevState *bmds,
>  }
>
>  bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, sector, nr_sectors);
> -sector += nr_sectors;
> -bmds->cur_dirty = sector;
> -
>  break;
>  }
>  sector += BDRV_SECTORS_PER_DIRTY_CHUNK;
> @@ -756,6 +776,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
>  }
>
>  blk_mig_reset_dirty_cursor();
> +block_mig_state.time_ns_used = 0;
>
>  /* control the rate of transfer */
>  blk_mig_lock();
> @@ -764,7 +785,8 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
> qemu_file_get_rate_limit(f) &&
> (block_mig_state.submitted +
>  block_mig_state.read_done) <
> -   MAX_INFLIGHT_IO) {
> +   MAX_INFLIGHT_IO &&
> +   block_mig_state.time_ns_used <= BIG_DELAY) {
>  blk_mig_unlock();
>  if (block_mig_state.bulk_completed == 0) {
>  /* first finish the bulk phase */
> --
> 1.8.3.1
>



Re: [Qemu-devel] [RFC] migration/block:limit the time used for block migration

2017-04-05 Thread 858585 jemmy
On Wed, Mar 29, 2017 at 9:21 PM, 858585 jemmy  wrote:
> On Tue, Mar 28, 2017 at 5:47 PM, Juan Quintela  wrote:
>> Lidong Chen  wrote:
>>> when migration with quick speed, mig_save_device_bulk invoke
>>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>>> this patch limit the time used for bdrv_is_allocated.
>>>
>>> Signed-off-by: Lidong Chen 
>>> ---
>>>  migration/block.c | 39 +++
>>>  1 file changed, 31 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/migration/block.c b/migration/block.c
>>> index 7734ff7..d3e81ca 100644
>>> --- a/migration/block.c
>>> +++ b/migration/block.c
>>> @@ -110,6 +110,7 @@ typedef struct BlkMigState {
>>>  int transferred;
>>>  int prev_progress;
>>>  int bulk_completed;
>>> +int time_ns_used;
>>
>> An int that can only take values 0/1 is called a bool O:-)
> time_ns_used is used to store how many ns used by bdrv_is_allocated.
>
>>
>>
>>>  if (bmds->shared_base) {
>>>  qemu_mutex_lock_iothread();
>>>  aio_context_acquire(blk_get_aio_context(bb));
>>>  /* Skip unallocated sectors; intentionally treats failure as
>>>   * an allocated sector */
>>> -while (cur_sector < total_sectors &&
>>> -   !bdrv_is_allocated(blk_bs(bb), cur_sector,
>>> -  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>>> -cur_sector += nr_sectors;
>>> +while (cur_sector < total_sectors) {
>>> +clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>>> +ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>>> +MAX_IS_ALLOCATED_SEARCH, &nr_sectors);
>>> +clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>>
>> Do we really want to call clock_gettime each time that
>> bdrv_is_allocated() is called?  My understanding is that clock_gettime
>> is expensive, but I don't know how expensive is brdrv_is_allocated()
>
> i write this code to measure the time used by  brdrv_is_allocated()
>
>  279 static int max_time = 0;
>  280 int tmp;
>
>  288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>  289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>  290 MAX_IS_ALLOCATED_SEARCH, 
> &nr_sectors);
>  291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>  292
>  293
>  294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
>  295+ (ts2.tv_nsec - ts1.tv_nsec);
>  296 if (tmp > max_time) {
>  297max_time=tmp;
>  298fprintf(stderr, "max_time is %d\n", max_time);
>  299 }
>
> the test result is below:
>
>  max_time is 37014
>  max_time is 1075534
>  max_time is 17180913
>  max_time is 28586762
>  max_time is 49563584
>  max_time is 103085447
>  max_time is 110836833
>  max_time is 120331438
>
> so i think it's necessary to clock_gettime each time.
> but clock_gettime only available on linux. maybe clock() is better.
>
>>
>> And while we are at it,  shouldn't we check since before the while?
> i also check it in block_save_iterate.
> +   MAX_INFLIGHT_IO &&
> +   block_mig_state.time_ns_used <= 10) {
>
>>
>>
>>> +
>>> +block_mig_state.time_ns_used += (ts2.tv_sec - ts1.tv_sec) * 
>>> BILLION
>>> +  + (ts2.tv_nsec - ts1.tv_nsec);
>>> +
>>> +if (!ret) {
>>> +cur_sector += nr_sectors;
>>> +if (block_mig_state.time_ns_used > 10) {
>>> +timeout_flag = 1;
>>> +break;
>>> +}
>>> +} else {
>>> +break;
>>> +}
>>>  }
>>>  aio_context_release(blk_get_aio_context(bb));
>>>  qemu_mutex_unlock_iothread();
>>> @@ -292,6 +311,11 @@ static int mig_save_device_bulk(QEMUFile *f, 
>>> BlkMigDevState *bmds)
>>>  return 1;
>>>  }
>>>
>>> +if (timeout_flag == 1) {
>>> +bmds->cur_sector = bmds->completed_sectors = cur_sector;
>>> +return 0;
>>> +}
>>> +
>>>  bmds->completed_sectors = cur_secto

Re: [Qemu-devel] [RFC] migration/block:limit the time used for block migration

2017-04-04 Thread 858585 jemmy
On Wed, Mar 29, 2017 at 11:57 PM, Juan Quintela  wrote:
>
> 858585 jemmy  wrote:
> > On Tue, Mar 28, 2017 at 5:47 PM, Juan Quintela  wrote:
> >> Lidong Chen  wrote:
> >>> when migration with quick speed, mig_save_device_bulk invoke
> >>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
> >>> this patch limit the time used for bdrv_is_allocated.
> >>>
> >>> Signed-off-by: Lidong Chen 
> >>> ---
> >>>  migration/block.c | 39 +++
> >>>  1 file changed, 31 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/migration/block.c b/migration/block.c
> >>> index 7734ff7..d3e81ca 100644
> >>> --- a/migration/block.c
> >>> +++ b/migration/block.c
> >>> @@ -110,6 +110,7 @@ typedef struct BlkMigState {
> >>>  int transferred;
> >>>  int prev_progress;
> >>>  int bulk_completed;
> >>> +int time_ns_used;
> >>
> >> An int that can only take values 0/1 is called a bool O:-)
> > time_ns_used is used to store how many ns used by bdrv_is_allocated.
>
> Oops, I really mean timeout_flag, sorry :-(
>
> >> Do we really want to call clock_gettime each time that
> >> bdrv_is_allocated() is called?  My understanding is that clock_gettime
> >> is expensive, but I don't know how expensive is brdrv_is_allocated()
> >
> > i write this code to measure the time used by  brdrv_is_allocated()
> >
> >  279 static int max_time = 0;
> >  280 int tmp;
> >
> >  288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
> >  289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
> >  290 MAX_IS_ALLOCATED_SEARCH, 
> > &nr_sectors);
> >  291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
> >  292
> >  293
> >  294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
> >  295+ (ts2.tv_nsec - ts1.tv_nsec);
> >  296 if (tmp > max_time) {
> >  297max_time=tmp;
> >  298fprintf(stderr, "max_time is %d\n", max_time);
> >  299 }
> >
> > the test result is below:
> >
> >  max_time is 37014
> >  max_time is 1075534
> >  max_time is 17180913
> >  max_time is 28586762
> >  max_time is 49563584
> >  max_time is 103085447
> >  max_time is 110836833
> >  max_time is 120331438
>
> this is around 120ms, no?  It is quite a lot, really :-(

i find the reason is mainly caused by qemu_co_mutex_lock invoked by
qcow2_co_get_block_status.
qemu_co_mutex_lock(&s->lock);
ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
   &cluster_offset);
qemu_co_mutex_unlock(&s->lock);

>
>
> > so i think it's necessary to clock_gettime each time.
> > but clock_gettime only available on linux. maybe clock() is better.
>
qemu_clock_get_ms(QEMU_CLOCK_REALTIME) is a better option.

> Thanks, Juan.



Re: [Qemu-devel] [RFC] migration/block:limit the time used for block migration

2017-03-29 Thread 858585 jemmy
On Tue, Mar 28, 2017 at 5:47 PM, Juan Quintela  wrote:
> Lidong Chen  wrote:
>> when migration with quick speed, mig_save_device_bulk invoke
>> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
>> this patch limit the time used for bdrv_is_allocated.
>>
>> Signed-off-by: Lidong Chen 
>> ---
>>  migration/block.c | 39 +++
>>  1 file changed, 31 insertions(+), 8 deletions(-)
>>
>> diff --git a/migration/block.c b/migration/block.c
>> index 7734ff7..d3e81ca 100644
>> --- a/migration/block.c
>> +++ b/migration/block.c
>> @@ -110,6 +110,7 @@ typedef struct BlkMigState {
>>  int transferred;
>>  int prev_progress;
>>  int bulk_completed;
>> +int time_ns_used;
>
> An int that can only take values 0/1 is called a bool O:-)
time_ns_used is used to store how many ns used by bdrv_is_allocated.

>
>
>>  if (bmds->shared_base) {
>>  qemu_mutex_lock_iothread();
>>  aio_context_acquire(blk_get_aio_context(bb));
>>  /* Skip unallocated sectors; intentionally treats failure as
>>   * an allocated sector */
>> -while (cur_sector < total_sectors &&
>> -   !bdrv_is_allocated(blk_bs(bb), cur_sector,
>> -  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
>> -cur_sector += nr_sectors;
>> +while (cur_sector < total_sectors) {
>> +clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
>> +ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
>> +MAX_IS_ALLOCATED_SEARCH, &nr_sectors);
>> +clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
>
> Do we really want to call clock_gettime each time that
> bdrv_is_allocated() is called?  My understanding is that clock_gettime
> is expensive, but I don't know how expensive is brdrv_is_allocated()

i write this code to measure the time used by  brdrv_is_allocated()

 279 static int max_time = 0;
 280 int tmp;

 288 clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
 289 ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
 290 MAX_IS_ALLOCATED_SEARCH, &nr_sectors);
 291 clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
 292
 293
 294 tmp =  (ts2.tv_sec - ts1.tv_sec)*10L
 295+ (ts2.tv_nsec - ts1.tv_nsec);
 296 if (tmp > max_time) {
 297max_time=tmp;
 298fprintf(stderr, "max_time is %d\n", max_time);
 299 }

the test result is below:

 max_time is 37014
 max_time is 1075534
 max_time is 17180913
 max_time is 28586762
 max_time is 49563584
 max_time is 103085447
 max_time is 110836833
 max_time is 120331438

so i think it's necessary to clock_gettime each time.
but clock_gettime only available on linux. maybe clock() is better.

>
> And while we are at it,  shouldn't we check since before the while?
i also check it in block_save_iterate.
+   MAX_INFLIGHT_IO &&
+   block_mig_state.time_ns_used <= 10) {

>
>
>> +
>> +block_mig_state.time_ns_used += (ts2.tv_sec - ts1.tv_sec) * 
>> BILLION
>> +  + (ts2.tv_nsec - ts1.tv_nsec);
>> +
>> +if (!ret) {
>> +cur_sector += nr_sectors;
>> +if (block_mig_state.time_ns_used > 10) {
>> +timeout_flag = 1;
>> +break;
>> +}
>> +} else {
>> +break;
>> +}
>>  }
>>  aio_context_release(blk_get_aio_context(bb));
>>  qemu_mutex_unlock_iothread();
>> @@ -292,6 +311,11 @@ static int mig_save_device_bulk(QEMUFile *f, 
>> BlkMigDevState *bmds)
>>  return 1;
>>  }
>>
>> +if (timeout_flag == 1) {
>> +bmds->cur_sector = bmds->completed_sectors = cur_sector;
>> +return 0;
>> +}
>> +
>>  bmds->completed_sectors = cur_sector;
>>
>>  cur_sector &= ~((int64_t)BDRV_SECTORS_PER_DIRTY_CHUNK - 1);
>> @@ -576,9 +600,6 @@ static int mig_save_device_dirty(QEMUFile *f, 
>> BlkMigDevState *bmds,
>>  }
>>
>>  bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, sector, nr_sectors);
>> -sector += nr_sectors;
>> -bmds->cur_dirty = sector;
>> -
>>  break;
>>  }
>>  sector += BDRV_SECTORS_PER_DIRTY_CHUNK;
>> @@ -756,6 +777,7 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
>>  }
>>
>>  blk_mig_reset_dirty_cursor();
>> +block_mig_state.time_ns_used = 0;
>>
>>  /* control the rate of transfer */
>>  blk_mig_lock();
>> @@ -764,7 +786,8 @@ static int block_save_iterate(QEMUFile *f, void *opaque)
>> qemu_file_get_rate_limit(f) &&
>> (block_mig_state.submitted +
>>  block_mig_state.read_done) <
>> -   MAX_INFLIGHT_IO) {
>> +   MAX_INFLIGHT_IO &&
>> +   block_mig_state.time_ns_used <= 10) {
>
> changed this 10.

Re: [Qemu-devel] [RFC] migration/block:limit the time used for block migration

2017-03-28 Thread 858585 jemmy
when migrate the vm with quick speed, i find vnc response slowly.

the bug can be reproduce by this command:
virsh migrate-setspeed 165cf436-312f-47e7-90f2-f8aa63f34893 900
virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
--copy-storage-inc qemu+ssh://10.59.163.38/system

and --copy-storage-all have no problem.
virsh migrate --live 165cf436-312f-47e7-90f2-f8aa63f34893
--copy-storage-all qemu+ssh://10.59.163.38/system

mig_save_device_bulk invoke bdrv_is_allocated, but bdrv_is_allocated maybe
wait for a long time.
so cause the main thread wait for a long time.

this patch limit the time wait for bdrv_is_allocated.

i do not find a better way to solve this bug, Any suggestion?

Thanks.

On Tue, Mar 28, 2017 at 5:23 PM, Lidong Chen  wrote:

> when migration with quick speed, mig_save_device_bulk invoke
> bdrv_is_allocated too frequently, and cause vnc reponse slowly.
> this patch limit the time used for bdrv_is_allocated.
>
> Signed-off-by: Lidong Chen 
> ---
>  migration/block.c | 39 +++
>  1 file changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/migration/block.c b/migration/block.c
> index 7734ff7..d3e81ca 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -110,6 +110,7 @@ typedef struct BlkMigState {
>  int transferred;
>  int prev_progress;
>  int bulk_completed;
> +int time_ns_used;
>
>  /* Lock must be taken _inside_ the iothread lock and any
> AioContexts.  */
>  QemuMutex lock;
> @@ -263,6 +264,7 @@ static void blk_mig_read_cb(void *opaque, int ret)
>  blk_mig_unlock();
>  }
>
> +#define BILLION 10L
>  /* Called with no lock taken.  */
>
>  static int mig_save_device_bulk(QEMUFile *f, BlkMigDevState *bmds)
> @@ -272,16 +274,33 @@ static int mig_save_device_bulk(QEMUFile *f,
> BlkMigDevState *bmds)
>  BlockBackend *bb = bmds->blk;
>  BlkMigBlock *blk;
>  int nr_sectors;
> +struct timespec ts1, ts2;
> +int ret = 0;
> +int timeout_flag = 0;
>
>  if (bmds->shared_base) {
>  qemu_mutex_lock_iothread();
>  aio_context_acquire(blk_get_aio_context(bb));
>  /* Skip unallocated sectors; intentionally treats failure as
>   * an allocated sector */
> -while (cur_sector < total_sectors &&
> -   !bdrv_is_allocated(blk_bs(bb), cur_sector,
> -  MAX_IS_ALLOCATED_SEARCH, &nr_sectors)) {
> -cur_sector += nr_sectors;
> +while (cur_sector < total_sectors) {
> +clock_gettime(CLOCK_MONOTONIC_RAW, &ts1);
> +ret = bdrv_is_allocated(blk_bs(bb), cur_sector,
> +MAX_IS_ALLOCATED_SEARCH, &nr_sectors);
> +clock_gettime(CLOCK_MONOTONIC_RAW, &ts2);
> +
> +block_mig_state.time_ns_used += (ts2.tv_sec - ts1.tv_sec) *
> BILLION
> +  + (ts2.tv_nsec - ts1.tv_nsec);
> +
> +if (!ret) {
> +cur_sector += nr_sectors;
> +if (block_mig_state.time_ns_used > 10) {
> +timeout_flag = 1;
> +break;
> +}
> +} else {
> +break;
> +}
>  }
>  aio_context_release(blk_get_aio_context(bb));
>  qemu_mutex_unlock_iothread();
> @@ -292,6 +311,11 @@ static int mig_save_device_bulk(QEMUFile *f,
> BlkMigDevState *bmds)
>  return 1;
>  }
>
> +if (timeout_flag == 1) {
> +bmds->cur_sector = bmds->completed_sectors = cur_sector;
> +return 0;
> +}
> +
>  bmds->completed_sectors = cur_sector;
>
>  cur_sector &= ~((int64_t)BDRV_SECTORS_PER_DIRTY_CHUNK - 1);
> @@ -576,9 +600,6 @@ static int mig_save_device_dirty(QEMUFile *f,
> BlkMigDevState *bmds,
>  }
>
>  bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, sector,
> nr_sectors);
> -sector += nr_sectors;
> -bmds->cur_dirty = sector;
> -
>  break;
>  }
>  sector += BDRV_SECTORS_PER_DIRTY_CHUNK;
> @@ -756,6 +777,7 @@ static int block_save_iterate(QEMUFile *f, void
> *opaque)
>  }
>
>  blk_mig_reset_dirty_cursor();
> +block_mig_state.time_ns_used = 0;
>
>  /* control the rate of transfer */
>  blk_mig_lock();
> @@ -764,7 +786,8 @@ static int block_save_iterate(QEMUFile *f, void
> *opaque)
> qemu_file_get_rate_limit(f) &&
> (block_mig_state.submitted +
>  block_mig_state.read_done) <
> -   MAX_INFLIGHT_IO) {
> +   MAX_INFLIGHT_IO &&
> +   block_mig_state.time_ns_used <= 10) {
>  blk_mig_unlock();
>  if (block_mig_state.bulk_completed == 0) {
>  /* first finish the bulk phase */
> --
> 1.8.3.1
>
>


Re: [Qemu-devel] [PATCH] migration/block: Avoid involve into blk_drain too frequently

2017-03-14 Thread 858585 jemmy
On Wed, Mar 15, 2017 at 10:57 AM, Fam Zheng  wrote:
> On Wed, 03/15 10:28, 858585 jemmy wrote:
>> On Tue, Mar 14, 2017 at 11:12 PM, Eric Blake  wrote:
>> > On 03/14/2017 02:57 AM, jemmy858...@gmail.com wrote:
>> >> From: Lidong Chen 
>> >>
>> >> Increase bmds->cur_dirty after submit io, so reduce the frequency involve 
>> >> into blk_drain, and improve the performance obviously when block 
>> >> migration.
>> >
>> > Long line; please wrap your commit messages, preferably around 70 bytes
>> > since 'git log' displays them indented, and it is still nice to read
>> > them in an 80-column window.
>> >
>> > Do you have benchmark numbers to prove the impact of this patch, or even
>> > a formula for reproducing the benchmark testing?
>> >
>>
>> the test result is base on current git master version.
>>
>> the xml of guest os:
>> 
>>   
>>   > file='/instanceimage/ab3ba978-c7a3-463d-a1d0-48649fb7df00/ab3ba978-c7a3-463d-a1d0-48649fb7df00_vda.qcow2'/>
>>   
>>   
>>   > function='0x0'/>
>> 
>> 
>>   
>>   
>>   
>>   
>>   > function='0x0'/>
>> 
>>
>> i used fio running in guest os.  and the context of  fio configuration is 
>> below:
>> [randwrite]
>> ioengine=libaio
>> iodepth=128
>> bs=512
>> filename=/dev/vdb
>> rw=randwrite
>> direct=1
>>
>> when the vm is not durning migrate, the iops is about 10.7K.
>>
>> then i used this command to start migrate virtual machine.
>>
>> virsh migrate-setspeed ab3ba978-c7a3-463d-a1d0-48649fb7df00 1000
>> virsh migrate --live ab3ba978-c7a3-463d-a1d0-48649fb7df00
>> --copy-storage-inc qemu+ssh://10.59.163.38/system
>>
>> before apply this patch, during the block dirty save phase, the iops
>> in guest os is  only 4.0K, the migrate speed is about 505856 rsec/s.
>> after apply this patch, during the block dirty save phase, the iops in
>> guest os is is 9.5K. the migrate speed is about 855756 rsec/s.
>
> Thanks, please include these numbers in the commit message too.
OK, i will.
>
> Fam



  1   2   >