Re: [PATCH v3 0/9] block-backend: Introduce I/O hang

2020-11-03 Thread cenjiahui


On 2020/10/30 21:21, Stefan Hajnoczi wrote:
> On Thu, Oct 29, 2020 at 05:42:42PM +0800, cenjiahui wrote:
>>
>> On 2020/10/27 0:53, Stefan Hajnoczi wrote:
>>> On Thu, Oct 22, 2020 at 09:02:54PM +0800, Jiahui Cen wrote:
>>>> A VM in the cloud environment may use a virutal disk as the backend 
>>>> storage,
>>>> and there are usually filesystems on the virtual block device. When backend
>>>> storage is temporarily down, any I/O issued to the virtual block device 
>>>> will
>>>> cause an error. For example, an error occurred in ext4 filesystem would 
>>>> make
>>>> the filesystem readonly. However a cloud backend storage can be soon 
>>>> recovered.
>>>> For example, an IP-SAN may be down due to network failure and will be 
>>>> online
>>>> soon after network is recovered. The error in the filesystem may not be
>>>> recovered unless a device reattach or system restart. So an I/O rehandle is
>>>> in need to implement a self-healing mechanism.
>>>>
>>>> This patch series propose a feature called I/O hang. It can rehandle AIOs
>>>> with EIO error without sending error back to guest. From guest's 
>>>> perspective
>>>> of view it is just like an IO is hanging and not returned. Guest can get
>>>> back running smoothly when I/O is recovred with this feature enabled.
>>>
>>> Hi,
>>> This feature seems like an extension of the existing -drive
>>> rerror=/werror= parameters:
>>>
>>>   werror=action,rerror=action
>>>   Specify which action to take on write and read errors. Valid
>>>   actions are: "ignore" (ignore the error and try to continue),
>>>   "stop" (pause QEMU), "report" (report the error to the guest),
>>>   "enospc" (pause QEMU only if the host disk is full; report the
>>>   error to the guest otherwise).  The default setting is
>>>   werror=enospc and rerror=report.
>>>
>>> That mechanism already has a list of requests to retry and live
>>> migration integration. Using the werror=/rerror= mechanism would avoid
>>> code duplication between these features. You could add a
>>> werror/rerror=retry error action for this feature.
>>>
>>> Does that sound good?
>>>
>>> Stefan
>>>
>>
>> Hi Stefan,
>>
>> Thanks for your reply. Extending the rerror=/werror= mechanism is a feasible
>> way for the retry feature.
>>
>> However, AFAIK, the rerror=/werror= mechanism in block-backend layer only
>> provides ACTION, and the real handler of errors need be implemented several
>> times in device layer for different devices. While our I/O Hang mechanism
>> directly handles AIO errors no matter which type of devices it is. Is it a
>> more common way to implement the feature in block-backend layer? Especially 
>> we
>> can set retry timeout in a common structure BlockBackend.
>>
>> Besides, is there any reason that QEMU implements the rerror=/werror 
>> mechansim
>> in device layer rather than in block-backend layer?
> 
> Yes, it's because failed requests can be live-migrated and retried on
> the destination host. In other words, live migration still works even
> when there are failed requests.
> 
> There may be things that can be refactored so there is less duplication
> in devices, but the basic design goal is that the block layer doesn't
> keep track of failed requests because they are live migrated together
> with the device state.
> 
> Maybe Kevin Wolf has more thoughts to share about rerror=/werror=.
> 
> Stefan
> 

Hi Kevin,

What do you think about extending rerror=/werror= for the retry feature?

And which place is better to set retry timeout, BlockBackend in
block layer or per device structure in device layer?

Jiahui



Re: [PATCH v3 0/9] block-backend: Introduce I/O hang

2020-10-29 Thread cenjiahui


On 2020/10/27 0:53, Stefan Hajnoczi wrote:
> On Thu, Oct 22, 2020 at 09:02:54PM +0800, Jiahui Cen wrote:
>> A VM in the cloud environment may use a virutal disk as the backend storage,
>> and there are usually filesystems on the virtual block device. When backend
>> storage is temporarily down, any I/O issued to the virtual block device will
>> cause an error. For example, an error occurred in ext4 filesystem would make
>> the filesystem readonly. However a cloud backend storage can be soon 
>> recovered.
>> For example, an IP-SAN may be down due to network failure and will be online
>> soon after network is recovered. The error in the filesystem may not be
>> recovered unless a device reattach or system restart. So an I/O rehandle is
>> in need to implement a self-healing mechanism.
>>
>> This patch series propose a feature called I/O hang. It can rehandle AIOs
>> with EIO error without sending error back to guest. From guest's perspective
>> of view it is just like an IO is hanging and not returned. Guest can get
>> back running smoothly when I/O is recovred with this feature enabled.
> 
> Hi,
> This feature seems like an extension of the existing -drive
> rerror=/werror= parameters:
> 
>   werror=action,rerror=action
>   Specify which action to take on write and read errors. Valid
>   actions are: "ignore" (ignore the error and try to continue),
>   "stop" (pause QEMU), "report" (report the error to the guest),
>   "enospc" (pause QEMU only if the host disk is full; report the
>   error to the guest otherwise).  The default setting is
>   werror=enospc and rerror=report.
> 
> That mechanism already has a list of requests to retry and live
> migration integration. Using the werror=/rerror= mechanism would avoid
> code duplication between these features. You could add a
> werror/rerror=retry error action for this feature.
> 
> Does that sound good?
> 
> Stefan
> 

Hi Stefan,

Thanks for your reply. Extending the rerror=/werror= mechanism is a feasible
way for the retry feature.

However, AFAIK, the rerror=/werror= mechanism in block-backend layer only
provides ACTION, and the real handler of errors need be implemented several
times in device layer for different devices. While our I/O Hang mechanism
directly handles AIO errors no matter which type of devices it is. Is it a
more common way to implement the feature in block-backend layer? Especially we
can set retry timeout in a common structure BlockBackend.

Besides, is there any reason that QEMU implements the rerror=/werror mechansim
in device layer rather than in block-backend layer?

Jiahui



Re: [RFC PATCH v2 0/8] block-backend: Introduce I/O hang

2020-10-09 Thread cenjiahui
Hi Kevin,

Could you please spend some time reviewing and commenting on this patch series.

Thanks,
Jiahui Cen

On 2020/9/30 17:45, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device will
> cause an error. For example, an error occurred in ext4 filesystem would make
> the filesystem readonly. However a cloud backend storage can be soon 
> recovered.
> For example, an IP-SAN may be down due to network failure and will be online
> soon after network is recovered. The error in the filesystem may not be
> recovered unless a device reattach or system restart. So an I/O rehandle is
> in need to implement a self-healing mechanism.
> 
> This patch series propose a feature called I/O hang. It can rehandle AIOs
> with EIO error without sending error back to guest. From guest's perspective
> of view it is just like an IO is hanging and not returned. Guest can get
> back running smoothly when I/O is recovred with this feature enabled.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> Jiahui Cen (8):
>   block-backend: introduce I/O rehandle info
>   block-backend: rehandle block aios when EIO
>   block-backend: add I/O hang timeout
>   block-backend: add I/O rehandle pause/unpause
>   block-backend: enable I/O hang when timeout is set
>   virtio-blk: pause I/O hang when resetting
>   qemu-option: add I/O hang timeout option
>   qapi: add I/O hang and I/O hang timeout qapi event
> 
>  block/block-backend.c  | 300 +
>  blockdev.c |  11 ++
>  hw/block/virtio-blk.c  |   8 +
>  include/sysemu/block-backend.h |   5 +
>  qapi/block-core.json   |  26 +++
>  5 files changed, 350 insertions(+)
> 



Re: [RFC PATCH 0/7] block-backend: Introduce I/O hang

2020-09-29 Thread cenjiahui


On 2020/9/28 18:57, Kevin Wolf wrote:
> Am 27.09.2020 um 15:04 hat Ying Fang geschrieben:
>> A VM in the cloud environment may use a virutal disk as the backend storage,
>> and there are usually filesystems on the virtual block device. When backend
>> storage is temporarily down, any I/O issued to the virtual block device will
>> cause an error. For example, an error occurred in ext4 filesystem would make
>> the filesystem readonly. However a cloud backend storage can be soon 
>> recovered.
>> For example, an IP-SAN may be down due to network failure and will be online
>> soon after network is recovered. The error in the filesystem may not be
>> recovered unless a device reattach or system restart. So an I/O rehandle is
>> in need to implement a self-healing mechanism.
>>
>> This patch series propose a feature called I/O hang. It can rehandle AIOs
>> with EIO error without sending error back to guest. From guest's perspective
>> of view it is just like an IO is hanging and not returned. Guest can get
>> back running smoothly when I/O is recovred with this feature enabled.
> 
> What is the problem with setting werror=stop and rerror=stop for the
When an I/O error occurs, if simply setting werror=stop and rerror=stop, the
whole VM will be paused and unavailable. Moreover, the VM won't be recovered
until the management tool manually resumes it after the backend storage 
recovers.
> device? Is it that QEMU won't automatically retry, but management tool
> interaction is required to resume the guest?
By using I/O Hang mechanism, we can temporarily hang the IOs, and any other
services unrelated with the hung virtual block device like network can go on
working. Besides, once the backend storage is recovered, our I/O rehandle
mechanism will automatically complete the hung IOs and continue the VM's work.
> 
> I haven't checked your patches in detail yet, but implementing this
> functionality in the backend means that blk_drain() will hang (or if it
> doesn't hang, it doesn't do what it's supposed to do), making the whole
What if we disable rehandle before blk_drain().
> QEMU process unresponsive until the I/O succeeds again. Amongst others,
> this would make it impossible to migrate away from a host with storage
> problems.
Exactly if the storage is recovered during migration iteration phase, the
migration can succeed, but if the storage is still not recovered on migration
completion phase, the migration should fail and be cancelled.

Thanks,
Jiahui Cen
> 
> Kevin
> 
> 
> .
>



Re: [PATCH 3/3] migration/multifd: fix potential wrong acception order of IOChannel

2019-10-24 Thread cenjiahui
On 2019/10/24 17:52, Daniel P. Berrangé wrote:
> On Wed, Oct 23, 2019 at 11:32:14AM +0800, cenjiahui wrote:
>> From: Jiahui Cen 
>>
>> Multifd assumes the migration thread IOChannel is always established before
>> the multifd IOChannels, but this assumption will be broken in many situations
>> like network packet loss.
>>
>> For example:
>> Step1: Source (migration thread IOChannel)  --SYN-->  Destination
>> Step2: Source (migration thread IOChannel)  <--SYNACK  Destination
>> Step3: Source (migration thread IOChannel, lost) --ACK-->X  Destination
>> Step4: Source (multifd IOChannel) --SYN-->Destination
>> Step5: Source (multifd IOChannel) <--SYNACK   Destination
>> Step6: Source (multifd IOChannel, ESTABLISHED) --ACK-->  Destination
>> Step7: Destination accepts multifd IOChannel
>> Step8: Source (migration thread IOChannel, ESTABLISHED) -ACK,DATA->  
>> Destination
>> Step9: Destination accepts migration thread IOChannel
>>
>> The above situation can be reproduced by creating a weak network environment,
>> such as "tc qdisc add dev eth0 root netem loss 50%". The wrong acception 
>> order
>> will cause magic check failure and thus lead to migration failure.
>>
>> This patch fixes this issue by sending a migration IOChannel initial packet 
>> with
>> a unique id when using multifd migration. Since the multifd IOChannels will 
>> also
>> send initial packets, the destination can judge whether the processing 
>> IOChannel
>> belongs to multifd by checking the id in the initial packet. This mechanism 
>> can
>> ensure that different IOChannels will go to correct branches in our test.
> 
> Isn't this going to break back compatibility when new QEMU talks to old
> QEMU with multifd enabled ? New QEMU will be sending a packet that old
> QEMU isn't expecting IIUC.

Yes, it actually breaks back compatibility. But since the old QEMU has bug with
multifd, it may be not suitable to use multifd to migrate from new QEMU to old
QEMU in my opinion.

Hi, Quintela, how do you think about this ?

> 
>> Signed-off-by: Jiahui Cen 
>> Signed-off-by: Ying Fang 

Regards,
Jiahui Cen




[PATCH 1/3] migration/multifd: fix nullptr access in terminating multifd threads

2019-10-22 Thread cenjiahui
From: Jiahui Cen 

One multifd channel will shutdown all the other multifd's IOChannel when it
fails to receive an IOChannel. In this senario, if some multifds had not
received its IOChannel yet, it would try to shutdown its IOChannel which could
cause nullptr access at qio_channel_shutdown.

Here is the coredump stack:
#0  object_get_class (obj=obj@entry=0x0) at qom/object.c:908
#1  0x5563fdbb8f4a in qio_channel_shutdown (ioc=0x0, 
how=QIO_CHANNEL_SHUTDOWN_BOTH, errp=0x0) at io/channel.c:355
#2  0x5563fd7b4c5f in multifd_recv_terminate_threads (err=) at migration/ram.c:1280
#3  0x5563fd7bc019 in multifd_recv_new_channel 
(ioc=ioc@entry=0x556400255610, errp=errp@entry=0x7ffec07dce00) at 
migration/ram.c:1478
#4  0x5563fda82177 in migration_ioc_process_incoming 
(ioc=ioc@entry=0x556400255610, errp=errp@entry=0x7ffec07dce30) at 
migration/migration.c:605
#5  0x5563fda8567d in migration_channel_process_incoming 
(ioc=0x556400255610) at migration/channel.c:44
#6  0x5563fda83ee0 in socket_accept_incoming_migration 
(listener=0x5563fff6b920, cioc=0x556400255610, opaque=) at 
migration/socket.c:166
#7  0x5563fdbc25cd in qio_net_listener_channel_func (ioc=, condition=, opaque=) at io/net-listener.c:54
#8  0x7f895b6fe9a9 in g_main_context_dispatch () from 
/usr/lib64/libglib-2.0.so.0
#9  0x5563fdc18136 in glib_pollfds_poll () at util/main-loop.c:218
#10 0x5563fdc181b5 in os_host_main_loop_wait (timeout=10) at 
util/main-loop.c:241
#11 0x5563fdc183a2 in main_loop_wait (nonblocking=nonblocking@entry=0) 
at util/main-loop.c:517
#12 0x5563fd8edb37 in main_loop () at vl.c:1791
#13 0x5563fd74fd45 in main (argc=, argv=, 
envp=) at vl.c:4473

To fix it up, let's check p->c before calling qio_channel_shutdown.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 migration/ram.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/migration/ram.c b/migration/ram.c
index 5078f94..dc63692 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1280,7 +1280,9 @@ static void multifd_recv_terminate_threads(Error *err)
- normal quit, i.e. everything went fine, just finished
- error quit: We close the channels so the channel threads
  finish the qio_channel_read_all_eof() */
-qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+if (p->c) {
+qio_channel_shutdown(p->c, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+}
 qemu_mutex_unlock(>mutex);
 }
 }
-- 
1.8.3.1





[PATCH 2/3] migration/multifd: fix destroyed mutex access in terminating multifd threads

2019-10-22 Thread cenjiahui
From: Jiahui Cen 

One multifd will lock all the other multifds' IOChannel mutex to inform them
to quit by setting p->quit or shutting down p->c. In this senario, if some
multifds had already been terminated and 
multifd_load_cleanup/multifd_save_cleanup
had destroyed their mutex, it could cause destroyed mutex access when trying
lock their mutex.

Here is the coredump stack:
#0  0x7f81a2794437 in raise () from /usr/lib64/libc.so.6
#1  0x7f81a2795b28 in abort () from /usr/lib64/libc.so.6
#2  0x7f81a278d1b6 in __assert_fail_base () from /usr/lib64/libc.so.6
#3  0x7f81a278d262 in __assert_fail () from /usr/lib64/libc.so.6
#4  0x55eb1bfadbd3 in qemu_mutex_lock_impl (mutex=0x55eb1e2d1988, 
file=, line=) at util/qemu-thread-posix.c:64
#5  0x55eb1bb4564a in multifd_send_terminate_threads (err=) at migration/ram.c:1015
#6  0x55eb1bb4bb7f in multifd_send_thread (opaque=0x55eb1e2d19f8) at 
migration/ram.c:1171
#7  0x55eb1bfad628 in qemu_thread_start (args=0x55eb1e170450) at 
util/qemu-thread-posix.c:502
#8  0x7f81a2b36df5 in start_thread () from /usr/lib64/libpthread.so.0
#9  0x7f81a286048d in clone () from /usr/lib64/libc.so.6

To fix it up, let's destroy the mutex after all the other multifd threads had
been terminated.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 migration/ram.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index dc63692..24a8906 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1032,6 +1032,10 @@ void multifd_save_cleanup(void)
 if (p->running) {
 qemu_thread_join(>thread);
 }
+}
+for (i = 0; i < migrate_multifd_channels(); i++) {
+MultiFDSendParams *p = _send_state->params[i];
+
 socket_send_channel_destroy(p->c);
 p->c = NULL;
 qemu_mutex_destroy(>mutex);
@@ -1308,6 +1312,10 @@ int multifd_load_cleanup(Error **errp)
 qemu_sem_post(>sem_sync);
 qemu_thread_join(>thread);
 }
+}
+for (i = 0; i < migrate_multifd_channels(); i++) {
+MultiFDRecvParams *p = _recv_state->params[i];
+
 object_unref(OBJECT(p->c));
 p->c = NULL;
 qemu_mutex_destroy(>mutex);
-- 
1.8.3.1





[PATCH 3/3] migration/multifd: fix potential wrong acception order of IOChannel

2019-10-22 Thread cenjiahui
From: Jiahui Cen 

Multifd assumes the migration thread IOChannel is always established before
the multifd IOChannels, but this assumption will be broken in many situations
like network packet loss.

For example:
Step1: Source (migration thread IOChannel)  --SYN-->  Destination
Step2: Source (migration thread IOChannel)  <--SYNACK  Destination
Step3: Source (migration thread IOChannel, lost) --ACK-->X  Destination
Step4: Source (multifd IOChannel) --SYN-->Destination
Step5: Source (multifd IOChannel) <--SYNACK   Destination
Step6: Source (multifd IOChannel, ESTABLISHED) --ACK-->  Destination
Step7: Destination accepts multifd IOChannel
Step8: Source (migration thread IOChannel, ESTABLISHED) -ACK,DATA->  Destination
Step9: Destination accepts migration thread IOChannel

The above situation can be reproduced by creating a weak network environment,
such as "tc qdisc add dev eth0 root netem loss 50%". The wrong acception order
will cause magic check failure and thus lead to migration failure.

This patch fixes this issue by sending a migration IOChannel initial packet with
a unique id when using multifd migration. Since the multifd IOChannels will also
send initial packets, the destination can judge whether the processing IOChannel
belongs to multifd by checking the id in the initial packet. This mechanism can
ensure that different IOChannels will go to correct branches in our test.

Signed-off-by: Jiahui Cen 
Signed-off-by: Ying Fang 
---
 migration/channel.c   |  9 +
 migration/migration.c | 49 -
 migration/migration.h |  3 +++
 migration/ram.c   | 38 +-
 migration/ram.h   |  3 ++-
 migration/socket.c|  7 +++
 6 files changed, 58 insertions(+), 51 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index 20e4c8e..7462181 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -82,6 +82,15 @@ void migration_channel_connect(MigrationState *s,
 return;
 }
 } else {
+if (migrate_use_multifd()) {
+/* multifd migration cannot distinguish migration IOChannel
+ * from multifd IOChannels, so we need to send an initial 
packet
+ * to show it is migration IOChannel
+ */
+migration_send_initial_packet(ioc,
+  migrate_multifd_channels(),
+  );
+}
 QEMUFile *f = qemu_fopen_channel_output(ioc);
 
 qemu_mutex_lock(>qemu_file_lock);
diff --git a/migration/migration.c b/migration/migration.c
index 3febd0f..3da2baf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -521,12 +521,6 @@ static void migration_incoming_setup(QEMUFile *f)
 {
 MigrationIncomingState *mis = migration_incoming_get_current();
 
-if (multifd_load_setup() != 0) {
-/* We haven't been able to create multifd threads
-   nothing better to do */
-exit(EXIT_FAILURE);
-}
-
 if (!mis->from_src_file) {
 mis->from_src_file = f;
 }
@@ -584,36 +578,41 @@ void migration_fd_process_incoming(QEMUFile *f)
 void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
 {
 MigrationIncomingState *mis = migration_incoming_get_current();
-bool start_migration;
-
-if (!mis->from_src_file) {
-/* The first connection (multifd may have multiple) */
-QEMUFile *f = qemu_fopen_channel_input(ioc);
+Error *local_err = NULL;
+int id = 0;
 
-/* If it's a recovery, we're done */
-if (postcopy_try_recover(f)) {
-return;
-}
+if (migrate_use_multifd()) {
+id = migration_recv_initial_packet(ioc, _err);
+}
+if (!migrate_use_multifd() || id == migrate_multifd_channels()) {
+if (!mis->from_src_file) {
+/* The migration connection (multifd may have multiple) */
+QEMUFile *f = qemu_fopen_channel_input(ioc);
 
-migration_incoming_setup(f);
+/* If it's a recovery, we're done */
+if (postcopy_try_recover(f)) {
+return;
+}
 
-/*
- * Common migration only needs one channel, so we can start
- * right now.  Multifd needs more than one channel, we wait.
- */
-start_migration = !migrate_use_multifd();
-} else {
-Error *local_err = NULL;
+migration_incoming_setup(f);
+}
+} else if (id >= 0) {
 /* Multiple connections */
 assert(migrate_use_multifd());
-start_migration = multifd_recv_new_channel(ioc, _err);
+multifd_recv_new_channel(ioc, id, _err);
 if (local_err) {
 error_propagate(errp, local_err);
 return;
 }
+} else {
+/* Bad connections */
+multifd_recv_terminate_threads(local_err);
+