RE: About migration/colo issue

2020-05-15 Thread Zhanghailiang
Hi,

I can't reproduce this issue with the qemu upstream either,
It works well.

Did you use an old version ?

Thanks,
Hailiang


> -Original Message-
> From: Lukas Straub [mailto:lukasstra...@web.de]
> Sent: Friday, May 15, 2020 3:12 PM
> To: Zhang, Chen 
> Cc: Zhanghailiang ; Dr . David Alan
> Gilbert ; qemu-devel ; Li
> Zhijian ; Jason Wang 
> Subject: Re: About migration/colo issue
> 
> On Fri, 15 May 2020 03:16:18 +
> "Zhang, Chen"  wrote:
> 
> > Hi Hailiang/Dave.
> >
> > I found a urgent problem in current upstream code, COLO will stuck on
> secondary checkpoint and later.
> > The guest will stuck by this issue.
> > I have bisect upstream code, this issue caused by Hailiang's optimize patch:
> 
> Hmm, I'm on v5.0.0 (where that commit is in) and I don't have this issue in
> my testing.
> 
> Regards,
> Lukas Straub
> 
> > From 0393031a16735835a441b6d6e0495a1bd14adb90 Mon Sep 17
> 00:00:00 2001
> > From: zhanghailiang 
> > Date: Mon, 24 Feb 2020 14:54:10 +0800
> > Subject: [PATCH] COLO: Optimize memory back-up process
> >
> > This patch will reduce the downtime of VM for the initial process,
> > Previously, we copied all these memory in preparing stage of COLO
> > while we need to stop VM, which is a time-consuming process.
> > Here we optimize it by a trick, back-up every page while in migration
> > process while COLO is enabled, though it affects the speed of the
> > migration, but it obviously reduce the downtime of back-up all SVM'S
> > memory in COLO preparing stage.
> >
> > Signed-off-by: zhanghailiang 
> > Message-Id:
> <20200224065414.36524-5-zhang.zhanghaili...@huawei.com>
> > Signed-off-by: Dr. David Alan Gilbert 
> >   minor typo fixes
> >
> > Hailiang, do you have time to look into it?
> >
> > ...



Re: About migration/colo issue

2020-05-15 Thread Lukas Straub
On Fri, 15 May 2020 03:16:18 +
"Zhang, Chen"  wrote:

> Hi Hailiang/Dave.
> 
> I found a urgent problem in current upstream code, COLO will stuck on 
> secondary checkpoint and later.
> The guest will stuck by this issue.
> I have bisect upstream code, this issue caused by Hailiang's optimize patch:

Hmm, I'm on v5.0.0 (where that commit is in) and I don't have this issue in my 
testing.

Regards,
Lukas Straub

> From 0393031a16735835a441b6d6e0495a1bd14adb90 Mon Sep 17 00:00:00 2001
> From: zhanghailiang 
> Date: Mon, 24 Feb 2020 14:54:10 +0800
> Subject: [PATCH] COLO: Optimize memory back-up process
> 
> This patch will reduce the downtime of VM for the initial process,
> Previously, we copied all these memory in preparing stage of COLO
> while we need to stop VM, which is a time-consuming process.
> Here we optimize it by a trick, back-up every page while in migration
> process while COLO is enabled, though it affects the speed of the
> migration, but it obviously reduce the downtime of back-up all SVM'S
> memory in COLO preparing stage.
> 
> Signed-off-by: zhanghailiang 
> Message-Id: <20200224065414.36524-5-zhang.zhanghaili...@huawei.com>
> Signed-off-by: Dr. David Alan Gilbert 
>   minor typo fixes
> 
> Hailiang, do you have time to look into it?
> 
> ...


pgpgCJr3sbz7h.pgp
Description: OpenPGP digital signature


RE: About migration/colo issue

2020-05-14 Thread Zhang, Chen


From: Zhanghailiang 
Sent: Friday, May 15, 2020 11:29 AM
To: Zhang, Chen ; Dr . David Alan Gilbert 
; qemu-devel ; Li Zhijian 

Cc: Jason Wang ; Lukas Straub 
Subject: RE: About migration/colo issue

Hi Zhang Chen,

>From your tracing log, it seems to be hanged in colo_flush_ram_cache()?
Does it come across a dead loop there ?

Maybe, I haven't looked in depth.

I'll test it by using the new qemu.

Thanks

Thanks,
Hailiang

From: Zhang, Chen [mailto:chen.zh...@intel.com]
Sent: Friday, May 15, 2020 11:16 AM
To: Zhanghailiang 
mailto:zhang.zhanghaili...@huawei.com>>; Dr . 
David Alan Gilbert mailto:dgilb...@redhat.com>>; 
qemu-devel mailto:qemu-devel@nongnu.org>>; Li Zhijian 
mailto:lizhij...@cn.fujitsu.com>>
Cc: Jason Wang mailto:jasow...@redhat.com>>; Lukas Straub 
mailto:lukasstra...@web.de>>
Subject: About migration/colo issue

Hi Hailiang/Dave.

I found a urgent problem in current upstream code, COLO will stuck on secondary 
checkpoint and later.
The guest will stuck by this issue.
I have bisect upstream code, this issue caused by Hailiang's optimize patch:

>From 0393031a16735835a441b6d6e0495a1bd14adb90 Mon Sep 17 00:00:00 2001
From: zhanghailiang 
mailto:zhang.zhanghaili...@huawei.com>>
Date: Mon, 24 Feb 2020 14:54:10 +0800
Subject: [PATCH] COLO: Optimize memory back-up process

This patch will reduce the downtime of VM for the initial process,
Previously, we copied all these memory in preparing stage of COLO
while we need to stop VM, which is a time-consuming process.
Here we optimize it by a trick, back-up every page while in migration
process while COLO is enabled, though it affects the speed of the
migration, but it obviously reduce the downtime of back-up all SVM'S
memory in COLO preparing stage.

Signed-off-by: zhanghailiang 
mailto:zhang.zhanghaili...@huawei.com>>
Message-Id: 
<20200224065414.36524-5-zhang.zhanghaili...@huawei.com<mailto:20200224065414.36524-5-zhang.zhanghaili...@huawei.com>>
Signed-off-by: Dr. David Alan Gilbert 
mailto:dgilb...@redhat.com>>
  minor typo fixes

Hailiang, do you have time to look into it?


The detail log:
Primary node:
13322@1589511271.917346:colo_receive_message<mailto:13322@1589511271.917346:colo_receive_message>
 Receive 'checkpoint-ready' message
{"timestamp": {"seconds": 1589511271, "microseconds": 917406}, "event": 
"RESUME"}
13322@1589511271.917842:colo_vm_state_change<mailto:13322@1589511271.917842:colo_vm_state_change>
 Change 'stop' => 'run'
13322@1589511291.243346:colo_send_message<mailto:13322@1589511291.243346:colo_send_message>
 Send 'checkpoint-request' message
13322@1589511291.243978:colo_receive_message<mailto:13322@1589511291.243978:colo_receive_message>
 Receive 'checkpoint-reply' message
{"timestamp": {"seconds": 1589511291, "microseconds": 244096}, "event": "STOP"}
13322@1589511291.24:colo_vm_state_change<mailto:13322@1589511291.24:colo_vm_state_change>
 Change 'run' => 'stop'
13322@1589511291.244561:colo_send_message<mailto:13322@1589511291.244561:colo_send_message>
 Send 'vmstate-send' message
13322@1589511291.258594:colo_send_message<mailto:13322@1589511291.258594:colo_send_message>
 Send 'vmstate-size' message
13322@1589511305.412479:colo_receive_message<mailto:13322@1589511305.412479:colo_receive_message>
 Receive 'vmstate-received' message
13322@1589511309.031826:colo_receive_message<mailto:13322@1589511309.031826:colo_receive_message>
 Receive 'vmstate-loaded' message
{"timestamp": {"seconds": 1589511309, "microseconds": 31862}, "event": "RESUME"}
13322@1589511309.033075:colo_vm_state_change<mailto:13322@1589511309.033075:colo_vm_state_change>
 Change 'stop' => 'run'
{"timestamp": {"seconds": 1589511311, "microseconds": 111617}, "event": 
"VNC_CONNECTED", "data": {"server": {"auth": "none", "family": "ipv4", 
"service": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": 
"ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}}
{"timestamp": {"seconds": 1589511311, "microseconds": 116197}, "event": 
"VNC_INITIALIZED", "data": {"server": {"auth": "none", "family": "ipv4", 
"service": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": 
"ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}}
13

RE: About migration/colo issue

2020-05-14 Thread Zhanghailiang
Hi Zhang Chen,

>From your tracing log, it seems to be hanged in colo_flush_ram_cache()?
Does it come across a dead loop there ?
I'll test it by using the new qemu.

Thanks,
Hailiang

From: Zhang, Chen [mailto:chen.zh...@intel.com]
Sent: Friday, May 15, 2020 11:16 AM
To: Zhanghailiang ; Dr . David Alan Gilbert 
; qemu-devel ; Li Zhijian 

Cc: Jason Wang ; Lukas Straub 
Subject: About migration/colo issue

Hi Hailiang/Dave.

I found a urgent problem in current upstream code, COLO will stuck on secondary 
checkpoint and later.
The guest will stuck by this issue.
I have bisect upstream code, this issue caused by Hailiang's optimize patch:

>From 0393031a16735835a441b6d6e0495a1bd14adb90 Mon Sep 17 00:00:00 2001
From: zhanghailiang 
mailto:zhang.zhanghaili...@huawei.com>>
Date: Mon, 24 Feb 2020 14:54:10 +0800
Subject: [PATCH] COLO: Optimize memory back-up process

This patch will reduce the downtime of VM for the initial process,
Previously, we copied all these memory in preparing stage of COLO
while we need to stop VM, which is a time-consuming process.
Here we optimize it by a trick, back-up every page while in migration
process while COLO is enabled, though it affects the speed of the
migration, but it obviously reduce the downtime of back-up all SVM'S
memory in COLO preparing stage.

Signed-off-by: zhanghailiang 
mailto:zhang.zhanghaili...@huawei.com>>
Message-Id: 
<20200224065414.36524-5-zhang.zhanghaili...@huawei.com>
Signed-off-by: Dr. David Alan Gilbert 
mailto:dgilb...@redhat.com>>
  minor typo fixes

Hailiang, do you have time to look into it?


The detail log:
Primary node:
13322@1589511271.917346:colo_receive_message
 Receive 'checkpoint-ready' message
{"timestamp": {"seconds": 1589511271, "microseconds": 917406}, "event": 
"RESUME"}
13322@1589511271.917842:colo_vm_state_change
 Change 'stop' => 'run'
13322@1589511291.243346:colo_send_message
 Send 'checkpoint-request' message
13322@1589511291.243978:colo_receive_message
 Receive 'checkpoint-reply' message
{"timestamp": {"seconds": 1589511291, "microseconds": 244096}, "event": "STOP"}
13322@1589511291.24:colo_vm_state_change
 Change 'run' => 'stop'
13322@1589511291.244561:colo_send_message
 Send 'vmstate-send' message
13322@1589511291.258594:colo_send_message
 Send 'vmstate-size' message
13322@1589511305.412479:colo_receive_message
 Receive 'vmstate-received' message
13322@1589511309.031826:colo_receive_message
 Receive 'vmstate-loaded' message
{"timestamp": {"seconds": 1589511309, "microseconds": 31862}, "event": "RESUME"}
13322@1589511309.033075:colo_vm_state_change
 Change 'stop' => 'run'
{"timestamp": {"seconds": 1589511311, "microseconds": 111617}, "event": 
"VNC_CONNECTED", "data": {"server": {"auth": "none", "family": "ipv4", 
"service": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": 
"ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}}
{"timestamp": {"seconds": 1589511311, "microseconds": 116197}, "event": 
"VNC_INITIALIZED", "data": {"server": {"auth": "none", "family": "ipv4", 
"service": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": 
"ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}}
13322@1589511311.243271:colo_send_message
 Send 'checkpoint-request' message
13322@1589511311.351361:colo_receive_message
 Receive 'checkpoint-reply' message
{"timestamp": {"seconds": 1589511311, "microseconds": 351439}, "event": "STOP"}
13322@1589511311.415779:colo_vm_state_change
 Change 'run' => 'stop'
13322@1589511311.416001:colo_send_message
 Send 'vmstate-send' message
13322@1589511311.418620:colo_send_message
 Send 'vmstate-size' message

Secondary node:
{"timestamp": {"seconds": 1589510920, "microseconds": 778207}, "event": 
"RESUME"}
23619@1589510920.778835:colo_vm_state_change
 Change 'stop' => 'run'
23619@1589510920.778891:colo_send_message
 Send 'checkpoint-ready' message
23619@1589510940.105539:colo_receive_message
 Receive 'checkpoint-request' message
{"timestamp": {"seconds":