Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Valentijn Sessink

Hi,

On 19-04-2022 16:07, Peter Krempa wrote:

So at this point I suspect that something without the network broke and
the migration was aborted in the storage copy phase, but could been in
any other.


Hmm, thank you. My problem is much clearer now - and probably not 
getting easier:


192.168.112.31.39324 > 192.168.112.12.22: Flags [P.], cksum 0x61e7 
(incorrect -> 0x3220), seq 9618:9686, ack 5038, win 501, options 
[nop,nop,TS val 3380045136 ecr 1586940949], length 68
(Many more of these - then a timeout. And mind you: this is not related 
to any virtual checksum waiver or anything like that, it's the physical 
machine).


Anyway, thanks for your help.

Best regards,

Valentijn
--



Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Peter Krempa
On Tue, Apr 19, 2022 at 15:51:32 +0200, Valentijn Sessink wrote:
> Hi Peter,
> 
> Thanks.
> 
> On 19-04-2022 13:22, Peter Krempa wrote:
> > It would be helpful if you provide the VM XML file to see how your disks
> > are configured and the debug log file when the bug reproduces:
> 
> I created a random VM to show the effect. XML file attached.
> 
> > Without that my only hunch would be that you ran out of disk space on
> > the destination which caused the I/O error.
> 
> ... it's an LVM2 volume with exact the same size as the source machine, so
> that would be rather odd ;-)

Oh, you are using raw disks backed by block volumes. That was not
obvious before ;)

> 
> I'm guessing that it's this weird message at the destination machine:
> 
> 2022-04-19 13:31:09.394+: 1412559: error : virKeepAliveTimerInternal:137
> : internal error: connection closed due to keepalive timeout

That certainly could be a hint ...

> 
> Source machine says:
> 2022-04-19 13:31:09.432+: 2641309: debug :
> qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds":
> 1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data":
> {"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
> 2022-04-19 13:31:09.432+: 2641309: debug : virJSONValueFromString:1822 :
> string={"timestamp": {"seconds": 1650375069, "microseconds": 432613},
> "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-virtio-disk2",
> "operation": "write", "action": "report"}}

The migration of non-shared storage works as follows:

1) libvirt sets up everything
2) libvirt asks destination qemu to open an NBD server exporting the
   disk backends
3) source libvirt instructs qemu to copy the disks to the NBD server via
   a block-copy job
4) when the block jobs converge, source qemu is instructed to migrate
   memory
5) when memory migrates, source qemu is killed and destination is
resumed

Now from the keepalive failure on the destiantion it seems that the
network connection at least between the migration controller and the
destination libvirt broke. That might actually cause also the NBD
connection to break and in such case the block job gets an I/O error.

Now the I/O error is actually based on the network connection and not
any storage issue.

So at this point I suspect that something without the network broke and
the migration was aborted in the storage copy phase, but could been in
any other.



Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Valentijn Sessink

Hi Peter,

Thanks.

On 19-04-2022 13:22, Peter Krempa wrote:

It would be helpful if you provide the VM XML file to see how your disks
are configured and the debug log file when the bug reproduces:


I created a random VM to show the effect. XML file attached.


Without that my only hunch would be that you ran out of disk space on
the destination which caused the I/O error.


... it's an LVM2 volume with exact the same size as the source machine, 
so that would be rather odd ;-)


I'm guessing that it's this weird message at the destination machine:

2022-04-19 13:31:09.394+: 1412559: error : 
virKeepAliveTimerInternal:137 : internal error: connection closed due to 
keepalive timeout


Source machine says:
2022-04-19 13:31:09.432+: 2641309: debug : 
qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 
1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
2022-04-19 13:31:09.432+: 2641309: debug : 
virJSONValueFromString:1822 : string={"timestamp": {"seconds": 
1650375069, "microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.432+: 2641309: info : 
qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: 
mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, 
"microseconds": 432613}, "event": "BLOCK_JOB_ERROR", "data": {"device": 
"drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.432+: 2641309: debug : qemuMonitorEmitEvent:1198 
: mon=0x7f70080028a0 event=BLOCK_JOB_ERROR
2022-04-19 13:31:09.432+: 2641309: debug : 
qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 
1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
2022-04-19 13:31:09.432+: 2641309: debug : 
virJSONValueFromString:1822 : string={"timestamp": {"seconds": 
1650375069, "microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: info : 
qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: 
mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, 
"microseconds": 432668}, "event": "BLOCK_JOB_ERROR", "data": {"device": 
"drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorEmitEvent:1198 
: mon=0x7f70080028a0 event=BLOCK_JOB_ERROR
2022-04-19 13:31:09.433+: 2641309: debug : 
qemuMonitorJSONIOProcessLine:220 : Line [{"timestamp": {"seconds": 
1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}]
2022-04-19 13:31:09.433+: 2641309: debug : 
virJSONValueFromString:1822 : string={"timestamp": {"seconds": 
1650375069, "microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": 
{"device": "drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: info : 
qemuMonitorJSONIOProcessLine:234 : QEMU_MONITOR_RECV_EVENT: 
mon=0x7f70080028a0 event={"timestamp": {"seconds": 1650375069, 
"microseconds": 432688}, "event": "BLOCK_JOB_ERROR", "data": {"device": 
"drive-virtio-disk2", "operation": "write", "action": "report"}}
2022-04-19 13:31:09.433+: 2641309: debug : qemuMonitorEmitEvent:1198 
: mon=0x7f70080028a0 event=BLOCK_JOB_ERROR


... and more of these. XML file attached.

Does that show anything? Please note that there is no real "block error" 
anywhere, there is an exact LVM volume on the other side, I'm actually 
using a script to extract the name of the volume at source; then I'm 
reading the source volume size and I'm creating a destination volume 
with the exact size before I start the migration. Disks are RAID volumes 
and there are no read or write errors.


Best regards,

Valentijn
--
Durgerdamstraat 29, 1507 JL Zaandam; telefoon 075-7100071
  water
  959c1a50-5784-e3f4-1006-1bac01d513e5
  
http://libosinfo.org/xmlns/libvirt/domain/1.0;>
  http://ubuntu.com/ubuntu/20.04"/>

  
  4194304
  1740804
  1
  
/machine
  
  
hvm

  
  



  
  
Westmere

  
  
  destroy
  restart
  restart
  
/usr/bin/kvm

  
  
  
  
  
  


  
  
  
  
  
  


  
  


  


  
  
  
  
  
  


  
  
  
  
  
  


  


  


  


  
  


  
  
  


  
  

  
  
libvirt-959c1a50-5784-e3f4-1006-1bac01d513e5
libvirt-959c1a50-5784-e3f4-1006-1bac01d513e5
  
  
+111:+111
+111:+111
  




Re: race condition? virsh migrate --copy-storage-all

2022-04-19 Thread Peter Krempa
On Fri, Apr 15, 2022 at 16:58:08 +0200, Valentijn Sessink wrote:
> Hi list,
> 
> I'm trying to migrate a few qemu virtual machines between two 1G ethernet
> connected hosts, with local storage only. I got endless "error: operation
> failed: migration of disk vda failed: Input/output error" errors and
> thought: something wrong with settings.
> 
> However, then, suddenly: I succeeded without changing anything. And, hey:
>  while ! time virsh migrate --live --persistent --undefinesource
> --copy-storage-all ubuntu20.04 qemu+ssh://duikboot/system; do a=$(( $a + 1
> )); echo $a; done
> 
> ... retried 8 times, but then: success. This smells like a race condition,
> doesn't it? A bit weird is the fact that the migration seems to succeed
> every time while copying from revolving disks to SSD; but the other way
> around has this Input/output error.
> 
> There are some messages in /var/log/syslog, but not at the time of the
> failure, and no disk errors. These disks are LVM2 volumes and they live on
> raid arrays - and/so there is not a real, as in physical, I/O-error.
> 
> Source system has SSD's, target system has regular disks.
> 
> 1) is this the right mailing list? I'm not 100% sure.
> 2) how can I research this further? Spending hours on a "while / then" loop
> to try and retry live migration looks like a dull job for my poor computers
> ;-)

It would be helpful if you provide the VM XML file to see how your disks
are configured and the debug log file when the bug reproduces:

https://www.libvirt.org/kbase/debuglogs.html#less-verbose-logging-for-qemu-vms

Without that my only hunch would be that you ran out of disk space on
the destination which caused the I/O error.



Re: Virtio-scsi and block mirroring

2022-04-19 Thread Peter Krempa
On Thu, Apr 14, 2022 at 16:36:38 +, Bjoern Teipel wrote:
> Hello everyone,

Hi,

> 
> I’m looking at an issue where I do see guests freezing (Dl) process state 
> during a block disk mirror from one storage to another storage (NFS) where 
> the network stack of the guest can freeze for up to 10 seconds.
> Looking at the storage and IO I noticed good throughput ad low latency <3ms 
> and I am having trouble to track down the source for the issue, as neither 
> storage nor networking  show issues. Interestingly when I do the same test 
> with virtio-blk I do not really see the process freezes at the frequency or 
> duration compared to virtio-scsi which seem to indicate a client side rather 
> than storage side problem.

Hmm, this is really weird if the difference is in the guest-facing
device frontend.

Since libvirt is merely setting up the block job for the copy and the
copy itself is handled by qemu I suggest you contact the
qemu-bl...@nongnu.org mailing list.

Unfortunately you didn't provide any information on the disk
configuration (the VM XML) or how you start the blockjob, which I could
translate for you into qemu specifics. If you provide such information I
can do that to ensure that the qemu folks have all the relevant
information.



Re: Help with libvirt

2022-04-19 Thread Francesc Guasch

El 11/4/22 a les 15:06, Eduardo Kiassucumuca ha escrit:
Good morning I'm Eduardo, a computer science student and I'm doing a final 
course work focused on virtualization. The work consists of creating virtual 
machines on a server and allowing ssh access to the virtual machines that are 
on the server containing qemu/kvm/libvirt. The problem is that I can't access 
the virtual machines from an external network but I can access them inside the 
server. I would like to know what would be the best way since we want to have 
a single public ip and be able to have a reverse proxy to access the virtual 
machines, I would like to know from your experience what you recommend?


Hello Eduardo. I am afraid I am promoting our own project now. :)

We have a feature in Ravada VDI to easily expose ports. It was
a feature created for students virtual machines in a classroom.
But it can be applied anywhere.

https://ravada.readthedocs.io/en/latest/docs/expose_ports.html

Hope this helps.



Re: When does the balloon-change or control-error event occur

2022-04-19 Thread Daniel P . Berrangé
On Thu, Apr 07, 2022 at 05:16:45PM +0800, Han Han wrote:
> Hi developers,
> I have questions about balloon-change or control-error event:
> 1. What's the meaning of these events
> 2. When do the events occur?
> 
> The comments of their callbacks don't mention that(
> https://gitlab.com/libvirt/libvirt/-/blob/master/include/libvirt/libvirt-domain.h#L4130
> https://gitlab.com/libvirt/libvirt/-/blob/master/include/libvirt/libvirt-domain.h#L3736

'balloon-change' is emitted any time the guest OS changes the ballon
inflation level.

eg if the host admin sets the balloon target to 1 GB and the guest is
currently using 2 GB, it might not be able to immediately drop down to
the 1 GB mark. The balloon-change events will emited as it make progress
towards teh 1 GB mark.

control-error is emitted when libvirt has some kind of problem controlling
the VM. The VM is still running, but libvirt may not be able to make changes
to its config. This can happen if libvirt has problems parsing JSON from QMP.
In practice it is highly unlikely for this to ever happen.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|



Re: Libvirt vs QEMU benefits

2022-04-19 Thread Daniel P . Berrangé
On Wed, Apr 06, 2022 at 12:44:37PM +, M, Shivakumar wrote:
> Hello,
> 
> For one of our case validation, we were using direct QEMU commands before for 
> VM creation as it was easier to configure the VMs. Inside VM we do run the 
> real-time latency test.
> Recently we switched to libvirt for the VM creation and deletion. 
> Surprisingly, we do see a significant increase in the real-time latency 
> performance for the VMs launched through the libvirt.
> 
> W.r.t configuration wise both VMs are the same, we just converted the 
> existing QEMU commands into libvirt XMLs.

It would be useful to share your QEMU command line args seen both
with directly running QEMU and via libvirt. I'd be really suprised
if your direct config was exactly the same as libvirt's.

> I am wondering what and all the features which libvirt has,
> improving this performance.

If it isn't related to QEMU configuration, then most likely candidate
is the use of cgroups for placing VMs.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|