Re: [ovirt-users] upgrade from 3.5 to 3.6 causing problems with migration

Jason Keltz Mon, 09 Nov 2015 19:43:14 -0800

On 11/9/2015 6:20 PM, Martin Polednik wrote:

On 09/11/15 14:00 -0500, Jason Keltz wrote:
Hi Shmuel,
Thanks very much for looking into my problem!

I installed 3.6 on the engine.  I rebooted the engine.
The 3 hosts were still running vdsm from 3.5. I checked back in theyum log, and it was 4.16.26-0.el7.On the first host upgrade (virt1), I made a mistake. After bringingin the 3.6 repo, I upgraded the packages with just "yum update".However, I know that I should have put the host into maintenance modefirst. After the updates installed, I put the host into maintenancemode, and it migrated the VMs off, during which I saw more than onefailed VM migration.I'm willing to accept the failures there because I should have putthe host into maintenance mode first. Live and learn!I had two other hosts to do this right. For virt2, and virt3, I putthe hosts into maintenance mode first. However, the same problemoccurred with failed migrations. I proceeded anyway, brought thefailed VMs back up elsewhere, applied the updates, and rebooted thehosts.So now, 3.6 is installed on the engine and the 3 hosts, and they areall rebooted.I tried another migration, and again, there were failures, so thisisn't specifically related to just 3.6.By the way, I'm using ovirtmgmt for migrations. virt1, virt2, andvirt3 have a dedicated 10G link via Intel X540 to a 10G switch.engine is on that network as well, but it's a 1G link.I was able to run iperf tests between the nodes, and saw nearly 10Gspeed. During the failed migrations, I also don't have any problemwith ovirtmgmt, so I don't think the network is an issue...
I found this bug in bugzilla over the weekend:

https://bugzilla.redhat.com/show_bug.cgi?id=1142776
I was nearly positive that this had something to do with the failedmigrations. As a final test, I decided to migrate the VMs from onehost to another, one at a time. I was nearly done migrating all theVMs from virt3 to virt1. I had migrated 5 VMs all successfully, oneat a time, without any failures. When I migrated the 6th, boom - itdidn't migrate, and the VM was down. It was a pretty basic VM aswell, with very little traffic.
I included on the bug report above an additional link with theengine, virt1, virt2, and virt3 logs for Saturday where I was doingthis experimentation because there's a couple more failuresrecorded. I'll include that link here:
http://www.eecs.yorku.ca/~jas/ovirt-debug/11072015
The last VM that I attempted to transfer one at a time was "webapp".It was transferred from virt3 to virt1.
I'm really puzzled that more people haven't experienced this issue.I've disabled the load balancing feature because I'm really concernedthat if it load balances my VMs, then they might not come back up! Idon't *think* this was happening when I was all purely 3.5, but Ican't remember doing big migrations. I most certainly was able toput a host into maintenance mode without having VMs go down!
In another email, Dan Kenisberg says that "It seems that 3.6'svdsm-4.17.10.1 cannot consume a Random Number Generator device thatwas created on 3.5.". Thanks also to Dan for looking into that aswell! I'm still waiting for more details though before openingadditional bug reports because this puzzles me... if this were thecase, then ALL of the VMs were created on 3.5, and ALL with randomnumber generator device, and all would fail migration, but theydon't. I have a feeling that there are a few issues at play here.
Hello and sorry for dropping in so late.

The issue is that 3.5 engine created RNG device without sending the
device key (which should've been 'rng' but it wasn't properly
documented in the API as fixed in [1]). This caused the
getUnderlyingRngDevice method to fail matching the device (fixed in
[2]) and it would therefore be treated as unknown device (where the
notion of 'source' isn't known). 3.6 engine should handle it correctly
[3].

The implication is that when VM is created in 3.5 environment and
moved to 3.6 environment, the matching will work but there will be 2
RNG devices for the single one. Same goes for migration.

I'm not sure about the fix yet, to rescue the 3.6 VM we would have to
remove the duplicate device without specParams (meaning that address
would be lost) or remove the original device but adding it's
specParams to the new device. A temporary fix would be creating a hook
that does this.

[1] https://gerrit.ovirt.org/#/c/43166/
[2] https://gerrit.ovirt.org/#/c/40095/
[3] https://gerrit.ovirt.org/#/c/43165/


Martin,

Thanks for your message and for looking at the debug logs.

What I don't understand is why in my last test case, I was able totransition 5 VMs from one host to another, completely successfully, andonly on the 6th, the problem occurred.? Why would this RNG issue nothave come up with every single transition? What is it that made ithappen on the 6th? I still have a feeling that there is something elseat play here as well. All of my VMs were created on 3.5, and all ofthem have RNG devices.

Assuming that you've created a bug entry, can you please give me thebugzilla ID so I can add myself to it? I'm anxious for when you've comeup with a patch to fix the existing issue. All of my VMs were createdon 3.5, and I don't want to have to hold my breath every time Itransition VMs from one host to another.


Jason.

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] upgrade from 3.5 to 3.6 causing problems with migration

Reply via email to