Hi Sven, can you attach full logs from the second host (problematic one)? i guess its "deovn-a01".
2012-10-15 11:13:38,197 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 ----- Original Message ----- > From: "Omer Frenkel" <ofren...@redhat.com> > To: "Itamar Heim" <ih...@redhat.com>, "Sven Knohsalla" > <s.knohsa...@netbiscuits.com> > Cc: users@ovirt.org > Sent: Tuesday, October 16, 2012 2:02:50 PM > Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non > operational" STORAGE_DOMAIN_UNREACHABLE > > > > ----- Original Message ----- > > From: "Itamar Heim" <ih...@redhat.com> > > To: "Sven Knohsalla" <s.knohsa...@netbiscuits.com> > > Cc: users@ovirt.org > > Sent: Monday, October 15, 2012 8:36:07 PM > > Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to > > "non operational" STORAGE_DOMAIN_UNREACHABLE > > > > On 10/15/2012 03:56 PM, Sven Knohsalla wrote: > > > Hi, > > > > > > sometimes one hypervisors status turns to „Non-operational“ with > > > error > > > “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated > > > for > > > all > > > VMs) is starting. > > > > > > I don’t currently know why the ovirt-node turns to this status, > > > because > > > the connected iSCSI SAN is available all the time(checked via > > > iscsi > > > session and lsblk), I’m also able to r/w on the SAN during that > > > time. > > > > > > We can simply activate this ovirt-node and it turns up again. The > > > migration process is running from scratch and hitting the some > > > error > > > àReboot of ovirt-node necessary! > > > > > > When a hypervisor turns to “non-operational” status, the live > > > migration > > > is starting and tries to migrate ~25 VMs (~ 100 GB RAM to > > > migrate). > > > > > > During that process the network workload goes 100%, some VMs will > > > be > > > migrated, then the destination host also turns to > > > “non-operational” > > > status with error “STORAGE_DOMAIN_UNREACHABLE”. > > > > > > Many VMs are still running on their origin host, some are > > > paused, > > > some > > > are showing “migration from” status. > > > > > > After a reboot of the origin host, the VMs turns of course into > > > unknown > > > state. > > > > > > So the whole cluster is down :/ > > > > > > For this problem I have some questions: > > > > > > -Does ovirt engine just use the ovirt-mgmt network for > > > migration/HA? > > > > yes. > > > > > > > > -If so, is there any possibility to *add*/switch a network for > > > migration/HA? > > > > you can bond, not yet add another one. > > > > > > > > -Is the kind of way we are using the live-migration not > > > recommended? > > > > > > -Which engine module checks the availability of the storage > > > domain > > > for > > > the ovirt-nodes? > > > > the engine. > > > > > > > > -Is there any timeout/cache option we can set/increase to avoid > > > this > > > problem? > > > > well, not clear what the problem is. > > also, vdsm is supposed to throttle live migration to 3 vm's in > > parallel > > iirc. > > also, you can at cluster level configure to not live migrate VMs on > > non-operational status. > > > > > > > > -Is there any known problem with the versions we are using? > > > (Migration > > > to ovirt-engine 3.1 is not possible atm) > > > > oh, the cluster level migration policy on non operational may be a > > 3.1 > > feature, not sure. > > > > AFAIR, it's in 3.0 > > > > > > > -Is it possible to modify the migration queue to just migrate a > > > max. of > > > 4 VMs at the same time for example? > > > > yes, there is a vdsm config for that. i am pretty sure 3 is the > > default > > though? > > > > > > > > _ovirt-engine: _ > > > > > > FC 16: 3.3.6-3.fc16.x86_64 > > > > > > Engine: 3.0.0_0001-1.6.fc16 > > > > > > KVM based VM: 2 vCPU, 4 GB RAM > > > > > > 1 NIC for ssh/https access > > > 1 NIC for ovirtmgmt network access > > > engine source: dreyou repo > > > > > > _ovirt-node:_ > > > Node: 2.3.0 > > > 2 bonded NICs -> Frontend Network > > > 4 Multipath NICs -> SAN connection > > > > > > Attached some relevant logfiles. > > > > > > Thanks in advance, I really appreciate your help! > > > > > > Best, > > > > > > Sven Knohsalla |System Administration > > > > > > Office +49 631 68036 433 | Fax +49 631 68036 111 > > > |e-mails.knohsa...@netbiscuits.com > > > |<mailto:s.knohsa...@netbiscuits.com>| > > > Skype: Netbiscuits.admin > > > > > > Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY > > > > > > > > > > > > _______________________________________________ > > > Users mailing list > > > Users@ovirt.org > > > http://lists.ovirt.org/mailman/listinfo/users > > > > > > > > > _______________________________________________ > > Users mailing list > > Users@ovirt.org > > http://lists.ovirt.org/mailman/listinfo/users > > > _______________________________________________ > Users mailing list > Users@ovirt.org > http://lists.ovirt.org/mailman/listinfo/users > _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users