Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE

Itamar Heim Sun, 21 Oct 2012 02:06:11 -0700

On 10/19/2012 06:43 PM, Sven Knohsalla wrote:

Hi Haim,


I wanted to wait to send this mail, until the problem occurs again.
Disabled live-migration for the cluster first, to make sure the second node 
wouldn't have the same problem, when migration is started.

It seems the problem isn't caused by migration, as I did run in the same error 
again today.

Log snippet Webgui:
2012-Oct-19,04:28:13 "Host deovn-a01 cannot access one of the Storage Domains 
attached to it, or the Data Center object. Setting Host state to Non-Operational."

--> all VMs are running properly, although the engine tells something different.
        Even the VM status in engine gui is wrong, as it's showing "<vmname> reboot 
in progress", but there is no reboot initiated (ssh/rdp connections, file operations are 
working fine)

Engine log says for this period:
cat /var/log/ovirt-engine/engine.log | grep 04:2*
2012-10-19 04:23:13,773 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01
2012-10-19 04:28:13,775 INFO  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
2012-10-19 04:28:13,799 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) vds deovn-a01 reported domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds 
to status NonOperational
2012-10-19 04:28:13,882 INFO  
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] 
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand 
internal: true. Entities affected :  ID: 66b546c2-ae62-11e1-b734-5254005cbe44 
Type: VDS
2012-10-19 04:28:13,884 INFO  
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] 
(QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 
66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, 
nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
2012-10-19 04:28:13,888 INFO  
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] 
(QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd
2012-10-19 04:28:19,690 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01

I think the first output is important:
2012-10-19 04:23:13,773 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01
--> which problem? There's no debug info during that time period to consider 
where tha problem could come from :/


look to the lines above:

2012-10-19 04:28:13,799 WARN[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand](QuartzScheduler_Worker-1) vds deovn-a01 reported domainccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, movingthe vds to status NonOperational2012-10-19 04:28:13,882 INFO[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand](QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommandinternal: true. Entities affected : ID:66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS


the problem was with the storage domain.


On affected node side I did grep /var/log/vdsm for ERROR:
Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) 
vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats
And 20 more of the same type with same vmId, I'm sure this is an aftereffect s 
the engine can't tell the status of the VMs.

Can you give me an advice where I can find more information to solve this issue?
Or perhaps a scenario I can try?

I have another curiosity I wanted to ask for in a new mail, but perhaps this 
has something to do with my issue:
The elected SPM is not part of this cluster, just has 2 storage paths 
(multipath) to the SAN.
The problematic cluster has 4 storage paths(bigger hypervisors), and all 
storage paths are connected successfully .

Does the SPM detects this difference, or is it unnecessary as the executive 
command detected possible paths on its own (what I assume)?

Currently in use:
oVirt-engine 3.0
oVirt-node2.30
--> is there any problem mixing node versions, regarding the ovirt-engine 
version?

Sorry for the amount of questions, I really want to understand the 
"oVirt-mechanism" completely,
to build up a fail-safe virtual environment :)

Thanks in advance.

Best,
Sven.

-----Ursprüngliche Nachricht-----
Von: Haim Ateya [mailto:hat...@redhat.com]
Gesendet: Dienstag, 16. Oktober 2012 14:38
An: Sven Knohsalla
Cc: users@ovirt.org; Itamar Heim; Omer Frenkel
Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non 
operational" STORAGE_DOMAIN_UNREACHABLE

Hi Sven,

can you attach full logs from the second host (problematic one)? i guess its 
"deovn-a01".

2012-10-15 11:13:38,197 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01


----- Original Message -----

From: "Omer Frenkel" <ofren...@redhat.com>
To: "Itamar Heim" <ih...@redhat.com>, "Sven Knohsalla" 
<s.knohsa...@netbiscuits.com>
Cc: users@ovirt.org
Sent: Tuesday, October 16, 2012 2:02:50 PM
Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non 
operational" STORAGE_DOMAIN_UNREACHABLE



----- Original Message -----

From: "Itamar Heim" <ih...@redhat.com>
To: "Sven Knohsalla" <s.knohsa...@netbiscuits.com>
Cc: users@ovirt.org
Sent: Monday, October 15, 2012 8:36:07 PM
Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
"non operational" STORAGE_DOMAIN_UNREACHABLE

On 10/15/2012 03:56 PM, Sven Knohsalla wrote:

Hi,

sometimes one hypervisors status turns to „Non-operational“ with
error
“STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated
for
all
VMs) is starting.

I don’t currently know why the ovirt-node turns to this status,
because
the connected iSCSI SAN is available all the time(checked via
iscsi
session and lsblk), I’m also able to r/w on the SAN during that
time.

We can simply activate this ovirt-node and it turns up again. The
migration process is running from scratch and hitting the some
error
àReboot of ovirt-node necessary!

When a hypervisor turns to “non-operational” status, the live
migration
is starting and tries to migrate ~25 VMs (~ 100 GB RAM to
migrate).

During that process the network workload goes 100%, some VMs will
be
migrated, then the destination host also turns to
“non-operational”
status with error “STORAGE_DOMAIN_UNREACHABLE”.

Many VMs are still running on their  origin host, some are
paused,
some
are showing “migration from” status.

After a reboot of the origin host, the VMs turns of course into
unknown
state.

So the whole cluster is down :/

For this problem I have some questions:

-Does ovirt engine just use the ovirt-mgmt network for
migration/HA?


yes.


-If so, is there any possibility to *add*/switch a network for
migration/HA?


you can bond, not yet add another one.


-Is the kind of way we are using the live-migration not
recommended?

-Which engine module checks the availability of the storage
domain
for
the ovirt-nodes?


the engine.


-Is there any timeout/cache option we can set/increase to avoid
this
problem?


well, not clear what the problem is.
also, vdsm is supposed to throttle live migration to 3 vm's in
parallel
iirc.
also, you can at cluster level configure to not live migrate VMs on
non-operational status.


-Is there any known problem with the versions we are using?
(Migration
to ovirt-engine 3.1 is not possible atm)


oh, the cluster level migration policy on non operational may be a
3.1
feature, not sure.


AFAIR, it's in 3.0


-Is it possible to modify the migration queue to just migrate a
max. of
4 VMs at the same time for example?


yes, there is a vdsm config for that. i am pretty sure 3 is the
default
though?


_ovirt-engine: _

FC 16:  3.3.6-3.fc16.x86_64

Engine: 3.0.0_0001-1.6.fc16

KVM based VM: 2 vCPU, 4 GB RAM

1 NIC for ssh/https access
1 NIC for ovirtmgmt network access
engine source: dreyou repo

_ovirt-node:_
Node: 2.3.0
2 bonded NICs -> Frontend Network
4 Multipath NICs -> SAN connection

Attached some relevant logfiles.

Thanks in advance, I really appreciate your help!

Best,

Sven Knohsalla |System Administration

Office +49 631 68036 433 | Fax +49 631 68036 111
|e-mails.knohsa...@netbiscuits.com
|<mailto:s.knohsa...@netbiscuits.com>|
Skype: Netbiscuits.admin

Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY



_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] ITA-2967 URGENT: ovirt Node turns status to "non operational" STORAGE_DOMAIN_UNREACHABLE

Reply via email to