Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE

2012-10-21 Thread Itamar Heim

On 10/19/2012 06:43 PM, Sven Knohsalla wrote:

Hi Haim,

I wanted to wait to send this mail, until the problem occurs again.
Disabled live-migration for the cluster first, to make sure the second node 
wouldn't have the same problem, when migration is started.

It seems the problem isn't caused by migration, as I did run in the same error 
again today.

Log snippet Webgui:
2012-Oct-19,04:28:13 Host deovn-a01 cannot access one of the Storage Domains 
attached to it, or the Data Center object. Setting Host state to Non-Operational.

-- all VMs are running properly, although the engine tells something different.
Even the VM status in engine gui is wrong, as it's showing vmname reboot 
in progress, but there is no reboot initiated (ssh/rdp connections, file operations are 
working fine)

Engine log says for this period:
cat /var/log/ovirt-engine/engine.log | grep 04:2*
2012-10-19 04:23:13,773 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01
2012-10-19 04:28:13,775 INFO  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
2012-10-19 04:28:13,799 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) vds deovn-a01 reported domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds 
to status NonOperational
2012-10-19 04:28:13,882 INFO  
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] 
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand 
internal: true. Entities affected :  ID: 66b546c2-ae62-11e1-b734-5254005cbe44 
Type: VDS
2012-10-19 04:28:13,884 INFO  
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] 
(QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 
66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, 
nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
2012-10-19 04:28:13,888 INFO  
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] 
(QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd
2012-10-19 04:28:19,690 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01

I think the first output is important:
2012-10-19 04:23:13,773 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01
-- which problem? There's no debug info during that time period to consider 
where tha problem could come from :/


look to the lines above:
 2012-10-19 04:28:13,799 WARN 
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) vds deovn-a01 reported domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving 
the vds to status NonOperational
 2012-10-19 04:28:13,882 INFO 
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] 
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand 
internal: true. Entities affected :  ID: 
66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS


the problem was with the storage domain.




On affected node side I did grep /var/log/vdsm for ERROR:
Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) 
vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats
And 20 more of the same type with same vmId, I'm sure this is an aftereffect s 
the engine can't tell the status of the VMs.

Can you give me an advice where I can find more information to solve this issue?
Or perhaps a scenario I can try?

I have another curiosity I wanted to ask for in a new mail, but perhaps this 
has something to do with my issue:
The elected SPM is not part of this cluster, just has 2 storage paths 
(multipath) to the SAN.
The problematic cluster has 4 storage paths(bigger hypervisors), and all 
storage paths are connected successfully .

Does the SPM detects this difference, or is it unnecessary as the executive 
command detected possible paths on its own (what I assume)?

Currently in use:
oVirt-engine 3.0
oVirt-node2.30
-- is there any problem mixing node versions, regarding the ovirt-engine 
version?

Sorry for the amount of questions, I really want to understand the 
oVirt-mechanism completely,
to build up a fail-safe virtual environment :)

Thanks in advance.

Best,
Sven.

-Ursprüngliche Nachricht-
Von: Haim Ateya [mailto:hat...@redhat.com]
Gesendet: Dienstag, 16. Oktober 2012 14:38
An: Sven Knohsalla
Cc: users@ovirt.org; Itamar Heim; Omer Frenkel
Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non 
operational STORAGE_DOMAIN_UNREACHABLE

Hi Sven,

can you attach full logs from the second host (problematic one)? i guess its 
deovn-a01.

2012-10-15 11:13:38,197 WARN

Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE

2012-10-21 Thread Haim Ateya


- Original Message -
 From: Itamar Heim ih...@redhat.com
 To: Sven Knohsalla s.knohsa...@netbiscuits.com
 Cc: Haim Ateya hat...@redhat.com, users@ovirt.org, Omer Frenkel 
 ofren...@redhat.com
 Sent: Sunday, October 21, 2012 11:05:56 AM
 Subject: Re: AW: [Users] ITA-2967 URGENT: ovirt Node turns status to non 
 operational STORAGE_DOMAIN_UNREACHABLE
 
 On 10/19/2012 06:43 PM, Sven Knohsalla wrote:
  Hi Haim,
 
  I wanted to wait to send this mail, until the problem occurs again.
  Disabled live-migration for the cluster first, to make sure the
  second node wouldn't have the same problem, when migration is
  started.
 
  It seems the problem isn't caused by migration, as I did run in the
  same error again today.
 
  Log snippet Webgui:
  2012-Oct-19,04:28:13 Host deovn-a01 cannot access one of the
  Storage Domains attached to it, or the Data Center object. Setting
  Host state to Non-Operational.
 
  -- all VMs are running properly, although the engine tells
  something different.
  Even the VM status in engine gui is wrong, as it's showing
  vmname reboot in progress, but there is no reboot
  initiated (ssh/rdp connections, file operations are
  working fine)
 
  Engine log says for this period:
  cat /var/log/ovirt-engine/engine.log | grep 04:2*
  2012-10-19 04:23:13,773 WARN
   [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
  (QuartzScheduler_Worker-94) domain
  ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
  2012-10-19 04:28:13,775 INFO
   [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
  (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for
  domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
  2012-10-19 04:28:13,799 WARN
   [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
  (QuartzScheduler_Worker-1) vds deovn-a01 reported domain
  ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem,
  moving the vds to status NonOperational
  2012-10-19 04:28:13,882 INFO
   [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
  (QuartzScheduler_Worker-1) Running command:
  SetNonOperationalVdsCommand internal: true. Entities affected :
   ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
  2012-10-19 04:28:13,884 INFO
   [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
  (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId =
  66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational,
  nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
  2012-10-19 04:28:13,888 INFO
   [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand]
  (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id:
  daad8bd
  2012-10-19 04:28:19,690 WARN
   [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
  (QuartzScheduler_Worker-38) domain
  ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
 
  I think the first output is important:
  2012-10-19 04:23:13,773 WARN
   [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
  (QuartzScheduler_Worker-94) domain
  ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01
  -- which problem? There's no debug info during that time period to
  consider where tha problem could come from :/
 
 look to the lines above:
   2012-10-19 04:28:13,799 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (QuartzScheduler_Worker-1) vds deovn-a01 reported domain
 ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem,
 moving
 the vds to status NonOperational
   2012-10-19 04:28:13,882 INFO
 [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand]
 (QuartzScheduler_Worker-1) Running command:
 SetNonOperationalVdsCommand
 internal: true. Entities affected :  ID:
 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS
 
 the problem was with the storage domain.
 
 
 
  On affected node side I did grep /var/log/vdsm for ERROR:
  Thread-254302::ERROR::2012-10-12
  16:01:11,359::vm::950::vm.Vm::(getStats)
  vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm
  stats
  And 20 more of the same type with same vmId, I'm sure this is an
  aftereffect s the engine can't tell the status of the VMs.
 
  Can you give me an advice where I can find more information to
  solve this issue?
  Or perhaps a scenario I can try?
what's the status of the VMs right now ? can you please provide the output of 
the following commands:

virsh -r list
vdsClient -s 0 list table

please attach full engine, vdsm and libvirt logs (and if possible, qemu log 
file under /var/log/libvirt/qemu/).
 
  I have another curiosity I wanted to ask for in a new mail, but
  perhaps this has something to do with my issue:
  The elected SPM is not part of this cluster, just has 2 storage
  paths (multipath) to the SAN.
  The problematic cluster has 4 storage paths(bigger hypervisors),
  and all storage paths are connected successfully .
I would like to see repoStats reports within the node logs (vdsm.log).
 
  Does the SPM detects

Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE

2012-10-19 Thread Sven Knohsalla
Hi Haim,

I wanted to wait to send this mail, until the problem occurs again.
Disabled live-migration for the cluster first, to make sure the second node 
wouldn't have the same problem, when migration is started.

It seems the problem isn't caused by migration, as I did run in the same error 
again today.

Log snippet Webgui:
2012-Oct-19,04:28:13 Host deovn-a01 cannot access one of the Storage Domains 
attached to it, or the Data Center object. Setting Host state to 
Non-Operational.

-- all VMs are running properly, although the engine tells something different.
   Even the VM status in engine gui is wrong, as it's showing vmname 
reboot in progress, but there is no reboot initiated (ssh/rdp connections, 
file operations are working fine)

Engine log says for this period:
cat /var/log/ovirt-engine/engine.log | grep 04:2*
2012-10-19 04:23:13,773 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01
2012-10-19 04:28:13,775 INFO  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5
2012-10-19 04:28:13,799 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-1) vds deovn-a01 reported domain 
ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds 
to status NonOperational
2012-10-19 04:28:13,882 INFO  
[org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] 
(QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand 
internal: true. Entities affected :  ID: 66b546c2-ae62-11e1-b734-5254005cbe44 
Type: VDS
2012-10-19 04:28:13,884 INFO  
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] 
(QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 
66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, 
nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd
2012-10-19 04:28:13,888 INFO  
[org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] 
(QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd
2012-10-19 04:28:19,690 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01

I think the first output is important:
2012-10-19 04:23:13,773 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01
-- which problem? There's no debug info during that time period to consider 
where tha problem could come from :/

On affected node side I did grep /var/log/vdsm for ERROR:
Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) 
vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats
And 20 more of the same type with same vmId, I'm sure this is an aftereffect s 
the engine can't tell the status of the VMs.

Can you give me an advice where I can find more information to solve this issue?
Or perhaps a scenario I can try?

I have another curiosity I wanted to ask for in a new mail, but perhaps this 
has something to do with my issue:
The elected SPM is not part of this cluster, just has 2 storage paths 
(multipath) to the SAN.
The problematic cluster has 4 storage paths(bigger hypervisors), and all 
storage paths are connected successfully .

Does the SPM detects this difference, or is it unnecessary as the executive 
command detected possible paths on its own (what I assume)?

Currently in use:
oVirt-engine 3.0
oVirt-node2.30
-- is there any problem mixing node versions, regarding the ovirt-engine 
version?

Sorry for the amount of questions, I really want to understand the 
oVirt-mechanism completely,
to build up a fail-safe virtual environment :)

Thanks in advance.

Best,
Sven.

-Ursprüngliche Nachricht-
Von: Haim Ateya [mailto:hat...@redhat.com] 
Gesendet: Dienstag, 16. Oktober 2012 14:38
An: Sven Knohsalla
Cc: users@ovirt.org; Itamar Heim; Omer Frenkel
Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non 
operational STORAGE_DOMAIN_UNREACHABLE

Hi Sven,

can you attach full logs from the second host (problematic one)? i guess its 
deovn-a01.

2012-10-15 11:13:38,197 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01


- Original Message -
 From: Omer Frenkel ofren...@redhat.com
 To: Itamar Heim ih...@redhat.com, Sven Knohsalla 
 s.knohsa...@netbiscuits.com
 Cc: users@ovirt.org
 Sent: Tuesday, October 16, 2012 2:02:50 PM
 Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non 
 operational STORAGE_DOMAIN_UNREACHABLE
 
 
 
 - Original Message -
  From: Itamar Heim ih...@redhat.com
  To: Sven Knohsalla s.knohsa...@netbiscuits.com
  Cc: users

Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE

2012-10-16 Thread Omer Frenkel


- Original Message -
 From: Itamar Heim ih...@redhat.com
 To: Sven Knohsalla s.knohsa...@netbiscuits.com
 Cc: users@ovirt.org
 Sent: Monday, October 15, 2012 8:36:07 PM
 Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non 
 operational STORAGE_DOMAIN_UNREACHABLE
 
 On 10/15/2012 03:56 PM, Sven Knohsalla wrote:
  Hi,
 
  sometimes one hypervisors status turns to „Non-operational“ with
  error
  “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated for
  all
  VMs) is starting.
 
  I don’t currently know why the ovirt-node turns to this status,
  because
  the connected iSCSI SAN is available all the time(checked via iscsi
  session and lsblk), I’m also able to r/w on the SAN during that
  time.
 
  We can simply activate this ovirt-node and it turns up again. The
  migration process is running from scratch and hitting the some
  error
  àReboot of ovirt-node necessary!
 
  When a hypervisor turns to “non-operational” status, the live
  migration
  is starting and tries to migrate ~25 VMs (~ 100 GB RAM to migrate).
 
  During that process the network workload goes 100%, some VMs will
  be
  migrated, then the destination host also turns to “non-operational”
  status with error “STORAGE_DOMAIN_UNREACHABLE”.
 
  Many VMs are still running on their  origin host, some are paused,
  some
  are showing “migration from” status.
 
  After a reboot of the origin host, the VMs turns of course into
  unknown
  state.
 
  So the whole cluster is down :/
 
  For this problem I have some questions:
 
  -Does ovirt engine just use the ovirt-mgmt network for
  migration/HA?
 
 yes.
 
 
  -If so, is there any possibility to *add*/switch a network for
  migration/HA?
 
 you can bond, not yet add another one.
 
 
  -Is the kind of way we are using the live-migration not
  recommended?
 
  -Which engine module checks the availability of the storage domain
  for
  the ovirt-nodes?
 
 the engine.
 
 
  -Is there any timeout/cache option we can set/increase to avoid
  this
  problem?
 
 well, not clear what the problem is.
 also, vdsm is supposed to throttle live migration to 3 vm's in
 parallel
 iirc.
 also, you can at cluster level configure to not live migrate VMs on
 non-operational status.
 
 
  -Is there any known problem with the versions we are using?
  (Migration
  to ovirt-engine 3.1 is not possible atm)
 
 oh, the cluster level migration policy on non operational may be a
 3.1
 feature, not sure.
 

AFAIR, it's in 3.0

 
  -Is it possible to modify the migration queue to just migrate a
  max. of
  4 VMs at the same time for example?
 
 yes, there is a vdsm config for that. i am pretty sure 3 is the
 default
 though?
 
 
  _ovirt-engine: _
 
  FC 16:  3.3.6-3.fc16.x86_64
 
  Engine: 3.0.0_0001-1.6.fc16
 
  KVM based VM: 2 vCPU, 4 GB RAM
 
  1 NIC for ssh/https access
  1 NIC for ovirtmgmt network access
  engine source: dreyou repo
 
  _ovirt-node:_
  Node: 2.3.0
  2 bonded NICs - Frontend Network
  4 Multipath NICs - SAN connection
 
  Attached some relevant logfiles.
 
  Thanks in advance, I really appreciate your help!
 
  Best,
 
  Sven Knohsalla |System Administration
 
  Office +49 631 68036 433 | Fax +49 631 68036 111
  |e-mails.knohsa...@netbiscuits.com
  |mailto:s.knohsa...@netbiscuits.com|
  Skype: Netbiscuits.admin
 
  Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY
 
 
 
  ___
  Users mailing list
  Users@ovirt.org
  http://lists.ovirt.org/mailman/listinfo/users
 
 
 
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE

2012-10-16 Thread Haim Ateya
Hi Sven,

can you attach full logs from the second host (problematic one)? i guess its 
deovn-a01.

2012-10-15 11:13:38,197 WARN  
[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] 
(QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in 
problem. vds: deovn-a01


- Original Message -
 From: Omer Frenkel ofren...@redhat.com
 To: Itamar Heim ih...@redhat.com, Sven Knohsalla 
 s.knohsa...@netbiscuits.com
 Cc: users@ovirt.org
 Sent: Tuesday, October 16, 2012 2:02:50 PM
 Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non 
 operational STORAGE_DOMAIN_UNREACHABLE
 
 
 
 - Original Message -
  From: Itamar Heim ih...@redhat.com
  To: Sven Knohsalla s.knohsa...@netbiscuits.com
  Cc: users@ovirt.org
  Sent: Monday, October 15, 2012 8:36:07 PM
  Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to
  non operational STORAGE_DOMAIN_UNREACHABLE
  
  On 10/15/2012 03:56 PM, Sven Knohsalla wrote:
   Hi,
  
   sometimes one hypervisors status turns to „Non-operational“ with
   error
   “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated
   for
   all
   VMs) is starting.
  
   I don’t currently know why the ovirt-node turns to this status,
   because
   the connected iSCSI SAN is available all the time(checked via
   iscsi
   session and lsblk), I’m also able to r/w on the SAN during that
   time.
  
   We can simply activate this ovirt-node and it turns up again. The
   migration process is running from scratch and hitting the some
   error
   àReboot of ovirt-node necessary!
  
   When a hypervisor turns to “non-operational” status, the live
   migration
   is starting and tries to migrate ~25 VMs (~ 100 GB RAM to
   migrate).
  
   During that process the network workload goes 100%, some VMs will
   be
   migrated, then the destination host also turns to
   “non-operational”
   status with error “STORAGE_DOMAIN_UNREACHABLE”.
  
   Many VMs are still running on their  origin host, some are
   paused,
   some
   are showing “migration from” status.
  
   After a reboot of the origin host, the VMs turns of course into
   unknown
   state.
  
   So the whole cluster is down :/
  
   For this problem I have some questions:
  
   -Does ovirt engine just use the ovirt-mgmt network for
   migration/HA?
  
  yes.
  
  
   -If so, is there any possibility to *add*/switch a network for
   migration/HA?
  
  you can bond, not yet add another one.
  
  
   -Is the kind of way we are using the live-migration not
   recommended?
  
   -Which engine module checks the availability of the storage
   domain
   for
   the ovirt-nodes?
  
  the engine.
  
  
   -Is there any timeout/cache option we can set/increase to avoid
   this
   problem?
  
  well, not clear what the problem is.
  also, vdsm is supposed to throttle live migration to 3 vm's in
  parallel
  iirc.
  also, you can at cluster level configure to not live migrate VMs on
  non-operational status.
  
  
   -Is there any known problem with the versions we are using?
   (Migration
   to ovirt-engine 3.1 is not possible atm)
  
  oh, the cluster level migration policy on non operational may be a
  3.1
  feature, not sure.
  
 
 AFAIR, it's in 3.0
 
  
   -Is it possible to modify the migration queue to just migrate a
   max. of
   4 VMs at the same time for example?
  
  yes, there is a vdsm config for that. i am pretty sure 3 is the
  default
  though?
  
  
   _ovirt-engine: _
  
   FC 16:  3.3.6-3.fc16.x86_64
  
   Engine: 3.0.0_0001-1.6.fc16
  
   KVM based VM: 2 vCPU, 4 GB RAM
  
   1 NIC for ssh/https access
   1 NIC for ovirtmgmt network access
   engine source: dreyou repo
  
   _ovirt-node:_
   Node: 2.3.0
   2 bonded NICs - Frontend Network
   4 Multipath NICs - SAN connection
  
   Attached some relevant logfiles.
  
   Thanks in advance, I really appreciate your help!
  
   Best,
  
   Sven Knohsalla |System Administration
  
   Office +49 631 68036 433 | Fax +49 631 68036 111
   |e-mails.knohsa...@netbiscuits.com
   |mailto:s.knohsa...@netbiscuits.com|
   Skype: Netbiscuits.admin
  
   Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY
  
  
  
   ___
   Users mailing list
   Users@ovirt.org
   http://lists.ovirt.org/mailman/listinfo/users
  
  
  
  ___
  Users mailing list
  Users@ovirt.org
  http://lists.ovirt.org/mailman/listinfo/users
  
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE

2012-10-15 Thread Itamar Heim

On 10/15/2012 03:56 PM, Sven Knohsalla wrote:

Hi,

sometimes one hypervisors status turns to „Non-operational“ with error
“STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated for all
VMs) is starting.

I don’t currently know why the ovirt-node turns to this status, because
the connected iSCSI SAN is available all the time(checked via iscsi
session and lsblk), I’m also able to r/w on the SAN during that time.

We can simply activate this ovirt-node and it turns up again. The
migration process is running from scratch and hitting the some error
àReboot of ovirt-node necessary!

When a hypervisor turns to “non-operational” status, the live migration
is starting and tries to migrate ~25 VMs (~ 100 GB RAM to migrate).

During that process the network workload goes 100%, some VMs will be
migrated, then the destination host also turns to “non-operational”
status with error “STORAGE_DOMAIN_UNREACHABLE”.

Many VMs are still running on their  origin host, some are paused, some
are showing “migration from” status.

After a reboot of the origin host, the VMs turns of course into unknown
state.

So the whole cluster is down :/

For this problem I have some questions:

-Does ovirt engine just use the ovirt-mgmt network for migration/HA?


yes.



-If so, is there any possibility to *add*/switch a network for migration/HA?


you can bond, not yet add another one.



-Is the kind of way we are using the live-migration not recommended?

-Which engine module checks the availability of the storage domain for
the ovirt-nodes?


the engine.



-Is there any timeout/cache option we can set/increase to avoid this
problem?


well, not clear what the problem is.
also, vdsm is supposed to throttle live migration to 3 vm's in parallel 
iirc.
also, you can at cluster level configure to not live migrate VMs on 
non-operational status.




-Is there any known problem with the versions we are using? (Migration
to ovirt-engine 3.1 is not possible atm)


oh, the cluster level migration policy on non operational may be a 3.1 
feature, not sure.




-Is it possible to modify the migration queue to just migrate a max. of
4 VMs at the same time for example?


yes, there is a vdsm config for that. i am pretty sure 3 is the default 
though?




_ovirt-engine: _

FC 16:  3.3.6-3.fc16.x86_64

Engine: 3.0.0_0001-1.6.fc16

KVM based VM: 2 vCPU, 4 GB RAM

1 NIC for ssh/https access
1 NIC for ovirtmgmt network access
engine source: dreyou repo

_ovirt-node:_
Node: 2.3.0
2 bonded NICs - Frontend Network
4 Multipath NICs - SAN connection

Attached some relevant logfiles.

Thanks in advance, I really appreciate your help!

Best,

Sven Knohsalla |System Administration

Office +49 631 68036 433 | Fax +49 631 68036 111
|e-mails.knohsa...@netbiscuits.com mailto:s.knohsa...@netbiscuits.com|
Skype: Netbiscuits.admin

Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users