Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE
On 10/19/2012 06:43 PM, Sven Knohsalla wrote: Hi Haim, I wanted to wait to send this mail, until the problem occurs again. Disabled live-migration for the cluster first, to make sure the second node wouldn't have the same problem, when migration is started. It seems the problem isn't caused by migration, as I did run in the same error again today. Log snippet Webgui: 2012-Oct-19,04:28:13 Host deovn-a01 cannot access one of the Storage Domains attached to it, or the Data Center object. Setting Host state to Non-Operational. -- all VMs are running properly, although the engine tells something different. Even the VM status in engine gui is wrong, as it's showing vmname reboot in progress, but there is no reboot initiated (ssh/rdp connections, file operations are working fine) Engine log says for this period: cat /var/log/ovirt-engine/engine.log | grep 04:2* 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 2012-10-19 04:28:13,775 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 2012-10-19 04:28:13,799 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational 2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS 2012-10-19 04:28:13,884 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd 2012-10-19 04:28:13,888 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd 2012-10-19 04:28:19,690 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 I think the first output is important: 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 -- which problem? There's no debug info during that time period to consider where tha problem could come from :/ look to the lines above: 2012-10-19 04:28:13,799 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational 2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS the problem was with the storage domain. On affected node side I did grep /var/log/vdsm for ERROR: Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats And 20 more of the same type with same vmId, I'm sure this is an aftereffect s the engine can't tell the status of the VMs. Can you give me an advice where I can find more information to solve this issue? Or perhaps a scenario I can try? I have another curiosity I wanted to ask for in a new mail, but perhaps this has something to do with my issue: The elected SPM is not part of this cluster, just has 2 storage paths (multipath) to the SAN. The problematic cluster has 4 storage paths(bigger hypervisors), and all storage paths are connected successfully . Does the SPM detects this difference, or is it unnecessary as the executive command detected possible paths on its own (what I assume)? Currently in use: oVirt-engine 3.0 oVirt-node2.30 -- is there any problem mixing node versions, regarding the ovirt-engine version? Sorry for the amount of questions, I really want to understand the oVirt-mechanism completely, to build up a fail-safe virtual environment :) Thanks in advance. Best, Sven. -Ursprüngliche Nachricht- Von: Haim Ateya [mailto:hat...@redhat.com] Gesendet: Dienstag, 16. Oktober 2012 14:38 An: Sven Knohsalla Cc: users@ovirt.org; Itamar Heim; Omer Frenkel Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE Hi Sven, can you attach full logs from the second host (problematic one)? i guess its deovn-a01. 2012-10-15 11:13:38,197 WARN
Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE
- Original Message - From: Itamar Heim ih...@redhat.com To: Sven Knohsalla s.knohsa...@netbiscuits.com Cc: Haim Ateya hat...@redhat.com, users@ovirt.org, Omer Frenkel ofren...@redhat.com Sent: Sunday, October 21, 2012 11:05:56 AM Subject: Re: AW: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE On 10/19/2012 06:43 PM, Sven Knohsalla wrote: Hi Haim, I wanted to wait to send this mail, until the problem occurs again. Disabled live-migration for the cluster first, to make sure the second node wouldn't have the same problem, when migration is started. It seems the problem isn't caused by migration, as I did run in the same error again today. Log snippet Webgui: 2012-Oct-19,04:28:13 Host deovn-a01 cannot access one of the Storage Domains attached to it, or the Data Center object. Setting Host state to Non-Operational. -- all VMs are running properly, although the engine tells something different. Even the VM status in engine gui is wrong, as it's showing vmname reboot in progress, but there is no reboot initiated (ssh/rdp connections, file operations are working fine) Engine log says for this period: cat /var/log/ovirt-engine/engine.log | grep 04:2* 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 2012-10-19 04:28:13,775 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 2012-10-19 04:28:13,799 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational 2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS 2012-10-19 04:28:13,884 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd 2012-10-19 04:28:13,888 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd 2012-10-19 04:28:19,690 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 I think the first output is important: 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 -- which problem? There's no debug info during that time period to consider where tha problem could come from :/ look to the lines above: 2012-10-19 04:28:13,799 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational 2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS the problem was with the storage domain. On affected node side I did grep /var/log/vdsm for ERROR: Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats And 20 more of the same type with same vmId, I'm sure this is an aftereffect s the engine can't tell the status of the VMs. Can you give me an advice where I can find more information to solve this issue? Or perhaps a scenario I can try? what's the status of the VMs right now ? can you please provide the output of the following commands: virsh -r list vdsClient -s 0 list table please attach full engine, vdsm and libvirt logs (and if possible, qemu log file under /var/log/libvirt/qemu/). I have another curiosity I wanted to ask for in a new mail, but perhaps this has something to do with my issue: The elected SPM is not part of this cluster, just has 2 storage paths (multipath) to the SAN. The problematic cluster has 4 storage paths(bigger hypervisors), and all storage paths are connected successfully . I would like to see repoStats reports within the node logs (vdsm.log). Does the SPM detects
Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE
Hi Haim, I wanted to wait to send this mail, until the problem occurs again. Disabled live-migration for the cluster first, to make sure the second node wouldn't have the same problem, when migration is started. It seems the problem isn't caused by migration, as I did run in the same error again today. Log snippet Webgui: 2012-Oct-19,04:28:13 Host deovn-a01 cannot access one of the Storage Domains attached to it, or the Data Center object. Setting Host state to Non-Operational. -- all VMs are running properly, although the engine tells something different. Even the VM status in engine gui is wrong, as it's showing vmname reboot in progress, but there is no reboot initiated (ssh/rdp connections, file operations are working fine) Engine log says for this period: cat /var/log/ovirt-engine/engine.log | grep 04:2* 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 2012-10-19 04:28:13,775 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) starting ProcessDomainRecovery for domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 2012-10-19 04:28:13,799 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-1) vds deovn-a01 reported domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5:DE-VM-SYSTEM as in problem, moving the vds to status NonOperational 2012-10-19 04:28:13,882 INFO [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (QuartzScheduler_Worker-1) Running command: SetNonOperationalVdsCommand internal: true. Entities affected : ID: 66b546c2-ae62-11e1-b734-5254005cbe44 Type: VDS 2012-10-19 04:28:13,884 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) START, SetVdsStatusVDSCommand(vdsId = 66b546c2-ae62-11e1-b734-5254005cbe44, status=NonOperational, nonOperationalReason=STORAGE_DOMAIN_UNREACHABLE), log id: daad8bd 2012-10-19 04:28:13,888 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (QuartzScheduler_Worker-1) FINISH, SetVdsStatusVDSCommand, log id: daad8bd 2012-10-19 04:28:19,690 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-38) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 I think the first output is important: 2012-10-19 04:23:13,773 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-94) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 -- which problem? There's no debug info during that time period to consider where tha problem could come from :/ On affected node side I did grep /var/log/vdsm for ERROR: Thread-254302::ERROR::2012-10-12 16:01:11,359::vm::950::vm.Vm::(getStats) vmId=`537eea7c-d12c-461f-adfb-6a1f2ebff4fb`::Error fetching vm stats And 20 more of the same type with same vmId, I'm sure this is an aftereffect s the engine can't tell the status of the VMs. Can you give me an advice where I can find more information to solve this issue? Or perhaps a scenario I can try? I have another curiosity I wanted to ask for in a new mail, but perhaps this has something to do with my issue: The elected SPM is not part of this cluster, just has 2 storage paths (multipath) to the SAN. The problematic cluster has 4 storage paths(bigger hypervisors), and all storage paths are connected successfully . Does the SPM detects this difference, or is it unnecessary as the executive command detected possible paths on its own (what I assume)? Currently in use: oVirt-engine 3.0 oVirt-node2.30 -- is there any problem mixing node versions, regarding the ovirt-engine version? Sorry for the amount of questions, I really want to understand the oVirt-mechanism completely, to build up a fail-safe virtual environment :) Thanks in advance. Best, Sven. -Ursprüngliche Nachricht- Von: Haim Ateya [mailto:hat...@redhat.com] Gesendet: Dienstag, 16. Oktober 2012 14:38 An: Sven Knohsalla Cc: users@ovirt.org; Itamar Heim; Omer Frenkel Betreff: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE Hi Sven, can you attach full logs from the second host (problematic one)? i guess its deovn-a01. 2012-10-15 11:13:38,197 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 - Original Message - From: Omer Frenkel ofren...@redhat.com To: Itamar Heim ih...@redhat.com, Sven Knohsalla s.knohsa...@netbiscuits.com Cc: users@ovirt.org Sent: Tuesday, October 16, 2012 2:02:50 PM Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE - Original Message - From: Itamar Heim ih...@redhat.com To: Sven Knohsalla s.knohsa...@netbiscuits.com Cc: users
Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE
- Original Message - From: Itamar Heim ih...@redhat.com To: Sven Knohsalla s.knohsa...@netbiscuits.com Cc: users@ovirt.org Sent: Monday, October 15, 2012 8:36:07 PM Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE On 10/15/2012 03:56 PM, Sven Knohsalla wrote: Hi, sometimes one hypervisors status turns to „Non-operational“ with error “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated for all VMs) is starting. I don’t currently know why the ovirt-node turns to this status, because the connected iSCSI SAN is available all the time(checked via iscsi session and lsblk), I’m also able to r/w on the SAN during that time. We can simply activate this ovirt-node and it turns up again. The migration process is running from scratch and hitting the some error àReboot of ovirt-node necessary! When a hypervisor turns to “non-operational” status, the live migration is starting and tries to migrate ~25 VMs (~ 100 GB RAM to migrate). During that process the network workload goes 100%, some VMs will be migrated, then the destination host also turns to “non-operational” status with error “STORAGE_DOMAIN_UNREACHABLE”. Many VMs are still running on their origin host, some are paused, some are showing “migration from” status. After a reboot of the origin host, the VMs turns of course into unknown state. So the whole cluster is down :/ For this problem I have some questions: -Does ovirt engine just use the ovirt-mgmt network for migration/HA? yes. -If so, is there any possibility to *add*/switch a network for migration/HA? you can bond, not yet add another one. -Is the kind of way we are using the live-migration not recommended? -Which engine module checks the availability of the storage domain for the ovirt-nodes? the engine. -Is there any timeout/cache option we can set/increase to avoid this problem? well, not clear what the problem is. also, vdsm is supposed to throttle live migration to 3 vm's in parallel iirc. also, you can at cluster level configure to not live migrate VMs on non-operational status. -Is there any known problem with the versions we are using? (Migration to ovirt-engine 3.1 is not possible atm) oh, the cluster level migration policy on non operational may be a 3.1 feature, not sure. AFAIR, it's in 3.0 -Is it possible to modify the migration queue to just migrate a max. of 4 VMs at the same time for example? yes, there is a vdsm config for that. i am pretty sure 3 is the default though? _ovirt-engine: _ FC 16: 3.3.6-3.fc16.x86_64 Engine: 3.0.0_0001-1.6.fc16 KVM based VM: 2 vCPU, 4 GB RAM 1 NIC for ssh/https access 1 NIC for ovirtmgmt network access engine source: dreyou repo _ovirt-node:_ Node: 2.3.0 2 bonded NICs - Frontend Network 4 Multipath NICs - SAN connection Attached some relevant logfiles. Thanks in advance, I really appreciate your help! Best, Sven Knohsalla |System Administration Office +49 631 68036 433 | Fax +49 631 68036 111 |e-mails.knohsa...@netbiscuits.com |mailto:s.knohsa...@netbiscuits.com| Skype: Netbiscuits.admin Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE
Hi Sven, can you attach full logs from the second host (problematic one)? i guess its deovn-a01. 2012-10-15 11:13:38,197 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-33) domain ccaa4e7a-fa89-46a6-a6e0-07dfe78d1bd5 in problem. vds: deovn-a01 - Original Message - From: Omer Frenkel ofren...@redhat.com To: Itamar Heim ih...@redhat.com, Sven Knohsalla s.knohsa...@netbiscuits.com Cc: users@ovirt.org Sent: Tuesday, October 16, 2012 2:02:50 PM Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE - Original Message - From: Itamar Heim ih...@redhat.com To: Sven Knohsalla s.knohsa...@netbiscuits.com Cc: users@ovirt.org Sent: Monday, October 15, 2012 8:36:07 PM Subject: Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE On 10/15/2012 03:56 PM, Sven Knohsalla wrote: Hi, sometimes one hypervisors status turns to „Non-operational“ with error “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated for all VMs) is starting. I don’t currently know why the ovirt-node turns to this status, because the connected iSCSI SAN is available all the time(checked via iscsi session and lsblk), I’m also able to r/w on the SAN during that time. We can simply activate this ovirt-node and it turns up again. The migration process is running from scratch and hitting the some error àReboot of ovirt-node necessary! When a hypervisor turns to “non-operational” status, the live migration is starting and tries to migrate ~25 VMs (~ 100 GB RAM to migrate). During that process the network workload goes 100%, some VMs will be migrated, then the destination host also turns to “non-operational” status with error “STORAGE_DOMAIN_UNREACHABLE”. Many VMs are still running on their origin host, some are paused, some are showing “migration from” status. After a reboot of the origin host, the VMs turns of course into unknown state. So the whole cluster is down :/ For this problem I have some questions: -Does ovirt engine just use the ovirt-mgmt network for migration/HA? yes. -If so, is there any possibility to *add*/switch a network for migration/HA? you can bond, not yet add another one. -Is the kind of way we are using the live-migration not recommended? -Which engine module checks the availability of the storage domain for the ovirt-nodes? the engine. -Is there any timeout/cache option we can set/increase to avoid this problem? well, not clear what the problem is. also, vdsm is supposed to throttle live migration to 3 vm's in parallel iirc. also, you can at cluster level configure to not live migrate VMs on non-operational status. -Is there any known problem with the versions we are using? (Migration to ovirt-engine 3.1 is not possible atm) oh, the cluster level migration policy on non operational may be a 3.1 feature, not sure. AFAIR, it's in 3.0 -Is it possible to modify the migration queue to just migrate a max. of 4 VMs at the same time for example? yes, there is a vdsm config for that. i am pretty sure 3 is the default though? _ovirt-engine: _ FC 16: 3.3.6-3.fc16.x86_64 Engine: 3.0.0_0001-1.6.fc16 KVM based VM: 2 vCPU, 4 GB RAM 1 NIC for ssh/https access 1 NIC for ovirtmgmt network access engine source: dreyou repo _ovirt-node:_ Node: 2.3.0 2 bonded NICs - Frontend Network 4 Multipath NICs - SAN connection Attached some relevant logfiles. Thanks in advance, I really appreciate your help! Best, Sven Knohsalla |System Administration Office +49 631 68036 433 | Fax +49 631 68036 111 |e-mails.knohsa...@netbiscuits.com |mailto:s.knohsa...@netbiscuits.com| Skype: Netbiscuits.admin Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] ITA-2967 URGENT: ovirt Node turns status to non operational STORAGE_DOMAIN_UNREACHABLE
On 10/15/2012 03:56 PM, Sven Knohsalla wrote: Hi, sometimes one hypervisors status turns to „Non-operational“ with error “STORAGE_DOMAIN_UNREACHABLE” and the live-migration (activated for all VMs) is starting. I don’t currently know why the ovirt-node turns to this status, because the connected iSCSI SAN is available all the time(checked via iscsi session and lsblk), I’m also able to r/w on the SAN during that time. We can simply activate this ovirt-node and it turns up again. The migration process is running from scratch and hitting the some error àReboot of ovirt-node necessary! When a hypervisor turns to “non-operational” status, the live migration is starting and tries to migrate ~25 VMs (~ 100 GB RAM to migrate). During that process the network workload goes 100%, some VMs will be migrated, then the destination host also turns to “non-operational” status with error “STORAGE_DOMAIN_UNREACHABLE”. Many VMs are still running on their origin host, some are paused, some are showing “migration from” status. After a reboot of the origin host, the VMs turns of course into unknown state. So the whole cluster is down :/ For this problem I have some questions: -Does ovirt engine just use the ovirt-mgmt network for migration/HA? yes. -If so, is there any possibility to *add*/switch a network for migration/HA? you can bond, not yet add another one. -Is the kind of way we are using the live-migration not recommended? -Which engine module checks the availability of the storage domain for the ovirt-nodes? the engine. -Is there any timeout/cache option we can set/increase to avoid this problem? well, not clear what the problem is. also, vdsm is supposed to throttle live migration to 3 vm's in parallel iirc. also, you can at cluster level configure to not live migrate VMs on non-operational status. -Is there any known problem with the versions we are using? (Migration to ovirt-engine 3.1 is not possible atm) oh, the cluster level migration policy on non operational may be a 3.1 feature, not sure. -Is it possible to modify the migration queue to just migrate a max. of 4 VMs at the same time for example? yes, there is a vdsm config for that. i am pretty sure 3 is the default though? _ovirt-engine: _ FC 16: 3.3.6-3.fc16.x86_64 Engine: 3.0.0_0001-1.6.fc16 KVM based VM: 2 vCPU, 4 GB RAM 1 NIC for ssh/https access 1 NIC for ovirtmgmt network access engine source: dreyou repo _ovirt-node:_ Node: 2.3.0 2 bonded NICs - Frontend Network 4 Multipath NICs - SAN connection Attached some relevant logfiles. Thanks in advance, I really appreciate your help! Best, Sven Knohsalla |System Administration Office +49 631 68036 433 | Fax +49 631 68036 111 |e-mails.knohsa...@netbiscuits.com mailto:s.knohsa...@netbiscuits.com| Skype: Netbiscuits.admin Netbiscuits GmbH | Europaallee 10 | 67657 | GERMANY ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users