Re: [Users] Testing High Availability and Power outages

2013-01-14 Thread Alexandru Vladulescu

On 01/14/2013 10:13 AM, Doron Fediuck wrote:





*From: *"Alexandru Vladulescu" 
*To: *"Doron Fediuck" 
*Cc: *"users" 
*Sent: *Sunday, January 13, 2013 9:49:25 PM
*Subject: *Re: [Users] Testing High Availability and Power outages


Dear Doron,


I had the case retested now and I am writing you the results.

Furthermore, if this information should be useful for you, my
network setup is the following: 2 Layer 2 (Zyxel es2108-g &
ES2200-8) switches configured with 2 VLANs ( 1 inside backbone
network -- added br0 to Ovirt ; 1 outside network -- running on
ovirtmgmt interface for Internet traffic to VMs). The backbone
switch is a gigabit capable one, and each host runs on jumbo frame
setup. There is one more firewall server that routes the subnets
through trunking port and VLAN configuration. The Ovirt software
has been setup with backbone network subnet.

As you could guess the network infrastructure is not the problem here.

The test case was the same as described before:

1. Vm running on Hyper01, none on Hyper02. Host had configured the
High Available check box.
2. Hand power off of Hyper01 from power network (no soft/manual
shutdown).
3. After a while, Ovirt marks the Hyper01 as Non Responsive
4. Manually clicked on Confirm host reboot and the VM starts after
Ovirt's manual fence to Hyper01 on Hyper02 host.

I have provided engine log attached. The Confirm Host reboot was
done at precise time of 21:31:45 On the cluster section, in Ovirt,
I did try changing the "Resilience Policy" attribute from "Migrate
Virtual Machines" to "Migrate only High Available Virtual
Machines" but with the same results.


As I am guessing from the engine log the Node Controller sees the
Hyper01 node as it has a "network fault" no route to host,
although this was shut down.

Is this supposed to be the default behavior in this case, as the
scenario might overlap with a real case of network outage.


My Regards,
Alex.



On 01/13/2013 10:54 AM, Doron Fediuck wrote:





*From: *"Alexandru Vladulescu" 
*To: *"Doron Fediuck" 
*Cc: *"users" 
    *Sent: *Sunday, January 13, 2013 10:46:41 AM
*Subject: *Re: [Users] Testing High Availability and Power
outages

Dear Doron,

I haven't collected the logs from the tests, but I would
gladly re-do the case and get back to you asap.

This feature is the main reason of which I have chosen to
go with Ovirt in the first place, besides other virt
environments.

Could you please inform me what logs should I be focusing
on, besides the engine log; vdsm maybe or other relevant logs?

Regards,
Alex


--
Sent from phone.

On 13.01.2013, at 09:56, Doron Fediuck
mailto:dfedi...@redhat.com>> wrote:






*From: *"Alexandru Vladulescu"
mailto:avladule...@bfproject.ro>>
    *To: *"users" mailto:users@ovirt.org>>
        *Sent: *Friday, January 11, 2013 2:47:38 PM
*Subject: *[Users] Testing High Availability and
Power outages


Hi,


Today, I started testing on my Ovirt 3.1
installation (from dreyou repos) running on 3 x
Centos 6.3 hypervisors the High Availability
features and the fence mechanism.

As yesterday, I have reported in a previous email
thread, that the migration priority queue cannot
be increased (bug) in this current version, I
decided to test what the official documentation
says about the High Availability cases.

This will be a disaster case scenarios to suffer
from if one hypervisor has a power outage/hardware
problem and the VMs running on it are not
migrating on other spare resources.


In the official documenation from ovirt.org
<http://ovirt.org> it is quoted the following:


  /High availability /

//

/Allows critical VMs to be restarted on another
host in the event of hardware failure with three
 

Re: [Users] Testing High Availability and Power outages

2013-01-14 Thread Doron Fediuck
- Original Message -

> From: "Alexandru Vladulescu" 
> To: "Doron Fediuck" 
> Cc: "users" 
> Sent: Sunday, January 13, 2013 9:49:25 PM
> Subject: Re: [Users] Testing High Availability and Power outages

> Dear Doron,

> I had the case retested now and I am writing you the results.

> Furthermore, if this information should be useful for you, my network
> setup is the following: 2 Layer 2 (Zyxel es2108-g & ES2200-8)
> switches configured with 2 VLANs ( 1 inside backbone network --
> added br0 to Ovirt ; 1 outside network -- running on ovirtmgmt
> interface for Internet traffic to VMs). The backbone switch is a
> gigabit capable one, and each host runs on jumbo frame setup. There
> is one more firewall server that routes the subnets through trunking
> port and VLAN configuration. The Ovirt software has been setup with
> backbone network subnet.

> As you could guess the network infrastructure is not the problem
> here.

> The test case was the same as described before:

> 1. Vm running on Hyper01, none on Hyper02. Host had configured the
> High Available check box.
> 2. Hand power off of Hyper01 from power network (no soft/manual
> shutdown).
> 3. After a while, Ovirt marks the Hyper01 as Non Responsive
> 4. Manually clicked on Confirm host reboot and the VM starts after
> Ovirt's manual fence to Hyper01 on Hyper02 host.

> I have provided engine log attached. The Confirm Host reboot was done
> at precise time of 21:31:45 On the cluster section, in Ovirt, I did
> try changing the "Resilience Policy" attribute from "Migrate Virtual
> Machines" to "Migrate only High Available Virtual Machines" but with
> the same results.

> As I am guessing from the engine log the Node Controller sees the
> Hyper01 node as it has a "network fault" no route to host, although
> this was shut down.

> Is this supposed to be the default behavior in this case, as the
> scenario might overlap with a real case of network outage.

> My Regards,
> Alex.

> On 01/13/2013 10:54 AM, Doron Fediuck wrote:

> > - Original Message -
> 

> > > From: "Alexandru Vladulescu" 
> > 
> 
> > > To: "Doron Fediuck" 
> > 
> 
> > > Cc: "users" 
> > 
> 
> > > Sent: Sunday, January 13, 2013 10:46:41 AM
> > 
> 
> > > Subject: Re: [Users] Testing High Availability and Power outages
> > 
> 

> > > Dear Doron,
> > 
> 

> > > I haven't collected the logs from the tests, but I would gladly
> > > re-do
> > > the case and get back to you asap.
> > 
> 

> > > This feature is the main reason of which I have chosen to go with
> > > Ovirt in the first place, besides other virt environments.
> > 
> 

> > > Could you please inform me what logs should I be focusing on,
> > > besides
> > > the engine log; vdsm maybe or other relevant logs?
> > 
> 

> > > Regards,
> > 
> 
> > > Alex
> > 
> 

> > > --
> > 
> 
> > > Sent from phone.
> > 
> 

> > > On 13.01.2013, at 09:56, Doron Fediuck < dfedi...@redhat.com >
> > > wrote:
> > 
> 

> > > > - Original Message -
> > > 
> > 
> 

> > > > > From: "Alexandru Vladulescu" < avladule...@bfproject.ro >
> > > > 
> > > 
> > 
> 
> > > > > To: "users" < users@ovirt.org >
> > > > 
> > > 
> > 
> 
> > > > > Sent: Friday, January 11, 2013 2:47:38 PM
> > > > 
> > > 
> > 
> 
> > > > > Subject: [Users] Testing High Availability and Power outages
> > > > 
> > > 
> > 
> 

> > > > > Hi,
> > > > 
> > > 
> > 
> 

> > > > > Today, I started testing on my Ovirt 3.1 installation (from
> > > > > dreyou
> > > > > repos) running on 3 x Centos 6.3 hypervisors the High
> > > > > Availability
> > > > > features and the fence mechanism.
> > > > 
> > > 
> > 
> 

> > > > > As yesterday, I have reported in a previous email thread,
> > > > > that
> > > > > the
> > > > > migration priority queue cannot be increased (bug) in this
> > > > > current
> > > > > version, I decided to test what the official documentation
> > > > > says
> > > > > about the High Availability cases.
> > > >

Re: [Users] Testing High Availability and Power outages

2013-01-13 Thread Doron Fediuck
- Original Message -

> From: "Alexandru Vladulescu" 
> To: "Doron Fediuck" 
> Cc: "users" 
> Sent: Sunday, January 13, 2013 10:46:41 AM
> Subject: Re: [Users] Testing High Availability and Power outages

> Dear Doron,

> I haven't collected the logs from the tests, but I would gladly re-do
> the case and get back to you asap.

> This feature is the main reason of which I have chosen to go with
> Ovirt in the first place, besides other virt environments.

> Could you please inform me what logs should I be focusing on, besides
> the engine log; vdsm maybe or other relevant logs?

> Regards,
> Alex

> --
> Sent from phone.

> On 13.01.2013, at 09:56, Doron Fediuck < dfedi...@redhat.com > wrote:

> > - Original Message -
> 

> > > From: "Alexandru Vladulescu" < avladule...@bfproject.ro >
> > 
> 
> > > To: "users" < users@ovirt.org >
> > 
> 
> > > Sent: Friday, January 11, 2013 2:47:38 PM
> > 
> 
> > > Subject: [Users] Testing High Availability and Power outages
> > 
> 

> > > Hi,
> > 
> 

> > > Today, I started testing on my Ovirt 3.1 installation (from
> > > dreyou
> > > repos) running on 3 x Centos 6.3 hypervisors the High
> > > Availability
> > > features and the fence mechanism.
> > 
> 

> > > As yesterday, I have reported in a previous email thread, that
> > > the
> > > migration priority queue cannot be increased (bug) in this
> > > current
> > > version, I decided to test what the official documentation says
> > > about the High Availability cases.
> > 
> 

> > > This will be a disaster case scenarios to suffer from if one
> > > hypervisor has a power outage/hardware problem and the VMs
> > > running
> > > on it are not migrating on other spare resources.
> > 
> 

> > > In the official documenation from ovirt.org it is quoted the
> > > following:
> > 
> 
> > > High availability
> > 
> 

> > > Allows critical VMs to be restarted on another host in the event
> > > of
> > > hardware failure with three levels of priority, taking into
> > > account
> > > resiliency policy.
> > 
> 

> > > * Resiliency policy to control high availability VMs at the
> > > cluster
> > > level.
> > 
> 
> > > * Supports application-level high availability with supported
> > > fencing
> > > agents.
> > 
> 

> > > As well as in the Architecture description:
> > 
> 

> > > High Availability - restart guest VMs from failed hosts
> > > automatically
> > > on other hosts
> > 
> 

> > > So the testing went like this -- One VM running a linux box,
> > > having
> > > the check box "High Available" and "Priority for Run/Migration
> > > queue:" set to Low. On Host we have the check box to "Any Host in
> > > Cluster", without "Allow VM migration only upon Admin specific
> > > request" checked.
> > 
> 

> > > My environment:
> > 
> 

> > > Configuration : 2 x Hypervisors (same cluster/hardware
> > > configuration)
> > > ; 1 x Hypervisor + acting as a NAS (NFS) server (different
> > > cluster/hardware configuration)
> > 
> 

> > > Actions: Went and cut-off the power from one of the hypervisors
> > > from
> > > the 2 node clusters, while the VM was running on. This would
> > > translate to a power outage.
> > 
> 

> > > Results: The hypervisor node that suffered from the outage is
> > > showing
> > > in Hosts tab as Non Responsive on Status, and the VM has a
> > > question
> > > mark and cannot be powered off or nothing (therefore it's stuck).
> > 
> 

> > > In the Log console in GUI, I get:
> > 
> 

> > > Host Hyper01 is non-responsive.
> > 
> 
> > > VM Web-Frontend01 was set to the Unknown status.
> > 
> 

> > > There is nothing I could I could do besides clicking on the
> > > Hyper01
> > > "Confirm Host as been rebooted", afterwards the VM starts on the
> > > Hyper02 with a cold reboot of the VM.
> > 
> 

> > > The Log console changes to:
> > 
> 

> > > Vm Web-Frontend01 was shut down due to Hyper01 host reboot or
> > > manual
> > > fence
> > 
> 
> > > All VMs' status on Non-Responsive

Re: [Users] Testing High Availability and Power outages

2013-01-13 Thread Alexandru Vladulescu
Dear Doron,

I haven't collected the logs from the tests, but I would gladly re-do the case 
and get back to you asap. 

This feature is the main reason of which I have chosen to go with Ovirt in the 
first place, besides other virt environments.

Could you please inform me what logs should I be focusing on, besides the 
engine log; vdsm maybe or other relevant logs?

Regards,
Alex


--
Sent from phone.

On 13.01.2013, at 09:56, Doron Fediuck  wrote:

> 
> 
> From: "Alexandru Vladulescu" 
> To: "users" 
> Sent: Friday, January 11, 2013 2:47:38 PM
> Subject: [Users] Testing High Availability and Power outages
> 
> 
> Hi,
> 
> 
> Today, I started testing on my Ovirt 3.1 installation (from dreyou repos) 
> running on 3 x Centos 6.3 hypervisors the High Availability features and the 
> fence mechanism.
> 
> As yesterday, I have reported in a previous email thread, that the migration 
> priority queue cannot be increased (bug) in this current version, I decided 
> to test what the official documentation says about the High Availability 
> cases. 
> 
> This will be a disaster case scenarios to suffer from if one hypervisor has a 
> power outage/hardware problem and the VMs running on it are not migrating on 
> other spare resources.
> 
> 
> In the official documenation from ovirt.org it is quoted the following:
> High availability
> 
> Allows critical VMs to be restarted on another host in the event of hardware 
> failure with three levels of priority, taking into account resiliency policy.
> 
> Resiliency policy to control high availability VMs at the cluster level.
> Supports application-level high availability with supported fencing agents.
> 
> As well as in the Architecture description:
> 
> High Availability - restart guest VMs from failed hosts automatically on 
> other hosts
> 
> 
> 
> So the testing went like this -- One VM running a linux box, having the check 
> box "High Available" and "Priority for Run/Migration queue:" set to Low. On 
> Host we have the check box to "Any Host in Cluster", without "Allow VM 
> migration only upon Admin specific request" checked.
> 
> 
> 
> My environment:
> 
> 
> Configuration :  2 x Hypervisors (same cluster/hardware configuration) ; 1 x 
> Hypervisor + acting as a NAS (NFS) server (different cluster/hardware 
> configuration)
> 
> Actions: Went and cut-off the power from one of the hypervisors from the 2 
> node clusters, while the VM was running on. This would translate to a power 
> outage.
> 
> Results: The hypervisor node that suffered from the outage is showing in 
> Hosts tab as Non Responsive on Status, and the VM has a question mark and 
> cannot be powered off or nothing (therefore it's stuck).
> 
> In the Log console in GUI, I get: 
> 
> Host Hyper01 is non-responsive.
> VM Web-Frontend01 was set to the Unknown status.
> 
> There is nothing I could I could do besides clicking on the Hyper01 "Confirm 
> Host as been rebooted", afterwards the VM starts on the Hyper02 with a cold 
> reboot of the VM.
> 
> The Log console changes to:
> 
> Vm Web-Frontend01 was shut down due to Hyper01 host reboot or manual fence
> All VMs' status on Non-Responsive Host Hyper01 were changed to 'Down' by 
> admin@internal
> Manual fencing for host Hyper01 was started.
> VM Web-Frontend01 was restarted on Host Hyper02
> 
> 
> I would like you approach on this problem, reading the documentation & 
> features pages on the official website, I suppose that this would have been 
> an automatically mechanism working on some sort of a vdsm & engine fencing 
> action. Am I missing something regarding it ?
> 
> 
> Thank you for your patience reading this.
> 
> 
> Regards,
> Alex.
> 
> 
> 
> 
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
> Hi Alex,
> Can you share with us the engine's log from the relevant time period?
> 
> Doron
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Testing High Availability and Power outages

2013-01-12 Thread Doron Fediuck
- Original Message -

> From: "Alexandru Vladulescu" 
> To: "users" 
> Sent: Friday, January 11, 2013 2:47:38 PM
> Subject: [Users] Testing High Availability and Power outages

> Hi,

> Today, I started testing on my Ovirt 3.1 installation (from dreyou
> repos) running on 3 x Centos 6.3 hypervisors the High Availability
> features and the fence mechanism.

> As yesterday, I have reported in a previous email thread, that the
> migration priority queue cannot be increased (bug) in this current
> version, I decided to test what the official documentation says
> about the High Availability cases.

> This will be a disaster case scenarios to suffer from if one
> hypervisor has a power outage/hardware problem and the VMs running
> on it are not migrating on other spare resources.

> In the official documenation from ovirt.org it is quoted the
> following:
> High availability

> Allows critical VMs to be restarted on another host in the event of
> hardware failure with three levels of priority, taking into account
> resiliency policy.

> * Resiliency policy to control high availability VMs at the cluster
> level.
> * Supports application-level high availability with supported fencing
> agents.

> As well as in the Architecture description:

> High Availability - restart guest VMs from failed hosts automatically
> on other hosts

> So the testing went like this -- One VM running a linux box, having
> the check box "High Available" and "Priority for Run/Migration
> queue:" set to Low. On Host we have the check box to "Any Host in
> Cluster", without "Allow VM migration only upon Admin specific
> request" checked.

> My environment:

> Configuration : 2 x Hypervisors (same cluster/hardware configuration)
> ; 1 x Hypervisor + acting as a NAS (NFS) server (different
> cluster/hardware configuration)

> Actions: Went and cut-off the power from one of the hypervisors from
> the 2 node clusters, while the VM was running on. This would
> translate to a power outage.

> Results: The hypervisor node that suffered from the outage is showing
> in Hosts tab as Non Responsive on Status, and the VM has a question
> mark and cannot be powered off or nothing (therefore it's stuck).

> In the Log console in GUI, I get:

> Host Hyper01 is non-responsive.
> VM Web-Frontend01 was set to the Unknown status.

> There is nothing I could I could do besides clicking on the Hyper01
> "Confirm Host as been rebooted", afterwards the VM starts on the
> Hyper02 with a cold reboot of the VM.

> The Log console changes to:

> Vm Web-Frontend01 was shut down due to Hyper01 host reboot or manual
> fence
> All VMs' status on Non-Responsive Host Hyper01 were changed to 'Down'
> by admin@internal
> Manual fencing for host Hyper01 was started.
> VM Web-Frontend01 was restarted on Host Hyper02

> I would like you approach on this problem, reading the documentation
> & features pages on the official website, I suppose that this would
> have been an automatically mechanism working on some sort of a vdsm
> & engine fencing action. Am I missing something regarding it ?

> Thank you for your patience reading this.

> Regards,
> Alex.

> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

Hi Alex, 
Can you share with us the engine's log from the relevant time period? 

Doron 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[Users] Testing High Availability and Power outages

2013-01-11 Thread Alexandru Vladulescu


Hi,


Today, I started testing on my Ovirt 3.1 installation (from dreyou 
repos) running on 3 x Centos 6.3 hypervisors the High Availability 
features and the fence mechanism.


As yesterday, I have reported in a previous email thread, that the 
migration priority queue cannot be increased (bug) in this current 
version, I decided to test what the official documentation says about 
the High Availability cases.


This will be a disaster case scenarios to suffer from if one hypervisor 
has a power outage/hardware problem and the VMs running on it are not 
migrating on other spare resources.



In the official documenation from ovirt.org it is quoted the following:


 /High availability /

//

/Allows critical VMs to be restarted on another host in the event of 
hardware failure with three levels of priority, taking into account 
resiliency policy. /


//

 * /Resiliency policy to control high availability VMs at the cluster
   level. /
 * /Supports application-level high availability with supported fencing
   agents. /


As well as in the Architecture description:

/High Availability - restart guest VMs from failed hosts automatically 
on other hosts/




So the testing went like this -- One VM running a linux box, having the 
check box "High Available" and "Priority for Run/Migration queue:" set 
to Low. On Host we have the check box to "Any Host in Cluster", without 
"Allow VM migration only upon Admin specific request" checked.




My environment:


Configuration :  2 x Hypervisors (same cluster/hardware configuration) ; 
1 x Hypervisor + acting as a NAS (NFS) server (different 
cluster/hardware configuration)


Actions: Went and cut-off the power from one of the hypervisors from the 
2 node clusters, while the VM was running on. This would translate to a 
power outage.


Results: The hypervisor node that suffered from the outage is showing in 
Hosts tab as Non Responsive on Status, and the VM has a question mark 
and cannot be powered off or nothing (therefore it's stuck).


In the Log console in GUI, I get:

Host Hyper01 is non-responsive.
VM Web-Frontend01 was set to the Unknown status.

There is nothing I could I could do besides clicking on the Hyper01 
"Confirm Host as been rebooted", afterwards the VM starts on the Hyper02 
with a cold reboot of the VM.


The Log console changes to:

Vm Web-Frontend01 was shut down due to Hyper01 host reboot or manual fence
All VMs' status on Non-Responsive Host Hyper01 were changed to 'Down' by 
admin@internal

Manual fencing for host Hyper01 was started.
VM Web-Frontend01 was restarted on Host Hyper02


I would like you approach on this problem, reading the documentation & 
features pages on the official website, I suppose that this would have 
been an automatically mechanism working on some sort of a vdsm & engine 
fencing action. Am I missing something regarding it ?



Thank you for your patience reading this.


Regards,
Alex.



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users