Re: [Users] oVirt/RHEV fencing; a single point of failure

2012-04-05 Thread Andrew Beekhof

On 4/03/12 7:16 AM, Perry Myers wrote:

On 03/03/2012 11:52 AM, xrx wrote:

Hello,

I was worried about the high availability approach taken by RHEV/oVirt.
I had read the thread titled Some thoughts on enhancing High
Availability in oVirt but couldn't help but feel that oVirt is missing
basic HA while it's developers are considering adding (and in my opinion
unneeded) complexity with service monitoring.


Service monitoring is a highly desirable feature, but for the most part
(today) people achieve it by running service monitoring in a layered
fashion.

For example, running the RHEL HA cluster stack on top of VMs on RHEV (or
Fedora Clustering on top of oVirt VMs)

So we could certainly skip providing service HA as an integral feature
of oVirt and continue to leverage the Pacemaker style service HA as a
layered option instead.

In the past I've gotten the impression that tighter integration and a
single UI/API for managing both VM and service HA was desirable.


It all comes down to fencing. Picture this: 3 HP hypervisors running
RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of which
are set to be highly available. Now suppose that hypervisor A has a
power failure or an iLO failure (I've seen it happen more than once with
a batch of HP DL380 G6s). Because RHEV would not be able to fence the
hypervisor as it's iLO is unresponsive; those 10 HA VMs that were halted
are NOT moved to other hypervisors automatically.

I suggest that oVirt concentrates on having support for multiple fencing
devices as a development priority. SCSI persistent reservation based
fencing would be an ideal secondary, if not primary, fencing device; it
would be easy to set up for users as SANs generally support it and is
proven to work well, as seen on Red Hat clusters.


Completely agree here.  The Pacemaker/rgmanager cluster stacks already
support an arbitrary number of fence devices per host, to provide
support for both redundant power supplies and also for redundant fencing
devices.  In order to provide resilient service HA, fixing this would be
a prerequisite anyhow.  I've cc'd Andrew Beekhof from the
Pacemaker/stonith_ng, since I think it might be useful to model the
fencing for oVirt similarly to how Pacemaker/stonith_ng does it.
Perhaps there's even some code that could be reused for this as well.


The idea is that fencing requests can be initiated from multiple sources 
and that clients can be notified regardless of where the request 
originates.  Even on non-local machines if the daemon (an independent 
part of pacemaker) is hooked up to corosync.


So the daemon takes care of the boring stuff, reading the configuration 
file, performing periodic health checks, provides a fencing history and 
notifications.


If you're interested, we can make it a sub-package to avoid pulling in 
all of pacemaker.  There is also the option of just using the library 
which knows how to invoke to the existing agents.


Happy to answer any questions people might have.



As for SCSI III PR based fencing... the trouble here has been that the
fence_scsi script provided in fence-agents is Perl based, and we were
hesitant to drag Perl into the list of required things on oVirt Node
(and in general)

on the other hand, fence-scsi might not be the right level of
granularity for oVirt based SCSI III PR based fencing anyhow.  Perhaps
better would be to just have vdsm directly call sg_persist commands
directly.

I've cc'd Ryan O'Hara who wrote fence_scsi and knows a fair bit about
SCSI III PR.  If oVirt is interested in pursuing this, perhaps he can be
of assistance.


I have brought up this point about fencing being a single point of
failure in RHEV with a Red Hat employee (Mark Wagner) during the RHEV
virtual event; but he said that it is not. I don't see how it isn't; one
single loose iLO cable and the VMs are stuck until there is manual
intervention.


Agreed.  This is something that should be easily fixed in order to
provide greater HA.

That being said, I still think more tightly integrated service HA is a
good idea as well.

Perry


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[Users] oVirt/RHEV fencing; a single point of failure

2012-03-03 Thread xrx

Hello,

I was worried about the high availability approach taken by RHEV/oVirt. 
I had read the thread titled Some thoughts on enhancing High 
Availability in oVirt but couldn't help but feel that oVirt is missing 
basic HA while it's developers are considering adding (and in my opinion 
unneeded) complexity with service monitoring.


It all comes down to fencing. Picture this: 3 HP hypervisors running 
RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of which 
are set to be highly available. Now suppose that hypervisor A has a 
power failure or an iLO failure (I've seen it happen more than once with 
a batch of HP DL380 G6s). Because RHEV would not be able to fence the 
hypervisor as it's iLO is unresponsive; those 10 HA VMs that were halted 
are NOT moved to other hypervisors automatically.


I suggest that oVirt concentrates on having support for multiple fencing 
devices as a development priority. SCSI persistent reservation based 
fencing would be an ideal secondary, if not primary, fencing device; it 
would be easy to set up for users as SANs generally support it and is 
proven to work well, as seen on Red Hat clusters.


I have brought up this point about fencing being a single point of 
failure in RHEV with a Red Hat employee (Mark Wagner) during the RHEV 
virtual event; but he said that it is not. I don't see how it isn't; one 
single loose iLO cable and the VMs are stuck until there is manual 
intervention.


Any thoughts?


-xrx


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] oVirt/RHEV fencing; a single point of failure

2012-03-03 Thread Perry Myers
On 03/03/2012 11:52 AM, xrx wrote:
 Hello,
 
 I was worried about the high availability approach taken by RHEV/oVirt.
 I had read the thread titled Some thoughts on enhancing High
 Availability in oVirt but couldn't help but feel that oVirt is missing
 basic HA while it's developers are considering adding (and in my opinion
 unneeded) complexity with service monitoring.

Service monitoring is a highly desirable feature, but for the most part
(today) people achieve it by running service monitoring in a layered
fashion.

For example, running the RHEL HA cluster stack on top of VMs on RHEV (or
Fedora Clustering on top of oVirt VMs)

So we could certainly skip providing service HA as an integral feature
of oVirt and continue to leverage the Pacemaker style service HA as a
layered option instead.

In the past I've gotten the impression that tighter integration and a
single UI/API for managing both VM and service HA was desirable.

 It all comes down to fencing. Picture this: 3 HP hypervisors running
 RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of which
 are set to be highly available. Now suppose that hypervisor A has a
 power failure or an iLO failure (I've seen it happen more than once with
 a batch of HP DL380 G6s). Because RHEV would not be able to fence the
 hypervisor as it's iLO is unresponsive; those 10 HA VMs that were halted
 are NOT moved to other hypervisors automatically.
 
 I suggest that oVirt concentrates on having support for multiple fencing
 devices as a development priority. SCSI persistent reservation based
 fencing would be an ideal secondary, if not primary, fencing device; it
 would be easy to set up for users as SANs generally support it and is
 proven to work well, as seen on Red Hat clusters.

Completely agree here.  The Pacemaker/rgmanager cluster stacks already
support an arbitrary number of fence devices per host, to provide
support for both redundant power supplies and also for redundant fencing
devices.  In order to provide resilient service HA, fixing this would be
a prerequisite anyhow.  I've cc'd Andrew Beekhof from the
Pacemaker/stonith_ng, since I think it might be useful to model the
fencing for oVirt similarly to how Pacemaker/stonith_ng does it.
Perhaps there's even some code that could be reused for this as well.

As for SCSI III PR based fencing... the trouble here has been that the
fence_scsi script provided in fence-agents is Perl based, and we were
hesitant to drag Perl into the list of required things on oVirt Node
(and in general)

on the other hand, fence-scsi might not be the right level of
granularity for oVirt based SCSI III PR based fencing anyhow.  Perhaps
better would be to just have vdsm directly call sg_persist commands
directly.

I've cc'd Ryan O'Hara who wrote fence_scsi and knows a fair bit about
SCSI III PR.  If oVirt is interested in pursuing this, perhaps he can be
of assistance.

 I have brought up this point about fencing being a single point of
 failure in RHEV with a Red Hat employee (Mark Wagner) during the RHEV
 virtual event; but he said that it is not. I don't see how it isn't; one
 single loose iLO cable and the VMs are stuck until there is manual
 intervention.

Agreed.  This is something that should be easily fixed in order to
provide greater HA.

That being said, I still think more tightly integrated service HA is a
good idea as well.

Perry
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] oVirt/RHEV fencing; a single point of failure

2012-03-03 Thread Andrew Cathrow


- Original Message -
 From: Perry Myers pmy...@redhat.com
 To: xrx xrx-ov...@xrx.me, Ryan O'Hara roh...@redhat.com, Andrew 
 Beekhof abeek...@redhat.com
 Cc: users@ovirt.org
 Sent: Saturday, March 3, 2012 3:16:02 PM
 Subject: Re: [Users] oVirt/RHEV fencing; a single point of failure
 
 On 03/03/2012 11:52 AM, xrx wrote:
  Hello,
  
  I was worried about the high availability approach taken by
  RHEV/oVirt.
  I had read the thread titled Some thoughts on enhancing High
  Availability in oVirt but couldn't help but feel that oVirt is
  missing
  basic HA while it's developers are considering adding (and in my
  opinion
  unneeded) complexity with service monitoring.
 
 Service monitoring is a highly desirable feature, but for the most
 part
 (today) people achieve it by running service monitoring in a layered
 fashion.
 
 For example, running the RHEL HA cluster stack on top of VMs on RHEV
 (or
 Fedora Clustering on top of oVirt VMs)
 
 So we could certainly skip providing service HA as an integral
 feature
 of oVirt and continue to leverage the Pacemaker style service HA as a
 layered option instead.
 
 In the past I've gotten the impression that tighter integration and a
 single UI/API for managing both VM and service HA was desirable.
 
  It all comes down to fencing. Picture this: 3 HP hypervisors
  running
  RHEV/oVirt with iLO fencing. Say hypervisor A runs 10 VMs, all of
  which
  are set to be highly available. Now suppose that hypervisor A has a
  power failure or an iLO failure (I've seen it happen more than once
  with
  a batch of HP DL380 G6s). Because RHEV would not be able to fence
  the
  hypervisor as it's iLO is unresponsive; those 10 HA VMs that were
  halted
  are NOT moved to other hypervisors automatically.
  
  I suggest that oVirt concentrates on having support for multiple
  fencing
  devices as a development priority. SCSI persistent reservation
  based
  fencing would be an ideal secondary, if not primary, fencing
  device; it
  would be easy to set up for users as SANs generally support it and
  is
  proven to work well, as seen on Red Hat clusters.
 
 Completely agree here.  The Pacemaker/rgmanager cluster stacks
 already
 support an arbitrary number of fence devices per host, to provide
 support for both redundant power supplies and also for redundant
 fencing
 devices.  In order to provide resilient service HA, fixing this would
 be
 a prerequisite anyhow.  I've cc'd Andrew Beekhof from the
 Pacemaker/stonith_ng, since I think it might be useful to model the
 fencing for oVirt similarly to how Pacemaker/stonith_ng does it.
 Perhaps there's even some code that could be reused for this as well.
 
 As for SCSI III PR based fencing... the trouble here has been that
 the
 fence_scsi script provided in fence-agents is Perl based, and we were
 hesitant to drag Perl into the list of required things on oVirt Node
 (and in general)
 
 on the other hand, fence-scsi might not be the right level of
 granularity for oVirt based SCSI III PR based fencing anyhow.
  Perhaps
 better would be to just have vdsm directly call sg_persist commands
 directly.
 
 I've cc'd Ryan O'Hara who wrote fence_scsi and knows a fair bit about
 SCSI III PR.  If oVirt is interested in pursuing this, perhaps he can
 be
 of assistance.

There's also sanlock which plays a role here. In the past we required some form 
of fencing action but once sanlock is integrated that provides another path.

 
  I have brought up this point about fencing being a single point of
  failure in RHEV with a Red Hat employee (Mark Wagner) during the
  RHEV
  virtual event; but he said that it is not. I don't see how it
  isn't; one
  single loose iLO cable and the VMs are stuck until there is manual
  intervention.
 
 Agreed.  This is something that should be easily fixed in order to
 provide greater HA.
 
 That being said, I still think more tightly integrated service HA is
 a
 good idea as well.
 
 Perry
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] oVirt/RHEV fencing; a single point of failure

2012-03-03 Thread Perry Myers
 I've cc'd Ryan O'Hara who wrote fence_scsi and knows a fair bit about
 SCSI III PR.  If oVirt is interested in pursuing this, perhaps he can
 be
 of assistance.
 
 There's also sanlock which plays a role here. In the past we required some 
 form of fencing action but once sanlock is integrated that provides another 
 path.

Agreed.  Should have mentioned that :)
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users