Bug#553166: [Debian-ha-maintainers] Bug#553166: closed by Guido Günther a...@sigxcpu.org (Re: Bug#553166: redhat-cluster: services are not relocated when a node fails)

2009-10-30 Thread Martin Waite
Hi Guido

On Thu, Oct 29, 2009 at 6:10 PM, Guido Günther a...@sigxcpu.org wrote:
 Hi Martin,
 On Thu, Oct 29, 2009 at 04:56:17PM +, Martin Waite wrote:
 Hi Guido,

 I have abandoned using RHCS on Debian Stable.   RHCS 2 in Fedora and
 Centos do not have the bug, and unfortunately in my production
 environment I will be tied to using RHCS 2.

 I do believe that the bug has gone away in version 3 of RHCS, but
 version 2 of RHCS as supplied by Lenny is virtually useless:  it
 cannot handle failover.

 As a newcomer to using RHCS, it took a few days to figure out that the
 problems I was having were not caused by my configuration errors (of
 which I had many) but were actually caused by a bug.

 Is there some way of either removing the package or providing some
 warning to potential users that it doesn't work ?   Marking the bug as
 closed seems inappropriate because it implies that the bug has been
 resolved - which is not true.    Would wont fix be better ?
 The bug is fixed in RHCS 3.0.2 so closing it is apropriate. I wouldn't
 object to remove RHCS 2 from Lenny though since I never got anything to
 work with RHCS2 either (neither rgmanager nor gfs). I'm cc'ing the
 maintainers of the RHCS2 package in Lenny. Is there anybody really using
 RHCS 2 in Lenny in production? If not we should remove it.
 Cheers,
  -- Guido


On further thought, it is probably best to just leave things as they
are.  From exchanges with the linux-cluster mailing list, some people
do use the lenny package - but patch and rebuild the source package
themselves.

Once squeeze is out, the cluster will work again.

Thanks for your help.

regards,
Martin



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#553166: redhat-cluster: services are not relocated when a node fails

2009-10-29 Thread Martin Waite
Package: redhat-cluster
Severity: important

*** Please type your report below this line ***

An unresolved upstream bug (which includes a patch) affects the
usability of the cluster suite on Debian Lenny.

https://bugzilla.redhat.com/show_bug.cgi?id=512512

This bug prevents the cluster from correctly handling the relocation
of services that were running on a failed node.
The problem is that fenced correctly handles the fencing of the failed
node, but is unable to correctly communicate
this to the cman module - which in turn fails to notify rgmanager that
the node has been fenced and any services
it was running are safe to be relocated to other nodes.

This bug makes the cluster suite on Debian Lenny unusable for
providing HA services.

A symptom of the notification failure can be seen in syslog when a node fails:

Oct 28 16:15:51 clusternode27 clurgmgrd[8602]: debug Membership Change Event
Oct 28 16:15:51 clusternode27 fenced[2760]: clusternode30 not a
cluster member after 0 sec post_fail_delay
Oct 28 16:15:51 clusternode27 fenced[2760]: fencing node clusternode30
Oct 28 16:15:51 clusternode27 fenced[2760]: can't get node number for
node pC9@#001
Oct 28 16:15:51 clusternode27 fenced[2760]: fence clusternode30 success
Oct 28 16:15:56 clusternode27 clurgmgrd[8602]: info Waiting for node
#30 to be fenced

On line 4, Can't get node number... should report the name
clusternode30, but the bug involves incorrectly
freeing the memory holding this name before using it in this message.
While fenced is satisfied that the
node has been successfully fenced, in the final line we can see that
the resource manager clurgmgrd
is still awaiting notification that the fencing has taken place.

This bug does not affect the RHEL5 version of the package, which is
based on a different source tree.
However, all distributions based on the STABLE2 branch are likely to
be affected.

The cluster used to expose the bug consists of 3 VM guests with the
following cluster configuration:

?xml version=1.0?
cluster name=testcluster config_version=28
cman port=6809
  multicast addr=224.0.0.1/
/cman

fencedevices
  fencedevice agent=fence_ack_null name=fan01/
/fencedevices

clusternodes
clusternode name=clusternode27 nodeid=27
  multicast addr=224.0.0.1 interface=eth0:1/
  fence
   method name=1
 device name=fan01/
   /method
  /fence
/clusternode
clusternode name=clusternode28 nodeid=28
  multicast addr=224.0.0.1 interface=eth0:1/
  fence
   method name=1
 device name=fan01/
   /method
  /fence
/clusternode
clusternode name=clusternode30 nodeid=30
  multicast addr=224.0.0.1 interface=eth0:1/
  fence
   method name=1
 device name=fan01/
   /method
  /fence
/clusternode
/clusternodes

rm log_level=7
  failoverdomains
failoverdomain name=new_cluster_failover nofailback=1
ordered=0 restricted=1
  failoverdomainnode name=clusternode27 priority=1/
  failoverdomainnode name=clusternode28 priority=1/
  failoverdomainnode name=clusternode30 priority=1/
/failoverdomain
  /failoverdomains
  resources
script name=sentinel file=/bin/true/
  /resources

  service autostart=0 exclusive=0 name=SENTINEL recovery=disable
script ref=sentinel/
  /service

/rm

/cluster

Where clusternode27, clusternode28, and clusternode30 are the 3 node names.

The only non-standard component is a dummy fencing agent installed in
/usr/sbin/fence_ack_null:
#!/bin/bash

#
# Fencing agent that always succeeds
#

echo Done
# eof

The cluster can now be started.  On each node:

sudo /etc/init.d/cman start
sudo /etc/init.d/rgmanager start

Once all nodes are running, start the SENTINEL service on clusternode30:

sudo /usr/sbin/clusvcadm -e SENTINEL -m clusternode30

On clusternode27, view the cluster status.  Service SENTINEL should be
in state started on clusternode30.

sudo /usr/sbin/clustat

At this point, tail syslog on clusternode27 and clusternode28.
Power-off clusternode30 (as rudely as possible).

On clusternode27, view the status of the cluster again:

sudo /usr/sbin/clustat

This will show that clusternode30 if Offline, but that
service:SENTINEL is still in state started on clusternode30.

Again on clusternode27, view the node status:

sudo /usr/sbin/cman_tool -f nodes

Clusternode30 will show as status X, with a note saying Node has
not been fenced since it went down.


-- System Information:
Debian Release: 5.0.3
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.26-2-amd64 (SMP w/1 CPU core)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#553166: closed by Guido Günther a...@sigxcpu.org (Re: [Debian-ha-maintainers] Bug#553166: redhat-cluster: services are not relocated when a node fails)

2009-10-29 Thread Martin Waite
Hi Guido,

I have abandoned using RHCS on Debian Stable.   RHCS 2 in Fedora and
Centos do not have the bug, and unfortunately in my production
environment I will be tied to using RHCS 2.

I do believe that the bug has gone away in version 3 of RHCS, but
version 2 of RHCS as supplied by Lenny is virtually useless:  it
cannot handle failover.

As a newcomer to using RHCS, it took a few days to figure out that the
problems I was having were not caused by my configuration errors (of
which I had many) but were actually caused by a bug.

Is there some way of either removing the package or providing some
warning to potential users that it doesn't work ?   Marking the bug as
closed seems inappropriate because it implies that the bug has been
resolved - which is not true.Would wont fix be better ?

regards,
Martin



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org