Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Andrew Beekhof
On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra
shravan.mis...@gmail.com wrote:
 Hi Guys,

 I'm running the following version of pacemaker and corosync
 corosync=1.1.1-1-2
 pacemaker=1.0.9-2-1

 Every thing had been running fine for quite some time now but then I
 started seeing following errors in the corosync logs,


 =
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 

 I can perform all the crm shell commands and what not but it's
 troubling that the above is happening.

 My crm_mon output looks good.


 I also checked the authkey and did md5sum on both it's same.

 Then I stopped corosync and regenerated the authkey with
 corosync-keygen and copied it to the the other machine but I still get
 the above message in the corosync log.

Are you sure there's not a third node somewhere broadcasting on that
mcast and port combination?


 Is there anything other authkey that I should look into ?


 corosync.conf

 

 # Please read the corosync.conf.5 manual page
 compatibility: whitetank

 totem {
        version: 2
        token: 3000
        token_retransmits_before_loss_const: 10
        join: 60
        consensus: 1500
        vsftype: none
        max_messages: 20
        clear_node_high_bit: yes
        secauth: on
        threads: 0
        rrp_mode: passive

        interface {
                ringnumber: 0
                bindnetaddr: 192.168.2.0
                #mcastaddr: 226.94.1.1
                broadcast: yes
                mcastport: 5405
        }
        interface {
                ringnumber: 1
                bindnetaddr: 172.20.20.0
                #mcastaddr: 226.94.1.1
                broadcast: yes
                mcastport: 5405
        }
 }


 logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        to_syslog: yes
        logfile: /tmp/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
 }

 service {
        name: pacemaker
        ver: 0
 }

 aisexec {
        user:root
        group: root
 }

 amf {
        mode: disabled
 }


 ===


 Thanks
 Shravan

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Andrew Beekhof
On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro
miki.shap...@coles.com.au wrote:
 Confused.



 I *am* running DRBD in dual-master mode

/me cringes... this sounds to me like an impossibly dangerous idea.
Can someone from linbit comment on this please?  Am I imagining this?

 (apologies, I should have mentioned
 that earlier), and there will be both WAN clients as well as
 local-to-datacenter-clients writing to both nodes on both ends. It’s safe to
 assume the clients will know not of the split.



 In a WAN split I need to ensure that the node whose idea of drbd volume will
 be kept once resync happens stays up, and node that’ll get blown away and
 re-synced/overwritten becomes dead asap.

Won't you _always_ loose some data in a WAN split though?
AFAICS, you're doing here is preventing some being lots.

Is master/master really a requirement?

 NodeX(Successfully) taking on data from clients while in
 quorumless-freeze-still-providing-service, then discarding its hitherto
 collected client data when realizing other node has quorum and discarding
 own data isn’t good.

Agreed - freeze isn't an option if you're doing master/master.


 To recap what I understood so far:

 1.   CRM Availability on the multicast channel drives DC election, but
 DC election is irrelevant to us here.

 2.   CRM Availability on the multicast channel (rather than resource
 failure) drive who-is-in-quorum-and-who-is-not decisions [not sure here..
 correct?

correct

 Or does resource failure drive quorum? ]

quorum applies to node availability - resource failures have no impact
(unless they lead to fencing with then leads to the node leaving the membership)


 3.   Steve to clarify what happens quorum-wise if 1/3 nodes sees both
 others, but the other two only see the first (“broken triangle”), and
 whether this behaviour may differ based on whether the first node (which is
 different as it sees both others) happens to be the DC at the time or not.

Try in a cluster of 3 VMs?
Just use iptables rules to simulate the broken links


 Given that anyone who goes about building a production cluster would want to
 identify all likely failure modes and be able to predict how the cluster
 behaves in each one, is there any user-targeted doco/rtfm material one could
 read regarding how quorum establishment works in such scenarios?

I don't think corosync has such a doc at the moment.

 Setting up a 3-way with intermittent WAN links without getting a clear
 understanding in advance of how the software will behave is … scary J

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Andrew Beekhof
On Tue, Jan 12, 2010 at 3:55 PM, Emmanuel Lesouef e.leso...@crbn.fr wrote:
 Le Tue, 12 Jan 2010 14:56:31 +0100,
 Michael Schwartzkopff mi...@multinet.de a écrit :

 Am Dienstag, 12. Januar 2010 14:48:12 schrieb Emmanuel Lesouef:
  Hi,
 
  We use a rather old (in fact, very old) combination :
 
  heartbeat 2.99 + openhpi 2.12
 
  What do you suggest in order to upgrade to the latest version of
  pacemaker ?
 
  Thanks.

 http://www.clusterlabs.org/wiki/Upgrade


 Thanks for your answer. I already saw :
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-upgrade.html

 In fact, my question wans't about the upgrading process but more about
 polling this list about caveats, advices or best practice when dealing
 with rather old  uncommon configuration.

Biggest caveat is the networking issue that makes pacemaker 1.0
wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x).
So rolling upgrades are out and you'd need to look at one of the other
upgrade strategies.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DC election with downed node in 2-way cluster

2010-01-18 Thread Andrew Beekhof
On Thu, Jan 14, 2010 at 4:40 AM, Miki Shapiro miki.shap...@coles.com.au wrote:
 And the node really did power down?
 Yes. 100% certain and positive. OFF.

 But the other node didn't notice?!?
 Its resources (drbd master and the fence clone) did notice.
 Its dc-election-mechanism did NOT notice (and the survivor didn't re-elect)
 Its quorum-election mechanism did NOT notice (and the survivor still thinks 
 it has quorum).

 Logs attached.

Hmmm.
Not much to see there. crmd gets the membership event and then just
sort of stops.
Could you try again with debug turned on in openais.conf please?


 Keep in mind I'm relatively new to this. PEBKAC not entirely outside the 
 realm of the possible ;)

Doesn't look like it, but you might want to try something a little
more recent than 1.0.3.

 Thanks!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Wednesday, 13 January 2010 7:26 PM
 To: pacemaker@oss.clusterlabs.org
 Subject: Re: [Pacemaker] DC election with downed node in 2-way cluster

 On Wed, Jan 13, 2010 at 9:12 AM, Miki Shapiro miki.shap...@coles.com.au 
 wrote:
 Halt = soft off - a natively issued poweroff command that shuts stuff down
 nicely, then powers the blade off.

 And the node really did power down?
 But the other node didn't notice?!? That is insanely bad - looking
 forward to those logs.

 Logs I'll send tomorrow (our timezone is just wrapping up for the day).

 Yep, I'm actually an Aussie too... just not living there at the moment :-)

 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 __
 This email and any attachments may contain privileged and confidential
 information and are intended for the named addressee only. If you have
 received this e-mail in error, please notify the sender and delete
 this e-mail immediately. Any confidentiality, privilege or copyright
 is not waived or lost because this e-mail has been sent to you in
 error. It is your responsibility to check this e-mail and any
 attachments for viruses.  No warranty is made that this material is
 free from computer virus or any other defect or error.  Any
 loss/damage incurred by using this material is not the sender's
 responsibility.  The sender's entire liability will be limited to
 resupplying the material.
 __
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Florian Haas
On 2010-01-18 11:41, Colin wrote:
 Hi All,
 
 we are currently looking at nearly the same issue, in fact I just
 wanted to start a similarly titled thread when I stumbled over these
 messages…
 
 The setup we are evaluating is actually a 2*N-node-cluster, i.e. two
 slightly separated sites with N nodes each. The main difference to an
 N-node-cluster is that a failure of one of the two groups of nodes
 must be considered a single failure event [against which the cluster
 must protect, e.g. loss of power at one site].

Colin,

the current approach is to utilize 2 Pacemaker clusters, each highly
available in its own right, and employing manual failover. As described
here:

http://www.drbd.org/users-guide/s-pacemaker-floating-peers.html#s-pacemaker-floating-peers-site-fail-over

May be combined with DRBD resource stacking, obviously.

Given the fact that most organizations currently employ a non-automatic
policy to site failover (as in, must be authorized by J. Random Vice
President), this is a sane approach that works for most. Automatic
failover is a different matter, not just with regard to clustering
(where neither Corosync nor Pacemaker nor Heartbeat currently support
any concept of sites), but also in terms of IP address failover,
dynamic routing, etc.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Pre-Announce: End of 0.6 support is near

2010-01-18 Thread Florian Haas
On 2010-01-18 11:18, Andrew Beekhof wrote:
 Biggest caveat is the networking issue that makes pacemaker 1.0
 wire-incompatible with pacemaker 0.6 (and heartbeat 2.1.x).
 So rolling upgrades are out and you'd need to look at one of the other
 upgrade strategies.

Even though I've bugged you about this repeatedly in the past, I'll
reiterate that I think this non-support of rolling upgrades is a bad
thing(tm).

Just so someone puts this on the record. :)

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] [Linux-HA] Announce: Hawk (HA Web Konsole)

2010-01-18 Thread Andrew Beekhof
I look forward to taking this for a spin!
Do we have a bugzilla component for it yet?

On Sat, Jan 16, 2010 at 2:14 PM, Tim Serong tser...@novell.com wrote:
 Greetings All,

 This is to announce the development of the Hawk project,
 a web-based GUI for Pacemaker HA clusters.

 So, why another management tool, given that we already have
 the crm shell, the Python GUI, and DRBD MC?  In order:

 1) We have the usual rationale for a GUI over (or in addition
   to) a CLI tool; it is (or should be) easier to use, for
   a wider audience.

 2) The Python GUI is not always easily installable/runnable
   (think: sysadmins with Windows desktops and/or people who
   don't want to, or can't, forward X).

 3) Believe it or not, there are a number of cases where,
   citing security reasons, site policy prohibits ssh access
   to servers (which is what DRBD MC uses internally).

 There are also some differing goals; Hawk is not intended
 to expose absolutely everything.  There will be point somewhere
 where you have to say and now you must learn to use a shell.

 Likewise, Hawk is not intended to install the base cluster
 stack for you (whereas DRBD MC does a good job of this).

 It's early days yet (no downloadable packages), but you can
 get the current source as follows:

  # hg clone http://hg.clusterlabs.org/pacemaker/hawk
  # cd hawk
  # hg update tip

 This will give you a web-based GUI with a display roughly
 analagous to crm_mon, in terms of status of cluster resources.
 It will show you running/dead/standby nodes, and the resources
 (clones, groups  primitives) running on those nodes.

 It does not yet provide information about failed resources or
 nodes, other than the fact that they are not running.

 Display of nodes  resources is collapsible (collapsed by
 default), but if something breaks while you are looking at it,
 the display will expand to show the broken nodes and/or
 resources.

 Hawk is intended to run on each node in your cluster.  You
 can then access it by pointing your web browser at the IP
 address of any cluster node, or the address of any IPaddr(2)
 resource you may have configured.

 Minimally, to see it in action, you will need the following
 packages and their dependencies (names per openSUSE/SLES):

  - ruby
  - rubygem-rails-2_3
  - rubygem-gettext_rails

 Once you've got those installed, run the following command:

  # hawk/script/server

 Then, point your browser at http://your-server:3000/ to see
 the status of your cluster.

 Ultimately, hawk is intended to be installed and run as a
 regular system service via /etc/init.d/hawk.  To do this,
 you will need the following additional packages:

  - lighttpd
  - lighttpd-mod_magnet
  - ruby-fcgi
  - rubygem-rake

 Then, try the following, but READ THE MAKEFILE FIRST!
 make install (and the rest of the build system for that
 matter) is frightfully primitive at the moment:

  # make
  # sudo make install
  # /etc/init.d/hawk start

 Then, point your browser at http://your-server:/ to see
 the status of your cluster.

 Assuming you've read this far, what next?

 - In the very near future (but probably not next week,
  because I'll be busy at linux.conf.au) you can expect to
  see further documentation and roadmap info up on the
  clusterlabs.org wiki.

 - Immediate goal is to obtain feature parity with crm_mon
  (completing status display, adding error/failure messages).

 - Various pieces of scaffolding need to be put in place (login
  page, access via HTTPS, clean up build/packaging, theming,
  etc.)

 - After status display, the following major areas of
  funcionality are:
  - Basic operator tasks (stop/start/migrate resource,
    standby/online node, etc.)
  - Explore failure scenarios (shadow CIB magic to see
    what would happen if a node/resource failed).
  - Ability to actually configure resources and nodes.

 Please direct comments, feedback, questions, etc. to
 tser...@novell.com and/or the Pacemaker mailing list.

 Thank you for your attention.

 Regards,

 Tim


 --
 Tim Serong tser...@novell.com
 Senior Clustering Engineer, Novell Inc.


 ___
 Linux-HA mailing list
 linux...@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released

2010-01-18 Thread Andrew Beekhof
The latest installment of the Pacemaker 1.0 stable series is now ready for 
general consumption.

In this release, we’ve made a number improvements to clone handling - 
particularly the way ordering constraints are processed - as well as some 
really nice improvements to the shell.

The next 1.0 release is anticipated to be in mid-March. We will be switching to 
a bi-monthly release schedule to begin focusing on development for the next 
stable series (more details soon). So, if you have feature requests, now is the 
time to voice them and/or provide patches :-)

Pre-built packages for Pacemaker and it’s immediate dependancies are currently 
building and will be available for openSUSE, SLES, Fedora, RHEL, CentOS from 
the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) shortly.

Read the full announcement at:
   http://theclusterguy.clusterlabs.org/post/340780359/pacemaker-1-0-7-released

General installation instructions are available at from the ClusterLabs wiki:
   http://clusterlabs.org/wiki/Install

-- Andrew




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Split Site 2-way clusters

2010-01-18 Thread Colin
On Mon, Jan 18, 2010 at 11:52 AM, Florian Haas florian.h...@linbit.com wrote:

 the current approach is to utilize 2 Pacemaker clusters, each highly
 available in its own right, and employing manual failover. As described
 here:

Thanks for the pointer! Perhaps site is not quite the correct term
for our setup, where we still have (multiple) Gbit-or-faster ethernet
links, think fire areas, at most in adjacent buildings.

For the next step up, two geographically different sites, I agree that
manual failover is more appropriate, but we feel that our case of the
fire areas should still be handled automatically…(?)

Can anybody judge how difficult it would be to integrate some kind of
quorum-support into the cluster? (All cluster nodes attempt a quorum
reservation; the node that gets it, has 1.5 or 2 votes towards the
quorum, rather than just one; this would ensure continued operation in
the case of a) a fire area losing power, b) the separate quorum-server
failing, or c) the cross-fire-area cluster-interconnects failing (but
not more than one failure at a time)…)

Regards, Colin

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released

2010-01-18 Thread Andreas Mock
 -Ursprüngliche Nachricht-
 Von: Andrew Beekhof and...@beekhof.net
 Gesendet: 18.01.10 12:43:30
 An: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Betreff: [Pacemaker] Announce: Pacemaker 1.0.7 (stable) Released


 The latest installment of the Pacemaker 1.0 stable series is now ready for 
 general consumption.

Great.

 Pre-built packages for Pacemaker and it’s immediate dependancies are 
 currently building and will be available for openSUSE, SLES, Fedora, RHEL, 
 CentOS from the ClusterLabs Build Area (http://www.clusterlabs.org/rpm) 
 shortly.

Please don't forget openSuSE 10.2. I'm waiting...  ;-)

Best regards + Thanks
Andreas



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Shravan Mishra
Hi ,

Since the interfaces on the two nodes are connected via cross over
cable so there is no chance of that happening and since I'm using rrp:
passive, which means that the other ring i.e. ring 1 will come into
play only when ring 0 fails,I assume.  I say this because ring 1
interface is on the network.


Once interesting that I observed was that
 lintomcrypt is being used for crypto reasons because I have secauth: on.

But I couldn't find that library on my machine.

I'm wondering if it's because of that.

Basically we are using 3 interfaces eth0, eth1 and eth2.

eth0 and eth2 are for ring 0 and ring 1 respectively. eth1 is the
primary interface.

This is what my drbd.conf looks like:


==
# please have a a look at the example configuration file in
# /usr/share/doc/drbd82/drbd.conf
#
global {
usage-count no;
}
common {
protocol C;
  startup {
wfc-timeout 120;
degr-wfc-timeout 120;
  }
}
resource var_nsm {
syncer {
rate 333M;
}
handlers {
fence-peer /usr/lib/drbd/crm-fence-peer.sh;
after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}
net {
after-sb-1pri discard-secondary;
}
on node1.itactics.com {
device /dev/drbd1;
 disk /dev/sdb3;
 address 172.20.20.1:7791;
 meta-disk internal;
  }
on node2.itactics.com {
device /dev/drbd1;
 disk /dev/sdb3;
 address 172.20.20.2:7791;
 meta-disk internal;
}
}
=


eth0's of the two nodes are connected via cross over as I mentioned
and eth1 and eth2 are on the network.

I'm not a networking expert but is it possible that broadcast done by
,let's say, any node not in my cluster, will still cause it to come to
my nodes through other interfaces which are attached to the network?


We in the dev and the QA guys are testing this in parallel.

And let's say there is QA cluster of two nodes and dev cluster of 2 nodes.

And interfaces for both of them are hooked as I mentioned above and that
corosync.conf for both the clusters have  bindnetaddr: 192.168.2.0.

Is there possibility of bad messages for the cluster casused by the other.


We are in the final leg of the testing and this came up.

Thanks for the help.


Shravan






On Mon, Jan 18, 2010 at 2:58 AM, Andrew Beekhof and...@beekhof.net wrote:
 On Sat, Jan 16, 2010 at 9:20 PM, Shravan Mishra
 shravan.mis...@gmail.com wrote:
 Hi Guys,

 I'm running the following version of pacemaker and corosync
 corosync=1.1.1-1-2
 pacemaker=1.0.9-2-1

 Every thing had been running fine for quite some time now but then I
 started seeing following errors in the corosync logs,


 =
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 

 I can perform all the crm shell commands and what not but it's
 troubling that the above is happening.

 My crm_mon output looks good.


 I also checked the authkey and did md5sum on both it's same.

 Then I stopped corosync and regenerated the authkey with
 corosync-keygen and copied it to the the other machine but I still get
 the above message in the corosync log.

 Are you sure there's not a third node somewhere broadcasting on that
 mcast and port combination?


 Is there anything other authkey that I should look into ?


 corosync.conf

 

 # Please read the corosync.conf.5 manual page
 compatibility: whitetank

 totem {
        version: 2
        token: 3000
        token_retransmits_before_loss_const: 10
        join: 60
        consensus: 1500
        vsftype: none
        max_messages: 20
        clear_node_high_bit: yes
        secauth: on
        threads: 0
        rrp_mode: passive

        interface {
                ringnumber: 0
                bindnetaddr: 192.168.2.0
                #mcastaddr: 226.94.1.1
                broadcast: yes
                mcastport: 5405
        }
        interface {
                ringnumber: 1
                bindnetaddr: 172.20.20.0
                #mcastaddr: 226.94.1.1
                broadcast: yes
                mcastport: 5405
        }
 }


 logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        to_syslog: yes
        logfile: /tmp/corosync.log
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
        }
 }

 service {
        name: pacemaker
        ver: 0
 }

 aisexec {
        user:root
        group: 

[Pacemaker] mcast vs broadcast

2010-01-18 Thread Shravan Mishra
Hi all,



Following is my corosync.conf.

Even though broadcast is enabled I see mcasted messages like these
in corosync.log.

Is it ok?  even when the broadcast is on and not mcast.

==
Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173
Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
172 to pending delivery queue
Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
173 to pending delivery queue
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172
Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173


=

===

# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 1500
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: on
threads: 0
rrp_mode: passive

interface {
ringnumber: 0
bindnetaddr: 192.168.2.0
#   mcastaddr: 226.94.1.1
broadcast: yes
mcastport: 5405
}
interface {
ringnumber: 1
bindnetaddr: 172.20.20.0
#mcastaddr: 226.94.2.1
broadcast: yes
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: yes
to_logfile: yes
to_syslog: yes
logfile: /tmp/corosync.log
debug: on
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}

service {
name: pacemaker
ver: 0
}

aisexec {
user:root
group: root
}

amf {
mode: disabled
}
=



Thanks
Shravan

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] 1.0.7 upgraded, restarting resources problem

2010-01-18 Thread Martin Gombač

Hi,

i have one m/s drbd resource and one Xen instance on top. Both m/s are 
primary.
When i restart node that's _not_ hosting the Xen instance (ibm1), 
pacemaker restarts running Xen instance on the other node (ibm2). There 
is no need to do that. I thought it got fixed 
(http://developerbugs.linux-foundation.org/show_bug.cgi?id=2153). Didn't it?


Here is my config once more. Please note the WARNING showed up only 
after upgrade.
(BTW setting drbd0predHosting score to 0 doesn't restart it. But it 
doesn't help resource ordering either.)


[r...@ibm1 etc]# crm configure show
WARNING: notify: operation name not recognized
node $id=3d430f49-b915-4d52-a32b-b0799fa17ae7 ibm2
node $id=4b2047c8-f3a0-4935-84a2-967b548598c9 ibm1
primitive Hosting ocf:heartbeat:Xen \
   params xmfile=/etc/xen/Hosting.cfg shutdown_timeout=303 \
   meta target-role=Started allow-migrate=true is-managed=true \
   op monitor interval=120s timeout=506s start-delay=5s \
   op migrate_to interval=0s timeout=304s \
   op migrate_from interval=0s timeout=304s \
   op stop interval=0s timeout=304s \
   op start interval=0s timeout=202s
primitive drbd_r0 ocf:linbit:drbd \
   params drbd_resource=r0 \
   op monitor interval=15s role=Master timeout=30s \
   op monitor interval=30s role=Slave timeout=30s \
   op stop interval=0s timeout=501s \
   op notify interval=0s timeout=90s \
   op demote interval=0s timeout=90s \
   op promote interval=0s timeout=90s \
   op start interval=0s timeout=255s
ms ms_drbd_r0 drbd_r0 \
   meta notify=true master-max=2 inteleave=true is-managed=true 
target-role=Started

order drbd0predHosting inf: ms_drbd_r0:promote Hosting:start
property $id=cib-bootstrap-options \
   dc-version=1.0.7-b1191b11d4b56dcae8f34715d52532561b875cd5 \
   cluster-infrastructure=Heartbeat \
   stonith-enabled=false \
   no-quorum-policy=ignore \
   default-resource-stickiness=10 \
   last-lrm-refresh=1263845352

All i want is to have just one resource Hosting started, after drbd was 
promoted(/primary) on the node that's it's starting.

Please advise me if you can.

Thank you,
regards,
M.

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] errors in corosync.log

2010-01-18 Thread Steven Dake
One possibility is you have a different cluster in your network on the
same multicast address and port.

Regards
-steve

On Sat, 2010-01-16 at 15:20 -0500, Shravan Mishra wrote:
 Hi Guys,
 
 I'm running the following version of pacemaker and corosync
 corosync=1.1.1-1-2
 pacemaker=1.0.9-2-1
 
 Every thing had been running fine for quite some time now but then I
 started seeing following errors in the corosync logs,
 
 
 =
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 Jan 16 15:08:39 corosync [TOTEM ] Received message has invalid
 digest... ignoring.
 Jan 16 15:08:39 corosync [TOTEM ] Invalid packet data
 
 
 I can perform all the crm shell commands and what not but it's
 troubling that the above is happening.
 
 My crm_mon output looks good.
 
 
 I also checked the authkey and did md5sum on both it's same.
 
 Then I stopped corosync and regenerated the authkey with
 corosync-keygen and copied it to the the other machine but I still get
 the above message in the corosync log.
 
 Is there anything other authkey that I should look into ?
 
 
 corosync.conf
 
 
 
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 totem {
 version: 2
 token: 3000
 token_retransmits_before_loss_const: 10
 join: 60
 consensus: 1500
 vsftype: none
 max_messages: 20
 clear_node_high_bit: yes
 secauth: on
 threads: 0
 rrp_mode: passive
 
 interface {
 ringnumber: 0
 bindnetaddr: 192.168.2.0
 #mcastaddr: 226.94.1.1
 broadcast: yes
 mcastport: 5405
 }
 interface {
 ringnumber: 1
 bindnetaddr: 172.20.20.0
 #mcastaddr: 226.94.1.1
 broadcast: yes
 mcastport: 5405
 }
 }
 
 
 logging {
 fileline: off
 to_stderr: yes
 to_logfile: yes
 to_syslog: yes
 logfile: /tmp/corosync.log
 debug: off
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 }
 }
 
 service {
 name: pacemaker
 ver: 0
 }
 
 aisexec {
 user:root
 group: root
 }
 
 amf {
 mode: disabled
 }
 
 
 ===
 
 
 Thanks
 Shravan
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] mcast vs broadcast

2010-01-18 Thread Steven Dake
On Mon, 2010-01-18 at 11:25 -0500, Shravan Mishra wrote:
 Hi all,
 
 
 
 Following is my corosync.conf.
 
 Even though broadcast is enabled I see mcasted messages like these
 in corosync.log.
 
 Is it ok?  even when the broadcast is on and not mcast.
 

Yes you are using broadcast and the debug output doesn't print a special
case for broadcast (but it really is broadcasting).

This output is debug output meant for developer consumption.  It is
really not all that useful for end users.  
 ==
 Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
 Jan 18 09:50:40 corosync [TOTEM ] mcasted message added to pending queue
 Jan 18 09:50:40 corosync [TOTEM ] Delivering 171 to 173
 Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
 172 to pending delivery queue
 Jan 18 09:50:40 corosync [TOTEM ] Delivering MCAST message with seq
 173 to pending delivery queue
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 172
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
 Jan 18 09:50:40 corosync [TOTEM ] Received ringid(192.168.2.1:168) seq 173
 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 172
 Jan 18 09:50:40 corosync [TOTEM ] releasing messages up to and including 173
 
 
 =
 
 ===
 
 # Please read the corosync.conf.5 manual page
 compatibility: whitetank
 
 totem {
 version: 2
 token: 3000
 token_retransmits_before_loss_const: 10
 join: 60
 consensus: 1500
 vsftype: none
 max_messages: 20
 clear_node_high_bit: yes
 secauth: on
 threads: 0
 rrp_mode: passive
 
 interface {
 ringnumber: 0
 bindnetaddr: 192.168.2.0
 #   mcastaddr: 226.94.1.1
 broadcast: yes
 mcastport: 5405
 }
 interface {
 ringnumber: 1
 bindnetaddr: 172.20.20.0
 #mcastaddr: 226.94.2.1
 broadcast: yes
 mcastport: 5405
 }
 }
 logging {
 fileline: off
 to_stderr: yes
 to_logfile: yes
 to_syslog: yes
 logfile: /tmp/corosync.log
 debug: on
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 }
 }
 
 service {
 name: pacemaker
 ver: 0
 }
 
 aisexec {
 user:root
 group: root
 }
 
 amf {
 mode: disabled
 }
 =
 
 
 
 Thanks
 Shravan
 
 ___
 Pacemaker mailing list
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker