Re: [Pacemaker] pacemaker-remote debian wheezy

2015-03-14 Thread Kostiantyn Ponomarenko
Hi Alexis,

Sorry, I didn't work with drbd.
Try to look here
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/
.

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration

2015-02-09 Thread Kostiantyn Ponomarenko
Hi Kristoffer,

Thank you for the help.
I will let you know if I see the same with the latest version.
On Feb 9, 2015 10:07 AM, Kristoffer Grönlund kgronl...@suse.com wrote:

 Hi,

 Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com writes:

  Hi guys,
 
  I saw this during applying the configuration using a script with crmsh
  commands:


 The CIB patching performed by crmsh has been a bit too sensitive to
 CIB version mismatches which can cause errors like the one you are
 seeing. This should be fixed in the latest released version of crmsh
 (2.1.2), and I would recommend upgrading to that version if you can.

 If this problem still occurs with 2.1.2, please let me know [1].

 Thanks!

 [1]: http://github.com/crmsh/crmsh/issues


 
  + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw
  + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params
  delay=10
  + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1
  -inf: node-1
  + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0
  -inf: node-0
  Call cib_apply_diff failed (-205): Update was older than existing
  configuration
  ERROR: could not patch cib (rc=205)
  INFO: offending xml diff: diff format=2
version
  source admin_epoch=0 epoch=26 num_updates=3/
  target admin_epoch=0 epoch=27 num_updates=3/
/version
change operation=modify path=/cib
  change-list
change-attr name=epoch operation=set value=27/
  /change-list
  change-result
cib crm_feature_set=3.0.9 validate-with=pacemaker-2.0
 epoch=27
  num_updates=3 admin_epoch=0 cib-last-written=Thu Feb  5 14:56:09
 2015
  have-quorum=1 dc-uuid=1/
  /change-result
/change
change operation=create path=/cib/configuration/constraints
  position=1
  rsc_location id=dont_run_STONITH_node-0_on_node-0
  rsc=STONITH_node-0 score=-INFINITY node=node-0/
/change
  /diff
 
  After that pacemaker stopped on the node on which the script was run.
 
  Thank you,
  Kostya
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org

 --
 // Kristoffer Grönlund
 // kgronl...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] authentication in the cluster

2015-02-05 Thread Kostiantyn Ponomarenko
Hi Chrissie,

On Thu, Jan 29, 2015 at 11:44 AM, Christine Caulfield ccaul...@redhat.com
wrote:

 as corosync rejects the messages
 with the wrong authkey


And I suppose that it is enough just to use crypto_hash option.
I am right?

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] authentication in the cluster

2015-02-05 Thread Kostiantyn Ponomarenko
Chrissie, one more thing.

And I also suppose that the key is read only one time at the start-up of
the Corosync.
In other words there is no any check for the presence of the key or its
update, while Corosync is working.
Am I right?

Thank you,
Kostya

On Thu, Feb 5, 2015 at 2:42 PM, Kostiantyn Ponomarenko 
konstantin.ponomare...@gmail.com wrote:

 Hi Chrissie,

 On Thu, Jan 29, 2015 at 11:44 AM, Christine Caulfield ccaul...@redhat.com
  wrote:

 as corosync rejects the messages
 with the wrong authkey


 And I suppose that it is enough just to use crypto_hash option.
 I am right?

 Thank you,
 Kostya

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] authentication in the cluster

2015-02-05 Thread Kostiantyn Ponomarenko
Chrissie, thank you! =)
It helps!

Thank you,
Kostya

On Thu, Feb 5, 2015 at 3:59 PM, Christine Caulfield ccaul...@redhat.com
wrote:

 On 02/05/2015 01:47 PM, Kostiantyn Ponomarenko wrote:
  Chrissie, one more thing.
 
  And I also suppose that the key is read only one time at the start-up of
  the Corosync.
  In other words there is no any check for the presence of the key or its
  update, while Corosync is working.
  Am I right?
 

 That's correct, yes :)

 Chrissie


  Thank you,
  Kostya
 
  On Thu, Feb 5, 2015 at 2:42 PM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi Chrissie,
 
  On Thu, Jan 29, 2015 at 11:44 AM, Christine Caulfield 
 ccaul...@redhat.com
  wrote:
 
  as corosync rejects the messages
  with the wrong authkey
 
 
  And I suppose that it is enough just to use crypto_hash option.
  I am right?
 
  Thank you,
  Kostya
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration

2015-02-05 Thread Kostiantyn Ponomarenko
Hi guys,

I saw this during applying the configuration using a script with crmsh
commands:

+ crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw
+ crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params
delay=10
+ crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1
-inf: node-1
+ crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0
-inf: node-0
Call cib_apply_diff failed (-205): Update was older than existing
configuration
ERROR: could not patch cib (rc=205)
INFO: offending xml diff: diff format=2
  version
source admin_epoch=0 epoch=26 num_updates=3/
target admin_epoch=0 epoch=27 num_updates=3/
  /version
  change operation=modify path=/cib
change-list
  change-attr name=epoch operation=set value=27/
/change-list
change-result
  cib crm_feature_set=3.0.9 validate-with=pacemaker-2.0 epoch=27
num_updates=3 admin_epoch=0 cib-last-written=Thu Feb  5 14:56:09 2015
have-quorum=1 dc-uuid=1/
/change-result
  /change
  change operation=create path=/cib/configuration/constraints
position=1
rsc_location id=dont_run_STONITH_node-0_on_node-0
rsc=STONITH_node-0 score=-INFINITY node=node-0/
  /change
/diff

After that pacemaker stopped on the node on which the script was run.

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Call cib_apply_diff failed (-205): Update was older than existing configuration

2015-02-05 Thread Kostiantyn Ponomarenko
Here is a part of my /var/log/cluster/corosync.log:


Feb 05 16:23:36 [3184] isis-seth943fcib: info:
cib_process_ping:Reporting our current digest to node-1:
71d36e21c5df77a4e8b10b29f2a349e2 for 0.70.16 (0x2474c50 0)

Feb 05 16:23:41 [3185] isis-seth943f   stonithd:   notice: remote_op_done:
 Operation reboot of node-0 by node-1 for crmd.3201@node-1.58027587: OK

Feb 05 16:23:41 [3189] isis-seth943f   crmd: crit:
tengine_stonith_notify:  We were alegedly just fenced by node-1 for
node-1!

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 Diff: --- 0.70.16 2

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 Diff: +++ 0.70.17 (null)

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 +  /cib:  @num_updates=17

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 +  /cib/status/node_state[@id='1']:
 @crm-debug-origin=send_stonith_update, @join=down, @expected=down

Feb 05 16:23:41 [3184] isis-seth943fcib: info:
cib_process_request: Completed cib_modify operation for section
status: OK (rc=0, origin=node-1/crmd/91, version=0.70.17)

Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: info:
cancel_recurring_action:  Cancelling operation dmdh_monitor_3

Feb 05 16:23:41 [3182] isis-seth943f pacemakerd:error: pcmk_child_exit:
Child process crmd (3189) exited: Network is down (100)

Feb 05 16:23:41 [3182] isis-seth943f pacemakerd:  warning: pcmk_child_exit:
Pacemaker child process crmd no longer wishes to be respawned. Shutting
ourselves down.

Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted:  warning:
qb_ipcs_event_sendv:  new_event_notification (3186-3189-7): Bad file
descriptor (9)

Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted:  warning:
send_client_notify:   Notification of client
crmd/437207d3-a709-48d0-b544-94e2158ea191 failed

Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted: info:
cancel_recurring_action:  Cancelling operation sm0dh_monitor_3

Feb 05 16:23:41 [3186] isis-seth943f pacemaker_remoted:  warning:
send_client_notify:   Notification of client
crmd/437207d3-a709-48d0-b544-94e2158ea191 failed

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 Diff: --- 0.70.17 2

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 Diff: +++ 0.70.18 (null)

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 -- /cib/status/node_state[@id='1']/lrm[@id='1']

Feb 05 16:23:41 [3184] isis-seth943fcib: info: cib_perform_op:
 +  /cib:  @num_updates=18

Feb 05 16:23:41 [3184] isis-seth943fcib: info:
cib_process_request: Completed cib_delete operation for section
//node_state[@uname='node-0']/lrm: OK (rc=0, origin=node-1/crmd/92,
version=0.70.18)
Feb 05 16:23:41 [3182] isis-seth943f pacemakerd:   notice:
pcmk_shutdown_worker:Shuting down Pacemaker

Thank you,
Kostya

On Thu, Feb 5, 2015 at 6:15 PM, Kostiantyn Ponomarenko 
konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 I saw this during applying the configuration using a script with crmsh
 commands:

 + crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw
 + crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw params
 delay=10
 + crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1
 -inf: node-1
 + crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0
 -inf: node-0
 Call cib_apply_diff failed (-205): Update was older than existing
 configuration
 ERROR: could not patch cib (rc=205)
 INFO: offending xml diff: diff format=2
   version
 source admin_epoch=0 epoch=26 num_updates=3/
 target admin_epoch=0 epoch=27 num_updates=3/
   /version
   change operation=modify path=/cib
 change-list
   change-attr name=epoch operation=set value=27/
 /change-list
 change-result
   cib crm_feature_set=3.0.9 validate-with=pacemaker-2.0
 epoch=27 num_updates=3 admin_epoch=0 cib-last-written=Thu Feb  5
 14:56:09 2015 have-quorum=1 dc-uuid=1/
 /change-result
   /change
   change operation=create path=/cib/configuration/constraints
 position=1
 rsc_location id=dont_run_STONITH_node-0_on_node-0
 rsc=STONITH_node-0 score=-INFINITY node=node-0/
   /change
 /diff

 After that pacemaker stopped on the node on which the script was run.

 Thank you,
 Kostya

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] rrp_mode in corosync.conf

2015-01-27 Thread Kostiantyn Ponomarenko
Hi all,

I've been looking for a good answer to my question, but all information I
found is ambiguous.
I hope to get a good answer here =)

The only description about active and passive modes I found is:
Active: both rings will be active, in use
Passive: only one of the 2 rings is in use, the second one will be use
only if the first one fails

There is no description of how it works and what the impact is?
So, my general question is: How the rings are used in active and
passive modes?


Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] authentication in the cluster

2015-01-27 Thread Kostiantyn Ponomarenko
Hi all,

Here is a situation - there are two two-node clusters.
They have totally identical configuration.
Nodes in the clusters are connected directly, without any switches.

Here is a part of corosync.comf file:

totem {
version: 2

cluster_name: mycluster
transport: udpu

crypto_hash: sha256
crypto_cipher: none
rrp_mode: passive
}

nodelist {
node {
name: node-a
nodeid: 1
ring0_addr: 169.254.0.2
ring1_addr: 169.254.1.2
}

node {
name: node-b
nodeid: 2
ring0_addr: 169.254.0.3
ring1_addr: 169.254.1.3
}
}

The only difference between those two clusters is authentication key (
/etc/corosync/authkey ) - it is different for both clusters.

QUESTION:
--
What will be the behavior if the next mess in connection occurs:
ring1_addr of node-a (cluster-A) is connected to ring1_addr of node-b
(cluster-B)
ring1_addr of node-a (cluster-B) is connected to ring1_addr of node-b
(cluster-A)

I attached a pic which shows the connections.

My actual goal - do not let the clusters work in such case.
To achieve it, I decided to use authentication key mechanism.
But I don't know the result in the situation which I described ... .

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] authentication in the cluster

2015-01-27 Thread Kostiantyn Ponomarenko
Hi Chrissie,

I know that this setup it crazy thing =)
First of all I needed to say - think about each two-node cluster as one box
with two nodes.

 You can't connect clusters together like that.
I know that.

All nodes in the cluster have just 1 authkey file.
That is true. But in this example there are two clusters, each of them have
its own auth key.

What you have there is not a ring, it's err, a linked-cross?!
Yep, I showed the wrong way of connecting two clusters.

 Why do you need to connect the two clusters together - is it for failover?
No, it is not. I really don't (and won't) connect them in that way. It
wrong.
But, in real life those two clusters will be standing (physically, in the
same room, in the same rack) pretty close to each other.
And my concern is - if someone do that connection by a mistake. What will
be in that situation?
What I would like to get in that situation, is something which prevent
simultaneous work of two nodes in one cluster - because it will cause data
corruption.

The situation is pretty simple when there is only one ring_addr defined
per node.
In this case, when some one cross-linked two separate clusters, it will
lead to 4 clusters each of which is missing one node - because two
connected nodes has different auth keys, and that is why they will not see
each other even when there is a connection.
STONITH always works in the same cluster.
So, STONITH will be rebooting the other one in the cluster.
That will prevent simultaneous access to the data.

I tried to do my best in describing the situation, the problem and the
question.
Looking forward to hear any suggestions =)


Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-20 Thread Kostiantyn Ponomarenko
Got it. Thank you =)
I just thought about possibility of a NIC to burn down.

Thank you,
Kostya

On Tue, Jan 20, 2015 at 10:50 AM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,


  One more thing to clarify.
  You said rebind can be avoided - what does it mean?

 By that I mean that as long as you don't shutdown interface everything
 will work as expected. Interface shutdown is administrator decision,
 system doesn't do it automagically :)

 Regards,
   Honza

 
  Thank you,
  Kostya
 
  On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Thank you. Now I am aware of it.
 
  Thank you,
  Kostya
 
  On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com
 wrote:
 
  Kostiantyn,
 
  Honza,
 
  Thank you for helping me.
  So, there is no defined behavior in case one of the interfaces is not
 in
  the system?
 
  You are right. There is no defined behavior.
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
  wrote:
 
  Kostiantyn,
 
 
  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in
  the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form)
 (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error
 in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
  Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking
 ringid
  0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ]
  Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
  Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors
  like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
  Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information
  Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully
 (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode
  (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but
  applied
  configuration is not guaranteed to be fully loaded (some rules can
 be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable
 and
  the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any
  other
  explanation in the network ... .
 
  Corosync needs all interfaces during start and runtime. This doesn't
  mean they must be connected (this would make corosync unusable for
  physical NIC/Switch or cable failure), but they must be up and have
  correct ip.
 
  When this is not the case, corosync rebinds to localhost and weird
  things happens. Removal of this rebinding is long time TODO, but
 there
  are still more important bugs (especially because rebind can be
  avoided).
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Fri

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-19 Thread Kostiantyn Ponomarenko
One more thing to clarify.
You said rebind can be avoided - what does it mean?

Thank you,
Kostya

On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko 
konstantin.ponomare...@gmail.com wrote:

 Thank you. Now I am aware of it.

 Thank you,
 Kostya

 On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,

  Honza,
 
  Thank you for helping me.
  So, there is no defined behavior in case one of the interfaces is not in
  the system?

 You are right. There is no defined behavior.

 Regards,
   Honza


 
 
  Thank you,
  Kostya
 
  On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
 wrote:
 
  Kostiantyn,
 
 
  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in
 the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form) (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
 Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid
 0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ]
 Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
  Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors
 like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
 Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information
 Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode
 (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but
 applied
  configuration is not guaranteed to be fully loaded (some rules can be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable and
 the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any
 other
  explanation in the network ... .
 
  Corosync needs all interfaces during start and runtime. This doesn't
  mean they must be connected (this would make corosync unusable for
  physical NIC/Switch or cable failure), but they must be up and have
  correct ip.
 
  When this is not the case, corosync rebinds to localhost and weird
  things happens. Removal of this rebinding is long time TODO, but there
  are still more important bugs (especially because rebind can be
 avoided).
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi guys,
 
  Corosync fails to start if there is no such network interface
 configured
  in the system.
  Even with rrp_mode: passive the problem is the same when at least
 one
  network interface is not configured in the system.
 
  Is this the expected behavior?
  I thought that when you use redundant rings, it is enough to have at
  least

Re: [Pacemaker] pacemaker-remote debian wheezy

2015-01-15 Thread Kostiantyn Ponomarenko
Thomas,

There was a need for me to run the latest cluster stuff on Debian 7.
So I created a document for myself to use.
I don't claim this doc to be the best way to go, but it works for me.
I hope it will work for you as well.

Here is the doc's content:


Software


   -

   libqb 0.17.0
   -

   corosync 2.3.3
   -

   cluster-glue 1.0.12
   -

   resource-agents 3.9.5
   -

   pacemaker 1.1.12
   -

   crmsh 2.1.0


IMPORTANT: do this installation step-by-step as here, the order is
significant.


Pre-Configuration

$ sudo apt-get install build-essential

$ sudo apt-get install automake autoconf

$ sudo apt-get install libtool

$ sudo apt-get install pkg-config


LIBQB (needed by Corosync)

https://github.com/ClusterLabs/libqb/releases

$ echo 0.17.0  .tarball-version

$ ./autogen.sh

$ ./configure

$ make

$ sudo make install


COROSYNC

https://github.com/corosync/corosync/releases

$ sudo apt-get install libnss3-dev

$ echo 2.3.3  .tarball-version

$ ./autogen.sh

$ ./configure

$ make

$ sudo make install




CLUSTER-GLUE (node fencing plugins, an error reporting utility, and other
reusable cluster components from the Linux HA project)

http://hg.linux-ha.org/glue/archive/glue-1.0.12.tar.bz2

$ sudo apt-get install libaio-dev

(!) install dependencies for pacemaker (below) before proceed

$ ./autogen.sh

$ ./configure --enable-fatal-warnings=no

$ make

$ sudo make install


RESOURCE-AGENTS (Combined repository of OCF agents from the RHCS and
Linux-HA projects)

https://github.com/ClusterLabs/resource-agents/releases

$ echo 3.9.5  .tarball-version

$ ./autogen.sh

$ ./configure

$ make

$ sudo make install


PACEMAKER

https://github.com/ClusterLabs/pacemaker/releases

$ sudo apt-get install uuid-dev

$ sudo apt-get install libglib2.0-dev

$ sudo apt-get install libxml2-dev

$ sudo apt-get install libxslt1-dev

$ sudo apt-get install libbz2-dev

$ sudo apt-get install libncurses5-dev

$ sudo addgroup --system haclient

$ sudo adduser --system --no-create-home --ingroup haclient hacluster

$ ./autogen.sh

$ ./configure

$ make

$ sudo make install



CRMSH

https://github.com/crmsh/crmsh/releases

$ sudo apt-get install python-lxml

$ ./autogen.sh

$ ./configure

$ make
$ sudo make install


Thank you,
Kostya

On Thu, Jan 15, 2015 at 5:44 PM, Thomas Manninger dbgtmas...@gmx.at wrote:

 Hi,

 i also compiled the pacemaker_mgmt. I can start the hb_gui, but i have no
 server daemon?
 I used git://github.com/ClusterLabs/pacemaker-mgmt.git as source.

 Is the server in another repo??

 I used:
 ./ConfigureMe configure
 ./ConfigureMe make
 checkinstall --fstrans=no ./ConfigureMe install

 regards
 thomas

 *Gesendet:* Donnerstag, 15. Januar 2015 um 15:16 Uhr
 *Von:* Ken Gaillot kgail...@redhat.com
 *An:* pacemaker@oss.clusterlabs.org
 *Betreff:* Re: [Pacemaker] pacemaker-remote debian wheezy
 On 01/15/2015 08:18 AM, Kristoffer Grönlund wrote:
  Thomas Manninger dbgtmas...@gmx.at writes:
 
  Hi,
  I compiled the latest libqb, corosync and pacemaker from source.
  Now there is no crm command available? Is there another standard
  shell?
  Should i use crmadmin?
  Thanks!
  Regards
  Thomas
 
  You can get crmsh and build from source at crmsh.github.io, or try the
  .rpm packages for various distributions here:
 
 
 https://build.opensuse.org/package/show/network:ha-clustering:Stable/crmsh

 Congratulations on getting that far, that's probably the hardest part :-)

 The crm shell was part of the pacemaker packages in Debian squeeze. It
 was going to be separated into its own package for jessie, but that
 hasn't made it out of sid/unstable yet, so it might not make it into the
 final release.

 Since you've built everything else from source, that's probably easiest,
 but if you want to try ...

 For the rpm mentioned above, have a look at alien
 (https://wiki.debian.org/Alien). crmsh is a standalone package so
 hopefully it would work; I wouldn't try alien for something as
 complicated as all the rpm's that go into a pacemaker install.

 You could try backporting the sid package
 https://packages.debian.org/source/sid/crmsh but I suspect the
 dependencies would get you.

 In theory the crm binary from the squeeze packages should work with the
 newer pacemaker, if you can straighten out the library dependencies.

 Or you can use the crm*/cib* command-line tools that come with pacemaker
 if you don't mind the lower-level approach.

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 

Re: [Pacemaker] pacemaker-remote debian wheezy

2015-01-15 Thread Kostiantyn Ponomarenko
Hi Thomas,

I don't remember starting from which version of Pacemaker crmsh is not
included in it anymore.
It goes as a independent product.
You can get it back.
Here is the link https://github.com/crmsh/crmsh/ .
Build and install =)

Thank you,
Kostya

On Thu, Jan 15, 2015 at 3:18 PM, Kristoffer Grönlund kgronl...@suse.com
wrote:

 Thomas Manninger dbgtmas...@gmx.at writes:

  Hi,
  I compiled the latest libqb, corosync and pacemaker from source.
  Now there is no crm command available? Is there another standard
  shell?
  Should i use crmadmin?
  Thanks!
  Regards
  Thomas

 You can get crmsh and build from source at crmsh.github.io, or try the
 .rpm packages for various distributions here:

 https://build.opensuse.org/package/show/network:ha-clustering:Stable/crmsh

 Best regards,
 Kristoffer

 --
 // Kristoffer Grönlund
 // kgronl...@suse.com

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-14 Thread Kostiantyn Ponomarenko
Thank you. Now I am aware of it.

Thank you,
Kostya

On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,

  Honza,
 
  Thank you for helping me.
  So, there is no defined behavior in case one of the interfaces is not in
  the system?

 You are right. There is no defined behavior.

 Regards,
   Honza


 
 
  Thank you,
  Kostya
 
  On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
 wrote:
 
  Kostiantyn,
 
 
  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form) (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
  Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors
 like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
 Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information
 Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode
 (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but
 applied
  configuration is not guaranteed to be fully loaded (some rules can be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable and
 the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any
 other
  explanation in the network ... .
 
  Corosync needs all interfaces during start and runtime. This doesn't
  mean they must be connected (this would make corosync unusable for
  physical NIC/Switch or cable failure), but they must be up and have
  correct ip.
 
  When this is not the case, corosync rebinds to localhost and weird
  things happens. Removal of this rebinding is long time TODO, but there
  are still more important bugs (especially because rebind can be
 avoided).
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi guys,
 
  Corosync fails to start if there is no such network interface
 configured
  in the system.
  Even with rrp_mode: passive the problem is the same when at least
 one
  network interface is not configured in the system.
 
  Is this the expected behavior?
  I thought that when you use redundant rings, it is enough to have at
  least
  one NIC configured in the system. Am I wrong?
 
  Thank you,
  Kostya
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-13 Thread Kostiantyn Ponomarenko
Honza,

Thank you for helping me.
So, there is no defined behavior in case one of the interfaces is not in
the system?


Thank you,
Kostya

On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,


  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form) (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
 Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but applied
  configuration is not guaranteed to be fully loaded (some rules can be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable and the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any other
  explanation in the network ... .

 Corosync needs all interfaces during start and runtime. This doesn't
 mean they must be connected (this would make corosync unusable for
 physical NIC/Switch or cable failure), but they must be up and have
 correct ip.

 When this is not the case, corosync rebinds to localhost and weird
 things happens. Removal of this rebinding is long time TODO, but there
 are still more important bugs (especially because rebind can be avoided).

 Regards,
   Honza

 
 
 
  Thank you,
  Kostya
 
  On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi guys,
 
  Corosync fails to start if there is no such network interface configured
  in the system.
  Even with rrp_mode: passive the problem is the same when at least one
  network interface is not configured in the system.
 
  Is this the expected behavior?
  I thought that when you use redundant rings, it is enough to have at
 least
  one NIC configured in the system. Am I wrong?
 
  Thank you,
  Kostya
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 


 ___
 Pacemaker

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-12 Thread Kostiantyn Ponomarenko
According to the https://access.redhat.com/solutions/638843 , the
interface, that is defined in the corosync.conf, must be present in the
system (see at the bottom of the article, section ROOT CAUSE).
To confirm that I made a couple of tests.

Here is a part of the corosync.conf file (in a free-write form) (also
attached the origin config file):
===
rrp_mode: passive
ring0_addr is defined in corosync.conf
ring1_addr is defined in corosync.conf
===

---

Two-node cluster

---

Test #1:
--
IP for ring0 is not defines in the system:
--
Start Corosync simultaneously on both nodes.
Corosync fails to start.
From the logs:
Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
config: No interfaces defined
Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
Engine exiting with status 8 at main.c:1343.
Result: Corosync and Pacemaker are not running.

Test #2:
--
IP for ring1 is not defines in the system:
--
Start Corosync simultaneously on both nodes.
Corosync starts.
Start Pacemaker simultaneously on both nodes.
Pacemaker fails to start.
From the logs, the last writes from the corosync:
Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
interface 169.254.1.3 FAULTY
Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
recovered ring 0
Result: Corosync and Pacemaker are not running.


Test #3:

rrp_mode: active leads to the same result, except Corosync and Pacemaker
init scripts return status running.
But still vim /var/log/cluster/corosync.log shows a lot of errors like:
Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
to the CPG API failed: Library error (2)

Result: Corosync and Pacemaker show their statuses as running, but
crm_mon cannot connect to the cluster database. And half of the
Pacemaker's services are not running (including Cluster Information Base
(CIB)).


---

For a single node mode

---

IP for ring0 is not defines in the system:

Corosync fails to start.

IP for ring1 is not defines in the system:

Corosync and Pacemaker are started.

It is possible that configuration will be applied successfully (50%),

and it is possible that the cluster is not running any resources,

and it is possible that the node cannot be put in a standby mode (shows:
communication error),

and it is possible that the cluster is running all resources, but applied
configuration is not guaranteed to be fully loaded (some rules can be
missed).


---

Conclusions:

---

It is possible that in some rare cases (see comments to the bug) the
cluster will work, but in that case its working state is unstable and the
cluster can stop working every moment.


So, is it correct? Does my assumptions make any sense? I didn't any other
explanation in the network ... .



Thank you,
Kostya

On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 Corosync fails to start if there is no such network interface configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least one
 network interface is not configured in the system.

 Is this the expected behavior?
 I thought that when you use redundant rings, it is enough to have at least
 one NIC configured in the system. Am I wrong?

 Thank you,
 Kostya



corosync.conf
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Corosync fails to start when NIC is absent

2015-01-09 Thread Kostiantyn Ponomarenko
Hi guys,

Corosync fails to start if there is no such network interface configured in
the system.
Even with rrp_mode: passive the problem is the same when at least one
network interface is not configured in the system.

Is this the expected behavior?
I thought that when you use redundant rings, it is enough to have at least
one NIC configured in the system. Am I wrong?

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Only build crm_uuid when supporting heartbeat

2014-09-22 Thread Kostiantyn Ponomarenko
Hi guys,

Could you help me to understand the reason to build crm_uuid only
when supporting heartbeat?

My situation is the next, I am trying to build Pacemaker's sources for
Debian. I built the sources from Debian Wheezy repo (removed
--with-heartbeat from it). Then I did the same but using sources from
Debian Sid. And I've noticed that crm_uuid is missing.

What the reason to build Pacemaker --with-heartbeat? I thought that
having Corosync is sufficient for Pacemaker.

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Only build crm_uuid when supporting heartbeat

2014-09-22 Thread Kostiantyn Ponomarenko
Fair enough, thank you!

Thank you,
Kostya

On Mon, Sep 22, 2014 at 8:38 PM, Andrew Beekhof and...@beekhof.net wrote:


 On 22 Sep 2014, at 11:50 pm, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

  Hi guys,
 
  Could you help me to understand the reason to build crm_uuid only when
 supporting heartbeat?

 Because only heartbeat cares about /var/lib/heartbeat/hb_uuid which is
 what crm_uuid is designed to read/write.

 
  My situation is the next, I am trying to build Pacemaker's sources for
 Debian. I built the sources from Debian Wheezy repo (removed
 --with-heartbeat from it). Then I did the same but using sources from
 Debian Sid. And I've noticed that crm_uuid is missing.
 
  What the reason to build Pacemaker --with-heartbeat? I thought that
 having Corosync is sufficient for Pacemaker.
 
  Thank you,
  Kostya
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] testquorum module

2014-07-01 Thread Kostiantyn Ponomarenko
Hi Chrissie,

You mentioned testquorum in your doc (Whatever happened to cman) as
it's a good place if you are thinking about writing your own quorum module.
The only file I found in corosync code is in /test folder and I think it's
not the module. Could you please point me where I can find this module?

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the cman package for ubuntu 13.10

2014-06-25 Thread Kostiantyn Ponomarenko
Hi Vijay B,

I have 2 Debian machines with the latest Corosync and Pacemaker.
I wanted the latest versions of these packages, so I didn't use apt-get
install corosync pacemaker.
Instead of that I downloaded the sources, built it and installed it.
I have a document with all steps I did to get it working.
If you still need some help here, just write me back.

Thank you,
Kostya


On Wed, Jun 25, 2014 at 3:33 AM, Digimer li...@alteeve.ca wrote:

 I can't speak to the installation bits, I don't use Ubuntu/Debian myself,
 but once installed, this guide should apply:

 http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-
 single/Clusters_from_Scratch/index.html

 Cheeers


 On 24/06/14 07:40 PM, Vijay B wrote:

 Hi Digimer,

 Thanks for the info - do you happen to have a cheat sheet or
 instructions for corosync/pacemaker setup on ubuntu 14.04? I can give it
 a try..

 Regards,
 Vijay


 On Tue, Jun 24, 2014 at 3:52 PM, Digimer li...@alteeve.ca
 mailto:li...@alteeve.ca wrote:

 I don't know about 13, but on Ubuntu 14.04 LTS, they're up to
 corosync 2 and pacemaker 1.1.10, which do not need cman at all.


 On 24/06/14 06:17 PM, Vijay B wrote:

 Hi Emmanuel!

 Thanks for getting back to me on this. Turns out that the corosync
 package is already installed on both nodes:

 user@pmk1:~$ sudo apt-get install corosync

 [sudo] password for user:

 Reading package lists... Done

 Building dependency tree

 Reading state information... Done

 corosync is already the newest version.

 corosync set to manually installed.

 0 upgraded, 0 newly installed, 0 to remove and 338 not upgraded.

 user@pmk1:~$



 So I think it's something else that needs to be done.. any ideas?


 Thanks,
 Regards,
 Vijay


 On Tue, Jun 24, 2014 at 3:02 PM, emmanuel segura
 emi2f...@gmail.com mailto:emi2f...@gmail.com
 mailto:emi2f...@gmail.com mailto:emi2f...@gmail.com wrote:

  did you try apt-get install corosync pacemaker

  2014-06-24 23:50 GMT+02:00 Vijay B os.v...@gmail.com
 mailto:os.v...@gmail.com
  mailto:os.v...@gmail.com mailto:os.v...@gmail.com:


Hi,
   
This is my first time using/trying to setup pacemaker with
  corosync and I'm
trying to set it up on 2 ubuntu 13.10 VMs. I'm following
 the
  instructions in
this link -
 http://clusterlabs.org/__quickstart-ubuntu.html

 http://clusterlabs.org/quickstart-ubuntu.html
   
But when I attempt to install the cman package, I see
 this error:
   
user@pmk2:~$ sudo apt-get install cman
   
Reading package lists... Done
   
Building dependency tree
   
Reading state information... Done
   
Package cman is not available, but is referred to by
 another package.
   
This may mean that the package is missing, has been
 obsoleted, or
   
is only available from another source
   
However the following packages replace it:
   
  fence-agents:i386 dlm:i386 fence-agents dlm
   
   
E: Package 'cman' has no installation candidate
   
user@pmk2:~$
   
   
   
So it says that the fence-agents replaces the obsolete
 cman
  package, but
then I don't know what the new equivalent service for
 cman is, or
  what the
config files are. I don't see an /etc/default/cman file,
 and even
  if I add
one manually with a single line in it:
   
QUORUM_TIMEOUT=0 (for a two node cluster)
   
it doesn't seem to be used anywhere. Also, since there
 is no cman
  service as
such, this is what happens:
   
user@pmk2:~$ sudo service cman status
   
cman: unrecognized service
   
user@pmk2:~$
   
   
   
If I try to ignore the above error and simply attempt to
 start up the
pacemaker service, I see this:
   
user@pmk2:~$ sudo crm status
   
Could not establish cib_ro connection: Connection
 refused (111)
   
Connection to cluster failed: Transport endpoint is not
 connected
   
user@pmk2:~$
   
   

Re: [Pacemaker] configuration variants for 2 node cluster

2014-06-24 Thread Kostiantyn Ponomarenko
Hi Chrissie,

But wait_for_all doesn't help when there is no connection between the nodes.
Because in case I need to reboot the remaining working node I won't get
working cluster after that - both nodes will be waiting connection between
them.
That's why I am looking for the solution which could help me to get one
node working in this situation (after reboot).
I've been thinking about some kind of marker which could help a node to
determine a state of the other node.
Like external disk and SCSI reservation command. Maybe you could suggest
another kind of marker?
I am not sure can we use a presents of a file on external SSD as the
marker. Kind of: if there is a file - the other node is alive, if no - node
is dead.

Digimer,

Thanks for the links and information.
Anyway if I go this way, I will write my own daemon to determine a state of
the other node.
Also the information about fence loop is new for me, thanks =)

Thank you,
Kostya


On Tue, Jun 24, 2014 at 10:55 AM, Christine Caulfield ccaul...@redhat.com
wrote:

 On 23/06/14 15:49, Digimer wrote:

 Hi Kostya,

I'm having a little trouble understanding your question, sorry.

On boot, the node will not start anything, so after booting it, you
 log in, check that it can talk to the peer node (a simple ping is
 generally enough), then start the cluster. It will join the peer's
 existing cluster (even if it's a cluster on just itself).

If you booted both nodes, say after a power outage, you will check
 the connection (again, a simple ping is fine) and then start the cluster
 on both nodes at the same time.



 wait_for_all helps with most of these situations. If a node goes down then
 it won't start services until it's seen the non-failed node because
 wait_for_all prevents a newly rebooted node from doing anything on its own.
 This also takes care of the case where both nodes are rebooted together of
 course, because that's the same as a new start.

 Chrissie


 If one of the nodes needs to be shut down, say for repairs or
 upgrades, you migrate the services off of it and over to the peer node,
 then you stop the cluster (which tells the peer that the node is leaving
 the cluster). After that, the remaining node operates by itself. When
 you turn it back on, you rejoin the cluster and migrate the services back.

I think, maybe, you are looking at things more complicated than you
 need to. Pacemaker and corosync will handle most of this for you, once
 setup properly. What operating system do you plan to use, and what
 cluster stack? I suspect it will be corosync + pacemaker, which should
 work fine.

 digimer

 On 23/06/14 10:36 AM, Kostiantyn Ponomarenko wrote:

 Hi Digimer,

 Suppose I disabled to cluster on start up, but what about remaining
 node, if I need to reboot it?
 So, even in case of connection lost between these two nodes I need to
 have one node working and providing resources.
 How did you solve this situation?
 Should it be a separate daemon which checks somehow connection between
 the two nodes and decides to run corosync and pacemaker or to keep them
 down?

 Thank you,
 Kostya


 On Mon, Jun 23, 2014 at 4:34 PM, Digimer li...@alteeve.ca
 mailto:li...@alteeve.ca wrote:

 On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote:

 Hi guys,
 I want to gather all possible configuration variants for 2-node
 cluster,
 because it has a lot of pitfalls and there are not a lot of
 information
 across the internet about it. And also I have some questions
 about
 configurations and their specific problems.
 VARIANT 1:
 -
 We can use two_node and wait_for_all option from Corosync's
 votequorum, and set up fencing agents with delay on one of them.
 Here is a workflow(diagram) of this configuration:
 1. Node start.
 2. Cluster start (Corosync and Pacemaker) at the boot time.
 3. Wait for all nodes. All nodes joined?
   No. Go to step 3.
   Yes. Go to step 4.
 4. Start resources.
 5. Split brain situation (something with connection between
 nodes).
 6. Fencing agent on the one of the nodes reboots the other node
 (there
 is a configured delay on one of the Fencing agents).
 7. Rebooted node go to step 1.
 There are two (or more?) important things in this configuration:
 1. Rebooted node remains waiting for all nodes to be visible
 (connection
 should be restored).
 2. Suppose connection problem still exists and the node which
 rebooted
 the other guy has to be rebooted also (for some reasons). After
 reboot
 he is also stuck on step 3 because of connection problem.
 QUESTION:
 -
 Is it possible somehow to assign to the guy who won the reboot
 race
 (rebooted other guy) a status like a primary and allow him

Re: [Pacemaker] configuration variants for 2 node cluster

2014-06-24 Thread Kostiantyn Ponomarenko
Chrissie,

I don't wont to reinvent a quorum disk =)
I know about its complexity.
That's why I think that the most reasonable decision for me is to wait till
Corosync 2 gets quorum disk :)
But meanwhile I need to deal somehow with my situation.
So, the possible solution for me is creating a daemon, which will start
cluster stack based on some circumstances.

Here is how I see it (any improvements are appreciated):

The marker: SCSI reservation of SSD
IMPORTANT: The daemon should distinguish which node marker belongs to.
QUESTION: What other markers is it possible to use?

--
Main workflow:
--
1. Node start
2. Daemon start
2.1. Check the marker. Is marker present?
NO:
2.1.1. Set marker. Successful?
NO: Do nothing. (Go to 2.1 and repeat it for few times).
YES: Start cluster stack.
YES:
2.1.2. Ping the other node. Successful?
NO: Do nothing: the other node is probably (99%) on.
YES:
Remove the marker.
Start cluster stack.[*]
P.S.: In case cluster won't establish connection with
the other node, fencing agent on this node is triggered and will fence the
other node (can be fence loop but we can minimize possibility of it[1]).

--
Split brain situation:
--
1. Fencing agent tries to set the marker. Successful?
NO: Do nothing: this node is gonna be fenced. Meanwhile this node can
be put in standby mode while waiting for fencing.
YES: STONITH (reboot) the other node. Marker is kept.

-
Benefits:
-
Even after reboot, one of the nodes still starts cluster stack - the one
that the marker belongs to.

--
Possible problems:
--
If the node, that the marker belongs to, is not working, we need to force
run cluster stack on the other node.
It requires human interaction.


=
* In case ping is successful but cluster doesn't see the other node (is it
even possible?) we can do the next:
a. Daemon starts Corosync.
b. Gets a list of nodes and ensures that the other node is present
there. This is the guarantee that the nodes are seeing each other in the
cluster.
c. Starts Pacemaker.



Thank you,
Kostya


On Tue, Jun 24, 2014 at 11:44 AM, Christine Caulfield ccaul...@redhat.com
wrote:

 On 24/06/14 09:36, Kostiantyn Ponomarenko wrote:

 Hi Chrissie,

 But wait_for_all doesn't help when there is no connection between the
 nodes.
 Because in case I need to reboot the remaining working node I won't get
 working cluster after that - both nodes will be waiting connection
 between them.
 That's why I am looking for the solution which could help me to get one
 node working in this situation (after reboot).
 I've been thinking about some kind of marker which could help a node to
 determine a state of the other node.
 Like external disk and SCSI reservation command. Maybe you could suggest
 another kind of marker?
 I am not sure can we use a presents of a file on external SSD as the
 marker. Kind of: if there is a file - the other node is alive, if no -
 node is dead.


 More seriously, that solution is harder than it might seem - which is one
 reason qdiskd was as complex as it became, and why votequorum is as
 conservative as it is when it comes to declaring a workable cluster. If
 someone is there to manually reboot nodes then it might be as well for a
 human decision to be made about which one is capable of running services.

 Chrissie

  Digimer,

 Thanks for the links and information.
 Anyway if I go this way, I will write my own daemon to determine a state
 of the other node.
 Also the information about fence loop is new for me, thanks =)

 Thank you,
 Kostya


 On Tue, Jun 24, 2014 at 10:55 AM, Christine Caulfield
 ccaul...@redhat.com mailto:ccaul...@redhat.com wrote:

 On 23/06/14 15:49, Digimer wrote:

 Hi Kostya,

 I'm having a little trouble understanding your question,
 sorry.

 On boot, the node will not start anything, so after booting
 it, you
 log in, check that it can talk to the peer node (a simple ping is
 generally enough), then start the cluster. It will join the peer's
 existing cluster (even if it's a cluster on just itself).

 If you booted both nodes, say after a power outage, you will
 check
 the connection (again, a simple ping is fine) and then start the
 cluster
 on both nodes at the same time.



 wait_for_all helps with most of these situations. If a node goes
 down then it won't start services until it's seen the non-failed
 node because wait_for_all prevents a newly rebooted node from doing
 anything on its own. This also takes care of the case where both
 nodes are rebooted together of course, because that's the same as a
 new start.

 Chrissie

Re: [Pacemaker] configuration variants for 2 node cluster

2014-06-24 Thread Kostiantyn Ponomarenko
Digimer,

Yes, wait_for_all is a part of votequorum in Corosync v2.

Thank you,
Kostya


On Tue, Jun 24, 2014 at 6:47 PM, Digimer li...@alteeve.ca wrote:

 On 24/06/14 03:55 AM, Christine Caulfield wrote:

 On 23/06/14 15:49, Digimer wrote:

 Hi Kostya,

I'm having a little trouble understanding your question, sorry.

On boot, the node will not start anything, so after booting it, you
 log in, check that it can talk to the peer node (a simple ping is
 generally enough), then start the cluster. It will join the peer's
 existing cluster (even if it's a cluster on just itself).

If you booted both nodes, say after a power outage, you will check
 the connection (again, a simple ping is fine) and then start the cluster
 on both nodes at the same time.



 wait_for_all helps with most of these situations. If a node goes down
 then it won't start services until it's seen the non-failed node because
 wait_for_all prevents a newly rebooted node from doing anything on its
 own. This also takes care of the case where both nodes are rebooted
 together of course, because that's the same as a new start.

 Chrissie


 This isn't available on RHEL 6, is it? iirc, it's a Corosync v2 feature?


 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] configuration variants for 2 node cluster

2014-06-23 Thread Kostiantyn Ponomarenko
Hi guys,

I want to gather all possible configuration variants for 2-node cluster,
because it has a lot of pitfalls and there are not a lot of information
across the internet about it. And also I have some questions about
configurations and their specific problems.

VARIANT 1:
-
We can use two_node and wait_for_all option from Corosync's votequorum,
and set up fencing agents with delay on one of them.

Here is a workflow(diagram) of this configuration:

1. Node start.
2. Cluster start (Corosync and Pacemaker) at the boot time.
3. Wait for all nodes. All nodes joined?
No. Go to step 3.
Yes. Go to step 4.
4. Start resources.
5. Split brain situation (something with connection between nodes).
6. Fencing agent on the one of the nodes reboots the other node (there is a
configured delay on one of the Fencing agents).
7. Rebooted node go to step 1.

There are two (or more?) important things in this configuration:
1. Rebooted node remains waiting for all nodes to be visible (connection
should be restored).
2. Suppose connection problem still exists and the node which rebooted the
other guy has to be rebooted also (for some reasons). After reboot he is
also stuck on step 3 because of connection problem.

QUESTION:
-
Is it possible somehow to assign to the guy who won the reboot race
(rebooted other guy) a status like a primary and allow him not to wait
for all nodes after reboot. And neglect this status after other node joined
this one.
So is it possible?

Right now that's the only configuration I know for 2 node cluster.
Other variants are very appreciated =)

VARIANT 2 (not implemented, just a suggestion):
-
I've been thinking about using external SSD drive (or other external
drive). So for example fencing agent can reserve SSD using SCSI command and
after that reboot the other node.

The main idea of this is the first node, as soon as a cluster starts on it,
reserves SSD till the other node joins the cluster, after that SCSI
reservation is removed.

1. Node start
2. Cluster start (Corosync and Pacemaker) at the boot time.
3. Reserve SSD. Did it manage to reserve?
No. Don't start resources (Wait for all).
Yes. Go to step 4.
4. Start resources.
5. Remove SCSI reservation when the other node has joined.
5. Split brain situation (something with connection between nodes).
6. Fencing agent tries to reserve SSD. Did it manage to reserve?
No. Maybe puts node in standby mode ...
Yes. Reboot the other node.
7. Optional: a single node can keep SSD reservation till he is alone in the
cluster or till his shut-down.

I am really looking forward to find the best solution (or a couple of them
=)).
Hope I am not the only person ho is interested in this topic.


Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] configuration variants for 2 node cluster

2014-06-23 Thread Kostiantyn Ponomarenko
Hi Digimer,

Suppose I disabled to cluster on start up, but what about remaining node,
if I need to reboot it?
So, even in case of connection lost between these two nodes I need to have
one node working and providing resources.
How did you solve this situation?
Should it be a separate daemon which checks somehow connection between the
two nodes and decides to run corosync and pacemaker or to keep them down?

Thank you,
Kostya


On Mon, Jun 23, 2014 at 4:34 PM, Digimer li...@alteeve.ca wrote:

 On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote:

 Hi guys,
 I want to gather all possible configuration variants for 2-node cluster,
 because it has a lot of pitfalls and there are not a lot of information
 across the internet about it. And also I have some questions about
 configurations and their specific problems.
 VARIANT 1:
 -
 We can use two_node and wait_for_all option from Corosync's
 votequorum, and set up fencing agents with delay on one of them.
 Here is a workflow(diagram) of this configuration:
 1. Node start.
 2. Cluster start (Corosync and Pacemaker) at the boot time.
 3. Wait for all nodes. All nodes joined?
  No. Go to step 3.
  Yes. Go to step 4.
 4. Start resources.
 5. Split brain situation (something with connection between nodes).
 6. Fencing agent on the one of the nodes reboots the other node (there
 is a configured delay on one of the Fencing agents).
 7. Rebooted node go to step 1.
 There are two (or more?) important things in this configuration:
 1. Rebooted node remains waiting for all nodes to be visible (connection
 should be restored).
 2. Suppose connection problem still exists and the node which rebooted
 the other guy has to be rebooted also (for some reasons). After reboot
 he is also stuck on step 3 because of connection problem.
 QUESTION:
 -
 Is it possible somehow to assign to the guy who won the reboot race
 (rebooted other guy) a status like a primary and allow him not to wait
 for all nodes after reboot. And neglect this status after other node
 joined this one.
 So is it possible?
 Right now that's the only configuration I know for 2 node cluster.
 Other variants are very appreciated =)
 VARIANT 2 (not implemented, just a suggestion):
 -
 I've been thinking about using external SSD drive (or other external
 drive). So for example fencing agent can reserve SSD using SCSI command
 and after that reboot the other node.
 The main idea of this is the first node, as soon as a cluster starts on
 it, reserves SSD till the other node joins the cluster, after that SCSI
 reservation is removed.
 1. Node start
 2. Cluster start (Corosync and Pacemaker) at the boot time.
 3. Reserve SSD. Did it manage to reserve?
  No. Don't start resources (Wait for all).
  Yes. Go to step 4.
 4. Start resources.
 5. Remove SCSI reservation when the other node has joined.
 5. Split brain situation (something with connection between nodes).
 6. Fencing agent tries to reserve SSD. Did it manage to reserve?
  No. Maybe puts node in standby mode ...
  Yes. Reboot the other node.
 7. Optional: a single node can keep SSD reservation till he is alone in
 the cluster or till his shut-down.
 I am really looking forward to find the best solution (or a couple of
 them =)).
 Hope I am not the only person ho is interested in this topic.


 Thank you,
 Kostya


 Hi Kostya,

   I only build 2-node clusters, and I've not had problems with this going
 back to 2009 over dozens of clusters. The tricks I found are:

 * Disable quorum (of course)
 * Setup good fencing, and add a delay to the node you you prefer (or pick
 one at random, if equal value) to avoid dual-fences
 * Disable to cluster on start up, to prevent fence loops.

   That's it. With this, your 2-node cluster will be just fine.

   As for your question; Once a node is fenced successfully, the resource
 manager (pacemaker) will take over any services lost on the fenced node, if
 that is how you configured it. A node the either gracefully leaves or
 dies/fenced should not interfere with the remaining node.

   The problem is when a node vanishes and fencing fails. Then, not knowing
 what the other node might be doing, the only safe option is to block,
 otherwise you risk a split-brain. This is why fencing is so important.

 Cheers

 --
 Digimer
 Papers and Projects: https://alteeve.ca/w/
 What if the cure for cancer is trapped in the mind of a person without
 access to education?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org

Re: [Pacemaker] configuration variants for 2 node cluster

2014-06-23 Thread Kostiantyn Ponomarenko
Digimer,

I am using Debian as OS and Corosync + Pacemaker as cluster stack.
I understand your suggestion.
I don't have any questions about it.

My main question is how to do it automatically?
So that it could work without human interruption for a while (nodes could
be rebooted, but not repaired).
That is my question.

The only way for me to do it is:
Write a daemon which will run Corosync and Pacemaker.
But before that this daemon will check the connection between the two nodes.
Then make a decision whether to start cluster or not, based on that check.

Maybe you have some thoughts how can I do it in other way?
Instead of doing ping the daemon could run Corosync, get the number of
nodes in cluster, and based on that it decides whether to run Pacemaker or
not.


Thank you,
Kostya


On Mon, Jun 23, 2014 at 5:49 PM, Digimer li...@alteeve.ca wrote:

 Hi Kostya,

   I'm having a little trouble understanding your question, sorry.

   On boot, the node will not start anything, so after booting it, you log
 in, check that it can talk to the peer node (a simple ping is generally
 enough), then start the cluster. It will join the peer's existing cluster
 (even if it's a cluster on just itself).

   If you booted both nodes, say after a power outage, you will check the
 connection (again, a simple ping is fine) and then start the cluster on
 both nodes at the same time.

   If one of the nodes needs to be shut down, say for repairs or upgrades,
 you migrate the services off of it and over to the peer node, then you stop
 the cluster (which tells the peer that the node is leaving the cluster).
 After that, the remaining node operates by itself. When you turn it back
 on, you rejoin the cluster and migrate the services back.

   I think, maybe, you are looking at things more complicated than you need
 to. Pacemaker and corosync will handle most of this for you, once setup
 properly. What operating system do you plan to use, and what cluster stack?
 I suspect it will be corosync + pacemaker, which should work fine.

 digimer

 On 23/06/14 10:36 AM, Kostiantyn Ponomarenko wrote:

 Hi Digimer,


 Suppose I disabled to cluster on start up, but what about remaining
 node, if I need to reboot it?
 So, even in case of connection lost between these two nodes I need to
 have one node working and providing resources.
 How did you solve this situation?
 Should it be a separate daemon which checks somehow connection between
 the two nodes and decides to run corosync and pacemaker or to keep them
 down?

 Thank you,
 Kostya


 On Mon, Jun 23, 2014 at 4:34 PM, Digimer li...@alteeve.ca
 mailto:li...@alteeve.ca wrote:

 On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote:

 Hi guys,
 I want to gather all possible configuration variants for 2-node
 cluster,
 because it has a lot of pitfalls and there are not a lot of
 information
 across the internet about it. And also I have some questions about
 configurations and their specific problems.
 VARIANT 1:
 -
 We can use two_node and wait_for_all option from Corosync's
 votequorum, and set up fencing agents with delay on one of them.
 Here is a workflow(diagram) of this configuration:
 1. Node start.
 2. Cluster start (Corosync and Pacemaker) at the boot time.
 3. Wait for all nodes. All nodes joined?
   No. Go to step 3.
   Yes. Go to step 4.
 4. Start resources.
 5. Split brain situation (something with connection between
 nodes).
 6. Fencing agent on the one of the nodes reboots the other node
 (there
 is a configured delay on one of the Fencing agents).
 7. Rebooted node go to step 1.
 There are two (or more?) important things in this configuration:
 1. Rebooted node remains waiting for all nodes to be visible
 (connection
 should be restored).
 2. Suppose connection problem still exists and the node which
 rebooted
 the other guy has to be rebooted also (for some reasons). After
 reboot
 he is also stuck on step 3 because of connection problem.
 QUESTION:
 -
 Is it possible somehow to assign to the guy who won the reboot
 race
 (rebooted other guy) a status like a primary and allow him not
 to wait
 for all nodes after reboot. And neglect this status after other
 node
 joined this one.
 So is it possible?
 Right now that's the only configuration I know for 2 node cluster.
 Other variants are very appreciated =)
 VARIANT 2 (not implemented, just a suggestion):
 -
 I've been thinking about using external SSD drive (or other
 external
 drive). So for example fencing agent can reserve SSD using SCSI
 command
 and after that reboot the other node

[Pacemaker] API documentation

2014-06-17 Thread Kostiantyn Ponomarenko
I took a look at the include folders of pacemaker and corosync and I didn't
find there any explanation to functions. Did I look at a wrong place?

My goal is to manage cluster from my app, so I don't need to use crmsh or
pcs.

Any ideas are appreciated.
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] API documentation

2014-06-17 Thread Kostiantyn Ponomarenko
Andrew, many thanks!
On Jun 18, 2014 2:11 AM, Andrew Beekhof and...@beekhof.net wrote:


 On 17 Jun 2014, at 8:01 pm, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

  I took a look at the include folders of pacemaker and corosync and I
 didn't find there any explanation to functions. Did I look at a wrong place?
 
  My goal is to manage cluster from my app, so I don't need to use crmsh
 or pcs.
 
  Any ideas are appreciated.

 For reading cluster state, try: crm_mon --as-xml
 Or for the raw config, cibadmin -Q and the relax-ng (.rng) schema files.
 It calling a binary isn't something you want to do, try looking at the
 source for those two tools to see how they do it.

 For making changes, including stop/start/move a resource, you also want
 cibadmin (or its C-API) and the relax-ng schema files.


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] votequorum for 2 node cluster

2014-06-13 Thread Kostiantyn Ponomarenko
Thank you for the explanation. I got the point.
But just to be sure, and maybe someone will find this info helpful, wanna
clarify this two options behavior.
From the http://manpages.ubuntu.com/manpages/saucy/man5/votequorum.5.html
about last_man_standing option:

   NOTES: In order for the cluster to downgrade automatically from 2 nodes
   to a 1 node cluster, the auto_tie_breaker feature must also be enabled.

   If auto_tie_breaker is not enabled, and one more failure


   occurs, the remaining node will not be quorate.


But this is still a roulette - you can lose only the node which
doesn't have the lowest nodeid?

Am I right?



Thank you,
Kostya


On Thu, Jun 12, 2014 at 12:37 PM, Christine Caulfield ccaul...@redhat.com
wrote:

 On 12/06/14 00:51, Andrew Beekhof wrote:

 Chrissy?  Can you shed some light here?

 On 11 Jun 2014, at 11:26 pm, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

  Hi guys,

 I am trying to deal somehow with split brain situation in 2 node cluster
 using votequorum.
 Here is a quorum section in my corosync.conf:

 provider: corosync_votequorum
 expected_votes: 2
 wait_for_all: 1
 last_man_standing: 1
 auto_tie_breaker: 1

 My question is about behavior of the remaining node after I shout down
 node with the lowest nodeid.
 My expectation is that after a last_man_standing_window this node should
 be back working.

 Or in the case of two node cluster it is not a solution?



 If you want symmetric failure handing into a 2 node cluster then the
 two_node option might be more appropriate. auto_tie_breaker and
 last_man_standing are more useful for larger clusters where a network split
 leaves more than one node in a partition.

 Chrissie


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] votequorum for 2 node cluster

2014-06-11 Thread Kostiantyn Ponomarenko
Hi guys,

I am trying to deal somehow with split brain situation in 2 node cluster
using votequorum.
Here is a quorum section in my corosync.conf:

provider: corosync_votequorum
expected_votes: 2
wait_for_all: 1
last_man_standing: 1
auto_tie_breaker: 1

My question is about behavior of the remaining node after I shout down node
with the lowest nodeid.
My expectation is that after a last_man_standing_window this node should be
back working.

Or in the case of two node cluster it is not a solution?

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] votequorum for 2 node cluster

2014-06-11 Thread Kostiantyn Ponomarenko
two_node option is my another question. I think it's not for this thread.

 last_man_standing: 1
 auto_tie_breaker: 1

So, anyway the only node will remain working in split brain (or one node
shout down) situation is that with the lowest id.
And that is like roulette, in case we lose the lowest nodeid we lose all.
So I can lose only the node which doesn't have the lowest nodeid?
And it's not useful in 2 node cluster.
Am i correct?


Thank you,
Kostya


On Wed, Jun 11, 2014 at 4:33 PM, Vladislav Bogdanov bub...@hoster-ok.com
wrote:

 11.06.2014 16:26, Kostiantyn Ponomarenko wrote:
  Hi guys,
 
  I am trying to deal somehow with split brain situation in 2 node cluster
  using votequorum.
  Here is a quorum section in my corosync.conf:
 
  provider: corosync_votequorum
  expected_votes: 2

 Just a side note, not an answer to your question:
 you'd add 'two_node: 1' here as two-node clusters are very special in
 terms of quorum.

  wait_for_all: 1
  last_man_standing: 1
  auto_tie_breaker: 1
 
  My question is about behavior of the remaining node after I shout down
  node with the lowest nodeid.
  My expectation is that after a last_man_standing_window this node should
  be back working.
 
  Or in the case of two node cluster it is not a solution?
 
  Thank you,
  Kostya
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] auto_tie_breaker in two node cluster

2014-05-21 Thread Kostiantyn Ponomarenko
Honza,
Can you please explain what does network based mean?

Thank you,
Kostya


On Wed, May 21, 2014 at 10:54 AM, Jan Friesse jfrie...@redhat.com wrote:

  I am not quite understand how auto_tie_breaker works.
  Say we have a cluster with 2 nodes and enabled auto_tie_breaker feature.
  Each node has 2 NICs. One NIC is used for cluster communication and
 another
  one is used for providing some services from the cluster.
  So the question is how the nodes will distinguish between two possible
  situations:
  1) connection between the nodes are lost, but the both nodes remain
 working;
  2) power supply on the node 1 (has the lowest node-id) broke down and
 node
  2 remain working;
 
  In 1st case, according to the description of the auto_tie_breaker, the
 node
  with the lowest node-id in the cluster will remain working.
  And in that particular situation it is good result because the both nodes
  are in good state (the both can remain working).
  In 2nd case the only working node is #2 and the node-id of that node is
 not
  the lowest one. So what will be in this case? What logic will work,
 because
  we have lost the node with the lowest node id in 2-node cluster?
 
  there is no qdiskd for votequorum yet
  Is there plans to implement it?
 

 Kostya,
 yes there are plans to implement qdisk (network based one).

 Regards,
   Honza


  Many thanks,
  Kostya
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] auto_tie_breaker in two node cluster

2014-05-20 Thread Kostiantyn Ponomarenko
I am not quite understand how auto_tie_breaker works.
Say we have a cluster with 2 nodes and enabled auto_tie_breaker feature.
Each node has 2 NICs. One NIC is used for cluster communication and another
one is used for providing some services from the cluster.
So the question is how the nodes will distinguish between two possible
situations:
1) connection between the nodes are lost, but the both nodes remain working;
2) power supply on the node 1 (has the lowest node-id) broke down and node
2 remain working;

In 1st case, according to the description of the auto_tie_breaker, the node
with the lowest node-id in the cluster will remain working.
And in that particular situation it is good result because the both nodes
are in good state (the both can remain working).
In 2nd case the only working node is #2 and the node-id of that node is not
the lowest one. So what will be in this case? What logic will work, because
we have lost the node with the lowest node id in 2-node cluster?

 there is no qdiskd for votequorum yet
Is there plans to implement it?

Many thanks,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org