Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-20 Thread Jan Friesse
Kostiantyn,


 One more thing to clarify.
 You said rebind can be avoided - what does it mean?

By that I mean that as long as you don't shutdown interface everything
will work as expected. Interface shutdown is administrator decision,
system doesn't do it automagically :)

Regards,
  Honza

 
 Thank you,
 Kostya
 
 On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:
 
 Thank you. Now I am aware of it.

 Thank you,
 Kostya

 On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,

 Honza,

 Thank you for helping me.
 So, there is no defined behavior in case one of the interfaces is not in
 the system?

 You are right. There is no defined behavior.

 Regards,
   Honza




 Thank you,
 Kostya

 On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
 wrote:

 Kostiantyn,


 According to the https://access.redhat.com/solutions/638843 , the
 interface, that is defined in the corosync.conf, must be present in
 the
 system (see at the bottom of the article, section ROOT CAUSE).
 To confirm that I made a couple of tests.

 Here is a part of the corosync.conf file (in a free-write form) (also
 attached the origin config file):
 ===
 rrp_mode: passive
 ring0_addr is defined in corosync.conf
 ring1_addr is defined in corosync.conf
 ===

 ---

 Two-node cluster

 ---

 Test #1:
 --
 IP for ring0 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync fails to start.
 From the logs:
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
 config: No interfaces defined
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
 Cluster
 Engine exiting with status 8 at main.c:1343.
 Result: Corosync and Pacemaker are not running.

 Test #2:
 --
 IP for ring1 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync starts.
 Start Pacemaker simultaneously on both nodes.
 Pacemaker fails to start.
 From the logs, the last writes from the corosync:
 Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid
 0
 interface 169.254.1.3 FAULTY
 Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ]
 Automatically
 recovered ring 0
 Result: Corosync and Pacemaker are not running.


 Test #3:

 rrp_mode: active leads to the same result, except Corosync and
 Pacemaker
 init scripts return status running.
 But still vim /var/log/cluster/corosync.log shows a lot of errors
 like:
 Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
 Connection
 to the CPG API failed: Library error (2)

 Result: Corosync and Pacemaker show their statuses as running, but
 crm_mon cannot connect to the cluster database. And half of the
 Pacemaker's services are not running (including Cluster Information
 Base
 (CIB)).


 ---

 For a single node mode

 ---

 IP for ring0 is not defines in the system:

 Corosync fails to start.

 IP for ring1 is not defines in the system:

 Corosync and Pacemaker are started.

 It is possible that configuration will be applied successfully (50%),

 and it is possible that the cluster is not running any resources,

 and it is possible that the node cannot be put in a standby mode
 (shows:
 communication error),

 and it is possible that the cluster is running all resources, but
 applied
 configuration is not guaranteed to be fully loaded (some rules can be
 missed).


 ---

 Conclusions:

 ---

 It is possible that in some rare cases (see comments to the bug) the
 cluster will work, but in that case its working state is unstable and
 the
 cluster can stop working every moment.


 So, is it correct? Does my assumptions make any sense? I didn't any
 other
 explanation in the network ... .

 Corosync needs all interfaces during start and runtime. This doesn't
 mean they must be connected (this would make corosync unusable for
 physical NIC/Switch or cable failure), but they must be up and have
 correct ip.

 When this is not the case, corosync rebinds to localhost and weird
 things happens. Removal of this rebinding is long time TODO, but there
 are still more important bugs (especially because rebind can be
 avoided).

 Regards,
   Honza




 Thank you,
 Kostya

 On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 Corosync fails to start if there is no such network interface
 configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least
 one
 network interface is not configured in the system.

 Is this the expected behavior?
 I 

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-20 Thread Andrei Borzenkov
On Tue, Jan 20, 2015 at 11:50 AM, Jan Friesse jfrie...@redhat.com wrote:
 Kostiantyn,


 One more thing to clarify.
 You said rebind can be avoided - what does it mean?

 By that I mean that as long as you don't shutdown interface everything
 will work as expected. Interface shutdown is administrator decision,
 system doesn't do it automagically :)


What is possible that e.g. during reboot interface (hardware) fails
and is not detected. This would lead to complete outage of node that
could be avoided.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-20 Thread Kostiantyn Ponomarenko
Got it. Thank you =)
I just thought about possibility of a NIC to burn down.

Thank you,
Kostya

On Tue, Jan 20, 2015 at 10:50 AM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,


  One more thing to clarify.
  You said rebind can be avoided - what does it mean?

 By that I mean that as long as you don't shutdown interface everything
 will work as expected. Interface shutdown is administrator decision,
 system doesn't do it automagically :)

 Regards,
   Honza

 
  Thank you,
  Kostya
 
  On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Thank you. Now I am aware of it.
 
  Thank you,
  Kostya
 
  On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com
 wrote:
 
  Kostiantyn,
 
  Honza,
 
  Thank you for helping me.
  So, there is no defined behavior in case one of the interfaces is not
 in
  the system?
 
  You are right. There is no defined behavior.
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
  wrote:
 
  Kostiantyn,
 
 
  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in
  the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form)
 (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error
 in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
  Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking
 ringid
  0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ]
  Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
  Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors
  like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
  Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information
  Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully
 (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode
  (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but
  applied
  configuration is not guaranteed to be fully loaded (some rules can
 be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable
 and
  the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any
  other
  explanation in the network ... .
 
  Corosync needs all interfaces during start and runtime. This doesn't
  mean they must be connected (this would make corosync unusable for
  physical NIC/Switch or cable failure), but they must be up and have
  correct ip.
 
  When this is not the case, corosync rebinds to localhost and weird
  things happens. Removal of this rebinding is long time TODO, but
 there
  are still more important bugs (especially because rebind can be
  avoided).
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Fri, 

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-19 Thread Kostiantyn Ponomarenko
One more thing to clarify.
You said rebind can be avoided - what does it mean?

Thank you,
Kostya

On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko 
konstantin.ponomare...@gmail.com wrote:

 Thank you. Now I am aware of it.

 Thank you,
 Kostya

 On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,

  Honza,
 
  Thank you for helping me.
  So, there is no defined behavior in case one of the interfaces is not in
  the system?

 You are right. There is no defined behavior.

 Regards,
   Honza


 
 
  Thank you,
  Kostya
 
  On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
 wrote:
 
  Kostiantyn,
 
 
  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in
 the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form) (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
 Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid
 0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ]
 Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
  Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors
 like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
 Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information
 Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode
 (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but
 applied
  configuration is not guaranteed to be fully loaded (some rules can be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable and
 the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any
 other
  explanation in the network ... .
 
  Corosync needs all interfaces during start and runtime. This doesn't
  mean they must be connected (this would make corosync unusable for
  physical NIC/Switch or cable failure), but they must be up and have
  correct ip.
 
  When this is not the case, corosync rebinds to localhost and weird
  things happens. Removal of this rebinding is long time TODO, but there
  are still more important bugs (especially because rebind can be
 avoided).
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi guys,
 
  Corosync fails to start if there is no such network interface
 configured
  in the system.
  Even with rrp_mode: passive the problem is the same when at least
 one
  network interface is not configured in the system.
 
  Is this the expected behavior?
  I thought that when you use redundant rings, it is enough to have at
  least
  

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-14 Thread Jan Friesse
Kostiantyn,

 Honza,
 
 Thank you for helping me.
 So, there is no defined behavior in case one of the interfaces is not in
 the system?

You are right. There is no defined behavior.

Regards,
  Honza


 
 
 Thank you,
 Kostya
 
 On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote:
 
 Kostiantyn,


 According to the https://access.redhat.com/solutions/638843 , the
 interface, that is defined in the corosync.conf, must be present in the
 system (see at the bottom of the article, section ROOT CAUSE).
 To confirm that I made a couple of tests.

 Here is a part of the corosync.conf file (in a free-write form) (also
 attached the origin config file):
 ===
 rrp_mode: passive
 ring0_addr is defined in corosync.conf
 ring1_addr is defined in corosync.conf
 ===

 ---

 Two-node cluster

 ---

 Test #1:
 --
 IP for ring0 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync fails to start.
 From the logs:
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
 config: No interfaces defined
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
 Engine exiting with status 8 at main.c:1343.
 Result: Corosync and Pacemaker are not running.

 Test #2:
 --
 IP for ring1 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync starts.
 Start Pacemaker simultaneously on both nodes.
 Pacemaker fails to start.
 From the logs, the last writes from the corosync:
 Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
 interface 169.254.1.3 FAULTY
 Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
 recovered ring 0
 Result: Corosync and Pacemaker are not running.


 Test #3:

 rrp_mode: active leads to the same result, except Corosync and
 Pacemaker
 init scripts return status running.
 But still vim /var/log/cluster/corosync.log shows a lot of errors like:
 Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)

 Result: Corosync and Pacemaker show their statuses as running, but
 crm_mon cannot connect to the cluster database. And half of the
 Pacemaker's services are not running (including Cluster Information Base
 (CIB)).


 ---

 For a single node mode

 ---

 IP for ring0 is not defines in the system:

 Corosync fails to start.

 IP for ring1 is not defines in the system:

 Corosync and Pacemaker are started.

 It is possible that configuration will be applied successfully (50%),

 and it is possible that the cluster is not running any resources,

 and it is possible that the node cannot be put in a standby mode (shows:
 communication error),

 and it is possible that the cluster is running all resources, but applied
 configuration is not guaranteed to be fully loaded (some rules can be
 missed).


 ---

 Conclusions:

 ---

 It is possible that in some rare cases (see comments to the bug) the
 cluster will work, but in that case its working state is unstable and the
 cluster can stop working every moment.


 So, is it correct? Does my assumptions make any sense? I didn't any other
 explanation in the network ... .

 Corosync needs all interfaces during start and runtime. This doesn't
 mean they must be connected (this would make corosync unusable for
 physical NIC/Switch or cable failure), but they must be up and have
 correct ip.

 When this is not the case, corosync rebinds to localhost and weird
 things happens. Removal of this rebinding is long time TODO, but there
 are still more important bugs (especially because rebind can be avoided).

 Regards,
   Honza




 Thank you,
 Kostya

 On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 Corosync fails to start if there is no such network interface configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least one
 network interface is not configured in the system.

 Is this the expected behavior?
 I thought that when you use redundant rings, it is enough to have at
 least
 one NIC configured in the system. Am I wrong?

 Thank you,
 Kostya




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-14 Thread Kostiantyn Ponomarenko
Thank you. Now I am aware of it.

Thank you,
Kostya

On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,

  Honza,
 
  Thank you for helping me.
  So, there is no defined behavior in case one of the interfaces is not in
  the system?

 You are right. There is no defined behavior.

 Regards,
   Honza


 
 
  Thank you,
  Kostya
 
  On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
 wrote:
 
  Kostiantyn,
 
 
  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form) (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
  Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors
 like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
 Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information
 Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode
 (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but
 applied
  configuration is not guaranteed to be fully loaded (some rules can be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable and
 the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any
 other
  explanation in the network ... .
 
  Corosync needs all interfaces during start and runtime. This doesn't
  mean they must be connected (this would make corosync unusable for
  physical NIC/Switch or cable failure), but they must be up and have
  correct ip.
 
  When this is not the case, corosync rebinds to localhost and weird
  things happens. Removal of this rebinding is long time TODO, but there
  are still more important bugs (especially because rebind can be
 avoided).
 
  Regards,
Honza
 
 
 
 
  Thank you,
  Kostya
 
  On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi guys,
 
  Corosync fails to start if there is no such network interface
 configured
  in the system.
  Even with rrp_mode: passive the problem is the same when at least
 one
  network interface is not configured in the system.
 
  Is this the expected behavior?
  I thought that when you use redundant rings, it is enough to have at
  least
  one NIC configured in the system. Am I wrong?
 
  Thank you,
  Kostya
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-13 Thread Jan Friesse
Kostiantyn,


 According to the https://access.redhat.com/solutions/638843 , the
 interface, that is defined in the corosync.conf, must be present in the
 system (see at the bottom of the article, section ROOT CAUSE).
 To confirm that I made a couple of tests.
 
 Here is a part of the corosync.conf file (in a free-write form) (also
 attached the origin config file):
 ===
 rrp_mode: passive
 ring0_addr is defined in corosync.conf
 ring1_addr is defined in corosync.conf
 ===
 
 ---
 
 Two-node cluster
 
 ---
 
 Test #1:
 --
 IP for ring0 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync fails to start.
 From the logs:
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
 config: No interfaces defined
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
 Engine exiting with status 8 at main.c:1343.
 Result: Corosync and Pacemaker are not running.
 
 Test #2:
 --
 IP for ring1 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync starts.
 Start Pacemaker simultaneously on both nodes.
 Pacemaker fails to start.
 From the logs, the last writes from the corosync:
 Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
 interface 169.254.1.3 FAULTY
 Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
 recovered ring 0
 Result: Corosync and Pacemaker are not running.
 
 
 Test #3:
 
 rrp_mode: active leads to the same result, except Corosync and Pacemaker
 init scripts return status running.
 But still vim /var/log/cluster/corosync.log shows a lot of errors like:
 Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 
 Result: Corosync and Pacemaker show their statuses as running, but
 crm_mon cannot connect to the cluster database. And half of the
 Pacemaker's services are not running (including Cluster Information Base
 (CIB)).
 
 
 ---
 
 For a single node mode
 
 ---
 
 IP for ring0 is not defines in the system:
 
 Corosync fails to start.
 
 IP for ring1 is not defines in the system:
 
 Corosync and Pacemaker are started.
 
 It is possible that configuration will be applied successfully (50%),
 
 and it is possible that the cluster is not running any resources,
 
 and it is possible that the node cannot be put in a standby mode (shows:
 communication error),
 
 and it is possible that the cluster is running all resources, but applied
 configuration is not guaranteed to be fully loaded (some rules can be
 missed).
 
 
 ---
 
 Conclusions:
 
 ---
 
 It is possible that in some rare cases (see comments to the bug) the
 cluster will work, but in that case its working state is unstable and the
 cluster can stop working every moment.
 
 
 So, is it correct? Does my assumptions make any sense? I didn't any other
 explanation in the network ... .

Corosync needs all interfaces during start and runtime. This doesn't
mean they must be connected (this would make corosync unusable for
physical NIC/Switch or cable failure), but they must be up and have
correct ip.

When this is not the case, corosync rebinds to localhost and weird
things happens. Removal of this rebinding is long time TODO, but there
are still more important bugs (especially because rebind can be avoided).

Regards,
  Honza

 
 
 
 Thank you,
 Kostya
 
 On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:
 
 Hi guys,

 Corosync fails to start if there is no such network interface configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least one
 network interface is not configured in the system.

 Is this the expected behavior?
 I thought that when you use redundant rings, it is enough to have at least
 one NIC configured in the system. Am I wrong?

 Thank you,
 Kostya

 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-13 Thread Kostiantyn Ponomarenko
Honza,

Thank you for helping me.
So, there is no defined behavior in case one of the interfaces is not in
the system?


Thank you,
Kostya

On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,


  According to the https://access.redhat.com/solutions/638843 , the
  interface, that is defined in the corosync.conf, must be present in the
  system (see at the bottom of the article, section ROOT CAUSE).
  To confirm that I made a couple of tests.
 
  Here is a part of the corosync.conf file (in a free-write form) (also
  attached the origin config file):
  ===
  rrp_mode: passive
  ring0_addr is defined in corosync.conf
  ring1_addr is defined in corosync.conf
  ===
 
  ---
 
  Two-node cluster
 
  ---
 
  Test #1:
  --
  IP for ring0 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync fails to start.
  From the logs:
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
  config: No interfaces defined
  Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
  Engine exiting with status 8 at main.c:1343.
  Result: Corosync and Pacemaker are not running.
 
  Test #2:
  --
  IP for ring1 is not defines in the system:
  --
  Start Corosync simultaneously on both nodes.
  Corosync starts.
  Start Pacemaker simultaneously on both nodes.
  Pacemaker fails to start.
  From the logs, the last writes from the corosync:
  Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
  interface 169.254.1.3 FAULTY
  Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
  recovered ring 0
  Result: Corosync and Pacemaker are not running.
 
 
  Test #3:
 
  rrp_mode: active leads to the same result, except Corosync and
 Pacemaker
  init scripts return status running.
  But still vim /var/log/cluster/corosync.log shows a lot of errors like:
  Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
  to the CPG API failed: Library error (2)
 
  Result: Corosync and Pacemaker show their statuses as running, but
  crm_mon cannot connect to the cluster database. And half of the
  Pacemaker's services are not running (including Cluster Information Base
  (CIB)).
 
 
  ---
 
  For a single node mode
 
  ---
 
  IP for ring0 is not defines in the system:
 
  Corosync fails to start.
 
  IP for ring1 is not defines in the system:
 
  Corosync and Pacemaker are started.
 
  It is possible that configuration will be applied successfully (50%),
 
  and it is possible that the cluster is not running any resources,
 
  and it is possible that the node cannot be put in a standby mode (shows:
  communication error),
 
  and it is possible that the cluster is running all resources, but applied
  configuration is not guaranteed to be fully loaded (some rules can be
  missed).
 
 
  ---
 
  Conclusions:
 
  ---
 
  It is possible that in some rare cases (see comments to the bug) the
  cluster will work, but in that case its working state is unstable and the
  cluster can stop working every moment.
 
 
  So, is it correct? Does my assumptions make any sense? I didn't any other
  explanation in the network ... .

 Corosync needs all interfaces during start and runtime. This doesn't
 mean they must be connected (this would make corosync unusable for
 physical NIC/Switch or cable failure), but they must be up and have
 correct ip.

 When this is not the case, corosync rebinds to localhost and weird
 things happens. Removal of this rebinding is long time TODO, but there
 are still more important bugs (especially because rebind can be avoided).

 Regards,
   Honza

 
 
 
  Thank you,
  Kostya
 
  On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
  konstantin.ponomare...@gmail.com wrote:
 
  Hi guys,
 
  Corosync fails to start if there is no such network interface configured
  in the system.
  Even with rrp_mode: passive the problem is the same when at least one
  network interface is not configured in the system.
 
  Is this the expected behavior?
  I thought that when you use redundant rings, it is enough to have at
 least
  one NIC configured in the system. Am I wrong?
 
  Thank you,
  Kostya
 
 
 
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 


 ___
 Pacemaker 

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-12 Thread Kostiantyn Ponomarenko
According to the https://access.redhat.com/solutions/638843 , the
interface, that is defined in the corosync.conf, must be present in the
system (see at the bottom of the article, section ROOT CAUSE).
To confirm that I made a couple of tests.

Here is a part of the corosync.conf file (in a free-write form) (also
attached the origin config file):
===
rrp_mode: passive
ring0_addr is defined in corosync.conf
ring1_addr is defined in corosync.conf
===

---

Two-node cluster

---

Test #1:
--
IP for ring0 is not defines in the system:
--
Start Corosync simultaneously on both nodes.
Corosync fails to start.
From the logs:
Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
config: No interfaces defined
Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
Engine exiting with status 8 at main.c:1343.
Result: Corosync and Pacemaker are not running.

Test #2:
--
IP for ring1 is not defines in the system:
--
Start Corosync simultaneously on both nodes.
Corosync starts.
Start Pacemaker simultaneously on both nodes.
Pacemaker fails to start.
From the logs, the last writes from the corosync:
Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
interface 169.254.1.3 FAULTY
Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
recovered ring 0
Result: Corosync and Pacemaker are not running.


Test #3:

rrp_mode: active leads to the same result, except Corosync and Pacemaker
init scripts return status running.
But still vim /var/log/cluster/corosync.log shows a lot of errors like:
Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
to the CPG API failed: Library error (2)

Result: Corosync and Pacemaker show their statuses as running, but
crm_mon cannot connect to the cluster database. And half of the
Pacemaker's services are not running (including Cluster Information Base
(CIB)).


---

For a single node mode

---

IP for ring0 is not defines in the system:

Corosync fails to start.

IP for ring1 is not defines in the system:

Corosync and Pacemaker are started.

It is possible that configuration will be applied successfully (50%),

and it is possible that the cluster is not running any resources,

and it is possible that the node cannot be put in a standby mode (shows:
communication error),

and it is possible that the cluster is running all resources, but applied
configuration is not guaranteed to be fully loaded (some rules can be
missed).


---

Conclusions:

---

It is possible that in some rare cases (see comments to the bug) the
cluster will work, but in that case its working state is unstable and the
cluster can stop working every moment.


So, is it correct? Does my assumptions make any sense? I didn't any other
explanation in the network ... .



Thank you,
Kostya

On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 Corosync fails to start if there is no such network interface configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least one
 network interface is not configured in the system.

 Is this the expected behavior?
 I thought that when you use redundant rings, it is enough to have at least
 one NIC configured in the system. Am I wrong?

 Thank you,
 Kostya



corosync.conf
Description: Binary data
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Corosync fails to start when NIC is absent

2015-01-09 Thread Kostiantyn Ponomarenko
Hi guys,

Corosync fails to start if there is no such network interface configured in
the system.
Even with rrp_mode: passive the problem is the same when at least one
network interface is not configured in the system.

Is this the expected behavior?
I thought that when you use redundant rings, it is enough to have at least
one NIC configured in the system. Am I wrong?

Thank you,
Kostya
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org