Re: [Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn, One more thing to clarify. You said rebind can be avoided - what does it mean? By that I mean that as long as you don't shutdown interface everything will work as expected. Interface shutdown is administrator decision, system doesn't do it automagically :) Regards, Honza Thank you, Kostya On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I
Re: [Pacemaker] Corosync fails to start when NIC is absent
On Tue, Jan 20, 2015 at 11:50 AM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, One more thing to clarify. You said rebind can be avoided - what does it mean? By that I mean that as long as you don't shutdown interface everything will work as expected. Interface shutdown is administrator decision, system doesn't do it automagically :) What is possible that e.g. during reboot interface (hardware) fails and is not detected. This would lead to complete outage of node that could be avoided. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Got it. Thank you =) I just thought about possibility of a NIC to burn down. Thank you, Kostya On Tue, Jan 20, 2015 at 10:50 AM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, One more thing to clarify. You said rebind can be avoided - what does it mean? By that I mean that as long as you don't shutdown interface everything will work as expected. Interface shutdown is administrator decision, system doesn't do it automagically :) Regards, Honza Thank you, Kostya On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri,
Re: [Pacemaker] Corosync fails to start when NIC is absent
One more thing to clarify. You said rebind can be avoided - what does it mean? Thank you, Kostya On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least
Re: [Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Thank you. Now I am aware of it. Thank you, Kostya On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? You are right. There is no defined behavior. Regards, Honza Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Corosync fails to start when NIC is absent
Honza, Thank you for helping me. So, there is no defined behavior in case one of the interfaces is not in the system? Thank you, Kostya On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote: Kostiantyn, According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Corosync needs all interfaces during start and runtime. This doesn't mean they must be connected (this would make corosync unusable for physical NIC/Switch or cable failure), but they must be up and have correct ip. When this is not the case, corosync rebinds to localhost and weird things happens. Removal of this rebinding is long time TODO, but there are still more important bugs (especially because rebind can be avoided). Regards, Honza Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker
Re: [Pacemaker] Corosync fails to start when NIC is absent
According to the https://access.redhat.com/solutions/638843 , the interface, that is defined in the corosync.conf, must be present in the system (see at the bottom of the article, section ROOT CAUSE). To confirm that I made a couple of tests. Here is a part of the corosync.conf file (in a free-write form) (also attached the origin config file): === rrp_mode: passive ring0_addr is defined in corosync.conf ring1_addr is defined in corosync.conf === --- Two-node cluster --- Test #1: -- IP for ring0 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync fails to start. From the logs: Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in config: No interfaces defined Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster Engine exiting with status 8 at main.c:1343. Result: Corosync and Pacemaker are not running. Test #2: -- IP for ring1 is not defines in the system: -- Start Corosync simultaneously on both nodes. Corosync starts. Start Pacemaker simultaneously on both nodes. Pacemaker fails to start. From the logs, the last writes from the corosync: Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0 interface 169.254.1.3 FAULTY Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically recovered ring 0 Result: Corosync and Pacemaker are not running. Test #3: rrp_mode: active leads to the same result, except Corosync and Pacemaker init scripts return status running. But still vim /var/log/cluster/corosync.log shows a lot of errors like: Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Result: Corosync and Pacemaker show their statuses as running, but crm_mon cannot connect to the cluster database. And half of the Pacemaker's services are not running (including Cluster Information Base (CIB)). --- For a single node mode --- IP for ring0 is not defines in the system: Corosync fails to start. IP for ring1 is not defines in the system: Corosync and Pacemaker are started. It is possible that configuration will be applied successfully (50%), and it is possible that the cluster is not running any resources, and it is possible that the node cannot be put in a standby mode (shows: communication error), and it is possible that the cluster is running all resources, but applied configuration is not guaranteed to be fully loaded (some rules can be missed). --- Conclusions: --- It is possible that in some rare cases (see comments to the bug) the cluster will work, but in that case its working state is unstable and the cluster can stop working every moment. So, is it correct? Does my assumptions make any sense? I didn't any other explanation in the network ... . Thank you, Kostya On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko konstantin.ponomare...@gmail.com wrote: Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya corosync.conf Description: Binary data ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] Corosync fails to start when NIC is absent
Hi guys, Corosync fails to start if there is no such network interface configured in the system. Even with rrp_mode: passive the problem is the same when at least one network interface is not configured in the system. Is this the expected behavior? I thought that when you use redundant rings, it is enough to have at least one NIC configured in the system. Am I wrong? Thank you, Kostya ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org