Hi,
We ran into some problems when we pull down the ethernet interface using
“ifconfig eth0 down”
Our cluster has the following configurations and resources
* Two network interfaces : eth0 and lo(cal)
* 3 nodes with one node put in maintenance mode
* No-quorum-policy=stop
* Stonith-enabled=false
* Postgresql Master/Slave
* vip master and vip replication IPs
* VIPs will run on the node where Postgresql Master is running
Two test cases that we executed are as follows
* Introduce delay in the ethernet interface o f the postgresql PRIMARY node
(Command : tc qdisc add dev eth0 root netem delay 8000ms)
* `Ifconfig eth0 down` on the postgresql PRIMARY Node
* We expected that both these test cases test for network problems in the
cluster
In the first case (ethernet interface delay)
* Cluster is divided into “partition WITH quorum” and “partition WITHOUT
quorum”
* Partition WITHOUT quorum shuts down all the services
* Partition WITH quorum takes over as Postgresql PRIMARY and VIPs
* Everything as expected. Wow !
In the second case (ethernet interface down)
* We see lots of errors like the following . On the node
* Feb 12 14:09:48 corosync [MAIN ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of this
message is that the local firewall is configured improperly.
* Feb 12 14:09:49 corosync [MAIN ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of this
message is that the local firewall is configured improperly.
* Feb 12 14:09:51 corosync [MAIN ] Totem is unable to form a cluster
because of an operating system or network fault. The most common cause of this
message is that the local firewall is configured improperly.
* But the `crm_mon –Afr` (from the node whose eth0 is down) always shows
the cluster to be fully formed.
* It shows all the nodes as UP
* It shows itself as the one running the postgresql PRIMARY (as was the
case before putting the ethernet interface is down)
* `crm_mon -Afr` on the OTHER nodes show a different story
* They show the other node as down
* One of the other two nodes takes over the postgresql PRIMARY
* This leads to a split brain situation which was gracefully avoided in the
test case where only “delay is introduced into the interface”
Questions :
* Is it a known issue with pacemaker when the ethernet interface is pulled
down ?
* Is it an incorrect way of testing the cluster ? There is some information
regarding the same in this thread
http://www.gossamer-threads.com/lists/linuxha/pacemaker/59738
Regards,
Deba
_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org