I had a RHEL 6.7, cman + rgmanager cluster that I've built many times before. Oddly, I just hit this error:
==== [root@node2 ~]# /etc/init.d/clvmd start Starting clvmd: clvmd could not connect to cluster manager Consult syslog for more information ==== syslog: ==== Sep 24 23:00:30 node2 kernel: dlm: Using SCTP for communications Sep 24 23:00:30 node2 clvmd: Unable to create DLM lockspace for CLVM: Address already in use Sep 24 23:00:30 node2 kernel: dlm: Can't bind to port 21064 addr number 1 Sep 24 23:00:30 node2 kernel: dlm: cannot start dlm lowcomms -98 ==== There are no iptables rules: ==== [root@node2 ~]# iptables-save ==== And there are no DLM lockspaces, either: ==== [root@node2 ~]# dlm_tool ls [root@node2 ~]# ==== I tried withdrawing the node from the cluster entirely, the started cman alone and tried to start clvmd, same issue. Pinging between the two nodes seems OK: ==== [root@node1 ~]# uname -n node1.ccrs.bcn [root@node1 ~]# ping -c 2 node1.ccrs.bcn PING node1.bcn (10.20.10.1) 56(84) bytes of data. 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.015 ms 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.017 ms --- node1.bcn ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.015/0.016/0.017/0.001 ms ==== [root@node2 ~]# uname -n node2.ccrs.bcn [root@node2 ~]# ping -c 2 node1.ccrs.bcn PING node1.bcn (10.20.10.1) 56(84) bytes of data. 64 bytes from node1.bcn (10.20.10.1): icmp_seq=1 ttl=64 time=0.079 ms 64 bytes from node1.bcn (10.20.10.1): icmp_seq=2 ttl=64 time=0.076 ms --- node1.bcn ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.076/0.077/0.079/0.008 ms ==== I have RRP configured and pings work on the second network, too: ==== [root@node1 ~]# corosync-objctl |grep ring -A 5 totem.interface.ringnumber=0 totem.interface.bindnetaddr=10.20.10.1 totem.interface.mcastaddr=239.192.100.163 totem.interface.mcastport=5405 totem.interface.member.memberaddr=node1.ccrs.bcn totem.interface.member.memberaddr=node2.ccrs.bcn totem.interface.ringnumber=1 totem.interface.bindnetaddr=10.10.10.1 totem.interface.mcastaddr=239.192.100.164 totem.interface.mcastport=5405 totem.interface.member.memberaddr=node1.sn totem.interface.member.memberaddr=node2.sn [root@node1 ~]# ping -c 2 node2.sn PING node2.sn (10.10.10.2) 56(84) bytes of data. 64 bytes from node2.sn (10.10.10.2): icmp_seq=1 ttl=64 time=0.111 ms 64 bytes from node2.sn (10.10.10.2): icmp_seq=2 ttl=64 time=0.120 ms --- node2.sn ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.111/0.115/0.120/0.011 ms ==== [root@node2 ~]# ping -c 2 node1.sn PING node1.sn (10.10.10.1) 56(84) bytes of data. 64 bytes from node1.sn (10.10.10.1): icmp_seq=1 ttl=64 time=0.079 ms 64 bytes from node1.sn (10.10.10.1): icmp_seq=2 ttl=64 time=0.171 ms --- node1.sn ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1000ms rtt min/avg/max/mdev = 0.079/0.125/0.171/0.046 ms ==== Here is the cluster.conf: ==== [root@node1 ~]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster name="ccrs" config_version="1"> <cman expected_votes="1" two_node="1" transport="udpu" /> <clusternodes> <clusternode name="node1.ccrs.bcn" nodeid="1"> <altname name="node1.sn" /> <fence> <method name="ipmi"> <device name="ipmi_n01" ipaddr="10.250.199.15" login="admin" passwd="secret" delay="15" action="reboot" /> </method> <method name="pdu"> <device name="pdu01" port="1" action="reboot" /> <device name="pdu02" port="1" action="reboot" /> </method> </fence> </clusternode> <clusternode name="node2.ccrs.bcn" nodeid="2"> <altname name="node2.sn" /> <fence> <method name="ipmi"> <device name="ipmi_n02" ipaddr="10.250.199.17" login="admin" passwd="secret" action="reboot" /> </method> <method name="pdu"> <device name="pdu01" port="2" action="reboot" /> <device name="pdu02" port="2" action="reboot" /> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="ipmi_n01" agent="fence_ipmilan" /> <fencedevice name="ipmi_n02" agent="fence_ipmilan" /> <fencedevice name="pdu01" agent="fence_raritan_snmp" ipaddr="pdu1A" /> <fencedevice name="pdu02" agent="fence_raritan_snmp" ipaddr="pdu1B" /> <fencedevice name="pdu03" agent="fence_raritan_snmp" ipaddr="pdu2A" /> <fencedevice name="pdu04" agent="fence_raritan_snmp" ipaddr="pdu2B" /> </fencedevices> <fence_daemon post_join_delay="30" /> <totem rrp_mode="passive" secauth="off"/> <rm log_level="5"> <resources> <script file="/etc/init.d/drbd" name="drbd"/> <script file="/etc/init.d/wait-for-drbd" name="wait-for-drbd"/> <script file="/etc/init.d/clvmd" name="clvmd"/> <clusterfs device="/dev/node1_vg0/shared" force_unmount="1" fstype="gfs2" mountpoint="/shared" name="sharedfs" /> <script file="/etc/init.d/libvirtd" name="libvirtd"/> </resources> <failoverdomains> <failoverdomain name="only_n01" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="node1.ccrs.bcn"/> </failoverdomain> <failoverdomain name="only_n02" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="node2.ccrs.bcn"/> </failoverdomain> <failoverdomain name="primary_n01" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="node1.ccrs.bcn" priority="1"/> <failoverdomainnode name="node2.ccrs.bcn" priority="2"/> </failoverdomain> <failoverdomain name="primary_n02" nofailback="1" ordered="1" restricted="1"> <failoverdomainnode name="node1.ccrs.bcn" priority="2"/> <failoverdomainnode name="node2.ccrs.bcn" priority="1"/> </failoverdomain> </failoverdomains> <service name="storage_n01" autostart="1" domain="only_n01" exclusive="0" recovery="restart"> <script ref="drbd"> <script ref="wait-for-drbd"> <script ref="clvmd"> <clusterfs ref="sharedfs"/> </script> </script> </script> </service> <service name="storage_n02" autostart="1" domain="only_n02" exclusive="0" recovery="restart"> <script ref="drbd"> <script ref="wait-for-drbd"> <script ref="clvmd"> <clusterfs ref="sharedfs"/> </script> </script> </script> </service> <service name="libvirtd_n01" autostart="1" domain="only_n01" exclusive="0" recovery="restart"> <script ref="libvirtd"/> </service> <service name="libvirtd_n02" autostart="1" domain="only_n02" exclusive="0" recovery="restart"> <script ref="libvirtd"/> </service> </rm> </cluster> ==== Nothing special there at all. While writing this email though, I saw this on the other node: ==== Sep 24 23:03:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 14e Sep 24 23:03:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 14e Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 158 Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 15a Sep 24 23:03:49 node1 corosync[4770]: [TOTEM ] Retransmit List: 15a Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 161 163 Sep 24 23:03:59 node1 corosync[4770]: [TOTEM ] Retransmit List: 163 Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 177 Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 177 Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 179 Sep 24 23:04:19 node1 corosync[4770]: [TOTEM ] Retransmit List: 179 Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181 Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181 Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 181 Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 183 Sep 24 23:04:29 node1 corosync[4770]: [TOTEM ] Retransmit List: 183 Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18c 18e Sep 24 23:04:39 node1 corosync[4770]: [TOTEM ] Retransmit List: 18e Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23c Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23e Sep 24 23:07:20 node1 corosync[4770]: [TOTEM ] Retransmit List: 23e Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 247 Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 247 Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 249 Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 24b Sep 24 23:07:30 node1 corosync[4770]: [TOTEM ] Retransmit List: 24b Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 252 Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 252 Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 254 Sep 24 23:07:40 node1 corosync[4770]: [TOTEM ] Retransmit List: 254 ==== Certainly *looks* like a network problem, but I can't see what's wrong... Any ideas? Thanks! -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org