Re: [PVE-User] Adding a cluster node breaks whole cluster
As I told you, we used dumb Linksys switch for cluster communication and it all worked for weekend and yesterday. And yesterday evening whole cluster is red again Yesterday's log: http://pastebin.com/sLpnGgeS Pure corosync log: http://pastebin.com/TzyUdYaJ Maybe a problem with packet fragmentation. Do you use special MTU sizes? ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
9000 MTU is based only on storage VLANs, which use 10G interfaces. All Proxmox cluster communication uses VLAN15, MTU1500. Rebooted Linksys switch, no good. Swapped to another dumb switch without VLANs - again no good. cman restart in one node *service cman restart* Stopping cluster: Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... ERROR: could not insert 'configfs': Exec format error [FAILED] On 14.04.15 11:58, Dietmar Maurer wrote: As I told you, we used dumb Linksys switch for cluster communication and it all worked for weekend and yesterday. And yesterday evening whole cluster is red again Yesterday's log: http://pastebin.com/sLpnGgeS Pure corosync log: http://pastebin.com/TzyUdYaJ Maybe a problem with packet fragmentation. Do you use special MTU sizes? smime.p7s Description: S/MIME Cryptographic Signature ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
Hi, Error is not lan, but kernel modules - reboot node and its fixed Grtz Van: pve-user [mailto:pve-user-boun...@pve.proxmox.com] Namens Sten Aus Verzonden: dinsdag 14 april 2015 11:02 Aan: Dietmar Maurer; m...@miras.org CC: pve-user@pve.proxmox.com Onderwerp: Re: [PVE-User] Adding a cluster node breaks whole cluster 9000 MTU is based only on storage VLANs, which use 10G interfaces. All Proxmox cluster communication uses VLAN15, MTU1500. Rebooted Linksys switch, no good. Swapped to another dumb switch without VLANs - again no good. cman restart in one node service cman restart Stopping cluster: Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... ERROR: could not insert 'configfs': Exec format error [FAILED] On 14.04.15 11:58, Dietmar Maurer wrote: As I told you, we used dumb Linksys switch for cluster communication and it all worked for weekend and yesterday. And yesterday evening whole cluster is red again Yesterday's log: http://pastebin.com/sLpnGgeS Pure corosync log: http://pastebin.com/TzyUdYaJ Maybe a problem with packet fragmentation. Do you use special MTU sizes? ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
I have lately rebooted 3 nodes, no luck there. As I told there was difference between kernels, now all kernels are at the same versions. Right now I cannot reboot no single node, because I can not migrate VMs.. On 14.04.15 12:05, Bart Lageweg | Bizway wrote: Hi, Error is not lan, but kernel modules - reboot node and its fixed Grtz *Van:*pve-user [mailto:pve-user-boun...@pve.proxmox.com] *Namens *Sten Aus *Verzonden:* dinsdag 14 april 2015 11:02 *Aan:* Dietmar Maurer; m...@miras.org *CC:* pve-user@pve.proxmox.com *Onderwerp:* Re: [PVE-User] Adding a cluster node breaks whole cluster 9000 MTU is based only on storage VLANs, which use 10G interfaces. All Proxmox cluster communication uses VLAN15, MTU1500. Rebooted Linksys switch, no good. Swapped to another dumb switch without VLANs - again no good. cman restart in one node *service cman restart* Stopping cluster: Stopping dlm_controld... [ OK ] Stopping fenced... [ OK ] Stopping cman... [ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Starting cluster: Checking if cluster has been disabled at boot... [ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... ERROR: could not insert 'configfs': Exec format error [FAILED] On 14.04.15 11:58, Dietmar Maurer wrote: As I told you, we used dumb Linksys switch for cluster communication and it all worked for weekend and yesterday. And yesterday evening whole cluster is red again Yesterday's log:http://pastebin.com/sLpnGgeS Pure corosync log:http://pastebin.com/TzyUdYaJ Maybe a problem with packet fragmentation. Do you use special MTU sizes? smime.p7s Description: S/MIME Cryptographic Signature ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
Okay, back to zero. As I told you, we used dumb Linksys switch for cluster communication and it all worked for weekend and yesterday. And yesterday evening whole cluster is red again Yesterday's log: http://pastebin.com/sLpnGgeS Pure corosync log: http://pastebin.com/TzyUdYaJ Bridge errors: http://pastebin.com/3ZgV7i5A Thing is that I don't have a bridge with this name bridge-id 8000.00:1e:e5:ac:db:fe.800e Service restart does not affect anything as usual. On 13.04.15 11:16, Sten Aus wrote: Hi Token timeout did not make anything better nor worse. :) But now configured cluster VLAN directly to dumb Linksys switch (without VLANs), using 1G interfaces. And it's working - adding and deleting new and old nodes works like a charm. Network department have also debuged this scenario for a weeks now and no misconfiguration is found. But time is running out, so I guess our solution for now would be that all cluster communication goes to this dumb switch and all other networks come from 10G via switch mesh (NetExtreme and DELL switches). Thanks for your time and effort, if we get a solution or find a bug/mistake, I will let you know! All the best Sten ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user smime.p7s Description: S/MIME Cryptographic Signature ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
Hi Token timeout did not make anything better nor worse. :) But now configured cluster VLAN directly to dumb Linksys switch (without VLANs), using 1G interfaces. And it's working - adding and deleting new and old nodes works like a charm. Network department have also debuged this scenario for a weeks now and no misconfiguration is found. But time is running out, so I guess our solution for now would be that all cluster communication goes to this dumb switch and all other networks come from 10G via switch mesh (NetExtreme and DELL switches). Thanks for your time and effort, if we get a solution or find a bug/mistake, I will let you know! All the best Sten smime.p7s Description: S/MIME Cryptographic Signature ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
Everything is working good until I try to add a new node. As soon as I do that, whole GUI breaks (KVM stays working, luckily) and all hell breaks loose, as it's said. So, we have eliminated network card issues - as this problem occurs with different network cards. We have eliminated switches' issues, because all switches are working prior to this situation AND we have tried to use 10GB-1GB gbic module to connect this new node to 10G switch as well. Now, we have eliminated this Fujitsu hardware totally, because a HP machine also breaks the cluster. IGMP snooping is disabled, multicast is working on both sides, tested with ssmping. *clustat* shows that all nodes are online. *pvecm nodes* shows that everything is OK. All nodes have join time and M in Sts column. Inc differs, though. *tcpdump* shows: 12:15:57.535798 IP 0.0.0.0 all-systems.mcast.net: igmp query v2 12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0 ff02::1: HBH ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24 12:15:57.540356 IP 0.0.0.0 all-systems.mcast.net: igmp query v2 12:15:57.540384 IP6 101:80a:21ee:154d:100:: ff02::1: HBH ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24 12:15:57.580874 IP 0.0.0.0 all-systems.mcast.net: igmp query v2 12:15:57.580903 IP6 10::40:918f:a47f:0 ff02::1: HBH ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24 12:15:58.349706 IP valitseja.5404 harija1.5405: UDP, length 107 12:15:58.349783 IP harija1.5404 ve-1.5405: UDP, length 617 12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown), length 42 Output from log files: Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 13 14 15 Apr 9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2960 Apr 9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2970 Apr 9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2980 Apr 9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 2990 Apr 9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3000 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3010 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: cpg_send_message failed: 9 Apr 9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3020 Apr 9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3030 Apr 9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3040 Apr 9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 3050 I have read that Proxmox tests with 16 working nodes, but there are information that someone uses it with more than 16. Although - I have plenty to go? Of course we have had nodes, which are not in cluster anymore (deleted), but I assume that they don't count. :) Any ideas where to look next? Does it help if you restart pve-cluster service on those nodes: # service pve-cluster restart ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
Re: [PVE-User] Adding a cluster node breaks whole cluster
Okay, I have now updated all nodes to same kernel versions. Thanks to the bugfix with latest update, IPv6 now works with newer kernel as well. So, all nodes are in the same version and packgages. Anyway, now I cannot add no new node. Even I can't add this old node, which was in the cluster before as a new node. So I added debug flag to cluster.conf and tried to add new node. Here's the output: Apr 9 13:03:06 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 10 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:07 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 20 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:08 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 30 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:09 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 40 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:10 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 50 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:11 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 60 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:12 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 70 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:13 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 80 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:14 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 90 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:15 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retry 100 (dfsm.c:215:dfsm_send_message_full) Apr 9 13:03:15 rabaja pmxcfs[673283]: [status] notice: cpg_send_message retried 100 times (dfsm.c:221:dfsm_send_message_full) Apr 9 13:03:15 rabaja pmxcfs[673283]: [status] crit: cpg_send_message failed: 6 (dfsm.c:329:dfsm_send_message_sync) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, size:16 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:5, size:528 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result -2 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, size:16 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, size:16 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, size:16 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:5, size:528 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 (server.c:310:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:4, size:421 (server.c:160:s1_msg_process_fn) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter cfs_fuse_getattr / (pmxcfs.c:126:cfs_fuse_getattr) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug start (pmxcfs.c:102:find_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: cfs_plug_base_lookup_plug (cfs-plug.c:52:cfs_plug_base_lookup_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug end = 0xd9d280 () (pmxcfs.c:109:find_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter cfs_plug_base_getattr (cfs-plug.c:84:cfs_plug_base_getattr) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: leave cfs_plug_base_getattr (cfs-plug.c:103:cfs_plug_base_getattr) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: leave cfs_fuse_getattr / (0) (pmxcfs.c:144:cfs_fuse_getattr) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter cfs_fuse_getattr /firewall (pmxcfs.c:126:cfs_fuse_getattr) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug start firewall (pmxcfs.c:102:find_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: cfs_plug_base_lookup_plug firewall (cfs-plug.c:52:cfs_plug_base_lookup_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: cfs_plug_base_lookup_plug name = firewall new path = (null) (cfs-plug.c:59:cfs_plug_base_lookup_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug end firewall = 0xd9d280 (firewall) (pmxcfs.c:109:find_plug) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter cfs_plug_base_getattr firewall (cfs-plug.c:84:cfs_plug_base_getattr) Apr 9 13:03:15 rabaja pmxcfs[673283]: [main] debug: leave cfs_plug_base_getattr firewall
Re: [PVE-User] Adding a cluster node breaks whole cluster
Tried another thing - removed existing node from cluster. Still no good when adding new nodes. So eliminated new factor - it's not about limit? no ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user