Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-14 Thread Dietmar Maurer
 As I told you, we used dumb Linksys switch for cluster communication and 
 it all worked for weekend and yesterday. And yesterday evening whole 
 cluster is red again
 
 Yesterday's log: http://pastebin.com/sLpnGgeS
 Pure corosync log: http://pastebin.com/TzyUdYaJ
 

Maybe a problem with packet fragmentation. Do you use special MTU sizes?

___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-14 Thread Sten Aus

9000 MTU is based only on storage VLANs, which use 10G interfaces.

All Proxmox cluster communication uses VLAN15, MTU1500.

Rebooted Linksys switch, no good. Swapped to another dumb switch without 
VLANs - again no good.


cman restart in one node
*service cman restart*
Stopping cluster:
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... ERROR: could not insert 'configfs': Exec 
format error

[FAILED]

On 14.04.15 11:58, Dietmar Maurer wrote:

As I told you, we used dumb Linksys switch for cluster communication and
it all worked for weekend and yesterday. And yesterday evening whole
cluster is red again

Yesterday's log: http://pastebin.com/sLpnGgeS
Pure corosync log: http://pastebin.com/TzyUdYaJ


Maybe a problem with packet fragmentation. Do you use special MTU sizes?





smime.p7s
Description: S/MIME Cryptographic Signature
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-14 Thread Bart Lageweg | Bizway
Hi,

Error is not lan, but kernel modules - reboot node and its fixed

Grtz

Van: pve-user [mailto:pve-user-boun...@pve.proxmox.com] Namens Sten Aus
Verzonden: dinsdag 14 april 2015 11:02
Aan: Dietmar Maurer; m...@miras.org
CC: pve-user@pve.proxmox.com
Onderwerp: Re: [PVE-User] Adding a cluster node breaks whole cluster

9000 MTU is based only on storage VLANs, which use 10G interfaces.

All Proxmox cluster communication uses VLAN15, MTU1500.

Rebooted Linksys switch, no good. Swapped to another dumb switch without VLANs 
- again no good.

cman restart in one node
service cman restart
Stopping cluster:
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... ERROR: could not insert 'configfs': Exec format 
error
[FAILED]
On 14.04.15 11:58, Dietmar Maurer wrote:

As I told you, we used dumb Linksys switch for cluster communication and

it all worked for weekend and yesterday. And yesterday evening whole

cluster is red again



Yesterday's log: http://pastebin.com/sLpnGgeS

Pure corosync log: http://pastebin.com/TzyUdYaJ





Maybe a problem with packet fragmentation. Do you use special MTU sizes?



___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-14 Thread Sten Aus
I have lately rebooted 3 nodes, no luck there. As I told there was 
difference between kernels, now all kernels are at the same versions.


Right now I cannot reboot no single node, because I can not migrate VMs..

On 14.04.15 12:05, Bart Lageweg | Bizway wrote:


Hi,

Error is not lan, but kernel modules - reboot node and its fixed

Grtz

*Van:*pve-user [mailto:pve-user-boun...@pve.proxmox.com] *Namens *Sten Aus
*Verzonden:* dinsdag 14 april 2015 11:02
*Aan:* Dietmar Maurer; m...@miras.org
*CC:* pve-user@pve.proxmox.com
*Onderwerp:* Re: [PVE-User] Adding a cluster node breaks whole cluster

9000 MTU is based only on storage VLANs, which use 10G interfaces.

All Proxmox cluster communication uses VLAN15, MTU1500.

Rebooted Linksys switch, no good. Swapped to another dumb switch 
without VLANs - again no good.


cman restart in one node
*service cman restart*
Stopping cluster:
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... ERROR: could not insert 'configfs': Exec 
format error

[FAILED]

On 14.04.15 11:58, Dietmar Maurer wrote:

As I told you, we used dumb Linksys switch for cluster communication and

it all worked for weekend and yesterday. And yesterday evening whole

cluster is red again

  


Yesterday's log:http://pastebin.com/sLpnGgeS

Pure corosync log:http://pastebin.com/TzyUdYaJ

  

  


Maybe a problem with packet fragmentation. Do you use special MTU sizes?

  





smime.p7s
Description: S/MIME Cryptographic Signature
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-14 Thread Sten Aus

Okay, back to zero.

As I told you, we used dumb Linksys switch for cluster communication and 
it all worked for weekend and yesterday. And yesterday evening whole 
cluster is red again


Yesterday's log: http://pastebin.com/sLpnGgeS
Pure corosync log: http://pastebin.com/TzyUdYaJ

Bridge errors: http://pastebin.com/3ZgV7i5A
Thing is that I don't have a bridge with this name bridge-id 
8000.00:1e:e5:ac:db:fe.800e


Service restart does not affect anything as usual.

On 13.04.15 11:16, Sten Aus wrote:

Hi

Token timeout did not make anything better nor worse. :)

But now configured cluster VLAN directly to dumb Linksys switch 
(without VLANs), using 1G interfaces. And it's working - adding and 
deleting new and old nodes works like a charm.


Network department have also debuged this scenario for a weeks now and 
no misconfiguration is found. But time is running out, so I guess our 
solution for now would be that all cluster communication goes to this 
dumb switch and all other networks come from 10G via switch mesh 
(NetExtreme and DELL switches).


Thanks for your time and effort, if we get a solution or find a 
bug/mistake, I will let you know!


All the best
Sten



___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user




smime.p7s
Description: S/MIME Cryptographic Signature
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-13 Thread Sten Aus

Hi

Token timeout did not make anything better nor worse. :)

But now configured cluster VLAN directly to dumb Linksys switch (without 
VLANs), using 1G interfaces. And it's working - adding and deleting new 
and old nodes works like a charm.


Network department have also debuged this scenario for a weeks now and 
no misconfiguration is found. But time is running out, so I guess our 
solution for now would be that all cluster communication goes to this 
dumb switch and all other networks come from 10G via switch mesh 
(NetExtreme and DELL switches).


Thanks for your time and effort, if we get a solution or find a 
bug/mistake, I will let you know!


All the best
Sten



smime.p7s
Description: S/MIME Cryptographic Signature
___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-09 Thread Dietmar Maurer
 Everything is working good until I try to add a new node. As soon as I 
 do that, whole GUI breaks (KVM stays working, luckily) and all hell 
 breaks loose, as it's said.
 
 So, we have eliminated network card issues - as this problem occurs with 
 different network cards. We have eliminated switches' issues, because 
 all switches are working prior to this situation AND we have tried to 
 use 10GB-1GB gbic module to connect this new node to 10G switch as well.
 Now, we have eliminated this Fujitsu hardware totally, because a HP 
 machine also breaks the cluster.
 
 IGMP snooping is disabled, multicast is working on both sides, tested 
 with ssmping.
 
 *clustat* shows that all nodes are online.
 
 *pvecm nodes* shows that everything is OK. All nodes have join time 
 and M in Sts column. Inc differs, though.
 
 *tcpdump* shows:
  12:15:57.535798 IP 0.0.0.0  all-systems.mcast.net: igmp query v2
  12:15:57.535831 IP6 101:80a:30b:6e28:cd3:1d7f:2f00:0  ff02::1: HBH 
  ICMP6, multicast listener querymax resp delay: 1000 addr: ::, length 24
  12:15:57.540356 IP 0.0.0.0  all-systems.mcast.net: igmp query v2
  12:15:57.540384 IP6 101:80a:21ee:154d:100::  ff02::1: HBH ICMP6, 
  multicast listener querymax resp delay: 1000 addr: ::, length 24
  12:15:57.580874 IP 0.0.0.0  all-systems.mcast.net: igmp query v2
  12:15:57.580903 IP6 10::40:918f:a47f:0  ff02::1: HBH ICMP6, multicast 
  listener querymax resp delay: 1000 addr: ::, length 24
  12:15:58.349706 IP valitseja.5404  harija1.5405: UDP, length 107
  12:15:58.349783 IP harija1.5404  ve-1.5405: UDP, length 617
  12:16:10.980002 ARP, Reply ve-1 is-at 90:e2:ba:3a:6e:d0 (oui Unknown),
  length 42 
 
 Output from log files:
  Apr 09 11:25:26 corosync [QUORUM] Members[14]: 1 2 3 4 5 6 7 8 9 10 11 
  13 14 15 
  Apr  9 11:30:27 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  2960
  Apr  9 11:30:28 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  2970
  Apr  9 11:30:29 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  2980
  Apr  9 11:30:30 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  2990
  Apr  9 11:30:31 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  3000
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  3010
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
  cpg_send_message failed: 9
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
  cpg_send_message failed: 9
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
  cpg_send_message failed: 9
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
  cpg_send_message failed: 9
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
  cpg_send_message failed: 9
  Apr  9 11:30:32 zoperdaja pmxcfs[4273]: [status] crit: 
  cpg_send_message failed: 9
  Apr  9 11:30:33 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  3020
  Apr  9 11:30:34 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  3030
  Apr  9 11:30:35 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  3040
  Apr  9 11:30:36 zoperdaja pmxcfs[4273]: [dcdb] notice: cpg_join retry 
  3050 
 
 I have read that Proxmox tests with 16 working nodes, but there are 
 information that someone uses it with more than 16. Although - I have 
 plenty to go? Of course we have had nodes, which are not in cluster 
 anymore (deleted), but I assume that they don't count. :)
 
 Any ideas where to look next?

Does it help if you restart pve-cluster service on those nodes:

# service pve-cluster restart

___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-09 Thread Sten Aus
Okay, I have now updated all nodes to same kernel versions. Thanks to 
the bugfix with latest update, IPv6 now works with newer kernel as well. 
So, all nodes are in the same version and packgages.


Anyway, now I cannot add no new node. Even I can't add this old node, 
which was in the cluster before as a new node.


So I added debug flag to cluster.conf and tried to add new node. Here's 
the output:
Apr  9 13:03:06 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 10 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:07 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 20 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:08 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 30 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:09 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 40 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:10 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 50 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:11 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 60 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:12 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 70 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:13 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 80 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:14 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 90 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retry 100 (dfsm.c:215:dfsm_send_message_full)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [status] notice: 
cpg_send_message retried 100 times (dfsm.c:221:dfsm_send_message_full)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [status] crit: cpg_send_message 
failed: 6 (dfsm.c:329:dfsm_send_message_sync)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, 
size:16 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:5, 
size:528 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result -2 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, 
size:16 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, 
size:16 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:1, 
size:16 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:5, 
size:528 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process result 0 
(server.c:310:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [ipcs] debug: process msg:4, 
size:421 (server.c:160:s1_msg_process_fn)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter 
cfs_fuse_getattr / (pmxcfs.c:126:cfs_fuse_getattr)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug start  
(pmxcfs.c:102:find_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: 
cfs_plug_base_lookup_plug (cfs-plug.c:52:cfs_plug_base_lookup_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug end = 
0xd9d280 () (pmxcfs.c:109:find_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter 
cfs_plug_base_getattr  (cfs-plug.c:84:cfs_plug_base_getattr)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: leave 
cfs_plug_base_getattr  (cfs-plug.c:103:cfs_plug_base_getattr)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: leave 
cfs_fuse_getattr / (0) (pmxcfs.c:144:cfs_fuse_getattr)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter 
cfs_fuse_getattr /firewall (pmxcfs.c:126:cfs_fuse_getattr)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug start 
firewall (pmxcfs.c:102:find_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: 
cfs_plug_base_lookup_plug firewall 
(cfs-plug.c:52:cfs_plug_base_lookup_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: 
cfs_plug_base_lookup_plug name = firewall new path = (null) 
(cfs-plug.c:59:cfs_plug_base_lookup_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: find_plug end 
firewall = 0xd9d280 (firewall) (pmxcfs.c:109:find_plug)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: enter 
cfs_plug_base_getattr firewall (cfs-plug.c:84:cfs_plug_base_getattr)
Apr  9 13:03:15 rabaja pmxcfs[673283]: [main] debug: leave 
cfs_plug_base_getattr firewall 

Re: [PVE-User] Adding a cluster node breaks whole cluster

2015-04-09 Thread Dietmar Maurer

 Tried another thing - removed existing node from cluster. Still no good 
 when adding new nodes. So eliminated new factor - it's not about limit?

no

___
pve-user mailing list
pve-user@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user