Re: [PVE-User] Cluster disaster

2016-11-22 Thread Dhaussy Alexandre
Le 22/11/2016 à 18:48, Michael Rasmussen a écrit : >>> Have you tested your filter rules? >> Yes, i set this filter at install : >> >> global_filter = [ "r|sd[b-z].*|", "r|disk|", "r|dm-.*|", >> "r|vm.*disk.*|", "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "a|.*|" ] >> > Does vgscan and lvscan list

Re: [PVE-User] Cluster disaster

2016-11-22 Thread Michael Rasmussen
On Tue, 22 Nov 2016 18:04:39 + Dhaussy Alexandre wrote: > Le 22/11/2016 à 18:48, Michael Rasmussen a écrit : > > Have you tested your filter rules? > Yes, i set this filter at install : > > global_filter = [ "r|sd[b-z].*|", "r|disk|", "r|dm-.*|", >

Re: [PVE-User] Cluster disaster

2016-11-22 Thread Dhaussy Alexandre
Le 22/11/2016 à 18:48, Michael Rasmussen a écrit : > Have you tested your filter rules? Yes, i set this filter at install : global_filter = [ "r|sd[b-z].*|", "r|disk|", "r|dm-.*|", "r|vm.*disk.*|", "r|/dev/zd.*|", "r|/dev/mapper/pve-.*|", "a|.*|" ] > > On November 22, 2016 6:12:27 PM

Re: [PVE-User] Cluster disaster

2016-11-22 Thread Michael Rasmussen
Have you tested your filter rules? On November 22, 2016 6:12:27 PM GMT+01:00, Dhaussy Alexandre wrote: > >Le 22/11/2016 à 17:56, Michael Rasmussen a écrit : >> On Tue, 22 Nov 2016 16:35:08 + >> Dhaussy Alexandre wrote: >> >>> I don't

Re: [PVE-User] Cluster disaster

2016-11-22 Thread Dhaussy Alexandre
Le 22/11/2016 à 17:56, Michael Rasmussen a écrit : > On Tue, 22 Nov 2016 16:35:08 + > Dhaussy Alexandre wrote: > >> I don't know how, but i feel that every node i add to the cluster currently >> slows down LVM scan a little more...until it ends up interfering with

Re: [PVE-User] Cluster disaster

2016-11-22 Thread Michael Rasmussen
On Tue, 22 Nov 2016 16:35:08 + Dhaussy Alexandre wrote: > > I don't know how, but i feel that every node i add to the cluster currently > slows down LVM scan a little more...until it ends up interfering with cluster > services at boot... Maybe you need to tune

Re: [PVE-User] Cluster disaster

2016-11-22 Thread Dhaussy Alexandre
...sequel to those thrilling adventures... I _still_ have problems with nodes not joining the cluster properly after rebooting... Here's what we have done last night : - Stopped ALL VMs (just to ensure no corruption happen in case of unexpected reboots...) - Patched qemu from 2.6.1 to 2.6.2 to

Re: [PVE-User] Cluster disaster

2016-11-14 Thread Dhaussy Alexandre
Le 14/11/2016 à 12:33, Thomas Lamprecht a écrit : > Hope that helps a bit understanding. :) Sure, thank you for clearing things up. :) I wish i had done this before, but i learned a lot in the last few days... ___ pve-user mailing list

Re: [PVE-User] Cluster disaster

2016-11-14 Thread Dhaussy Alexandre
Le 14/11/2016 à 12:34, Dietmar Maurer a écrit : >> What i understand so far, is that every state/service change from LRM >> must be acknowledged (cluster-wise) by CRM master. >> So if a multicast disruption occurs, and i assume LRM wouldn't be able >> talk to the CRM MASTER, then it also couldn't

Re: [PVE-User] Cluster disaster

2016-11-14 Thread Thomas Lamprecht
On 14.11.2016 11:50, Dhaussy Alexandre wrote: Le 11/11/2016 à 19:43, Dietmar Maurer a écrit : On November 11, 2016 at 6:41 PM Dhaussy Alexandre wrote: you lost quorum, and the watchdog expired - that is how the watchdog based fencing works. I don't expect to

Re: [PVE-User] Cluster disaster

2016-11-14 Thread Dietmar Maurer
> What i understand so far, is that every state/service change from LRM > must be acknowledged (cluster-wise) by CRM master. > So if a multicast disruption occurs, and i assume LRM wouldn't be able > talk to the CRM MASTER, then it also couldn't reset the watchdog, am i > right ? Nothing

Re: [PVE-User] Cluster disaster

2016-11-14 Thread Dhaussy Alexandre
Le 11/11/2016 à 19:43, Dietmar Maurer a écrit : > On November 11, 2016 at 6:41 PM Dhaussy Alexandre > wrote: >>> you lost quorum, and the watchdog expired - that is how the watchdog >>> based fencing works. >> I don't expect to loose quorum when _one_ node joins or

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Dietmar Maurer
> On November 11, 2016 at 6:41 PM Dhaussy Alexandre > wrote: > > > > you lost quorum, and the watchdog expired - that is how the watchdog > > based fencing works. > > I don't expect to loose quorum when _one_ node joins or leave the cluster. This was probably a

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Dhaussy Alexandre
> you lost quorum, and the watchdog expired - that is how the watchdog > based fencing works. I don't expect to loose quorum when _one_ node joins or leave the cluster. Nov 8 10:38:58 proxmoxt20 pmxcfs[22537]: [status] notice: update cluster info (cluster name pxmcluster, version = 14) Nov 8

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Dhaussy Alexandre
> A long shot. Do you have a hardware watchdog enabled in bios? I didn't modify any BIOS parameters, except power management. So I believe it's enabled. hpwdt module (hp ilo watchdog) is not loaded. HP ASR is enabled (10 min timeout.) Ipmi_watchdog is blacklisted. nmi_watchdog is enabled => I

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Dietmar Maurer
> Responding to myself, i find this interesting : > > Nov 8 10:39:01 proxmoxt35 corosync[35250]: [TOTEM ] A new membership > (10.xx.xx.11:684) was formed. Members joined: 13 > Nov 8 10:39:58 proxmoxt35 watchdog-mux[28239]: client watchdog expired - > disable watchdog updates you lost quorum,

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Michael Rasmussen
A long shot. Do you have a hardware watchdog enabled in bios? On November 11, 2016 4:28:09 PM GMT+01:00, Dhaussy Alexandre wrote: >> Do you have a hint why there is no messages in the logs when watchdog >> actually seems to trigger fencing ? >> Because when a node

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Dhaussy Alexandre
> Do you have a hint why there is no messages in the logs when watchdog > actually seems to trigger fencing ? > Because when a node suddently reboots, i can't be sure if it's the watchdog, > a hardware bug, kernel bug or whatever.. Responding to myself, i find this interesting : Nov 8 10:39:01

Re: [PVE-User] Cluster disaster

2016-11-11 Thread Dhaussy Alexandre
I really hope to find an explanation to all this mess. Because i'm not very confident right now.. So far if i understand all this correctly.. I'm not very found of how watchdog behaves with crm/lrm. To make a comparison with PVE 3 (RedHat cluster), fencing happened on the corosync/cluster

Re: [PVE-User] Cluster disaster

2016-11-09 Thread Dhaussy Alexandre
I had again another outage... BUT now everything is back online ! yay ! So i think i had (at least) two problems : 1 - When installing/upgrading a node. If the node sees all SAN storages LUN before install, debian partitionner tries to scan all LUNs.. This causes almost all nodes to reboot

Re: [PVE-User] Cluster disaster

2016-11-09 Thread Dhaussy Alexandre
I have done a cleanup of ressources with echo "" > /etc/pve/ha/resources.cfg It seems to have resolved all problems with inconsistent status of lrm/lcm in the GUI. A new master have been elected. The manager_status file have been cleaned up. All nodes are idle or active. I am re-starting

Re: [PVE-User] Cluster disaster

2016-11-09 Thread Dhaussy Alexandre
Sorry my old message was too big... Thanks for the input !... I have attached manager_status files. .old is the original file, and .new is the file i have modified and put in /etc/pve/ha. I know this is bad but here's what i've done : - delnode on known NON-working nodes. - rm -Rf

Re: [PVE-User] Cluster disaster

2016-11-09 Thread Dhaussy Alexandre
Typo - delnode on known NON-working nodes. Le 09/11/2016 à 17:32, Alexandre DHAUSSY a écrit : > - delnode on known now-working nodes. ___ pve-user mailing list pve-user@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

Re: [PVE-User] Cluster disaster

2016-11-09 Thread Thomas Lamprecht
Hi, On 09.11.2016 16:29, Dhaussy Alexandre wrote: I try to remove from ha in the gui, but nothing happends. There are some services in "error" or "fence" state. Now i tried to remove the non-working nodes from the cluster... but i still see those nodes in /etc/pve/ha/manager_status. Can you

Re: [PVE-User] Cluster disaster

2016-11-09 Thread Dhaussy Alexandre
I try to remove from ha in the gui, but nothing happends. There are some services in "error" or "fence" state. Now i tried to remove the non-working nodes from the cluster... but i still see those nodes in /etc/pve/ha/manager_status. Le 09/11/2016 à 16:13, Dietmar Maurer a écrit : >> I wanted

[PVE-User] Cluster disaster

2016-11-09 Thread Dhaussy Alexandre
Hello, I have a big problem on my cluster (1500 HA VMs), storage is LVM + SAN (around 70 PVs, 2000 LVs) Problems began adding a new node to the cluster… All nodes crashed and rebooted (happended yesterday) After some work I managed to get all back online, but some nodes were down (hardware