Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-14 Thread Eneko Lacunza

Hi Hervé,

Glad to read this :)

Cheers

El 14/5/20 a las 16:48, Herve Ballans escribió:

Hi Eneko,

Thanks again for trying to help me.

Now, the problem is solved!  We upgraded our entire cluster in PVE 6.2 
and now all is optimal, including HA status.
We just upgraded each nodes, didn't change anything else (I mean in 
term of configuration file).


Here, I'm just stating a fact, I don't say that this is the upgrade 
process that are solved our problems...


Indeed we are trying to investigate with network engineers who manage 
the network equipments of our datacenter in order to see if something 
was happening at the moment where our cluster had crashed.


I will let you know if I have the answer to that mystery...

Cheers,
Hervé

On 12/05/2020 15:00, Eneko Lacunza wrote:

Hi Hervé,

El 11/5/20 a las 17:58, Herve Ballans escribió:
Thanks for your answer. I was also thinking at first a network issue 
but physical network equipments don't seem to be showing any 
specific problems...Here are more details on the cluster:


2x10Gb + 2x1Gb interface:

 * a 10Gb interface for ceph cluster
 * a 10Gb interface for main network cluster
 * the other 2 1Gb interfaces are used for two other VLAN for the VMs


Can you post
"pvecm status" to see cluster network IPs?
"ip a" for a node?
"cat /etc/corosync/corosync.conf "?

All network interfaces go to the same switch?

PVE 6.2 has been released and it supports multiple networks for 
cluster. I suggest you look at it and configure a second network that 
uses another switch.


In the logs you sent, I can see that there are grave cluster 
problems, at 18:38:58 I can see only nodes 1,3,4,5 in quorum


Also, at 18:39:01 I can see ceph-mon complaining about slow ops and 
failed timeout for osd.5 .


I really think there is a network issue. Ceph and Proxmox clusters 
are completely separate, but they're both having issues.


I'd try to change the networking switch; I'd try even a 1G switch 
just to see if that makes Proxmox cluster and ceph stable. Are 10G 
interfaces very loaded?


Cheers
Eneko





On 11/05/2020 10:39, Eneko Lacunza wrote:

Hi Hervé,

This seems a network issue. What is the network setup in this 
cluster? What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,

I would like to take the opportunity at the beginning of this new 
week to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this 
problem really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster 
seems to have rebooted in the same time with no apparent reasons 
(I mean, we weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and 
"Proxmox VE replication runer" failed. After node rebooted, we 
are obliged to start those services manually.


Once rebooted with all pve services, some nodes were in HA lrm 
status : old timestamp - dead? while others were in active status 
or in wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as 
long as we don't change the configuration...


In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 
'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
though the file exists but on an another node !
Such message is probably a consequence of the fencing between 
nodes due to the change of status...


*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have 
failed to solve anything (in particular a complete upgrade to the 
latest PVE version 6.1-11),


we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it 
is not nominal !


Now, all the nodes are in HA lrm status : idle and sometimes 
switch to old timestamp - dead? state, then come back to idle state.

None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA 
status on the VMs, problem occurs again (nodes reboot!) :(


*Question:*

Have you ever experienced such a problem or do you know a way to 
restore a correct HA configuration in this case ?

I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

___
pve-user mailing list
pve-user@pve.proxmox.com

Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-14 Thread Herve Ballans

Hi Mark,

Thanks. Yes we are investigating with network engineers.

We upgraded the entire cluster in PVE 6.2 and the cluster is fully 
operational now.


But we think indeed that something in the network has changed and caused 
the problem (switch upgrades ?)


Therefore, for example, does activating or disabling the IGMP protocol 
could have an impact on corosync or not (in PVE 6) ?


Regards,
Hervé

On 11/05/2020 19:33, Mark Adams via pve-user wrote:

Subject:
Re: [PVE-User] critical HA problem on a PVE6 cluster
From:
Mark Adams 
Date:
11/05/2020 à 19:33

To:
PVE User List 


As Eneko already said, this really sounds like a network problem - if your
hosts lose connectivity to each other they will reboot themselves, and it
sounds like this is what happened to you.

You are sure there has been no changes to your network around the time this
happened? Have you checked your switch config is still right (maybe it
reset?)

Maybe the switches have bugged out and need a reboot? check the logs on
them for errors.

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-14 Thread Herve Ballans

Hi Eneko,

Thanks again for trying to help me.

Now, the problem is solved!  We upgraded our entire cluster in PVE 6.2 
and now all is optimal, including HA status.
We just upgraded each nodes, didn't change anything else (I mean in term 
of configuration file).


Here, I'm just stating a fact, I don't say that this is the upgrade 
process that are solved our problems...


Indeed we are trying to investigate with network engineers who manage 
the network equipments of our datacenter in order to see if something 
was happening at the moment where our cluster had crashed.


I will let you know if I have the answer to that mystery...

Cheers,
Hervé

On 12/05/2020 15:00, Eneko Lacunza wrote:

Hi Hervé,

El 11/5/20 a las 17:58, Herve Ballans escribió:
Thanks for your answer. I was also thinking at first a network issue 
but physical network equipments don't seem to be showing any specific 
problems...Here are more details on the cluster:


2x10Gb + 2x1Gb interface:

 * a 10Gb interface for ceph cluster
 * a 10Gb interface for main network cluster
 * the other 2 1Gb interfaces are used for two other VLAN for the VMs


Can you post
"pvecm status" to see cluster network IPs?
"ip a" for a node?
"cat /etc/corosync/corosync.conf "?

All network interfaces go to the same switch?

PVE 6.2 has been released and it supports multiple networks for 
cluster. I suggest you look at it and configure a second network that 
uses another switch.


In the logs you sent, I can see that there are grave cluster problems, 
at 18:38:58 I can see only nodes 1,3,4,5 in quorum


Also, at 18:39:01 I can see ceph-mon complaining about slow ops and 
failed timeout for osd.5 .


I really think there is a network issue. Ceph and Proxmox clusters are 
completely separate, but they're both having issues.


I'd try to change the networking switch; I'd try even a 1G switch just 
to see if that makes Proxmox cluster and ceph stable. Are 10G 
interfaces very loaded?


Cheers
Eneko





On 11/05/2020 10:39, Eneko Lacunza wrote:

Hi Hervé,

This seems a network issue. What is the network setup in this 
cluster? What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,

I would like to take the opportunity at the beginning of this new 
week to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this 
problem really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems 
to have rebooted in the same time with no apparent reasons (I 
mean, we weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and "Proxmox 
VE replication runer" failed. After node rebooted, we are obliged 
to start those services manually.


Once rebooted with all pve services, some nodes were in HA lrm 
status : old timestamp - dead? while others were in active status 
or in wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as 
long as we don't change the configuration...


In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 
'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
though the file exists but on an another node !
Such message is probably a consequence of the fencing between 
nodes due to the change of status...


*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have 
failed to solve anything (in particular a complete upgrade to the 
latest PVE version 6.1-11),


we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is 
not nominal !


Now, all the nodes are in HA lrm status : idle and sometimes 
switch to old timestamp - dead? state, then come back to idle state.

None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA 
status on the VMs, problem occurs again (nodes reboot!) :(


*Question:*

Have you ever experienced such a problem or do you know a way to 
restore a correct HA configuration in this case ?

I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

___
pve-user mailing list

Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-12 Thread Eneko Lacunza

Hi Hervé,

El 11/5/20 a las 17:58, Herve Ballans escribió:
Thanks for your answer. I was also thinking at first a network issue 
but physical network equipments don't seem to be showing any specific 
problems...Here are more details on the cluster:


2x10Gb + 2x1Gb interface:

 * a 10Gb interface for ceph cluster
 * a 10Gb interface for main network cluster
 * the other 2 1Gb interfaces are used for two other VLAN for the VMs


Can you post
"pvecm status" to see cluster network IPs?
"ip a" for a node?
"cat /etc/corosync/corosync.conf "?

All network interfaces go to the same switch?

PVE 6.2 has been released and it supports multiple networks for cluster. 
I suggest you look at it and configure a second network that uses 
another switch.


In the logs you sent, I can see that there are grave cluster problems, 
at 18:38:58 I can see only nodes 1,3,4,5 in quorum


Also, at 18:39:01 I can see ceph-mon complaining about slow ops and 
failed timeout for osd.5 .


I really think there is a network issue. Ceph and Proxmox clusters are 
completely separate, but they're both having issues.


I'd try to change the networking switch; I'd try even a 1G switch just 
to see if that makes Proxmox cluster and ceph stable. Are 10G interfaces 
very loaded?


Cheers
Eneko





On 11/05/2020 10:39, Eneko Lacunza wrote:

Hi Hervé,

This seems a network issue. What is the network setup in this 
cluster? What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,

I would like to take the opportunity at the beginning of this new 
week to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this 
problem really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems 
to have rebooted in the same time with no apparent reasons (I mean, 
we weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and "Proxmox 
VE replication runer" failed. After node rebooted, we are obliged 
to start those services manually.


Once rebooted with all pve services, some nodes were in HA lrm 
status : old timestamp - dead? while others were in active status 
or in wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as long 
as we don't change the configuration...


In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 
'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
though the file exists but on an another node !
Such message is probably a consequence of the fencing between nodes 
due to the change of status...


*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have 
failed to solve anything (in particular a complete upgrade to the 
latest PVE version 6.1-11),


we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is 
not nominal !


Now, all the nodes are in HA lrm status : idle and sometimes switch 
to old timestamp - dead? state, then come back to idle state.

None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA 
status on the VMs, problem occurs again (nodes reboot!) :(


*Question:*

Have you ever experienced such a problem or do you know a way to 
restore a correct HA configuration in this case ?

I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user




___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user



--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-11 Thread Mark Adams via pve-user
--- Begin Message ---
As Eneko already said, this really sounds like a network problem - if your
hosts lose connectivity to each other they will reboot themselves, and it
sounds like this is what happened to you.

You are sure there has been no changes to your network around the time this
happened? Have you checked your switch config is still right (maybe it
reset?)

Maybe the switches have bugged out and need a reboot? check the logs on
them for errors.

On Mon, 11 May 2020 at 18:13, Herve Ballans 
wrote:

> Hi again, (sorry for the spam!).
>
> I just found logs just before the crash of one of the nodes (time of
> crash : 18:36:36). It could be more useful than logs sent
> previously...(I deleted here normal events)
>
> First, several messages like that (first one at 11:00 am):
>
> May  6 18:33:25 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not
> been received in 2212 ms
> May  6 18:33:26 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor
> failed, forming new configuration.
>
> Then:
>
> May  6 18:34:14 inf-proxmox7 corosync[2648]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> May  6 18:34:14 inf-proxmox7 pvesr[3342642]: error with cfs lock
> 'file-replication_cfg': got lock request timeout
> May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Main process
> exited, code=exited, status=17/n/a
> May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Failed with
> result 'exit-code'.
> May  6 18:34:14 inf-proxmox7 systemd[1]: Failed to start Proxmox VE
> replication runner.
> May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice:
> cpg_send_message retry 30
> May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice:
> cpg_send_message retried 30 times
>
> Then again a series of processor failed messages (in totally 147 before
> the crash):
>
> May  6 18:35:03 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not
> been received in 2212 ms
> May  6 18:35:04 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor
> failed, forming new configuration.
>
> Then:
>
> May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] notice: start cluster
> connection
> May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_join failed: 14
> May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: can't initialize
> service
> May  6 18:35:40 inf-proxmox7 pve-ha-lrm[5528]: lost lock
> 'ha_agent_inf-proxmox7_lock - cfs lock update failed - Device or
> resource busy
> May  6 18:35:40 inf-proxmox7 pve-ha-crm[5421]: status change slave =>
> wait_for_quorum
> May  6 18:35:41 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership
> (1.e60) was formed. Members joined: 1 3 4 5
>
> Then:
>
> May  6 18:35:41 inf-proxmox7 pmxcfs[2602]: [status] notice: node has quorum
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice:
> cpg_send_message retried 1 times
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received
> sync request (epoch 1/2592/0031)
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received
> sync request (epoch 1/2592/0032)
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message
> failed: 9
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message
> failed: 9
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received all
> states
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: all data is
> up to date
> May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice:
> dfsm_deliver_queue: queue length 144
>
> Then:
>
> May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership
> (1.e64) was formed. Members left: 3 4
> May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] Failed to
> receive the leave message. failed: 3 4
>
> And finally crash after this last logs:
>
> May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: status change
> wait_for_quorum => slave
> May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Main process
> exited, code=exited, status=17/n/a
> May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Failed with
> result 'exit-code'.
> May  6 18:36:36 inf-proxmox7 systemd[1]: Failed to start Proxmox VE
> replication runner.
> May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: loop take too long (51
> seconds)
> May  6 18:36:36 inf-proxmox7 systemd[1]: watchdog-mux.service: Succeeded.
> May  6 18:36:36 inf-proxmox7 kernel: [1292969.953131] watchdog:
> watchdog0: watchdog did not stop!
> May  6 18:36:36 inf-proxmox7 pvestatd[2894]: status update time (5.201
> seconds)
> ^@^@^@^@^@^@
>
> following by a binary part...
>
> Thank you again,
> Hervé
>
> On 11/05/2020 10:39, Eneko Lacunza wrote:
> >>> Hi Hervé,
> >>>
> >>> This seems a network issue. What is the network setup in this
> >>> cluster? What logs in syslog about corosync and pve-cluster?
> >>>
> >>> Don't enable HA until you have a stable cluster quorum.
> >>>
> >>> Cheers
> >>> Eneko
> >>>
> >>> El 11/5/20 a las 10:35, Herve Ballans escribió:
>  Hi everybody,
> 
>  I would like to 

Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-11 Thread Herve Ballans

Hi again, (sorry for the spam!).

I just found logs just before the crash of one of the nodes (time of 
crash : 18:36:36). It could be more useful than logs sent 
previously...(I deleted here normal events)


First, several messages like that (first one at 11:00 am):

May  6 18:33:25 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not 
been received in 2212 ms
May  6 18:33:26 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor 
failed, forming new configuration.


Then:

May  6 18:34:14 inf-proxmox7 corosync[2648]:   [MAIN  ] Completed 
service synchronization, ready to provide service.
May  6 18:34:14 inf-proxmox7 pvesr[3342642]: error with cfs lock 
'file-replication_cfg': got lock request timeout
May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Main process 
exited, code=exited, status=17/n/a
May  6 18:34:14 inf-proxmox7 systemd[1]: pvesr.service: Failed with 
result 'exit-code'.
May  6 18:34:14 inf-proxmox7 systemd[1]: Failed to start Proxmox VE 
replication runner.
May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice: 
cpg_send_message retry 30
May  6 18:34:14 inf-proxmox7 pmxcfs[2602]: [status] notice: 
cpg_send_message retried 30 times


Then again a series of processor failed messages (in totally 147 before 
the crash):


May  6 18:35:03 inf-proxmox7 corosync[2648]:   [TOTEM ] Token has not 
been received in 2212 ms
May  6 18:35:04 inf-proxmox7 corosync[2648]:   [TOTEM ] A processor 
failed, forming new configuration.


Then:

May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] notice: start cluster 
connection

May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_join failed: 14
May  6 18:35:40 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: can't initialize 
service
May  6 18:35:40 inf-proxmox7 pve-ha-lrm[5528]: lost lock 
'ha_agent_inf-proxmox7_lock - cfs lock update failed - Device or 
resource busy
May  6 18:35:40 inf-proxmox7 pve-ha-crm[5421]: status change slave => 
wait_for_quorum
May  6 18:35:41 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership 
(1.e60) was formed. Members joined: 1 3 4 5


Then:

May  6 18:35:41 inf-proxmox7 pmxcfs[2602]: [status] notice: node has quorum
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: 
cpg_send_message retried 1 times
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received 
sync request (epoch 1/2592/0031)
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received 
sync request (epoch 1/2592/0032)
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message 
failed: 9
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [dcdb] crit: cpg_send_message 
failed: 9
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: received all 
states
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: all data is 
up to date
May  6 18:35:42 inf-proxmox7 pmxcfs[2602]: [status] notice: 
dfsm_deliver_queue: queue length 144


Then:

May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] A new membership 
(1.e64) was formed. Members left: 3 4
May  6 18:35:57 inf-proxmox7 corosync[2648]:   [TOTEM ] Failed to 
receive the leave message. failed: 3 4


And finally crash after this last logs:

May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: status change 
wait_for_quorum => slave
May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Main process 
exited, code=exited, status=17/n/a
May  6 18:36:36 inf-proxmox7 systemd[1]: pvesr.service: Failed with 
result 'exit-code'.
May  6 18:36:36 inf-proxmox7 systemd[1]: Failed to start Proxmox VE 
replication runner.
May  6 18:36:36 inf-proxmox7 pve-ha-crm[5421]: loop take too long (51 
seconds)

May  6 18:36:36 inf-proxmox7 systemd[1]: watchdog-mux.service: Succeeded.
May  6 18:36:36 inf-proxmox7 kernel: [1292969.953131] watchdog: 
watchdog0: watchdog did not stop!
May  6 18:36:36 inf-proxmox7 pvestatd[2894]: status update time (5.201 
seconds)

^@^@^@^@^@^@

following by a binary part...

Thank you again,
Hervé

On 11/05/2020 10:39, Eneko Lacunza wrote:

Hi Hervé,

This seems a network issue. What is the network setup in this 
cluster? What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,

I would like to take the opportunity at the beginning of this new 
week to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this 
problem really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems 
to have rebooted in the same time with no apparent reasons (I 
mean, we weren't doing antything on it) !

Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-11 Thread Herve Ballans

Hi Eneko,

Thanks for your answer. I was also thinking at first a network issue but 
physical network equipments don't seem to be showing any specific 
problems...Here are more details on the cluster:


2x10Gb + 2x1Gb interface:

 * a 10Gb interface for ceph cluster
 * a 10Gb interface for main network cluster
 * the other 2 1Gb interfaces are used for two other VLAN for the VMs



On 11/05/2020 10:39, Eneko Lacunza wrote:

Hi Hervé,

This seems a network issue. What is the network setup in this cluster? 
What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,

I would like to take the opportunity at the beginning of this new 
week to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this 
problem really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems 
to have rebooted in the same time with no apparent reasons (I mean, 
we weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and "Proxmox 
VE replication runer" failed. After node rebooted, we are obliged to 
start those services manually.


Once rebooted with all pve services, some nodes were in HA lrm 
status : old timestamp - dead? while others were in active status or 
in wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as long 
as we don't change the configuration...


In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 
'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even 
though the file exists but on an another node !
Such message is probably a consequence of the fencing between nodes 
due to the change of status...


*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have failed 
to solve anything (in particular a complete upgrade to the latest 
PVE version 6.1-11),


we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is 
not nominal !


Now, all the nodes are in HA lrm status : idle and sometimes switch 
to old timestamp - dead? state, then come back to idle state.

None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA 
status on the VMs, problem occurs again (nodes reboot!) :(


*Question:*

Have you ever experienced such a problem or do you know a way to 
restore a correct HA configuration in this case ?

I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user




___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-11 Thread Eneko Lacunza

Hi Hervé,

This seems a network issue. What is the network setup in this cluster? 
What logs in syslog about corosync and pve-cluster?


Don't enable HA until you have a stable cluster quorum.

Cheers
Eneko

El 11/5/20 a las 10:35, Herve Ballans escribió:

Hi everybody,

I would like to take the opportunity at the beginning of this new week 
to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this 
problem really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems to 
have rebooted in the same time with no apparent reasons (I mean, we 
weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and "Proxmox VE 
replication runer" failed. After node rebooted, we are obliged to 
start those services manually.


Once rebooted with all pve services, some nodes were in HA lrm status 
: old timestamp - dead? while others were in active status or in 
wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as long 
as we don't change the configuration...


In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 
'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even though 
the file exists but on an another node !
Such message is probably a consequence of the fencing between nodes 
due to the change of status...


*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have failed 
to solve anything (in particular a complete upgrade to the latest PVE 
version 6.1-11),


we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is 
not nominal !


Now, all the nodes are in HA lrm status : idle and sometimes switch 
to old timestamp - dead? state, then come back to idle state.

None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA status 
on the VMs, problem occurs again (nodes reboot!) :(


*Question:*

Have you ever experienced such a problem or do you know a way to 
restore a correct HA configuration in this case ?

I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user



--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user


Re: [PVE-User] critical HA problem on a PVE6 cluster

2020-05-11 Thread Herve Ballans

Hi everybody,

I would like to take the opportunity at the beginning of this new week 
to ask my issue again.


Has anyone had any idea why a such problem occurred, or is this problem 
really something new ?


Thanks again,
Hervé

On 07/05/2020 18:28, Herve Ballans wrote:

Hi all,

*Cluster info:*

 * 5 nodes (version PVE 6.1-3 at the time the problem occured)
 * Ceph rbd storage (Nautilus)
 * In production since many years with no major issues
 * No specific network problems at the time the problem occured
 * Nodes are on the same date (configured with the same ntp server)

*Symptoms:*

Suddenly, last night (around 7 PM), all nodes of our cluster seems to 
have rebooted in the same time with no apparent reasons (I mean, we 
weren't doing antything on it) !
During the reboot, services "Corosync Cluster Engine" and "Proxmox VE 
replication runer" failed. After node rebooted, we are obliged to 
start those services manually.


Once rebooted with all pve services, some nodes were in HA lrm status 
: old timestamp - dead? while others were in active status or in 
wait_for_agent_lock status ?...
Nodes switch states regularly...and it loops back and forth as long as 
we don't change the configuration...


In the same time, pve-ha-crm service got unexpected error, as for 
example : "Configuration file 
'nodes/inf-proxmox6/qemu-server/501.conf' does not exist" even though 
the file exists but on an another node !
Such message is probably a consequence of the fencing between nodes 
due to the change of status...


*What we have tried until now to stabilize the situation:*

After several investigations and several operations that have failed 
to solve anything (in particular a complete upgrade to the latest PVE 
version 6.1-11),


we finally removed the HA configuration of all the VM.
Since, the state seems to be stabilized although, obviously, it is not 
nominal !


Now, all the nodes are in HA lrm status : idle and sometimes switch to 
old timestamp - dead? state, then come back to idle state.

None of them are in "active" state.
Obviously, quorum status is "no quorum"

It will be noted that, as soon as we try to re-activate the HA status 
on the VMs, problem occurs again (nodes reboot!) :(


*Question:*

Have you ever experienced such a problem or do you know a way to 
restore a correct HA configuration in this case ?

I point out that nodes are currently on version PVE 6.1-11.

I can put some specific logs if useful.

Thanks in advance for your help,
Hervé

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user

___
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user