[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes

2021-05-11 Thread David White via Users
It appears there was a power issue at the datacenter.
One of my routers (I have two of them) and both of my switches, have an uptime 
that corresponds with the 2nd outage I experience yesterday.

Both of my switches & both routers have single PSUs.
All of my servers have dual PSUs.None of my servers, and my other router were 
unaffected.

I just contacted the datacenter to inquire about the outage.

But the mystery has been solved!

Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Tuesday, May 11, 2021 12:17 AM, Strahil Nikolov via Users  
wrote:

> ovirtmgmt is using a linux bridge and maybe STP kicked in ?
> Do you know of any changes done in the network at that time ?
> 

> Best Regards,
> Strahil Nikolov
> 

> > On Tue, May 11, 2021 at 2:27, David White via Users
> >  wrote:
> > ___
> > Users mailing list -- users@ovirt.org
> > To unsubscribe send an email to users-le...@ovirt.org
> > Privacy Statement: https://www.ovirt.org/privacy-policy.html
> > oVirt Code of Conduct: 
> > https://www.ovirt.org/community/about/community-guidelines/
> > List Archives: 
> > https://lists.ovirt.org/archives/list/users@ovirt.org/message/SM72GL5TUF5O3HJU2T2N36PHNFNOXT2A/

publickey - dmwhite823@protonmail.com - 0x320CD582.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PSXO7CE74SS4JDC42ITPXKEW7D7CC52C/


[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes

2021-05-10 Thread Strahil Nikolov via Users
ovirtmgmt is using a linux bridge and maybe STP kicked in ?Do you know of any 
changes done in the network at that time ?
Best Regards,Strahil Nikolov
 
 
  On Tue, May 11, 2021 at 2:27, David White via Users wrote:   
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/SM72GL5TUF5O3HJU2T2N36PHNFNOXT2A/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IPQIDKOEH7ISHK6QBYCCSXYXGJY3CSRO/


[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes

2021-05-10 Thread David White via Users
Just a point of clarification, for all of these hosts, 1 of these interfaces is 
connected to my 1Gbps switch, and the other interface is connected to my 10Gbps 
switch. 

For Host 1 specifically,

enp4s0f0 is physically connected to 1 switch.
eno1 is physically connected to another.
But those interfaces are also bridged - and controlled - by oVirt itself.

Is it possible that oVirt took them down for some reason.
I don't know what that reason might be?

Sent with ProtonMail Secure Email.

‐‐‐ Original Message ‐‐‐
On Monday, May 10, 2021 7:14 PM, David White via Users  wrote:

> I'm not sure what to make of this, but looking at /var/log/messages on all 3 
> of the hosts,it appears that the kernel disabled my oVirt networks at the 
> same exact time on all 3 hosts.
> This occurred twice this morning, once around 8am and again around 8:30am:
> 

> ovirtmgmt is the storage network.
> Private is the frond-end network.
> 

> I actually don't have *any* backup storage domains currently, and no backups 
> to speak of, so that wouldn't have been a cause from this morning.
> My goal for this week is to install a 4th physical server with some spinning 
> disks, and expose those as an NFS mount point so that I can build a backup 
> domain.
> I also hope to get the 10Gbps network cards installed on the remaining two 
> hosts, to get 10Gbps connectivity up and running between all 3 of the HCI 
> hosts.
> 

> Host 1
> May 10 08:00:23 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is down
> May 10 08:00:23 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered 
> disabled state
> May 10 08:00:23 cha1-storage kernel: tg3 :01:00.0 eno1: Link is down
> May 10 08:00:24 cha1-storage kernel: Private: port 1(eno1) entered disabled 
> state
> {snip}
> May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: Link is up at 
> 1000 Mbps, full duplex
> May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: Flow control is 
> off for TX and off for RX
> May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabled
> May 10 08:01:10 cha1-storage kernel: Private: port 1(eno1) entered blocking 
> state
> May 10 08:01:10 cha1-storage kernel: Private: port 1(eno1) entered forwarding 
> state
> May 10 08:01:10 cha1-storage NetworkManager[1805]:   [1620648070.6021] 
> device (eno1): carrier: link connected
> May 10 08:30:01 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is down
> May 10 08:30:01 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered 
> disabled state
> May 10 08:30:01 cha1-storage systemd[1]: Starting system activity accounting 
> tool...
> May 10 08:30:01 cha1-storage systemd[1]: sysstat-collect.service: Succeeded.
> May 10 08:30:01 cha1-storage systemd[1]: Started system activity accounting 
> tool.
> May 10 08:30:01 cha1-storage kernel: tg3 :01:00.0 eno1: Link is down
> May 10 08:30:02 cha1-storage kernel: Private: port 1(eno1) entered disabled 
> state
> {snip}
> May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Link is up at 
> 1000 Mbps, full duplex
> May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Flow control is 
> off for TX and off for RX
> May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabled
> May 10 08:30:47 cha1-storage kernel: Private: port 1(eno1) entered blocking 
> state
> May 10 08:30:47 cha1-storage kernel: Private: port 1(eno1) entered forwarding 
> state
> May 10 08:30:47 cha1-storage NetworkManager[1805]:   [1620649847.8592] 
> device (eno1): carrier: link connected
> May 10 08:30:47 cha1-storage NetworkManager[1805]:   [1620649847.8602] 
> device (Private): carrier: link connected
> 

> Host 2
> May 10 08:00:23 cha2-storage kernel: ixgbe :01:00.1 eno2: NIC Link is Down
> May 10 08:00:23 cha2-storage kernel: ovirtmgmt: port 1(eno2) entered disabled 
> state
> May 10 08:00:23 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Down
> May 10 08:00:24 cha2-storage kernel: Private: port 1(eno1) entered disabled 
> state
> {snip}
> May 10 08:01:10 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Up 
> 1 Gbps, Flow Control: None
> May 10 08:01:10 cha2-storage kernel: Private: port 1(eno1) entered blocking 
> state
> May 10 08:01:10 cha2-storage kernel: Private: port 1(eno1) entered forwarding 
> state
> May 10 08:01:10 cha2-storage NetworkManager[16957]:   [1620648070.1303] 
> device (eno1): carrier: link connected
> {snip}
> May 10 08:30:01 cha2-storage kernel: ixgbe :01:00.1 eno2: NIC Link is Down
> May 10 08:30:01 cha2-storage kernel: ovirtmgmt: port 1(eno2) entered disabled 
> state
> May 10 08:30:01 cha2-storage systemd[1]: Starting system activity accounting 
> tool...
> May 10 08:30:01 cha2-storage systemd[1]: sysstat-collect.service: Succeeded.
> May 10 08:30:01 cha2-storage systemd[1]: Started system activity accounting 
> tool.
> May 10 08:30:01 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Down
> May 10 08:30:02 cha2-storage kernel: Private: port 1(eno1) entered 

[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes

2021-05-10 Thread David White via Users
I'm not sure what to make of this, but looking at /var/log/messages on all 3 of 
the hosts,it appears that the kernel disabled my oVirt networks at the same 
exact time on all 3 hosts.
This occurred twice this morning, once around 8am and again around 8:30am:

ovirtmgmt is the storage network.
Private is the frond-end network.

I actually don't have *any* backup storage domains currently, and no backups to 
speak of, so that wouldn't have been a cause from this morning.
My goal for this week is to install a 4th physical server with some spinning 
disks, and expose those as an NFS mount point so that I can build a backup 
domain.
I also hope to get the 10Gbps network cards installed on the remaining two 
hosts, to get 10Gbps connectivity up and running between all 3 of the HCI hosts.

Host 1May 10 08:00:23 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is 
downMay 10 08:00:23 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered 
disabled stateMay 10 08:00:23 cha1-storage kernel: tg3 :01:00.0 eno1: Link 
is downMay 10 08:00:24 cha1-storage kernel: Private: port 1(eno1) entered 
disabled state{snip}May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: 
Link is up at 1000 Mbps, full duplexMay 10 08:01:10 cha1-storage kernel: tg3 
:01:00.0 eno1: Flow control is off for TX and off for RXMay 10 08:01:10 
cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabledMay 10 08:01:10 
cha1-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 
08:01:10 cha1-storage kernel: Private: port 1(eno1) entered forwarding stateMay 
10 08:01:10 cha1-storage NetworkManager[1805]:   [1620648070.6021] device 
(eno1): carrier: link connectedMay 10 08:30:01 cha1-storage kernel: tg3 
:04:00.0 enp4s0f0: Link is downMay 10 08:30:01 cha1-storage kernel: 
ovirtmgmt: port 1(enp4s0f0) entered disabled stateMay 10 08:30:01 cha1-storage 
systemd[1]: Starting system activity accounting tool...May 10 08:30:01 
cha1-storage systemd[1]: sysstat-collect.service: Succeeded.May 10 08:30:01 
cha1-storage systemd[1]: Started system activity accounting tool.May 10 
08:30:01 cha1-storage kernel: tg3 :01:00.0 eno1: Link is downMay 10 
08:30:02 cha1-storage kernel: Private: port 1(eno1) entered disabled 
state{snip}May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Link is 
up at 1000 Mbps, full duplexMay 10 08:30:47 cha1-storage kernel: tg3 
:01:00.0 eno1: Flow control is off for TX and off for RXMay 10 08:30:47 
cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabledMay 10 08:30:47 
cha1-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 
08:30:47 cha1-storage kernel: Private: port 1(eno1) entered forwarding stateMay 
10 08:30:47 cha1-storage NetworkManager[1805]:   [1620649847.8592] device 
(eno1): carrier: link connectedMay 10 08:30:47 cha1-storage 
NetworkManager[1805]:   [1620649847.8602] device (Private): carrier: link 
connectedHost 2May 10 08:00:23 cha2-storage kernel: ixgbe :01:00.1 eno2: 
NIC Link is DownMay 10 08:00:23 cha2-storage kernel: ovirtmgmt: port 1(eno2) 
entered disabled stateMay 10 08:00:23 cha2-storage kernel: ixgbe :01:00.0 
eno1: NIC Link is DownMay 10 08:00:24 cha2-storage kernel: Private: port 
1(eno1) entered disabled state{snip}May 10 08:01:10 cha2-storage kernel: ixgbe 
:01:00.0 eno1: NIC Link is Up 1 Gbps, Flow Control: NoneMay 10 08:01:10 
cha2-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 
08:01:10 cha2-storage kernel: Private: port 1(eno1) entered forwarding stateMay 
10 08:01:10 cha2-storage NetworkManager[16957]:   [1620648070.1303] 
device (eno1): carrier: link connected{snip}May 10 08:30:01 cha2-storage 
kernel: ixgbe :01:00.1 eno2: NIC Link is DownMay 10 08:30:01 cha2-storage 
kernel: ovirtmgmt: port 1(eno2) entered disabled stateMay 10 08:30:01 
cha2-storage systemd[1]: Starting system activity accounting tool...May 10 
08:30:01 cha2-storage systemd[1]: sysstat-collect.service: Succeeded.May 10 
08:30:01 cha2-storage systemd[1]: Started system activity accounting tool.May 
10 08:30:01 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is DownMay 
10 08:30:02 cha2-storage kernel: Private: port 1(eno1) entered disabled 
state{snip}May 10 08:30:47 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC 
Link is Up 1 Gbps, Flow Control: NoneMay 10 08:30:47 cha2-storage kernel: 
Private: port 1(eno1) entered blocking stateMay 10 08:30:47 cha2-storage 
kernel: Private: port 1(eno1) entered forwarding stateMay 10 08:30:47 
cha2-storage NetworkManager[16957]:   [1620649847.5041] device (eno1): 
carrier: link connectedHost 3May 10 08:00:23 cha3-storage kernel: tg3 
:01:00.0 eno1: Link is downMay 10 08:00:24 cha3-storage journal[2196]: 
Guest agent is not responding: Guest agent not available for nowMay 10 08:00:24 
cha3-storage kernel: Private: port 1(eno1) entered disabled stateMay 10 
08:00:30 cha3-storage journal[2196]: Guest agent is not responding: Guest agent 
not available for now{snip}May 10 08:01:10 

[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes

2021-05-10 Thread Strahil Nikolov via Users
The symptoms are similar to a loss of quorum (like in a  network 
outage/disruption).
Check the gluster logs for any indication of the root cause.As you have only 
one gigabit network, consider enabling cluster choose-local option which will 
make FUSE client to try to read from local brick instead of a remote one.
Theoretically congestion on storage network could be the root cause, but this 
is usually a symptom and not the real problem. Maybe you got too many backups 
running in parallel ?

Best Regards,Strahil Nikolov
 
 
  On Mon, May 10, 2021 at 19:13, David White via Users wrote:  
 ___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/DOI6BEFMTS3PEELVZLM54EJTCMUSREDI/
  
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/BQR6MXRF44ELQNKQJIUKLZUON4HAEHY7/