[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes
It appears there was a power issue at the datacenter. One of my routers (I have two of them) and both of my switches, have an uptime that corresponds with the 2nd outage I experience yesterday. Both of my switches & both routers have single PSUs. All of my servers have dual PSUs.None of my servers, and my other router were unaffected. I just contacted the datacenter to inquire about the outage. But the mystery has been solved! Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Tuesday, May 11, 2021 12:17 AM, Strahil Nikolov via Users wrote: > ovirtmgmt is using a linux bridge and maybe STP kicked in ? > Do you know of any changes done in the network at that time ? > > Best Regards, > Strahil Nikolov > > > On Tue, May 11, 2021 at 2:27, David White via Users > > wrote: > > ___ > > Users mailing list -- users@ovirt.org > > To unsubscribe send an email to users-le...@ovirt.org > > Privacy Statement: https://www.ovirt.org/privacy-policy.html > > oVirt Code of Conduct: > > https://www.ovirt.org/community/about/community-guidelines/ > > List Archives: > > https://lists.ovirt.org/archives/list/users@ovirt.org/message/SM72GL5TUF5O3HJU2T2N36PHNFNOXT2A/ publickey - dmwhite823@protonmail.com - 0x320CD582.asc Description: application/pgp-keys signature.asc Description: OpenPGP digital signature ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PSXO7CE74SS4JDC42ITPXKEW7D7CC52C/
[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes
ovirtmgmt is using a linux bridge and maybe STP kicked in ?Do you know of any changes done in the network at that time ? Best Regards,Strahil Nikolov On Tue, May 11, 2021 at 2:27, David White via Users wrote: ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/SM72GL5TUF5O3HJU2T2N36PHNFNOXT2A/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/IPQIDKOEH7ISHK6QBYCCSXYXGJY3CSRO/
[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes
Just a point of clarification, for all of these hosts, 1 of these interfaces is connected to my 1Gbps switch, and the other interface is connected to my 10Gbps switch. For Host 1 specifically, enp4s0f0 is physically connected to 1 switch. eno1 is physically connected to another. But those interfaces are also bridged - and controlled - by oVirt itself. Is it possible that oVirt took them down for some reason. I don't know what that reason might be? Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Monday, May 10, 2021 7:14 PM, David White via Users wrote: > I'm not sure what to make of this, but looking at /var/log/messages on all 3 > of the hosts,it appears that the kernel disabled my oVirt networks at the > same exact time on all 3 hosts. > This occurred twice this morning, once around 8am and again around 8:30am: > > ovirtmgmt is the storage network. > Private is the frond-end network. > > I actually don't have *any* backup storage domains currently, and no backups > to speak of, so that wouldn't have been a cause from this morning. > My goal for this week is to install a 4th physical server with some spinning > disks, and expose those as an NFS mount point so that I can build a backup > domain. > I also hope to get the 10Gbps network cards installed on the remaining two > hosts, to get 10Gbps connectivity up and running between all 3 of the HCI > hosts. > > Host 1 > May 10 08:00:23 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is down > May 10 08:00:23 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered > disabled state > May 10 08:00:23 cha1-storage kernel: tg3 :01:00.0 eno1: Link is down > May 10 08:00:24 cha1-storage kernel: Private: port 1(eno1) entered disabled > state > {snip} > May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: Link is up at > 1000 Mbps, full duplex > May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: Flow control is > off for TX and off for RX > May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabled > May 10 08:01:10 cha1-storage kernel: Private: port 1(eno1) entered blocking > state > May 10 08:01:10 cha1-storage kernel: Private: port 1(eno1) entered forwarding > state > May 10 08:01:10 cha1-storage NetworkManager[1805]: [1620648070.6021] > device (eno1): carrier: link connected > May 10 08:30:01 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is down > May 10 08:30:01 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered > disabled state > May 10 08:30:01 cha1-storage systemd[1]: Starting system activity accounting > tool... > May 10 08:30:01 cha1-storage systemd[1]: sysstat-collect.service: Succeeded. > May 10 08:30:01 cha1-storage systemd[1]: Started system activity accounting > tool. > May 10 08:30:01 cha1-storage kernel: tg3 :01:00.0 eno1: Link is down > May 10 08:30:02 cha1-storage kernel: Private: port 1(eno1) entered disabled > state > {snip} > May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Link is up at > 1000 Mbps, full duplex > May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Flow control is > off for TX and off for RX > May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabled > May 10 08:30:47 cha1-storage kernel: Private: port 1(eno1) entered blocking > state > May 10 08:30:47 cha1-storage kernel: Private: port 1(eno1) entered forwarding > state > May 10 08:30:47 cha1-storage NetworkManager[1805]: [1620649847.8592] > device (eno1): carrier: link connected > May 10 08:30:47 cha1-storage NetworkManager[1805]: [1620649847.8602] > device (Private): carrier: link connected > > Host 2 > May 10 08:00:23 cha2-storage kernel: ixgbe :01:00.1 eno2: NIC Link is Down > May 10 08:00:23 cha2-storage kernel: ovirtmgmt: port 1(eno2) entered disabled > state > May 10 08:00:23 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Down > May 10 08:00:24 cha2-storage kernel: Private: port 1(eno1) entered disabled > state > {snip} > May 10 08:01:10 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Up > 1 Gbps, Flow Control: None > May 10 08:01:10 cha2-storage kernel: Private: port 1(eno1) entered blocking > state > May 10 08:01:10 cha2-storage kernel: Private: port 1(eno1) entered forwarding > state > May 10 08:01:10 cha2-storage NetworkManager[16957]: [1620648070.1303] > device (eno1): carrier: link connected > {snip} > May 10 08:30:01 cha2-storage kernel: ixgbe :01:00.1 eno2: NIC Link is Down > May 10 08:30:01 cha2-storage kernel: ovirtmgmt: port 1(eno2) entered disabled > state > May 10 08:30:01 cha2-storage systemd[1]: Starting system activity accounting > tool... > May 10 08:30:01 cha2-storage systemd[1]: sysstat-collect.service: Succeeded. > May 10 08:30:01 cha2-storage systemd[1]: Started system activity accounting > tool. > May 10 08:30:01 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Down > May 10 08:30:02 cha2-storage kernel: Private: port 1(eno1) entered
[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes
I'm not sure what to make of this, but looking at /var/log/messages on all 3 of the hosts,it appears that the kernel disabled my oVirt networks at the same exact time on all 3 hosts. This occurred twice this morning, once around 8am and again around 8:30am: ovirtmgmt is the storage network. Private is the frond-end network. I actually don't have *any* backup storage domains currently, and no backups to speak of, so that wouldn't have been a cause from this morning. My goal for this week is to install a 4th physical server with some spinning disks, and expose those as an NFS mount point so that I can build a backup domain. I also hope to get the 10Gbps network cards installed on the remaining two hosts, to get 10Gbps connectivity up and running between all 3 of the HCI hosts. Host 1May 10 08:00:23 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is downMay 10 08:00:23 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered disabled stateMay 10 08:00:23 cha1-storage kernel: tg3 :01:00.0 eno1: Link is downMay 10 08:00:24 cha1-storage kernel: Private: port 1(eno1) entered disabled state{snip}May 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: Link is up at 1000 Mbps, full duplexMay 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: Flow control is off for TX and off for RXMay 10 08:01:10 cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabledMay 10 08:01:10 cha1-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 08:01:10 cha1-storage kernel: Private: port 1(eno1) entered forwarding stateMay 10 08:01:10 cha1-storage NetworkManager[1805]: [1620648070.6021] device (eno1): carrier: link connectedMay 10 08:30:01 cha1-storage kernel: tg3 :04:00.0 enp4s0f0: Link is downMay 10 08:30:01 cha1-storage kernel: ovirtmgmt: port 1(enp4s0f0) entered disabled stateMay 10 08:30:01 cha1-storage systemd[1]: Starting system activity accounting tool...May 10 08:30:01 cha1-storage systemd[1]: sysstat-collect.service: Succeeded.May 10 08:30:01 cha1-storage systemd[1]: Started system activity accounting tool.May 10 08:30:01 cha1-storage kernel: tg3 :01:00.0 eno1: Link is downMay 10 08:30:02 cha1-storage kernel: Private: port 1(eno1) entered disabled state{snip}May 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Link is up at 1000 Mbps, full duplexMay 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: Flow control is off for TX and off for RXMay 10 08:30:47 cha1-storage kernel: tg3 :01:00.0 eno1: EEE is disabledMay 10 08:30:47 cha1-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 08:30:47 cha1-storage kernel: Private: port 1(eno1) entered forwarding stateMay 10 08:30:47 cha1-storage NetworkManager[1805]: [1620649847.8592] device (eno1): carrier: link connectedMay 10 08:30:47 cha1-storage NetworkManager[1805]: [1620649847.8602] device (Private): carrier: link connectedHost 2May 10 08:00:23 cha2-storage kernel: ixgbe :01:00.1 eno2: NIC Link is DownMay 10 08:00:23 cha2-storage kernel: ovirtmgmt: port 1(eno2) entered disabled stateMay 10 08:00:23 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is DownMay 10 08:00:24 cha2-storage kernel: Private: port 1(eno1) entered disabled state{snip}May 10 08:01:10 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Up 1 Gbps, Flow Control: NoneMay 10 08:01:10 cha2-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 08:01:10 cha2-storage kernel: Private: port 1(eno1) entered forwarding stateMay 10 08:01:10 cha2-storage NetworkManager[16957]: [1620648070.1303] device (eno1): carrier: link connected{snip}May 10 08:30:01 cha2-storage kernel: ixgbe :01:00.1 eno2: NIC Link is DownMay 10 08:30:01 cha2-storage kernel: ovirtmgmt: port 1(eno2) entered disabled stateMay 10 08:30:01 cha2-storage systemd[1]: Starting system activity accounting tool...May 10 08:30:01 cha2-storage systemd[1]: sysstat-collect.service: Succeeded.May 10 08:30:01 cha2-storage systemd[1]: Started system activity accounting tool.May 10 08:30:01 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is DownMay 10 08:30:02 cha2-storage kernel: Private: port 1(eno1) entered disabled state{snip}May 10 08:30:47 cha2-storage kernel: ixgbe :01:00.0 eno1: NIC Link is Up 1 Gbps, Flow Control: NoneMay 10 08:30:47 cha2-storage kernel: Private: port 1(eno1) entered blocking stateMay 10 08:30:47 cha2-storage kernel: Private: port 1(eno1) entered forwarding stateMay 10 08:30:47 cha2-storage NetworkManager[16957]: [1620649847.5041] device (eno1): carrier: link connectedHost 3May 10 08:00:23 cha3-storage kernel: tg3 :01:00.0 eno1: Link is downMay 10 08:00:24 cha3-storage journal[2196]: Guest agent is not responding: Guest agent not available for nowMay 10 08:00:24 cha3-storage kernel: Private: port 1(eno1) entered disabled stateMay 10 08:00:30 cha3-storage journal[2196]: Guest agent is not responding: Guest agent not available for now{snip}May 10 08:01:10
[ovirt-users] Re: Something broke & took down multiple VMs for ~20 minutes
The symptoms are similar to a loss of quorum (like in a network outage/disruption). Check the gluster logs for any indication of the root cause.As you have only one gigabit network, consider enabling cluster choose-local option which will make FUSE client to try to read from local brick instead of a remote one. Theoretically congestion on storage network could be the root cause, but this is usually a symptom and not the real problem. Maybe you got too many backups running in parallel ? Best Regards,Strahil Nikolov On Mon, May 10, 2021 at 19:13, David White via Users wrote: ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/DOI6BEFMTS3PEELVZLM54EJTCMUSREDI/ ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BQR6MXRF44ELQNKQJIUKLZUON4HAEHY7/