[ovirt-users] Re: Random reboots
As the rest of the cluster didn't have issues (check dmesg on the Hypervisors), in 99% of the cases it's network.Check Server NICs, enclosure network devices ,switches, backup running during the same time . I would start with any firmware upgrades of the server (if there are any). Best Regards,Strahil Nikolov On Thu, Feb 17, 2022 at 18:22, Pablo Olivera wrote: Hi Nir, Thank you very much for all the help and information. We will continue to investigate the NFS server side. To find what may be causing one of the hosts to lose access to storage. The strange thing is that it happens only on one of the NFS client hosts and not on all of them at the same time. Pablo. El 17/02/2022 a las 11:02, Nir Soffer escribió: > On Thu, Feb 17, 2022 at 11:58 AM Nir Soffer wrote: >> On Thu, Feb 17, 2022 at 11:20 AM Pablo Olivera wrote: >>> Hi Nir, >>> >>> >>> Thank you very much for your detailed explanations. >>> >>> The pid 6398 looks like it's HostedEngine: >>> >>> audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 >>> uid=0 auid=4294967295 ses=4294967295 >>> subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start >>> reason=booted vm="HostedEngine" uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc >>> vm-pid=6398 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? >>> res=success'UID="root" AUID="unset" >>> >>> So, I understand that SanLock has problems with the storage (it loses >>> connection with NFS storage). The watchdog begins to check connectivity >>> with the MV and after the established time, the order to >>> reboot the machine. >>> >>> I don't know if I can somehow increase these timeouts, or try to make >>> sanlock force the reconnection or renewal with the storage and in this way >>> try to avoid host reboots for this reason. >> You can do one of these: >> 1. Use lower timeouts on the NFS server mount, so the NFS server at >> the same time >> the sanlock lease times out. >> 2. Use larger sanlock timeout so sanlock lease time out when the NFS >> server times out. >> 3. Both 1 and 2 >> >> The problem is that NFS timeouts are not predictable. In the past we used: >> "timeo=600,retrans=6" which can lead to 21 minutes timeout, but practically >> we saw up to a 30 minutes timeout. >> >> In >> https://github.com/oVirt/vdsm/commit/672a98bbf3e55d1077669f06c37305185fbdc289 >> we change this to the recommended seting: >> "timeo=100,retrans=3" >> >> Which according to the docs, should fail in 60 seconds if all retries >> fail. But practically we >> saw up to 270 seconds timeout with this setting, which does not play >> well with sanlock. >> >> We assumed that the timeout value should not be less than sanlock io timeout >> (10 seconds) but I'm not sure this assumption is correct. >> >> You can smaller timeout value in engine storage domain >> "custom connections parameters" >> - Retransmissions - mapped to "retrans" mount option >> - Timeout (deciseconds) - mapped to "timeo" mount option >> >> For example: >> Retransmissions: 3 >> Timeout: 5 > Correction: > > Timeout: 50 (5 seconds, 50 deciseconds) > >> Theoretically this will behave like this: >> >> 00:00 retry 1 (5 seconds timeout) >> 00:10 retry 2 (10 seconds timeout) >> 00:30 retry 3 (15 seconds timeout) >> 00:45 request fail >> >> But based on what we see with the defaults, this is likely to take more time. >> If it fails before 140 seconds, the VM will be killed and the host >> will not reboot. >> >> The other way is to increase sanlock timeout, in vdsm configuration. >> note that changing sanlock timeout requires also changing other >> settings (e.g. spm:watchdog_interval). >> >> Add this file on all hosts: >> >> $ cat /etc/vdsm/vdsm.conf.d/99-local.conf >> [spm] >> >> # If enabled, montior the SPM lease status and panic if the lease >> # status is not expected. The SPM host will lose the SPM role, and >> # engine will select a new SPM host. (default true) >> # watchdog_enable = true >> >> # Watchdog check internal in seconds. The recommended value is >> # sanlock:io_timeout * 2. (default 20) >> watchdog_interval = 40 >> >> [sanlock] >> >> # I/O timeout in seconds. All sanlock timeouts are computed based on >> # this value. Using larger timeout will make VMs more resilient to >> # short storage outage, but increase VM failover time and the time to >> # acquire a host id. For more info on sanlock timeouts please check >> # sanlock source: >> # https://pagure.io/sanlock/raw/master/f/src/timeouts.h. If your >> # storage requires larger timeouts, you can increase the value to 15 >> # or 20 seconds. If you change this you need to update also multipath >> # no_path_retry. For more info onconfiguring multipath please check >> # /etc/multipath.conf.oVirt is tested only with the default value (10 >> # seconds) >> io_timeout = 20 >> >> >> You can check https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md >> to learn more about sanlock timeouts. >> >> Alternatively, you can
[ovirt-users] Re: Random reboots
Hi Nir, Thank you very much for all the help and information. We will continue to investigate the NFS server side. To find what may be causing one of the hosts to lose access to storage. The strange thing is that it happens only on one of the NFS client hosts and not on all of them at the same time. Pablo. El 17/02/2022 a las 11:02, Nir Soffer escribió: On Thu, Feb 17, 2022 at 11:58 AM Nir Soffer wrote: On Thu, Feb 17, 2022 at 11:20 AM Pablo Olivera wrote: Hi Nir, Thank you very much for your detailed explanations. The pid 6398 looks like it's HostedEngine: audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start reason=booted vm="HostedEngine" uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc vm-pid=6398 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success'UID="root" AUID="unset" So, I understand that SanLock has problems with the storage (it loses connection with NFS storage). The watchdog begins to check connectivity with the MV and after the established time, the order to reboot the machine. I don't know if I can somehow increase these timeouts, or try to make sanlock force the reconnection or renewal with the storage and in this way try to avoid host reboots for this reason. You can do one of these: 1. Use lower timeouts on the NFS server mount, so the NFS server at the same time the sanlock lease times out. 2. Use larger sanlock timeout so sanlock lease time out when the NFS server times out. 3. Both 1 and 2 The problem is that NFS timeouts are not predictable. In the past we used: "timeo=600,retrans=6" which can lead to 21 minutes timeout, but practically we saw up to a 30 minutes timeout. In https://github.com/oVirt/vdsm/commit/672a98bbf3e55d1077669f06c37305185fbdc289 we change this to the recommended seting: "timeo=100,retrans=3" Which according to the docs, should fail in 60 seconds if all retries fail. But practically we saw up to 270 seconds timeout with this setting, which does not play well with sanlock. We assumed that the timeout value should not be less than sanlock io timeout (10 seconds) but I'm not sure this assumption is correct. You can smaller timeout value in engine storage domain "custom connections parameters" - Retransmissions - mapped to "retrans" mount option - Timeout (deciseconds) - mapped to "timeo" mount option For example: Retransmissions: 3 Timeout: 5 Correction: Timeout: 50 (5 seconds, 50 deciseconds) Theoretically this will behave like this: 00:00 retry 1 (5 seconds timeout) 00:10 retry 2 (10 seconds timeout) 00:30 retry 3 (15 seconds timeout) 00:45 request fail But based on what we see with the defaults, this is likely to take more time. If it fails before 140 seconds, the VM will be killed and the host will not reboot. The other way is to increase sanlock timeout, in vdsm configuration. note that changing sanlock timeout requires also changing other settings (e.g. spm:watchdog_interval). Add this file on all hosts: $ cat /etc/vdsm/vdsm.conf.d/99-local.conf [spm] # If enabled, montior the SPM lease status and panic if the lease # status is not expected. The SPM host will lose the SPM role, and # engine will select a new SPM host. (default true) # watchdog_enable = true # Watchdog check internal in seconds. The recommended value is # sanlock:io_timeout * 2. (default 20) watchdog_interval = 40 [sanlock] # I/O timeout in seconds. All sanlock timeouts are computed based on # this value. Using larger timeout will make VMs more resilient to # short storage outage, but increase VM failover time and the time to # acquire a host id. For more info on sanlock timeouts please check # sanlock source: # https://pagure.io/sanlock/raw/master/f/src/timeouts.h. If your # storage requires larger timeouts, you can increase the value to 15 # or 20 seconds. If you change this you need to update also multipath # no_path_retry. For more info onconfiguring multipath please check # /etc/multipath.conf.oVirt is tested only with the default value (10 # seconds) io_timeout = 20 You can check https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md to learn more about sanlock timeouts. Alternatively, you can make a small change in NFS timeout and small change in sanlock timeout to make them work better together. All this is of course to handle the case when the NFS server is not accessible, but this is something that should not happen in a healthy cluster. You need to check why the server was not accessible and fix this problem. Nir ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives:
[ovirt-users] Re: Random reboots
On Thu, Feb 17, 2022 at 11:58 AM Nir Soffer wrote: > > On Thu, Feb 17, 2022 at 11:20 AM Pablo Olivera wrote: > > > > Hi Nir, > > > > > > Thank you very much for your detailed explanations. > > > > The pid 6398 looks like it's HostedEngine: > > > > audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 > > uid=0 auid=4294967295 ses=4294967295 > > subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start > > reason=booted vm="HostedEngine" uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc > > vm-pid=6398 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? > > res=success'UID="root" AUID="unset" > > > > So, I understand that SanLock has problems with the storage (it loses > > connection with NFS storage). The watchdog begins to check connectivity > > with the MV and after the established time, the order to > > reboot the machine. > > > > I don't know if I can somehow increase these timeouts, or try to make > > sanlock force the reconnection or renewal with the storage and in this way > > try to avoid host reboots for this reason. > > You can do one of these: > 1. Use lower timeouts on the NFS server mount, so the NFS server at > the same time >the sanlock lease times out. > 2. Use larger sanlock timeout so sanlock lease time out when the NFS > server times out. > 3. Both 1 and 2 > > The problem is that NFS timeouts are not predictable. In the past we used: > "timeo=600,retrans=6" which can lead to 21 minutes timeout, but practically > we saw up to a 30 minutes timeout. > > In > https://github.com/oVirt/vdsm/commit/672a98bbf3e55d1077669f06c37305185fbdc289 > we change this to the recommended seting: > "timeo=100,retrans=3" > > Which according to the docs, should fail in 60 seconds if all retries > fail. But practically we > saw up to 270 seconds timeout with this setting, which does not play > well with sanlock. > > We assumed that the timeout value should not be less than sanlock io timeout > (10 seconds) but I'm not sure this assumption is correct. > > You can smaller timeout value in engine storage domain > "custom connections parameters" > - Retransmissions - mapped to "retrans" mount option > - Timeout (deciseconds) - mapped to "timeo" mount option > > For example: > Retransmissions: 3 > Timeout: 5 Correction: Timeout: 50 (5 seconds, 50 deciseconds) > > Theoretically this will behave like this: > > 00:00 retry 1 (5 seconds timeout) > 00:10 retry 2 (10 seconds timeout) > 00:30 retry 3 (15 seconds timeout) > 00:45 request fail > > But based on what we see with the defaults, this is likely to take more time. > If it fails before 140 seconds, the VM will be killed and the host > will not reboot. > > The other way is to increase sanlock timeout, in vdsm configuration. > note that changing sanlock timeout requires also changing other > settings (e.g. spm:watchdog_interval). > > Add this file on all hosts: > > $ cat /etc/vdsm/vdsm.conf.d/99-local.conf > [spm] > > # If enabled, montior the SPM lease status and panic if the lease > # status is not expected. The SPM host will lose the SPM role, and > # engine will select a new SPM host. (default true) > # watchdog_enable = true > > # Watchdog check internal in seconds. The recommended value is > # sanlock:io_timeout * 2. (default 20) > watchdog_interval = 40 > > [sanlock] > > # I/O timeout in seconds. All sanlock timeouts are computed based on > # this value. Using larger timeout will make VMs more resilient to > # short storage outage, but increase VM failover time and the time to > # acquire a host id. For more info on sanlock timeouts please check > # sanlock source: > # https://pagure.io/sanlock/raw/master/f/src/timeouts.h. If your > # storage requires larger timeouts, you can increase the value to 15 > # or 20 seconds. If you change this you need to update also multipath > # no_path_retry. For more info onconfiguring multipath please check > # /etc/multipath.conf.oVirt is tested only with the default value (10 > # seconds) > io_timeout = 20 > > > You can check https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md > to learn more about sanlock timeouts. > > Alternatively, you can make a small change in NFS timeout and small change in > sanlock timeout to make them work better together. > > All this is of course to handle the case when the NFS server is not > accessible, > but this is something that should not happen in a healthy cluster. You need > to check why the server was not accessible and fix this problem. > > Nir ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5MSXZ6PCKQFTMCC3KIFJJWZJXAKCPIAP/
[ovirt-users] Re: Random reboots
On Thu, Feb 17, 2022 at 11:20 AM Pablo Olivera wrote: > > Hi Nir, > > > Thank you very much for your detailed explanations. > > The pid 6398 looks like it's HostedEngine: > > audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 > uid=0 auid=4294967295 ses=4294967295 > subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start > reason=booted vm="HostedEngine" uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc > vm-pid=6398 exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? > res=success'UID="root" AUID="unset" > > So, I understand that SanLock has problems with the storage (it loses > connection with NFS storage). The watchdog begins to check connectivity with > the MV and after the established time, the order to > reboot the machine. > > I don't know if I can somehow increase these timeouts, or try to make sanlock > force the reconnection or renewal with the storage and in this way try to > avoid host reboots for this reason. You can do one of these: 1. Use lower timeouts on the NFS server mount, so the NFS server at the same time the sanlock lease times out. 2. Use larger sanlock timeout so sanlock lease time out when the NFS server times out. 3. Both 1 and 2 The problem is that NFS timeouts are not predictable. In the past we used: "timeo=600,retrans=6" which can lead to 21 minutes timeout, but practically we saw up to a 30 minutes timeout. In https://github.com/oVirt/vdsm/commit/672a98bbf3e55d1077669f06c37305185fbdc289 we change this to the recommended seting: "timeo=100,retrans=3" Which according to the docs, should fail in 60 seconds if all retries fail. But practically we saw up to 270 seconds timeout with this setting, which does not play well with sanlock. We assumed that the timeout value should not be less than sanlock io timeout (10 seconds) but I'm not sure this assumption is correct. You can smaller timeout value in engine storage domain "custom connections parameters" - Retransmissions - mapped to "retrans" mount option - Timeout (deciseconds) - mapped to "timeo" mount option For example: Retransmissions: 3 Timeout: 5 Theoretically this will behave like this: 00:00 retry 1 (5 seconds timeout) 00:10 retry 2 (10 seconds timeout) 00:30 retry 3 (15 seconds timeout) 00:45 request fail But based on what we see with the defaults, this is likely to take more time. If it fails before 140 seconds, the VM will be killed and the host will not reboot. The other way is to increase sanlock timeout, in vdsm configuration. note that changing sanlock timeout requires also changing other settings (e.g. spm:watchdog_interval). Add this file on all hosts: $ cat /etc/vdsm/vdsm.conf.d/99-local.conf [spm] # If enabled, montior the SPM lease status and panic if the lease # status is not expected. The SPM host will lose the SPM role, and # engine will select a new SPM host. (default true) # watchdog_enable = true # Watchdog check internal in seconds. The recommended value is # sanlock:io_timeout * 2. (default 20) watchdog_interval = 40 [sanlock] # I/O timeout in seconds. All sanlock timeouts are computed based on # this value. Using larger timeout will make VMs more resilient to # short storage outage, but increase VM failover time and the time to # acquire a host id. For more info on sanlock timeouts please check # sanlock source: # https://pagure.io/sanlock/raw/master/f/src/timeouts.h. If your # storage requires larger timeouts, you can increase the value to 15 # or 20 seconds. If you change this you need to update also multipath # no_path_retry. For more info onconfiguring multipath please check # /etc/multipath.conf.oVirt is tested only with the default value (10 # seconds) io_timeout = 20 You can check https://github.com/oVirt/vdsm/blob/master/doc/io-timeouts.md to learn more about sanlock timeouts. Alternatively, you can make a small change in NFS timeout and small change in sanlock timeout to make them work better together. All this is of course to handle the case when the NFS server is not accessible, but this is something that should not happen in a healthy cluster. You need to check why the server was not accessible and fix this problem. Nir ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/XRJXOF3CSDKBKN3ZH3BWAKCWCZ3XETC2/
[ovirt-users] Re: Random reboots
On 2/16/22 23:37, Nir Soffer wrote: On Wed, Feb 16, 2022 at 9:18 PM Nir Soffer wrote: On Wed, Feb 16, 2022 at 5:12 PM Nir Soffer wrote: On Wed, Feb 16, 2022 at 10:10 AM Pablo Olivera wrote: Hi community, We're dealing with an issue as we occasionally have random reboots on any of our hosts. We're using ovirt 4.4.3 in production with about 60 VM distributed over 5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS. The infrastructure is interconnected by a Cisco 9000 switch. The last random reboot was yesterday February 14th at 03:03 PM (in the log it appears as: 15:03 due to our time configuration) of the host: 'nodo1'. At the moment of the reboot we detected in the log of the switch a link-down in the port where the host is connected. I attach log of the engine and host 'nodo1' in case you can help us to find the cause of these random reboots. According to messages: 1. Sanlock could not renew the lease for 80 seconds: Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 [2017]: s1 check_our_lease failed 80 2. In this case sanlock must terminate the processes holding a lease on the that storage - I guess that pid 6398 is vdsm. Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 [2017]: s1 kill 6398 sig 15 count 1 Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655258 [2017]: s1 kill 6398 sig 15 count 2 pid 6398 is not vdsm: Feb 14 15:02:51 nodo1 vdsm[4338] The fact that we see "sig 15" means sanlock is trying to send SIGTERM. If pid 6398 is a VM (hosted engine vm?) we would expect to see: [2017]: s1 kill 6398 sig 100 count 1 Exactly once - which means run the killpath program registered by libvirt, which will terminate the vm. I reproduce this issue locally - we never use killpath program, because we don't configure libvirt on_lockfailure in the domain xml. So we get the default behavior, which is sanlock terminating the vm. So my guess is that this is not a VM, so the only other option is hosted engine broker, using a lease on the whiteboard. ... Feb 14 15:03:36 nodo1 sanlock[2017]: 2022-02-14 15:03:36 1655288 [2017]: s1 kill 6398 sig 15 count 32 3. Terminating pid 6398 stopped here, and we see: Feb 14 15:03:36 nodo1 wdmd[2033]: test failed rem 19 now 1655288 ping 1655237 close 1655247 renewal 1655177 expire 1655257 client 2017 sanlock_a5c35d19-4c34-4571-ac77-1b10de484426:1 According to David, this means we have 19 more attempts to kill the process holding the lease. 4. So it looks like wdmd rebooted the host. Feb 14 15:08:09 nodo1 kernel: Linux version 4.18.0-193.28.1.el8_2.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 00:20:22 UTC 2020 This is strange, since sanlock should try to kill pid 6398 40 times, and then switch to SIGKILL. The watchdog should not reboot the host before sanlock finish the attempt to kill the processes. David, do you think this is expected? do we have any issue in sanlock? I discussed it with David (sanlock author). What we see here may be truncated logs when a host is rebooted by the watchdog. The last time logs were synced to storage was probably Feb 14 15:03:36. Any message written after that was lost in the host page cache. It is possible that sanlock will not be able to terminate a process if the process is blocked on inaccessible storage. This seems to be the case here. In vdsm log we see that storage is indeed inaccessible: 2022-02-14 15:03:03,149+0100 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/newstoragedrbd.andromeda.com:_var_nfsshare_data/a5c35d19-4c34-4571-ac77-1b10de484426/dom_md/metadata' is blocked for 60.00 seconds (check:282) But we don't see any termination request - so this host is not the SPM. I guess this host was running the hosted engine vm, which uses a storage lease. If you lose access to storage, sanlcok will kill the hosted engine vm, so the system can start it elsewhere. If the hosted engine vm is stuck on storage, sanlock cannot kill it and it will reboot the host. Pablo, can you locate the process with pid 6398? Looking in hosted engine logs and other logs on the system may reveal what was this process. When we find the process, we can check the source to understand why it was not terminating - likely blocked on the inaccessible NFS server. The process is most likely a VM - I reproduced the exact scenario locally. You can file a vdsm bug for this. The system behave as designed, but the design is problematic; one VM with a lease stuck on NFS server can cause the entire host to be rebooted. > With block storage we don't have this issue, since we have exact control over multipath timeouts. Multipath will fail I/O in 80 seconds, after sanlock failed to renew the lease. When I/O fails, the process block on storage will unblocked an will be terminated by the kernel. I observe this or similar behavior also in my glusterfs HCI cluster (but not on
[ovirt-users] Re: Random reboots
Hi Nir, Thank you very much for your detailed explanations. The pid 6398 looks like it's HostedEngine: /audit/audit.log:type=VIRT_CONTROL msg=audit(1644587639.935:7895): pid=3629 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:virtd_t:s0-s0:c0.c1023 msg='virt=kvm op=start reason=booted //*vm="HostedEngine"*//uuid=37a75c8e-50a2-4abd-a887-8a62a75814cc //*vm-pid=6398*//exe="/usr/sbin/libvirtd" hostname=? addr=? terminal=? res=success'UID="root" AUID="unset"/ So, I understand that SanLock has problems with the storage (it loses connection with NFS storage). The watchdog begins to check connectivity with the MV and after the established time, the order to reboot the machine. I don't know if I can somehow increase these timeouts, or try to make sanlock force the reconnection or renewal with the storage and in this way try to avoid host reboots for this reason. Pablo. El 16/02/2022 a las 23:37, Nir Soffer escribió: On Wed, Feb 16, 2022 at 9:18 PM Nir Soffer wrote: On Wed, Feb 16, 2022 at 5:12 PM Nir Soffer wrote: On Wed, Feb 16, 2022 at 10:10 AM Pablo Olivera wrote: Hi community, We're dealing with an issue as we occasionally have random reboots on any of our hosts. We're using ovirt 4.4.3 in production with about 60 VM distributed over 5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS. The infrastructure is interconnected by a Cisco 9000 switch. The last random reboot was yesterday February 14th at 03:03 PM (in the log it appears as: 15:03 due to our time configuration) of the host: 'nodo1'. At the moment of the reboot we detected in the log of the switch a link-down in the port where the host is connected. I attach log of the engine and host 'nodo1' in case you can help us to find the cause of these random reboots. According to messages: 1. Sanlock could not renew the lease for 80 seconds: Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 [2017]: s1 check_our_lease failed 80 2. In this case sanlock must terminate the processes holding a lease on the that storage - I guess that pid 6398 is vdsm. Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 [2017]: s1 kill 6398 sig 15 count 1 Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655258 [2017]: s1 kill 6398 sig 15 count 2 pid 6398 is not vdsm: Feb 14 15:02:51 nodo1 vdsm[4338] The fact that we see "sig 15" means sanlock is trying to send SIGTERM. If pid 6398 is a VM (hosted engine vm?) we would expect to see: [2017]: s1 kill 6398 sig 100 count 1 Exactly once - which means run the killpath program registered by libvirt, which will terminate the vm. I reproduce this issue locally - we never use killpath program, because we don't configure libvirt on_lockfailure in the domain xml. So we get the default behavior, which is sanlock terminating the vm. So my guess is that this is not a VM, so the only other option is hosted engine broker, using a lease on the whiteboard. ... Feb 14 15:03:36 nodo1 sanlock[2017]: 2022-02-14 15:03:36 1655288 [2017]: s1 kill 6398 sig 15 count 32 3. Terminating pid 6398 stopped here, and we see: Feb 14 15:03:36 nodo1 wdmd[2033]: test failed rem 19 now 1655288 ping 1655237 close 1655247 renewal 1655177 expire 1655257 client 2017 sanlock_a5c35d19-4c34-4571-ac77-1b10de484426:1 According to David, this means we have 19 more attempts to kill the process holding the lease. 4. So it looks like wdmd rebooted the host. Feb 14 15:08:09 nodo1 kernel: Linux version 4.18.0-193.28.1.el8_2.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 00:20:22 UTC 2020 This is strange, since sanlock should try to kill pid 6398 40 times, and then switch to SIGKILL. The watchdog should not reboot the host before sanlock finish the attempt to kill the processes. David, do you think this is expected? do we have any issue in sanlock? I discussed it with David (sanlock author). What we see here may be truncated logs when a host is rebooted by the watchdog. The last time logs were synced to storage was probably Feb 14 15:03:36. Any message written after that was lost in the host page cache. It is possible that sanlock will not be able to terminate a process if the process is blocked on inaccessible storage. This seems to be the case here. In vdsm log we see that storage is indeed inaccessible: 2022-02-14 15:03:03,149+0100 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/newstoragedrbd.andromeda.com:_var_nfsshare_data/a5c35d19-4c34-4571-ac77-1b10de484426/dom_md/metadata' is blocked for 60.00 seconds (check:282) But we don't see any termination request - so this host is not the SPM. I guess this host was running the hosted engine vm, which uses a storage lease. If you lose access to storage, sanlcok will kill the hosted engine vm, so the system can start it elsewhere. If the hosted engine vm is stuck on storage, sanlock cannot kill it and it will reboot the
[ovirt-users] Re: Random reboots
On Wed, Feb 16, 2022 at 9:18 PM Nir Soffer wrote: > > On Wed, Feb 16, 2022 at 5:12 PM Nir Soffer wrote: > > > > On Wed, Feb 16, 2022 at 10:10 AM Pablo Olivera wrote: > > > > > > Hi community, > > > > > > We're dealing with an issue as we occasionally have random reboots on > > > any of our hosts. > > > We're using ovirt 4.4.3 in production with about 60 VM distributed over > > > 5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS. > > > The infrastructure is interconnected by a Cisco 9000 switch. > > > The last random reboot was yesterday February 14th at 03:03 PM (in the > > > log it appears as: 15:03 due to our time configuration) of the host: > > > 'nodo1'. > > > At the moment of the reboot we detected in the log of the switch a > > > link-down in the port where the host is connected. > > > I attach log of the engine and host 'nodo1' in case you can help us to > > > find the cause of these random reboots. > > > > > > According to messages: > > > > 1. Sanlock could not renew the lease for 80 seconds: > > > > Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 > > [2017]: s1 check_our_lease failed 80 > > > > > > 2. In this case sanlock must terminate the processes holding a lease > >on the that storage - I guess that pid 6398 is vdsm. > > > > Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 > > [2017]: s1 kill 6398 sig 15 count 1 > > Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655258 > > [2017]: s1 kill 6398 sig 15 count 2 > > pid 6398 is not vdsm: > > Feb 14 15:02:51 nodo1 vdsm[4338] > > The fact that we see "sig 15" means sanlock is trying to send SIGTERM. > If pid 6398 is a VM (hosted engine vm?) we would expect to see: > > > [2017]: s1 kill 6398 sig 100 count 1 > > Exactly once - which means run the killpath program registered by libvirt, > which will terminate the vm. I reproduce this issue locally - we never use killpath program, because we don't configure libvirt on_lockfailure in the domain xml. So we get the default behavior, which is sanlock terminating the vm. > > So my guess is that this is not a VM, so the only other option is hosted > engine broker, using a lease on the whiteboard. > > > ... > > Feb 14 15:03:36 nodo1 sanlock[2017]: 2022-02-14 15:03:36 1655288 > > [2017]: s1 kill 6398 sig 15 count 32 > > > > 3. Terminating pid 6398 stopped here, and we see: > > > > Feb 14 15:03:36 nodo1 wdmd[2033]: test failed rem 19 now 1655288 ping > > 1655237 close 1655247 renewal 1655177 expire 1655257 client 2017 > > sanlock_a5c35d19-4c34-4571-ac77-1b10de484426:1 > > According to David, this means we have 19 more attempts to kill the process > holding the lease. > > > > > 4. So it looks like wdmd rebooted the host. > > > > Feb 14 15:08:09 nodo1 kernel: Linux version > > 4.18.0-193.28.1.el8_2.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc > > version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 > > 00:20:22 UTC 2020 > > > > > > This is strange, since sanlock should try to kill pid 6398 40 times, > > and then switch > > to SIGKILL. The watchdog should not reboot the host before sanlock > > finish the attempt to kill the processes. > > > > David, do you think this is expected? do we have any issue in sanlock? > > I discussed it with David (sanlock author). What we see here may be truncated > logs when a host is rebooted by the watchdog. The last time logs were synced > to storage was probably Feb 14 15:03:36. Any message written after that was > lost in the host page cache. > > > > It is possible that sanlock will not be able to terminate a process if > > the process is blocked on inaccessible storage. This seems to be the > > case here. > > > > In vdsm log we see that storage is indeed inaccessible: > > > > 2022-02-14 15:03:03,149+0100 WARN (check/loop) [storage.check] > > Checker > > '/rhev/data-center/mnt/newstoragedrbd.andromeda.com:_var_nfsshare_data/a5c35d19-4c34-4571-ac77-1b10de484426/dom_md/metadata' > > is blocked for 60.00 seconds (check:282) > > > > But we don't see any termination request - so this host is not the SPM. > > > > I guess this host was running the hosted engine vm, which uses a storage > > lease. > > If you lose access to storage, sanlcok will kill the hosted engine vm, > > so the system > > can start it elsewhere. If the hosted engine vm is stuck on storage, sanlock > > cannot kill it and it will reboot the host. > > Pablo, can you locate the process with pid 6398? > > Looking in hosted engine logs and other logs on the system may reveal what > was this process. When we find the process, we can check the source to > understand > why it was not terminating - likely blocked on the inaccessible NFS server. The process is most likely a VM - I reproduced the exact scenario locally. You can file a vdsm bug for this. The system behave as designed, but the design is problematic; one VM with a lease stuck on NFS server can cause the entire host to be rebooted. With block storage we don't have this
[ovirt-users] Re: Random reboots
On Wed, Feb 16, 2022 at 5:12 PM Nir Soffer wrote: > > On Wed, Feb 16, 2022 at 10:10 AM Pablo Olivera wrote: > > > > Hi community, > > > > We're dealing with an issue as we occasionally have random reboots on > > any of our hosts. > > We're using ovirt 4.4.3 in production with about 60 VM distributed over > > 5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS. > > The infrastructure is interconnected by a Cisco 9000 switch. > > The last random reboot was yesterday February 14th at 03:03 PM (in the > > log it appears as: 15:03 due to our time configuration) of the host: > > 'nodo1'. > > At the moment of the reboot we detected in the log of the switch a > > link-down in the port where the host is connected. > > I attach log of the engine and host 'nodo1' in case you can help us to > > find the cause of these random reboots. > > > According to messages: > > 1. Sanlock could not renew the lease for 80 seconds: > > Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 > [2017]: s1 check_our_lease failed 80 > > > 2. In this case sanlock must terminate the processes holding a lease >on the that storage - I guess that pid 6398 is vdsm. > > Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 > [2017]: s1 kill 6398 sig 15 count 1 > Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655258 > [2017]: s1 kill 6398 sig 15 count 2 pid 6398 is not vdsm: Feb 14 15:02:51 nodo1 vdsm[4338] The fact that we see "sig 15" means sanlock is trying to send SIGTERM. If pid 6398 is a VM (hosted engine vm?) we would expect to see: > [2017]: s1 kill 6398 sig 100 count 1 Exactly once - which means run the killpath program registered by libvirt, which will terminate the vm. So my guess is that this is not a VM, so the only other option is hosted engine broker, using a lease on the whiteboard. > ... > Feb 14 15:03:36 nodo1 sanlock[2017]: 2022-02-14 15:03:36 1655288 > [2017]: s1 kill 6398 sig 15 count 32 > > 3. Terminating pid 6398 stopped here, and we see: > > Feb 14 15:03:36 nodo1 wdmd[2033]: test failed rem 19 now 1655288 ping > 1655237 close 1655247 renewal 1655177 expire 1655257 client 2017 > sanlock_a5c35d19-4c34-4571-ac77-1b10de484426:1 According to David, this means we have 19 more attempts to kill the process holding the lease. > > 4. So it looks like wdmd rebooted the host. > > Feb 14 15:08:09 nodo1 kernel: Linux version > 4.18.0-193.28.1.el8_2.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc > version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 > 00:20:22 UTC 2020 > > > This is strange, since sanlock should try to kill pid 6398 40 times, > and then switch > to SIGKILL. The watchdog should not reboot the host before sanlock > finish the attempt to kill the processes. > > David, do you think this is expected? do we have any issue in sanlock? I discussed it with David (sanlock author). What we see here may be truncated logs when a host is rebooted by the watchdog. The last time logs were synced to storage was probably Feb 14 15:03:36. Any message written after that was lost in the host page cache. > It is possible that sanlock will not be able to terminate a process if > the process is blocked on inaccessible storage. This seems to be the > case here. > > In vdsm log we see that storage is indeed inaccessible: > > 2022-02-14 15:03:03,149+0100 WARN (check/loop) [storage.check] > Checker > '/rhev/data-center/mnt/newstoragedrbd.andromeda.com:_var_nfsshare_data/a5c35d19-4c34-4571-ac77-1b10de484426/dom_md/metadata' > is blocked for 60.00 seconds (check:282) > > But we don't see any termination request - so this host is not the SPM. > > I guess this host was running the hosted engine vm, which uses a storage > lease. > If you lose access to storage, sanlcok will kill the hosted engine vm, > so the system > can start it elsewhere. If the hosted engine vm is stuck on storage, sanlock > cannot kill it and it will reboot the host. Pablo, can you locate the process with pid 6398? Looking in hosted engine logs and other logs on the system may reveal what was this process. When we find the process, we can check the source to understand why it was not terminating - likely blocked on the inaccessible NFS server. Nir ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/6XIQP4G2Y3WQYU4C2Q5TQNIOZKQ3U5TR/
[ovirt-users] Re: Random reboots
On Wed, Feb 16, 2022 at 10:10 AM Pablo Olivera wrote: > > Hi community, > > We're dealing with an issue as we occasionally have random reboots on > any of our hosts. > We're using ovirt 4.4.3 in production with about 60 VM distributed over > 5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS. > The infrastructure is interconnected by a Cisco 9000 switch. > The last random reboot was yesterday February 14th at 03:03 PM (in the > log it appears as: 15:03 due to our time configuration) of the host: > 'nodo1'. > At the moment of the reboot we detected in the log of the switch a > link-down in the port where the host is connected. > I attach log of the engine and host 'nodo1' in case you can help us to > find the cause of these random reboots. According to messages: 1. Sanlock could not renew the lease for 80 seconds: Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 [2017]: s1 check_our_lease failed 80 2. In this case sanlock must terminate the processes holding a lease on the that storage - I guess that pid 6398 is vdsm. Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655257 [2017]: s1 kill 6398 sig 15 count 1 Feb 14 15:03:06 nodo1 sanlock[2017]: 2022-02-14 15:03:06 1655258 [2017]: s1 kill 6398 sig 15 count 2 ... Feb 14 15:03:36 nodo1 sanlock[2017]: 2022-02-14 15:03:36 1655288 [2017]: s1 kill 6398 sig 15 count 32 3. Terminating pid 6398 stopped here, and we see: Feb 14 15:03:36 nodo1 wdmd[2033]: test failed rem 19 now 1655288 ping 1655237 close 1655247 renewal 1655177 expire 1655257 client 2017 sanlock_a5c35d19-4c34-4571-ac77-1b10de484426:1 4. So it looks like wdmd rebooted the host. Feb 14 15:08:09 nodo1 kernel: Linux version 4.18.0-193.28.1.el8_2.x86_64 (mockbu...@kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 00:20:22 UTC 2020 This is strange, since sanlock should try to kill pid 6398 40 times, and then switch to SIGKILL. The watchdog should not reboot the host before sanlock finish the attempt to kill the processes. David, do you think this is expected? do we have any issue in sanlock? It is possible that sanlock will not be able to terminate a process if the process is blocked on inaccessible storage. This seems to be the case here. In vdsm log we see that storage is indeed inaccessible: 2022-02-14 15:03:03,149+0100 WARN (check/loop) [storage.check] Checker '/rhev/data-center/mnt/newstoragedrbd.andromeda.com:_var_nfsshare_data/a5c35d19-4c34-4571-ac77-1b10de484426/dom_md/metadata' is blocked for 60.00 seconds (check:282) But we don't see any termination request - so this host is not the SPM. I guess this host was running the hosted engine vm, which uses a storage lease. If you lose access to storage, sanlcok will kill the hosted engine vm, so the system can start it elsewhere. If the hosted engine vm is stuck on storage, sanlock cannot kill it and it will reboot the host. Nir ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZFJZGLGL5QVEDBRQOXOO7C6Y2Q5TK5S3/
[ovirt-users] Re: Random reboots
Hi, Thanks for your answer. On the other nodes there are no errors in this period. In the Cisco log there are only link-down errors due to the restart of 'nodo1'. There is no error before. I attach the Cisco log during this period. We use bond between nodo1 and Cisco switch. The storage is connected to another port of the same switch and is shared between all the nodes. Is it possible that there is some bug with NFS in ovirt? El 16/02/2022 a las 11:59, Valkov, Alexey escribió: Hello, Pablo. It looks like nodo1 have lost connection with the storage (sanlock on nodo1 can't renew leases), then nodo1 has been reset by the watchdog. Are there any errors in logs on the other nodes at this period (15:02 - 15:03)? Are there any errors (near 15:02:13) in cisco9000's log (except those which was after 15:03:36 - when nodo1 reboots)? Do you use bond on nodo1 for storage connection? 2022 Feb 14 13:50:04 N9K_Andromeda %ETHPORT-4-IF_SFP_WARNING: Interface Ethernet3/20, Low Rx Power Warning cleared 2022 Feb 14 13:55:05 N9K_Andromeda %ETHPORT-4-IF_SFP_WARNING: Interface Ethernet3/20, Low Rx Power Warning 2022 Feb 14 14:00:06 N9K_Andromeda %ETHPORT-4-IF_SFP_WARNING: Interface Ethernet3/20, Low Rx Power Warning cleared 2022 Feb 14 14:05:07 N9K_Andromeda %ETHPORT-4-IF_SFP_WARNING: Interface Ethernet3/20, Low Rx Power Warning 2022 Feb 14 15:04:59 N9K_Andromeda %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel30: Ethernet3/37 is down 2022 Feb 14 15:04:59 N9K_Andromeda %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet3/37 is down (Initializing) 2022 Feb 14 15:04:59 N9K_Andromeda %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel30 is down (No operational members) 2022 Feb 14 15:05:00 N9K_Andromeda %ETH_PORT_CHANNEL-5-PORT_DOWN: port-channel30: Ethernet1/43 is down 2022 Feb 14 15:05:00 N9K_Andromeda %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel30: first operational port changed from Ethernet1/43 to none 2022 Feb 14 15:05:00 N9K_Andromeda %ETHPORT-5-IF_DOWN_INITIALIZING: Interface Ethernet1/43 is down (Initializing) 2022 Feb 14 15:05:00 N9K_Andromeda %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel30 is down (No operational members) 2022 Feb 14 15:05:09 N9K_Andromeda last message repeated 1 time 2022 Feb 14 15:05:09 N9K_Andromeda %ETH_PORT_CHANNEL-5-PORT_SUSPENDED: Ethernet3/37: Ethernet3/37 is suspended 2022 Feb 14 15:05:10 N9K_Andromeda %ETH_PORT_CHANNEL-5-PORT_SUSPENDED: Ethernet1/43: Ethernet1/43 is suspended 2022 Feb 14 15:05:19 N9K_Andromeda %ETHPORT-4-IF_SFP_WARNING: Interface Ethernet3/20, Low Rx Power Warning cleared 2022 Feb 14 15:05:50 N9K_Andromeda %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet1/43 is down (Link failure) 2022 Feb 14 15:05:50 N9K_Andromeda %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel30 is down (No operational members) 2022 Feb 14 15:05:50 N9K_Andromeda last message repeated 1 time 2022 Feb 14 15:05:50 N9K_Andromeda %ETHPORT-5-IF_DOWN_LINK_FAILURE: Interface Ethernet3/37 is down (Link failure) 2022 Feb 14 15:05:50 N9K_Andromeda %ETHPORT-5-IF_DOWN_PORT_CHANNEL_MEMBERS_DOWN: Interface port-channel30 is down (No operational members) 2022 Feb 14 15:05:51 N9K_Andromeda last message repeated 2 times 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-SPEED: Interface Ethernet3/37, operational speed changed to 10 Gbps 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-IF_DUPLEX: Interface Ethernet3/37, operational duplex mode changed to Full 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet3/37, operational Receive Flow Control state changed to off 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet3/37, operational Transmit Flow Control state changed to off 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-SPEED: Interface port-channel30, operational speed changed to 10 Gbps 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-IF_DUPLEX: Interface port-channel30, operational duplex mode changed to Full 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface port-channel30, operational Receive Flow Control state changed to off 2022 Feb 14 15:05:51 N9K_Andromeda %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface port-channel30, operational Transmit Flow Control state changed to off 2022 Feb 14 15:05:52 N9K_Andromeda %ETHPORT-5-SPEED: Interface Ethernet1/43, operational speed changed to 10 Gbps 2022 Feb 14 15:05:52 N9K_Andromeda %ETHPORT-5-IF_DUPLEX: Interface Ethernet1/43, operational duplex mode changed to Full 2022 Feb 14 15:05:52 N9K_Andromeda %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface Ethernet1/43, operational Receive Flow Control state changed to off 2022 Feb 14 15:05:52 N9K_Andromeda %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface Ethernet1/43, operational Transmit Flow Control state changed to off 2022 Feb 14 15:06:01 N9K_Andromeda %ETH_PORT_CHANNEL-5-PORT_SUSPENDED: Ethernet3/37: Ethernet3/37 is suspended 2022 Feb 14 15:06:02 N9K_Andromeda
[ovirt-users] Re: Random reboots
Hello, Pablo. It looks like nodo1 have lost connection with the storage (sanlock on nodo1 can't renew leases), then nodo1 has been reset by the watchdog. Are there any errors in logs on the other nodes at this period (15:02 - 15:03)? Are there any errors (near 15:02:13) in cisco9000's log (except those which was after 15:03:36 - when nodo1 reboots)? Do you use bond on nodo1 for storage connection? -- Alexey -Original Message- From: Pablo Olivera Sent: Wednesday, February 16, 2022 11:04 AM To: users@ovirt.org Subject: [ovirt-users] Random reboots Hi community, We're dealing with an issue as we occasionally have random reboots on any of our hosts. We're using ovirt 4.4.3 in production with about 60 VM distributed over 5 hosts. We've a virtualized engine and a DRBD storage mounted by NFS. The infrastructure is interconnected by a Cisco 9000 switch. The last random reboot was yesterday February 14th at 03:03 PM (in the log it appears as: 15:03 due to our time configuration) of the host: 'nodo1'. At the moment of the reboot we detected in the log of the switch a link-down in the port where the host is connected. I attach log of the engine and host 'nodo1' in case you can help us to find the cause of these random reboots. Thanks in advance. Pablo. ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PIAPQ3DMTAK2HTCV2SEQ7VK7JCFFYHSK/