[ovirt-users] Host down/activation loop
Hello, Have an issue that feels sanlock related, but I can't get sorted with our installation. This is 4.2.1, hosted engine. One of our hosts is stuck in a loop. It: - gets a VDSM GetStatsVDS timeout, is marked as down, - throws a warning about not being fenced (because that's not enabled yet, because of this problem). - and is set up Up about a minute later. This repeats every 4 minutes and 20 seconds. The hosted engine is running on the host that is stuck like this, and it doesn't appear to get in the way of creating new VMs or other operations, but obviously I can't use fencing, which is a big part of the point of running Ovirt in the first place. I tried setting global maintenance and running hosted-engine --reinitialize-lockspace, which (a) took nearly exactly 2 minutes to run, making me think something timed out, (b) exited with rc 0, and (c) didn't fix the problem. Anyone have an idea of how to fix this? -j - - details - - I still can't quite figure out how to interpret what sanlock says, but the -1s look like wrongness. [sc5-ovirt-1]# sanlock client status daemon bedae69e-03cc-49f8-88f4-9674a85a3185.sc5-ovirt- p -1 helper p -1 listener p 122268 HostedEngine p -1 status s 1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd:1:/rhev/data-center/mnt/glusterSD/172.16.0.151\:_sc5-images/1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd/dom_md/ids:0 s b41eb20a-eafb-481b-9a50-a135cf42b15e:1:/rhev/data-center/mnt/glusterSD/sc5-gluster-10g-1\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/dom_md/ids:0 r b41eb20a-eafb-481b-9a50-a135cf42b15e:8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87:/rhev/data-center/mnt/glusterSD/172.16.0.153\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/images/a9d01d59-f146-47e5-b514-d10f8867678e/8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87.lease:0:5 p 122268 engine.log: 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM sc5-ovirt-1 command GetStatsVDS failed: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Command 'GetStatsVDSCommand(HostName = sc5-ovirt-1, VdsIdAndVdsVDSCommandParametersBase:{hostId='be3517e0-f79d-464c-8169-f786d13ac287', vds='Host[sc5-ovirt-1,be3517e0-f79d-464c-8169-f786d13ac287]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed getting vds stats, host='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failure to refresh host 'sc5-ovirt-1' runtime info: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed to refresh VDS, network error, continuing, vds='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-102682) [] Host 'sc5-ovirt-1' is not responding. 2018-03-21 16:09:26,088-07 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-102682) [] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host sc5-ovirt-1 is not responding. Host cannot be fenced automatically because power management for the host is disabled. 2018-03-21 16:09:27,070-07 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to sc5-ovirt-1/10.181.26.129 2018-03-21 16:09:27,918-07 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [493fb316] START, GlusterServersListVDSCommand(HostName = sc5-gluster-2, VdsIdVDSCommandParametersBase:{hostId='797cbf42-6553-4a75-b8b1-93b2adbbc0db'}), log id: 6afccc01 2018-03-21 16:09:28,579-07 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [493fb316] FINISH, GlusterServersListVDSCommand, return: [192.168.122.1/24:CONNECTED, sc5-gluster-3:CONNECTED, sc5-gluster-10g-1:CONNECTED], log id: 6afccc01 2018-03-21 16:09:28,606-07 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVo
[ovirt-users] Host down/activation loop
Hello, Have an issue that feels sanlock related, but I can't get sorted with our installation. This is 4.2.1, hosted engine. One of our hosts is stuck in a loop. It: - gets a VDSM GetStatsVDS timeout, is marked as down, - throws a warning about not being fenced (because that's not enabled yet, because of this problem). - and is set up Up about a minute later. This repeats every 4 minutes and 20 seconds. The hosted engine is running on the host that is stuck like this, and it doesn't appear to get in the way of creating new VMs or other operations, but obviously I can't use fencing, which is a big part of the point of running Ovirt in the first place. I tried setting global maintenance and running hosted-engine --reinitialize-lockspace, which (a) took nearly exactly 2 minutes to run, making me think something timed out, (b) exited with rc 0, and (c) didn't fix the problem. Anyone have an idea of how to fix this? -j - - details - - I still can't quite figure out how to interpret what sanlock says, but the -1s look like wrongness. [sc5-ovirt-1]# sanlock client status daemon bedae69e-03cc-49f8-88f4-9674a85a3185.sc5-ovirt- p -1 helper p -1 listener p 122268 HostedEngine p -1 status s 1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd:1:/rhev/data-center/mnt/glusterSD/172.16.0.151\:_sc5-images/1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd/dom_md/ids:0 s b41eb20a-eafb-481b-9a50-a135cf42b15e:1:/rhev/data-center/mnt/glusterSD/sc5-gluster-10g-1\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/dom_md/ids:0 r b41eb20a-eafb-481b-9a50-a135cf42b15e:8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87:/rhev/data-center/mnt/glusterSD/172.16.0.153\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/images/a9d01d59-f146-47e5-b514-d10f8867678e/8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87.lease:0:5 p 122268 engine.log: 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM sc5-ovirt-1 command GetStatsVDS failed: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Command 'GetStatsVDSCommand(HostName = sc5-ovirt-1, VdsIdAndVdsVDSCommandParametersBase:{hostId='be3517e0-f79d-464c-8169-f786d13ac287', vds='Host[sc5-ovirt-1,be3517e0-f79d-464c-8169-f786d13ac287]'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed getting vds stats, host='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failure to refresh host 'sc5-ovirt-1' runtime info: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed to refresh VDS, network error, continuing, vds='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 2018-03-21 16:09:26,081-07 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-102682) [] Host 'sc5-ovirt-1' is not responding. 2018-03-21 16:09:26,088-07 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-102682) [] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host sc5-ovirt-1 is not responding. Host cannot be fenced automatically because power management for the host is disabled. 2018-03-21 16:09:27,070-07 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to sc5-ovirt-1/10.181.26.129 2018-03-21 16:09:27,918-07 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [493fb316] START, GlusterServersListVDSCommand(HostName = sc5-gluster-2, VdsIdVDSCommandParametersBase:{hostId='797cbf42-6553-4a75-b8b1-93b2adbbc0db'}), log id: 6afccc01 2018-03-21 16:09:28,579-07 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler4) [493fb316] FINISH, GlusterServersListVDSCommand, return: [192.168.122.1/24:CONNECTED, sc5-gluster-3:CONNECTED, sc5-gluster-10g-1:CONNECTED], log id: 6afccc01 2018-03-21 16:09:28,606-07 INFO [org.ovirt.engine.core.vdsbroker.gluster.Gluste