[ovirt-users] Host down/activation loop

2018-03-21 Thread Jamie Lawrence
Hello,

Have an issue that feels sanlock related, but I can't get sorted with our 
installation. This is 4.2.1, hosted engine. One of our hosts is stuck in a 
loop. It:

- gets a VDSM GetStatsVDS timeout, is marked as down, 
- throws a warning about not being fenced (because that's not enabled yet, 
because of this problem).
- and is set up Up about a minute later.

This repeats every 4 minutes and 20 seconds.

The hosted engine is running on the host that is stuck like this, and it 
doesn't appear to get in the way of creating new VMs or other operations, but 
obviously I can't use fencing, which is a big part of the point of running 
Ovirt in the first place.

I tried setting global maintenance and running hosted-engine 
--reinitialize-lockspace, which (a) took nearly exactly 2 minutes to run, 
making me think something timed out, (b) exited with rc 0, and (c) didn't fix 
the problem.

Anyone have an idea of how to fix this?

-j



- - details - -

I still can't quite figure out how to interpret what sanlock says, but  the -1s 
look like wrongness.

[sc5-ovirt-1]# sanlock client status
daemon bedae69e-03cc-49f8-88f4-9674a85a3185.sc5-ovirt-
p -1 helper
p -1 listener
p 122268 HostedEngine
p -1 status
s 
1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd:1:/rhev/data-center/mnt/glusterSD/172.16.0.151\:_sc5-images/1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd/dom_md/ids:0
s 
b41eb20a-eafb-481b-9a50-a135cf42b15e:1:/rhev/data-center/mnt/glusterSD/sc5-gluster-10g-1\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/dom_md/ids:0
r 
b41eb20a-eafb-481b-9a50-a135cf42b15e:8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87:/rhev/data-center/mnt/glusterSD/172.16.0.153\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/images/a9d01d59-f146-47e5-b514-d10f8867678e/8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87.lease:0:5
 p 122268


engine.log:

2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] EVENT_ID: 
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM sc5-ovirt-1 command GetStatsVDS 
failed: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Command 
'GetStatsVDSCommand(HostName = sc5-ovirt-1, 
VdsIdAndVdsVDSCommandParametersBase:{hostId='be3517e0-f79d-464c-8169-f786d13ac287',
 vds='Host[sc5-ovirt-1,be3517e0-f79d-464c-8169-f786d13ac287]'})' execution 
failed: VDSGenericException: VDSNetworkException: Message timeout which can be 
caused by communication issues
2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed getting vds 
stats, host='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): 
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: 
VDSGenericException: VDSNetworkException: Message timeout which can be caused 
by communication issues
2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failure to refresh host 
'sc5-ovirt-1' runtime info: VDSGenericException: VDSNetworkException: Message 
timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed to refresh VDS, 
network error, continuing, 
vds='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): VDSGenericException: 
VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
(EE-ManagedThreadFactory-engine-Thread-102682) [] Host 'sc5-ovirt-1' is not 
responding.
2018-03-21 16:09:26,088-07 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engine-Thread-102682) [] EVENT_ID: 
VDS_HOST_NOT_RESPONDING(9,027), Host sc5-ovirt-1 is not responding. Host cannot 
be fenced automatically because power management for the host is disabled.
2018-03-21 16:09:27,070-07 INFO  
[org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] 
Connecting to sc5-ovirt-1/10.181.26.129
2018-03-21 16:09:27,918-07 INFO  
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] 
(DefaultQuartzScheduler4) [493fb316] START, 
GlusterServersListVDSCommand(HostName = sc5-gluster-2, 
VdsIdVDSCommandParametersBase:{hostId='797cbf42-6553-4a75-b8b1-93b2adbbc0db'}), 
log id: 6afccc01
2018-03-21 16:09:28,579-07 INFO  
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] 
(DefaultQuartzScheduler4) [493fb316] FINISH, GlusterServersListVDSCommand, 
return: [192.168.122.1/24:CONNECTED, sc5-gluster-3:CONNECTED, 
sc5-gluster-10g-1:CONNECTED], log id: 6afccc01
2018-03-21 16:09:28,606-07 INFO  
[org.ovirt.engine.core.vdsbroker.gluster.GlusterVo

[ovirt-users] Host down/activation loop

2018-03-21 Thread Jamie Lawrence
Hello,

Have an issue that feels sanlock related, but I can't get sorted with our 
installation. This is 4.2.1, hosted engine. One of our hosts is stuck in a 
loop. It:

 - gets a VDSM GetStatsVDS timeout, is marked as down, 
 - throws a warning about not being fenced (because that's not enabled yet, 
because of this problem).
 - and is set up Up about a minute later.

This repeats every 4 minutes and 20 seconds.

The hosted engine is running on the host that is stuck like this, and it 
doesn't appear to get in the way of creating new VMs or other operations, but 
obviously I can't use fencing, which is a big part of the point of running 
Ovirt in the first place.

I tried setting global maintenance and running hosted-engine 
--reinitialize-lockspace, which (a) took nearly exactly 2 minutes to run, 
making me think something timed out, (b) exited with rc 0, and (c) didn't fix 
the problem.

Anyone have an idea of how to fix this?

-j



- - details - -

I still can't quite figure out how to interpret what sanlock says, but  the -1s 
look like wrongness.

[sc5-ovirt-1]# sanlock client status
daemon bedae69e-03cc-49f8-88f4-9674a85a3185.sc5-ovirt-
p -1 helper
p -1 listener
p 122268 HostedEngine
p -1 status
s 
1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd:1:/rhev/data-center/mnt/glusterSD/172.16.0.151\:_sc5-images/1aabcd3a-3fd3-4902-b92e-17beaf8fe3fd/dom_md/ids:0
s 
b41eb20a-eafb-481b-9a50-a135cf42b15e:1:/rhev/data-center/mnt/glusterSD/sc5-gluster-10g-1\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/dom_md/ids:0
r 
b41eb20a-eafb-481b-9a50-a135cf42b15e:8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87:/rhev/data-center/mnt/glusterSD/172.16.0.153\:_sc5-ovirt__engine/b41eb20a-eafb-481b-9a50-a135cf42b15e/images/a9d01d59-f146-47e5-b514-d10f8867678e/8f0c9f7a-ae6a-476e-b6f3-a830dcb79e87.lease:0:5
 p 122268


engine.log:

2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] EVENT_ID: 
VDS_BROKER_COMMAND_FAILURE(10,802), VDSM sc5-ovirt-1 command GetStatsVDS 
failed: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Command 
'GetStatsVDSCommand(HostName = sc5-ovirt-1, 
VdsIdAndVdsVDSCommandParametersBase:{hostId='be3517e0-f79d-464c-8169-f786d13ac287',
 vds='Host[sc5-ovirt-1,be3517e0-f79d-464c-8169-f786d13ac287]'})' execution 
failed: VDSGenericException: VDSNetworkException: Message timeout which can be 
caused by communication issues
2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed getting vds 
stats, host='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): 
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: 
VDSGenericException: VDSNetworkException: Message timeout which can be caused 
by communication issues
2018-03-21 16:09:26,081-07 ERROR 
[org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failure to refresh host 
'sc5-ovirt-1' runtime info: VDSGenericException: VDSNetworkException: Message 
timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
(EE-ManagedThreadFactory-engineScheduled-Thread-67) [] Failed to refresh VDS, 
network error, continuing, 
vds='sc5-ovirt-1'(be3517e0-f79d-464c-8169-f786d13ac287): VDSGenericException: 
VDSNetworkException: Message timeout which can be caused by communication issues
2018-03-21 16:09:26,081-07 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
(EE-ManagedThreadFactory-engine-Thread-102682) [] Host 'sc5-ovirt-1' is not 
responding.
2018-03-21 16:09:26,088-07 WARN  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(EE-ManagedThreadFactory-engine-Thread-102682) [] EVENT_ID: 
VDS_HOST_NOT_RESPONDING(9,027), Host sc5-ovirt-1 is not responding. Host cannot 
be fenced automatically because power management for the host is disabled.
2018-03-21 16:09:27,070-07 INFO  
[org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] 
Connecting to sc5-ovirt-1/10.181.26.129
2018-03-21 16:09:27,918-07 INFO  
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] 
(DefaultQuartzScheduler4) [493fb316] START, 
GlusterServersListVDSCommand(HostName = sc5-gluster-2, 
VdsIdVDSCommandParametersBase:{hostId='797cbf42-6553-4a75-b8b1-93b2adbbc0db'}), 
log id: 6afccc01
2018-03-21 16:09:28,579-07 INFO  
[org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] 
(DefaultQuartzScheduler4) [493fb316] FINISH, GlusterServersListVDSCommand, 
return: [192.168.122.1/24:CONNECTED, sc5-gluster-3:CONNECTED, 
sc5-gluster-10g-1:CONNECTED], log id: 6afccc01
2018-03-21 16:09:28,606-07 INFO  
[org.ovirt.engine.core.vdsbroker.gluster.Gluste