On Wed, Jul 21, 2021 at 1:01 AM Valerio Luccio <[email protected]> wrote:
>
> Hi list,
>
> I have a hosted engine running on a CentOS 8. The engine and all the VM's are
> stored on a 4-node gluster. I had some issues with the gluster and then the
> hosted-engine stopped working (even though the virtualization dashboard
> showed 4 virtual machines running). I tried to "systemctl restart" the
> hosted-engine, but it failed. I try to reboot the server and the
> hosted-engine still will not come up. Note that the server has no issue
> mounting the gluster:
>
> $ df
> hydra1:/MRIData 390664407040 20530130012 370134277028 6%
> /rhev/data-center/mnt/glusterSD/hydra1:_MRIData
> $ ls -l
> /rhev/data-center/mnt/glusterSD/hydra1\:_MRIData/6547dc22-b89e-4f14-8958-c9e8d27b29a4/
> drwxr-xr-x. 2 vdsm kvm 4.0K Mar 29 12:24 dom_md
> drwxr-xr-x. 2 vdsm kvm 4.0K Jul 20 17:47 ha_agent
> drwxr-xr-x. 12 vdsm kvm 4.0K Apr 1 16:32 images
> drwxr-xr-x. 4 vdsm kvm 4.0K Mar 29 12:24 master
>
> Where "hydra1" is one of my gluster nodes and MRIData is the volume name.
>
> Here is the relevant snippet from /var/log/ovirt-hosted-engine-ha/agent.log
>
> MainThread::INFO::2021-07-20
> 17:29:07,584::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> ovirt-hosted-engine-ha agent 2.4.6 started
> MainThread::INFO::2021-07-20
> 17:29:07,594::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname)
> Certificate common name not found, using hostname to identify host
> MainThread::INFO::2021-07-20
> 17:29:07,635::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> Initializing ha-broker connection
> MainThread::INFO::2021-07-20
> 17:29:07,636::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor)
> Starting monitor network, options {'addr': '192.168.39.65', 'network_test':
> 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}
> MainThread::ERROR::2021-07-20
> 17:29:07,636::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> Failed to start necessary monitors
> MainThread::ERROR::2021-07-20
> 17:29:07,637::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Traceback (most recent call last):
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 85, in start_monitor
> response = self._proxy.start_monitor(type, options)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__
> return self.__send(self.__name, args)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request
> verbose=self.__verbose
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
> return self.single_request(host, handler, request_body, verbose)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in single_request
> http_conn = self.send_request(host, handler, request_body, verbose)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request
> self.send_content(connection, request_body)
> File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content
> connection.endheaders(request_body)
> File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
> self._send_output(message_body, encode_chunked=encode_chunked)
> File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
> self.send(msg)
> File "/usr/lib64/python3.6/http/client.py", line 974, in send
> self.connect()
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py",
> line 74, in connect
> self.sock.connect(base64.b16decode(self.host))
> FileNotFoundError: [Errno 2] No such file or directory
This seems to indicate that the broker is down. Can you check it,
please - log, restart, status, etc.?
>
> During handling of the above exception, another exception occurred:
>
> Traceback (most recent call last):
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 131, in _run_agent
> return action(he)
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 55, in action_proper
> return he.start_monitoring()
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 437, in start_monitoring
> self._initialize_broker()
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 561, in _initialize_broker
> m.get('options', {}))
> File
> "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 91, in start_monitor
> ).format(t=type, o=options, e=e)
> ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to
> start monitor via ovirt-ha-broker: [Errno 2] No such file or directory,
> [monitor: 'network', options: {'addr': '192.168.39.65', 'network_test':
> 'dns', 'tcp_t_address': '', 'tcp_t_port': ''}]
>
> MainThread::ERROR::2021-07-20
> 17:29:07,637::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Trying to restart agent
> MainThread::INFO::2021-07-20
> 17:29:07,637::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run)
> Agent shutting down
>
> I'm puzzled by that "Certificate common name not found", which I had not seen
> before.
This seems like a harmless bug. Now filed it, mainly for reference -
not sure it's worth fixing:
https://bugzilla.redhat.com/show_bug.cgi?id=1984262
> The fqdn of the hosted engine resolves fine on the server, so does the fqdn
> of the server itself. The ip address it seems to try to use for the network
> is that of one of the university's gateways.
>
> Any ideas ? Any way to debug this further ?
See above - check the broker.
Thanks and best regards,
--
Didi
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/[email protected]/message/OHGLOO5WJ4OTZCUFZKQP4DALPQX3G7UG/