[ovirt-users] Re: Hosted engine on HCI cluster is not running

David White via Users Fri, 13 Aug 2021 18:34:12 -0700

I have updated the Bugzilla with all of the details I included below, as well 
as additional details.


I figured better to err on the side of providing too many details than not 
enough. 


For the oVirt list's edification, I will note that restarting vdsmd on all 3 
hosts did fix the problem -- to an extent. Unfortunately, my hosted-engine is 
still not starting (although I can now clearly connect to the hosted-engine 
storage), and I see this output every time I try to start the hosted-engine:

[root@cha2-storage ~]# hosted-engine --vm-start
Command VM.getStats with args {'vmID': 'ffd77d79-a699-455e-88e2-f55ee53166ef'} 
failed:
(code=1, message=Virtual machine does not exist: {'vmId': 
'ffd77d79-a699-455e-88e2-f55ee53166ef'})
VM in WaitForLaunch

I'm not sure if that's because I screwed up when I was doing gluster 
maintenance, or what.
But at this point, does this mean I have to re-deploy the hosted engine?
To confirm, if I re-deploy the hosted engine, will all of my regular VMs remain 
intact? I have over 20 VMs in this environment, and it would be a major deal to 
have to rebuild all 20+ of those VMs.

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer <[email protected]> wrote:

> On Fri, Aug 13, 2021 at 9:13 PM David White via Users [email protected] wrote:
> 

> > Hello,
> > 

> > It appears that my Manager / hosted-engine isn't working, and I'm unable to 
> > get it to start.
> > 

> > I have a 3-node HCI cluster, but right now, Gluster is only running on 1 
> > host (so no replication).
> > 

> > I was hoping to upgrade / replace the storage on my 2nd host today, but 
> > aborted that maintenance when I found that I couldn't even get into the 
> > Manager.
> > 

> > The storage is mounted, but here's what I see:
> > 

> > [root@cha2-storage dwhite]# hosted-engine --vm-status
> > 

> > The hosted engine configuration has not been retrieved from shared storage. 
> > Please ensure that ovirt-ha-agent is running and the storage server is 
> > reachable.
> > 

> > [root@cha2-storage dwhite]# systemctl status ovirt-ha-agent
> > 

> > ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring 
> > Agent
> > 

> > Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; 
> > vendor preset: disabled)
> > 

> > Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago
> > 

> > Main PID: 3591872 (ovirt-ha-agent)
> > 

> > Tasks: 1 (limit: 409676)
> > 

> > Memory: 21.5M
> > 

> > CGroup: /system.slice/ovirt-ha-agent.service
> > 

> > └─3591872 /usr/libexec/platform-python 
> > /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
> > 

> > Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]: Started oVirt 
> > Hosted Engine High Availability Monitoring Agent.
> > 

> > Any time I try to do anything like connect the engine storage, disconnect 
> > the engine storage, or connect to the console, it just sits there, and 
> > doesn't do anything, and I eventually have to ctl-c out of it.
> > 

> > Maybe I have to be patient? When I ctl-c, I get a trackback error:
> > 

> > [root@cha2-storage dwhite]# hosted-engine --console
> > 

> > ^CTraceback (most recent call last):
> > 

> > File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
> > 

> >     "__main__", mod_spec)
> >     

> > 

> > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
> > 

> > exec(code, run_globals)
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> >  line 214, in <module>
> > 

> > [root@cha2-storage dwhite]# args.command(args)
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> >  line 42, in func
> > 

> > f(*args, **kwargs)
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> >  line 91, in checkVmStatus
> > 

> > cli = ohautil.connect_vdsm_json_rpc()
> > 

> > File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", 
> > line 472, in connect_vdsm_json_rpc
> > 

> > __vdsm_json_rpc_connect(logger, timeout)
> > 

> > File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", 
> > line 395, in __vdsm_json_rpc_connect
> > 

> > timeout=timeout)
> > 

> > File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in connect
> > 

> > outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
> > 

> > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 426, 
> > in SimpleClient
> > 

> > nr_retries, reconnect_interval)
> > 

> > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 448, 
> > in StandAloneRpcClient
> > 

> > client = StompClient(utils.create_connected_socket(host, port, sslctx),
> > 

> > File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in 
> > create_connected_socket
> > 

> > sock.connect((host, port))
> > 

> > File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
> > 

> > self._real_connect(addr, False)
> > 

> > File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
> > 

> > self.do_handshake()
> > 

> > File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
> > 

> > self._sslobj.do_handshake()
> > 

> > File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
> > 

> > self._sslobj.do_handshake()
> > 

> > This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
> > 

> > MainThread::WARNING::2021-08-11 
> > 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init)
> >  Can't connect vdsm storage: Connection to storage server failed
> > 

> > MainThread::ERROR::2021-08-11 
> > 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> >  Failed initializing the broker: Connection to storage server failed
> > 

> > MainThread::ERROR::2021-08-11 
> > 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> >  Traceback (most recent call last):
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", 
> > line 64, in run
> > 

> > self._storage_broker_instance = self._get_storage_broker()
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", 
> > line 143, in _get_storage_broker
> > 

> > return storage_broker.StorageBroker()
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> >  line 97, in init
> > 

> > self._backend.connect()
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
> >  line 375, in connect
> > 

> > sserver.connect_storage_server()
> > 

> > File 
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
> >  line 451, in connect_storage_server
> > 

> > 'Connection to storage server failed'
> > 

> > RuntimeError: Connection to storage server failed
> > 

> > MainThread::ERROR::2021-08-11 
> > 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> >  Trying to restart the broker
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> >  ovirt-hosted-engine-ha broker 2.4.7 started
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Searching for submonitors in 
> > /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor cpu-load
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor cpu-load-no-engine
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor engine-health
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor mem-free
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor mgmt-bridge
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor network
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Loaded submonitor storage-domain
> > 

> > MainThread::INFO::2021-08-11 
> > 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> >  Finished loading submonitors
> > 

> > And I see this in /var/log/vdsm/vdsm.log:
> > 

> > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) 
> > [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor 
> > (protocoldetector:76)
> > 

> > Traceback (most recent call last):
> > 

> > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
> > 

> > File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event
> > 

> > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 
> > 57, in handle_accept
> > 

> > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 
> > 173, in _delegate_call
> > 

> > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 53, 
> > in handle_accept
> > 

> > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
> > 

> > File "/usr/lib64/python3.6/socket.py", line 205, in accept
> > 

> > OSError: [Errno 24] Too many open files
> 

> This may be this bug:
> 

> https://bugzilla.redhat.com/show_bug.cgi?id=1926589
> 

> Since vdsm will never recover from this error without a reboot, you should
> 

> start by restarting vdsmd service on all hosts.
> 

> After restarting vdsmd, connecting to the storage server may succeed.
> 

> Please also report this bug, we need to understand if this is the same
> 

> issue or another issue.
> 

> Vdsm should recover from such critical errors by exiting, so leaks
> 

> will cause service restarts (maybe every few days) instead of downtime
> 

> of the entire system.
> 

> Nir

publickey - [email protected] - 0x320CD582.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/MY6L5KK4TQ7CGUWTZPDBK7XW3DLBKAFO/

[ovirt-users] Re: Hosted engine on HCI cluster is not running

Reply via email to