[ovirt-users] Re: Hosted engine on HCI cluster is not running

David White via Users Fri, 13 Aug 2021 18:48:49 -0700

Of course right when I sent this email, I went back over to one of my consoles, 
re-ran "hosted-engine --status", and I saw that it was up. I can confirm my 
hosted engine is now online and healthy.


So to recap: restarting vdsmd solved my problem.

I provided lots of details in the Bugzilla, and I generated an sosreport on two 
of my three systems prior to restarting vdsmd.

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Friday, August 13th, 2021 at 9:31 PM, David White 
<[email protected]> wrote:

> I have updated the Bugzilla with all of the details I included below, as well 
> as additional details.
> 

> I figured better to err on the side of providing too many details than not 
> enough.
> 

> For the oVirt list's edification, I will note that restarting vdsmd on all 3 
> hosts did fix the problem -- to an extent. Unfortunately, my hosted-engine is 
> still not starting (although I can now clearly connect to the hosted-engine 
> storage), and I see this output every time I try to start the hosted-engine:
> 

> [root@cha2-storage ~]# hosted-engine --vm-start
> 

> Command VM.getStats with args {'vmID': 
> 'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed:
> 

> (code=1, message=Virtual machine does not exist: {'vmId': 
> 'ffd77d79-a699-455e-88e2-f55ee53166ef'})
> 

> VM in WaitForLaunch
> 

> I'm not sure if that's because I screwed up when I was doing gluster 
> maintenance, or what.
> 

> But at this point, does this mean I have to re-deploy the hosted engine?
> 

> To confirm, if I re-deploy the hosted engine, will all of my regular VMs 
> remain intact? I have over 20 VMs in this environment, and it would be a 
> major deal to have to rebuild all 20+ of those VMs.
> 

> Sent with ProtonMail Secure Email.
> 

> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> 

> On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer [email protected] wrote:
> 

> > On Fri, Aug 13, 2021 at 9:13 PM David White via Users [email protected] wrote:
> > 

> > > Hello,
> > > 

> > > It appears that my Manager / hosted-engine isn't working, and I'm unable 
> > > to get it to start.
> > > 

> > > I have a 3-node HCI cluster, but right now, Gluster is only running on 1 
> > > host (so no replication).
> > > 

> > > I was hoping to upgrade / replace the storage on my 2nd host today, but 
> > > aborted that maintenance when I found that I couldn't even get into the 
> > > Manager.
> > > 

> > > The storage is mounted, but here's what I see:
> > > 

> > > [root@cha2-storage dwhite]# hosted-engine --vm-status
> > > 

> > > The hosted engine configuration has not been retrieved from shared 
> > > storage. Please ensure that ovirt-ha-agent is running and the storage 
> > > server is reachable.
> > > 

> > > [root@cha2-storage dwhite]# systemctl status ovirt-ha-agent
> > > 

> > > ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability 
> > > Monitoring Agent
> > > 

> > > Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; 
> > > vendor preset: disabled)
> > > 

> > > Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago
> > > 

> > > Main PID: 3591872 (ovirt-ha-agent)
> > > 

> > > Tasks: 1 (limit: 409676)
> > > 

> > > Memory: 21.5M
> > > 

> > > CGroup: /system.slice/ovirt-ha-agent.service
> > > 

> > > └─3591872 /usr/libexec/platform-python 
> > > /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
> > > 

> > > Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]: Started 
> > > oVirt Hosted Engine High Availability Monitoring Agent.
> > > 

> > > Any time I try to do anything like connect the engine storage, disconnect 
> > > the engine storage, or connect to the console, it just sits there, and 
> > > doesn't do anything, and I eventually have to ctl-c out of it.
> > > 

> > > Maybe I have to be patient? When I ctl-c, I get a trackback error:
> > > 

> > > [root@cha2-storage dwhite]# hosted-engine --console
> > > 

> > > ^CTraceback (most recent call last):
> > > 

> > > File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
> > > 

> > >     "__main__", mod_spec)
> > >     

> > > 

> > > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
> > > 

> > > exec(code, run_globals)
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> > >  line 214, in <module>
> > > 

> > > [root@cha2-storage dwhite]# args.command(args)
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> > >  line 42, in func
> > > 

> > > f(*args, **kwargs)
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> > >  line 91, in checkVmStatus
> > > 

> > > cli = ohautil.connect_vdsm_json_rpc()
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", 
> > > line 472, in connect_vdsm_json_rpc
> > > 

> > > __vdsm_json_rpc_connect(logger, timeout)
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", 
> > > line 395, in __vdsm_json_rpc_connect
> > > 

> > > timeout=timeout)
> > > 

> > > File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in 
> > > connect
> > > 

> > > outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
> > > 

> > > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 
> > > 426, in SimpleClient
> > > 

> > > nr_retries, reconnect_interval)
> > > 

> > > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 
> > > 448, in StandAloneRpcClient
> > > 

> > > client = StompClient(utils.create_connected_socket(host, port, sslctx),
> > > 

> > > File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in 
> > > create_connected_socket
> > > 

> > > sock.connect((host, port))
> > > 

> > > File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
> > > 

> > > self._real_connect(addr, False)
> > > 

> > > File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
> > > 

> > > self.do_handshake()
> > > 

> > > File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
> > > 

> > > self._sslobj.do_handshake()
> > > 

> > > File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
> > > 

> > > self._sslobj.do_handshake()
> > > 

> > > This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
> > > 

> > > MainThread::WARNING::2021-08-11 
> > > 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init)
> > >  Can't connect vdsm storage: Connection to storage server failed
> > > 

> > > MainThread::ERROR::2021-08-11 
> > > 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > >  Failed initializing the broker: Connection to storage server failed
> > > 

> > > MainThread::ERROR::2021-08-11 
> > > 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > >  Traceback (most recent call last):
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py",
> > >  line 64, in run
> > > 

> > > self._storage_broker_instance = self._get_storage_broker()
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py",
> > >  line 143, in _get_storage_broker
> > > 

> > > return storage_broker.StorageBroker()
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> > >  line 97, in init
> > > 

> > > self._backend.connect()
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
> > >  line 375, in connect
> > > 

> > > sserver.connect_storage_server()
> > > 

> > > File 
> > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
> > >  line 451, in connect_storage_server
> > > 

> > > 'Connection to storage server failed'
> > > 

> > > RuntimeError: Connection to storage server failed
> > > 

> > > MainThread::ERROR::2021-08-11 
> > > 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > >  Trying to restart the broker
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > >  ovirt-hosted-engine-ha broker 2.4.7 started
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Searching for submonitors in 
> > > /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor cpu-load
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor cpu-load-no-engine
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor engine-health
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor mem-free
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor mgmt-bridge
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor network
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Loaded submonitor storage-domain
> > > 

> > > MainThread::INFO::2021-08-11 
> > > 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > >  Finished loading submonitors
> > > 

> > > And I see this in /var/log/vdsm/vdsm.log:
> > > 

> > > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) 
> > > [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor 
> > > (protocoldetector:76)
> > > 

> > > Traceback (most recent call last):
> > > 

> > > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
> > > 

> > > File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event
> > > 

> > > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 
> > > 57, in handle_accept
> > > 

> > > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 
> > > 173, in _delegate_call
> > > 

> > > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 
> > > 53, in handle_accept
> > > 

> > > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
> > > 

> > > File "/usr/lib64/python3.6/socket.py", line 205, in accept
> > > 

> > > OSError: [Errno 24] Too many open files
> > 

> > This may be this bug:
> > 

> > https://bugzilla.redhat.com/show_bug.cgi?id=1926589
> > 

> > Since vdsm will never recover from this error without a reboot, you should
> > 

> > start by restarting vdsmd service on all hosts.
> > 

> > After restarting vdsmd, connecting to the storage server may succeed.
> > 

> > Please also report this bug, we need to understand if this is the same
> > 

> > issue or another issue.
> > 

> > Vdsm should recover from such critical errors by exiting, so leaks
> > 

> > will cause service restarts (maybe every few days) instead of downtime
> > 

> > of the entire system.
> > 

> > Nir

publickey - [email protected] - 0x320CD582.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/5KWKT6WLEIDE42JF25E5XT34AENXNOJC/

[ovirt-users] Re: Hosted engine on HCI cluster is not running

Reply via email to