I have updated the Bugzilla with all of the details I included below, as well as additional details.
I figured better to err on the side of providing too many details than not
enough.
For the oVirt list's edification, I will note that restarting vdsmd on all 3
hosts did fix the problem -- to an extent. Unfortunately, my hosted-engine is
still not starting (although I can now clearly connect to the hosted-engine
storage), and I see this output every time I try to start the hosted-engine:
[root@cha2-storage ~]# hosted-engine --vm-start
Command VM.getStats with args {'vmID': 'ffd77d79-a699-455e-88e2-f55ee53166ef'}
failed:
(code=1, message=Virtual machine does not exist: {'vmId':
'ffd77d79-a699-455e-88e2-f55ee53166ef'})
VM in WaitForLaunch
I'm not sure if that's because I screwed up when I was doing gluster
maintenance, or what.
But at this point, does this mean I have to re-deploy the hosted engine?
To confirm, if I re-deploy the hosted engine, will all of my regular VMs remain
intact? I have over 20 VMs in this environment, and it would be a major deal to
have to rebuild all 20+ of those VMs.
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer <[email protected]> wrote:
> On Fri, Aug 13, 2021 at 9:13 PM David White via Users [email protected] wrote:
>
> > Hello,
> >
> > It appears that my Manager / hosted-engine isn't working, and I'm unable to
> > get it to start.
> >
> > I have a 3-node HCI cluster, but right now, Gluster is only running on 1
> > host (so no replication).
> >
> > I was hoping to upgrade / replace the storage on my 2nd host today, but
> > aborted that maintenance when I found that I couldn't even get into the
> > Manager.
> >
> > The storage is mounted, but here's what I see:
> >
> > [root@cha2-storage dwhite]# hosted-engine --vm-status
> >
> > The hosted engine configuration has not been retrieved from shared storage.
> > Please ensure that ovirt-ha-agent is running and the storage server is
> > reachable.
> >
> > [root@cha2-storage dwhite]# systemctl status ovirt-ha-agent
> >
> > ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring
> > Agent
> >
> > Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled;
> > vendor preset: disabled)
> >
> > Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago
> >
> > Main PID: 3591872 (ovirt-ha-agent)
> >
> > Tasks: 1 (limit: 409676)
> >
> > Memory: 21.5M
> >
> > CGroup: /system.slice/ovirt-ha-agent.service
> >
> > └─3591872 /usr/libexec/platform-python
> > /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent
> >
> > Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]: Started oVirt
> > Hosted Engine High Availability Monitoring Agent.
> >
> > Any time I try to do anything like connect the engine storage, disconnect
> > the engine storage, or connect to the console, it just sits there, and
> > doesn't do anything, and I eventually have to ctl-c out of it.
> >
> > Maybe I have to be patient? When I ctl-c, I get a trackback error:
> >
> > [root@cha2-storage dwhite]# hosted-engine --console
> >
> > ^CTraceback (most recent call last):
> >
> > File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
> >
> > "__main__", mod_spec)
> >
> >
> > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
> >
> > exec(code, run_globals)
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> > line 214, in <module>
> >
> > [root@cha2-storage dwhite]# args.command(args)
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> > line 42, in func
> >
> > f(*args, **kwargs)
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py",
> > line 91, in checkVmStatus
> >
> > cli = ohautil.connect_vdsm_json_rpc()
> >
> > File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py",
> > line 472, in connect_vdsm_json_rpc
> >
> > __vdsm_json_rpc_connect(logger, timeout)
> >
> > File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py",
> > line 395, in __vdsm_json_rpc_connect
> >
> > timeout=timeout)
> >
> > File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in connect
> >
> > outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
> >
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 426,
> > in SimpleClient
> >
> > nr_retries, reconnect_interval)
> >
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 448,
> > in StandAloneRpcClient
> >
> > client = StompClient(utils.create_connected_socket(host, port, sslctx),
> >
> > File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in
> > create_connected_socket
> >
> > sock.connect((host, port))
> >
> > File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
> >
> > self._real_connect(addr, False)
> >
> > File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
> >
> > self.do_handshake()
> >
> > File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
> >
> > self._sslobj.do_handshake()
> >
> > File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
> >
> > self._sslobj.do_handshake()
> >
> > This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:
> >
> > MainThread::WARNING::2021-08-11
> > 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init)
> > Can't connect vdsm storage: Connection to storage server failed
> >
> > MainThread::ERROR::2021-08-11
> > 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > Failed initializing the broker: Connection to storage server failed
> >
> > MainThread::ERROR::2021-08-11
> > 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > Traceback (most recent call last):
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py",
> > line 64, in run
> >
> > self._storage_broker_instance = self._get_storage_broker()
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py",
> > line 143, in _get_storage_broker
> >
> > return storage_broker.StorageBroker()
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py",
> > line 97, in init
> >
> > self._backend.connect()
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py",
> > line 375, in connect
> >
> > sserver.connect_storage_server()
> >
> > File
> > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py",
> > line 451, in connect_storage_server
> >
> > 'Connection to storage server failed'
> >
> > RuntimeError: Connection to storage server failed
> >
> > MainThread::ERROR::2021-08-11
> > 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > Trying to restart the broker
> >
> > MainThread::INFO::2021-08-11
> > 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run)
> > ovirt-hosted-engine-ha broker 2.4.7 started
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Searching for submonitors in
> > /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor cpu-load
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor cpu-load-no-engine
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor engine-health
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor mem-free
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor mgmt-bridge
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor network
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Loaded submonitor storage-domain
> >
> > MainThread::INFO::2021-08-11
> > 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors)
> > Finished loading submonitors
> >
> > And I see this in /var/log/vdsm/vdsm.log:
> >
> > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread)
> > [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor
> > (protocoldetector:76)
> >
> > Traceback (most recent call last):
> >
> > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
> >
> > File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event
> >
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line
> > 57, in handle_accept
> >
> > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line
> > 173, in _delegate_call
> >
> > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 53,
> > in handle_accept
> >
> > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
> >
> > File "/usr/lib64/python3.6/socket.py", line 205, in accept
> >
> > OSError: [Errno 24] Too many open files
>
> This may be this bug:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1926589
>
> Since vdsm will never recover from this error without a reboot, you should
>
> start by restarting vdsmd service on all hosts.
>
> After restarting vdsmd, connecting to the storage server may succeed.
>
> Please also report this bug, we need to understand if this is the same
>
> issue or another issue.
>
> Vdsm should recover from such critical errors by exiting, so leaks
>
> will cause service restarts (maybe every few days) instead of downtime
>
> of the entire system.
>
> Nir
publickey - [email protected] - 0x320CD582.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/MY6L5KK4TQ7CGUWTZPDBK7XW3DLBKAFO/

