Of course right when I sent this email, I went back over to one of my consoles, re-ran "hosted-engine --status", and I saw that it was up. I can confirm my hosted engine is now online and healthy.
So to recap: restarting vdsmd solved my problem. I provided lots of details in the Bugzilla, and I generated an sosreport on two of my three systems prior to restarting vdsmd. Sent with ProtonMail Secure Email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, August 13th, 2021 at 9:31 PM, David White <dmwhite...@protonmail.com> wrote: > I have updated the Bugzilla with all of the details I included below, as well > as additional details. > > I figured better to err on the side of providing too many details than not > enough. > > For the oVirt list's edification, I will note that restarting vdsmd on all 3 > hosts did fix the problem -- to an extent. Unfortunately, my hosted-engine is > still not starting (although I can now clearly connect to the hosted-engine > storage), and I see this output every time I try to start the hosted-engine: > > [root@cha2-storage ~]# hosted-engine --vm-start > > Command VM.getStats with args {'vmID': > 'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed: > > (code=1, message=Virtual machine does not exist: {'vmId': > 'ffd77d79-a699-455e-88e2-f55ee53166ef'}) > > VM in WaitForLaunch > > I'm not sure if that's because I screwed up when I was doing gluster > maintenance, or what. > > But at this point, does this mean I have to re-deploy the hosted engine? > > To confirm, if I re-deploy the hosted engine, will all of my regular VMs > remain intact? I have over 20 VMs in this environment, and it would be a > major deal to have to rebuild all 20+ of those VMs. > > Sent with ProtonMail Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > > On Friday, August 13th, 2021 at 2:41 PM, Nir Soffer nsof...@redhat.com wrote: > > > On Fri, Aug 13, 2021 at 9:13 PM David White via Users users@ovirt.org wrote: > > > > > Hello, > > > > > > It appears that my Manager / hosted-engine isn't working, and I'm unable > > > to get it to start. > > > > > > I have a 3-node HCI cluster, but right now, Gluster is only running on 1 > > > host (so no replication). > > > > > > I was hoping to upgrade / replace the storage on my 2nd host today, but > > > aborted that maintenance when I found that I couldn't even get into the > > > Manager. > > > > > > The storage is mounted, but here's what I see: > > > > > > [root@cha2-storage dwhite]# hosted-engine --vm-status > > > > > > The hosted engine configuration has not been retrieved from shared > > > storage. Please ensure that ovirt-ha-agent is running and the storage > > > server is reachable. > > > > > > [root@cha2-storage dwhite]# systemctl status ovirt-ha-agent > > > > > > ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability > > > Monitoring Agent > > > > > > Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; > > > vendor preset: disabled) > > > > > > Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago > > > > > > Main PID: 3591872 (ovirt-ha-agent) > > > > > > Tasks: 1 (limit: 409676) > > > > > > Memory: 21.5M > > > > > > CGroup: /system.slice/ovirt-ha-agent.service > > > > > > └─3591872 /usr/libexec/platform-python > > > /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent > > > > > > Aug 13 11:10:51 cha2-storage.mgt.barredowlweb.com systemd[1]: Started > > > oVirt Hosted Engine High Availability Monitoring Agent. > > > > > > Any time I try to do anything like connect the engine storage, disconnect > > > the engine storage, or connect to the console, it just sits there, and > > > doesn't do anything, and I eventually have to ctl-c out of it. > > > > > > Maybe I have to be patient? When I ctl-c, I get a trackback error: > > > > > > [root@cha2-storage dwhite]# hosted-engine --console > > > > > > ^CTraceback (most recent call last): > > > > > > File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main > > > > > > "__main__", mod_spec) > > > > > > > > > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code > > > > > > exec(code, run_globals) > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", > > > line 214, in <module> > > > > > > [root@cha2-storage dwhite]# args.command(args) > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", > > > line 42, in func > > > > > > f(*args, **kwargs) > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", > > > line 91, in checkVmStatus > > > > > > cli = ohautil.connect_vdsm_json_rpc() > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", > > > line 472, in connect_vdsm_json_rpc > > > > > > __vdsm_json_rpc_connect(logger, timeout) > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", > > > line 395, in __vdsm_json_rpc_connect > > > > > > timeout=timeout) > > > > > > File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in > > > connect > > > > > > outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries) > > > > > > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line > > > 426, in SimpleClient > > > > > > nr_retries, reconnect_interval) > > > > > > File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line > > > 448, in StandAloneRpcClient > > > > > > client = StompClient(utils.create_connected_socket(host, port, sslctx), > > > > > > File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in > > > create_connected_socket > > > > > > sock.connect((host, port)) > > > > > > File "/usr/lib64/python3.6/ssl.py", line 1068, in connect > > > > > > self._real_connect(addr, False) > > > > > > File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect > > > > > > self.do_handshake() > > > > > > File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake > > > > > > self._sslobj.do_handshake() > > > > > > File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake > > > > > > self._sslobj.do_handshake() > > > > > > This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log: > > > > > > MainThread::WARNING::2021-08-11 > > > 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(init) > > > Can't connect vdsm storage: Connection to storage server failed > > > > > > MainThread::ERROR::2021-08-11 > > > 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) > > > Failed initializing the broker: Connection to storage server failed > > > > > > MainThread::ERROR::2021-08-11 > > > 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) > > > Traceback (most recent call last): > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", > > > line 64, in run > > > > > > self._storage_broker_instance = self._get_storage_broker() > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", > > > line 143, in _get_storage_broker > > > > > > return storage_broker.StorageBroker() > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", > > > line 97, in init > > > > > > self._backend.connect() > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", > > > line 375, in connect > > > > > > sserver.connect_storage_server() > > > > > > File > > > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", > > > line 451, in connect_storage_server > > > > > > 'Connection to storage server failed' > > > > > > RuntimeError: Connection to storage server failed > > > > > > MainThread::ERROR::2021-08-11 > > > 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) > > > Trying to restart the broker > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) > > > ovirt-hosted-engine-ha broker 2.4.7 started > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Searching for submonitors in > > > /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor cpu-load > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor cpu-load-no-engine > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor engine-health > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor mem-free > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor mgmt-bridge > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor network > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Loaded submonitor storage-domain > > > > > > MainThread::INFO::2021-08-11 > > > 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) > > > Finished loading submonitors > > > > > > And I see this in /var/log/vdsm/vdsm.log: > > > > > > 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) > > > [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor > > > (protocoldetector:76) > > > > > > Traceback (most recent call last): > > > > > > File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite > > > > > > File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event > > > > > > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line > > > 57, in handle_accept > > > > > > File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line > > > 173, in _delegate_call > > > > > > File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line > > > 53, in handle_accept > > > > > > File "/usr/lib64/python3.6/asyncore.py", line 348, in accept > > > > > > File "/usr/lib64/python3.6/socket.py", line 205, in accept > > > > > > OSError: [Errno 24] Too many open files > > > > This may be this bug: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1926589 > > > > Since vdsm will never recover from this error without a reboot, you should > > > > start by restarting vdsmd service on all hosts. > > > > After restarting vdsmd, connecting to the storage server may succeed. > > > > Please also report this bug, we need to understand if this is the same > > > > issue or another issue. > > > > Vdsm should recover from such critical errors by exiting, so leaks > > > > will cause service restarts (maybe every few days) instead of downtime > > > > of the entire system. > > > > Nir
publickey - dmwhite823@protonmail.com - 0x320CD582.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/5KWKT6WLEIDE42JF25E5XT34AENXNOJC/