On Sat, Apr 10, 2021 at 1:14 PM David White via Users <[email protected]> wrote:
> This is resolved, and my environment is 100% stable now. > Glad to hear that, thanks for the report! > Or was, until I then used the engine to "upgrade" one of the hosts, at > which point I started having problems again after the reboot, because the > old vlan came back. > I'll finish getting things stabilized today, and hopefully won't run into > this again. > > I've been turning things on and off quite a bit, because they aren't in a > proper data center (yet) and are just sitting here in my home office. > So I'm sure shutting them down and turning them back on fairly often > hasn't helped the situation. > > I initially had a few issues going on: > > 1. I of course first broke things when I tried to change the > management vlan > 2. Aside from my notes below and the troubleshooting steps I went > through, yesterday, I had forgotten that connectivity to the DNS server > hadn't been restored. Once I got DNS operational, the engine was able to > see two of the hosts, and finally started showing some green. > 3. I then went in and ran `hosted-engine --vm-stop` to shutdown the > engine, and then I started it again... and viola. The last remaining > problematic host came online, and a few minutes later, the disks, volumes, > and datacenter came online. > 4. I think part of my problem has been this switch. I purchased a > Netgear GS324T for my frontend traffic. But I've also needed to put my > backend traffic onto some temporary ports on that switch until I can get a > VM controller setup that will run my other switch, a Ubiquiti US-XG-16 for > my permanent backend traffic. The Netgear hasn't been nearly as simple to > configure as I had hoped. The vlan behavior has also been inconsistent - > sometimes I have vlan settings in place, and things work. Sometimes they > don't work. It has also been re-assigning a of the vlans occasionally after > reboots, which has been frustrating. I'm close to being completely done > configuring the infrastructure, but I'm also getting increasingly tempted > to go find a different switch. > > Lessons learned: > > 1. Always make sure DNS is functional > 1. I was really hoping that I could run DNS as a VM (or multiple VMs) > *inside* the cluster. > 2. That said, if the cluster and the engine won't even start > correctly without, then I may need to run DNS externally. I'm open to > feedback on this. > 1. I have 1 extra U of space at the datacenter reserved, and I do > have a 4th spare server that I haven't decided what to do with yet. > It has > way more CPU and RAM than would be necessary to run an internal DNS > server... but perhaps I have no choice. *Thoughts*? > > You can also have the IP addresses of the engine and hosts in /etc/hosts of all machines (engine and hosts) - then things should work fine. It does mean you'll have to manually maintain these hosts files somehow. > > 1. > 1. Make sure your vlan settings are correct *before* you start > deploying the hosted engine and configure oVirt. > > Definitely. As well as making sure that IP addresses (and netmasks, routes, etc.) are as intended and working, name resolution is correct (DNS or /etc/hosts), etc. . > > 1. > 2. If possible, don't turn off and turn on your servers constantly. :) > I realize this is a given. I just don't have much choice in the matter > right now, due to lack of datacenter in my home office. > > While definitely not recommended, in principle this should be harmless. If you find concrete reproducible bugs around this, please report them (with clear accurate details - just "I turn off and on my hosts and things stop working" is not helpful, obviously...). Thanks again and best regards, > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Friday, April 9, 2021 5:55 AM, David White via Users <[email protected]> > wrote: > > I was able to fix the connectivity issues between all 3 hosts. > It turned out that I hadn't completely deleted the old vlan settings from > the host. I re-ran "nmcli connection delete" on the old vlan. After that, I > had to edit a network-scripts file and change/fix the bridge to use > ifcfg-ovirtmgmt. > After I did all that, the problematic host was accessible again. All 3 > Gluster peers are now able to see each other and communicate over the > management network. > > From the command line, I was then able to successfully run "hosted-engine > --connect-storage" without errors. I was also able to then run > "hosted-engine --vm-start". > Unfortunately, the engine itself is still unstable, and when I access the > web UI / oVirt Manager, it shows that all 3 hosts are inaccessible and down. > > I don't understand how the web UI is operational at all if the engine > thinks that all 3 hosts are inaccessible. What's going on there? > > Although the initial problem was my own doing (I changed the management > vlan), I'm deeply concerned with how unstable everything became - and has > continued to be- ever since I lost connectivity to the 1 host. I thought > the point of all of this was that things would (should) continue to work if > 1 of the hosts went away. > > Anyway, at that point, all 3 hosts are able to communicate with each other > over the management network, but the engine still thinks that all 3 hosts > are down, and is unable to manage anything. > Any suggestions on how to proceed would be much appreciated. > > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, April 7, 2021 8:28 PM, David White < > [email protected]> wrote: > > I still haven't been able to resurrect the 1st host, so I've spent some > time trying to get the hosted engine stable. I would welcome input on how > to fix the problematic host so that it can be accessible again. > > As per my original email, this all started when I tried to change the > management vlan. I honestly cannot remember what I did (if anything) to the > actual hosts when this all started, but my troubleshooting steps today have > been to try to fiddle with the vlan settings and > /etc/sysconfig/network-scripts/ files on the problematic host to switch > from the original vlan (1) to the new vlan (10). > > Until then, I'm troubleshooting why the hosted engine isn't really > working, since the other two hosts are operational. > > The hosted engine is "running" -- I can access and navigate around the > oVirt Manager. > However, it appears that all of the storage domains are down, and all of > the hosts are "NonOperational". I was, however, able to put two of the > hosts into Maintenance Mode, including the problematic 1st host. > > This is what I see on the 2nd host: > > *[root@cha2-storage network-scripts]# gluster peer status* > Number of Peers: 2 > > Hostname: cha1-storage.mgt.example.com > Uuid: 348de1f3-5efe-4e0c-b58e-9cf48071e8e1 > State: Peer in Cluster (Disconnected) > > Hostname: cha3-storage.mgt.example.com > Uuid: 0563c3e8-237d-4409-a09a-ec51719b0da6 > State: Peer in Cluster (Connected) > > *[root@cha2-storage network-scripts]# hosted-engine --vm-status* > The hosted engine configuration has not been retrieved from shared > storage. Please ensure that ovirt-ha-agent is running and the storage > server is reachable. > > *[root@cha2-storage network-scripts]# hosted-engine --connect-storage* > Traceback (most recent call last): > File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main > "__main__", mod_spec) > File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code > exec(code, run_globals) > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/connect_storage_server.py", > line 30, in <module> > timeout=ohostedcons.Const.STORAGE_SERVER_TIMEOUT, > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/client/client.py", > line 312, in connect_storage_server > sserver.connect_storage_server(timeout=timeout) > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", > line 394, in connect_storage_server > 'Connection to storage server failed' > RuntimeError: Connection to storage server failed > > The ovirt-engine-ha service seems to be continuously trying to load / > activate, but failing: > *[root@cha2-storage network-scripts]# systemctl status -l ovirt-ha-agent* > ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability > Monitoring Agent > Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; > enabled; vendor preset: disabled) > Active: activating (auto-restart) (Result: exit-code) since Wed > 2021-04-07 20:24:46 EDT; 60ms ago > Process: 124306 > ExecStart=/usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent (code=exited, > status=157) > Main PID: 124306 (code=exited, status=157) > > *Some recent entries in /var/log/ovirt-hosted-engine-ha/agent.log* > MainThread::ERROR::2021-04-07 > 20:22:59,115::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) > Trying to restart agent > MainThread::INFO::2021-04-07 > 20:22:59,115::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > Agent shutting down > MainThread::INFO::2021-04-07 > 20:23:09,717::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > ovirt-hosted-engine-ha agent 2.4.6 started > MainThread::INFO::2021-04-07 > 20:23:09,742::hosted_engine::242::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_get_hostname) > Certificate common name not found, using hostname to identify host > MainThread::INFO::2021-04-07 > 20:23:09,837::hosted_engine::548::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) > Initializing ha-broker connection > MainThread::INFO::2021-04-07 > 20:23:09,838::brokerlink::82::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) > Starting monitor network, options {'addr': '10.1.0.1', 'network_test': > 'dns', 'tcp_t_address': '', 'tcp_t_port': ''} > MainThread::ERROR::2021-04-07 > 20:23:09,839::hosted_engine::564::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) > Failed to start necessary monitors > MainThread::ERROR::2021-04-07 > 20:23:09,842::agent::143::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) > Traceback (most recent call last): > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > line 85, in start_monitor > response = self._proxy.start_monitor(type, options) > File "/usr/lib64/python3.6/xmlrpc/client.py", line 1112, in __call__ > return self.__send(self.__name, args) > File "/usr/lib64/python3.6/xmlrpc/client.py", line 1452, in __request > verbose=self.__verbose > File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request > return self.single_request(host, handler, request_body, verbose) > File "/usr/lib64/python3.6/xmlrpc/client.py", line 1166, in > single_request > http_conn = self.send_request(host, handler, request_body, verbose) > File "/usr/lib64/python3.6/xmlrpc/client.py", line 1279, in send_request > self.send_content(connection, request_body) > File "/usr/lib64/python3.6/xmlrpc/client.py", line 1309, in send_content > connection.endheaders(request_body) > File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders > self._send_output(message_body, encode_chunked=encode_chunked) > File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output > self.send(msg) > File "/usr/lib64/python3.6/http/client.py", line 974, in send > self.connect() > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/unixrpc.py", > line 74, in connect > self.sock.connect(base64.b16decode(self.host)) > FileNotFoundError: [Errno 2] No such file or directory > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > line 131, in _run_agent > return action(he) > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", > line 55, in action_proper > return he.start_monitoring() > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > line 437, in start_monitoring > self._initialize_broker() > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", > line 561, in _initialize_broker > m.get('options', {})) > File > "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", > line 91, in start_monitor > ).format(t=type, o=options, e=e) > ovirt_hosted_engine_ha.lib.exceptions.RequestError: brokerlink - failed to > start monitor via ovirt-ha-broker: [Errno 2] No such file or directory, > [monitor: 'network', options: {'addr': '10.1.0.1', 'network_test': 'dns', > 'tcp_t_address': '', 'tcp_t_port': ''}] > > MainThread::ERROR::2021-04-07 > 20:23:09,842::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) > Trying to restart agent > MainThread::INFO::2021-04-07 > 20:23:09,842::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) > Agent shutting down > > > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Wednesday, April 7, 2021 5:36 PM, David White via Users < > [email protected]> wrote: > > I'm working on setting up my environment prior to production, and have run > into an issue. > > I got most things configured, but due to a limitation on one of my > switches, I decided to change the management vlan that the hosts > communicate on. Over the course of changing that vlan, I wound up resetting > my router to default settings. > > I have the router operational again, and I also have 1 of my switches > operational. > Now, I'm trying to bring the oVirt cluster back online. > This is oVirt 4.5 running on RHEL 8.3. > > The old vlan is 1, and the new vlan is 10. > > Currently, hosts 2 & 3 are accessible over the new vlan, and can ping each > other. > I'm able to ssh to both hosts, and when I run "gluster peer status", I see > that they are connected to each other. > > However, host 1 is not accessible from anything. I can't ping it, and it > cannot get out. > > As part of my troubleshooting, I've done the following: > From the host console, I ran `nmcli connection delete` to delete the old > vlan (VLAN 1). > I moved the /etc/sysconfig/network-scripts/interface.1 file to > interface.10, and edited the file accordingly to make sure the vlan and > device settings are set to 10 instead of 1, and I rebooted the host. > > The engine seems to be running, but I don't understand why. > From each of the hosts that are working (host 2 and host 3), I ran > "hosted-engine --check-liveliness" and both hosts indicate that the engine > is NOT running. > > Yet the engine loads in a web browser, and I'm able to log > into /ovirt-engine/webadmin/. > The engine thinks that all 3 hosts is nonresponsive. See screenshot below: > > [image: Screenshot from 2021-04-07 17-33-48.png] > > What I'm really looking for help with is to get the first host back online. > Once it is healthy and gluster is healthy, I feel confident I can get the > engine operational again. > > What else should I look for on this host? > > > Sent with ProtonMail <https://protonmail.com> Secure Email. > > > > > _______________________________________________ > Users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > Privacy Statement: https://www.ovirt.org/privacy-policy.html > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/[email protected]/message/6TWZCFKAYF75GFCZQ4DBBWM53LHSWV2O/ > -- Didi
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/privacy-policy.html oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/FMMZBSH3SUNOVAQOG2KR7RL46JMSRH3C/

