I didn't report it as nobody has mentioned it and I thought it's a onetime issue.
I am,now, quite confident that is a bug. Are you using the gluster fuse mounts ( the ones in /rhev...) or libgfapi ? Can you open a case in the bugzilla.redhat.com ? Best Regards, Strahil NikolovOn Dec 15, 2019 13:16, Jayme <[email protected]> wrote: > > I compared each file across my nodes and synced them. It seems to have > resolved my issue. > > I wonder if there is a problem with 6.5 to 6.6 upgrade that is causing the > problem? It’s strange that it seems to have happened to more than one > person. I was also following proper upgrade procedure. > > > > On Sun, Dec 15, 2019 at 3:09 AM <[email protected]> wrote: >> >> I don't know. I had the same issues when I migrated my gluster from v6.5 to >> 6.6 (currently running v7.0). >> Just get the newest file and rsync it to the rest of the bricks. It will >> solve the '?????? ??????' problem. >> >> Best Regards, >> Strahil Nikolov >> В неделя, 15 декември 2019 г., 3:49:27 ч. Гринуич+2, Jayme >> <[email protected]> написа: >> >> >> on that page it says to check open bugs and the migration bug you mention >> does not appear to be on the list. Has it been resolved or is it just >> missing from this page? >> >> On Sat, Dec 14, 2019 at 7:53 PM Strahil Nikolov <[email protected]> >> wrote: >>> >>> Nah... this is not gonna fix your issue and is unnecessary. >>> Just compare the data from all bricks ... most probably the 'Last Updated' >>> is different and the gfid of the file is different. >>> Find the brick that has the most fresh data, and replace (move away as a >>> backup and rsync) the file from last good copy to the other bricks. >>> You can also run a 'full heal'. >>> >>> Best Regards, >>> Strahil Nikolov >>> >>> В събота, 14 декември 2019 г., 21:18:44 ч. Гринуич+2, Jayme >>> <[email protected]> написа: >>> >>> >>> *Update* >>> >>> Situation has improved. All VMs and engine are running. I'm left right >>> now with about 2 heal entries in each glusterfs storage volume that will >>> not heal. >>> >>> In all cases each heal entry is related to an OVF_STORE image and the >>> problem appears to be an issue with the gluster metadata for those >>> ovf_store images. When I look at the files shown in gluster volume heal >>> info output I'm seeing question marks on the meta files which indicates an >>> attribute/gluster problem (even though there is no split-brain). And I get >>> input/output error when attempting to do anything with the files. >>> >>> If I look at the files on each host in /gluster_bricks they all look fine. >>> I only see question marks on the meta files when look at the file in /rhev >>> mounts >>> >>> Does anyone know how I can correct the attributes on these OVF_STORE files? >>> I've tried putting each host in maintenance and re-activating to re-mount >>> gluster volumes. I've also stopped and started all gluster volumes. >>> >>> I'm thinking I might be able to solve this by shutting down all VMs and >>> placing all hosts in maintenance and safely restarting the entire cluster.. >>> but that may not be necessary? >>> >>> On Fri, Dec 13, 2019 at 12:59 AM Jayme <[email protected]> wrote: >>>> >>>> I believe I was able to get past this by stopping the engine volume then >>>> unmounting the glusterfs engine mount on all hosts and re-starting the >>>> volume. I was able to start hostedengine on host0. >>>> >>>> I'm still facing a few problems: >>>> >>>> 1. I'm still seeing this issue in each host's logs: >>>> >>>> Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent >>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR >>>> Failed scanning for OVF_STORE due to Command Volume.getInfo with args >>>> {'storagepoolID': '00000000-0000-0000-0000-000000000000', >>>> 'storagedomainID': 'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID': >>>> u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID': >>>> u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, >>>> message=Volume does not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',)) >>>> Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent >>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR >>>> Unable to identify the OVF_STORE volume, falling back to initial vm.conf. >>>> Please ensure you already added your first data domain for regular VMs >>>> >>>> >>>> 2. Most of my gluster volumes still have un-healed entires which I can't >>>> seem to heal. I'm not sure what the answer is here. >>>> >>>> On Fri, Dec 13, 2019 at 12:33 AM Jayme <[email protected]> wrote: >>>>> >>>>> I was able to get the hosted engine started manually via Virsh after >>>>> re-creating a missing symlink in /var/run/vdsm/storage -- I later shut it >>>>> down and am still having the same problem with ha broker starting. It >>>>> appears that the problem *might* be with a corrupt ha metadata file, >>>>> although gluster is not stating there is split brain on the engine volume >>>>> >>>>> I'm seeing this: >>>>> >>>>> ls -al >>>>> /rhev/data-center/mnt/glusterSD/orchard0\:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/ >>>>> ls: cannot access >>>>> /rhev/data-center/mnt/glusterSD/orchard0:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/hosted-engine.metadata: >>>>> Input/output error >>>>> total 0 >>>>> drwxr-xr-x. 2 vdsm kvm 67 Dec 13 00:30 . >>>>> drwxr-xr-x. 6 vdsm kvm 64 Aug 6 2018 .. >>>>> lrwxrwxrwx. 1 vdsm kvm 132 Dec 13 00:30 hosted-engine.lockspace -> >>>>> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d >>>>> l?????????? ? ? ? ? ? hosted-engine.metadata >>>>> >>>>> Clearly showing some sort of issue with hosted-engine.metadata on the >>>>> client mount. >>>>> >>>>> on each node in /gluster_bricks I see this: >>>>> >>>>> # ls -al >>>>> /gluster_bricks/engine/engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/ >>>>> total 0 >>>>> drwxr-xr-x. 2 vdsm kvm 67 Dec 13 00:31 . >>>>> drwxr-xr-x. 6 vdsm kvm 64 Aug 6 2018 .. >>>>> lrwxrwxrwx. 2 vdsm kvm 132 Dec 13 00:31 hosted-engine.lockspace -> >>>>> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d >>>>> lrwxrwxrwx. 2 vdsm kvm 132 Dec 12 16:30 hosted-engine.metadata -> >>>>> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c >>>>> >>>>> ls -al >>>>> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c >>>>> -rw-rw----. 1 vdsm kvm 1073741824 Dec 12 16:48 >>>>> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c >>>>> >>>>> >>>>> I'm not sure how to proceed at this point. Do I have data corruption, a >>>>> gluster split-brain issue or something else? Maybe I just need to >>>>> re-generate metadata for the hosted engine? >>>>> >>>>> On Thu, Dec 12, 2019 at 6:36 PM Jayme <[email protected]> wrote: >>>>>> >>>>>> I'm running a three server HCI. Up and running on 4.3.7 with no >>>>>> problems. Today I updated to 4.3.8. Engine upgraded fine, rebooted. >>>>>> First host updated fine, rebooted and let all gluster volumes heal. Put >>>>>> second host in maintenance, upgraded successfully, rebooted. Waited for >>>>>> gluster volumes to heal for over an hour but the heal process was not >>>>>> completing. I tried restarting gluster services as well as the host >>>>>> with no success. >>>>>> >>>>>> I'm in a state right now where there are pending heals on almost all of >>>>>> my volumes. Nothing is reporting split-brain, but the heals are not >>>>>> completing. >>>>>> >>>>>> All vms are still currently running except hosted engine. Hosted engine >>>>>> was running but on the 2nd host I upgraded I was seeing errors such as: >>>>>> >>>>>> Dec 12 16:34:39 orchard2 journal: ovirt-ha-agent >>>>>> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR >>>>>> Failed scanning for OVF_STORE due to Command Volume.getInfo with args >>>>>> {'storagepoolID': '00000000-0000-0000-0000-000000000000', >>>>>> 'storagedomainID': 'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID': >>>>>> u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID': >>>>>> u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201, >>>>>> message=Volume does not exist: >>>>>> (u'2632f423-ed89-43d9-93a9-36738420b866',)) >>>>>> >>>>>> I shut down the engine VM and attempted a manual heal on the engine >>>>>> volume. I cannot start the engine on any host now. I get: >>>>>> >>>>>> The hosted engine configuration has not been retrieved from shared >>>>>> storage. Please ensure that ovirt-ha-agent is running and the storage >>>>>> server is reachable. >>>>>> >>>>>> I'm seeing ovirt-ha-agent crashing on all three nodes: >>>>>> >>>>>> Dec 12 18:30:48 orchard0 python: detected unhandled Python exception in >>>>>> '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker' >>>>>> Dec 12 18:30:48 orchard0 abrt-server: Duplicate: core backtrace >>>>>> Dec 12 18:30:48 orchard0 abrt-server: DUP_OF_DIR: >>>>>> /var/tmp/abrt/Python-2019-03-14-21:02:52-44318 >>>>>> Dec 12 18:30:48 orchard0 abrt-server: Deleting problem directory >>>>>> Python-2019-12-12-18:30:48-23193 (dup of >>>>>> Python-2019-03-14-21:02:52-44318) >>>>>> Dec 12 18:30:49 orchard0 vdsm[6087]: ERROR failed to retrieve Hosted >>>>>> Engine HA score '[Errno 2] No such file or directory'Is the Hosted >>>>>> Engine setup finished? >>>>>> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service: main process >>>>>> exited, code=exited, status=1/FAILURE >>>>>> Dec 12 18:30:49 orchard0 systemd: Unit ovirt-ha-broker.service entered >>>>>> failed state. >>>>>> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service failed. >>>>>> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service holdoff time >>>>>> over, scheduling restart. >>>>>> Dec 12 18:30:49 orchard0 systemd: Cannot add dependency job for unit >>>>>> lvm2-lvmetad.socket, ignoring: Unit is masked. >>>>>> Dec 12 18:30:49 orchard0 systemd: Stopped oVirt Hosted Engine High >>>>>> Availability Communications Broker. >>>>>> >>>>>> >>>>>> Here is what gluster volume heal info on engine looks like, it's similar >>>>>> on other volumes as well (although more heals pending on some of those): >>>>>> >>>>>> gluster volume heal engine info >>>>>> Brick gluster0:/gluster_bricks/engine/engine >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids >>>>>> Status: Connected >>>>>> Number of entries: 4 >>>>>> >>>>>> Brick gluster1:/gluster_bricks/engine/engine >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta >>>>>> /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids >>>>>> Status: Connected >>>>>> Number of entries: 4 >>>>>> >>>>>> Brick gluster2:/gluster_bricks/engine/engine >>>>>> Status: Connected >>>>>> Number of entries: 0 >>>>>> >>>>>> I don't see much in vdsm.log and gluster logs look fairly normal to me, >>>>>> I'm not seeing any obvious errors in the gluster logs. >>>>>> >>>>>> As far as I can tell the underlying storage is fine. Why are my gluster >>>>>> volumes not healing and why is self-hosted engine failing to start? >>>>>> >>>>>> More agent and broker logs: >>>>>> >>>>>> ==> agent.log <== >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:09,056::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >>>>>> Failed to start necessary monitors >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:09,058::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Traceback (most recent call last): >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 131, in _run_agent >>>>>> return action(he) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 55, in action_proper >>>>>> return he.start_monitoring() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 432, in start_monitoring >>>>>> self._initialize_broker() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 556, in _initialize_broker >>>>>> m.get('options', {})) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >>>>>> line 89, in start_monitor >>>>>> ).format(t=type, o=options, e=e) >>>>>> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: >>>>>> [Errno 2] No such file or directory, [monitor: 'network', options: >>>>>> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, >>>>>> 'addr': '10.11.0.254'}] >>>>>> >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:09,058::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Trying to restart agent >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:19,619::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >>>>>> Failed to start necessary monitors >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:19,619::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Traceback (most recent call last): >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 131, in _run_agent >>>>>> return action(he) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 55, in action_proper >>>>>> return he.start_monitoring() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 432, in start_monitoring >>>>>> self._initialize_broker() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 556, in _initialize_broker >>>>>> m.get('options', {})) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >>>>>> line 89, in start_monitor >>>>>> ).format(t=type, o=options, e=e) >>>>>> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: >>>>>> [Errno 2] No such file or directory, [monitor: 'network', options: >>>>>> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, >>>>>> 'addr': '10.11.0.254'}] >>>>>> >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:19,619::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Trying to restart agent >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:30,568::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >>>>>> Failed to start necessary monitors >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:30,570::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Traceback (most recent call last): >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 131, in _run_agent >>>>>> return action(he) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 55, in action_proper >>>>>> return he.start_monitoring() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 432, in start_monitoring >>>>>> self._initialize_broker() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 556, in _initialize_broker >>>>>> m.get('options', {})) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >>>>>> line 89, in start_monitor >>>>>> ).format(t=type, o=options, e=e) >>>>>> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: >>>>>> [Errno 2] No such file or directory, [monitor: 'network', options: >>>>>> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, >>>>>> 'addr': '10.11.0.254'}] >>>>>> >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:30,570::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Trying to restart agent >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:41,581::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker) >>>>>> Failed to start necessary monitors >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:41,583::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Traceback (most recent call last): >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 131, in _run_agent >>>>>> return action(he) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py", >>>>>> line 55, in action_proper >>>>>> return he.start_monitoring() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 432, in start_monitoring >>>>>> self._initialize_broker() >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", >>>>>> line 556, in _initialize_broker >>>>>> m.get('options', {})) >>>>>> File >>>>>> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", >>>>>> line 89, in start_monitor >>>>>> ).format(t=type, o=options, e=e) >>>>>> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker: >>>>>> [Errno 2] No such file or directory, [monitor: 'network', options: >>>>>> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, >>>>>> 'addr': '10.11.0.254'}] >>>>>> >>>>>> MainThread::ERROR::2019-12-12 >>>>>> 18:36:41,583::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) >>>>>> Trying to restart agent >>>>>> >>>>>> >>> _______________________________________________ >>> Users mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>> oVirt Code of Conduct: >>> https://www.ovirt.org/community/about/community-guidelines/ >>> List Archives: >>> https://lists.ovirt.org/archives/list/[email protected]/message/U5YFDWCQJYNALSVNPZG4FLUO7KB2Z2XI/
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/3DXZVAYBWZQPRFPYYRXFQRRYZXASE6QX/

