[ovirt-users] Re: Problems after 4.3.8 update

Jayme Sun, 15 Dec 2019 03:21:49 -0800

I compared each file across my nodes and synced them.  It seems to have
resolved my issue.


I wonder if there is a problem with 6.5 to 6.6 upgrade that is causing the
problem?  It’s strange that it seems to have happened to more than one
person. I was also following proper upgrade procedure.



On Sun, Dec 15, 2019 at 3:09 AM <[email protected]> wrote:

> I don't know. I had the same issues when I migrated my gluster from v6.5
> to 6.6 (currently running v7.0).
> Just get the newest file and rsync it to the rest of the bricks. It will
> solve the '?????? ??????' problem.
>
> Best Regards,
> Strahil Nikolov
> В неделя, 15 декември 2019 г., 3:49:27 ч. Гринуич+2, Jayme <
> [email protected]> написа:
>
>
> on that page it says to check open bugs and the migration bug you mention
> does not appear to be on the list.  Has it been resolved or is it just
> missing from this page?
>
> On Sat, Dec 14, 2019 at 7:53 PM Strahil Nikolov <[email protected]>
> wrote:
>
> Nah... this is not gonna fix your issue and is unnecessary.
> Just compare the data from all bricks ... most probably the 'Last Updated'
> is different and the gfid of the file is different.
> Find the brick that has the most fresh data, and replace (move away as a
> backup and rsync) the file from last good copy to the other bricks.
> You can also run a 'full heal'.
>
> Best Regards,
> Strahil Nikolov
>
> В събота, 14 декември 2019 г., 21:18:44 ч. Гринуич+2, Jayme <
> [email protected]> написа:
>
>
> *Update*
>
> Situation has improved.  All VMs and engine are running.  I'm left right
> now with about 2 heal entries in each glusterfs storage volume that will
> not heal.
>
> In all cases each heal entry is related to an OVF_STORE image and the
> problem appears to be an issue with the gluster metadata for those
> ovf_store images.  When I look at the files shown in gluster volume heal
> info output I'm seeing question marks on the meta files which indicates an
> attribute/gluster problem (even though there is no split-brain).  And I get
> input/output error when attempting to do anything with the files.
>
> If I look at the files on each host in /gluster_bricks they all look
> fine.  I only see question marks on the meta files when look at the file in
> /rhev mounts
>
> Does anyone know how I can correct the attributes on these OVF_STORE
> files?  I've tried putting each host in maintenance and re-activating to
> re-mount gluster volumes.  I've also stopped and started all gluster
> volumes.
>
> I'm thinking I might be able to solve this by shutting down all VMs and
> placing all hosts in maintenance and safely restarting the entire cluster..
> but that may not be necessary?
>
> On Fri, Dec 13, 2019 at 12:59 AM Jayme <[email protected]> wrote:
>
> I believe I was able to get past this by stopping the engine volume then
> unmounting the glusterfs engine mount on all hosts and re-starting the
> volume. I was able to start hostedengine on host0.
>
> I'm still facing a few problems:
>
> 1. I'm still seeing this issue in each host's logs:
>
> Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR
> Failed scanning for OVF_STORE due to Command Volume.getInfo with args
> {'storagepoolID': '00000000-0000-0000-0000-000000000000',
> 'storagedomainID': 'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID':
> u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID':
> u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201,
> message=Volume does not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',))
> Dec 13 00:57:54 orchard0 journal: ovirt-ha-agent
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR
> Unable to identify the OVF_STORE volume, falling back to initial vm.conf.
> Please ensure you already added your first data domain for regular VMs
>
>
> 2. Most of my gluster volumes still have un-healed entires which I can't
> seem to heal.  I'm not sure what the answer is here.
>
> On Fri, Dec 13, 2019 at 12:33 AM Jayme <[email protected]> wrote:
>
> I was able to get the hosted engine started manually via Virsh after
> re-creating a missing symlink in /var/run/vdsm/storage -- I later shut it
> down and am still having the same problem with ha broker starting.  It
> appears that the problem *might* be with a corrupt ha metadata file,
> although gluster is not stating there is split brain on the engine volume
>
> I'm seeing this:
>
> ls -al
> /rhev/data-center/mnt/glusterSD/orchard0\:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/
> ls: cannot access
> /rhev/data-center/mnt/glusterSD/orchard0:_engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/hosted-engine.metadata:
> Input/output error
> total 0
> drwxr-xr-x. 2 vdsm kvm  67 Dec 13 00:30 .
> drwxr-xr-x. 6 vdsm kvm  64 Aug  6  2018 ..
> lrwxrwxrwx. 1 vdsm kvm 132 Dec 13 00:30 hosted-engine.lockspace ->
> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d
> l?????????? ? ?    ?     ?            ? hosted-engine.metadata
>
> Clearly showing some sort of issue with hosted-engine.metadata on the
> client mount.
>
> on each node in /gluster_bricks I see this:
>
> # ls -al
> /gluster_bricks/engine/engine/d70b171e-7488-4d52-8cad-bbc581dbf16e/ha_agent/
> total 0
> drwxr-xr-x. 2 vdsm kvm  67 Dec 13 00:31 .
> drwxr-xr-x. 6 vdsm kvm  64 Aug  6  2018 ..
> lrwxrwxrwx. 2 vdsm kvm 132 Dec 13 00:31 hosted-engine.lockspace ->
> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/03a8ee8e-91f5-4e06-904b-9ed92a9706eb/db2699ce-6349-4020-b52d-8ab11d01e26d
> lrwxrwxrwx. 2 vdsm kvm 132 Dec 12 16:30 hosted-engine.metadata ->
> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
>
>  ls -al
> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
> -rw-rw----. 1 vdsm kvm 1073741824 Dec 12 16:48
> /var/run/vdsm/storage/d70b171e-7488-4d52-8cad-bbc581dbf16e/66bf05fa-bf50-45ec-98d8-d00002040317/a2250415-5ff0-42ab-8071-cd9d67c3048c
>
>
> I'm not sure how to proceed at this point.  Do I have data corruption, a
> gluster split-brain issue or something else?  Maybe I just need to
> re-generate metadata for the hosted engine?
>
> On Thu, Dec 12, 2019 at 6:36 PM Jayme <[email protected]> wrote:
>
> I'm running a three server HCI.  Up and running on 4.3.7 with no
> problems.  Today I updated to 4.3.8.  Engine upgraded fine, rebooted.
> First host updated fine, rebooted and let all gluster volumes heal.  Put
> second host in maintenance, upgraded successfully, rebooted.  Waited for
> gluster volumes to heal for over an hour but the heal process was not
> completing.  I tried restarting gluster services as well as the host with
> no success.
>
> I'm in a state right now where there are pending heals on almost all of my
> volumes.  Nothing is reporting split-brain, but the heals are not
> completing.
>
> All vms are still currently running except hosted engine.  Hosted engine
> was running but on the 2nd host I upgraded I was seeing errors such as:
>
> Dec 12 16:34:39 orchard2 journal: ovirt-ha-agent
> ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR
> Failed scanning for OVF_STORE due to Command Volume.getInfo with args
> {'storagepoolID': '00000000-0000-0000-0000-000000000000',
> 'storagedomainID': 'd70b171e-7488-4d52-8cad-bbc581dbf16e', 'volumeID':
> u'2632f423-ed89-43d9-93a9-36738420b866', 'imageID':
> u'd909dc74-5bbd-4e39-b9b5-755c167a6ee8'} failed:#012(code=201,
> message=Volume does not exist: (u'2632f423-ed89-43d9-93a9-36738420b866',))
>
> I shut down the engine VM and attempted a manual heal on the engine
> volume.  I cannot start the engine on any host now.  I get:
>
> The hosted engine configuration has not been retrieved from shared
> storage. Please ensure that ovirt-ha-agent is running and the storage
> server is reachable.
>
> I'm seeing ovirt-ha-agent crashing on all three nodes:
>
> Dec 12 18:30:48 orchard0 python: detected unhandled Python exception in
> '/usr/share/ovirt-hosted-engine-ha/ovirt-ha-broker'
> Dec 12 18:30:48 orchard0 abrt-server: Duplicate: core backtrace
> Dec 12 18:30:48 orchard0 abrt-server: DUP_OF_DIR:
> /var/tmp/abrt/Python-2019-03-14-21:02:52-44318
> Dec 12 18:30:48 orchard0 abrt-server: Deleting problem directory
> Python-2019-12-12-18:30:48-23193 (dup of Python-2019-03-14-21:02:52-44318)
> Dec 12 18:30:49 orchard0 vdsm[6087]: ERROR failed to retrieve Hosted
> Engine HA score '[Errno 2] No such file or directory'Is the Hosted Engine
> setup finished?
> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service: main process
> exited, code=exited, status=1/FAILURE
> Dec 12 18:30:49 orchard0 systemd: Unit ovirt-ha-broker.service entered
> failed state.
> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service failed.
> Dec 12 18:30:49 orchard0 systemd: ovirt-ha-broker.service holdoff time
> over, scheduling restart.
> Dec 12 18:30:49 orchard0 systemd: Cannot add dependency job for unit
> lvm2-lvmetad.socket, ignoring: Unit is masked.
> Dec 12 18:30:49 orchard0 systemd: Stopped oVirt Hosted Engine High
> Availability Communications Broker.
>
>
> Here is what gluster volume heal info on engine looks like, it's similar
> on other volumes as well (although more heals pending on some of those):
>
>  gluster volume heal engine info
> Brick gluster0:/gluster_bricks/engine/engine
>
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta
>
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta
>
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids
> Status: Connected
> Number of entries: 4
>
> Brick gluster1:/gluster_bricks/engine/engine
>
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/d909dc74-5bbd-4e39-b9b5-755c167a6ee8/2632f423-ed89-43d9-93a9-36738420b866.meta
>
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/master/tasks/a9b11e33-9b93-46a0-a36e-85063fd53ebe.backup
>
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/images/053171e4-f782-42d7-9115-c602beb3c826/627b8f93-5373-48bb-bd20-a308a455e082.meta
> /d70b171e-7488-4d52-8cad-bbc581dbf16e/dom_md/ids
> Status: Connected
> Number of entries: 4
>
> Brick gluster2:/gluster_bricks/engine/engine
> Status: Connected
> Number of entries: 0
>
> I don't see much in vdsm.log and gluster logs look fairly normal to me,
> I'm not seeing any obvious errors in the gluster logs.
>
> As far as I can tell the underlying storage is fine.  Why are my gluster
> volumes not healing and why is self-hosted engine failing to start?
>
> More agent and broker logs:
>
> ==> agent.log <==
> MainThread::ERROR::2019-12-12
> 18:36:09,056::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> Failed to start necessary monitors
> MainThread::ERROR::2019-12-12
> 18:36:09,058::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Traceback (most recent call last):
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 131, in _run_agent
>     return action(he)
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 55, in action_proper
>     return he.start_monitoring()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 432, in start_monitoring
>     self._initialize_broker()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 556, in _initialize_broker
>     m.get('options', {}))
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 89, in start_monitor
>     ).format(t=type, o=options, e=e)
> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker:
> [Errno 2] No such file or directory, [monitor: 'network', options:
> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr':
> '10.11.0.254'}]
>
> MainThread::ERROR::2019-12-12
> 18:36:09,058::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Trying to restart agent
> MainThread::ERROR::2019-12-12
> 18:36:19,619::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> Failed to start necessary monitors
> MainThread::ERROR::2019-12-12
> 18:36:19,619::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Traceback (most recent call last):
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 131, in _run_agent
>     return action(he)
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 55, in action_proper
>     return he.start_monitoring()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 432, in start_monitoring
>     self._initialize_broker()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 556, in _initialize_broker
>     m.get('options', {}))
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 89, in start_monitor
>     ).format(t=type, o=options, e=e)
> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker:
> [Errno 2] No such file or directory, [monitor: 'network', options:
> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr':
> '10.11.0.254'}]
>
> MainThread::ERROR::2019-12-12
> 18:36:19,619::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Trying to restart agent
> MainThread::ERROR::2019-12-12
> 18:36:30,568::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> Failed to start necessary monitors
> MainThread::ERROR::2019-12-12
> 18:36:30,570::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Traceback (most recent call last):
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 131, in _run_agent
>     return action(he)
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 55, in action_proper
>     return he.start_monitoring()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 432, in start_monitoring
>     self._initialize_broker()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 556, in _initialize_broker
>     m.get('options', {}))
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 89, in start_monitor
>     ).format(t=type, o=options, e=e)
> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker:
> [Errno 2] No such file or directory, [monitor: 'network', options:
> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr':
> '10.11.0.254'}]
>
> MainThread::ERROR::2019-12-12
> 18:36:30,570::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Trying to restart agent
> MainThread::ERROR::2019-12-12
> 18:36:41,581::hosted_engine::559::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_broker)
> Failed to start necessary monitors
> MainThread::ERROR::2019-12-12
> 18:36:41,583::agent::144::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Traceback (most recent call last):
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 131, in _run_agent
>     return action(he)
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/agent.py",
> line 55, in action_proper
>     return he.start_monitoring()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 432, in start_monitoring
>     self._initialize_broker()
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py",
> line 556, in _initialize_broker
>     m.get('options', {}))
>   File
> "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py",
> line 89, in start_monitor
>     ).format(t=type, o=options, e=e)
> RequestError: brokerlink - failed to start monitor via ovirt-ha-broker:
> [Errno 2] No such file or directory, [monitor: 'network', options:
> {'tcp_t_address': None, 'network_test': None, 'tcp_t_port': None, 'addr':
> '10.11.0.254'}]
>
> MainThread::ERROR::2019-12-12
> 18:36:41,583::agent::145::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent)
> Trying to restart agent
>
>
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/[email protected]/message/U5YFDWCQJYNALSVNPZG4FLUO7KB2Z2XI/
>
>

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/A2WDQBXRNXL3UFH67WA7RQ7ODW6JTMOM/

[ovirt-users] Re: Problems after 4.3.8 update

Reply via email to