Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-02-05 Thread Christopher Cox
Answering my own post... a restart of vdsmd on the affected blade has 
fixed everything.  Thanks everyone who helped.



On 02/05/2018 10:02 AM, Christopher Cox wrote:
Forgive the top post.  I guess what I need to know now is whether there 
is a recovery path that doesn't lead to total loss of the VMs that are 
currently in the "Unknown" "Not responding" state.


We are planning a total oVirt shutdown.  I just would like to know if 
we've effectively lot those VMs or not.  Again, the VMs are currently 
"up".  And we use a file backup process, so in theory they can be 
restored, just somewhat painfully, from scratch.


But if somebody knows if we shutdown all the bad VMs and the blade, is 
there someway oVirt can know the VMs are "ok" to start up??  Will 
changing their state directly to "down" in the db stick if the blade is 
down?  That is, will we get to a known state where the VMs can actually 
be started and brought back into a known state?


Right now, we're feeling there's a good chance we will not be able to 
recover these VMs, even though they are "up" right now.  I really need 
some way to force oVirt into an integral state, even if it means we take 
the whole thing down.


Possible?


On 01/25/2018 06:57 PM, Christopher Cox wrote:



On 01/25/2018 04:57 PM, Douglas Landgraf wrote:
On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox 
 wrote:

On 01/25/2018 02:25 PM, Douglas Landgraf wrote:


On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox 


wrote:


Would restarting vdsm on the node in question help fix this? 
Again, all

the
VMs are up on the node.  Prior attempts to fix this problem have 
left the
node in a state where I can issue the "has been rebooted" command 
to it,

it's confused.

So... node is up.  All VMs are up.  Can't issue "has been 
rebooted" to

the
node, all VMs show Unknown and not responding but they are up.

Chaning the status is the ovirt db to 0 works for a second and 
then it

goes
immediately back to 8 (which is why I'm wondering if I should restart
vdsm
on the node).



It's not recommended to change db manually.



Oddly enough, we're running all of this in production.  So, 
watching it

all
go down isn't the best option for us.

Any advice is welcome.




We would need to see the node/engine logs, have you found any error in
the vdsm.log
(from nodes) or engine.log? Could you please share the error?




In short, the error is our ovirt manager lost network (our problem) and
crashed hard (hardware issue on the server)..  On bring up, we had some
network changes (that caused the lost network problem) so our LACP 
bond was
down for a bit while we were trying to bring it up (noting the ovirt 
manager

is up while we're reestablishing the network on the switch side).

In other word, that's the "error" so to speak that got us to where 
we are.


Full DEBUG enabled on the logs... The error messages seem obvious to 
me..
starts like this (nothing the ISO DOMAIN was coming off an NFS mount 
off the
ovirt management server... yes... we know... we do have plans to 
move that).


So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):

(hopefully no surprise here)

Thread-2426633::WARNING::2018-01-23
13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) 
Could not

collect metadata file for domain path
/rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844 


Traceback (most recent call last):
   File "/usr/share/vdsm/storage/fileSD.py", line 735, in 
collectMetaFiles

 sd.DOMAIN_META_DATA))
   File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
 return self._iop.glob(pattern)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
line 536,

in glob
 return self._sendCommand("glob", {"pattern": pattern}, 
self.timeout)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
line 421,

in _sendCommand
 raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-27::ERROR::2018-01-23
13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) 
domain

e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
 dom = findMethod(sdUUID)
   File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
 return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
   File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
 raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
Thread-27::ERROR::2018-01-23
13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/monitor.py", line 272, in 
_monitorDomain

 self._performDomainSelftest()
   File 

Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-02-05 Thread Christopher Cox
Forgive the top post.  I guess what I need to know now is whether there 
is a recovery path that doesn't lead to total loss of the VMs that are 
currently in the "Unknown" "Not responding" state.


We are planning a total oVirt shutdown.  I just would like to know if 
we've effectively lot those VMs or not.  Again, the VMs are currently 
"up".  And we use a file backup process, so in theory they can be 
restored, just somewhat painfully, from scratch.


But if somebody knows if we shutdown all the bad VMs and the blade, is 
there someway oVirt can know the VMs are "ok" to start up??  Will 
changing their state directly to "down" in the db stick if the blade is 
down?  That is, will we get to a known state where the VMs can actually 
be started and brought back into a known state?


Right now, we're feeling there's a good chance we will not be able to 
recover these VMs, even though they are "up" right now.  I really need 
some way to force oVirt into an integral state, even if it means we take 
the whole thing down.


Possible?


On 01/25/2018 06:57 PM, Christopher Cox wrote:



On 01/25/2018 04:57 PM, Douglas Landgraf wrote:
On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox  
wrote:

On 01/25/2018 02:25 PM, Douglas Landgraf wrote:


On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox 
wrote:


Would restarting vdsm on the node in question help fix this?  
Again, all

the
VMs are up on the node.  Prior attempts to fix this problem have 
left the
node in a state where I can issue the "has been rebooted" command 
to it,

it's confused.

So... node is up.  All VMs are up.  Can't issue "has been rebooted" to
the
node, all VMs show Unknown and not responding but they are up.

Chaning the status is the ovirt db to 0 works for a second and then it
goes
immediately back to 8 (which is why I'm wondering if I should restart
vdsm
on the node).



It's not recommended to change db manually.



Oddly enough, we're running all of this in production.  So, 
watching it

all
go down isn't the best option for us.

Any advice is welcome.




We would need to see the node/engine logs, have you found any error in
the vdsm.log
(from nodes) or engine.log? Could you please share the error?




In short, the error is our ovirt manager lost network (our problem) and
crashed hard (hardware issue on the server)..  On bring up, we had some
network changes (that caused the lost network problem) so our LACP 
bond was
down for a bit while we were trying to bring it up (noting the ovirt 
manager

is up while we're reestablishing the network on the switch side).

In other word, that's the "error" so to speak that got us to where we 
are.


Full DEBUG enabled on the logs... The error messages seem obvious to 
me..
starts like this (nothing the ISO DOMAIN was coming off an NFS mount 
off the
ovirt management server... yes... we know... we do have plans to move 
that).


So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):

(hopefully no surprise here)

Thread-2426633::WARNING::2018-01-23
13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) 
Could not

collect metadata file for domain path
/rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844 


Traceback (most recent call last):
   File "/usr/share/vdsm/storage/fileSD.py", line 735, in 
collectMetaFiles

 sd.DOMAIN_META_DATA))
   File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
 return self._iop.glob(pattern)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
line 536,

in glob
 return self._sendCommand("glob", {"pattern": pattern}, 
self.timeout)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", 
line 421,

in _sendCommand
 raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-27::ERROR::2018-01-23
13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) domain
e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
 dom = findMethod(sdUUID)
   File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
 return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
   File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
 raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
Thread-27::ERROR::2018-01-23
13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/monitor.py", line 272, in 
_monitorDomain

 self._performDomainSelftest()
   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in
wrapper
 value = meth(self, *a, **kw)
   File "/usr/share/vdsm/storage/monitor.py", line 339, in
_performDomainSelftest
 

Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-01-29 Thread Giuseppe Ragusa
Da: users-boun...@ovirt.org  per conto di Christopher 
Cox 
Inviato: venerdì 26 gennaio 2018 01:57
A: dougsl...@redhat.com
Cc: users
Oggetto: Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad 
way and all VMs for one node marked Unknown and Not Reponding while up

>On 01/25/2018 04:57 PM, Douglas Landgraf wrote:
>> On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox  wrote:
>>> On 01/25/2018 02:25 PM, Douglas Landgraf wrote:

 On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox 
 wrote:
>

 Probably it's time to think to upgrade your environment from 3.6.
>>>
>>>
>>> I know.  But from a production standpoint mid-2016 wasn't that long ago.
>>> And 4 was just coming out of beta at the time.
>>>
>>> We were upgrading from 3.4 to 3.6.  And it took a long time (again, because
>>> it's all "live").  Trust me, the move to 4.0 was discussed, it was just a
>>> timing thing.
>>>
>>> With that said, I do "hear you"and certainly it's being discussed. We
>>> just don't see a "good" migration path... we see a slow path (moving nodes
>>> out, upgrading, etc.) and knowing that as with all things, nobody can
>>> guarantee "success", which would be a very bad thing.  So going from working
>>> 3.6 to totally (potential) broken 4.2, isn't going to impress anyone here,
>>> you know?  If all goes according to our best guesses, then great, but when
>>> things go bad, and the chance is not insignificant, well... I'm just not
>>> quite prepared with my résumé if you know what I mean.
>>>
>>> Don't get me wrong, our move from 3.4 to 3.6 had some similar risks, but we
>>> also migrated to whole new infrastructure, a luxury we will not have this
>>> time.  And somehow 3.4 to 3.6 doesn't sound as risky as 3.6 to 4.2.
>>
>> I see your concern. However,  keep your system updated with recent
>> software is something I would recommend. You could setup a parallel
>> 4.2 env and move the VMS slowly from 3.6.
>
>Understood.  But would people want software that changes so quickly?
>This isn't like moving from RH 7.2 to 7.3 in a matter of months, it's
>more like moving from major release to major release in a matter of
>months and doing again potentially in a matter of months.  Granted we're
>running oVirt and not RHV, so maybe we should be on the Fedora style
>upgrade plan.  Just not conducive to an enterprise environment (oVirt
>people, stop laughing).

The analogy you made is exactly on point: I think that, given the
maturity of the oVirt project, the time has come to complete the picture ;-)

RHEL -> CentOS

RHV -> ???

Note: I should mention RHGS too (or at least a subset) because we have the
oVirt hyperconverged setup to care for (RHHI)

So: is anyone interested in the rebuild of RHV/RHGS upstream packages?

If there is interest, I think that the proper path would be to join the CentOS
Virtualization SIG and perform the proposal/work there.

Best regards,
Giuseppe

>>> Is there a path from oVirt to RHEV?  Every bit of help we get helps us in
>>> making that decision as well, which I think would be a very good thing for
>>> both of us. (I inherited all this oVirt and I was the "guy" doing the 3.4 to
>>> 3.6 with the all new infrastructure).
>>
>> Yes, you can import your setup to RHEV.
>
>Good to know. Because of the fragility (support wise... I'm mean our
>oVirt has been rock solid, apart from rare glitches like this), we may
>follow this path.


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-01-25 Thread Christopher Cox



On 01/25/2018 04:57 PM, Douglas Landgraf wrote:

On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox  wrote:

On 01/25/2018 02:25 PM, Douglas Landgraf wrote:


On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox 
wrote:


Would restarting vdsm on the node in question help fix this?  Again, all
the
VMs are up on the node.  Prior attempts to fix this problem have left the
node in a state where I can issue the "has been rebooted" command to it,
it's confused.

So... node is up.  All VMs are up.  Can't issue "has been rebooted" to
the
node, all VMs show Unknown and not responding but they are up.

Chaning the status is the ovirt db to 0 works for a second and then it
goes
immediately back to 8 (which is why I'm wondering if I should restart
vdsm
on the node).



It's not recommended to change db manually.



Oddly enough, we're running all of this in production.  So, watching it
all
go down isn't the best option for us.

Any advice is welcome.




We would need to see the node/engine logs, have you found any error in
the vdsm.log
(from nodes) or engine.log? Could you please share the error?




In short, the error is our ovirt manager lost network (our problem) and
crashed hard (hardware issue on the server)..  On bring up, we had some
network changes (that caused the lost network problem) so our LACP bond was
down for a bit while we were trying to bring it up (noting the ovirt manager
is up while we're reestablishing the network on the switch side).

In other word, that's the "error" so to speak that got us to where we are.

Full DEBUG enabled on the logs... The error messages seem obvious to me..
starts like this (nothing the ISO DOMAIN was coming off an NFS mount off the
ovirt management server... yes... we know... we do have plans to move that).

So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):

(hopefully no surprise here)

Thread-2426633::WARNING::2018-01-23
13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) Could not
collect metadata file for domain path
/rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/fileSD.py", line 735, in collectMetaFiles
 sd.DOMAIN_META_DATA))
   File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
 return self._iop.glob(pattern)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 536,
in glob
 return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 421,
in _sendCommand
 raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-27::ERROR::2018-01-23
13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) domain
e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
 dom = findMethod(sdUUID)
   File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
 return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
   File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
 raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
Thread-27::ERROR::2018-01-23
13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
Traceback (most recent call last):
   File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
 self._performDomainSelftest()
   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in
wrapper
 value = meth(self, *a, **kw)
   File "/usr/share/vdsm/storage/monitor.py", line 339, in
_performDomainSelftest
 self.domain.selftest()
   File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
 return getattr(self.getRealDomain(), attrName)
   File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
 return self._cache._realProduce(self._sdUUID)
   File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
 domain = self._findDomain(sdUUID)
   File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
 dom = findMethod(sdUUID)
   File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
 return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
   File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
 raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist:
(u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)


Again, all the hypervisor nodes will complain about having the NFS area for
ISO DOMAIN now gone.  Remember the ovirt manager node held this and it has
now network has gone out and the node crashed (note: the ovirt node (the
actual server box) shouldn't crash due to the network outage, but it did.



I have added VDSM 

Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-01-25 Thread Douglas Landgraf
On Thu, Jan 25, 2018 at 5:12 PM, Christopher Cox  wrote:
> On 01/25/2018 02:25 PM, Douglas Landgraf wrote:
>>
>> On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox 
>> wrote:
>>>
>>> Would restarting vdsm on the node in question help fix this?  Again, all
>>> the
>>> VMs are up on the node.  Prior attempts to fix this problem have left the
>>> node in a state where I can issue the "has been rebooted" command to it,
>>> it's confused.
>>>
>>> So... node is up.  All VMs are up.  Can't issue "has been rebooted" to
>>> the
>>> node, all VMs show Unknown and not responding but they are up.
>>>
>>> Chaning the status is the ovirt db to 0 works for a second and then it
>>> goes
>>> immediately back to 8 (which is why I'm wondering if I should restart
>>> vdsm
>>> on the node).
>>
>>
>> It's not recommended to change db manually.
>>
>>>
>>> Oddly enough, we're running all of this in production.  So, watching it
>>> all
>>> go down isn't the best option for us.
>>>
>>> Any advice is welcome.
>>
>>
>>
>> We would need to see the node/engine logs, have you found any error in
>> the vdsm.log
>> (from nodes) or engine.log? Could you please share the error?
>
>
>
> In short, the error is our ovirt manager lost network (our problem) and
> crashed hard (hardware issue on the server)..  On bring up, we had some
> network changes (that caused the lost network problem) so our LACP bond was
> down for a bit while we were trying to bring it up (noting the ovirt manager
> is up while we're reestablishing the network on the switch side).
>
> In other word, that's the "error" so to speak that got us to where we are.
>
> Full DEBUG enabled on the logs... The error messages seem obvious to me..
> starts like this (nothing the ISO DOMAIN was coming off an NFS mount off the
> ovirt management server... yes... we know... we do have plans to move that).
>
> So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):
>
> (hopefully no surprise here)
>
> Thread-2426633::WARNING::2018-01-23
> 13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) Could not
> collect metadata file for domain path
> /rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/fileSD.py", line 735, in collectMetaFiles
> sd.DOMAIN_META_DATA))
>   File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
> return self._iop.glob(pattern)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 536,
> in glob
> return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
>   File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 421,
> in _sendCommand
> raise Timeout(os.strerror(errno.ETIMEDOUT))
> Timeout: Connection timed out
> Thread-27::ERROR::2018-01-23
> 13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) domain
> e5ecae2f-5a06-4743-9a43-e74d83992c35 not found
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
> dom = findMethod(sdUUID)
>   File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
> return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>   File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
> raise se.StorageDomainDoesNotExist(sdUUID)
> StorageDomainDoesNotExist: Storage domain does not exist:
> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
> Thread-27::ERROR::2018-01-23
> 13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error
> monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
> self._performDomainSelftest()
>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in
> wrapper
> value = meth(self, *a, **kw)
>   File "/usr/share/vdsm/storage/monitor.py", line 339, in
> _performDomainSelftest
> self.domain.selftest()
>   File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
> return getattr(self.getRealDomain(), attrName)
>   File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
> return self._cache._realProduce(self._sdUUID)
>   File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
> domain = self._findDomain(sdUUID)
>   File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
> dom = findMethod(sdUUID)
>   File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
> return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
>   File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
> raise se.StorageDomainDoesNotExist(sdUUID)
> StorageDomainDoesNotExist: Storage domain does not exist:
> (u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
>
>
> Again, all the hypervisor nodes will complain about having the NFS area for
> ISO DOMAIN now gone.  Remember the ovirt manager node held this and it 

Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-01-25 Thread Christopher Cox

On 01/25/2018 02:25 PM, Douglas Landgraf wrote:

On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox  wrote:

Would restarting vdsm on the node in question help fix this?  Again, all the
VMs are up on the node.  Prior attempts to fix this problem have left the
node in a state where I can issue the "has been rebooted" command to it,
it's confused.

So... node is up.  All VMs are up.  Can't issue "has been rebooted" to the
node, all VMs show Unknown and not responding but they are up.

Chaning the status is the ovirt db to 0 works for a second and then it goes
immediately back to 8 (which is why I'm wondering if I should restart vdsm
on the node).


It's not recommended to change db manually.



Oddly enough, we're running all of this in production.  So, watching it all
go down isn't the best option for us.

Any advice is welcome.



We would need to see the node/engine logs, have you found any error in
the vdsm.log
(from nodes) or engine.log? Could you please share the error?



In short, the error is our ovirt manager lost network (our problem) and 
crashed hard (hardware issue on the server)..  On bring up, we had some 
network changes (that caused the lost network problem) so our LACP bond 
was down for a bit while we were trying to bring it up (noting the ovirt 
manager is up while we're reestablishing the network on the switch side).


In other word, that's the "error" so to speak that got us to where we are.

Full DEBUG enabled on the logs... The error messages seem obvious to 
me.. starts like this (nothing the ISO DOMAIN was coming off an NFS 
mount off the ovirt management server... yes... we know... we do have 
plans to move that).


So on the hypervisor node itself, from the vdsm.log (vdsm.log.33.xz):

(hopefully no surprise here)

Thread-2426633::WARNING::2018-01-23 
13:50:56,672::fileSD::749::Storage.scanDomains::(collectMetaFiles) Could 
not collect metadata file for domain path 
/rhev/data-center/mnt/d0lppc129.skopos.me:_var_lib_exports_iso-20160408002844

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/fileSD.py", line 735, in collectMetaFiles
sd.DOMAIN_META_DATA))
  File "/usr/share/vdsm/storage/outOfProcess.py", line 121, in glob
return self._iop.glob(pattern)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 
536, in glob

return self._sendCommand("glob", {"pattern": pattern}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 
421, in _sendCommand

raise Timeout(os.strerror(errno.ETIMEDOUT))
Timeout: Connection timed out
Thread-27::ERROR::2018-01-23 
13:50:56,672::sdc::145::Storage.StorageDomainCache::(_findDomain) domain 
e5ecae2f-5a06-4743-9a43-e74d83992c35 not found

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: 
(u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)
Thread-27::ERROR::2018-01-23 
13:50:56,673::monitor::276::Storage.Monitor::(_monitorDomain) Error 
monitoring domain e5ecae2f-5a06-4743-9a43-e74d83992c35

Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 272, in _monitorDomain
self._performDomainSelftest()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 769, in 
wrapper

value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 339, in 
_performDomainSelftest

self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 124, in _realProduce
domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 143, in _findDomain
dom = findMethod(sdUUID)
  File "/usr/share/vdsm/storage/nfsSD.py", line 122, in findDomain
return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/nfsSD.py", line 112, in findDomainPath
raise se.StorageDomainDoesNotExist(sdUUID)
StorageDomainDoesNotExist: Storage domain does not exist: 
(u'e5ecae2f-5a06-4743-9a43-e74d83992c35',)



Again, all the hypervisor nodes will complain about having the NFS area 
for ISO DOMAIN now gone.  Remember the ovirt manager node held this and 
it has now network has gone out and the node crashed (note: the ovirt 
node (the actual server box) shouldn't crash due to the network outage, 
but it did.


So here is the engine collapse as it lost network connectivity (before 
the server actually crashed hard).


2018-01-23 13:45:33,666 ERROR 

Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-01-25 Thread Douglas Landgraf
On Wed, Jan 24, 2018 at 10:18 AM, Christopher Cox  wrote:
> Would restarting vdsm on the node in question help fix this?  Again, all the
> VMs are up on the node.  Prior attempts to fix this problem have left the
> node in a state where I can issue the "has been rebooted" command to it,
> it's confused.
>
> So... node is up.  All VMs are up.  Can't issue "has been rebooted" to the
> node, all VMs show Unknown and not responding but they are up.
>
> Chaning the status is the ovirt db to 0 works for a second and then it goes
> immediately back to 8 (which is why I'm wondering if I should restart vdsm
> on the node).

It's not recommended to change db manually.

>
> Oddly enough, we're running all of this in production.  So, watching it all
> go down isn't the best option for us.
>
> Any advice is welcome.


We would need to see the node/engine logs, have you found any error in
the vdsm.log
(from nodes) or engine.log? Could you please share the error?

Probably it's time to think to upgrade your environment from 3.6.

>
>
> On 01/23/2018 03:58 PM, Christopher Cox wrote:
>>
>> Like the subject says.. I tried to clear the status from the vm_dynamic
>> for a
>> VM, but it just goes back to 8.
>>
>> Any hints on how to get things back to a known state?
>>
>> I tried marking the node in maint, but it can't move the "Unknown" VMs, so
>> that
>> doesn't work.  I tried rebooting a VM, that doesn't work.
>>
>> The state of the VMs is up and I think they are running on the node
>> they say
>> they are running on, we just have the Unknown problem with VMs on that one
>> node.  So... can't move them, reboot VMs doens't fix
>>
>> Any trick to restoring state so that oVirt is ok???
>>
>> (what a mess)
>
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users



-- 
Cheers
Douglas
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] ovirt 3.6, we had the ovirt manager go down in a bad way and all VMs for one node marked Unknown and Not Reponding while up

2018-01-24 Thread Christopher Cox
Would restarting vdsm on the node in question help fix this?  Again, all the VMs 
are up on the node.  Prior attempts to fix this problem have left the node in a 
state where I can issue the "has been rebooted" command to it, it's confused.


So... node is up.  All VMs are up.  Can't issue "has been rebooted" to the node, 
all VMs show Unknown and not responding but they are up.


Chaning the status is the ovirt db to 0 works for a second and then it goes 
immediately back to 8 (which is why I'm wondering if I should restart vdsm on 
the node).


Oddly enough, we're running all of this in production.  So, watching it all go 
down isn't the best option for us.


Any advice is welcome.

On 01/23/2018 03:58 PM, Christopher Cox wrote:

Like the subject says.. I tried to clear the status from the vm_dynamic for a
VM, but it just goes back to 8.

Any hints on how to get things back to a known state?

I tried marking the node in maint, but it can't move the "Unknown" VMs, so that
doesn't work.  I tried rebooting a VM, that doesn't work.

The state of the VMs is up and I think they are running on the node they say
they are running on, we just have the Unknown problem with VMs on that one
node.  So... can't move them, reboot VMs doens't fix

Any trick to restoring state so that oVirt is ok???

(what a mess)


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users