Re: [ovirt-users] HE in bad stauts, will not start following storage issue - HELP

2017-03-12 Thread Ian Neilsen
I've checked id's in  /rhev/data-center/mnt/glusterSD/*./dom_md/

# -rw-rw. 1 vdsm kvm  1048576 Mar 12 05:14 ids

seems ok

sanlock.log showing;
---
r14 acquire_token open error -13
r14 cmd_acquire 2,11,89283 acquire_token -13

Now I'm not quiet sure on which direction to take.

Lockspace
---
"hosted-engine --reinitialize-lockspace" is throwing an exception;

Exception("Lockfile reset cannot be performed with"
Exception: Lockfile reset cannot be performed with an active agent.


@didi - I am in "Global Maintenance".
I just noticed that host 1 now shows.
Engine status: unknown stale-data
state= AgentStopped

I'm pretty sure Ive been able to start the Engine VM while in Global
Maintenance. But you raise a good question. I don't see why you would be
restricted in running the engine while in Global or even starting the VM.
If so this is a little bakwards.






On 12 March 2017 at 16:28, Yedidyah Bar David  wrote:

> On Fri, Mar 10, 2017 at 12:39 PM, Martin Sivak  wrote:
> > Hi Ian,
> >
> > it is normal that VDSMs are competing for the lock, one should win
> > though. If that is not the case then the lockspace might be corrupted
> > or the sanlock daemons can't reach it.
> >
> > I would recommend putting the cluster to global maintenance and
> > attempting a manual start using:
> >
> > # hosted-engine --set-maintenance --mode=global
> > # hosted-engine --vm-start
>
> Is that possible? See also:
>
> http://lists.ovirt.org/pipermail/users/2016-January/036993.html
>
> >
> > You will need to check your storage connectivity and sanlock status on
> > all hosts if that does not work.
> >
> > # sanlock client status
> >
> > There are couple of locks I would expect to be there (ha_agent, spm),
> > but no lock for hosted engine disk should be visible.
> >
> > Next steps depend on whether you have important VMs running on the
> > cluster and on the Gluster status (I can't help you there
> > unfortunately).
> >
> > Best regards
> >
> > --
> > Martin Sivak
> > SLA / oVirt
> >
> >
> > On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen 
> wrote:
> >> I just noticed this in the vdsm.logs.  The agent looks like it is
> trying to
> >> start hosted engine on both machines??
> >>
> >> destroydestroy on_reboot>destroy
> >> Thread-7517::ERROR::2017-03-10
> >> 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm)
> >> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process
> failed
> >> Traceback (most recent call last):
> >>   File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
> >> self._run()
> >>   File "/usr/share/vdsm/virt/vm.py", line 2026, in _run
> >> self._connection.createXML(domxml, flags),
> >>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py",
> line
> >> 123, in wrapper ret = f(*args, **kwargs)
> >>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in
> >> wrapper return func(inst, *args, **kwargs)
> >>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in
> >> createXML if ret is None:raise libvirtError('virDomainCreateXML()
> failed',
> >> conn=self)
> >>
> >> libvirtError: Failed to acquire lock: Permission denied
> >>
> >> INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus)
> >> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down:
> Failed
> >> to acquire lock: Permission denied (code=1)
> >> INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop)
> >> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
> >>
> >> DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister)
> Delete
> >> fileno 56 from listener.
> >> DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd)
> Failed
> >> to unregister FD from epoll (ENOENT): 56
> >> DEBUG::2017-03-10 01:26:13,055::__init__::209::
> jsonrpc.Notification::(emit)
> >> Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379":
> {"status":
> >> "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock:
> Permission
> >> denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0",
> >> "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"}
> >> VM Channels Listener::DEBUG::2017-03-10
> >> 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was
> removed
> >> from listener.
> >> DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_
> process)
> >> START check
> >> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/
> a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> >> cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd',
> >> u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/
> a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata',
> >> 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00
> >> DEBUG::2017-03-10 01:26:14,481::asyncevent::564:
> :storage.asyncevent::(reap)
> >> Process  terminated (count=1)
> >> DEBUG::2017-03-10
> >> 

Re: [ovirt-users] HE in bad stauts, will not start following storage issue - HELP

2017-03-11 Thread Yedidyah Bar David
On Fri, Mar 10, 2017 at 12:39 PM, Martin Sivak  wrote:
> Hi Ian,
>
> it is normal that VDSMs are competing for the lock, one should win
> though. If that is not the case then the lockspace might be corrupted
> or the sanlock daemons can't reach it.
>
> I would recommend putting the cluster to global maintenance and
> attempting a manual start using:
>
> # hosted-engine --set-maintenance --mode=global
> # hosted-engine --vm-start

Is that possible? See also:

http://lists.ovirt.org/pipermail/users/2016-January/036993.html

>
> You will need to check your storage connectivity and sanlock status on
> all hosts if that does not work.
>
> # sanlock client status
>
> There are couple of locks I would expect to be there (ha_agent, spm),
> but no lock for hosted engine disk should be visible.
>
> Next steps depend on whether you have important VMs running on the
> cluster and on the Gluster status (I can't help you there
> unfortunately).
>
> Best regards
>
> --
> Martin Sivak
> SLA / oVirt
>
>
> On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen  wrote:
>> I just noticed this in the vdsm.logs.  The agent looks like it is trying to
>> start hosted engine on both machines??
>>
>> destroydestroydestroy
>> Thread-7517::ERROR::2017-03-10
>> 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm)
>> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed
>> Traceback (most recent call last):
>>   File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
>> self._run()
>>   File "/usr/share/vdsm/virt/vm.py", line 2026, in _run
>> self._connection.createXML(domxml, flags),
>>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line
>> 123, in wrapper ret = f(*args, **kwargs)
>>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in
>> wrapper return func(inst, *args, **kwargs)
>>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in
>> createXML if ret is None:raise libvirtError('virDomainCreateXML() failed',
>> conn=self)
>>
>> libvirtError: Failed to acquire lock: Permission denied
>>
>> INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus)
>> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed
>> to acquire lock: Permission denied (code=1)
>> INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop)
>> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
>>
>> DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete
>> fileno 56 from listener.
>> DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed
>> to unregister FD from epoll (ENOENT): 56
>> DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit)
>> Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status":
>> "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission
>> denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0",
>> "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"}
>> VM Channels Listener::DEBUG::2017-03-10
>> 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed
>> from listener.
>> DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process)
>> START check
>> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
>> cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd',
>> u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata',
>> 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00
>> DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap)
>> Process  terminated (count=1)
>> DEBUG::2017-03-10
>> 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check
>> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
>> rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B)
>> copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
>>
>>
>> On 10 March 2017 at 10:40, Ian Neilsen  wrote:
>>>
>>> Hi All
>>>
>>> I had a storage issue with my gluster volumes running under ovirt hosted.
>>> I now cannot start the hosted engine manager vm from "hosted-engine
>>> --vm-start".
>>> I've scoured the net to find a way, but can't seem to find anything
>>> concrete.
>>>
>>> Running Centos7, ovirt 4.0 and gluster 3.8.9
>>>
>>> How do I recover the engine manager. Im at a loss!
>>>
>>> Engine Status = score between nodes was 0 for all, now node 1 is reading
>>> 3400, but all others are 0
>>>
>>> {"reason": "bad vm status", "health": "bad", "vm": "down", "detail":
>>> "down"}
>>>
>>>
>>> Logs from agent.log
>>> ==
>>>
>>> INFO::2017-03-09
>>> 19:32:52,600::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>>> Global maintenance detected
>>> INFO::2017-03-09
>>> 

Re: [ovirt-users] HE in bad stauts, will not start following storage issue - HELP

2017-03-10 Thread Martin Sivak
Hi Ian,

it is normal that VDSMs are competing for the lock, one should win
though. If that is not the case then the lockspace might be corrupted
or the sanlock daemons can't reach it.

I would recommend putting the cluster to global maintenance and
attempting a manual start using:

# hosted-engine --set-maintenance --mode=global
# hosted-engine --vm-start

You will need to check your storage connectivity and sanlock status on
all hosts if that does not work.

# sanlock client status

There are couple of locks I would expect to be there (ha_agent, spm),
but no lock for hosted engine disk should be visible.

Next steps depend on whether you have important VMs running on the
cluster and on the Gluster status (I can't help you there
unfortunately).

Best regards

--
Martin Sivak
SLA / oVirt


On Fri, Mar 10, 2017 at 7:37 AM, Ian Neilsen  wrote:
> I just noticed this in the vdsm.logs.  The agent looks like it is trying to
> start hosted engine on both machines??
>
> destroydestroydestroy
> Thread-7517::ERROR::2017-03-10
> 01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm)
> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed
> Traceback (most recent call last):
>   File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
> self._run()
>   File "/usr/share/vdsm/virt/vm.py", line 2026, in _run
> self._connection.createXML(domxml, flags),
>   File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line
> 123, in wrapper ret = f(*args, **kwargs)
>   File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in
> wrapper return func(inst, *args, **kwargs)
>   File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in
> createXML if ret is None:raise libvirtError('virDomainCreateXML() failed',
> conn=self)
>
> libvirtError: Failed to acquire lock: Permission denied
>
> INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus)
> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed
> to acquire lock: Permission denied (code=1)
> INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop)
> vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection
>
> DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete
> fileno 56 from listener.
> DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd) Failed
> to unregister FD from epoll (ENOENT): 56
> DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit)
> Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379": {"status":
> "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock: Permission
> denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc": "2.0",
> "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"}
> VM Channels Listener::DEBUG::2017-03-10
> 01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was removed
> from listener.
> DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process)
> START check
> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd',
> u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata',
> 'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00
> DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap)
> Process  terminated (count=1)
> DEBUG::2017-03-10
> 01:26:14,481::check::327::storage.check::(_check_completed) FINISH check
> u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
> rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B)
> copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06
>
>
> On 10 March 2017 at 10:40, Ian Neilsen  wrote:
>>
>> Hi All
>>
>> I had a storage issue with my gluster volumes running under ovirt hosted.
>> I now cannot start the hosted engine manager vm from "hosted-engine
>> --vm-start".
>> I've scoured the net to find a way, but can't seem to find anything
>> concrete.
>>
>> Running Centos7, ovirt 4.0 and gluster 3.8.9
>>
>> How do I recover the engine manager. Im at a loss!
>>
>> Engine Status = score between nodes was 0 for all, now node 1 is reading
>> 3400, but all others are 0
>>
>> {"reason": "bad vm status", "health": "bad", "vm": "down", "detail":
>> "down"}
>>
>>
>> Logs from agent.log
>> ==
>>
>> INFO::2017-03-09
>> 19:32:52,600::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
>> Global maintenance detected
>> INFO::2017-03-09
>> 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm)
>> Initializing VDSM
>> INFO::2017-03-09
>> 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images)
>> Connecting the storage
>> 

Re: [ovirt-users] HE in bad stauts, will not start following storage issue - HELP

2017-03-10 Thread Ian Neilsen
I just noticed this in the vdsm.logs.  The agent looks like it is trying to
start hosted engine on both machines??

destroydestroydestroy
Thread-7517::ERROR::2017-03-10
01:26:13,053::vm::773::virt.vm::(_startUnderlyingVm)
vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 714, in _startUnderlyingVm
self._run()
  File "/usr/share/vdsm/virt/vm.py", line 2026, in _run
self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line
123, in wrapper ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in
wrapper return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3782, in
createXML if ret is None:raise libvirtError('virDomainCreateXML() failed',
conn=self)

libvirtError: Failed to acquire lock: Permission denied

INFO::2017-03-10 01:26:13,054::vm::1330::virt.vm::(setDownStatus)
vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Changed state to Down: Failed
to acquire lock: Permission denied (code=1)
INFO::2017-03-10 01:26:13,054::guestagent::430::virt.vm::(stop)
vmId=`2419f9fe-4998-4b7a-9fe9-151571d20379`::Stopping connection

DEBUG::2017-03-10 01:26:13,054::vmchannels::238::vds::(unregister) Delete
fileno 56 from listener.
DEBUG::2017-03-10 01:26:13,055::vmchannels::66::vds::(_unregister_fd)
Failed to unregister FD from epoll (ENOENT): 56
DEBUG::2017-03-10 01:26:13,055::__init__::209::jsonrpc.Notification::(emit)
Sending event {"params": {"2419f9fe-4998-4b7a-9fe9-151571d20379":
{"status": "Down", "exitReason": 1, "exitMessage": "Failed to acquire lock:
Permission denied", "exitCode": 1}, "notify_time": 4308740560}, "jsonrpc":
"2.0", "method": "|virt|VM_status|2419f9fe-4998-4b7a-9fe9-151571d20379"}
VM Channels Listener::DEBUG::2017-03-10
01:26:13,475::vmchannels::142::vds::(_do_del_channels) fileno 56 was
removed from listener.
DEBUG::2017-03-10 01:26:14,430::check::296::storage.check::(_start_process)
START check 
u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
cmd=['/usr/bin/taskset', '--cpu-list', '0-39', '/usr/bin/dd',
u'if=/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata',
'of=/dev/null', 'bs=4096', 'count=1', 'iflag=direct'] delay=0.00
DEBUG::2017-03-10 01:26:14,481::asyncevent::564::storage.asyncevent::(reap)
Process  terminated (count=1)
DEBUG::2017-03-10
01:26:14,481::check::327::storage.check::(_check_completed) FINISH check
u'/rhev/data-center/mnt/glusterSD/192.168.3.10:_data/a08822ec-3f5b-4dba-ac2d-5510f0b4b6a2/dom_md/metadata'
rc=0 err=bytearray(b'0+1 records in\n0+1 records out\n300 bytes (300 B)
copied, 8.7603e-05 s, 3.4 MB/s\n') elapsed=0.06


On 10 March 2017 at 10:40, Ian Neilsen  wrote:

> Hi All
>
> I had a storage issue with my gluster volumes running under ovirt hosted.
> I now cannot start the hosted engine manager vm from "hosted-engine
> --vm-start".
> I've scoured the net to find a way, but can't seem to find anything
> concrete.
>
> Running Centos7, ovirt 4.0 and gluster 3.8.9
>
> How do I recover the engine manager. Im at a loss!
>
> Engine Status = score between nodes was 0 for all, now node 1 is reading
> 3400, but all others are 0
>
> {"reason": "bad vm status", "health": "bad", "vm": "down", "detail":
> "down"}
>
>
> Logs from agent.log
> ==
>
> INFO::2017-03-09 19:32:52,600::state_decorators::51::ovirt_hosted_
> engine_ha.agent.hosted_engine.HostedEngine::(check) Global maintenance
> detected
> INFO::2017-03-09 19:32:52,603::hosted_engine::612::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
> INFO::2017-03-09 19:32:54,820::hosted_engine::639::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting
> the storage
> INFO::2017-03-09 19:32:54,821::storage_server::
> 219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
> Connecting storage server
> INFO::2017-03-09 19:32:59,194::storage_server::
> 226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
> Connecting storage server
> INFO::2017-03-09 19:32:59,211::storage_server::
> 233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
> Refreshing the storage domain
> INFO::2017-03-09 19:32:59,328::hosted_engine::666::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing
> images
> INFO::2017-03-09 
> 19:32:59,328::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images)
> Preparing images
> INFO::2017-03-09 19:33:01,748::hosted_engine::669::ovirt_hosted_engine_ha.
> agent.hosted_engine.HostedEngine::(_initialize_storage_images) Reloading
> vm.conf from the shared storage domain
> INFO::2017-03-09