Re: [ovirt-users] vms in paused state

2016-05-13 Thread Milan Zamazal
We've found out that if libvirtd got restarted then VMs with disabled
memory balloon device are wrongly reported as being in the paused state.
It's a bug and we're working on a fix.
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] vms in paused state

2016-05-04 Thread Milan Zamazal
Bill James  writes:

> .recovery setting before removing:
> p298
> sS'status'
> p299
> S'Paused'
> p300
>
> After removing .recovery file and shutdown and restart:
> V0
> sS'status'
> p51
> S'Up'
> p52

Thank you for the information.  I was able to reproduce the problem with
mistakenly reported paused state when Vdsm receives unexpected data from
libvirt.  I'll try to look at it.

Restarting Vdsm (4.17.18 and some newer versions) afterwards remedies
the problem for me, even without removing the recovery file.

Milan
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] vms in paused state

2016-05-02 Thread Bill James

.recovery setting before removing:
p298
sS'status'
p299
S'Paused'
p300



After removing .recovery file and shutdown and restart:
V0
sS'status'
p51
S'Up'
p52


So far looks good, GUI show's VM as Up.


another host was:
p318
sS'status'
p319
S'Paused'
p320

after moving .recovery file and restarting:
V0
sS'status'
p51
S'Up'


Thanks.

On 04/29/2016 02:36 PM, Nir Soffer wrote:

/run/vdsm/.recovery

On Fri, Apr 29, 2016 at 10:59 PM, Bill James > wrote:


where do I find the recovery files?

[root@ovirt1 test vdsm]# pwd
/var/lib/vdsm
[root@ovirt1 test vdsm]# ls -la
total 16
drwxr-xr-x   6 vdsm kvm100 Mar 17 16:33 .
drwxr-xr-x. 45 root root  4096 Apr 29 12:01 ..
-rw-r--r--   1 vdsm kvm  10170 Jan 19 05:04 bonding-defaults.json
drwxr-xr-x   2 vdsm root 6 Apr 19 11:34 netconfback
drwxr-xr-x   3 vdsm kvm 54 Apr 19 11:35 persistence
drwxr-x---.  2 vdsm kvm  6 Mar 17 16:33 transient
drwxr-xr-x   2 vdsm kvm 40 Mar 17 16:33 upgrade



On 4/29/16 10:02 AM, Michal Skrivanek wrote:



On 29 Apr 2016, at 18:26, Bill James > wrote:


yes they are still saying "paused" state.
No, bouncing libvirt didn't help.


Then my suspicion of vm recovery gets closer to a certainty:)
Can you get one of the paused vm's .recovery file from
/var/lib/vdsm and check it says Paused there? It's worth a shot
to try to remove that file and restart vdsm, then check logs and
that vm status...it should recover "good enough" from libvirt only.
Try it with one first


I noticed the errors about the ISO domain. Didn't think that was
related.
I have been migrating a lot of VMs to ovirt lately, and recently
added another node.
Also had some problems with /etc/exports for a while, but I
think those issues are all resolved.


Last "unresponsive" message in vdsm.log was:

vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::*2016-04-21*
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout)
vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become
unresponsive (command timeout, age=310323.97)
vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout)
vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become
unresponsive (command timeout, age=310323.97)



Thanks.



On 4/29/16 1:40 AM, Michal Skrivanek wrote:



On 28 Apr 2016, at 19:40, Bill James > wrote:

thank you for response.
I bold-ed the ones that are listed as "paused".


[root@ovirt1 test vdsm]# virsh -r list --all
 Id  Â
Name                          State






Looks like problem started around 2016-04-17 20:19:34,822,
based on engine.log attached.


yes, that time looks correct. Any idea what might have been a
trigger? Anything interesting happened at that time (power
outage of some host, some maintenance action, anything)?Â
logs indicate a problem when vdsm talks to libvirt(all those
"monitor become unresponsive†)

It does seem that at that time you started to have some storage
connectivity issues - first one at 2016-04-17 20:06:53,929.
And it doesn’t look temporary because such errors are still
there couple hours later(in your most recent file you attached
I can see at 23:00:54)
When I/O gets blocked the VMs may experience issues (then VM
gets Paused), or their qemu process gets stuck(resulting in
libvirt either reporting error or getting stuck as well ->
resulting in what vdsm sees as “monitor unresponsive†)

Since you now bounced libvirtd - did it help? Do you still see
wrong status for those VMs and still those "monitor
unresponsive" errors in vdsm.log?
If not…then I would suspect the “vm recovery†code not
working correctly. Milan is looking at that.

Thanks,
michal



There's a lot of vdsm logs!

fyi, the storage domain for these Vms is a "local" nfs share,
7e566f55-e060-47b7-bfa4-ac3c48d70dda.

attached more logs.


On 04/28/2016 12:53 AM, Michal Skrivanek wrote:

On 27 Apr 2016, at 19:16, Bill James 
  wrote:

virsh # list --all
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such 
file or directory


you need to run virsh in read-only mode
virsh -r list —all


[root@ovirt1 test vdsm]# systemctl status libvirtd
â—  libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; 
vendor preset: enabled)
  Drop-In: /etc/systemd/system/libvirtd.service.d
   

Re: [ovirt-users] vms in paused state

2016-04-29 Thread Nir Soffer
/run/vdsm/.recovery

On Fri, Apr 29, 2016 at 10:59 PM, Bill James  wrote:

> where do I find the recovery files?
>
> [root@ovirt1 test vdsm]# pwd
> /var/lib/vdsm
> [root@ovirt1 test vdsm]# ls -la
> total 16
> drwxr-xr-x   6 vdsm kvm100 Mar 17 16:33 .
> drwxr-xr-x. 45 root root  4096 Apr 29 12:01 ..
> -rw-r--r--   1 vdsm kvm  10170 Jan 19 05:04 bonding-defaults.json
> drwxr-xr-x   2 vdsm root 6 Apr 19 11:34 netconfback
> drwxr-xr-x   3 vdsm kvm 54 Apr 19 11:35 persistence
> drwxr-x---.  2 vdsm kvm  6 Mar 17 16:33 transient
> drwxr-xr-x   2 vdsm kvm 40 Mar 17 16:33 upgrade
> [root@ovirt1 test vdsm]# locate recovery
> /opt/hp/hpdiags/en/tcstorage.ldinterimrecovery.htm
> /opt/hp/hpdiags/en/tcstorage.ldrecoveryready.htm
> /usr/share/doc/postgresql-9.2.15/html/archive-recovery-settings.html
> /usr/share/doc/postgresql-9.2.15/html/recovery-config.html
> /usr/share/doc/postgresql-9.2.15/html/recovery-target-settings.html
> /usr/share/pgsql/recovery.conf.sample
> /var/lib/nfs/v4recovery
>
>
> [root@ovirt1 test vdsm]# locate 757a5  (disk id)
>
> /ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118
>
> /ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118/211581dc-fa98-41be-a0b9-ace236149bc2
>
> /ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118/211581dc-fa98-41be-a0b9-ace236149bc2.lease
>
> /ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118/211581dc-fa98-41be-a0b9-ace236149bc2.meta
> [root@ovirt1 test vdsm]# locate 5bfb140 (vm id)
>
> /var/lib/libvirt/qemu/channels/5bfb140a-a971-4c9c-82c6-277929eb45d4.com.redhat.rhevm.vdsm
>
> /var/lib/libvirt/qemu/channels/5bfb140a-a971-4c9c-82c6-277929eb45d4.org.qemu.guest_agent.0
>
>
>
>
> On 4/29/16 10:02 AM, Michal Skrivanek wrote:
>
>
>
> On 29 Apr 2016, at 18:26, Bill James < 
> bill.ja...@j2.com> wrote:
>
> yes they are still saying "paused" state.
> No, bouncing libvirt didn't help.
>
>
> Then my suspicion of vm recovery gets closer to a certainty:)
> Can you get one of the paused vm's .recovery file from /var/lib/vdsm and
> check it says Paused there? It's worth a shot to try to remove that file
> and restart vdsm, then check logs and that vm status...it should recover
> "good enough" from libvirt only.
> Try it with one first
>
> I noticed the errors about the ISO domain. Didn't think that was related.
> I have been migrating a lot of VMs to ovirt lately, and recently added
> another node.
> Also had some problems with /etc/exports for a while, but I think those
> issues are all resolved.
>
>
> Last "unresponsive" message in vdsm.log was:
>
> vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::*2016-04-21*
> 11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout)
> vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become unresponsive
> (command timeout, age=310323.97)
> vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21
> 11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout)
> vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become unresponsive
> (command timeout, age=310323.97)
>
>
>
> Thanks.
>
>
>
> On 4/29/16 1:40 AM, Michal Skrivanek wrote:
>
>
> On 28 Apr 2016, at 19:40, Bill James  wrote:
>
> thank you for response.
> I bold-ed the ones that are listed as "paused".
>
>
> [root@ovirt1 test vdsm]# virsh -r list --all
>  Id    Name                           State
> 
>
>
>
>
> Looks like problem started around 2016-04-17 20:19:34,822, based on
> engine.log attached.
>
>
> yes, that time looks correct. Any idea what might have been a trigger?
> Anything interesting happened at that time (power outage of some host, some
> maintenance action, anything)?Â
> logs indicate a problem when vdsm talks to libvirt(all those "monitor
> become unresponsive†)
>
> It does seem that at that time you started to have some storage
> connectivity issues - first one at 2016-04-17 20:06:53,929. And it
> doesn’t look temporary because such errors are still there couple hours
> later(in your most recent file you attached I can see at 23:00:54)
> When I/O gets blocked the VMs may experience issues (then VM gets Paused),
> or their qemu process gets stuck(resulting in libvirt either reporting
> error or getting stuck as well -> resulting in what vdsm sees as “monitor
> unresponsive†)
>
> Since you now bounced libvirtd - did it help? Do you still see wrong
> status for those VMs and still those "monitor unresponsive" errors in
> vdsm.log?
> If not…then I would suspect the “vm recovery†code not working
> correctly. Milan is looking at that.
>
> Thanks,
> michal
>
>
> There's a lot of vdsm logs!
>
> fyi, the storage domain for these Vms is a "local" nfs share,
> 7e566f55-e060-47b7-bfa4-ac3c48d70dda.
>
> attached more logs.
>
>

Re: [ovirt-users] vms in paused state

2016-04-29 Thread Bill James

where do I find the recovery files?

[root@ovirt1 test vdsm]# pwd
/var/lib/vdsm
[root@ovirt1 test vdsm]# ls -la
total 16
drwxr-xr-x   6 vdsm kvm100 Mar 17 16:33 .
drwxr-xr-x. 45 root root  4096 Apr 29 12:01 ..
-rw-r--r--   1 vdsm kvm  10170 Jan 19 05:04 bonding-defaults.json
drwxr-xr-x   2 vdsm root 6 Apr 19 11:34 netconfback
drwxr-xr-x   3 vdsm kvm 54 Apr 19 11:35 persistence
drwxr-x---.  2 vdsm kvm  6 Mar 17 16:33 transient
drwxr-xr-x   2 vdsm kvm 40 Mar 17 16:33 upgrade
[root@ovirt1 test vdsm]# locate recovery
/opt/hp/hpdiags/en/tcstorage.ldinterimrecovery.htm
/opt/hp/hpdiags/en/tcstorage.ldrecoveryready.htm
/usr/share/doc/postgresql-9.2.15/html/archive-recovery-settings.html
/usr/share/doc/postgresql-9.2.15/html/recovery-config.html
/usr/share/doc/postgresql-9.2.15/html/recovery-target-settings.html
/usr/share/pgsql/recovery.conf.sample
/var/lib/nfs/v4recovery


[root@ovirt1 test vdsm]# locate 757a5  (disk id)
/ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118
/ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118/211581dc-fa98-41be-a0b9-ace236149bc2
/ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118/211581dc-fa98-41be-a0b9-ace236149bc2.lease
/ovirt-store/nfs1/7e566f55-e060-47b7-bfa4-ac3c48d70dda/images/757a5e69-a791-4391-9d7d-9516bf7f2118/211581dc-fa98-41be-a0b9-ace236149bc2.meta
[root@ovirt1 test vdsm]# locate 5bfb140 (vm id)
/var/lib/libvirt/qemu/channels/5bfb140a-a971-4c9c-82c6-277929eb45d4.com.redhat.rhevm.vdsm
/var/lib/libvirt/qemu/channels/5bfb140a-a971-4c9c-82c6-277929eb45d4.org.qemu.guest_agent.0



On 4/29/16 10:02 AM, Michal Skrivanek wrote:



On 29 Apr 2016, at 18:26, Bill James > wrote:



yes they are still saying "paused" state.
No, bouncing libvirt didn't help.


Then my suspicion of vm recovery gets closer to a certainty:)
Can you get one of the paused vm's .recovery file from /var/lib/vdsm 
and check it says Paused there? It's worth a shot to try to remove 
that file and restart vdsm, then check logs and that vm status...it 
should recover "good enough" from libvirt only.

Try it with one first


I noticed the errors about the ISO domain. Didn't think that was related.
I have been migrating a lot of VMs to ovirt lately, and recently 
added another node.
Also had some problems with /etc/exports for a while, but I think 
those issues are all resolved.



Last "unresponsive" message in vdsm.log was:

vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::*2016-04-21* 
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become 
unresponsive (command timeout, age=310323.97)
vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21 
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become 
unresponsive (command timeout, age=310323.97)




Thanks.



On 4/29/16 1:40 AM, Michal Skrivanek wrote:



On 28 Apr 2016, at 19:40, Bill James  wrote:

thank you for response.
I bold-ed the ones that are listed as "paused".


[root@ovirt1 test vdsm]# virsh -r list --all
 Id   Name                          State






Looks like problem started around 2016-04-17 20:19:34,822, based on 
engine.log attached.


yes, that time looks correct. Any idea what might have been a 
trigger? Anything interesting happened at that time (power outage of 
some host, some maintenance action, anything)?Â
logs indicate a problem when vdsm talks to libvirt(all those 
"monitor become unresponsive”)


It does seem that at that time you started to have some storage 
connectivity issues - first one at 2016-04-17 20:06:53,929. And it 
doesn’t look temporary because such errors are still there couple 
hours later(in your most recent file you attached I can see at 23:00:54)
When I/O gets blocked the VMs may experience issues (then VM gets 
Paused), or their qemu process gets stuck(resulting in libvirt 
either reporting error or getting stuck as well -> resulting in what 
vdsm sees as “monitor unresponsive”)


Since you now bounced libvirtd - did it help? Do you still see wrong 
status for those VMs and still those "monitor unresponsive" errors 
in vdsm.log?
If not…then I would suspect the “vm recovery” code not working 
correctly. Milan is looking at that.


Thanks,
michal



There's a lot of vdsm logs!

fyi, the storage domain for these Vms is a "local" nfs share, 
7e566f55-e060-47b7-bfa4-ac3c48d70dda.


attached more logs.


On 04/28/2016 12:53 AM, Michal Skrivanek wrote:

On 27 Apr 2016, at 19:16, Bill James  wrote:

virsh # list --all
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such 

Re: [ovirt-users] vms in paused state

2016-04-29 Thread Michal Skrivanek


> On 29 Apr 2016, at 18:26, Bill James  wrote:
> 
> yes they are still saying "paused" state.
> No, bouncing libvirt didn't help.

Then my suspicion of vm recovery gets closer to a certainty:)
Can you get one of the paused vm's .recovery file from /var/lib/vdsm and check 
it says Paused there? It's worth a shot to try to remove that file and restart 
vdsm, then check logs and that vm status...it should recover "good enough" from 
libvirt only. 
Try it with one first

> I noticed the errors about the ISO domain. Didn't think that was related.
> I have been migrating a lot of VMs to ovirt lately, and recently added 
> another node.
> Also had some problems with /etc/exports for a while, but I think those 
> issues are all resolved.
> 
> 
> Last "unresponsive" message in vdsm.log was:
> 
> vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21 
> 11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
> vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become unresponsive 
> (command timeout, age=310323.97)
> vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21 
> 11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
> vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become unresponsive 
> (command timeout, age=310323.97)
> 
> 
> 
> Thanks.
> 
> 
> 
>> On 4/29/16 1:40 AM, Michal Skrivanek wrote:
>> 
>>> On 28 Apr 2016, at 19:40, Bill James  wrote:
>>> 
>>> thank you for response.
>>> I bold-ed the ones that are listed as "paused".
>>> 
>>> 
>>> [root@ovirt1 test vdsm]# virsh -r list --all
>>>  Id    Name                           State
>>> 
>> 
>> 
>>> 
>>> 
>>> Looks like problem started around 2016-04-17 20:19:34,822, based on 
>>> engine.log attached.
>> 
>> yes, that time looks correct. Any idea what might have been a trigger? 
>> Anything interesting happened at that time (power outage of some host, some 
>> maintenance action, anything)? 
>> logs indicate a problem when vdsm talks to libvirt(all those "monitor become 
>> unresponsive”)
>> 
>> It does seem that at that time you started to have some storage connectivity 
>> issues - first one at 2016-04-17 20:06:53,929. And it doesn’t look 
>> temporary because such errors are still there couple hours later(in your 
>> most recent file you attached I can see at 23:00:54)
>> When I/O gets blocked the VMs may experience issues (then VM gets Paused), 
>> or their qemu process gets stuck(resulting in libvirt either reporting error 
>> or getting stuck as well -> resulting in what vdsm sees as “monitor 
>> unresponsive”)
>> 
>> Since you now bounced libvirtd - did it help? Do you still see wrong status 
>> for those VMs and still those "monitor unresponsive" errors in vdsm.log?
>> If not…then I would suspect the “vm recovery” code not working 
>> correctly. Milan is looking at that.
>> 
>> Thanks,
>> michal
>> 
>> 
>>> There's a lot of vdsm logs!
>>> 
>>> fyi, the storage domain for these Vms is a "local" nfs share, 
>>> 7e566f55-e060-47b7-bfa4-ac3c48d70dda.
>>> 
>>> attached more logs.
>>> 
>>> 
 On 04/28/2016 12:53 AM, Michal Skrivanek wrote:
>> On 27 Apr 2016, at 19:16, Bill James  wrote:
>> 
>> virsh # list --all
>> error: failed to connect to the hypervisor
>> error: no valid connection
>> error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No 
>> such file or directory
>> 
> you need to run virsh in read-only mode
> virsh -r list —all
> 
> [root@ovirt1 test vdsm]# systemctl status libvirtd
> ● libvirtd.service - Virtualization daemon
>   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; 
> vendor preset: enabled)
>  Drop-In: /etc/systemd/system/libvirtd.service.d
>   └─unlimited-core.conf
>   Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago
> 
> 
> tried systemctl restart libvirtd.
> No change.
> 
> Attached vdsm.log and supervdsm.log.
> 
> 
> [root@ovirt1 test vdsm]# systemctl status vdsmd
> ● vdsmd.service - Virtual Desktop Server Manager
>   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor 
> preset: enabled)
>   Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago
> 
> 
> vdsm-4.17.18-0.el7.centos.noarch
 the vdsm.log attach is good, but it’s too short interval, it only shows 
 recovery(vdsm restart) phase when the VMs are identified as paused….can 
 you add earlier logs? Did you restart vdsm yourself or did it crash?
 
 
> libvirt-daemon-1.2.17-13.el7_2.4.x86_64
> 
> 
> Thanks.
> 
> 
> On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
 On 27 Apr 2016, at 02:04, Nir Soffer  wrote:
 
 jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James 

Re: [ovirt-users] vms in paused state

2016-04-29 Thread Bill James

yes they are still saying "paused" state.
No, bouncing libvirt didn't help.

I noticed the errors about the ISO domain. Didn't think that was related.
I have been migrating a lot of VMs to ovirt lately, and recently added 
another node.
Also had some problems with /etc/exports for a while, but I think those 
issues are all resolved.



Last "unresponsive" message in vdsm.log was:

vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::*2016-04-21* 
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
vmId=`b6a13808-9552-401b-840b-4f7022e8293d`::monitor become unresponsive 
(command timeout, age=310323.97)
vdsm.log.49.xz:jsonrpc.Executor/0::WARNING::2016-04-21 
11:00:54,703::vm::5067::virt.vm::(_setUnresponsiveIfTimeout) 
vmId=`5bfb140a-a971-4c9c-82c6-277929eb45d4`::monitor become unresponsive 
(command timeout, age=310323.97)




Thanks.



On 4/29/16 1:40 AM, Michal Skrivanek wrote:


On 28 Apr 2016, at 19:40, Bill James > wrote:


thank you for response.
I bold-ed the ones that are listed as "paused".


[root@ovirt1 test vdsm]# virsh -r list --all
 IdName   State






Looks like problem started around 2016-04-17 20:19:34,822, based on 
engine.log attached.


yes, that time looks correct. Any idea what might have been a trigger? 
Anything interesting happened at that time (power outage of some host, 
some maintenance action, anything)?
logs indicate a problem when vdsm talks to libvirt(all those "monitor 
become unresponsive”)


It does seem that at that time you started to have some storage 
connectivity issues - first one at 2016-04-17 20:06:53,929. And it 
doesn’t look temporary because such errors are still there couple 
hours later(in your most recent file you attached I can see at 23:00:54)
When I/O gets blocked the VMs may experience issues (then VM gets 
Paused), or their qemu process gets stuck(resulting in libvirt either 
reporting error or getting stuck as well -> resulting in what vdsm 
sees as “monitor unresponsive”)


Since you now bounced libvirtd - did it help? Do you still see wrong 
status for those VMs and still those "monitor unresponsive" errors in 
vdsm.log?
If not…then I would suspect the “vm recovery” code not working 
correctly. Milan is looking at that.


Thanks,
michal



There's a lot of vdsm logs!

fyi, the storage domain for these Vms is a "local" nfs share, 
7e566f55-e060-47b7-bfa4-ac3c48d70dda.


attached more logs.


On 04/28/2016 12:53 AM, Michal Skrivanek wrote:

On 27 Apr 2016, at 19:16, Bill James  wrote:

virsh # list --all
error: failed to connect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such 
file or directory


you need to run virsh in read-only mode
virsh -r list —all


[root@ovirt1 test vdsm]# systemctl status libvirtd
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor 
preset: enabled)
  Drop-In: /etc/systemd/system/libvirtd.service.d
   └─unlimited-core.conf
   Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago


tried systemctl restart libvirtd.
No change.

Attached vdsm.log and supervdsm.log.


[root@ovirt1 test vdsm]# systemctl status vdsmd
● vdsmd.service - Virtual Desktop Server Manager
   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor 
preset: enabled)
   Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago


vdsm-4.17.18-0.el7.centos.noarch

the vdsm.log attach is good, but it’s too short interval, it only shows 
recovery(vdsm restart) phase when the VMs are identified as paused….can you add 
earlier logs? Did you restart vdsm yourself or did it crash?



libvirt-daemon-1.2.17-13.el7_2.4.x86_64


Thanks.


On 04/26/2016 11:35 PM, Michal Skrivanek wrote:

On 27 Apr 2016, at 02:04, Nir Soffer  wrote:

jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James  wrote:

I have a hardware node that has 26 VMs.
9 are listed as "running", 17 are listed as "paused".

In truth all VMs are up and running fine.

I tried telling the db they are up:

engine=> update vm_dynamic set status = 1 where vm_guid =(select
vm_guid from vm_static where vm_name = 'api1.test.j2noc.com 
');

GUI then shows it up for a short while,

then puts it back in paused state.

2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
(DefaultQuartzScheduler_Worker-16) [157cc21e] VM '242ca0af-4ab2-4dd6-b515-5
d435e6452c4'(api1.test.j2noc.com ) moved from 'Up' 
--> 'Paused'
2016-04-26 15:16:46,221 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh
andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] Cor
relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM api1.
test.j2noc.com   has 

Re: [ovirt-users] vms in paused state

2016-04-29 Thread Michal Skrivanek

> On 28 Apr 2016, at 19:40, Bill James  wrote:
> 
> thank you for response.
> I bold-ed the ones that are listed as "paused".
> 
> 
> [root@ovirt1 test vdsm]# virsh -r list --all
>  IdName   State
> 
>  2 puppet.test.j2noc.com  running
>  4 sftp2.test.j2noc.com   running
>  5 oct.test.j2noc.com running
>  6 sftp2.dev.j2noc.comrunning
>  10darmaster1.test.j2noc.com  running
>  14api1.test.j2noc.comrunning
>  25ftp1.frb.test.j2noc.comrunning
>  26auto7.test.j2noc.com   running
>  32epaymv02.j2noc.com running
>  34media2.frb.test.j2noc.com  running
>  36auto2.j2noc.comrunning
>  44nfs.testhvy2.colo.j2noc.comrunning
>  53billapp-zuma1.dev.j2noc.comrunning
>  54billing-ci.dev.j2noc.com   running
>  60log2.test.j2noc.comrunning
>  63log1.test.j2noc.comrunning
>  69sonar.dev.j2noc.comrunning
>  73billapp-ui1.dev.j2noc.com  running
>  74billappvm01.dev.j2noc.com  running
>  75db2.frb.test.j2noc.com running
>  83billapp-ui1.test.j2noc.com running
>  84epayvm01.test.j2noc.comrunning
>  87billappvm01.test.j2noc.com running
>  89etapi1.test.j2noc.com  running
>  93billapp-zuma2.test.j2noc.com   running
>  94git.dev.j2noc.com  running
> 
> Yes I did "systemctl restart libvirtd" which apparently also restart vdsm?

yes, it does. 

> 
> 
> Looks like problem started around 2016-04-17 20:19:34,822, based on 
> engine.log attached.

yes, that time looks correct. Any idea what might have been a trigger? Anything 
interesting happened at that time (power outage of some host, some maintenance 
action, anything)? 
logs indicate a problem when vdsm talks to libvirt(all those "monitor become 
unresponsive”)

It does seem that at that time you started to have some storage connectivity 
issues - first one at 2016-04-17 20:06:53,929. And it doesn’t look temporary 
because such errors are still there couple hours later(in your most recent file 
you attached I can see at 23:00:54)
When I/O gets blocked the VMs may experience issues (then VM gets Paused), or 
their qemu process gets stuck(resulting in libvirt either reporting error or 
getting stuck as well -> resulting in what vdsm sees as “monitor unresponsive”)

Since you now bounced libvirtd - did it help? Do you still see wrong status for 
those VMs and still those "monitor unresponsive" errors in vdsm.log?
If not…then I would suspect the “vm recovery” code not working correctly. Milan 
is looking at that.

Thanks,
michal


> There's a lot of vdsm logs!
> 
> fyi, the storage domain for these Vms is a "local" nfs share, 
> 7e566f55-e060-47b7-bfa4-ac3c48d70dda.
> 
> attached more logs.
> 
> 
> On 04/28/2016 12:53 AM, Michal Skrivanek wrote:
>>> On 27 Apr 2016, at 19:16, Bill James  
>>>  wrote:
>>> 
>>> virsh # list --all
>>> error: failed to connect to the hypervisor
>>> error: no valid connection
>>> error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such 
>>> file or directory
>>> 
>> you need to run virsh in read-only mode
>> virsh -r list —all
>> 
>>> [root@ovirt1 test vdsm]# systemctl status libvirtd
>>> ● libvirtd.service - Virtualization daemon
>>>   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor 
>>> preset: enabled)
>>>  Drop-In: /etc/systemd/system/libvirtd.service.d
>>>   └─unlimited-core.conf
>>>   Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago
>>> 
>>> 
>>> tried systemctl restart libvirtd.
>>> No change.
>>> 
>>> Attached vdsm.log and supervdsm.log.
>>> 
>>> 
>>> [root@ovirt1 test vdsm]# systemctl status vdsmd
>>> ● vdsmd.service - Virtual Desktop Server Manager
>>>   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor 
>>> preset: enabled)
>>>   Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago
>>> 
>>> 
>>> vdsm-4.17.18-0.el7.centos.noarch
>> the vdsm.log attach is good, but it’s too short interval, it only shows 
>> recovery(vdsm restart) phase when the VMs are identified as paused….can you 
>> add earlier logs? Did you restart vdsm yourself or did it crash?
>> 
>> 
>>> libvirt-daemon-1.2.17-13.el7_2.4.x86_64
>>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
> On 27 Apr 2016, at 02:04, Nir Soffer  
>  wrote:
> 
> jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James  
>  wrote:
>> I have a hardware node that has 26 VMs.
>> 9 are listed as "running", 17 are listed as "paused".
>> 
>> In truth all VMs are up and running fine.
>> 
>> I tried 

Re: [ovirt-users] vms in paused state

2016-04-28 Thread Michal Skrivanek

> On 27 Apr 2016, at 19:16, Bill James  wrote:
> 
> virsh # list --all
> error: failed to connect to the hypervisor
> error: no valid connection
> error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such 
> file or directory
> 

you need to run virsh in read-only mode
virsh -r list —all

> [root@ovirt1 test vdsm]# systemctl status libvirtd
> ● libvirtd.service - Virtualization daemon
>   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor 
> preset: enabled)
>  Drop-In: /etc/systemd/system/libvirtd.service.d
>   └─unlimited-core.conf
>   Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago
> 
> 
> tried systemctl restart libvirtd.
> No change.
> 
> Attached vdsm.log and supervdsm.log.
> 
> 
> [root@ovirt1 test vdsm]# systemctl status vdsmd
> ● vdsmd.service - Virtual Desktop Server Manager
>   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor 
> preset: enabled)
>   Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago
> 
> 
> vdsm-4.17.18-0.el7.centos.noarch

the vdsm.log attach is good, but it’s too short interval, it only shows 
recovery(vdsm restart) phase when the VMs are identified as paused….can you add 
earlier logs? Did you restart vdsm yourself or did it crash?


> libvirt-daemon-1.2.17-13.el7_2.4.x86_64
> 
> 
> Thanks.
> 
> 
> On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
>>> On 27 Apr 2016, at 02:04, Nir Soffer  wrote:
>>> 
>>> jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James  wrote:
 I have a hardware node that has 26 VMs.
 9 are listed as "running", 17 are listed as "paused".
 
 In truth all VMs are up and running fine.
 
 I tried telling the db they are up:
 
 engine=> update vm_dynamic set status = 1 where vm_guid =(select
 vm_guid from vm_static where vm_name = 'api1.test.j2noc.com');
 
 GUI then shows it up for a short while,
 
 then puts it back in paused state.
 
 2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
 (DefaultQuartzScheduler_Worker-16) [157cc21e] VM '242ca0af-4ab2-4dd6-b515-5
 d435e6452c4'(api1.test.j2noc.com) moved from 'Up' --> 'Paused'
 2016-04-26 15:16:46,221 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh
 andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] Cor
 relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM api1.
 test.j2noc.com has been paused.
 
 
 Why does the engine think the VMs are paused?
 Attached engine.log.
 
 I can fix the problem by powering off the VM then starting it back up.
 But the VM is working fine! How do I get ovirt to realize that?
>>> If this is an issue in engine, restarting engine may fix this.
>>> but having this problem only with one node, I don't think this is the issue.
>>> 
>>> If this is an issue in vdsm, restarting vdsm may fix this.
>>> 
>>> If this does not help, maybe this is libvirt issue? did you try to check vm
>>> status using virsh?
>> this looks more likely as it seems such status is being reported
>> logs would help, vdsm.log at the very least.
>> 
>>> If virsh thinks that the vms are paused, you can try to restart libvirtd.
>>> 
>>> Please file a bug about this in any case with engine and vdsm logs.
>>> 
>>> Adding Michal in case he has better idea how to proceed.
>>> 
>>> Nir
> 
> 
> Cloud Services for Business www.j2.com
> j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox
> 
> 
> This email, its contents and attachments contain information from j2 Global, 
> Inc. and/or its affiliates which may be privileged, confidential or otherwise 
> protected from disclosure. The information is intended to be for the 
> addressee(s) only. If you are not an addressee, any disclosure, copy, 
> distribution, or use of the contents of this message is prohibited. If you 
> have received this email in error please notify the sender by reply e-mail 
> and delete the original message and any copies. (c) 2015 j2 Global, Inc. All 
> rights reserved. eFax, eVoice, Campaigner, FuseMail, KeepItSafe, and Onebox 
> are registered trademarks of j2 Global, Inc. and its affiliates.
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] vms in paused state

2016-04-27 Thread Michal Skrivanek

> On 27 Apr 2016, at 02:04, Nir Soffer  wrote:
> 
> jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James  wrote:
>> I have a hardware node that has 26 VMs.
>> 9 are listed as "running", 17 are listed as "paused".
>> 
>> In truth all VMs are up and running fine.
>> 
>> I tried telling the db they are up:
>> 
>> engine=> update vm_dynamic set status = 1 where vm_guid =(select
>> vm_guid from vm_static where vm_name = 'api1.test.j2noc.com');
>> 
>> GUI then shows it up for a short while,
>> 
>> then puts it back in paused state.
>> 
>> 2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>> (DefaultQuartzScheduler_Worker-16) [157cc21e] VM '242ca0af-4ab2-4dd6-b515-5
>> d435e6452c4'(api1.test.j2noc.com) moved from 'Up' --> 'Paused'
>> 2016-04-26 15:16:46,221 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh
>> andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] Cor
>> relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM api1.
>> test.j2noc.com has been paused.
>> 
>> 
>> Why does the engine think the VMs are paused?
>> Attached engine.log.
>> 
>> I can fix the problem by powering off the VM then starting it back up.
>> But the VM is working fine! How do I get ovirt to realize that?
> 
> If this is an issue in engine, restarting engine may fix this.
> but having this problem only with one node, I don't think this is the issue.
> 
> If this is an issue in vdsm, restarting vdsm may fix this.
> 
> If this does not help, maybe this is libvirt issue? did you try to check vm
> status using virsh?

this looks more likely as it seems such status is being reported
logs would help, vdsm.log at the very least.

> 
> If virsh thinks that the vms are paused, you can try to restart libvirtd.
> 
> Please file a bug about this in any case with engine and vdsm logs.
> 
> Adding Michal in case he has better idea how to proceed.
> 
> Nir

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] vms in paused state

2016-04-26 Thread Nir Soffer
jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James  wrote:
> I have a hardware node that has 26 VMs.
> 9 are listed as "running", 17 are listed as "paused".
>
> In truth all VMs are up and running fine.
>
> I tried telling the db they are up:
>
> engine=> update vm_dynamic set status = 1 where vm_guid =(select
> vm_guid from vm_static where vm_name = 'api1.test.j2noc.com');
>
> GUI then shows it up for a short while,
>
> then puts it back in paused state.
>
> 2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
> (DefaultQuartzScheduler_Worker-16) [157cc21e] VM '242ca0af-4ab2-4dd6-b515-5
> d435e6452c4'(api1.test.j2noc.com) moved from 'Up' --> 'Paused'
> 2016-04-26 15:16:46,221 INFO [org.ovirt.engine.core.dal.dbbroker.auditlogh
> andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] Cor
> relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM api1.
> test.j2noc.com has been paused.
>
>
> Why does the engine think the VMs are paused?
> Attached engine.log.
>
> I can fix the problem by powering off the VM then starting it back up.
> But the VM is working fine! How do I get ovirt to realize that?

If this is an issue in engine, restarting engine may fix this.
but having this problem only with one node, I don't think this is the issue.

If this is an issue in vdsm, restarting vdsm may fix this.

If this does not help, maybe this is libvirt issue? did you try to check vm
status using virsh?

If virsh thinks that the vms are paused, you can try to restart libvirtd.

Please file a bug about this in any case with engine and vdsm logs.

Adding Michal in case he has better idea how to proceed.

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users