> On 28 Apr 2016, at 19:40, Bill James <bill.ja...@j2.com> wrote:
> 
> thank you for response.
> I bold-ed the ones that are listed as "paused".
> 
> 
> [root@ovirt1 test vdsm]# virsh -r list --all
>  Id    Name                           State
> ----------------------------------------------------
>  2     puppet.test.j2noc.com          running
>  4     sftp2.test.j2noc.com           running
>  5     oct.test.j2noc.com             running
>  6     sftp2.dev.j2noc.com            running
>  10    darmaster1.test.j2noc.com      running
>  14    api1.test.j2noc.com            running
>  25    ftp1.frb.test.j2noc.com        running
>  26    auto7.test.j2noc.com           running
>  32    epaymv02.j2noc.com             running
>  34    media2.frb.test.j2noc.com      running
>  36    auto2.j2noc.com                running
>  44    nfs.testhvy2.colo.j2noc.com    running
>  53    billapp-zuma1.dev.j2noc.com    running
>  54    billing-ci.dev.j2noc.com       running
>  60    log2.test.j2noc.com            running
>  63    log1.test.j2noc.com            running
>  69    sonar.dev.j2noc.com            running
>  73    billapp-ui1.dev.j2noc.com      running
>  74    billappvm01.dev.j2noc.com      running
>  75    db2.frb.test.j2noc.com         running
>  83    billapp-ui1.test.j2noc.com     running
>  84    epayvm01.test.j2noc.com        running
>  87    billappvm01.test.j2noc.com     running
>  89    etapi1.test.j2noc.com          running
>  93    billapp-zuma2.test.j2noc.com   running
>  94    git.dev.j2noc.com              running
> 
> Yes I did "systemctl restart libvirtd" which apparently also restart vdsm?

yes, it does. 

> 
> 
> Looks like problem started around 2016-04-17 20:19:34,822, based on 
> engine.log attached.

yes, that time looks correct. Any idea what might have been a trigger? Anything 
interesting happened at that time (power outage of some host, some maintenance 
action, anything)? 
logs indicate a problem when vdsm talks to libvirt(all those "monitor become 
unresponsive”)

It does seem that at that time you started to have some storage connectivity 
issues - first one at 2016-04-17 20:06:53,929. And it doesn’t look temporary 
because such errors are still there couple hours later(in your most recent file 
you attached I can see at 23:00:54)
When I/O gets blocked the VMs may experience issues (then VM gets Paused), or 
their qemu process gets stuck(resulting in libvirt either reporting error or 
getting stuck as well -> resulting in what vdsm sees as “monitor unresponsive”)

Since you now bounced libvirtd - did it help? Do you still see wrong status for 
those VMs and still those "monitor unresponsive" errors in vdsm.log?
If not…then I would suspect the “vm recovery” code not working correctly. Milan 
is looking at that.

Thanks,
michal


> There's a lot of vdsm logs!
> 
> fyi, the storage domain for these Vms is a "local" nfs share, 
> 7e566f55-e060-47b7-bfa4-ac3c48d70dda.
> 
> attached more logs.
> 
> 
> On 04/28/2016 12:53 AM, Michal Skrivanek wrote:
>>> On 27 Apr 2016, at 19:16, Bill James <bill.ja...@j2.com> 
>>> <mailto:bill.ja...@j2.com> wrote:
>>> 
>>> virsh # list --all
>>> error: failed to connect to the hypervisor
>>> error: no valid connection
>>> error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such 
>>> file or directory
>>> 
>> you need to run virsh in read-only mode
>> virsh -r list —all
>> 
>>> [root@ovirt1 test vdsm]# systemctl status libvirtd
>>> ● libvirtd.service - Virtualization daemon
>>>   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor 
>>> preset: enabled)
>>>  Drop-In: /etc/systemd/system/libvirtd.service.d
>>>           └─unlimited-core.conf
>>>   Active: active (running) since Thu 2016-04-21 16:00:03 PDT; 5 days ago
>>> 
>>> 
>>> tried systemctl restart libvirtd.
>>> No change.
>>> 
>>> Attached vdsm.log and supervdsm.log.
>>> 
>>> 
>>> [root@ovirt1 test vdsm]# systemctl status vdsmd
>>> ● vdsmd.service - Virtual Desktop Server Manager
>>>   Loaded: loaded (/usr/lib/systemd/system/vdsmd.service; enabled; vendor 
>>> preset: enabled)
>>>   Active: active (running) since Wed 2016-04-27 10:09:14 PDT; 3min 46s ago
>>> 
>>> 
>>> vdsm-4.17.18-0.el7.centos.noarch
>> the vdsm.log attach is good, but it’s too short interval, it only shows 
>> recovery(vdsm restart) phase when the VMs are identified as paused….can you 
>> add earlier logs? Did you restart vdsm yourself or did it crash?
>> 
>> 
>>> libvirt-daemon-1.2.17-13.el7_2.4.x86_64
>>> 
>>> 
>>> Thanks.
>>> 
>>> 
>>> On 04/26/2016 11:35 PM, Michal Skrivanek wrote:
>>>>> On 27 Apr 2016, at 02:04, Nir Soffer <nsof...@redhat.com> 
>>>>> <mailto:nsof...@redhat.com> wrote:
>>>>> 
>>>>> jjOn Wed, Apr 27, 2016 at 2:03 AM, Bill James <bill.ja...@j2.com> 
>>>>> <mailto:bill.ja...@j2.com> wrote:
>>>>>> I have a hardware node that has 26 VMs.
>>>>>> 9 are listed as "running", 17 are listed as "paused".
>>>>>> 
>>>>>> In truth all VMs are up and running fine.
>>>>>> 
>>>>>> I tried telling the db they are up:
>>>>>> 
>>>>>> engine=> update vm_dynamic set status = 1 where vm_guid =(select
>>>>>> vm_guid from vm_static where vm_name = 'api1.test.j2noc.com');
>>>>>> 
>>>>>> GUI then shows it up for a short while,
>>>>>> 
>>>>>> then puts it back in paused state.
>>>>>> 
>>>>>> 2016-04-26 15:16:46,095 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer]
>>>>>> (DefaultQuartzScheduler_Worker-16) [157cc21e] VM 
>>>>>> '242ca0af-4ab2-4dd6-b515-5
>>>>>> d435e6452c4'(api1.test.j2noc.com) moved from 'Up' --> 'Paused'
>>>>>> 2016-04-26 15:16:46,221 INFO 
>>>>>> [org.ovirt.engine.core.dal.dbbroker.auditlogh
>>>>>> andling.AuditLogDirector] (DefaultQuartzScheduler_Worker-16) [157cc21e] 
>>>>>> Cor
>>>>>> relation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM 
>>>>>> api1.
>>>>>> test.j2noc.com has been paused.
>>>>>> 
>>>>>> 
>>>>>> Why does the engine think the VMs are paused?
>>>>>> Attached engine.log.
>>>>>> 
>>>>>> I can fix the problem by powering off the VM then starting it back up.
>>>>>> But the VM is working fine! How do I get ovirt to realize that?
>>>>> If this is an issue in engine, restarting engine may fix this.
>>>>> but having this problem only with one node, I don't think this is the 
>>>>> issue.
>>>>> 
>>>>> If this is an issue in vdsm, restarting vdsm may fix this.
>>>>> 
>>>>> If this does not help, maybe this is libvirt issue? did you try to check 
>>>>> vm
>>>>> status using virsh?
>>>> this looks more likely as it seems such status is being reported
>>>> logs would help, vdsm.log at the very least.
>>>> 
>>>>> If virsh thinks that the vms are paused, you can try to restart libvirtd.
>>>>> 
>>>>> Please file a bug about this in any case with engine and vdsm logs.
>>>>> 
>>>>> Adding Michal in case he has better idea how to proceed.
>>>>> 
>>>>> Nir
>>> 
>>> Cloud Services for Business www.j2.com <http://www.j2.com/>
>>> j2 | eFax | eVoice | FuseMail | Campaigner | KeepItSafe | Onebox
>>> 
>>> 
>>> This email, its contents and attachments contain information from j2 
>>> Global, Inc. and/or its affiliates which may be privileged, confidential or 
>>> otherwise protected from disclosure. The information is intended to be for 
>>> the addressee(s) only. If you are not an addressee, any disclosure, copy, 
>>> distribution, or use of the contents of this message is prohibited. If you 
>>> have received this email in error please notify the sender by reply e-mail 
>>> and delete the original message and any copies. (c) 2015 j2 Global, Inc. 
>>> All rights reserved. eFax, eVoice, Campaigner, FuseMail, KeepItSafe, and 
>>> Onebox are registered trademarks of j2 Global, Inc. and its affiliates.
>>> <supervdsm.log.gz><vdsm.log.gz>_______________________________________________
>>> Users mailing list
>>> Users@ovirt.org <mailto:Users@ovirt.org>
>>> http://lists.ovirt.org/mailman/listinfo/users 
>>> <http://lists.ovirt.org/mailman/listinfo/users>
> 
> <engine.log-20160421.gz><vdsm.logs.tar.gz>

_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to