Re: Help! After network outage, can't start System VMs; focused debug info attached

sriharsha work Tue, 17 Sep 2013 02:24:15 -0700

Hi Kirk,

Thanks for your reply. This is a blocker for us and currently affected all
of our work. It is very helpful to debug more into the issue. I have a
question.


1. What should the directory be when mounting [2] systemVM template
location on the nfs drive.


Error from agent.log on the host. Clearly it says some issue with the
libvirt pools. Can you please help me understand if anything else needs to
be addressed to get the issue resolved.


2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Request:Seq 14-1592393816:  { Cmd , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 100111,
[{"storage.CreateCommand":{"vo
lId":9817,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-9736","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":9817,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
}
2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Processing command:
com.cloud.agent.api.storage.CreateCommand
2013-09-17 02:17:36,779 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Failed to create volume:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'
2013-09-17 02:17:36,781 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Seq 14-1592393816:  { Ans: , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 110,
[{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
com.cloud.utils.exception.CloudRuntimeException\nMessage:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Request:Seq 14-1592393817:  { Cmd , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 100111,
[{"StopCommand":{"isProxy":false,"vmName":"s-9736-VM","wait":0}}] }
2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Processing command:
com.cloud.agent.api.StopCommand
2013-09-17 02:17:36,891 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to get dom xml:
org.libvirt.LibvirtException: Domain not found: no domain with matching
uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to get dom xml:
org.libvirt.LibvirtException: Domain not found: no domain with matching
uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Try to stop the vm at first
2013-09-17 02:17:36,895 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to stop VM :s-9736-VM :
org.libvirt.LibvirtException: Domain not found: no domain with matching
uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
        at org.libvirt.ErrorHandler.processError(Unknown Source)
        at org.libvirt.Connect.processError(Unknown Source)
        at org.libvirt.Connect.domainLookupByUUIDString(Unknown Source)
        at org.libvirt.Connect.domainLookupByUUID(Unknown Source)
        at
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(LibvirtComputingResource.java:4023)
        at
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(Libvi


Thanks
Sriharsha.


On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <[email protected]>wrote:

> Hi, here is the error:
>
> 2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
> (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
> MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
>
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
>
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
>
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
>
> I'm not certain what volume it is complaining about, but I suspect
> secondary storage.  Log on to a host (in particular host 13 [1] since it
> is confirmed to suffer from the issue) and try to manually mount the
> full path of the directory with the system VM template of the secondary
> storage NFS share [2].  The idea is to confirm the share and
> subdirectories of the share are mountable.  Maybe during the maintenance
> some hosts changed IPs and/or the secondary storage NFS share
> permissions (or other settings) were messed up.
>
> If the mount doesn't work, fix whatever is causing it.  If it does work,
> please collect additional info.  Enable DEBUG logging on the hosts [3]
> (if necessary), wait for the error to occur, and upload the agent.log
> from the host with the error.  It should have more details besides the
> exception shown in the management-server.log.  If you have a lot of
> hosts and don't want to enable DEBUG logging on every one, temporarily
> disable most of them and do it on the remaining few.
>
> Best regards,
> Kirk
>
> [1] "13" is the id of the host in the CloudStack database, so find out
> which host it is with:
> select * from `cloud`.`host` where id = 13 \G
>
> [2] Something like:
> nfshost:/share/template/tmpl/2/123
>
> [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
> and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
> may not be enabled by default, and the path may be /etc/cloud/agent/.
>
>
> On 09/16/2013 07:36 PM, sriharsha work wrote:
> > Replying on behalf of Matt. We are able to write data to the Nfs drives.
> > That's not an issue.
> >
> > Thanks
> > Sriharsha
> >
> > Sent from my iPhone
> >
> >> On Sep 16, 2013, at 19:30, Ahmad Emneina <[email protected]> wrote:
> >>
> >> Try to mount your primary storage to a compute host and try to write to
> it.
> >> Your NFS server might not have come back up properly (settings-wise or
> all
> >> the relevant services).
> >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <[email protected]> wrote:
> >>>
> >>> Thank you Chiradeep.  Log snippet now available as
> http://apaste.info/qBIB
> >>> --Matt
> >>>
> >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
> >>> [email protected]> wrote:
> >>>
> >>>> Attachments are stripped. Can you paste (say at http://apaste.info/)
> >>>>
> >>>> From: Matt Foley <[email protected]>
> >>>> Date: Monday, September 16, 2013 4:58 PM
> >>>>
> >>>> We had a planned network outage this weekend, which inadvertently
> >>> resulted
> >>>> in making the NFS Shared Primary Storage (used by System VMs)
> unavailable
> >>>> for a day and a half.  (Guest VMs use local storage only, but System
> VMs
> >>>> use shared storage only.)  Cloudstack was not brought down prior to
> the
> >>>> outage.
> >>>>
> >>>> After network came back, we gracefully brought down all services
> >>> including
> >>>> cloudstack-management, mysql, and NFS, then actually rebooted all
> servers
> >>>> in the cluster and the NFS server (to make sure no stale file
> handles),
> >>>> then brought up services in the appropriate order.  Also checked mysql
> >>> for
> >>>> table corruption, and found none.  Confirmed that the NFS volumes are
> >>>> mountable from all hosts, and in fact Shared Primary Storage is being
> >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
> >>>>
> >>>> Nevertheless, when try to bring up the cluster, we fail to start the
> >>>> system VMs, with errors "InsufficientServerCapacityException: Unable
> to
> >>>> create a deployment for VM".  The cause is not really insufficient
> >>>> capacity, as actual usage of resources is tiny; these error messages
> are
> >>>> false explanations of the failure to create primary storage volume for
> >>> the
> >>>> System VMs.
> >>>>
> >>>> Digging into management-server.log, the core issue seems to be the
> ~160
> >>>> line snippet from the log attached to this message as
> >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary Storage pool
> is
> >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
> >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
> >>>> shows the management server correctly identifying a particular host as
> >>>> being able to access pool 201, then trying to allocate a primary
> storage
> >>>> volume using the template with uuid
> f23a16e7-b628-429e-83e1-698935588465.
> >>>> It fails, but I cannot tell why.  I suspect its claim that "Template 3
> >>> has
> >>>> already been downloaded to pool 201" is false, but I don't know how to
> >>>> check this (or fix if wrong).
> >>>>
> >>>> Any guidance for further debugging or fixing this would be GREATLY
> >>>> appreciated.
> >>>> Thanks,
> >>>> --Matt
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> reader
> >>> of this message is not the intended recipient, you are hereby notified
> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> immediately
> >>> and delete it from your system. Thank You.
> >>>
>



-- 
Thanks & Regards
Sriharsha Devineni

Re: Help! After network outage, can't start System VMs; focused debug info attached

Reply via email to