Okay, so system VMs are using NFS primary storage (I mis-read the OP,
sorry).  Make sure the KVM hosts can mount and write to:

10.42.1.101:/srv/nfs/eng/cs-primary

Also check libvirtd.log for errors.

If you're not making progress and want to get up and running ASAP, try
reconfiguring CloudStack to use local storage for system VMs and
(assuming this works) sort out the NFS primary storage problem later.

Best regards,
Kirk

On 09/17/2013 02:22 AM, sriharsha work wrote:
> Hi Kirk,
> 
> Thanks for your reply. This is a blocker for us and currently affected
> all of our work. It is very helpful to debug more into the issue. I have
> a question.
> 
> 1. What should the directory be when mounting [2] systemVM template
> location on the nfs drive.
> 
> 
> Error from agent.log on the host. Clearly it says some issue with the
> libvirt pools. Can you please help me understand if anything else needs
> to be addressed to get the issue resolved. 
> 
> 
> 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Request:Seq 14-1592393816:  { Cmd ,
> MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> [{"storage.CreateCommand":{"vo
> lId":9817,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-9736","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":9817,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
> }
> 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Processing command:
> com.cloud.agent.api.storage.CreateCommand
> 2013-09-17 02:17:36,779 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Failed to create volume:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'
> 2013-09-17 02:17:36,781 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Seq 14-1592393816:  { Ans: , MgmtId:
> 161340856362, via: 14, Ver: v1, Flags: 110,
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Request:Seq 14-1592393817:  { Cmd ,
> MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> [{"StopCommand":{"isProxy":false,"vmName":"s-9736-VM","wait":0}}] }
> 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Processing command:
> com.cloud.agent.api.StopCommand
> 2013-09-17 02:17:36,891 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to get dom xml:
> org.libvirt.LibvirtException: Domain not found: no domain with matching
> uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to get dom xml:
> org.libvirt.LibvirtException: Domain not found: no domain with matching
> uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Try to stop the vm at first
> 2013-09-17 02:17:36,895 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to stop VM :s-9736-VM :
> org.libvirt.LibvirtException: Domain not found: no domain with matching
> uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
>         at org.libvirt.ErrorHandler.processError(Unknown Source)
>         at org.libvirt.Connect.processError(Unknown Source)
>         at org.libvirt.Connect.domainLookupByUUIDString(Unknown Source)
>         at org.libvirt.Connect.domainLookupByUUID(Unknown Source)
>         at
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(LibvirtComputingResource.java:4023)
>         at
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(Libvi
> 
> 
> Thanks 
> Sriharsha.
> 
> 
> On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <kirkkosin...@gmail.com
> <mailto:kirkkosin...@gmail.com>> wrote:
> 
>     Hi, here is the error:
> 
>     2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
>     (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
>     MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
>     
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
>     com.cloud.utils.exception.CloudRuntimeException\nMessage:
>     org.libvirt.LibvirtException: Storage volume not found: no storage vol
>     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
>     com.cloud.utils.exception.CloudRuntimeException:
>     org.libvirt.LibvirtException: Storage volume not found: no storage vol
>     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
>     
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
>     
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
>     
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
>     
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
>     
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
>     com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
>     com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
>     com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
>     
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
>     
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
>     java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 
>     I'm not certain what volume it is complaining about, but I suspect
>     secondary storage.  Log on to a host (in particular host 13 [1] since it
>     is confirmed to suffer from the issue) and try to manually mount the
>     full path of the directory with the system VM template of the secondary
>     storage NFS share [2].  The idea is to confirm the share and
>     subdirectories of the share are mountable.  Maybe during the maintenance
>     some hosts changed IPs and/or the secondary storage NFS share
>     permissions (or other settings) were messed up.
> 
>     If the mount doesn't work, fix whatever is causing it.  If it does work,
>     please collect additional info.  Enable DEBUG logging on the hosts [3]
>     (if necessary), wait for the error to occur, and upload the agent.log
>     from the host with the error.  It should have more details besides the
>     exception shown in the management-server.log.  If you have a lot of
>     hosts and don't want to enable DEBUG logging on every one, temporarily
>     disable most of them and do it on the remaining few.
> 
>     Best regards,
>     Kirk
> 
>     [1] "13" is the id of the host in the CloudStack database, so find out
>     which host it is with:
>     select * from `cloud`.`host` where id = 13 \G
> 
>     [2] Something like:
>     nfshost:/share/template/tmpl/2/123
> 
>     [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
>     and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
>     may not be enabled by default, and the path may be /etc/cloud/agent/.
> 
> 
>     On 09/16/2013 07:36 PM, sriharsha work wrote:
>     > Replying on behalf of Matt. We are able to write data to the Nfs
>     drives.
>     > That's not an issue.
>     >
>     > Thanks
>     > Sriharsha
>     >
>     > Sent from my iPhone
>     >
>     >> On Sep 16, 2013, at 19:30, Ahmad Emneina <aemne...@gmail.com
>     <mailto:aemne...@gmail.com>> wrote:
>     >>
>     >> Try to mount your primary storage to a compute host and try to
>     write to it.
>     >> Your NFS server might not have come back up properly
>     (settings-wise or all
>     >> the relevant services).
>     >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mfo...@hortonworks.com
>     <mailto:mfo...@hortonworks.com>> wrote:
>     >>>
>     >>> Thank you Chiradeep.  Log snippet now available as
>     http://apaste.info/qBIB
>     >>> --Matt
>     >>>
>     >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
>     >>> chiradeep.vit...@citrix.com
>     <mailto:chiradeep.vit...@citrix.com>> wrote:
>     >>>
>     >>>> Attachments are stripped. Can you paste (say at
>     http://apaste.info/)
>     >>>>
>     >>>> From: Matt Foley <mfo...@hortonworks.com
>     <mailto:mfo...@hortonworks.com>>
>     >>>> Date: Monday, September 16, 2013 4:58 PM
>     >>>>
>     >>>> We had a planned network outage this weekend, which inadvertently
>     >>> resulted
>     >>>> in making the NFS Shared Primary Storage (used by System VMs)
>     unavailable
>     >>>> for a day and a half.  (Guest VMs use local storage only, but
>     System VMs
>     >>>> use shared storage only.)  Cloudstack was not brought down
>     prior to the
>     >>>> outage.
>     >>>>
>     >>>> After network came back, we gracefully brought down all services
>     >>> including
>     >>>> cloudstack-management, mysql, and NFS, then actually rebooted
>     all servers
>     >>>> in the cluster and the NFS server (to make sure no stale file
>     handles),
>     >>>> then brought up services in the appropriate order.  Also
>     checked mysql
>     >>> for
>     >>>> table corruption, and found none.  Confirmed that the NFS
>     volumes are
>     >>>> mountable from all hosts, and in fact Shared Primary Storage is
>     being
>     >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
>     >>>>
>     >>>> Nevertheless, when try to bring up the cluster, we fail to
>     start the
>     >>>> system VMs, with errors "InsufficientServerCapacityException:
>     Unable to
>     >>>> create a deployment for VM".  The cause is not really insufficient
>     >>>> capacity, as actual usage of resources is tiny; these error
>     messages are
>     >>>> false explanations of the failure to create primary storage
>     volume for
>     >>> the
>     >>>> System VMs.
>     >>>>
>     >>>> Digging into management-server.log, the core issue seems to be
>     the ~160
>     >>>> line snippet from the log attached to this message as
>     >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary
>     Storage pool is
>     >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
>     >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.
>      The log
>     >>>> shows the management server correctly identifying a particular
>     host as
>     >>>> being able to access pool 201, then trying to allocate a
>     primary storage
>     >>>> volume using the template with uuid
>     f23a16e7-b628-429e-83e1-698935588465.
>     >>>> It fails, but I cannot tell why.  I suspect its claim that
>     "Template 3
>     >>> has
>     >>>> already been downloaded to pool 201" is false, but I don't know
>     how to
>     >>>> check this (or fix if wrong).
>     >>>>
>     >>>> Any guidance for further debugging or fixing this would be GREATLY
>     >>>> appreciated.
>     >>>> Thanks,
>     >>>> --Matt
>     >>>
>     >>> --
>     >>> CONFIDENTIALITY NOTICE
>     >>> NOTICE: This message is intended for the use of the individual
>     or entity to
>     >>> which it is addressed and may contain information that is
>     confidential,
>     >>> privileged and exempt from disclosure under applicable law. If
>     the reader
>     >>> of this message is not the intended recipient, you are hereby
>     notified that
>     >>> any printing, copying, dissemination, distribution, disclosure or
>     >>> forwarding of this communication is strictly prohibited. If you have
>     >>> received this communication in error, please contact the sender
>     immediately
>     >>> and delete it from your system. Thank You.
>     >>>
> 
> 
> 
> 
> -- 
> Thanks & Regards
> Sriharsha Devineni

Reply via email to