Thank you, Kirk, that was most helpful. I currently have the system running with SystemVMs using local storage, with the loss of all our (supposedly) persistent volumes in shared storage. After we get past the emergency I will try Kirk's suggestion of deleting the stale line from the template_spool_ref table. In the meantime, here's what I learned:
The volume f23a16e7-b628-429e-83e1-698935588465 is present in "template_spool_ref" table, with download_state = DOWNLOADED, but is not in the cs-primary pool location on NFS. Indeed, there are no volume or template objects in the cs-primary pool (shared Primary) location, even though there should be several. In order to see what they should look like, I created new DATA and ROOT disks in shared storage, which both worked fine, creating in the NFS directory a volume object in the case of DATA, and both a template and a volume object in the case of ROOT, with a path name bound to the volume id via the "volumes" table in the db, and in the case of the ROOT volume a new DOWNLOADED entry in the "template_spool_ref" table. The shared Primary storage should have contained several high-value DATA volumes, as well as the SystemVM template the system obviously thought was previously downloaded. I infer the Primary storage was deleted and recreated by Cloudstack when the NFS storage became available after an outage of more than 24 hours. This is disappointing, and rather ironic, since shared storage was chosen so that it would be MORE persistent, not more vulnerable. I suspect implicit re-registration of Primary storage, after a lengthy NFS outage, activated the logic of erasing the Primary storage upon registration. That would be a major bug, if so. Thank you for your kind help. --Matt On Tue, Sep 17, 2013 at 7:01 PM, Kirk Kosinski <kirkkosin...@gmail.com>wrote: > Hi, secondary storage is only mounted on an as-needed basis. When a KVM > or XenServer host needs to do something on secondary storage, it will > mount the full path it needs (e.g. nfshost:/share/template/tmpl/2/123), > do what it needs to do, and unmount it. > > The error seems to be that CloudStack is looking for and not finding a > volume (qcow2 disk) named "f23a16e7-b628-429e-83e1-698935588465" on the > NFS primary storage. This file seems to be the system VM template. > Does this file exist or not? I'd guess not, since CS says it can't find > it. > > Check the status of this volume in the template_spool_ref table: > SELECT * FROM template_spool_ref where local_path = > 'f23a16e7-b628-429e-83e1-698935588465'\G > > If it shows up in the database as download_state = DOWNLOADED but it > does not exist on primary storage, back up the cloud database, then > delete the row in template_spool_ref. This should force CS to should > re-download it (i.e. copy it from secondary storage to primary again and > use it to deploy system VMs... and create a new entry for it in > template_spool_ref). > > If it does exist on primary storage, maybe the file is corrupt. Compare > the size and md5sum to the original on secondary storage. Let us know > how it goes. > > Best regards, > Kirk > > On 09/17/2013 04:47 PM, Matt Foley wrote: > > Hi, > > I've now heard that this problem, of Cloudstack being messed up after > > interruption of the NFS shared storage access, is well known. Does > > anyone have a fix or work-around? > > > > Kirk, thanks for your help so far. > > Both the master and the host servers can mount both primary and > > secondary stores, and read and write them. No permissions nor IP access > > seem broken. > > > > I also checked the log levels on the hosts, and both FILE and com.cloud > > were already set to DEBUG. I tried setting them to TRACE, but got no > > additional useful info. > > > > On the host, I tried just restarting the cloudstack-agent service. In > > the resulting logs, the following snippet occurs. The best > > interpretation I can make of it is that "no storage vol with matching > > name 'f23a16e7-b628-429e-83e1-698935588465'' is the key issue, and that > > should relate to secondary storage, where the templates are stored. But > > this uuid doesn't seem to be related to the actual secondary storage > > pool, whose uuid is b7fd7b11-c0f7-4717-8343-ff6fb9bff860. The primary > > storage pool is uuid 9c6fd9a3-43e5-389a-9594-faecf178b4b9, and it seems > > to be properly automatically mounted on all hosts and the master. > > > > ** It concerns me that the secondary storage pool does NOT seem to be > > automatically mounted. Is it supposed to be? If not, how are the hosts > > supposed to find the templates, before a System Router VM can even be > > set up? > > > > Below is the relevant host agent.log snippet, and also a dump of the > > storage_pool table from mysql. > > > > Thanks in advance for any suggestions. > > --Matt > > > > ====================== > ...truncated... -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.