Re: Help! After network outage, can't start System VMs; focused debug info attached

Matt Foley Wed, 18 Sep 2013 00:50:51 -0700

Thank you, Kirk, that was most helpful.

I currently have the system running with SystemVMs using local storage,
with the loss of all our (supposedly) persistent volumes in shared storage.
 After we get past the emergency I will try Kirk's suggestion of deleting
the stale line from the template_spool_ref table.  In the meantime, here's
what I learned:


The volume f23a16e7-b628-429e-83e1-698935588465 is present in
"template_spool_ref" table, with download_state = DOWNLOADED, but is not in
the cs-primary pool location on NFS.

Indeed, there are no volume or template objects in the cs-primary pool
(shared Primary) location, even though there should be several.  In order
to see what they should look like, I created new DATA and ROOT disks in
shared storage, which both worked fine, creating in the NFS directory a
volume object in the case of DATA, and both a template and a volume object
in the case of ROOT, with a path name bound to the volume id via the
"volumes" table in the db, and in the case of the ROOT volume a new
DOWNLOADED entry in the "template_spool_ref" table.

The shared Primary storage should have contained several high-value DATA
volumes, as well as the SystemVM template the system obviously thought was
previously downloaded.  I infer the Primary storage was deleted and
recreated by Cloudstack when the NFS storage became available after an
outage of more than 24 hours.  This is disappointing, and rather ironic,
since shared storage was chosen so that it would be MORE persistent, not
more vulnerable.

I suspect implicit re-registration of Primary storage, after a lengthy NFS
outage, activated the logic of erasing the Primary storage upon
registration.  That would be a major bug, if so.

Thank you for your kind help.
--Matt


On Tue, Sep 17, 2013 at 7:01 PM, Kirk Kosinski <kirkkosin...@gmail.com>wrote:

> Hi, secondary storage is only mounted on an as-needed basis.  When a KVM
> or XenServer host needs to do something on secondary storage, it will
> mount the full path it needs (e.g. nfshost:/share/template/tmpl/2/123),
> do what it needs to do, and unmount it.
>
> The error seems to be that CloudStack is looking for and not finding a
> volume (qcow2 disk) named "f23a16e7-b628-429e-83e1-698935588465" on the
> NFS primary storage.  This file seems to be the system VM template.
> Does this file exist or not?  I'd guess not, since CS says it can't find
> it.
>
> Check the status of this volume in the template_spool_ref table:
> SELECT * FROM template_spool_ref where local_path =
> 'f23a16e7-b628-429e-83e1-698935588465'\G
>
> If it shows up in the database as download_state = DOWNLOADED but it
> does not exist on primary storage, back up the cloud database, then
> delete the row in template_spool_ref.  This should force CS to should
> re-download it (i.e. copy it from secondary storage to primary again and
> use it to deploy system VMs... and create a new entry for it in
> template_spool_ref).
>
> If it does exist on primary storage, maybe the file is corrupt.  Compare
> the size and md5sum to the original on secondary storage.  Let us know
> how it goes.
>
> Best regards,
> Kirk
>
> On 09/17/2013 04:47 PM, Matt Foley wrote:
> > Hi,
> > I've now heard that this problem, of Cloudstack being messed up after
> > interruption of the NFS shared storage access, is well known.  Does
> > anyone have a fix or work-around?
> >
> > Kirk, thanks for your help so far.
> > Both the master and the host servers can mount both primary and
> > secondary stores, and read and write them.  No permissions nor IP access
> > seem broken.
> >
> > I also checked the log levels on the hosts, and both FILE and com.cloud
> > were already set to DEBUG.  I tried setting them to TRACE, but got no
> > additional useful info.
> >
> > On the host, I tried just restarting the cloudstack-agent service.  In
> > the resulting logs, the following snippet occurs.  The best
> > interpretation I can make of it is that "no storage vol with matching
> > name 'f23a16e7-b628-429e-83e1-698935588465'' is the key issue, and that
> > should relate to secondary storage, where the templates are stored.  But
> > this uuid doesn't seem to be related to the actual secondary storage
> > pool, whose uuid is b7fd7b11-c0f7-4717-8343-ff6fb9bff860.  The primary
> > storage pool is uuid 9c6fd9a3-43e5-389a-9594-faecf178b4b9, and it seems
> > to be properly automatically mounted on all hosts and the master.
> >
> > ** It concerns me that the secondary storage pool does NOT seem to be
> > automatically mounted.  Is it supposed to be?  If not, how are the hosts
> > supposed to find the templates, before a System Router VM can even be
> > set up?
> >
> > Below is the relevant host agent.log snippet, and also a dump of the
> > storage_pool table from mysql.
> >
> > Thanks in advance for any suggestions.
> > --Matt
> >
> > ======================
>
...truncated...

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Reply via email to