Hi,

colleague of Daniel’s here, who has been dealing a lot with volumesnapshot 
problems.

We are currently seeing two problems with the volumesnapshots.
The first is the long running snapshots, they go into the „BackingUp“ state and 
never finish and thus block all further snapshots. These can be fixed 
reasonably easy by hacking the database and setting the stuck snapshot to 
„Error“ Usually the next snapshot resumes normally. This can also happen due to 
restarts of the management service while OVF exports are running. In this case 
the ACPdoctor’s cleanup usually fixes the stale snapshots.

A second issue we are seeing since late last year is snapshots failing because 
they can not locate the volume they are supposed to snapshot. This seems to be 
due to inconsistencies between the database and vmware, specifically in the 
„Path“ and „ChainInfo“ variables. The accelerite support I am working with 
currently is supecting some race condition to cause this for VMs with multiple 
volumes.

The current suspicion is:
root volume snapshot triggers, this causes the creation of vmnamehere-0003.vmdk
data volume snapshot triggers, this causes the creation of vmnamehere-0004.vmdk
Once the snapshot is created, CP will parse the snapshot and then it is copied 
to secondary storage, during this process CP was unable to find the 0004.vmdk 
as part of the snapshot
When parsing the snapshot for the data disk, it does find the 0004.vmdk on 
datastore but not when parsing the snapshot info from vmsd as you can see above 
( notice the snapshot.numSnapshots = "4" in both ( when parsing ROOT and DATA 
disks) the cases)
Later, once the snapshot copying is done. CP will delete the snapshot of the VM 
and then consolidate the disks.
For the root disk cp tries to delete 0003.vmdk after copying, but this fails. 
It marks 0004 as top vmdk though.
CP keeps updating 0004.vmdk, but this vmdk is removed once the Vm runs a 
consolidation.

I’m currently looking to verify this is indeed the root cause of the described 
error.

Regards


Simon Völker
Fraunhofer-Gesellschaft e.V.
Abteilung C7 Kommunikationsmanagement
Schloss Birlinghoven IZB, 53754 Sankt Augustin
Telefon: (02241) 14-2311
E-mail: simon.voel...@zv.fraunhofer.de<mailto:simon.voel...@zv.fraunhofer.de>



Am 08.02.2018 um 09:25 schrieb 
daniel.herrm...@zv.fraunhofer.de<mailto:daniel.herrm...@zv.fraunhofer.de>:

Hi Sebastián,

Thank you for your answer. This is exactly the same problem we are facing. Some 
customers have >1TB volumes, and it just takes ages to complete them. Which by 
the way would not be the actual problem, but sometimes CS does not even create 
the snapshot from the recurring snapshot policy or, even worse, a snapshot is 
created but never (>7d) finishes and remains in the BackingUp state, causing 
new snapshots of this series not to be created.

I read about a solution with Veeam and Tags (e.g. let the customer tag the 
virtual machines, and Veeam automatically backups the tagged machines), but 
this adds problems such as:

- how to bill the usage of this method?
- We could restore the virtual machine to an earlier state. But if the customer 
accidently deleted the machine, we cannot create it back from the backup as CS 
would not recognize it again.

So... if anyone has further insight, we'd be happy to hear about it. (

Regards
Daniel

On 08.02.18, 08:21, "Sebastian Gomez" 
<tioc...@gmail.com<mailto:tioc...@gmail.com>> wrote:

   Hello Daniel,

   We have the same environment, and the same problem.
   I agree, the volume snapshots are a pain in time needings. Volume snapshot
   does a full copy of the volume through the network from primary storage to
   the secondary. May be there is a storage configuration that could optimize
   this action, but we have iSCSI for primary storage and NFS por secondary...
   For some big volumes it takes up to 8 h to complete the snap.

   This is NOT a sustainable solution.

   Our customers uses backup agents of their backup solutions, and we (as
   providers) have a backup at VMware level (you can find many solutions like
   vRanger, veeam, ...), is the only way that we have found to have a
   disaster-recovery backup of the platform. We are working now on how to
   offer to customers using this backup as a service, facing to have a
   global-unique backup solution (and scalable) for all the platform and users.

   In this way, for example Veeam backup offers many options to allow users to
   recover their own data (configuring access via API), but the problem here
   is that in Cloudstack you can't recover a virtual machine on the
   virtualization layer without informing the cloud-manager...


   Perhaps someone else could light us.




   Regards.





   Atentamente,
   Sebastián Gómez

   On Wed, Feb 7, 2018 at 3:35 PM, 
<daniel.herrm...@zv.fraunhofer.de<mailto:daniel.herrm...@zv.fraunhofer.de>> 
wrote:

Hi All,

We are using CS 4.7.1 with VMWare Hypervisor and advanced networking in a
private cloud environment. Currently, most of our (internal) customers
hosting internal services within this environment are using volume
snapshots to facilitate backups of their virtual machines. Besides the
obvious downsides of this approach (consistent snapshots of multiple
volumes, …) we encounter serious problems using this features. In ~10% of
the cases, snapshots get stuck in the BackingUp state, which sometimes
causes the whole snapshot queue to stale. In some other cases, recurring
snapshots are correctly configured, but CS does not try even try to create
this snapshot, there is no entry in the database.

In summary, we are currently evaluating different options, hence my
questions here:


 *   Are we the only ones encountering that massive problems with volume
snapshots? Or is this a known problem? Anything we could look at or a hint
where we could start troubleshooting?
 *   How are you actually providing backup services to the customer? Are
there other solutions or products that integrate with CS?

When using another option than the volume snapshots in CS, the most
important factor would be to keep the ability for the customer to configure
everything in self-service.

Thanks and regards
Daniel




Reply via email to