Hi, colleague of Daniel’s here, who has been dealing a lot with volumesnapshot problems.
We are currently seeing two problems with the volumesnapshots. The first is the long running snapshots, they go into the „BackingUp“ state and never finish and thus block all further snapshots. These can be fixed reasonably easy by hacking the database and setting the stuck snapshot to „Error“ Usually the next snapshot resumes normally. This can also happen due to restarts of the management service while OVF exports are running. In this case the ACPdoctor’s cleanup usually fixes the stale snapshots. A second issue we are seeing since late last year is snapshots failing because they can not locate the volume they are supposed to snapshot. This seems to be due to inconsistencies between the database and vmware, specifically in the „Path“ and „ChainInfo“ variables. The accelerite support I am working with currently is supecting some race condition to cause this for VMs with multiple volumes. The current suspicion is: root volume snapshot triggers, this causes the creation of vmnamehere-0003.vmdk data volume snapshot triggers, this causes the creation of vmnamehere-0004.vmdk Once the snapshot is created, CP will parse the snapshot and then it is copied to secondary storage, during this process CP was unable to find the 0004.vmdk as part of the snapshot When parsing the snapshot for the data disk, it does find the 0004.vmdk on datastore but not when parsing the snapshot info from vmsd as you can see above ( notice the snapshot.numSnapshots = "4" in both ( when parsing ROOT and DATA disks) the cases) Later, once the snapshot copying is done. CP will delete the snapshot of the VM and then consolidate the disks. For the root disk cp tries to delete 0003.vmdk after copying, but this fails. It marks 0004 as top vmdk though. CP keeps updating 0004.vmdk, but this vmdk is removed once the Vm runs a consolidation. I’m currently looking to verify this is indeed the root cause of the described error. Regards Simon Völker Fraunhofer-Gesellschaft e.V. Abteilung C7 Kommunikationsmanagement Schloss Birlinghoven IZB, 53754 Sankt Augustin Telefon: (02241) 14-2311 E-mail: simon.voel...@zv.fraunhofer.de<mailto:simon.voel...@zv.fraunhofer.de> Am 08.02.2018 um 09:25 schrieb daniel.herrm...@zv.fraunhofer.de<mailto:daniel.herrm...@zv.fraunhofer.de>: Hi Sebastián, Thank you for your answer. This is exactly the same problem we are facing. Some customers have >1TB volumes, and it just takes ages to complete them. Which by the way would not be the actual problem, but sometimes CS does not even create the snapshot from the recurring snapshot policy or, even worse, a snapshot is created but never (>7d) finishes and remains in the BackingUp state, causing new snapshots of this series not to be created. I read about a solution with Veeam and Tags (e.g. let the customer tag the virtual machines, and Veeam automatically backups the tagged machines), but this adds problems such as: - how to bill the usage of this method? - We could restore the virtual machine to an earlier state. But if the customer accidently deleted the machine, we cannot create it back from the backup as CS would not recognize it again. So... if anyone has further insight, we'd be happy to hear about it. ( Regards Daniel On 08.02.18, 08:21, "Sebastian Gomez" <tioc...@gmail.com<mailto:tioc...@gmail.com>> wrote: Hello Daniel, We have the same environment, and the same problem. I agree, the volume snapshots are a pain in time needings. Volume snapshot does a full copy of the volume through the network from primary storage to the secondary. May be there is a storage configuration that could optimize this action, but we have iSCSI for primary storage and NFS por secondary... For some big volumes it takes up to 8 h to complete the snap. This is NOT a sustainable solution. Our customers uses backup agents of their backup solutions, and we (as providers) have a backup at VMware level (you can find many solutions like vRanger, veeam, ...), is the only way that we have found to have a disaster-recovery backup of the platform. We are working now on how to offer to customers using this backup as a service, facing to have a global-unique backup solution (and scalable) for all the platform and users. In this way, for example Veeam backup offers many options to allow users to recover their own data (configuring access via API), but the problem here is that in Cloudstack you can't recover a virtual machine on the virtualization layer without informing the cloud-manager... Perhaps someone else could light us. Regards. Atentamente, Sebastián Gómez On Wed, Feb 7, 2018 at 3:35 PM, <daniel.herrm...@zv.fraunhofer.de<mailto:daniel.herrm...@zv.fraunhofer.de>> wrote: Hi All, We are using CS 4.7.1 with VMWare Hypervisor and advanced networking in a private cloud environment. Currently, most of our (internal) customers hosting internal services within this environment are using volume snapshots to facilitate backups of their virtual machines. Besides the obvious downsides of this approach (consistent snapshots of multiple volumes, …) we encounter serious problems using this features. In ~10% of the cases, snapshots get stuck in the BackingUp state, which sometimes causes the whole snapshot queue to stale. In some other cases, recurring snapshots are correctly configured, but CS does not try even try to create this snapshot, there is no entry in the database. In summary, we are currently evaluating different options, hence my questions here: * Are we the only ones encountering that massive problems with volume snapshots? Or is this a known problem? Anything we could look at or a hint where we could start troubleshooting? * How are you actually providing backup services to the customer? Are there other solutions or products that integrate with CS? When using another option than the volume snapshots in CS, the most important factor would be to keep the ability for the customer to configure everything in self-service. Thanks and regards Daniel