Hi All, Maybe this is not related but that's seem know qemu corrup .qcow2 image with internal snapshot
https://www.linux-kvm.org/images/6/65/02x08B-Max_Reitz-Backups_with_QEMU.pdf (slide 13/15) Nicolas Bouige DIMSI cloud.dimsi.fr<http://www.cloud.dimsi.fr> 4, avenue Laurent Cely Tour d’Asnière – 92600 Asnière sur Seine T/ +33 (0)6 28 98 53 40 ________________________________ De : cloudstack-fan <cloudstack-...@protonmail.com.INVALID> Envoyé : samedi 18 août 2018 13:06:08 À : users@cloudstack.apache.org Objet : Re: qemu2 images are being corrupted Dear colleagues, You might find it interesting: https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/ It seems that qemu-kvm really could corrupt a QCOW2 image. :-( What do you think, is that possible to avoid that? Maybe there's an option to use RAW forman instead of QCOW2? Thanks! ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On 2 July 2018 12:21 PM, cloudstack-fan <cloudstack-...@protonmail.com> wrote: > Dear colleagues, > > I'm posting as an anonymous user, because there's a thing that concerns me a > little and I'd like to share my experience with you, so maybe some people > could relate to the same. ACS is amazing, it solves my tasks for 6 years, I'm > running a few ACS-backed clouds that contain hundreds and hundreds of VMs. > I'm enjoying ACS really much, but there's a thing that scares me sometimes. > > It happens pretty seldom, but the more VMs you have is the more chances you > run into this glitch. It usually happens on the sly and you don't get any > error messages in log-files of your cloudstack-management server or a > cloudstack-agent, so you don't even know that something had happened until > you see that a virtual machine is having major problems. If you're lucky, you > see it on the same day when it happens, but if you aren't - you won't suspect > anything unusual for a week, but at some moment you realize that the > filesystem had become a mess and you can't do anything to restore it. You're > trying to restore it from a snapshot, but if you don't have a snapshot that > would be created before the incident, your snapshots won't help. :-( > > I experienced it for about 5-7 times during the last 5-6 years and there are > a few conditions that always present: > * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS > 7) with qcow2-images (either 0.10 and 1.1 versions); > * it happens on primary storages running different filesystems (I > experiences it with local XFS and network-based GFS2 and NFS); > * it happens when a volume snapshot is being made, according to the > log-files inside of a VM (guest's operating system's kernel starts > complaining on a filesystem errors); > * at the same time, as I wrote before, there are NO error messages in the > log-files outside of a VM which disk image is corrupted; > * but when you run `qemu-img check ...` to check the image, you may see a > lot of leaked clusters (that's why I'd strongly advice to check each and > every image one each and every primary storage at least once per hour by a > script being run by your monitoring system, something kind of `for imagefile > in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { > /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; > } done`); > * when it happens you can also find a record in the snapshot_store_ref table > that refers to the snapshot on a primary storage (see an example here > https://pastebin.com/BuxCXVSq) - this record should have been removed when > the snapshot's state is being changed from "BackingUp" to "BackedUp", but it > isn't being removed in this case. At the same time, this snapshot isn't being > listed in the output of `qemu-img snapshot -l ...`, so that's why I suppose > that the image is being corrupted when ACS deletes the snapshot that has been > backed up (it tries to delete the snapshot, but something goes wrong, image > is being corrupted, but ACS thinks that everything's fine and changes the > status to "BackedUp" without a bit of qualm); > * if you're trying to restore this VM's image from the same snapshot that > has caused destruction or any other snapshot that has been made after that, > you'll find the same corrupted filesystem inside, but the snapshot's image > that is stored in your secondary storage doesn't show anything wrong when you > run `qemu-img check ...` (so you can restore your image only if you have a > snapshot that had been created AND stored before the incident). > > As I wrote, I saw several times in different environments and different > versions of ACS. I'm pretty sure that it's not only me who had such a luck to > experience the same glitch, so let's share our stories. Maybe together we'll > find out why does it happen and how to prevent that in future. > > Thanks in advance, > An Anonymous ACS Fan