RE: qemu2 images are being corrupted

Nicolas Bouige Sat, 18 Aug 2018 04:21:02 -0700

Hi All,

Maybe this is not related but that's seem know qemu corrup .qcow2 image with 
internal snapshot


https://www.linux-kvm.org/images/6/65/02x08B-Max_Reitz-Backups_with_QEMU.pdf 
(slide 13/15)

Nicolas Bouige
DIMSI
cloud.dimsi.fr<http://www.cloud.dimsi.fr>
4, avenue Laurent Cely
Tour d’Asnière – 92600 Asnière sur Seine
T/ +33 (0)6 28 98 53 40


________________________________
De : cloudstack-fan <cloudstack-...@protonmail.com.INVALID>
Envoyé : samedi 18 août 2018 13:06:08
À : users@cloudstack.apache.org
Objet : Re: qemu2 images are being corrupted

Dear colleagues,

You might find it interesting:
https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/

It seems that qemu-kvm really could corrupt a QCOW2 image. :-(

What do you think, is that possible to avoid that? Maybe there's an option to 
use RAW forman instead of QCOW2?

Thanks!

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On 2 July 2018 12:21 PM, cloudstack-fan <cloudstack-...@protonmail.com> wrote:

> Dear colleagues,
>
> I'm posting as an anonymous user, because there's a thing that concerns me a 
> little and I'd like to share my experience with you, so maybe some people 
> could relate to the same. ACS is amazing, it solves my tasks for 6 years, I'm 
> running a few ACS-backed clouds that contain hundreds and hundreds of VMs. 
> I'm enjoying ACS really much, but there's a thing that scares me sometimes.
>
> It happens pretty seldom, but the more VMs you have is the more chances you 
> run into this glitch. It usually happens on the sly and you don't get any 
> error messages in log-files of your cloudstack-management server or a 
> cloudstack-agent, so you don't even know that something had happened until 
> you see that a virtual machine is having major problems. If you're lucky, you 
> see it on the same day when it happens, but if you aren't - you won't suspect 
> anything unusual for a week, but at some moment you realize that the 
> filesystem had become a mess and you can't do anything to restore it. You're 
> trying to restore it from a snapshot, but if you don't have a snapshot that 
> would be created before the incident, your snapshots won't help. :-(
>
> I experienced it for about 5-7 times during the last 5-6 years and there are 
> a few conditions that always present:
>  * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS 
> 7) with qcow2-images (either 0.10 and 1.1 versions);
>  * it happens on primary storages running different filesystems (I 
> experiences it with local XFS and network-based GFS2 and NFS);
>  * it happens when a volume snapshot is being made, according to the 
> log-files inside of a VM (guest's operating system's kernel starts 
> complaining on a filesystem errors);
>  * at the same time, as I wrote before, there are NO error messages in the 
> log-files outside of a VM which disk image is corrupted;
>  * but when you run `qemu-img check ...` to check the image, you may see a 
> lot of leaked clusters (that's why I'd strongly advice to check each and 
> every image one each and every primary storage at least once per hour by a 
> script being run by your monitoring system, something kind of `for imagefile 
> in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { 
> /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; 
> } done`);
>  * when it happens you can also find a record in the snapshot_store_ref table 
> that refers to the snapshot on a primary storage (see an example here 
> https://pastebin.com/BuxCXVSq) - this record should have been removed when 
> the snapshot's state is being changed from "BackingUp" to "BackedUp", but it 
> isn't being removed in this case. At the same time, this snapshot isn't being 
> listed in the output of `qemu-img snapshot -l ...`, so that's why I suppose 
> that the image is being corrupted when ACS deletes the snapshot that has been 
> backed up (it tries to delete the snapshot, but something goes wrong, image 
> is being corrupted, but ACS thinks that everything's fine and changes the 
> status to "BackedUp" without a bit of qualm);
>  * if you're trying to restore this VM's image from the same snapshot that 
> has caused destruction or any other snapshot that has been made after that, 
> you'll find the same corrupted filesystem inside, but the snapshot's image 
> that is stored in your secondary storage doesn't show anything wrong when you 
> run `qemu-img check ...` (so you can restore your image only if you have a 
> snapshot that had been created AND stored before the incident).
>
> As I wrote, I saw several times in different environments and different 
> versions of ACS. I'm pretty sure that it's not only me who had such a luck to 
> experience the same glitch, so let's share our stories. Maybe together we'll 
> find out why does it happen and how to prevent that in future.
>
> Thanks in advance,
> An Anonymous ACS Fan

RE: qemu2 images are being corrupted

Reply via email to