Hello, I also met that in the past once. I bet it's closely connected to qemu snapshots.
2018-07-02 16:21 GMT+07:00 cloudstack-fan < cloudstack-...@protonmail.com.invalid>: > Dear colleagues, > > I'm posting as an anonymous user, because there's a thing that concerns me > a little and I'd like to share my experience with you, so maybe some people > could relate to the same. ACS is amazing, it solves my tasks for 6 years, > I'm running a few ACS-backed clouds that contain hundreds and hundreds of > VMs. I'm enjoying ACS really much, but there's a thing that scares me > sometimes. > > It happens pretty seldom, but the more VMs you have is the more chances > you run into this glitch. It usually happens on the sly and you don't get > any error messages in log-files of your cloudstack-management server or a > cloudstack-agent, so you don't even know that something had happened until > you see that a virtual machine is having major problems. If you're lucky, > you see it on the same day when it happens, but if you aren't - you won't > suspect anything unusual for a week, but at some moment you realize that > the filesystem had become a mess and you can't do anything to restore it. > You're trying to restore it from a snapshot, but if you don't have a > snapshot that would be created before the incident, your snapshots won't > help. :-( > > I experienced it for about 5-7 times during the last 5-6 years and there > are a few conditions that always present: > * it happens on KVM-based hosts (I experienced itt with CentOS 6 and > CentOS 7) with qcow2-images (either 0.10 and 1.1 versions); > * it happens on primary storages running different filesystems (I > experiences it with local XFS and network-based GFS2 and NFS); > * it happens when a volume snapshot is being made, according to the > log-files inside of a VM (guest's operating system's kernel starts > complaining on a filesystem errors); > * at the same time, as I wrote before, there are NO error messages in the > log-files outside of a VM which disk image is corrupted; > * but when you run `qemu-img check ...` to check the image, you may see a > lot of leaked clusters (that's why I'd strongly advice to check each and > every image one each and every primary storage at least once per hour by a > script being run by your monitoring system, something kind of `for > imagefile in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { > /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } > fi; } done`); > * when it happens you can also find a record in the snapshot_store_ref > table that refers to the snapshot on a primary storage (see an example here > https://pastebin.com/BuxCXVSq) - this record should have been removed > when the snapshot's state is being changed from "BackingUp" to "BackedUp", > but it isn't being removed in this case. At the same time, this snapshot > isn't being listed in the output of `qemu-img snapshot -l ...`, so that's > why I suppose that the image is being corrupted when ACS deletes the > snapshot that has been backed up (it tries to delete the snapshot, but > something goes wrong, image is being corrupted, but ACS thinks that > everything's fine and changes the status to "BackedUp" without a bit of > qualm); > * if you're trying to restore this VM's image from the same snapshot that > has caused destruction or any other snapshot that has been made after that, > you'll find the same corrupted filesystem inside, but the snapshot's image > that is stored in your secondary storage doesn't show anything wrong when > you run `qemu-img check ...` (so you can restore your image only if you > have a snapshot that had been created AND stored before the incident). > > As I wrote, I saw several times in different environments and different > versions of ACS. I'm pretty sure that it's not only me who had such a luck > to experience the same glitch, so let's share our stories. Maybe together > we'll find out why does it happen and how to prevent that in future. > > Thanks in advance, > An Anonymous ACS Fan -- With best regards, Ivan Kudryavtsev Bitworks Software, Ltd. Cell: +7-923-414-1515 WWW: http://bitworks.software/ <http://bw-sw.com/>