Re: [ceph-users] RBD snapshot atomicity guarantees?
On 21/12/2018 03.02, Gregory Farnum wrote: > RBD snapshots are indeed crash-consistent. :) > -Greg Thanks for the confirmation! May I suggest putting this little nugget in the docs somewhere? This might help clarify things for others :) -- Hector Martin (hec...@marcansoft.com) Public Key: https://mrcn.st/pub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
On Tue, Dec 18, 2018 at 1:11 AM Hector Martin wrote: > Hi list, > > I'm running libvirt qemu guests on RBD, and currently taking backups by > issuing a domfsfreeze, taking a snapshot, and then issuing a domfsthaw. > This seems to be a common approach. > > This is safe, but it's impactful: the guest has frozen I/O for the > duration of the snapshot. This is usually only a few seconds. > Unfortunately, the freeze action doesn't seem to be very reliable. > Sometimes it times out, leaving the guest in a messy situation with > frozen I/O (thaw times out too when this happens, or returns success but > FSes end up frozen anyway). This is clearly a bug somewhere, but I > wonder whether the freeze is a hard requirement or not. > > Are there any atomicity guarantees for RBD snapshots taken *without* > freezing the filesystem? Obviously the filesystem will be dirty and will > require journal recovery, but that is okay; it's equivalent to a hard > shutdown/crash. But is there any chance of corruption related to the > snapshot being taken in a non-atomic fashion? RBD snapshots are indeed crash-consistent. :) -Greg Filesystems and > applications these days should have no trouble with hard shutdowns, as > long as storage writes follow ordering guarantees (no writes getting > reordered across a barrier and such). > > Put another way: do RBD snapshots have ~identical atomicity guarantees > to e.g. LVM snapshots? > > If we can get away without the freeze, honestly I'd rather go that > route. If I really need to pause I/O during the snapshot creation, I > might end up resorting to pausing the whole VM (suspend/resume), which > has higher impact but also probably a much lower chance of messing up > (or having excess latency), since it doesn't involve the guest OS or the > qemu agent at all... > > -- > Hector Martin (hec...@marcansoft.com) > Public Key: https://marcan.st/marcan.asc > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
On 18/12/2018 20:29, Oliver Freyermuth wrote: Potentially, if granted arbitrary command execution by the guest agent, you could check (there might be a better interface than parsing meminfo...): cat /proc/meminfo | grep -i dirty Dirty: 19476 kB You could guess from that information how long the fsfreeze may take (ideally, combining that with allowed IOPS). Of course, if you have control over your VMs, you may also play with the vm.dirty_ratio and vm.dirty_background_ratio. I have that data (from node_exporter), but looks like it was only a few kB peaking at about 3MB during the problem interval. The problem is that there's no way to tell how long flushing that is going to take without knowing the average I/O size required. 1MB contiguous will complete in negligible time, 1MB of 4kB random writes will take a few seconds. I do have access to the VMs, customer stuff runs higher in the stack. Still, given the time it took to flush and the I/OS involved (looks like ~23kIOs during the time range of interest) makes me think there was something else involved than what the Dirty number accounts for. 23kIOs * 4kB (page size, worst case) is 94MB, which is definitely not what I had as Dirty. Perhaps it was dirty entires in the inode cache (which would explain the peak in Dirty as they were flushed to disk buffers first and then to disk). Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile: vm.dirty_ratio = 30 (default is 20 %) so they optimize for performance by increasing the dirty buffers to delay writeback even more. They take the opposite for their "virtual-host" profile: vm.dirty_background_ratio = 5 (default is 10 %). I believe these choices are good for performance, but may increase the time it takes to freeze the VMs, especially if IOPS are limited and there's a lot of dirty data. Yeah, I may need to try playing with some of those settings if this becomes a further problem in the future. FWIW our hosts and VMs are both Ubuntu 16.04. Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs and hosts, we may also observe this one day... So I'm curious: How did you implement the timeout in your case? Are you using a qemu-agent-command issuing fsfreeze with --async and --timeout instead of domfsfreeze? We are using domfsfreeze as of now, which (probably) has an infinite timeout, or at least no timeout documented in the manpage. We have a wrapper to take the snapshots, and it just uses domfsfreeze and times out and kills the command if it takes too long. Unsurprisingly, that doesn't abort the freeze, so libvirt just ends up running it in the background (with a lock taken, so domfsthaw doesn't work until that completes). The logic I have right now actually tries several times to thaw the filesystems and if it doesn't succeed it resets the VM to avoid leaving it in a frozen state. However, I had a logic bug where if the freeze itself timed out it did not do that (assuming the VM wasn't frozen), when in this case it was just freeze taking a while. That leaves the VM frozen and broken. I'll probably add some alerting to complain loudly when this happens, increase the thaw timeout/retries, then switch to unconditionally reset the VM if thawing fails. Ultimately this whole thing is kind of fragile, so if I can get away without freezing at all it would probably make the whole process a lot more robust. -- Hector Martin (hec...@marcansoft.com) Public Key: https://marcan.st/marcan.asc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
Am 18.12.18 um 11:48 schrieb Hector Martin: > On 18/12/2018 18:28, Oliver Freyermuth wrote: >> We have yet to observe these hangs, we are running this with ~5 VMs with ~10 >> disks for about half a year now with daily snapshots. But all of these VMs >> have very "low" I/O, >> since we put anything I/O intensive on bare metal (but with automated >> provisioning of course). >> >> So I'll chime in on your question, especially since there might be VMs on >> our cluster in the future where the inner OS may not be running an agent. >> Since we did not observe this yet, I'll also add: What's your "scale", is it >> hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? > > 5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); > usually not that high, but it can easily peak at 100% when certain things > happen. We don't have great I/O performance (RBD over 1gbps links to HDD > OSDs). > > I'm poring through monitoring graphs now and I think the issue this time > around was just too much dirty data in the page cache of a guest. The VM that > failed spent 3 minutes flushing out writes to disk before its I/O was > quiesced, at around 100 IOPS throughput (the actual data throughput was low, > though, so small writes). That exceeded our timeout and then things went > south from there. > > I wasn't sure if fsfreeze did a full sync to disk, but given the I/O behavior > I'm seeing that seems to be the case. Unfortunately coming up with an upper > bound for the freeze time seems tricky now. I'm increasing our timeout to 15 > minutes, we'll see if the problem recurs. > > Given this, it makes even more sense to just avoid the freeze if at all > reasonable. There's no real way to guarantee that a fsfreeze will complete in > a "reasonable" amount of time as far as I can tell. Potentially, if granted arbitrary command execution by the guest agent, you could check (there might be a better interface than parsing meminfo...): cat /proc/meminfo | grep -i dirty Dirty: 19476 kB You could guess from that information how long the fsfreeze may take (ideally, combining that with allowed IOPS). Of course, if you have control over your VMs, you may also play with the vm.dirty_ratio and vm.dirty_background_ratio. Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile: vm.dirty_ratio = 30 (default is 20 %) so they optimize for performance by increasing the dirty buffers to delay writeback even more. They take the opposite for their "virtual-host" profile: vm.dirty_background_ratio = 5 (default is 10 %). I believe these choices are good for performance, but may increase the time it takes to freeze the VMs, especially if IOPS are limited and there's a lot of dirty data. Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs and hosts, we may also observe this one day... So I'm curious: How did you implement the timeout in your case? Are you using a qemu-agent-command issuing fsfreeze with --async and --timeout instead of domfsfreeze? We are using domfsfreeze as of now, which (probably) has an infinite timeout, or at least no timeout documented in the manpage. Cheers, Oliver smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
On 18/12/2018 18:28, Oliver Freyermuth wrote: We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for about half a year now with daily snapshots. But all of these VMs have very "low" I/O, since we put anything I/O intensive on bare metal (but with automated provisioning of course). So I'll chime in on your question, especially since there might be VMs on our cluster in the future where the inner OS may not be running an agent. Since we did not observe this yet, I'll also add: What's your "scale", is it hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? 5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); usually not that high, but it can easily peak at 100% when certain things happen. We don't have great I/O performance (RBD over 1gbps links to HDD OSDs). I'm poring through monitoring graphs now and I think the issue this time around was just too much dirty data in the page cache of a guest. The VM that failed spent 3 minutes flushing out writes to disk before its I/O was quiesced, at around 100 IOPS throughput (the actual data throughput was low, though, so small writes). That exceeded our timeout and then things went south from there. I wasn't sure if fsfreeze did a full sync to disk, but given the I/O behavior I'm seeing that seems to be the case. Unfortunately coming up with an upper bound for the freeze time seems tricky now. I'm increasing our timeout to 15 minutes, we'll see if the problem recurs. Given this, it makes even more sense to just avoid the freeze if at all reasonable. There's no real way to guarantee that a fsfreeze will complete in a "reasonable" amount of time as far as I can tell. -- Hector Martin (hec...@marcansoft.com) Public Key: https://marcan.st/marcan.asc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
For what it worth, we are using snapshots on a daily basis for a couple of thousands rbd volume for some times So far so good, we have not catched any issue On 12/18/2018 10:28 AM, Oliver Freyermuth wrote: > Dear Hector, > > we are using the very same approach on CentOS 7 (freeze + thaw), but > preceeded by an fstrim. With virtio-scsi, using fstrim propagates the > discards from within the VM to Ceph RBD (if qemu is configured > accordingly), > and a lot of space is saved. > > We have yet to observe these hangs, we are running this with ~5 VMs with > ~10 disks for about half a year now with daily snapshots. But all of > these VMs have very "low" I/O, > since we put anything I/O intensive on bare metal (but with automated > provisioning of course). > > So I'll chime in on your question, especially since there might be VMs > on our cluster in the future where the inner OS may not be running an > agent. > Since we did not observe this yet, I'll also add: What's your "scale", > is it hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? > > Cheers, > Oliver > > Am 18.12.18 um 10:10 schrieb Hector Martin: >> Hi list, >> >> I'm running libvirt qemu guests on RBD, and currently taking backups >> by issuing a domfsfreeze, taking a snapshot, and then issuing a >> domfsthaw. This seems to be a common approach. >> >> This is safe, but it's impactful: the guest has frozen I/O for the >> duration of the snapshot. This is usually only a few seconds. >> Unfortunately, the freeze action doesn't seem to be very reliable. >> Sometimes it times out, leaving the guest in a messy situation with >> frozen I/O (thaw times out too when this happens, or returns success >> but FSes end up frozen anyway). This is clearly a bug somewhere, but I >> wonder whether the freeze is a hard requirement or not. >> >> Are there any atomicity guarantees for RBD snapshots taken *without* >> freezing the filesystem? Obviously the filesystem will be dirty and >> will require journal recovery, but that is okay; it's equivalent to a >> hard shutdown/crash. But is there any chance of corruption related to >> the snapshot being taken in a non-atomic fashion? Filesystems and >> applications these days should have no trouble with hard shutdowns, as >> long as storage writes follow ordering guarantees (no writes getting >> reordered across a barrier and such). >> >> Put another way: do RBD snapshots have ~identical atomicity guarantees >> to e.g. LVM snapshots? >> >> If we can get away without the freeze, honestly I'd rather go that >> route. If I really need to pause I/O during the snapshot creation, I >> might end up resorting to pausing the whole VM (suspend/resume), which >> has higher impact but also probably a much lower chance of messing up >> (or having excess latency), since it doesn't involve the guest OS or >> the qemu agent at all... >> > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD snapshot atomicity guarantees?
Dear Hector, we are using the very same approach on CentOS 7 (freeze + thaw), but preceeded by an fstrim. With virtio-scsi, using fstrim propagates the discards from within the VM to Ceph RBD (if qemu is configured accordingly), and a lot of space is saved. We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for about half a year now with daily snapshots. But all of these VMs have very "low" I/O, since we put anything I/O intensive on bare metal (but with automated provisioning of course). So I'll chime in on your question, especially since there might be VMs on our cluster in the future where the inner OS may not be running an agent. Since we did not observe this yet, I'll also add: What's your "scale", is it hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs? Cheers, Oliver Am 18.12.18 um 10:10 schrieb Hector Martin: Hi list, I'm running libvirt qemu guests on RBD, and currently taking backups by issuing a domfsfreeze, taking a snapshot, and then issuing a domfsthaw. This seems to be a common approach. This is safe, but it's impactful: the guest has frozen I/O for the duration of the snapshot. This is usually only a few seconds. Unfortunately, the freeze action doesn't seem to be very reliable. Sometimes it times out, leaving the guest in a messy situation with frozen I/O (thaw times out too when this happens, or returns success but FSes end up frozen anyway). This is clearly a bug somewhere, but I wonder whether the freeze is a hard requirement or not. Are there any atomicity guarantees for RBD snapshots taken *without* freezing the filesystem? Obviously the filesystem will be dirty and will require journal recovery, but that is okay; it's equivalent to a hard shutdown/crash. But is there any chance of corruption related to the snapshot being taken in a non-atomic fashion? Filesystems and applications these days should have no trouble with hard shutdowns, as long as storage writes follow ordering guarantees (no writes getting reordered across a barrier and such). Put another way: do RBD snapshots have ~identical atomicity guarantees to e.g. LVM snapshots? If we can get away without the freeze, honestly I'd rather go that route. If I really need to pause I/O during the snapshot creation, I might end up resorting to pausing the whole VM (suspend/resume), which has higher impact but also probably a much lower chance of messing up (or having excess latency), since it doesn't involve the guest OS or the qemu agent at all... smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com