Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-20 Thread Hector Martin
On 21/12/2018 03.02, Gregory Farnum wrote:
> RBD snapshots are indeed crash-consistent. :)
> -Greg

Thanks for the confirmation! May I suggest putting this little nugget in
the docs somewhere? This might help clarify things for others :)

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-20 Thread Gregory Farnum
On Tue, Dec 18, 2018 at 1:11 AM Hector Martin  wrote:

> Hi list,
>
> I'm running libvirt qemu guests on RBD, and currently taking backups by
> issuing a domfsfreeze, taking a snapshot, and then issuing a domfsthaw.
> This seems to be a common approach.
>
> This is safe, but it's impactful: the guest has frozen I/O for the
> duration of the snapshot. This is usually only a few seconds.
> Unfortunately, the freeze action doesn't seem to be very reliable.
> Sometimes it times out, leaving the guest in a messy situation with
> frozen I/O (thaw times out too when this happens, or returns success but
> FSes end up frozen anyway). This is clearly a bug somewhere, but I
> wonder whether the freeze is a hard requirement or not.
>
> Are there any atomicity guarantees for RBD snapshots taken *without*
> freezing the filesystem? Obviously the filesystem will be dirty and will
> require journal recovery, but that is okay; it's equivalent to a hard
> shutdown/crash. But is there any chance of corruption related to the
> snapshot being taken in a non-atomic fashion?


RBD snapshots are indeed crash-consistent. :)
-Greg

Filesystems and
> applications these days should have no trouble with hard shutdowns, as
> long as storage writes follow ordering guarantees (no writes getting
> reordered across a barrier and such).
>
> Put another way: do RBD snapshots have ~identical atomicity guarantees
> to e.g. LVM snapshots?
>
> If we can get away without the freeze, honestly I'd rather go that
> route. If I really need to pause I/O during the snapshot creation, I
> might end up resorting to pausing the whole VM (suspend/resume), which
> has higher impact but also probably a much lower chance of messing up
> (or having excess latency), since it doesn't involve the guest OS or the
> qemu agent at all...
>
> --
> Hector Martin (hec...@marcansoft.com)
> Public Key: https://marcan.st/marcan.asc
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Hector Martin

On 18/12/2018 20:29, Oliver Freyermuth wrote:

Potentially, if granted arbitrary command execution by the guest agent, you 
could check (there might be a better interface than parsing meminfo...):
   cat /proc/meminfo | grep -i dirty
   Dirty: 19476 kB
You could guess from that information how long the fsfreeze may take (ideally, 
combining that with allowed IOPS).
Of course, if you have control over your VMs, you may also play with the 
vm.dirty_ratio and vm.dirty_background_ratio.


I have that data (from node_exporter), but looks like it was only a few 
kB peaking at about 3MB during the problem interval. The problem is that 
there's no way to tell how long flushing that is going to take without 
knowing the average I/O size required. 1MB contiguous will complete in 
negligible time, 1MB of 4kB random writes will take a few seconds. I do 
have access to the VMs, customer stuff runs higher in the stack.


Still, given the time it took to flush and the I/OS involved (looks like 
~23kIOs during the time range of interest) makes me think there was 
something else involved than what the Dirty number accounts for. 23kIOs 
* 4kB (page size, worst case) is 94MB, which is definitely not what I 
had as Dirty. Perhaps it was dirty entires in the inode cache (which 
would explain the peak in Dirty as they were flushed to disk buffers 
first and then to disk).



Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile:
vm.dirty_ratio = 30
(default is 20 %) so they optimize for performance by increasing the dirty 
buffers to delay writeback even more.
They take the opposite for their "virtual-host" profile:
vm.dirty_background_ratio = 5
(default is 10 %).
I believe these choices are good for performance, but may increase the time it 
takes to freeze the VMs, especially if IOPS are limited and there's a lot of 
dirty data.


Yeah, I may need to try playing with some of those settings if this 
becomes a further problem in the future. FWIW our hosts and VMs are both 
Ubuntu 16.04.



Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs 
and hosts, we may also observe this one day...
So I'm curious:
How did you implement the timeout in your case? Are you using a 
qemu-agent-command issuing fsfreeze with --async and --timeout instead of 
domfsfreeze?
We are using domfsfreeze as of now, which (probably) has an infinite timeout, 
or at least no timeout documented in the manpage.


We have a wrapper to take the snapshots, and it just uses domfsfreeze 
and times out and kills the command if it takes too long. 
Unsurprisingly, that doesn't abort the freeze, so libvirt just ends up 
running it in the background (with a lock taken, so domfsthaw doesn't 
work until that completes).


The logic I have right now actually tries several times to thaw the 
filesystems and if it doesn't succeed it resets the VM to avoid leaving 
it in a frozen state. However, I had a logic bug where if the freeze 
itself timed out it did not do that (assuming the VM wasn't frozen), 
when in this case it was just freeze taking a while. That leaves the VM 
frozen and broken. I'll probably add some alerting to complain loudly 
when this happens, increase the thaw timeout/retries, then switch to 
unconditionally reset the VM if thawing fails.


Ultimately this whole thing is kind of fragile, so if I can get away 
without freezing at all it would probably make the whole process a lot 
more robust.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Oliver Freyermuth
Am 18.12.18 um 11:48 schrieb Hector Martin:
> On 18/12/2018 18:28, Oliver Freyermuth wrote:
>> We have yet to observe these hangs, we are running this with ~5 VMs with ~10 
>> disks for about half a year now with daily snapshots. But all of these VMs 
>> have very "low" I/O,
>> since we put anything I/O intensive on bare metal (but with automated 
>> provisioning of course).
>>
>> So I'll chime in on your question, especially since there might be VMs on 
>> our cluster in the future where the inner OS may not be running an agent.
>> Since we did not observe this yet, I'll also add: What's your "scale", is it 
>> hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?
> 
> 5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); 
> usually not that high, but it can easily peak at 100% when certain things 
> happen. We don't have great I/O performance (RBD over 1gbps links to HDD 
> OSDs).
> 
> I'm poring through monitoring graphs now and I think the issue this time 
> around was just too much dirty data in the page cache of a guest. The VM that 
> failed spent 3 minutes flushing out writes to disk before its I/O was 
> quiesced, at around 100 IOPS throughput (the actual data throughput was low, 
> though, so small writes). That exceeded our timeout and then things went 
> south from there.
> 
> I wasn't sure if fsfreeze did a full sync to disk, but given the I/O behavior 
> I'm seeing that seems to be the case. Unfortunately coming up with an upper 
> bound for the freeze time seems tricky now. I'm increasing our timeout to 15 
> minutes, we'll see if the problem recurs.
> 
> Given this, it makes even more sense to just avoid the freeze if at all 
> reasonable. There's no real way to guarantee that a fsfreeze will complete in 
> a "reasonable" amount of time as far as I can tell.

Potentially, if granted arbitrary command execution by the guest agent, you 
could check (there might be a better interface than parsing meminfo...):
  cat /proc/meminfo | grep -i dirty
  Dirty: 19476 kB
You could guess from that information how long the fsfreeze may take (ideally, 
combining that with allowed IOPS). 
Of course, if you have control over your VMs, you may also play with the 
vm.dirty_ratio and vm.dirty_background_ratio. 

Interestingly, tuned on CentOS 7 configures for a "virtual-guest" profile:
vm.dirty_ratio = 30
(default is 20 %) so they optimize for performance by increasing the dirty 
buffers to delay writeback even more. 
They take the opposite for their "virtual-host" profile:
vm.dirty_background_ratio = 5
(default is 10 %). 
I believe these choices are good for performance, but may increase the time it 
takes to freeze the VMs, especially if IOPS are limited and there's a lot of 
dirty data. 

Since we also have 1 Gbps links and HDD OSDs, and plan to add more and more VMs 
and hosts, we may also observe this one day... 
So I'm curious:
How did you implement the timeout in your case? Are you using a 
qemu-agent-command issuing fsfreeze with --async and --timeout instead of 
domfsfreeze? 
We are using domfsfreeze as of now, which (probably) has an infinite timeout, 
or at least no timeout documented in the manpage. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Hector Martin

On 18/12/2018 18:28, Oliver Freyermuth wrote:

We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for 
about half a year now with daily snapshots. But all of these VMs have very 
"low" I/O,
since we put anything I/O intensive on bare metal (but with automated 
provisioning of course).

So I'll chime in on your question, especially since there might be VMs on our 
cluster in the future where the inner OS may not be running an agent.
Since we did not observe this yet, I'll also add: What's your "scale", is it 
hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?


5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads); 
usually not that high, but it can easily peak at 100% when certain 
things happen. We don't have great I/O performance (RBD over 1gbps links 
to HDD OSDs).


I'm poring through monitoring graphs now and I think the issue this time 
around was just too much dirty data in the page cache of a guest. The VM 
that failed spent 3 minutes flushing out writes to disk before its I/O 
was quiesced, at around 100 IOPS throughput (the actual data throughput 
was low, though, so small writes). That exceeded our timeout and then 
things went south from there.


I wasn't sure if fsfreeze did a full sync to disk, but given the I/O 
behavior I'm seeing that seems to be the case. Unfortunately coming up 
with an upper bound for the freeze time seems tricky now. I'm increasing 
our timeout to 15 minutes, we'll see if the problem recurs.


Given this, it makes even more sense to just avoid the freeze if at all 
reasonable. There's no real way to guarantee that a fsfreeze will 
complete in a "reasonable" amount of time as far as I can tell.


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread ceph
For what it worth, we are using snapshots on a daily basis for a couple
of thousands rbd volume for some times

So far so good, we have not catched any issue

On 12/18/2018 10:28 AM, Oliver Freyermuth wrote:
> Dear Hector,
> 
> we are using the very same approach on CentOS 7 (freeze + thaw), but
> preceeded by an fstrim. With virtio-scsi, using fstrim propagates the
> discards from within the VM to Ceph RBD (if qemu is configured
> accordingly),
> and a lot of space is saved.
> 
> We have yet to observe these hangs, we are running this with ~5 VMs with
> ~10 disks for about half a year now with daily snapshots. But all of
> these VMs have very "low" I/O,
> since we put anything I/O intensive on bare metal (but with automated
> provisioning of course).
> 
> So I'll chime in on your question, especially since there might be VMs
> on our cluster in the future where the inner OS may not be running an
> agent.
> Since we did not observe this yet, I'll also add: What's your "scale",
> is it hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?
> 
> Cheers,
> Oliver
> 
> Am 18.12.18 um 10:10 schrieb Hector Martin:
>> Hi list,
>>
>> I'm running libvirt qemu guests on RBD, and currently taking backups
>> by issuing a domfsfreeze, taking a snapshot, and then issuing a
>> domfsthaw. This seems to be a common approach.
>>
>> This is safe, but it's impactful: the guest has frozen I/O for the
>> duration of the snapshot. This is usually only a few seconds.
>> Unfortunately, the freeze action doesn't seem to be very reliable.
>> Sometimes it times out, leaving the guest in a messy situation with
>> frozen I/O (thaw times out too when this happens, or returns success
>> but FSes end up frozen anyway). This is clearly a bug somewhere, but I
>> wonder whether the freeze is a hard requirement or not.
>>
>> Are there any atomicity guarantees for RBD snapshots taken *without*
>> freezing the filesystem? Obviously the filesystem will be dirty and
>> will require journal recovery, but that is okay; it's equivalent to a
>> hard shutdown/crash. But is there any chance of corruption related to
>> the snapshot being taken in a non-atomic fashion? Filesystems and
>> applications these days should have no trouble with hard shutdowns, as
>> long as storage writes follow ordering guarantees (no writes getting
>> reordered across a barrier and such).
>>
>> Put another way: do RBD snapshots have ~identical atomicity guarantees
>> to e.g. LVM snapshots?
>>
>> If we can get away without the freeze, honestly I'd rather go that
>> route. If I really need to pause I/O during the snapshot creation, I
>> might end up resorting to pausing the whole VM (suspend/resume), which
>> has higher impact but also probably a much lower chance of messing up
>> (or having excess latency), since it doesn't involve the guest OS or
>> the qemu agent at all...
>>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD snapshot atomicity guarantees?

2018-12-18 Thread Oliver Freyermuth

Dear Hector,

we are using the very same approach on CentOS 7 (freeze + thaw), but preceeded 
by an fstrim. With virtio-scsi, using fstrim propagates the discards from 
within the VM to Ceph RBD (if qemu is configured accordingly),
and a lot of space is saved.

We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for 
about half a year now with daily snapshots. But all of these VMs have very 
"low" I/O,
since we put anything I/O intensive on bare metal (but with automated 
provisioning of course).

So I'll chime in on your question, especially since there might be VMs on our 
cluster in the future where the inner OS may not be running an agent.
Since we did not observe this yet, I'll also add: What's your "scale", is it 
hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?

Cheers,
Oliver

Am 18.12.18 um 10:10 schrieb Hector Martin:

Hi list,

I'm running libvirt qemu guests on RBD, and currently taking backups by issuing 
a domfsfreeze, taking a snapshot, and then issuing a domfsthaw. This seems to 
be a common approach.

This is safe, but it's impactful: the guest has frozen I/O for the duration of 
the snapshot. This is usually only a few seconds. Unfortunately, the freeze 
action doesn't seem to be very reliable. Sometimes it times out, leaving the 
guest in a messy situation with frozen I/O (thaw times out too when this 
happens, or returns success but FSes end up frozen anyway). This is clearly a 
bug somewhere, but I wonder whether the freeze is a hard requirement or not.

Are there any atomicity guarantees for RBD snapshots taken *without* freezing 
the filesystem? Obviously the filesystem will be dirty and will require journal 
recovery, but that is okay; it's equivalent to a hard shutdown/crash. But is 
there any chance of corruption related to the snapshot being taken in a 
non-atomic fashion? Filesystems and applications these days should have no 
trouble with hard shutdowns, as long as storage writes follow ordering 
guarantees (no writes getting reordered across a barrier and such).

Put another way: do RBD snapshots have ~identical atomicity guarantees to e.g. 
LVM snapshots?

If we can get away without the freeze, honestly I'd rather go that route. If I 
really need to pause I/O during the snapshot creation, I might end up resorting 
to pausing the whole VM (suspend/resume), which has higher impact but also 
probably a much lower chance of messing up (or having excess latency), since it 
doesn't involve the guest OS or the qemu agent at all...






smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com