Re: Are nocow files snapshot-aware

2014-02-08 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

[...]

Difficult to twist your mind around that but well explained. ;-)

 A snapshot thus looks much like a crash in terms of NOCOW file integrity
 since the blocks of a NOCOW file are simply snapshotted in-place, and
 there's already no checksumming or file integrity verification on such
 files -- they're simply directly written in-place (with the exception of
 a single COW write when a writable snapshottted NOCOW file diverges from
 the shared snapshot version).
 
 But as I said, the applications themselves are normally designed to
 handle and recover from crashes, and in fact, having btrfs try to manage
 it too only complicates things and can actually make it impossible for
 the app to recover what it would have otherwise recovered just fine.
 
 So it should be with these NOCOW in-place snapshotted files, too.  If a
 NOCOW file is put back into operation from a snapshot, and the file was
 being written to at snapshot time, it'll very likely trigger exactly the
 same response from the application as a crash while writing would have
 triggered, but, the point is, such applications are normally designed to
 deal with just that, and thus, they should recover just as they would
 from a crash.  If they could recover from a crash, it shouldn't be an
 issue.  If they couldn't, well...

So we have common sense that taking a snapshot looks like a crash from the 
applications perspective. That means if their are facilities to instruct the 
application to suspend its operations first, you should use them - like in 
the InnoDB case:

http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html:

| FLUSH TABLES WITH READ LOCK;
| SHOW MASTER STATUS;
| SYSTEM xfs_freeze -f /var/lib/mysql;
| SYSTEM YOUR_SCRIPT_TO_CREATE_SNAPSHOT.sh;
| SYSTEM xfs_freeze -u /var/lib/mysql;
| UNLOCK TABLES;
| EXIT;

Only that way you get consistent snapshots and won't trigger crash-recovery 
(which might otherwise throw away unrecoverable transactions or otherwise 
harm your data for the sake of consistency). InnoDB is more or less like a 
vm filesystem image on btrfs in this case. So the same approach should be 
taken for vm images if possible. I think VMware has facilities to prepare 
the guest for a snapshot being taken (it is triggered when you take 
snapshots with VMware itself, and btw it usually takes much longer than 
btrfs snapshots do).

Take xfs for example: Although it is crash-safe, it prefers to zero-out your 
files for security reasons during log-replay - because it is crash-safe only 
for meta-data: if meta-data has already allocated blocks but file-data has 
not yet been written, a recovered file may end up with wrong content 
otherwise, so its cleared out. This _IS_NOT_ the situation you want with vm 
images with xfs inside hosted on btrfs when taking a snapshot. You should 
trigger xfs_freeze in the guest before taking the btrfs snapshot in the 
host.

I think the same holds true for most other meta-data-only-journalling file 
systems which probably even do not zero-out files during recovery and just 
silently corrupt your files during crash-recovery.

So in case of crash or snapshot (which looks the same from the application 
perspective), btrfs' capabilities won't help you here (at least in the nocow 
case, probably in the cow case too, because the vm guest may write blocks 
out-of-order without having the possibility to pass write-barriers down to 
btrfs cow mechanism). Taking snapshots of database files or vm images 
without proper prepartion only guarantees you crash-like rollback 
situations. Taking snapshots even at short intervals only makes this worse, 
with all the extra downsides of effects this has within the btrfs.

I think this is important to understand for people planning to do automated 
snapshots of such file data. Making a file nocow only helps the situation 
during normal operation - but after a snapshot, a nocow file is essentially 
cow while carried over to the new subvolume generation during writes of 
blocks from the old generation.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-07 Thread Kai Krakow
Chris Murphy li...@colorremedies.com schrieb:

 If the database/virtual machine/whatever is crash safe, then the
 atomic state that a snapshot grabs will be useful.
 
 How fast is this state fixed on disk from the time of the snapshot
 command? Loosely speaking. I'm curious if this is  1 second; a few
 seconds; or possibly up to the 30 second default commit interval? And also
 if it's even related to the commit interval time at all?

Such constructs can only be crash-safe if write-barriers are passed down 
through the cow logic of btrfs to the storage layer. That won't probably 
ever happen. Atomic and transactional updates cannot happen without write-
barriers or synchronous writes. To make it work, you need to design the 
storage-layers from the ground up to work without write-barriers, like 
having battery-backed write-caches, synchronous logical file-system layers 
etc. Otherwise, database/vm/whatever transactional/atomic writes are just 
having undefined status down at the lowest storage layer.

 I'm also curious what happens to files that are presently writing. e.g.
 I'm writing a 1GB file to subvol A and before it completes I snapshot
 subvol A into A.1. If I go find the file I was writing to, in A.1, what's
 its state? Truncated? Or or are in-progress writes permitted to complete
 if it's a rw snapshot? Any difference in behavior if it's an ro snapshot?

I wondered that many times, too. What happens to files being written to? I 
suppose, at the time of snapshotting it's taking the current state of the 
blocks as they are, ignoring pending writes. This means, the file being 
written to is probably in limbo state.

For example, xfs has an option to freeze the file system to take atomic 
snapshots. You can use that feature to take consistent snapshots of MySQL 
InnoDB files to create a hot-copy backup of it. But: You need to instruct 
MySQL first to complete its transactions and pausing before running 
xfs_freeze, then after that's done, you can resume MySQL operations. That 
clearly tells me that it is probably not safe to take snapshots of online 
databases, even if they are crash-safe (and by what I know, InnoDB is 
designed to be crash-safe).

A solution, probably far-future, could be that a btrfs snapshot would inform 
all current file-writers to complete transactions and atomic operations and 
wait until each one signals a ready state, then take the snapshot, then 
signal the processes to resume operations. For this, the btrfs driver could 
offer some sort of subscription, similar to what inotify offers. Processes 
subscribe to some sort of notification broadcasts, btrfs can wait for every 
process to report an integral file state. If I remember right, reiser4 
offered some similar feature (approaching the problem from the opposite 
side): processes were offered an interface to start and commit transactions 
within reiser4. If btrfs had such information from file-writers, it could 
take consistent snapshots of online databases/vms/whatever (given, that in 
the vm case the guest could pass this information to the host). Whatever 
approach is taken, however, it will make the time needed to create snapshots 
undeterministic, processes may not finish their transactions within a 
reasonable time...

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-07 Thread Chris Murphy

On Feb 7, 2014, at 2:07 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:

 Chris Murphy li...@colorremedies.com schrieb:
 
 If the database/virtual machine/whatever is crash safe, then the
 atomic state that a snapshot grabs will be useful.
 
 How fast is this state fixed on disk from the time of the snapshot
 command? Loosely speaking. I'm curious if this is  1 second; a few
 seconds; or possibly up to the 30 second default commit interval? And also
 if it's even related to the commit interval time at all?
 
 Such constructs can only be crash-safe if write-barriers are passed down 
 through the cow logic of btrfs to the storage layer. That won't probably 
 ever happen. Atomic and transactional updates cannot happen without write-
 barriers or synchronous writes. To make it work, you need to design the 
 storage-layers from the ground up to work without write-barriers, like 
 having battery-backed write-caches, synchronous logical file-system layers 
 etc. Otherwise, database/vm/whatever transactional/atomic writes are just 
 having undefined status down at the lowest storage layer.

This explanation makes sense. But I failed to qualify the state fixed on 
disk. I'm not concerned about when bits actually arrive on disk. I'm wondering 
what state they describe. So assume no crash or power failure, and assume 
writes eventually make it onto the media without a problem. What I'm wondering 
is, what state of the subvolume I'm snapshotting do I end up with? Is there a 
delay and how long is it, or is it pretty much instant? The command completes 
really quickly even when the file system is actively being used, so the 
feedback is that the snapshot state is established very fast but I'm not sure 
what bearing that has in reality.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-07 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 The question here is: Does it really make sense to create such snapshots
 of disk images currently online and running a system. They will probably
 be broken anyway after rollback - or at least I'd not fully trust the
 contents.
 
 VM images should not be part of a subvolume of which snapshots are taken
 at a regular and short interval. The problem will go away if you follow
 this rule.
 
 The same applies to probably any kind of file which you make nocow -
 e.g. database files. The only use case is taking _controlled_ snapshots
 - and doing it all 30 seconds is by all means NOT controlled, it's
 completely undeterministic.
 
 I'd absolutely agree -- and that wasn't my report, I'm just recalling it,
 as at the time I didn't understand the interaction between NOCOW and
 snapshots and couldn't quite understand how a NOCOW file was still
 triggering the snapshot-aware-defrag pathology, which in fact we were
 just beginning to realize based on such reports.

Sorry, didn't mean to push it to you. ;-) I just wanted to give some 
pointers to rethink such practices for people stumpling upon this.

 But some of the snapshotting scripts out there, and the admins running
 them, seem to have the idea that just because it's possible it must be
 done, and they have snapshots taken every minute or more frequently, with
 no automated snapshot thinning at all.  IMO that's pathology run amok
 even if btrfs /was/ stable and mature and /could/ handle it properly.

Yeah, people should stop such bullshit practice (sorry), no matter if 
there's a technical problem with it. It does not give the protection they 
intended to give. It's just wrong sense for security/safety... There _may_ 
be actual use cases for doing it - but generally I'd suggest it's plain 
wrong.

 That's regardless of the content so it's from a different angle than you
 were attacking the problem from...  But if admins aren't able to
 recognize the problem with per-minute snapshots without any thinning at
 all for days, weeks, months on end, I doubt they'll be any better at
 recognizing that VMs, databases, etc, should have a dedicated subvolume.

True.

 But be that as it may, since such extreme snapshotting /is/ possible, and
 with automation and downloadable snapper scripts somebody WILL be doing
 it, btrfs should scale to it if it is to be considered mature and
 stable.  People don't want a filesystem that's going to fall over on them
 and lose data or simply become unworkably live-locked just because they
 didn't know what they were doing when they setup the snapper script and
 set it to 1 minute snaps without any corresponding thinning after an hour
 or a day or whatever.

Such, uhm, sorry, bullshit practice should not be a high priority on the 
fix-list for btrfs. There are other areas. It's a technical problem, yes, 
but I think there are more important ones than brute-forcing problems out of 
btrfs that are never being hit by normal usage patterns.

It is good that such tests are done, but I would not understand how people 
can expect they need such a feature - now and at once. Such tests are not 
ready to leave the development sandbox yet.

From a normal use perspective, doing such heavy snapshotting is probably 
almost always nonsense.

I'd be more interested in how btrfs behaves in highly io loaded server 
patterns. One interesting use case for me would be to use btrfs as the 
building block of a system with container virtualization (docker, lxc), 
making a high vm density on the machine (with the io load and unpredictable 
io bahavior that internet-facing servers apply to their storage layer), 
using btrfs snapshots to instantly create new vms from vm templates living 
in subvolumes (thin provisioning), spreading btrfs across a higher number of 
disks as the average desktop user / standard server has. I think this is one 
of many very interesting use cases for btrfs and its capabilities. And this 
is how we get back to my initial question: In such a scenario I'd like to 
take ro snapshots of all machines (which probably host nocow files for 
databases), send these to a backup server at low io-priority, then remove 
the snapshots. Apparently, btrfs send/receive is still far from being stable 
and bullet-proof from what I read here, so the destination would probably be 
another btrfs or zfs, using inplace-rsync backups and snapshotting for 
backlog.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-07 Thread Kai Krakow
Chris Murphy li...@colorremedies.com schrieb:

 
 On Feb 7, 2014, at 2:07 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:
 
 Chris Murphy li...@colorremedies.com schrieb:
 
 If the database/virtual machine/whatever is crash safe, then the
 atomic state that a snapshot grabs will be useful.
 
 How fast is this state fixed on disk from the time of the snapshot
 command? Loosely speaking. I'm curious if this is  1 second; a few
 seconds; or possibly up to the 30 second default commit interval? And
 also if it's even related to the commit interval time at all?
 
 Such constructs can only be crash-safe if write-barriers are passed down
 through the cow logic of btrfs to the storage layer. That won't probably
 ever happen. Atomic and transactional updates cannot happen without
 write- barriers or synchronous writes. To make it work, you need to
 design the storage-layers from the ground up to work without
 write-barriers, like having battery-backed write-caches, synchronous
 logical file-system layers etc. Otherwise, database/vm/whatever
 transactional/atomic writes are just having undefined status down at the
 lowest storage layer.
 
 This explanation makes sense. But I failed to qualify the state fixed on
 disk. I'm not concerned about when bits actually arrive on disk. I'm
 wondering what state they describe. So assume no crash or power failure,
 and assume writes eventually make it onto the media without a problem.
 What I'm wondering is, what state of the subvolume I'm snapshotting do I
 end up with? Is there a delay and how long is it, or is it pretty much
 instant? The command completes really quickly even when the file system is
 actively being used, so the feedback is that the snapshot state is
 established very fast but I'm not sure what bearing that has in reality.

I think from that perspective it is more or less the same taking a snapshot 
or cycling the power. For the state of the file consistency it means the 
same, I suppose. I got your argument about state fixed on disk, but I 
implied from perspective of the writing process it is just the same 
situation: in the moment of the snapshot the data file is in a crashed 
state. That is like cycling the power without having a mechanism to support 
transactional guarantees.

So the question is: Do btrfs snapshots give the same guarantees on the 
filesystem level that write-barriers give on the storage level which exactly 
those processes rely upon? The cleanest solution would be if processes could 
give btrfs hints about what belongs to their transactions so in the moment 
of a snapshot the data file would be in clean state. I guess snapshots are 
atomic in that way, that pending writes will never reach the snapshots just 
taken, which is good.

But what about the ordering of writes? Maybe some younger write requests 
already made it to the disk, while older ones didn't. The file system 
usually only has to care about its own transactional integrity, not those of 
its writing processes, and that is completely unrelated to what the writing 
process expects. Or in other words: A following crash only guarantees that 
the active subvolume being written to is clean from the transactional 
perspective of the process, but the snapshot may be broken. As far as I 
know, user processes cannot tell the filesystem when to issue write-
barriers, it could only issue fsyncs (which hurts performance). Otherwise 
this discussion would be a whole different story.

Did you test how btrfs snapshots perform while running fsync with a lot of 
data to be committed? Could give a clue...

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-07 Thread Duncan
Kai Krakow posted on Fri, 07 Feb 2014 23:26:34 +0100 as excerpted:

 So the question is: Do btrfs snapshots give the same guarantees on the
 filesystem level that write-barriers give on the storage level which
 exactly those processes rely upon? The cleanest solution would be if
 processes could give btrfs hints about what belongs to their
 transactions so in the moment of a snapshot the data file would be in
 clean state. I guess snapshots are atomic in that way, that pending
 writes will never reach the snapshots just taken, which is good.

Keep in mind that btrfs' metadata is COW-based also.  Like reiser4 in 
this way, in theory at least, commits are atomic -- they've ether made it 
to disk or they haven't, there's no half there.  Commits at the leaf 
level propagate up the tree, and are not finalized until the top-level 
root node is written.  AFAIK if there's dirty data to write, btrfs 
triggers a root node commit every 30 seconds.  Until that root is 
rewritten, it points to the last consistent-state written root node.  
Once it's rewritten, it points to the new one and a new set of writes are 
started, only to be finalized at the next root node write.

And I believe that final write simply updates a pointer to point at the 
latest root node.  There's also a history of root nodes, which is what 
the btrfs-find-root tool uses in combination with btrfs restore, if 
necessary, to find a valid root from the root node pointer log if the 
system crashed in the middle of that final update so the pointer ends up 
pointing at garbage.

Meanwhile, I'm a bit blurry on this but if I understand things correctly, 
between root node writes/full-filesystem-commits there's a log of 
transaction completions at the atomic individual transaction level, such 
that even transactions completed between root node writes can normally be 
replayed.  Of course this is only ~30 seconds worth of activity max, 
since the root node writes should occur every 30 seconds, but this is 
what btrfs-zero-log zeroes out, if/when needed.  You'll lose that few 
seconds of log replay since the last root node write, but if it was 
garbage data due to it being written when the system actually went down, 
dropping those few extra seconds of log can allow the filesystem to mount 
properly from the last full root node commit, where it couldn't, 
otherwise.

It's actually those metadata trees and the atomic root-node commit 
feature that btrfs snapshots depend on, and why they're normally so fast 
to create.  When a snapshot is taken, btrfs simply keeps a record of the 
current root node instead of letting it recede into history and fall off 
the end of the root node log, labeling that record with the name of the 
snapshot for humans as well as the object-ID that btrfs uses.  That root 
node is by definition a record of the filesystem in a consistent state, 
so any snapshot that's a reference to it is similarly by definition in a 
consistent state.

So normally, files in the process of being written out (created) simply 
wouldn't appear in the snapshot.  Of course preexisting files will appear 
(and fallocated files are simply the blanked-out-special-case of 
preexisting), but again, with normal COW-based files at least, will exist 
in a state either before the latest transaction started, or after it 
finished, which of course is where fsync comes in, since that's how 
userspace apps communicate file transactions to the filesystem.

And of course in addition to COW, btrfs normally does checksumming as 
well, and again, the filesystem including that checksumming will be self-
consistent when a root-node is written, or it won't be written until the 
filesystem /is/ self-consistent.  If for whatever reason there's garbage 
when btrfs attempts to read the data back, which is exactly what btrfs 
defines it as if it doesn't pass checksum, btrfs will refuse to use that 
data.  If there's a second copy somewhere (as with raid1 mode), it'll try 
to restore from that second copy.  If it can't, btrfs will return an 
error and simply won't let you access that file.

So one way or another, a snapshot is deterministic and atomic.  No 
partial transactions, at least on ordinary COW and checksummed files.

Which brings us to NOCOW files, where for btrfs NOCOW also turns off 
checksumming.  Btrfs will write these files in-place, and as a result 
there's not the transaction integrity guarantee on these files that there 
is on ordinary files.

*HOWEVER*, the situation isn't as bad as it might seem, because most 
files where NOCOW is recommended, database files, VM images, pre-
allocated torrent files, etc, are created and managed by applications 
that already have their own data integrity management/verification/repair 
methods, since they're designed to work on filesystems without the data 
integrity guarantees btrfs normally provides.

In fact, it's possible, even likely in case of a crash, that the 
application's own data integrity mechanisms can fight with those 

Re: Are nocow files snapshot-aware

2014-02-06 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.
 
 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and the
 continuing VM writes would nearly guarantee that many of those snapshots
 had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
 at all!

The question here is: Does it really make sense to create such snapshots of 
disk images currently online and running a system. They will probably be 
broken anyway after rollback - or at least I'd not fully trust the contents.

VM images should not be part of a subvolume of which snapshots are taken at 
a regular and short interval. The problem will go away if you follow this 
rule.

The same applies to probably any kind of file which you make nocow - e.g. 
database files. Most of those file implement their own way of transaction 
protection or COW system, e.g. look at InnoDB files. Neither they gain 
anything from using IO schedulers (because InnoDB internally does block 
sorting and prioritizing and knows better, doing otherwise even hurts 
performance), nor they gain from file system semantics like COW (because it 
does its own transactions and atomic updates and probably can do better for 
its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or 
btrfs images on btrfs). Snapshots can only do harm here (the only 
protection use case would be to have a backup, but snapshots are no 
backups), and COW will probably hurt performance a lot. The only use case is 
taking _controlled_ snapshots - and doing it all 30 seconds is by all means 
NOT controlled, it's completely undeterministic.

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread cwillu
On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:
 Duncan 1i5t5.dun...@cox.net schrieb:

 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.

 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and the
 continuing VM writes would nearly guarantee that many of those snapshots
 had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
 at all!

 The question here is: Does it really make sense to create such snapshots of
 disk images currently online and running a system. They will probably be
 broken anyway after rollback - or at least I'd not fully trust the contents.

 VM images should not be part of a subvolume of which snapshots are taken at
 a regular and short interval. The problem will go away if you follow this
 rule.

 The same applies to probably any kind of file which you make nocow - e.g.
 database files. Most of those file implement their own way of transaction
 protection or COW system, e.g. look at InnoDB files. Neither they gain
 anything from using IO schedulers (because InnoDB internally does block
 sorting and prioritizing and knows better, doing otherwise even hurts
 performance), nor they gain from file system semantics like COW (because it
 does its own transactions and atomic updates and probably can do better for
 its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
 btrfs images on btrfs). Snapshots can only do harm here (the only
 protection use case would be to have a backup, but snapshots are no
 backups), and COW will probably hurt performance a lot. The only use case is
 taking _controlled_ snapshots - and doing it all 30 seconds is by all means
 NOT controlled, it's completely undeterministic.

If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread Chris Murphy

On Feb 6, 2014, at 6:01 PM, cwillu cwi...@cwillu.com wrote:

 On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:
 Duncan 1i5t5.dun...@cox.net schrieb:
 
 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.
 
 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and the
 continuing VM writes would nearly guarantee that many of those snapshots
 had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
 at all!
 
 The question here is: Does it really make sense to create such snapshots of
 disk images currently online and running a system. They will probably be
 broken anyway after rollback - or at least I'd not fully trust the contents.
 
 VM images should not be part of a subvolume of which snapshots are taken at
 a regular and short interval. The problem will go away if you follow this
 rule.
 
 The same applies to probably any kind of file which you make nocow - e.g.
 database files. Most of those file implement their own way of transaction
 protection or COW system, e.g. look at InnoDB files. Neither they gain
 anything from using IO schedulers (because InnoDB internally does block
 sorting and prioritizing and knows better, doing otherwise even hurts
 performance), nor they gain from file system semantics like COW (because it
 does its own transactions and atomic updates and probably can do better for
 its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
 btrfs images on btrfs). Snapshots can only do harm here (the only
 protection use case would be to have a backup, but snapshots are no
 backups), and COW will probably hurt performance a lot. The only use case is
 taking _controlled_ snapshots - and doing it all 30 seconds is by all means
 NOT controlled, it's completely undeterministic.
 
 If the database/virtual machine/whatever is crash safe, then the
 atomic state that a snapshot grabs will be useful.

How fast is this state fixed on disk from the time of the snapshot command? 
Loosely speaking. I'm curious if this is  1 second; a few seconds; or possibly 
up to the 30 second default commit interval? And also if it's even related to 
the commit interval time at all?

I'm also curious what happens to files that are presently writing. e.g. I'm 
writing a 1GB file to subvol A and before it completes I snapshot subvol A into 
A.1. If I go find the file I was writing to, in A.1, what's its state? 
Truncated? Or or are in-progress writes permitted to complete if it's a rw 
snapshot? Any difference in behavior if it's an ro snapshot?


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-06 Thread Duncan
Kai Krakow posted on Fri, 07 Feb 2014 01:32:27 +0100 as excerpted:

 Duncan 1i5t5.dun...@cox.net schrieb:
 
 That also explains the report of a NOCOW VM-image still triggering the
 snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
 snapshotted btrfs (thousands of snapshots, something like every 30
 seconds or more frequent, without thinning them down right away), and
 the continuing VM writes would nearly guarantee that many of those
 snapshots had unique blocks, so the effect was nearly as bad as if it
 wasn't NOCOW at all!
 
 The question here is: Does it really make sense to create such snapshots
 of disk images currently online and running a system. They will probably
 be broken anyway after rollback - or at least I'd not fully trust the
 contents.
 
 VM images should not be part of a subvolume of which snapshots are taken
 at a regular and short interval. The problem will go away if you follow
 this rule.
 
 The same applies to probably any kind of file which you make nocow -
 e.g. database files. The only use case is taking _controlled_ snapshots
 - and doing it all 30 seconds is by all means NOT controlled, it's
 completely undeterministic.

I'd absolutely agree -- and that wasn't my report, I'm just recalling it, 
as at the time I didn't understand the interaction between NOCOW and 
snapshots and couldn't quite understand how a NOCOW file was still 
triggering the snapshot-aware-defrag pathology, which in fact we were 
just beginning to realize based on such reports.

In fact at the time I assumed it was because the NOCOW had been added 
after the file was originally written, such that btrfs couldn't NOCOW it 
properly.  That still might have been the case, but now that I understand 
the interaction between snapshots and NOCOW, I see that such heavy 
snapshotting on an actively written VM could trigger the same issue, even 
if the NOCOW file was created properly and was indeed NOCOW when content 
was actually first written into it.

But definitely agreed.  30 second snapshotting, with a 30 second commit 
deadline, is pretty much off the deep end regardless of the content.  I'd 
even argue that 1 minute snapshotting without snapshots thinned down to 
say 5 or 10 minute snapshots after say an hour, is too extreme to be 
practical.  Even a couple days of that, and how are you going to even 
manage the thousands of snapshots or know which precise snapshot to roll 
back to if you had to?  That's why in the what-I-considered toward the 
extreme end of practical example I posted here some days ago, IIRC I had 
it do 1 minute snapshots but thin them down to 5 or 10 minutes after a 
couple hours and to half hour after a couple days, with something like 90 
day snapshots out to a decade.  Even that I considered extreme altho at 
least reasonably so, but the point was, even with something as extreme as 
1 minute snapshots at first frequency and decade of snapshots, with 
reasonable thinning it was still very manageable, something like 250 
snapshots total, well below the thousands or tens of thousands we're 
sometimes seeing in reports.  That's hardly practical no matter how you 
slice it, as how likely are you to know the exact minute to roll back to, 
even a month out, and even if you do, if you can survive a month before 
detecting it, how important is rolling back to precisely the last minute 
before the problem actually going to be?  At a month out perhaps the 
hour, but the minute?

But some of the snapshotting scripts out there, and the admins running 
them, seem to have the idea that just because it's possible it must be 
done, and they have snapshots taken every minute or more frequently, with 
no automated snapshot thinning at all.  IMO that's pathology run amok 
even if btrfs /was/ stable and mature and /could/ handle it properly.

That's regardless of the content so it's from a different angle than you 
were attacking the problem from...  But if admins aren't able to 
recognize the problem with per-minute snapshots without any thinning at 
all for days, weeks, months on end, I doubt they'll be any better at 
recognizing that VMs, databases, etc, should have a dedicated subvolume.  
Taking the long view, with a bit of luck we'll get to the point were 
database and VM setup scripts and/or documentation recommend setting NOCOW 
on the directory the VMs/DBs/etc will be in, but in practice, even that's 
pushing it, and will take some time (2-5 years) as btrfs stabilizes and 
mainstreams, taking over from ext4 as the assumed Linux default.  Other 
than that, I guess it'll be a case-by-case basis as people report 
problems here.  But with a snapshot-aware-defrag that actually scales, 
hopefully there won't be so many people reporting problems.  True, they 
might not have the best optimized system and may have some minor 
pathologies in their admin practices, but as long as they remain /minor/ 
pathologies because btrfs can deal with them better than it does now thus 
keeping them from 

Re: Are nocow files snapshot-aware

2014-02-05 Thread Kai Krakow
David Sterba dste...@suse.cz schrieb:

 On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote:
 On 02/04/2014 03:52 PM, Kai Krakow wrote:
 Hi!
 
 I'm curious... The whole snapshot thing on btrfs is based on its COW
 design. But you can make individual files and directory contents nocow
 by applying the C attribute on it using chattr. This is usually
 recommended for database files and VM images. So far, so good...
 
 But what happens to such files when they are part of a snapshot? Do they
 become duplicated during the snapshot? Do they become unshared (as a
 whole) when written to? Or when the the parent snapshot becomes deleted?
 Or maybe the nocow attribute is just ignored after a snapshot was taken?
 
 After all they are nocow and thus would be handled in another way when
 snapshotted.
 
 When snapshotted nocow files fallback to normal cow behaviour.
 
 This may seem unclear to people not familiar with the actual
 implementation, and I had to think for a second about that sentence. The
 file will keep the NOCOW status, but any modified blocks will be newly
 allocated on the first write (in a COW manner), then the block location
 will not change anymore (unlike ordinary COW).

Ah okay, that makes it clear. So, actually, in the snapshot the file is 
still nocow - just for the exception that blocks being written to become 
unshared and relocated. This may introduce a lot of fragmentation but it 
won't become worse when rewriting the same blocks over and over again.

 HTH

Yes, it does. ;-)

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-05 Thread Duncan
Kai Krakow posted on Wed, 05 Feb 2014 19:17:10 +0100 as excerpted:

 David Sterba dste...@suse.cz schrieb:
 
 On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote:
 On 02/04/2014 03:52 PM, Kai Krakow wrote:
 Hi!
 
 I'm curious... The whole snapshot thing on btrfs is based on its COW
 design. But you can make individual files and directory contents
 nocow by applying the C attribute on it using chattr. This is usually
 recommended for database files and VM images. So far, so good...
 
 But what happens to such files when they are part of a snapshot? Do
 they become duplicated during the snapshot? Do they become unshared
 (as a whole) when written to? Or when the the parent snapshot becomes
 deleted?
 Or maybe the nocow attribute is just ignored after a snapshot was
 taken?
 
 When snapshotted nocow files fallback to normal cow behaviour.
 
 This may seem unclear to people not familiar with the actual
 implementation, and I had to think for a second about that sentence.
 The file will keep the NOCOW status, but any modified blocks will be
 newly allocated on the first write (in a COW manner), then the block
 location will not change anymore (unlike ordinary COW).
 
 Ah okay, that makes it clear. So, actually, in the snapshot the file is
 still nocow - just for the exception that blocks being written to become
 unshared and relocated. This may introduce a lot of fragmentation but it
 won't become worse when rewriting the same blocks over and over again.

That also explains the report of a NOCOW VM-image still triggering the 
snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30 
seconds or more frequent, without thinning them down right away), and the 
continuing VM writes would nearly guarantee that many of those snapshots 
had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW 
at all!

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-04 Thread Josef Bacik

On 02/04/2014 03:52 PM, Kai Krakow wrote:

Hi!

I'm curious... The whole snapshot thing on btrfs is based on its COW design.
But you can make individual files and directory contents nocow by applying
the C attribute on it using chattr. This is usually recommended for database
files and VM images. So far, so good...

But what happens to such files when they are part of a snapshot? Do they
become duplicated during the snapshot? Do they become unshared (as a whole)
when written to? Or when the the parent snapshot becomes deleted? Or maybe
the nocow attribute is just ignored after a snapshot was taken?

After all they are nocow and thus would be handled in another way when
snapshotted.


When snapshotted nocow files fallback to normal cow behaviour. Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Are nocow files snapshot-aware

2014-02-04 Thread David Sterba
On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote:
 On 02/04/2014 03:52 PM, Kai Krakow wrote:
 Hi!
 
 I'm curious... The whole snapshot thing on btrfs is based on its COW design.
 But you can make individual files and directory contents nocow by applying
 the C attribute on it using chattr. This is usually recommended for database
 files and VM images. So far, so good...
 
 But what happens to such files when they are part of a snapshot? Do they
 become duplicated during the snapshot? Do they become unshared (as a whole)
 when written to? Or when the the parent snapshot becomes deleted? Or maybe
 the nocow attribute is just ignored after a snapshot was taken?
 
 After all they are nocow and thus would be handled in another way when
 snapshotted.
 
 When snapshotted nocow files fallback to normal cow behaviour.

This may seem unclear to people not familiar with the actual
implementation, and I had to think for a second about that sentence. The
file will keep the NOCOW status, but any modified blocks will be newly
allocated on the first write (in a COW manner), then the block location
will not change anymore (unlike ordinary COW).

HTH
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html