Re: btrfsck: unresolved ref root
Just curious, would the observed behavior change if one modifies head in any way between snap and umount? Regards, Andrey 02.12.2011 13:28 пользователь Jan Schmidt list.bt...@jan-o-sch.net написал: While hunting another bug, I got distracted by btrfsck error output, which is reproducible as simple as creating a snapshot. Either btrfsck is strange or subvolume snapsnot does something wrong. # mkfs.btrfs /dev/sdo # mount /dev/sdo /mnt/scratch/ # umount /mnt/scratch/ # btrfsck /dev/sdo - everything ok # mount /dev/sdo /mnt/scratch/ # btrfs subvol snap /mnt/scratch/ /mnt/scratch/snap1 # umount /mnt/scratch # btrfsck /dev/sdo fs tree 257 refs 2 unresolved ref root 257 dir 256 index 2 namelen 5 name snap1 error 600 [...] Tested with current for-linus and most current btrfsck. I also have older filesystems with snapshots I never used on btrfsck before, they also show the unresolved ref error. From a quick look at the btrfsck code, this complaint means that btrfsck is looking for two BTRFS_ROOT_REF_KEY and BTRFS_ROOT_BACKREF_KEY each in the tree of tree roots. However there's only one of each (as I would expect): item 4 key (FS_TREE ROOT_REF 257) itemoff 3238 itemsize 23 root ref key dirid 256 sequence 2 name snap1 ... item 12 key (257 ROOT_BACKREF 5) itemoff 2315 itemsize 23 root backref key dirid 256 sequence 2 name snap1 -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Quota Implementation
On Fri, Jun 3, 2011 at 8:47 PM, Hugo Mills h...@carfax.org.uk wrote: On Fri, Jun 03, 2011 at 06:24:41PM +0200, Arne Jansen wrote: Hi, If no one is already working on it, I'd like to take the Quota lock and see how far I come. Let me sketch out in short what I'm planning to do: - Quota will be subvolume based. Only the FS-trees and data extents will be accounted. - Quota Groups can be defined. Every quota group can comprise any number of subvolumes. A subvolume can be assigned to any number of quota groups. - A Quota Group can account/limit the total amount of space that is referenced by it and/or the amount of space that is exclusively referenced (i.e. referenced by no other quota group). - With this it is possible to define a hierarchical quota that need not necessarily reflect the filesystem hierarchy. - It is also possible to decide for each snapshot if it should be accounted into the parent group. So in a scenario where each subvolume reflect a user home, it's possible to have some snapshots accounted to the user and others not (e.g. the ones needed for system backups). - Quota information will be stored in new records, possibly in a separate tree. - It should be possible to change the Quota config and group assignments online, though this might need a full re-scan of the fs. - It does NOT include any kind of user/group (UID/GID) quota. Any addenda or arguments why it's impossible or insane welcome. There's a problem in that in some cases, it's possible to get into a situation where you can't *delete* files because you're going over quota. If I have two subvolumes that share most of their data (e.g. one is a snapshot of the other), and both subvolumes have a limit under the exclusive use clause, then deleting material from subvolume A could cause subvolume B to go over quota. If users can create their own subvolumes, then using the exclusive use form is also pointless, because as a user, I can simply snapshot (or otherwise CoW copy) all my data into a snapshot, and I then don't pay for it. That one probably comes under the admin shot himself in the foot, though. Getting out the bike-shed brush, I might suggest the use of some name other than quota, because inevitably people will think of UID/GID-type quotas, and we've got enough confusingly-modified terminology already. Size bounds, storage bounds, possibly? Budget :)? Regards, Andrey Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Is it true that last known good on Windows XP --- boots into CP/M? -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iD8DBQFN6RAiIKyzvlFcI40RAkkQAKCAulO65dL1F/vaO7W20qJEAKuonwCghfvH XlliA+eCfmLmP/G0quVALe0= =m513 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, Mar 25, 2011 at 6:39 AM, Steven Rostedt rost...@goodmis.org wrote: On Thu, Mar 24, 2011 at 10:41:51AM +0100, Tejun Heo wrote: Adaptive owner spinning used to be applied only to mutex_lock(). This patch applies it also to mutex_trylock(). btrfs has developed custom locking to avoid excessive context switches in its btree implementation. Generally, doing away with the custom implementation and just using the mutex shows better behavior; however, there's an interesting distinction in the custom implemention of trylock. It distinguishes between simple trylock and tryspin, where the former just tries once and then fail while the latter does some spinning before giving up. Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, for btrfs anyway. The following results are from dbench 50 run on an opteron two socket eight core machine with 4GiB of memory and an OCZ vertex SSD. During the run, disk stays mostly idle and all CPUs are fully occupied and the difference in locking performance becomes quite visible. SIMPLE is with the locking simplification patch[1] applied. i.e. it basically just uses mutex. SPIN is with this patch applied on top - mutex_trylock() uses adaptive spinning. USER SYSTEM SIRQ CXTSW THROUGHPUT SIMPLE 61107 354977 217 8099529 845.100 MB/sec SPIN 63140 364888 214 6840527 879.077 MB/sec On various runs, the adaptive spinning trylock consistently posts higher throughput. The amount of difference varies but it outperforms consistently. In general, using adaptive spinning on trylock makes sense as trylock failure usually leads to costly unlock-relock sequence. [1] http://article.gmane.org/gmane.comp.file-systems.btrfs/9658 Signed-off-by: Tejun Heo t...@kernel.org I'm curious about the effects that this has on those places that do: again: mutex_lock(A); if (mutex_trylock(B)) { mutex_unlock(A); goto again; Where the normal locking order is: B - A If another location does: mutex_lock(B); [...] mutex_lock(A); But another process has A already, and is running, it may spin waiting for A as A's owner is still running. But now, mutex_trylock(B) becomes a spinner too, and since the B's owner is running (spinning on A) it will spin as well waiting for A's owner to release it. Unfortunately, A's owner is also spinning waiting for B to release it. If both A and B's owners are real time tasks, then boom! deadlock. Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Regards, Andrey -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()
On Fri, Mar 25, 2011 at 4:12 PM, Steven Rostedt rost...@goodmis.org wrote: On Fri, 2011-03-25 at 14:13 +0300, Andrey Kuzmin wrote: Turning try_lock into indefinitely spinning one breaks its semantics, so deadlock is to be expected. But what's wrong in this scenario if try_lock spins a bit before giving up? Because that will cause this scenario to spin that little longer always, and introduce latencies that did not exist before. Either the solution does not break this scenario, or it should not go in. Broken semantics and extra latency are two separate issues. If the former is fixed, the latter is easily handled by introducing new mutex_trylock_spin call that lets one either stick to existing behavior (try/fail) or choose a new one where latency penalty is justified by locking patterns. Regards, Andrey -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen sensi...@gmx.net wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Regards, Andrey The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()
On Wed, Mar 23, 2011 at 6:48 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote: Currently, mutex_trylock() doesn't use adaptive spinning. It tries just once. I got curious whether using adaptive spinning on mutex_trylock() would be beneficial and it seems so, at least for btrfs anyway. Hmm. Seems reasonable to me. TAS/spin with exponential back-off has been preferred locking approach in Postgres (and I believe other DBMSes) for years, at least since '04 when I had last touched Postgres code. Even with 'false negative' cost in user-space being much higher than in the kernel, it's still just a question of scale (no wonder measurable improvement here is reported from dbench on SSD capable of few dozen thousand IOPS). Regards, Andrey The patch looks clean, although part of that is just the mutex_spin() cleanup that is independent of actually using it in trylock. So no objections from me. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Tree fragmentation and prefetching
On Wed, Mar 23, 2011 at 11:28 PM, Arne Jansen sensi...@gmx.net wrote: On 23.03.2011 20:26, Andrey Kuzmin wrote: On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net wrote: While looking into the performance of scrub I noticed that a significant amount of time is being used for loading the extent tree and the csum tree. While this is no surprise I did some prototyping on how to improve on it. The main idea is to load the tree (or parts of it) top-down, order the needed blocks and distribute it over all disks. To keep you interested, some results first. a) by tree enumeration with reada=2 reading extent tree: 242s reading csum tree: 140s reading both trees: 324s b) prefetch prototype reading extent tree: 23.5s reading csum tree: 20.4s reading both trees: 25.7s 10x speed-up looks indeed impressive. Just for me to be sure, did I get you right in that you attribute this effect specifically to enumerating tree leaves in key address vs. disk addresses when these two are not aligned? Yes. Leaves and the intermediate nodes tend to be quite scattered around the disk with respect to their logical order. Reading them in logical (ascending/descending) order require lots of seeks. And the patch actually does on-the-fly defragmentation, right? Why loose it then :)? Regards, Andrey Regards, Andrey The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled 28%. It is created with the current git tree + the round robin patch and filled with fs_mark -D 512 -t 16 -n 4096 -F -S0 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf() with path-reada=2. Both trees are being enumerated one after the other. The prototype currently just uses raw bios, does not make use of the page cache and does not enter the read pages into the cache. This will probably add some overhead. It also does not check the crcs. While it is very promising to implement it for scrub, I think a more general interface which can be used for every enumeration would be beneficial. Use cases that come to mind are rebalance, reflink, deletion of large files, listing of large directories etc.. I'd imagine an interface along the lines of int btrfs_readahead_init(struct btrfs_reada_ctx *reada); int btrfs_readahead_add(struct btrfs_root *root, struct btrfs_key *start, struct btrfs_key *end, struct btrfs_reada_ctx *reada); void btrfs_readahead_wait(struct btrfs_reada_ctx *reada); to trigger the readahead of parts of a tree. Multiple readahead requests can be given before waiting. This would enable the very beneficial folding seen above for 'reading both trees'. Also it would be possible to add a cascading readahead, where the content of leaves would trigger readaheads in other trees, maybe by giving a callback for the decisions what to read instead of the fixed start/end range. For the implementation I'd need an interface which I haven't been able to find yet. Currently I can trigger the read of several pages / tree blocks and wait for the completion of each of them. What I'd need would be an interface that gives me a callback on each completion or a waiting function that wakes up on each completion with the information which pages just completed. One way to achieve this would be to add a hook, but I gladly take any implementation hints. -- Arne -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check items for correctness as we search V3
On Fri, Mar 18, 2011 at 3:52 AM, Chris Mason chris.ma...@oracle.com wrote: Excerpts from Andrey Kuzmin's message of 2011-03-17 15:12:32 -0400: On Thu, Mar 17, 2011 at 9:18 PM, Josef Bacik jo...@redhat.com wrote: Currently if we have corrupted items things will blow up in spectacular ways. So as we read in blocks and they are leaves, check the entire leaf to make sure all of the items are correct and point to valid parts in the leaf for the item data the are responsible for. If the item is corrupt we will kick back EIO and not read any of the copies since they are likely to not be correct either. This will catch generic corruptions, it will be up to the individual callers of btrfs_search_slot to make sure their items are right. Thanks, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 495b1ac..9f31e11 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -323,6 +323,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, int num_copies = 0; int mirror_num = 0; + clear_bit(EXTENT_BUFFER_CORRUPT, eb-bflags); io_tree = BTRFS_I(root-fs_info-btree_inode)-io_tree; while (1) { ret = read_extent_buffer_pages(io_tree, eb, start, 1, @@ -331,6 +332,14 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, !verify_parent_transid(io_tree, eb, parent_transid)) return ret; + /* + * This buffer's crc is fine, but its contents are corrupted, so + * there is no reason to read the other copies, they won't be + * any less wrong. + */ This sounds like an overstatement to me. You may be dealing with an error pattern CRC faled to catch, so giving up reading a mirror at this point seems premature. But we have no way to tell which one is more correct, at least not without a full fsck. Voting with two participants (would be better to have at least three, though theory says even this is insufficient in the presence of failures :)) is naturally deficient, so you are right in general except one particular case: when 2nd copy passes CRC _and_ verification, and two copies differ by a bit pattern undetectable by CRC in use. This is a corner case, of course, but the price to pay for false positive (full fsck with associated downtime) is high enough to make it worth a deeper dive. Regards, Andrey -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to implement raid1 repair
On Thu, Mar 17, 2011 at 8:42 PM, Chris Mason chris.ma...@oracle.com wrote: Excerpts from Jan Schmidt's message of 2011-03-17 13:37:54 -0400: On 03/17/2011 06:09 PM, Andrey Kuzmin wrote: On Thu, Mar 17, 2011 at 5:46 PM, Jan Schmidt list.bt...@jan-o-sch.net mailto:list.bt...@jan-o-sch.net wrote: - Is it acceptable to retry reading a block immediately after the disk said it won't work? Or in case of a successful read followed by a checksum error? (Which is already being done right now in btrfs.) These are two pretty different cases. When disk firmware fails read, it means it has retried number of times but gave up (suggesting media error), so an upper layer retry would hardly make sense. Checksum error catches on-disk EDC fault, so retry is on the contrary quite reasonable. Agreed. - Is it acceptable to always write both mirrors if one is found to be bad (also consider ssds)? Writing on read path bypassing file-system transaction mechanism doesn't seem a good idea to me. Just imaging loosing power while overwriting last good copy. Okay, sounds reasonable to me. Let's say we're bypassing transaction mechanism in the same rude manner, but only write the bad mirror. Does that seem reasonable? The bad mirror is fair game. Write away, as long as you're sure you're excluding nodatacow and you don't allow that block to get reallocated elsewhere. You don't actually need to bypass the transaction mechanism, just those two things. What happens if multiple readers (allowed by read lock) attempt an overwrite? Regards, Andrey -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check items for correctness as we search V3
On Thu, Mar 17, 2011 at 9:18 PM, Josef Bacik jo...@redhat.com wrote: Currently if we have corrupted items things will blow up in spectacular ways. So as we read in blocks and they are leaves, check the entire leaf to make sure all of the items are correct and point to valid parts in the leaf for the item data the are responsible for. If the item is corrupt we will kick back EIO and not read any of the copies since they are likely to not be correct either. This will catch generic corruptions, it will be up to the individual callers of btrfs_search_slot to make sure their items are right. Thanks, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 495b1ac..9f31e11 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -323,6 +323,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, int num_copies = 0; int mirror_num = 0; + clear_bit(EXTENT_BUFFER_CORRUPT, eb-bflags); io_tree = BTRFS_I(root-fs_info-btree_inode)-io_tree; while (1) { ret = read_extent_buffer_pages(io_tree, eb, start, 1, @@ -331,6 +332,14 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, !verify_parent_transid(io_tree, eb, parent_transid)) return ret; + /* + * This buffer's crc is fine, but its contents are corrupted, so + * there is no reason to read the other copies, they won't be + * any less wrong. + */ This sounds like an overstatement to me. You may be dealing with an error pattern CRC faled to catch, so giving up reading a mirror at this point seems premature. Regards, Andrey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Appending data to the middle of a file using btrfs-specific features
On Mon, Dec 6, 2010 at 7:05 PM, Chris Mason chris.ma...@oracle.com wrote: Excerpts from Nirbheek Chauhan's message of 2010-12-06 07:41:16 -0500: Hello, I'd like to know if there has been any discussion about adding a new feature to write (add) data at an offset, but without overwriting existing data, or re-writing the existing data. Essentially, in-place addition/removal of data to a file at a place other than the end of the file. Some possible use-cases of such a feature would be: (a) Databases (currently hack around this by allocating sparse files) (b) Delta-patching (rsync, patch, xdelta, etc) (c) Video editors (especially if combined with reflink copies) Besides I/O savings, it would also have significant space savings if the current subvolume being written to has been snapshotted (a common use-case for incremental backups). I've been told that the problem is somewhat difficult to solve properly under block-based representation of data, but I was hoping that btrfs' reflink mechanism and its space-efficient packing of small files might make it doable. A hack I can think of is to do a BTRFS_IOC_CLONE_RANGE into a new file (upto the offset), writing whatever data is required, and then doing another BTRFS_IOC_CLONE_RANGE with an offset for the rest of the original file. This can be followed by a rename() over the original file. Similarly for removing data from the middle of a file. Would this work? Would it be cleaner to implement something equivalent internally? It would work yes. The operation has three cases: 1) file size doesn't change 2) extend the file with new bytes in the middle 3) make the file smaller removing bytes in the middle #1 is the easiest case, you can just use the clone range ioctl directly Tis doesn't seem to be interesting, looking just like traditional COW overwrite. For #2 and #3, all of the file pointers past the bytes you want to add or remove need to be updated with a new file offset. I'd say for an initial implementation to use the IOC_CLONE_RANGE code, and after everything is working we can look at optimizing it with a shift ioctl if it makes sense. Not sure how btrfs implements versioned B-trees, but other snapshot-capable file-systems I'm aware of utilize DITTO B-tree entry that says for tis range, consult previous version tree. One can imagine DITTO(n) extension that would tell subtract n from look-up key and then consult previous version tree, effectively achieving range shift behavior. FWIW. Regards, Andrey Of the use cases you list, video editors seems the most useful. Databases already have things pretty much under control, and delta patching wants to go to a new file anyway. Video editing software has long been looking for ways to do this. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Default to read-only on snapshot creation and have a flag if snapshot should be writable (was: [PATCH 0/5] btrfs: Readonly snapshots)
In my opinion, the point is not the default snapshot creation mode but rather default usage, equals user's expectation. On 11/30/10, Li Zefan l...@cn.fujitsu.com wrote: C Anthony Risinger wrote: On Nov 29, 2010, at 3:48 PM, Andrey Kuzmin andrey.v.kuz...@gmail.com wrote: I'm not sure why zfs came up, they don't own the term :). As to zfs/overhead topic, I doubt there's any difference between clone and writable shapshot (there should be none, of course, it's just two different names for the same concept). Regards, Andrey On Tue, Nov 30, 2010 at 12:43 AM, Mike Fedyk mfe...@mikefedyk.com wrote: On Mon, Nov 29, 2010 at 1:31 PM, Andrey Kuzmin andrey.v.kuz...@gmail.com wrote: This may sound excessive as any new concept introduction that late in development, but readonly/writable snapshots could be further differentiated by naming the latter clones. This way end-user would naturally perceive snapsot as read-only PIT fs image, while clone would naturally refer to (writable) head fork. I'm not sure we want to take all of the terminology that zfs uses as it may also bring the percieved drawbacks as well. Isn't there some additional overhead for a zfs clone compared to a snapshot? I'm not very familiar with zfs so that's why I ask. -- To unsubscribe from this list: send the line unsubscribe linux- btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I don't like the idea of readonly by default, or further changes to terminology, for several reasons: I quite agree with you. LVM2 also defaults to read/write for snapshots. ) readonly by default offers no real enhancement whatsoever other than breaking _anything_ that's written right now This was the first thing that came to my mind. ) btrfs readonly is not even really readonly; as superuser could simply flip a flag to enable writes, readonly merely prevents accidental writes or misbehaving apps... ie. protecting you from yourself ) backups are the simple/obvious use case; I personally use btrfs heavily for LXC containers, in which case nearly every single snapshot is intended to be writable -- usually cloning a template into a new domain ) I also use an initramfs hook to provide system rollbacks, also writable; the hook also provides multiple versions of the branch... all writable ) adding new terms is not a good idea imo; I've already spewed out many sentences explaining the difference between subvolumes and snapshots, ie. that there is none... adding another term only adds to this problem; they each describe the same thing, but differentiate based on origin or current state, neither of which actually describe what it _is_-- a new named pointer to a tree, like a git branch -- a subvolume. I think a better solution/compromise would be to leave snapshots writeable by default, since that's more true to what's happening internally anyway, but maybe introduce a mount option controlling the default action for that mount point. C Anthony [mobile] -- Regards, Andrey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Default to read-only on snapshot creation and have a flag if snapshot should be writable (was: [PATCH 0/5] btrfs: Readonly snapshots)
This may sound excessive as any new concept introduction that late in development, but readonly/writable snapshots could be further differentiated by naming the latter clones. This way end-user would naturally perceive snapsot as read-only PIT fs image, while clone would naturally refer to (writable) head fork. Regards, Andrey On Tue, Nov 30, 2010 at 12:08 AM, Mike Fedyk mfe...@mikefedyk.com wrote: On Mon, Nov 29, 2010 at 12:41 PM, David Arendt ad...@prnet.org wrote: On 11/29/10 21:02, Mike Fedyk wrote: On Mon, Nov 29, 2010 at 12:02 AM, Li Zefanl...@cn.fujitsu.com wrote: (Cc: Sage Weils...@newdream.net for changes in async snapshots) This patchset adds readonly-snapshots support. You can create a readonly snapshot, and you can also set a snapshot readonly/writable on the fly. A few readonly checks are added in setattr, permission, remove_xattr and set_xattr callbacks, as well as in some ioctls. Great work! I have a suggestion on defaults when snapshots are created. I think they should default to being read-only and if they are meant to be read-write a flag can be set at creation time (and changable at a later time as well of course). This way user/admin preconceptions of a snapshot being read-only can be enforced by default, and the exception when you want a read-write snapshot can be available with a switch at the cli level (and probably a flag at the ioctl level). It gives one more natural distinction between a snapshot and a subvolume at the user conceptual level. What do you think? I completely agree with you. I think lots of people use snapshots for backup purposes and these ones shouldn't be writable. by default. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snapshots of directories
Did I get you right in that btrfs does not support snapshots of an arbitrary directory? Regards, Andrey On Tue, Jan 12, 2010 at 5:19 AM, TARUISI Hiroaki taruishi.hir...@jp.fujitsu.com wrote: In btrfs, snapshot is a clone of subvolume, not arbitrary directory. You specified '/root' directory and it is not subvolume, snapshot is created for parent subvolume, root of filesystem. Regards, taruisi (2010/01/12 11:12), Michael Niederle wrote: I try to take a snapshot of a single directory, e.g. root: btrfsctl -s root.2010-01-12 /root operation complete Btrfs v0.19-4-gab8fb4c-dirty Then I take look what's inside the newly created snapshot: ls -l /root.2010-01-12/ total 0 drwxr-xr-x 1 root root 1192 2010-01-03 20:32:12 bin drwxr-xr-x 1 root root 76 2009-06-25 0:40:35 boot drwxr-xr-x 1 root root 1756 2010-01-12 2:33:07 cmds drwxr-xr-x 1 root root 0 2010-01-06 12:21:46 data drwxr-xr-x 1 root root 4356 2010-01-12 2:07:00 dev drwxr-xr-x 1 root root 42 2010-01-04 12:29:45 downloads drwxr-xr-x 1 root root 4528 2010-01-12 2:12:12 etc drwxr-xr-x 1 root root 52 2010-01-11 12:57:47 home drwxr-xr-x 1 root root 0 2007-11-10 4:44:07 initrd drwxr-xr-x 1 root root 4490 2010-01-05 20:15:53 lib drwxr-xr-x 1 root root 124 2008-04-27 14:53:39 mnt drwxr-xr-x 1 root root 62 2008-01-08 0:21:58 net drwxr-xr-x 1 root root 0 2008-04-09 3:19:16 objects drwxr-xr-x 1 root root 316 2009-12-28 23:23:13 opt dr-xr-xr-x 1 root root 0 2007-11-10 3:35:28 proc drwxr-xr-x 1 root root 7676 2010-01-11 0:35:41 root drwxr-xr-x 1 root root 0 2010-01-12 1:56:17 save drwxr-xr-x 1 root root 0 2010-01-12 1:55:58 save2 drwxr-xr-x 1 root root 3804 2010-01-06 2:36:08 sbin drwxr-xr-x 1 root root 0 2007-11-10 3:35:28 sys drwxr-xr-x 1 root root 358 2010-01-11 18:44:29 tmp drwxr-xr-x 1 root root 176 2009-12-29 17:08:37 usr drwxr-xr-x 1 root root 72 2010-01-05 20:03:00 var It seems that always a snapshot of the root is taken instead one of the specified directory? Is this by design? Snapshotting the root works fine, but if you take several snapshots it's a bit recursive, because every new snapshot contains all previous snapshots. Greetings, Michael -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: committing new snapshots
On Tue, Dec 8, 2009 at 7:05 PM, Josef Bacik jo...@redhat.com wrote: On Mon, Dec 07, 2009 at 02:25:50PM -0800, Sage Weil wrote: When you create a new snap or subvol, first a new ROOT_ITEM is created while everything commits, and then the referring directory entry is set up (with a correspond ROOT_BACKREF). First, if you say 'btrfsctl -s foo .' and then 'reboot -f -n' before the next regularly scheduled commit, the snap is created, but lost.. there's no reference. Second, the unreferenced ROOT_ITEM is never cleaned up. Are there any existing plans for this? It would be nice if the reference could be committed as well the first time around. That probably requires a bit of futzing to determine what the root objectid is going to be beforehand, then adding the link in the namespace, then flushing things out and updating the root item in the right order? We could probably use the orphan code for this. Just create an orphan item for the snapshot and then delete it when the snapshot is fully created that way if somebody does reboot -fn we cleanup the root item and such. Thanks, It would be nice to have atomic behavior. Perhaps something similar to rename with its atomicity guarantees could help? Regards, Andrey Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UI issues around RAID1
On Tue, Nov 17, 2009 at 6:25 PM, jim owens jow...@hp.com wrote: snip So we know the raw free blocks, but can not guarantee how many raw blocks per new user write-block will be consumed because we do not know what topology will be in effect for a new write. We could cheat and use worst-case topology numbers if all writes are the current default raid. Of course this ignores DUP unless it is set on the whole filesystem. And we also have the problem of metadata - which is dynamic and allocated in large chunks and has a DUP type, how do we account for that in worst-case calculations. The worst-case is probably wrong but may be more useful to people to know when they will run out of space. Or at least it might make some of our ENOSPC complaints go away :) Only raw and worst-case can be explained to users and which we report is up to Chris. Today we report raw. After spending 10 years on a multi-volume filesystem that had (unsolvable) confusing df output, I'm just of the opinion that nothing we do will make everyone happy. df is user-centric, and therefore is naturally expected to return used/available _logical_ capacity (how this translates to used physical space is up to file-system-specific tools to find out/report). Returning raw is counter-intuitive and causes surprise similar to that of Roland. With so flexible, down to per-file, topology configuration the only option I see for df to return logical capacity available is to compute the latter off the file-system object for which df is invoked. For instance, 'df /path/to/some/file' could return logical capacity for the mountpoint where some-file resides, computed from underlying physical capacity available _and_ topology for this file. 'df /mount-point' would under this implementation return available logical capacity assuming default topology for the referenced file-system. As to used logical space accounting, this is file-system-specific and I'm not yet familiar enough with btrfs code-base to argument for any approach. Regards, Andrey But feel free to run a patch proposal by Chris. jim -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Snapshot/subvolume listing feature
Just for clarity, getdents is exactly the other interface options discussed couple of weeks back (use virtual directories standard file-system API). Regards, Andrey On Mon, Nov 16, 2009 at 11:58 AM, TARUISI Hiroaki taruishi.hir...@jp.fujitsu.com wrote: Thank you for your advice. I'm aware of redundant search, but I didn't think of getdents like interface. I'll remake it without redundant search. Regards, taruisi Yan, Zheng wrote: 2009/11/16 TARUISI Hiroaki taruishi.hir...@jp.fujitsu.com: I made Snapshot/subvolume listing feature. This feature consists of two patches, for kernel(ioctl), and for progs(btrfsctl). I send these two patches as response of this mail soon. New option '-l' is introduced to btrfsctl for listing. If this option is specified, btrfsctl call new ioctl. New ioctl searches root tree and enumerates subtrees. For each subtrees, ioctl searches directory path to tree root, and enumerates more descendant until no more subtree is found. MANPAGE-like option description and examples are as follows. OPTIONS -l _file_ List all snapshot/subvolume directories under a tree which _file_ belongs to. EXAMPLES # btrfsctl -l /work/btrfs Base path = /work/btrfs/ No. Tree ID Subvolume Relative Path 1 256 ss1/ 2 257 ss2/ 3 258 svs1/ss1/ 4 259 svs1/ss2/ 5 260 svs2/ss1/ 6 261 svs2/ss2/ 7 262 ss3/ 8 263 ss4/ 9 264 sv_pool/ 10 265 sv_pool/ss01/ 11 266 sv_pool/ss02/ 12 267 sv_pool/ss03/ 13 268 sv_pool/ss04/ 14 269 sv_pool/ss05/ 15 270 sv_pool/ss06/ 16 271 sv_pool/ss07/ 17 272 sv_pool/ss08/ 18 273 sv_pool/ss09/ 19 274 sv_pool/ss10/ operation complete Btrfs v0.19-9-gd67dad2 Thank you for doing this. I have a quick look at the patches. It seems the ioctl returns full path to each subvolume and uses sequence ID to indicate the progress of listing. Every time the ioctl is called, it tries building full list of subvolume, then skip entries that already returned. I think the API is suboptimal, a getdents like API is better. (The ioctl only lists subvolumes within a given subvolume, the user program call the ioctl recursively to list all subvolumes.) Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] big fat transaction ioctl
On Wed, Nov 11, 2009 at 6:03 PM, Chris Mason chris.ma...@oracle.com wrote: On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote: On Tue, 10 Nov 2009, Andrey Kuzmin wrote: On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil s...@newdream.net wrote: Hi all, This is an alternative approach to atomic user transactions for btrfs. The old start/end ioctls suffer from some basic limitations, namely - We can't properly reserve space ahead of time to avoid ENOSPC part way through the transaction, and - The process may die (seg fault, SIGKILL) part way through the transaction. Currently when that happens the partial transaction will commit. This patch implements an ioctl that lets the application completely specify the entire transaction in a single syscall. If the process gets killed or seg faults part way through, the entire transaction will still complete. The goal is to atomically commit updates to multiple files, xattrs, directories. But this is still a file system: we don't get rollback if things go wrong. Instead, do what we can up front to make sure things will work out. And if things do go wrong, optionally prevent a partial result from reaching the disk. Why not snapshot respective root (doesn't work if transaction spans multiple file-systems, but this doesn't look like a real-world limitation), run txn against that snapshot and rollback on failure instead? Snapshots are writable, cheap, and this looks like a real transaction abort mechanism. Good question. :) I hadn't looked into this before, but I think the snapshots could be used to achieve both atomicity and rollback. If userspace uses an rw mutex to quiesce writes, it can make sure all transactions complete before creating a snapshot (commit). The problem with this currently is the create snapshot ioctl is relatively slow... it calls commit_transaction, which blocks until everything reaches disk. I think to perform well this approach would need a hook to start a commit and then return as soon as it can guarantee than any subsequent operation's start_transaction can't join in that commit. This may be a better way to go about this, though. Does that sound reasonable, Chris? Yes, we could do this, but I don't think it will perform very well compared to your multi-operation ioctl. It really does depend on how often you need to do atomic ops (my guess is very). Honestly you'll get better performance with a simple write-ahead log from userland: Write-ahead logging is necessary anyway if the aim is to provide transactional semantics to an application. But, at the same time, w/o snapshot there is no synchronization between the log and file-system state. Regards, Andrey step1: write redo log somewhere in the FS, with enough information to bring all the objects you're about to touch to a consistent state. step2: fsync the log step3: do your operations step4: append a record to the undo log that invalidates the last log op, or just truncate it to zero. step5: fsync the log. The big advantage of the log is that you won't be tied to btrfs, but it's two fsyncs where the big transaction framework does none. This should allow you to turn on the fast fsync log again, but I think the multi-operation ioctl would do that as well. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] big fat transaction ioctl
On Wed, Nov 11, 2009 at 8:19 PM, Sage Weil s...@newdream.net wrote: On Wed, 11 Nov 2009, Chris Mason wrote: On Tue, Nov 10, 2009 at 02:13:10PM -0800, Sage Weil wrote: On Tue, 10 Nov 2009, Andrey Kuzmin wrote: On Tue, Nov 10, 2009 at 11:12 PM, Sage Weil s...@newdream.net wrote: Hi all, This is an alternative approach to atomic user transactions for btrfs. The old start/end ioctls suffer from some basic limitations, namely - We can't properly reserve space ahead of time to avoid ENOSPC part way through the transaction, and - The process may die (seg fault, SIGKILL) part way through the transaction. Currently when that happens the partial transaction will commit. This patch implements an ioctl that lets the application completely specify the entire transaction in a single syscall. If the process gets killed or seg faults part way through, the entire transaction will still complete. The goal is to atomically commit updates to multiple files, xattrs, directories. But this is still a file system: we don't get rollback if things go wrong. Instead, do what we can up front to make sure things will work out. And if things do go wrong, optionally prevent a partial result from reaching the disk. Why not snapshot respective root (doesn't work if transaction spans multiple file-systems, but this doesn't look like a real-world limitation), run txn against that snapshot and rollback on failure instead? Snapshots are writable, cheap, and this looks like a real transaction abort mechanism. Good question. :) I hadn't looked into this before, but I think the snapshots could be used to achieve both atomicity and rollback. If userspace uses an rw mutex to quiesce writes, it can make sure all transactions complete before creating a snapshot (commit). The problem with this currently is the create snapshot ioctl is relatively slow... it calls commit_transaction, which blocks until everything reaches disk. I think to perform well this approach would need a hook to start a commit and then return as soon as it can guarantee than any subsequent operation's start_transaction can't join in that commit. This may be a better way to go about this, though. Does that sound reasonable, Chris? Yes, we could do this, but I don't think it will perform very well compared to your multi-operation ioctl. It really does depend on how often you need to do atomic ops (my guess is very). The thing is, I'm not sure using snaps is that different from what I'm doing now. Currently the ioctl transactions don't hit disk until each full commit (flushoncommit, no fsync). Unless the presense of a snapshot adds additional overhead (to the commit, or to cleaning up the slightly longer-living snapped roots), the difference would be that starting transactions would need to be blocked by the application instead of wait_current_trans in start_transaction, and (currently at least) they would wait longer (the extra writes between blocked = 0 and commit_done = 1 in commit_transaction). The key, as now, is keeping the full fs syncs infrequent. And, if possible, reducing the duration of the blocked == 1 period during commit_transaction. It took me some time to associate you with Ceph project and to recall what Ceph is, so my original snapshot suggestion was out-of-context. When put into Ceph context, it looks too heavy-weight and may turn an overkill. Chris's write-ahead logging idea looks much more realistic for your use case. Honestly you'll get better performance with a simple write-ahead log from userland: There actually is a log, but it's optional and not strictly write-ahead... it's only used to reduce the commit latency: 1- apply operations to fs (grouped into atomic transactions) 2- (optionally) write and flush log entry ...repeat... 3- periodically sync the fs, then trim the log. or sync early if a client explicitly requests it. But 1- I don't want to make the log required. Sometimes you're more concerned about total throughput, not latency, and the log halves your write bw unless you add more spindles. Log-induced latency penalty is the price for transactional consistency :). Traditional mitigation recipe involves low-latency log device (NVRAM and, recently, SLC flash). Since you specifically target distributed systems, you have a distributed in-memory logging option. Regards, Andrey 2- I don't want it strictly write-ahead because (in the absense of atomic ops) it means you have to wait for the log to sync before applying the ops to the fs (to ensure the fs doesn't get a partial transaction ahead of the log). This marries atomicity with your schedule for durability, which isn't necessarily what you want. (e.g., Ceph makes a distinction between serialized and commited ops, allowing limited sharing of data before it hits disk. That's the nice
Re: snapshot-removal - timeline ?
On Wed, Aug 5, 2009 at 3:18 PM, Roy Sigurd Karlsbakk r...@karlsbakk.net wrote: On 4. aug.. 2009, at 20.33, Chris Mason wrote: It's strange that such a small thing should be delayed so much. If snapshot removal was working, I'm quite sure we might get more users and thereby more stable code faster. It's a small feature but it gets deep into the difficult parts of the dentry cache to do it right. So, it definitely isn't easy. I'd say it's a pretty elemetary feature to be able to remove something you have created. Snapshots are somewhat counter-intuitive in many respects: for instance, one snapshot-capable file-system performs writes to a dataset with snapshots _faster_ than to the same dataset w/o snapshots. Snapshot removal is no exception - it's a bit more complex than one would think. Regards, Andrey I know, you can remove the files and so on, but still, having a bunch of old and empty snapshots lying around is no good. roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data Deduplication with the help of an online filesystem check
On Mon, May 4, 2009 at 10:06 PM, Jan-Frode Myklebust janfr...@tanso.net wrote: Looking at the website content, it also revealed that VMware will have a similiar feature for their workhorse ,,esx server'' in the upcoming release, however my point still stands. Ship out a service pack for windows and you 1.5 Gbyte of modified data that is not deduped. All desktops that are linked to a master image can be patched or updated simply by updating the master image, without affecting users’ settings, data or applications. Writable clone support in the file-system coupled with hierarchical settings/data/apps layout, and you have what's described above. Regards, Andrey -jf -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs development plans
On Mon, Apr 20, 2009 at 8:10 PM, Ahmed Kamal email.ahmedka...@googlemail.com wrote: But now Oracle can re-license Solaris and merge ZFS with btrfs. Just kidding, I don't think it would be technically feasible. May I suggest the name ZbtrFS :) Sorry couldn't resist. On a more serious note though, is there any technical benefits that justify continuing to push money in btrfs Personally, I don't see any. Porting zfs to Linux will cost (quite) some time and effort, but this is peanuts compared to what's needed to get btrfs (no offense meant) to maturity level/feature parity with zfs. The only thing that could prevent this is CDDL licensing issues and patent claims from NTAP over zfs snapshots and other features; btrfs is free from both. Regards, Andrey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs development plans
On Mon, Apr 20, 2009 at 9:08 PM, Gregory Maxwell gmaxw...@gmail.com wrote: On Mon, Apr 20, 2009 at 12:57 PM, Andrey Kuzmin andrey.v.kuz...@gmail.com wrote: On Mon, Apr 20, 2009 at 8:10 PM, Ahmed Kamal email.ahmedka...@googlemail.com wrote: But now Oracle can re-license Solaris and merge ZFS with btrfs. Just kidding, I don't think it would be technically feasible. May I suggest the name ZbtrFS :) Sorry couldn't resist. On a more serious note though, is there any technical benefits that justify continuing to push money in btrfs Personally, I don't see any. Porting zfs to Linux will cost (quite) some time and effort, but this is peanuts compared to what's needed to get btrfs (no offense meant) to maturity level/feature parity with zfs. The only thing that could prevent this is CDDL licensing issues and patent claims from NTAP over zfs snapshots and other features; btrfs is free from both. I'm sure that people with far more experience than I will comment— But considering that BTRFS is in the Linux Kernel today, the histories of other imported FSes (XFS), Imported file-systems (someone more experienced may correct me if I'm wrong) have previously been give-aways. This one is different - zfs is in active development, with highly welcomed features like de-duplication coming. and the state of ZFS in FreeBSD this may not be strictly true. This was one-man's effort (though a heroic one, definitely), hardly a case to compare with. Regards, Andrey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs and raw zvol-like partition
zvol (interface) does not just 'export raw device' but rather implements volume abstraction and integrates volume management into file-system. Regards, Andrey On Sun, Apr 12, 2009 at 11:26 AM, Sébastien Wacquiez s...@enix.org wrote: Hi, A nice feature is ZFS is the ZVOL layer, that permit you to export (directly) a raw device from your zfs pool of disc, with the benefit of powerful (growing!) snapshot and easy raid management from zfs. It's particulary usefull when you use it with virtual server, allowing you to centralize all your backup problematic, etc Does btrfs plan to support this kind of feature ? (please, don't tell me that lvm do, lvm just sucks when you make a snapshot of your disk, and lack of the growing, commit, rollback, diffsend feature). Thanks ! Sébastien Wacquiez PS : see http://opensolaris.org/os/community/zfs/source/zfstour.png if you don't know what zvol do. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] error handling of ERR_PTR() returns
Since both NULL ptr and IS_ERR(ptr) are treated as error, why not redefine IS_ERR to handle both, simplifying caller's life? Regards, Andrey On Tue, Apr 7, 2009 at 5:38 PM, Dan Carpenter erro...@gmail.com wrote: There are a couple functions which return ERR_PTR as well as NULL. The caller needs to handle both. Smatch also complains about the handling of alloc_extent_map() but as far as I can see that doesn't actually return an ERR_PTR. Compile tested on 2.6.29. regards, dan carpenter --- orig/fs/btrfs/disk-io.c 2009-04-07 16:15:36.0 +0300 +++ devel/fs/btrfs/disk-io.c 2009-04-07 16:23:33.0 +0300 @@ -123,7 +123,7 @@ spin_lock(em_tree-lock); em = lookup_extent_mapping(em_tree, start, len); - if (em) { + if (!IS_ERR(em) em) { em-bdev = BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev; spin_unlock(em_tree-lock); @@ -1216,8 +1216,8 @@ int ret; root = btrfs_read_fs_root_no_name(fs_info, location); - if (!root) - return NULL; + if (!root || IS_ERR(root)) + return root; if (root-in_sysfs) return root; @@ -1324,7 +1324,7 @@ spin_lock(em_tree-lock); em = lookup_extent_mapping(em_tree, offset, PAGE_CACHE_SIZE); spin_unlock(em_tree-lock); - if (!em) { + if (!em || IS_ERR(em)) { __unplug_io_fn(bdi, page); return; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html