Re: [PATCH RFC] btrfs: Simplify locking
Hello, Chris. On Sun, Mar 20, 2011 at 08:10:51PM -0400, Chris Mason wrote: I went through a number of benchmarks with the explicit blocking/spinning code and back then it was still significantly faster than the adaptive spin. But, it is definitely worth doing these again, how many dbench procs did you use? It was dbench 50. The biggest benefit to explicit spinning is that mutex_lock starts with might_sleep(), so we skip the cond_resched(). Do you have voluntary preempt on? Ah, right, I of course forgot to actually attach the .config. I had CONFIG_PREEMPT, not CONFIG_PREEMPT_VOLUNTARY. I'll re-run with VOLUNTARY and see how its behavior changes. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2 v2] Btrfs: add datacow flag in inode flag
For datacow control, the corresponding inode flags are needed. This is for btrfs use. v1-v2: Change FS_COW_FL to another bit due to conflict with the upstream e2fsprogs Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- include/linux/fs.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 63d069b..dbcb47e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -353,6 +353,8 @@ struct inodes_stat_t { #define FS_TOPDIR_FL 0x0002 /* Top of directory hierarchies*/ #define FS_EXTENT_FL 0x0008 /* Extents */ #define FS_DIRECTIO_FL 0x0010 /* Use direct i/o */ +#define FS_NOCOW_FL0x0080 /* Do not cow file */ +#define FS_COW_FL 0x0200 /* Cow file */ #define FS_RESERVED_FL 0x8000 /* reserved for ext2 lib */ #define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */ -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. According to chris's comment, there should be just one true compression method(probably LZO) stored in the super. However, before this, we would wait for that one method is stable enough to be adopted into the super. So I list it as a long term goal, and just store it in ram today. After applying this patch, we can use the generic FS_IOC_SETFLAGS ioctl to control file and directory's datacow and compression attribute. NOTE: - The compression type is selected by such rules: If we mount btrfs with compress options, ie, zlib/lzo, the type is it. Otherwise, we'll use the default compress type (zlib today). v1-v2: Rebase the patch with the latest btrfs. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |6 ++ fs/btrfs/inode.c | 32 fs/btrfs/ioctl.c | 41 + 4 files changed, 72 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b4b9d1..b77d1a5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1283,6 +1283,7 @@ struct btrfs_root { #define BTRFS_INODE_NODUMP (1 8) #define BTRFS_INODE_NOATIME(1 9) #define BTRFS_INODE_DIRSYNC(1 10) +#define BTRFS_INODE_COMPRESS (1 11) /* some macros to generate set/get funcs for the struct fields. This * assumes there is a lefoo_to_cpu for every type, so lets make a simple diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3e1ea3e..a894c12 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1762,6 +1762,12 @@ struct btrfs_root *open_ctree(struct super_block *sb, btrfs_check_super_valid(fs_info, sb-s_flags MS_RDONLY); + /* +* In the long term, we'll store the compression type in the super +* block, and it'll be used for per file compression control. +*/ + fs_info-compress_type = BTRFS_COMPRESS_ZLIB; + ret = btrfs_parse_options(tree_root, options); if (ret) { err = ret; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index db67821..e687bb9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -381,7 +381,8 @@ again: */ if (!(BTRFS_I(inode)-flags BTRFS_INODE_NOCOMPRESS) (btrfs_test_opt(root, COMPRESS) || -(BTRFS_I(inode)-force_compress))) { +(BTRFS_I(inode)-force_compress) || +(BTRFS_I(inode)-flags BTRFS_INODE_COMPRESS))) { WARN_ON(pages); pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS); @@ -1253,7 +1254,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); else if (!btrfs_test_opt(root, COMPRESS) -!(BTRFS_I(inode)-force_compress)) +!(BTRFS_I(inode)-force_compress) +!(BTRFS_I(inode)-flags BTRFS_INODE_COMPRESS)) ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); else @@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, location-offset = 0; btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY); - btrfs_inherit_iflags(inode, dir); - if ((mode S_IFREG)) { if (btrfs_test_opt(root, NODATASUM)) BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM; @@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW; } + btrfs_inherit_iflags(inode, dir); + insert_inode_hash(inode); inode_tree_add(inode); return inode; @@ -6803,6 +6805,26 @@ static int btrfs_getattr(struct vfsmount *mnt, return 0; } +/* + * If a file is moved, it will inherit the cow and compression flags of the new + * directory. + */ +static void fixup_inode_flags(struct inode *dir, struct inode *inode) +{ + struct btrfs_inode *b_dir = BTRFS_I(dir); + struct btrfs_inode *b_inode = BTRFS_I(inode); + + if (b_dir-flags BTRFS_INODE_NODATACOW) + b_inode-flags |= BTRFS_INODE_NODATACOW; + else + b_inode-flags = ~BTRFS_INODE_NODATACOW; + + if (b_dir-flags BTRFS_INODE_COMPRESS) + b_inode-flags |= BTRFS_INODE_COMPRESS; + else + b_inode-flags = ~BTRFS_INODE_COMPRESS; +} + static int btrfs_rename(struct inode *old_dir, struct dentry
Re: [RFC] a couple of i_nlink fixes in btrfs
Excerpts from Al Viro's message of 2011-03-21 01:17:25 -0400: On Mon, Mar 07, 2011 at 11:58:13AM -0500, Chris Mason wrote: Thanks, these both look good but I'll test here as well. Are you planning on pushing for .38? No, but .39 would be nice ;-) Do you want that to go through btrfs tree or through vfs one? I'll take them in mine, it'll be easier to put in with all these other changes. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Excerpts from Miao Xie's message of 2011-03-21 01:05:22 -0400: On sun, 20 Mar 2011 20:33:34 -0400, Chris Mason wrote: Excerpts from Miao Xie's message of 2011-03-18 05:24:46 -0400: Changelog V3 - V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inodes in time. I ran some tests on this and had trouble with my stress.sh script: http://oss.oracle.com/~mason/stress.sh I used: stress.sh -n 50 -c path to linux kernel git tree /mnt The git tree has all the .git files but no .o files. The problem was that within about 20 minutes, the filesystem was spending almost all of its time in balance_dirty_pages(). The problem is that data writeback isn't complete until the endio handlers have finished inserting metadata into the btree. The v4 patch calls btrfs_btree_balance_dirty() from all the btrfs_end_transaction variants, which means that the FS writeback code waits for balance_dirty_pages(), which won't make progress until the FS writeback code is done. So I changed things to call the delayed inode balance function only from inside btrfs_btree_balance_dirty(), which did resolve the stalls. But Ok, but can we invoke the delayed inode balance function before balance_dirty_pages_ratelimited_nr(), because the delayed item insertion and deletion also bring us some dirty pages. Yes, good point. I found a few times that when I did rmmod btrfs, there would be delayed inode objects leaked in the slab cache. rmmod will try to destroy the slab cache, which will fail because we haven't freed everything. It looks like we have a race in btrfs_get_or_create_delayed_node, where two concurrent callers can both create delayed nodes and then race on adding it to the inode. Sorry for my mistake, I thought we updated the inodes when holding i_mutex originally, so I didn't use any lock or other method to protect delayed_node of the inodes. But I think we needn't use rcu lock to protect delayed_node when we want to get the delayed node, because we won't change it after it is created, cmpxchg() and ACCESS_ONCE() can protect it well. What do you think about? PS: I worry about the inode update without holding i_mutex. We have the tree locks to make sure we're serialized while we actually change the tree. The only places that go in without locking are times updates. I also think that code is racing with the code that frees delayed nodes, but haven't yet triggered my debugging printks to prove either one. We free delayed nodes when we want to destroy the inode, at that time, just one task, which is destroying inode, can access the delayed nodes, so I think ACCESS_ONCE() is enough. What do you think about? Great, I see what you mean. The bigger problem right now is that we may do a lot of operations in destroy_inode(), which can block the slab shrinkers on our metadata operations. That same stress.sh -n 50 run is running into OOM. So we need to rework the part where the final free is done. We could keep a ref on the inode until the delayed items are complete, or we could let the inode go and make a way to lookup the delayed node when the inode is read. I'll read more today. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: cleanup how we setup free space clusters
This patch makes the free space cluster refilling code a little easier to understand, and fixes some things with the bitmap part of it. Currently we either want to refill a cluster with 1) All normal extent entries (those without bitmaps) 2) A bitmap entry with enough space The current code has this ugly jump around logic that will first try and fill up the cluster with extent entries and then if it can't do that it will try and find a bitmap to use. So instead split this out into two functions, one that tries to find only normal entries, and one that tries to find bitmaps. This also fixes a suboptimal thing we would do with bitmaps. If we used a bitmap we would just tell the cluster that we were pointing at a bitmap and it would do the tree search in the block group for that entry every time we tried to make an allocation. Instead of doing that now we just add it to the clusters group. I tested this with my ENOSPC tests and xfstests and it survived. Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/ctree.h|3 - fs/btrfs/free-space-cache.c | 361 +-- 2 files changed, 179 insertions(+), 185 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6036fdb..0ee679b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -783,9 +783,6 @@ struct btrfs_free_cluster { /* first extent starting offset */ u64 window_start; - /* if this cluster simply points at a bitmap in the block group */ - bool points_to_bitmap; - struct btrfs_block_group_cache *block_group; /* * when a cluster is allocated from a block group, we put the diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 7a808d7..a328af9 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1644,30 +1644,28 @@ __btrfs_return_cluster_to_free_space( { struct btrfs_free_space *entry; struct rb_node *node; - bool bitmap; spin_lock(cluster-lock); if (cluster-block_group != block_group) goto out; - bitmap = cluster-points_to_bitmap; cluster-block_group = NULL; cluster-window_start = 0; list_del_init(cluster-block_group_list); - cluster-points_to_bitmap = false; - - if (bitmap) - goto out; node = rb_first(cluster-root); while (node) { + bool bitmap; + entry = rb_entry(node, struct btrfs_free_space, offset_index); node = rb_next(entry-offset_index); rb_erase(entry-offset_index, cluster-root); - BUG_ON(entry-bitmap); - try_merge_free_space(block_group, entry, false); + + bitmap = (entry-bitmap != NULL); + if (!bitmap) + try_merge_free_space(block_group, entry, false); tree_insert_offset(block_group-free_space_offset, - entry-offset, entry-offset_index, 0); + entry-offset, entry-offset_index, bitmap); } cluster-root = RB_ROOT; @@ -1790,50 +1788,24 @@ int btrfs_return_cluster_to_free_space( static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group, struct btrfs_free_cluster *cluster, + struct btrfs_free_space *entry, u64 bytes, u64 min_start) { - struct btrfs_free_space *entry; int err; u64 search_start = cluster-window_start; u64 search_bytes = bytes; u64 ret = 0; - spin_lock(block_group-tree_lock); - spin_lock(cluster-lock); - - if (!cluster-points_to_bitmap) - goto out; - - if (cluster-block_group != block_group) - goto out; - - /* -* search_start is the beginning of the bitmap, but at some point it may -* be a good idea to point to the actual start of the free area in the -* bitmap, so do the offset_to_bitmap trick anyway, and set bitmap_only -* to 1 to make sure we get the bitmap entry -*/ - entry = tree_search_offset(block_group, - offset_to_bitmap(block_group, search_start), - 1, 0); - if (!entry || !entry-bitmap) - goto out; - search_start = min_start; search_bytes = bytes; err = search_bitmap(block_group, entry, search_start, search_bytes); if (err) - goto out; + return 0; ret = search_start; bitmap_clear_bits(block_group, entry, ret, bytes); - if (entry-bytes == 0) - free_bitmap(block_group, entry); -out: - spin_unlock(cluster-lock); - spin_unlock(block_group-tree_lock); return ret; } @@ -1851,10 +1823,6 @@
[PATCH] Btrfs: don't be as aggressive about using bitmaps V2
We have been creating bitmaps for small extents unconditionally forever. This was great when testing to make sure the bitmap stuff was working, but is overkill normally. So instead of always adding small chunks of free space to bitmaps, only start doing it if we go past half of our extent threshold. This will keeps us from creating a bitmap for just one small free extent at the front of the block group, and will make the allocator a little faster as a result. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- V1-V2: -fix a formatting problem -make the small extent threshold to be = 4 sectors, not 4 sectors fs/btrfs/free-space-cache.c | 19 --- 1 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 63776ae..4ab35ea 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1287,9 +1287,22 @@ static int insert_into_bitmap(struct btrfs_block_group_cache *block_group, * If we are below the extents threshold then we can add this as an * extent, and don't have to deal with the bitmap */ - if (block_group-free_extents block_group-extents_thresh - info-bytes block_group-sectorsize * 4) - return 0; + if (block_group-free_extents block_group-extents_thresh) { + /* +* If this block group has some small extents we don't want to +* use up all of our free slots in the cache with them, we want +* to reserve them to larger extents, however if we have plent +* of cache left then go ahead an dadd them, no sense in adding +* the overhead of a bitmap if we don't have to. +*/ + if (info-bytes = block_group-sectorsize * 4) { + if (block_group-free_extents * 2 = + block_group-extents_thresh) + return 0; + } else { + return 0; + } + } /* * some block groups are so tiny they can't be enveloped by a bitmap, so -- 1.7.2.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
cloning single-device btrfs file system onto multi-device one
Hiya, I'm trying to move a btrfs FS that's on a hardware raid 5 (6TB large, 4 of which are in use) to another machine with 3 3TB HDs and preserve all the subvolumes/snapshots. Is there a way to do that without using a software/hardware raid on the new machine (that is just use btrfs multi-device). If fewer than 3TB were occupied, I suppose I could just resize it so that it fits on one 3TB hd, then copy device to device onto a 3TB disk, add the 2 other ones and do a balance, but here, I can't do that. I suspect that if compression was enabled, the FS could fit on 3 TB, but AFAICT, compression is enabled at mount time and would only apply to newly created files. Is there a way to compress files already in a btrfs filesystem? Any help would be appreciated. Stephane -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Simplify locking
Hello, Here are the results with voluntary preemption. I've moved to a beefier machine for testing. It's dual Opteron 2347, so dual socket, eight core. The memory is limited to 1GiB to force IOs and the disk is the same OCZ Vertex 60gig SSD. /proc/stat is captured before and after dbench 50. I ran the following four setups. DFL The current custom locking implementation. SIMPLE Simple mutex conversion. The first patch in this thread. SPINSIMPLE + mutex_tryspin(). The second patch in this thread. SPIN2 SPIN + mutex_tryspin() in btrfs_tree_lock(). Patch below. SPIN2 should alleviate the voluntary preemption by might_sleep() in mutex_lock(). USER SYSTEM SIRQCXTSW THROUGHPUT DFL49427 458210 1433 7683488 642.947 SIMPLE 52836 471398 1427 3055384 705.369 SPIN 52267 473115 1467 3005603 705.369 SPIN2 52063 470453 1446 3092091 701.826 I'm running DFL again just in case but SIMPLE or SPIN seems to be a much better choice. Thanks. NOT-Signed-off-by: Tejun Heo t...@kernel.org --- fs/btrfs/locking.h |2 ++ 1 file changed, 2 insertions(+) Index: work/fs/btrfs/locking.h === --- work.orig/fs/btrfs/locking.h +++ work/fs/btrfs/locking.h @@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s static inline void btrfs_tree_lock(struct extent_buffer *eb) { + if (mutex_tryspin(eb-lock)) + return; mutex_lock(eb-lock); } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Simplify locking
On Mon, Mar 21, 2011 at 05:59:55PM +0100, Tejun Heo wrote: I'm running DFL again just in case but SIMPLE or SPIN seems to be a much better choice. Got 644.176 MB/sec, so yeah the custom locking is definitely worse than just using mutex. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: Simplify locking
Excerpts from Tejun Heo's message of 2011-03-21 12:59:55 -0400: Hello, Here are the results with voluntary preemption. I've moved to a beefier machine for testing. It's dual Opteron 2347, so dual socket, eight core. The memory is limited to 1GiB to force IOs and the disk is the same OCZ Vertex 60gig SSD. /proc/stat is captured before and after dbench 50. I ran the following four setups. DFLThe current custom locking implementation. SIMPLESimple mutex conversion. The first patch in this thread. SPINSIMPLE + mutex_tryspin(). The second patch in this thread. SPIN2SPIN + mutex_tryspin() in btrfs_tree_lock(). Patch below. SPIN2 should alleviate the voluntary preemption by might_sleep() in mutex_lock(). USER SYSTEM SIRQCXTSW THROUGHPUT DFL49427 458210 1433 7683488 642.947 SIMPLE 52836 471398 1427 3055384 705.369 SPIN 52267 473115 1467 3005603 705.369 SPIN2 52063 470453 1446 3092091 701.826 I'm running DFL again just in case but SIMPLE or SPIN seems to be a much better choice. Very interesting. Ok, I'll definitely rerun my benchmarks as well. I used dbench extensively during the initial tuning, but you're forcing the memory low in order to force IO. This case doesn't really hammer on the locks, it hammers on the transition from spinning to blocking. We want also want to compare dbench entirely in ram, which will hammer on the spinning portion. -chris Thanks. NOT-Signed-off-by: Tejun Heo t...@kernel.org --- fs/btrfs/locking.h |2 ++ 1 file changed, 2 insertions(+) Index: work/fs/btrfs/locking.h === --- work.orig/fs/btrfs/locking.h +++ work/fs/btrfs/locking.h @@ -28,6 +28,8 @@ static inline bool btrfs_try_spin_lock(s static inline void btrfs_tree_lock(struct extent_buffer *eb) { +if (mutex_tryspin(eb-lock)) +return; mutex_lock(eb-lock); } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression
On Mon, Mar 21, 2011 at 04:57:13PM +0800, liubo wrote: @@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, location-offset = 0; btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY); - btrfs_inherit_iflags(inode, dir); - if ((mode S_IFREG)) { if (btrfs_test_opt(root, NODATASUM)) BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM; @@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW; } + btrfs_inherit_iflags(inode, dir); The problem is that btrfs_inherit_iflags() overwrites BTRFS_I(inode)-flags with the parent's flags, so you lose BTRFS_INODE_NODATA{SUM|COW}. Johann -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs device delete not working after failed/missing device
Hi, I decided to try btrfs for a few file systems on my not-too-critical home server, including my root fs. Most file systems are on a RAID5 MD software array, but my rootfs is btrfs striped as RAID1 over 3 partitions. I got hit by the Intel Sandy Bridge SATA chipset bug, so eventually the 3rd SATA drive (/dev/sdc) failed with all kinds of bus errors. My btrfs rootfs stayed up and working, but btrfs device delete /dev/sdc3 / did not work, and gave a vague Error removing device (iirc). After rebooting with the drive on the bad SATA bus removed, the file system didn't come up, but passing a -o degraded fixed that. However, I was still not able to remove the missing device from the rootfs. Neither btrfs device delete missing / nor btrfs device delete /dev/sdc3 / succeeded. btrfs fi balance / succeeded without errors however. I'll paste the relevant bits of dmesg below. My btrfs rootfs is now mountable and working in degraded mode, and I have a (daily rsynced) backup on another filesystem anyway. I decided to report it anyway, as it would be good to get things stable and this bug fixed. ;) The running kernel is 2.6.38.4, and btrfs utils are version v0.19 (Ubuntu Maverick). dmesg: [ 197.810007] device label root devid 1 transid 65058 /dev/sda3 [ 197.811844] btrfs: failed to read the system array on sda3 [ 197.876797] btrfs: open_ctree failed [ 207.743237] device label root devid 1 transid 65058 /dev/sda3 [ 207.745460] btrfs: failed to read the system array on sda3 [ 207.793912] btrfs: open_ctree failed [ 249.18] device label backups devid 1 transid 7325 /dev/dm-4 [ 250.002555] device label root devid 2 transid 65058 /dev/sdb3 [ 250.003545] device label root devid 1 transid 65058 /dev/sda3 [ 488.217867] device label root devid 1 transid 65058 /dev/sda3 [ 488.218325] btrfs: allowing degraded mounts [ 509.983121] btrfs: relocating block group 12108955648 flags 17 [ 513.096861] btrfs: found 695 extents [ 520.176191] btrfs: found 695 extents [ 520.888601] btrfs: relocating block group 11035213824 flags 17 [ 537.663032] btrfs: found 4836 extents [ 550.237641] btrfs: found 4836 extents [ 551.527503] btrfs: relocating block group 9961472000 flags 17 [ 556.314350] btrfs: found 1602 extents [ 564.737818] btrfs: found 1602 extents [ 565.358244] btrfs: relocating block group 9693036544 flags 20 [ 586.400905] btrfs: found 5548 extents [ 586.773695] btrfs: relocating block group 8619294720 flags 17 [ 593.677386] btrfs: found 3839 extents [ 599.059888] btrfs: found 3839 extents [ 600.155579] btrfs: relocating block group 7545552896 flags 17 [ 612.330001] btrfs: found 3139 extents [ 621.223148] btrfs: found 3139 extents [ 622.054094] btrfs: relocating block group 6471811072 flags 17 [ 634.848723] btrfs: found 5541 extents [ 649.045142] btrfs: found 5541 extents [ 649.685956] btrfs: relocating block group 5398069248 flags 17 [ 663.123926] btrfs: found 12743 extents [ 683.670746] btrfs: found 12743 extents [ 684.595137] btrfs: relocating block group 4324327424 flags 17 [ 717.828652] btrfs: found 13762 extents [ 740.723221] btrfs: found 13762 extents [ 742.037898] btrfs: relocating block group 29360128 flags 20 [ 826.976047] btrfs: found 34862 extents [ 827.084723] btrfs: relocating block group 20971520 flags 18 [ 827.126034] btrfs allocation failed flags 18, wanted 4096 [ 827.127621] space_info has 0 free, is not full [ 827.127623] space_info total=12582912, used=4096, pinned=0, reserved=0, may_use=0, readonly=12578816 [ 827.127626] block group 20971520 has 8388608 bytes, 4096 used 0 pinned 0 reserved [ 827.127629] entry offset 20971520, bytes 4096, bitmap no [ 827.129215] entry offset 20979712, bytes 8380416, bitmap no [ 827.130778] block group has cluster?: no [ 827.130780] 2 blocks of free space at or bigger than bytes is [ 827.130782] block group 0 has 4194304 bytes, 0 used 0 pinned 0 reserved [ 827.130784] entry offset 131072, bytes 4063232, bitmap no [ 827.132340] block group has cluster?: no [ 827.132342] 1 blocks of free space at or bigger than bytes is [ 827.174753] btrfs: relocating block group 20971520 flags 18 [ 827.213303] btrfs allocation failed flags 18, wanted 4096 [ 827.214347] space_info has 0 free, is not full [ 827.214348] space_info total=12582912, used=4096, pinned=0, reserved=0, may_use=0, readonly=12578816 [ 827.214350] block group 20971520 has 8388608 bytes, 4096 used 0 pinned 0 reserved [ 827.214352] entry offset 20971520, bytes 4096, bitmap no [ 827.215367] entry offset 20979712, bytes 8380416, bitmap no [ 827.216261] block group has cluster?: no [ 827.216262] 2 blocks of free space at or bigger than bytes is [ 827.216263] block group 0 has 4194304 bytes, 0 used 0 pinned 0 reserved [ 827.216265] entry offset 131072, bytes 4063232, bitmap no [ 827.217144] block group has cluster?: no [ 827.217145] 1 blocks of free space at or bigger than bytes is btrfs filesystem show: Label: 'root' uuid:
Re: [PATCH RFC] btrfs: Simplify locking
Hello, On Mon, Mar 21, 2011 at 01:24:37PM -0400, Chris Mason wrote: Very interesting. Ok, I'll definitely rerun my benchmarks as well. I used dbench extensively during the initial tuning, but you're forcing the memory low in order to force IO. This case doesn't really hammer on the locks, it hammers on the transition from spinning to blocking. We want also want to compare dbench entirely in ram, which will hammer on the spinning portion. Here's re-run of DFL and SIMPLE with the memory restriction lifted. Memory is 4GiB and disk remains mostly idle with all CPUs running full. USER SYSTEM SIRQCXTSW THROUGHPUT DFL59898 504517377 6814245 782.295 SIMPLE 61090 493441457 1631688 827.751 So, about the same picture. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs device returned from stat vs /proc/pid/maps
Hi, I noticed that btrfs_getattr() is filling stat-dev with an anonymous device (for per-snapshot root?): stat-dev = BTRFS_I(inode)-root-anon_super.s_dev; but /proc/pid/maps uses the real block device: dev = inode-i_sb-s_dev; This results in some unfortunate behavior for lsof as it reports some duplicate paths (except on different block devices). The easiest way to see this (if your root partition is btrfs): $ lsof | grep lsof snip lsof 9238root txt REG 0,19 139736 14478 /usr/bin/lsof lsof 9238root mem REG 0,17 14478 /usr/bin/lsof (path dev=0,19) Ultimately, this breaks existing software. In my case, zypper ps gets really unhappy (which may partially also be due to a zypper bug, hooray!) I'm not really quite sure how this should be handled though. Do we have /proc/pid/maps report the subvolumes device (via some callback I suppose)? Another alternative of course is to return the true block device in btrfs_getattr() but that has some obvious downsides too. Thanks and best regards, --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v2] Btrfs: Per file/directory controls for COW and compression
On 03/22/2011 01:43 AM, Johann Lombardi wrote: On Mon, Mar 21, 2011 at 04:57:13PM +0800, liubo wrote: @@ -4581,8 +4583,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, location-offset = 0; btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY); -btrfs_inherit_iflags(inode, dir); - if ((mode S_IFREG)) { if (btrfs_test_opt(root, NODATASUM)) BTRFS_I(inode)-flags |= BTRFS_INODE_NODATASUM; @@ -4590,6 +4590,8 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, BTRFS_I(inode)-flags |= BTRFS_INODE_NODATACOW; } +btrfs_inherit_iflags(inode, dir); The problem is that btrfs_inherit_iflags() overwrites BTRFS_I(inode)-flags with the parent's flags, so you lose BTRFS_INODE_NODATA{SUM|COW}. Thanks for pointing this, will fix it. thanks, liubo Johann -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
Hi Miao, Here is an excerpt of the V4 patch applied kernel boot log: === [ INFO: possible circular locking dependency detected ] 2.6.36-xie+ #117 --- vgs/1210 is trying to acquire lock: (delayed_node-mutex){+.+...}, at: [8121184b] btrfs_delayed_update_inode+0x45/0x101 but task is already holding lock: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (mm-mmap_sem){++}: [81076a3d] lock_acquire+0x11d/0x143 [810edc79] might_fault+0x95/0xb8 [8112b5ce] filldir+0x75/0xd0 [811d77f8] btrfs_real_readdir+0x3d7/0x528 [8112b75c] vfs_readdir+0x79/0xb6 [8112b8e9] sys_getdents+0x85/0xd8 [81002ddb] system_call_fastpath+0x16/0x1b - #0 (delayed_node-mutex){+.+...}: [81076612] __lock_acquire+0xa98/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c38b1] __mutex_lock_common+0x5a/0x444 [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6532] sys_mmap_pgoff+0xf6/0x12e [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by vgs/1210: #0: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e stack backtrace: Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117 Call Trace: [81074c15] print_circular_bug+0xaf/0xbd [81076612] __lock_acquire+0xa98/0xda6 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [81076a3d] lock_acquire+0x11d/0x143 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [814c38b1] __mutex_lock_common+0x5a/0x444 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8107162f] ? debug_mutex_init+0x31/0x3c [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [81229f10] ? file_map_prot_check+0x9a/0xa3 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6512] ? sys_mmap_pgoff+0xd6/0x12e [810f6532] sys_mmap_pgoff+0xf6/0x12e [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, that seems deadlockable. vfs_readdir holds i_mutex, I wonder if we can execute btrfs_readdir_delayed_dir_index without taking the node lock. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On tue, 22 Mar 2011 11:33:10 +0900, Itaru Kitayama wrote: Here is an excerpt of the V4 patch applied kernel boot log: === [ INFO: possible circular locking dependency detected ] 2.6.36-xie+ #117 --- vgs/1210 is trying to acquire lock: (delayed_node-mutex){+.+...}, at: [8121184b] btrfs_delayed_update_inode+0x45/0x101 but task is already holding lock: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (mm-mmap_sem){++}: [81076a3d] lock_acquire+0x11d/0x143 [810edc79] might_fault+0x95/0xb8 [8112b5ce] filldir+0x75/0xd0 [811d77f8] btrfs_real_readdir+0x3d7/0x528 [8112b75c] vfs_readdir+0x79/0xb6 [8112b8e9] sys_getdents+0x85/0xd8 [81002ddb] system_call_fastpath+0x16/0x1b - #0 (delayed_node-mutex){+.+...}: [81076612] __lock_acquire+0xa98/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c38b1] __mutex_lock_common+0x5a/0x444 [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6532] sys_mmap_pgoff+0xf6/0x12e [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by vgs/1210: #0: (mm-mmap_sem){++}, at: [810f6512] sys_mmap_pgoff+0xd6/0x12e stack backtrace: Pid: 1210, comm: vgs Not tainted 2.6.36-xie+ #117 Call Trace: [81074c15] print_circular_bug+0xaf/0xbd [81076612] __lock_acquire+0xa98/0xda6 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [81076a3d] lock_acquire+0x11d/0x143 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [814c38b1] __mutex_lock_common+0x5a/0x444 [8121184b] ? btrfs_delayed_update_inode+0x45/0x101 [8107162f] ? debug_mutex_init+0x31/0x3c [814c3d50] mutex_lock_nested+0x39/0x3e [8121184b] btrfs_delayed_update_inode+0x45/0x101 [814c36c6] ? __mutex_unlock_slowpath+0x129/0x13a [811dbd4f] btrfs_update_inode+0x2e/0x129 [811de008] btrfs_dirty_inode+0x57/0x113 [8113c2a5] __mark_inode_dirty+0x33/0x1aa [81130939] touch_atime+0x107/0x12a [811e15b2] btrfs_file_mmap+0x3e/0x57 [810f5f40] mmap_region+0x2bb/0x4c4 [81229f10] ? file_map_prot_check+0x9a/0xa3 [810f63d9] do_mmap_pgoff+0x290/0x2f3 [810f6512] ? sys_mmap_pgoff+0xd6/0x12e [810f6532] sys_mmap_pgoff+0xf6/0x12e [814c4b75] ? trace_hardirqs_on_thunk+0x3a/0x3f [81006e9a] sys_mmap+0x22/0x24 [81002ddb] system_call_fastpath+0x16/0x1b As the corresponding delayed node mutex lock is taken in btrfs_real_readdir, that seems deadlockable. vfs_readdir holds i_mutex, I wonder if we can execute btrfs_readdir_delayed_dir_index without taking the node lock. We can't fix it by this way, because the work threads may do insertion or deletion at the same time, and we may lose some directory items. Maybe we can fix it by adding a reference for the delayed directory items, we can do read dir just like this: 1. hold the node lock 2. increase the directory items' reference and put all the directory items into a list 3. release the node lock 4. read dir 5. decrease the directory items' reference and free them if the reference is zero. What do you think about? Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4] btrfs: implement delayed inode items operation
On Tue, 22 Mar 2011 11:12:37 +0800 Miao Xie mi...@cn.fujitsu.com wrote: We can't fix it by this way, because the work threads may do insertion or deletion at the same time, and we may lose some directory items. Ok. Maybe we can fix it by adding a reference for the delayed directory items, we can do read dir just like this: 1. hold the node lock 2. increase the directory items' reference and put all the directory items into a list 3. release the node lock 4. read dir 5. decrease the directory items' reference and free them if the reference is zero. What do you think about? Sounds doable to me. itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html