[PATCH 1/2] btrfs-progs: Allow btrfs_leaf_free_space to accept NULL root
Btrfs_leaf_free_space() function is used to determine the leaf/node size. It's OK to use root->nodesize to determine nodesize, but in fact, extent_buffer->len can also be used to determine the nodesize if caller can ensure it's a tree block. So this patch will add support for NULL root for btrfs_leaf_free_space() function, to allow btrfs_print_leaf() functions to be called in gdb or to debug temporary btrfs in make_btrfs() without a valid root. Signed-off-by: Qu Wenruo--- ctree.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/ctree.c b/ctree.c index 1434007..46153e3 100644 --- a/ctree.c +++ b/ctree.c @@ -1619,13 +1619,14 @@ static int leaf_space_used(struct extent_buffer *l, int start, int nr) */ int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf) { + u32 nodesize = (root ? BTRFS_LEAF_DATA_SIZE(root) : leaf->len); int nritems = btrfs_header_nritems(leaf); int ret; - ret = BTRFS_LEAF_DATA_SIZE(root) - leaf_space_used(leaf, 0, nritems); + ret = nodesize - leaf_space_used(leaf, 0, nritems); if (ret < 0) { - printk("leaf free space ret %d, leaf data size %lu, used %d nritems %d\n", - ret, (unsigned long) BTRFS_LEAF_DATA_SIZE(root), - leaf_space_used(leaf, 0, nritems), nritems); + printk("leaf free space ret %d, leaf data size %u, used %d nritems %d\n", + ret, nodesize, leaf_space_used(leaf, 0, nritems), + nritems); } return ret; } -- 2.6.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs-progs: debug-tree: Add option to show ondisk block without open_ctree
Add new option '-B' to show tree block without calling open_ctree. It's very useful to debug non-standard super or heavily damaged case. As it needs nodesize, also adds a new option '-n' to specify nodesize. Normal user should avoid calling it on random bytes, as it won't check the validation of the tree block. Signed-off-by: Qu Wenruo--- Documentation/btrfs-debug-tree.asciidoc | 12 - btrfs-debug-tree.c | 81 + 2 files changed, 82 insertions(+), 11 deletions(-) diff --git a/Documentation/btrfs-debug-tree.asciidoc b/Documentation/btrfs-debug-tree.asciidoc index 23fc115..a5528cb 100644 --- a/Documentation/btrfs-debug-tree.asciidoc +++ b/Documentation/btrfs-debug-tree.asciidoc @@ -25,8 +25,16 @@ Print detailed extents info. Print info of btrfs device and root tree dirs only. -r:: Print info of roots only. --b :: -Print info of the specified block only. +-b :: +Print info of the specified block at logical bytenr. +-B :: +Print info of the specified block at on-disk bytenr. +Need to use with '-n ' option. ++ +Use with caution, as it won't do normal tree block check. + +-n :: +Specify the nodesize for '-B ' option. EXIT STATUS --- diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index 8adc39f..52e3de3 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -21,6 +21,8 @@ #include #include #include +#include +#include #include "kerncompat.h" #include "radix-tree.h" @@ -41,8 +43,11 @@ static int print_usage(int ret) fprintf(stderr, "\t-r : print info of roots only\n"); fprintf(stderr, "\t-R : print info of roots and root backups\n"); fprintf(stderr, "\t-u : print info of uuid tree only\n"); - fprintf(stderr, "\t-b block_num : print info of the specified block" -" only\n"); + fprintf(stderr, "\t-b bytenr: print info of the specified block" +" at logical bytenr only\n"); + fprintf(stderr, "\t-B bytenr: print info of the specified block" +" at disk bytenr only, need -n option(use with caution)\n"); + fprintf(stderr, "\t-n nodesize: specify the nodesize to use with -B\n"); fprintf(stderr, "\t-t tree_id : print only the tree with the given id\n"); fprintf(stderr, "%s\n", PACKAGE_STRING); @@ -122,6 +127,41 @@ static void print_old_roots(struct btrfs_super_block *super) } } +static int print_ondisk_leaf(const char *device, u64 ondisk_bytenr, +u32 nodesize) +{ + struct extent_buffer *buf = NULL; + int fd; + int ret; + + buf = malloc(sizeof(*buf) + nodesize); + if (!buf) { + ret = -ENOMEM; + goto out; + } + memset(buf, 0, sizeof(buf) + nodesize); + buf->start = ondisk_bytenr; + buf->len = nodesize; + buf->refs = 1; + fd = open(device, O_RDONLY); + if (fd < 0) { + ret = -errno; + goto out; + } + + ret = pread(fd, buf->data, nodesize, ondisk_bytenr); + if (ret < nodesize) { + ret = (ret < 0 ? ret : -EIO); + goto out; + } + + /* We don't check anything, user should be responsible for it */ + btrfs_print_leaf(NULL, buf); +out: + free(buf); + return ret; +} + int main(int ac, char **av) { struct btrfs_root *root; @@ -140,7 +180,9 @@ int main(int ac, char **av) int uuid_tree_only = 0; int roots_only = 0; int root_backups = 0; - u64 block_only = 0; + u64 logical_only = 0; + u64 ondisk_only = 0; + u32 nodesize = 0; struct btrfs_root *tree_root_scan; u64 tree_id = 0; @@ -153,7 +195,7 @@ int main(int ac, char **av) { NULL, 0, NULL, 0 } }; - c = getopt_long(ac, av, "deb:rRut:", long_options, NULL); + c = getopt_long(ac, av, "deb:B:rRut:n:", long_options, NULL); if (c < 0) break; switch(c) { @@ -174,7 +216,13 @@ int main(int ac, char **av) root_backups = 1; break; case 'b': - block_only = arg_strtou64(optarg); + logical_only = arg_strtou64(optarg); + break; + case 'B': + ondisk_only = arg_strtou64(optarg); + break; + case 'n': + nodesize = arg_strtou64(optarg); break; case 't': tree_id = arg_strtou64(optarg); @@ -196,6 +244,21 @@ int main(int ac, char **av) exit(1); } + /* Ondisk_only means we won't need to go
[PATCH] Btrfs: fix race waiting for qgroup rescan worker
From: Filipe MananaWe were initializing the completion (fs_info->qgroup_rescan_completion) object after releasing the qgroup rescan lock, which gives a small time window for a rescan waiter to not actually wait for the rescan worker to finish. Example: CPU 1 CPU 2 fs_info->qgroup_rescan_completion->done is 0 btrfs_qgroup_rescan_worker() complete_all(_info->qgroup_rescan_completion) sets fs_info->qgroup_rescan_completion->done to UINT_MAX / 2 ... do some other stuff qgroup_rescan_init() mutex_lock(_info->qgroup_rescan_lock) set flag BTRFS_QGROUP_STATUS_FLAG_RESCAN in fs_info->qgroup_flags mutex_unlock(_info->qgroup_rescan_lock) btrfs_qgroup_wait_for_completion() mutex_lock(_info->qgroup_rescan_lock) sees flag BTRFS_QGROUP_STATUS_FLAG_RESCAN in fs_info->qgroup_flags mutex_unlock(_info->qgroup_rescan_lock) wait_for_completion_interruptible( _info->qgroup_rescan_completion) fs_info->qgroup_rescan_completion->done is > 0 so it returns immediately init_completion(_info->qgroup_rescan_completion) sets fs_info->qgroup_rescan_completion->done to 0 So fix this by initializing the completion object while holding the mutex fs_info->qgroup_rescan_lock. Signed-off-by: Filipe Manana --- fs/btrfs/qgroup.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 75c0249..75bb4af9 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2387,12 +2387,11 @@ qgroup_rescan_init(struct btrfs_fs_info *fs_info, u64 progress_objectid, memset(_info->qgroup_rescan_progress, 0, sizeof(fs_info->qgroup_rescan_progress)); fs_info->qgroup_rescan_progress.objectid = progress_objectid; + init_completion(_info->qgroup_rescan_completion); spin_unlock(_info->qgroup_lock); mutex_unlock(_info->qgroup_rescan_lock); - init_completion(_info->qgroup_rescan_completion); - memset(_info->qgroup_rescan_work, 0, sizeof(fs_info->qgroup_rescan_work)); btrfs_init_work(_info->qgroup_rescan_work, -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: qgroup: exit the rescan worker during umount
On Wed, Nov 4, 2015 at 11:56 PM, Justin Maggardwrote: > I was hitting a consistent NULL pointer dereference during shutdown that > showed the trace running through end_workqueue_bio(). I traced it back to > the endio_meta_workers workqueue being poked after it had already been > destroyed. > > Eventually I found that the root cause was a qgroup rescan that was still > in progress while we were stopping all the btrfs workers. > > Currently we explicitly pause balance and scrub operations in > close_ctree(), but we do nothing to stop the qgroup rescan. We should > probably be doing the same for qgroup rescan, but that's a much larger > change. This small change is good enough to allow me to unmount without > crashing. > > v3: avoid more races by calling btrfs_qgroup_wait_for_completion() Btw, information about what changes between versions should go after the "---" (triple dash) below. You should also have mentioned what changed between v2 and v1 as well (see https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Repeated_submissions). > > Signed-off-by: Justin Maggard Reviewed-by: Filipe Manana I've added this to my integration branch for 4.4 [1] and will prepare later a pull request for Chris. I've removed there the v3 line from the change log, as that's not intended to be there, serves only for patch reviewing in the list. Thanks for doing this and the test case for fstests. [1] http://git.kernel.org/cgit/linux/kernel/git/fdmanana/linux.git/log/?h=integration-4.4 > --- > fs/btrfs/disk-io.c | 3 +++ > fs/btrfs/qgroup.c | 9 ++--- > 2 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 2d46675..1eb0839 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3780,6 +3780,9 @@ void close_ctree(struct btrfs_root *root) > fs_info->closing = 1; > smp_mb(); > > + /* wait for the qgroup rescan worker to stop */ > + btrfs_qgroup_wait_for_completion(fs_info); > + > /* wait for the uuid_scan task to finish */ > down(_info->uuid_tree_rescan_sem); > /* avoid complains from lockdep et al., set sem back to initial state > */ > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 46476c2..75c0249 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -2286,7 +2286,7 @@ static void btrfs_qgroup_rescan_worker(struct > btrfs_work *work) > goto out; > > err = 0; > - while (!err) { > + while (!err && !btrfs_fs_closing(fs_info)) { > trans = btrfs_start_transaction(fs_info->fs_root, 0); > if (IS_ERR(trans)) { > err = PTR_ERR(trans); > @@ -2307,7 +2307,8 @@ out: > btrfs_free_path(path); > > mutex_lock(_info->qgroup_rescan_lock); > - fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; > + if (!btrfs_fs_closing(fs_info)) > + fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN; > > if (err > 0 && > fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT) { > @@ -2336,7 +2337,9 @@ out: > } > btrfs_end_transaction(trans, fs_info->quota_root); > > - if (err >= 0) { > + if (btrfs_fs_closing(fs_info)) { > + btrfs_info(fs_info, "qgroup scan paused"); > + } else if (err >= 0) { > btrfs_info(fs_info, "qgroup scan completed%s", > err > 0 ? " (inconsistency flag cleared)" : ""); > } else { > -- > 2.6.2 > -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs/RAID5 became unmountable after SATA cable fault
On 2015-11-04 23:06, Duncan wrote: (Tho I should mention, while not on zfs, I've actually had my own problems with ECC RAM too. In my case, the RAM was certified to run at speeds faster than it was actually reliable at, such that actually stored data, what the ECC protects, was fine, the data was actually getting damaged in transit to/from the RAM. On a lightly loaded system, such as one running many memory tests or under normal desktop usage conditions, the RAM was generally fine, no problems. But on a heavily loaded system, such as when doing parallel builds (I run gentoo, which builds from sources in ordered to get the higher level of option flexibility that comes only when you can toggle build-time options), I'd often have memory faults and my builds would fail. The most common failure, BTW, was on tarball decompression, bunzip2 or the like, since the tarballs contained checksums that were verified on data decompression, and often they'd fail to verify. Once I updated the BIOS to one that would let me set the memory speed instead of using the speed the modules themselves reported, and I declocked the memory just one notch (this was DDR1, IIRC I declocked from the PC3200 it was rated, to PC3000 speeds), not only was the memory then 100% reliable, but I could and did actually reduce the number of wait- states for various operations, and it was STILL 100% reliable. It simply couldn't handle the raw speeds it was certified to run, is all, tho it did handle it well enough, enough of the time, to make the problem far more difficult to diagnose and confirm than it would have been had the problem appeared at low load as well. As it happens, I was running reiserfs at the time, and it handled both that hardware issue, and a number of others I've had, far better than I'd have expected of /any/ filesystem, when the memory feeding it is simply not reliable. Reiserfs metadata, in particular, seems incredibly resilient in the face of hardware issues, and I lost far less data than I might have expected, tho without checksums and with bad memory, I imagine I had occasional undetected bitflip corruption in files here or there, but generally nothing I detected. I still use reiserfs on my spinning rust today, but it's not well suited to SSD, which is where I run btrfs. But the point for this discussion is that just because it's ECC RAM doesn't mean you can't have memory related errors, just that if you do, they're likely to be different errors, "transit errors", that will tend to be undetected by many memory checkers, at least the ones that don't tend to run full out memory bandwidth if they're simply checking that what was stored in a cell can be read back, unchanged.) I've actually seen similar issues with both ECC and non-ECC memory myself. Any time I'm getting RAM for a system that I can afford to over-spec, I get the next higher speed and under-clock it (which in turn means I can lower the timing parameters and usually get a faster system than if I was running it at the rated speed). FWIW, I also make a point of doing multiple memtest86+ runs (at a minimum, one running single core, and one with forced SMP) when I get new RAM, and even have a run-level configured on my Gentoo based home server system where it boots Xen and fires up twice as many VM's running memtest86+ as I have CPU cores, which is usually enough to fully saturate memory bandwidth and check for the type of issues you mentioned having above (although the BOINC client I run usually does a good job of triggering those kind of issues fast, distributed computing apps tend to be memory bound and use a lot of memory bandwidth). smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH 1/2] btrfs-progs: Allow btrfs_leaf_free_space to accept NULL root
On Thu, Nov 05, 2015 at 04:32:59PM +0800, Qu Wenruo wrote: > Btrfs_leaf_free_space() function is used to determine the leaf/node > size. > It's OK to use root->nodesize to determine nodesize, but in fact, > extent_buffer->len can also be used to determine the nodesize if caller > can ensure it's a tree block. > > So this patch will add support for NULL root for btrfs_leaf_free_space() > function, to allow btrfs_print_leaf() functions to be called in gdb or > to debug temporary btrfs in make_btrfs() without a valid root. > > Signed-off-by: Qu WenruoApplied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs-progs: debug-tree: Add option to show ondisk block without open_ctree
On Thu, Nov 05, 2015 at 04:33:00PM +0800, Qu Wenruo wrote: > Add new option '-B' to show tree block without calling open_ctree. > It's very useful to debug non-standard super or heavily damaged case. > > As it needs nodesize, also adds a new option '-n' to specify nodesize. > > Normal user should avoid calling it on random bytes, as it won't check > the validation of the tree block. > > Signed-off-by: Qu WenruoI'm going to postpone this patch for next development cycle. It's likely that we'll add more options so I'd like to not make it a mess (similar to what btrfs-corrupt-block options are). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: clear bio reference after submit_one_bio()
On 10/11/15 20:09, Alex Lyakas wrote: > Hi Naota, > > What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we > return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And > if *bio_ret was non-NULL upon entry into submit_extent_page, then we > had submitted this bio before getting to btrfs_bio_alloc(). So should > btrfs_bio_alloc() failure be handled in the same way? > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 3915c94..cd443bc 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct > extent_io_tree *tree, > > bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES, > GFP_NOFS | __GFP_HIGH); > - if (!bio) > + if (!bio) { > + if (bio_ret) > + *bio_ret = NULL; > return -ENOMEM; > + } > > bio_add_page(bio, page, page_size, offset); > bio->bi_end_io = end_io_func; > Did you get any feedback on this? It seems it could cause data loss or corruption on allocation failures, no? -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to allocate for space usage in particular btrfs volume
On Thu, Nov 05, 2015 at 10:44:51AM +, OmegaPhil wrote: > So a couple of gig still unaccountable but irrelevant. Thanks, problem > solved! Although hopefully checksumming will be allowed on nocow files > in the future as thats currently 17% of all data unprotected and will > get worse... It's not likely to happen. If you're modifying data in-place (as is the case on nodatacow files), then there's a window of vulnerability between modifying the data and modifying the checksum records during which the checksums are incorrect. This breaks the consistency guarantees of the FS. Hugo. -- Hugo Mills | Nostalgia isn't what it used to be. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
[GIT PULL] Btrfs fixes
From: Filipe MananaHi Chris, please consider the following fixes for the 4.4 merge window (they were all previously sent to the mailing list already). One fixes a sleep inside atomic context issue, introduced by one patch in the integration-4.4 branch. Another two fix races regarding waiting for qgroup rescan worker to finish and a race between the qgroup rescan worker and unmounting the filesystem (leading to crashes). The remaining patch fixes an issue with partial direct IO writes, which has been introduced in the 4.0 kernel, and results either in an assertion failure (BUG_ON) when CONFIG_BTRFS_ASSERT=y or arithmetic underflow of an inode's outstanding extents counter (used for proper space reservation) when assertions are disabled. Two test cases for fstests were sent recently to cover the issues regarding the races and the direct IO partial write regression. Thanks. The following changes since commit 2959a32a858a2c44bbbce83d19c158d54cc5998a: Btrfs: fix hole punching when using the no-holes feature (2015-11-03 07:44:20 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git integration-4.4 for you to fetch changes up to 3b2ba7b31d56c3d8f57cd5d32b8fb5101ab446e4: Btrfs: fix sleeping inside atomic context in qgroup rescan worker (2015-11-05 11:02:22 +) Filipe Manana (3): Btrfs: fix extent accounting for partial direct IO writes Btrfs: fix race waiting for qgroup rescan worker Btrfs: fix sleeping inside atomic context in qgroup rescan worker Justin Maggard (1): btrfs: qgroup: exit the rescan worker during umount fs/btrfs/disk-io.c | 3 +++ fs/btrfs/inode.c | 52 +--- fs/btrfs/qgroup.c | 13 +++-- 3 files changed, 47 insertions(+), 21 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix sleeping inside atomic context in qgroup rescan worker
From: Filipe MananaWe are holding a btree path with spinning locks and then we attempt to clone an extent buffer, which calls kmem_cache_alloc() and this function can sleep, causing the following trace to be reported on a debug kernel: [107118.218536] BUG: sleeping function called from invalid context at mm/slab.c:2871 [107118.224110] in_atomic(): 1, irqs_disabled(): 0, pid: 19148, name: kworker/u32:3 [107118.226120] INFO: lockdep is turned off. [107118.226843] Preemption disabled at:[] btrfs_clear_lock_blocking_rw+0x96/0xea [btrfs] [107118.229175] CPU: 3 PID: 19148 Comm: kworker/u32:3 Tainted: GW 4.3.0-rc5-btrfs-next-17+ #1 [107118.231326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [107118.233687] Workqueue: btrfs-qgroup-rescan btrfs_qgroup_rescan_helper [btrfs] [107118.236835] 880424bf3b78 812566f4 [107118.238369] 880424bf3ba0 81070664 817f1cd5 0b37 [107118.239769] 880424bf3bc8 8107070a 8850 [107118.241244] Call Trace: [107118.241729] [] dump_stack+0x4e/0x79 [107118.242602] [] ___might_sleep+0x23a/0x241 [107118.243586] [] __might_sleep+0x9f/0xa6 [107118.244532] [] cache_alloc_debugcheck_before+0x25/0x36 [107118.245939] [] kmem_cache_alloc+0x50/0x215 [107118.246930] [] __alloc_extent_buffer+0x2a/0x11f [btrfs] [107118.248121] [] btrfs_clone_extent_buffer+0x3d/0xdd [btrfs] [107118.249451] [] btrfs_qgroup_rescan_worker+0x16d/0x434 [btrfs] [107118.250755] [] ? arch_local_irq_save+0x9/0xc [107118.251754] [] normal_work_helper+0x14c/0x32a [btrfs] [107118.252899] [] ? normal_work_helper+0x14c/0x32a [btrfs] [107118.254195] [] btrfs_qgroup_rescan_helper+0x12/0x14 [btrfs] [107118.255436] [] process_one_work+0x24a/0x4ac [107118.263690] [] worker_thread+0x206/0x2c2 [107118.264888] [] ? rescuer_thread+0x2cb/0x2cb [107118.267413] [] kthread+0xef/0xf7 [107118.268417] [] ? kthread_parkme+0x24/0x24 [107118.269505] [] ret_from_fork+0x3f/0x70 [107118.270491] [] ? kthread_parkme+0x24/0x24 So just use blocking locks for our path to solve this. This fixes the patch titled: "btrfs: qgroup: Don't copy extent buffer to do qgroup rescan" Signed-off-by: Filipe Manana --- fs/btrfs/qgroup.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 75bb4af9..93e12c1 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2198,7 +2198,6 @@ qgroup_rescan_leaf(struct btrfs_fs_info *fs_info, struct btrfs_path *path, int slot; int ret; - path->leave_spinning = 1; mutex_lock(_info->qgroup_rescan_lock); ret = btrfs_search_slot_for_read(fs_info->extent_root, _info->qgroup_rescan_progress, -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: test unmount during quota rescan
On Thu, Nov 5, 2015 at 12:50 AM, Justin Maggardwrote: > This test case tests if we are able to unmount a filesystem while > a quota rescan is running. Up to now (4.3) this would result > in a kernel NULL pointer dereference. > > Fixed by patch (btrfs: qgroup: exit the rescan worker during umount). > > Signed-off-by: Justin Maggard Reviewed-by: Filipe Manana > --- Btw, for future patches/versions, here after the --- you should mention what changed between versions of the patch. thanks > tests/btrfs/114 | 61 > + > tests/btrfs/114.out | 2 ++ > tests/btrfs/group | 1 + > 3 files changed, 64 insertions(+) > create mode 100644 tests/btrfs/114 > create mode 100644 tests/btrfs/114.out > > diff --git a/tests/btrfs/114 b/tests/btrfs/114 > new file mode 100644 > index 000..0a0e8ba > --- /dev/null > +++ b/tests/btrfs/114 > @@ -0,0 +1,61 @@ > +#! /bin/bash > +# FS QA Test No. btrfs/114 > +# > +# btrfs quota scan/unmount sanity test > +# Make sure that unmounting during a quota rescan doesn't crash > +# > +#--- > +# Copyright (c) 2015 NETGEAR, Inc. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# real QA test starts here > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch > + > +_scratch_mkfs >>$seqres.full 2>&1 > +_scratch_mount > + > +for i in `seq 0 1 45`; do > + echo -n > $SCRATCH_MNT/file.$i > +done > +echo 3 > /proc/sys/vm/drop_caches > +$BTRFS_UTIL_PROG quota enable $SCRATCH_MNT > +_scratch_unmount > + > +echo "Silence is golden" > +status=0 > +exit > diff --git a/tests/btrfs/114.out b/tests/btrfs/114.out > new file mode 100644 > index 000..a2aa4a2 > --- /dev/null > +++ b/tests/btrfs/114.out > @@ -0,0 +1,2 @@ > +QA output created by 114 > +Silence is golden > diff --git a/tests/btrfs/group b/tests/btrfs/group > index 7cf7dd7..10ab26b 100644 > --- a/tests/btrfs/group > +++ b/tests/btrfs/group > @@ -116,3 +116,4 @@ > 111 auto quick send > 112 auto quick clone > 113 auto quick compress clone > +114 auto qgroup > -- > 2.6.2 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to allocate for space usage in particular btrfs volume
On 05/11/15 04:18, Duncan wrote: > OmegaPhil posted on Wed, 04 Nov 2015 21:53:09 + as excerpted: > >> The volume doesn't change hugely over time, so it really ought not to >> have broken so quickly - a quick rundown of the storage usage: >> >> 36% general (small files, some smallish videos) >> 24% music 23% pr0n 17% VMs >> >> But in terms of 'large files changing', it could be the VM disks perhaps >> - >> I'll move them out, balance, and then back in again, hopefully that'd be >> a meaningful test. > > VM image files (and large database files, for the same reason) are a bit > of a problem on btrfs, and indeed, any COW-based filesystem, since the > random rewrite pattern matching that use-case is pretty much the absolute > worst-case match for a COW-based filesystem there is. > > And that would be the worst-case in terms of the unsplit extents issue > Hugo was talking about as well. So they may well be the problem, indeed. > > Since you're not doing snapshotting (which conflicts with this option, > with an imperfect workaround), setting nocow on those files may well > eliminate the problem, but be aware if you aren't already that (1) nocow > does turn off checksumming as well, in ordered to avoid a race that could > easily lead to data corruption, and (2) you can't just activate nocow on > the existing file and expect it to work, the procedure is a bit more > complicated than that, since nocow is only guaranteed to work if it's set > at file creation. Detailed instructions for #2 skipped as they've been > posted many times, but if you are interested and haven't seen them, ask. Thankyou Hugo, Duncan - moving VMs out, then: = sudo chattr +C '/mnt/storage-1/Virtual Machines' sudo btrfs balance start -mconvert=dup /mnt/storage-1 sudo btrfs fi defrag -rv /mnt/storage-1 = And after moving VMs back, du reports 884GB used, = $ sudo btrfs fi usage /mnt/storage-1 Overall: Device size: 953.87GiB Device allocated:924.07GiB Device unallocated: 29.80GiB Device missing: 0.00B Used:886.11GiB Free (estimated): 65.72GiB (min: 50.82GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:918.01GiB, Used:882.09GiB /dev/sdb 918.01GiB Metadata,DUP: Size:3.00GiB, Used:2.01GiB /dev/sdb6.00GiB System,DUP: Size:32.00MiB, Used:128.00KiB /dev/sdb 64.00MiB Unallocated: /dev/sdb 29.80GiB = So a couple of gig still unaccountable but irrelevant. Thanks, problem solved! Although hopefully checksumming will be allowed on nocow files in the future as thats currently 17% of all data unprotected and will get worse... signature.asc Description: OpenPGP digital signature
Possible project idea
I'd been looking at the wiki page with project ideas, and I realized that there were no listed ideas that suggested the adding support for arbitrary erasure coding methods. Ceph for example has an option that allows you to set arbitrary erasure coding such that you use n devices to store the data, and can tolerate loss of any m devices out of it. Technically, this would be covered as a special case by the whole copies/stripes/parity thing that was discussed a while back before raid56 code made it into the kernel (I've tried to find the thread for reference, but haven't been unable to locate it), but I didn't see that 'project' listed anywhere on the page either. IMHO, this would be an excellent feature to differentiate BTRFS from bcachefs and ZFS (although I would not be surprised if they both copied it themselves). smime.p7s Description: S/MIME Cryptographic Signature
Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
On Wed, Nov 04, 2015 at 09:01:36AM +0800, Qu Wenruo wrote: > > > Mark Fasheh wrote on 2015/11/03 11:26 -0800: > >On Mon, Nov 02, 2015 at 09:34:24AM +0800, Qu Wenruo wrote: > >> > >> > >>Stefan Priebe wrote on 2015/11/01 21:49 +0100: > >>>Hi, > >>> > >>>this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html > >>> > >>>adds a regression to my test systems with very large disks (30tb and 50tb). > >>> > >>>btrfs balance is super slow afterwards while heavily making use of cp > >>>--reflink=always on big files (200gb - 500gb). > >>> > >>>Sorry didn't know how to correctly reply to that "old" message. > >>> > >>>Greets, > >>>Stefan > >>>-- > >>>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > >>>the body of a message to majord...@vger.kernel.org > >>>More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > >>Thanks for the testing. > >> > >>Are you using qgroup or just doing normal balance with qgroup disabled? > >> > >>For the latter case, that's should be optimized to skip the dirty > >>extent insert in qgroup disabled case. > >> > >>For qgroup enabled case, I'm afraid that's the design. > >>As relocation will drop a subtree to relocate, and to ensure qgroup > >>consistent, we must walk down all the tree blocks and mark them > >>dirty for later qgroup accounting. > > > >Qu, we're always going to have to walk the tree when deleting it, this is > >part of removing a subvolume. We've walked shared subtrees in this code for > >numerous kernel releases without incident before it was removed in 4.2. > > > >Do you have any actual evidence that this is a major performance regression? > > From our previous conversations you seemed convinced of this, without even > >having a working subtree walk to test. I remember the hand wringing > >about an individual commit being too heavy with the qgroup code (even though > >I pointed out that tree walk is a restartable transaction). > > > >It seems that you are confused still about how we handle removing a volume > >wrt qgroups. > > > >If you have questions or concerns I would be happy to explain them but > >IMHO your statements there are opinion and not based in fact. > > Yes, I don't deny it. > But it's quite hard to prove it, as we need such a huge storage like Stefan. > What I have is only several hundred GB test storage. > Even accounting all my home NAS, I only have 2T, far from the > storage Stefan has. > > And what Stefan report should already give some hint about the > performance issue. > > In your word "it won't be doing anything (ok some kmalloc/free of a > very tiny object)", it's already slowing down balance, since balance > also use btrfs_drop_subtree(). When I wrote that I was under the impression that the qgroup code was doing it's own sanity checking (it used to) and since Stephan had them disabled they couldn't be causing the problem. I read your e-mail explaining that the qgroup api was now intertwined with delayed ref locking after this one. The same exact code ran in either case before and after your patches, so my guess is that the issue is actually inside the qgroup code that shouldn't have been run. I wonder if we even just filled up his memory but never cleaned the objects. The only other thing I can think of is if account_leaf_items() got run in a really tight loop for some reason. Kmalloc in the way we are using it is not usually a performance issue, especially if we've been reading off disk in the same process. Ask yourself this - your own patch series does the same kmalloc for every qgroup operation. Did you notice a complete and massive performance slowdown like the one Stefan reported? I will say that we never had this problem reported before, and account_leaf_items() is always run in all kernels, even without qgroups enabled. That will change with my new patch though. What we can say for sure is that drop_snapshot in the qgroup case will read more disk and obviously that will have a negative impact depending on what the tree looks like. So IMHO we ought to be focusing on reducing the amount of I/O involved. > But we can't just ignore such "possible" performance issue just > because old code did the same thing.(Although not the same now, > we're marking all subtree blocks dirty other than shared one). Well, I can't disagree with that - the only reason we are talking right now is because you intentionally ignored the qgroup code in drop_snapshot(). So let's start with this - no more 'fixing' code by tearing it out and replacing it with /* TODO: somebody else re-implement this */ ;) --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: qgroup: account shared subtree during snapshot delete
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.') removed our qgroup accounting during btrfs_drop_snapshot(). Predictably, this results in qgroup numbers going bad shortly after a snapshot is removed. Fix this by adding a dirty extent record when we encounter extents during our shared subtree walk. This effectively restores the functionality we had with the original shared subtree walking code in 1152651 (btrfs: qgroup: account shared subtrees during snapshot delete). The idea with the original patch (and this one) is that shared subtrees can get skipped during drop_snapshot. The shared subtree walk then allows us a chance to visit those extents and add them to the qgroup work for later processing. This ultimately makes the accounting for drop snapshot work. The new qgroup code nicely handles all the other extents during the tree walk via the ref dec/inc functions so we don't have to add actions beyond what we had originally. Signed-off-by: Mark Fasheh--- fs/btrfs/extent-tree.c | 47 --- fs/btrfs/qgroup.c | 2 ++ 2 files changed, 42 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 601d7d4..410b46d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7850,21 +7850,47 @@ reada: } /* - * TODO: Modify related function to add related node/leaf to dirty_extent_root, - * for later qgroup accounting. - * - * Current, this function does nothing. + * These may not be seen by the usual inc/dec ref code so we have to + * add them here. */ +static int record_one_subtree_extent(struct btrfs_trans_handle *trans, +struct btrfs_root *root, u64 bytenr, +u64 num_bytes) +{ + struct btrfs_qgroup_extent_record *qrecord; + struct btrfs_delayed_ref_root *delayed_refs; + + qrecord = kmalloc(sizeof(*qrecord), GFP_NOFS); + if (!qrecord) + return -ENOMEM; + + qrecord->bytenr = bytenr; + qrecord->num_bytes = num_bytes; + qrecord->old_roots = NULL; + + delayed_refs = >transaction->delayed_refs; + spin_lock(_refs->lock); + if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord)) + kfree(qrecord); + spin_unlock(_refs->lock); + + return 0; +} + static int account_leaf_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *eb) { int nr = btrfs_header_nritems(eb); - int i, extent_type; + int i, extent_type, ret; struct btrfs_key key; struct btrfs_file_extent_item *fi; u64 bytenr, num_bytes; + /* We can be called directly from walk_up_proc() */ + if (!root->fs_info->quota_enabled) + return 0; + for (i = 0; i < nr; i++) { btrfs_item_key_to_cpu(eb, , i); @@ -7883,6 +7909,10 @@ static int account_leaf_items(struct btrfs_trans_handle *trans, continue; num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi); + + ret = record_one_subtree_extent(trans, root, bytenr, num_bytes); + if (ret) + return ret; } return 0; } @@ -7951,8 +7981,6 @@ static int adjust_slots_upwards(struct btrfs_root *root, /* * root_eb is the subtree root and is locked before this function is called. - * TODO: Modify this function to mark all (including complete shared node) - * to dirty_extent_root to allow it get accounted in qgroup. */ static int account_shared_subtree(struct btrfs_trans_handle *trans, struct btrfs_root *root, @@ -8030,6 +8058,11 @@ walk_down: btrfs_tree_read_lock(eb); btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); path->locks[level] = BTRFS_READ_LOCK_BLOCKING; + + ret = record_one_subtree_extent(trans, root, child_bytenr, + root->nodesize); + if (ret) + goto out; } if (level == 0) { diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index b068209..ce1cdcf 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1461,6 +1461,8 @@ struct btrfs_qgroup_extent_record struct btrfs_qgroup_extent_record *entry; u64 bytenr = record->bytenr; + assert_spin_locked(_refs->lock); + trace_btrfs_qgroup_insert_dirty_extent(record); while (*p) { -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: Add qgroup tracing
This patch adds tracepoints to the qgroup code on both the reporting side (insert_dirty_extents) and the accounting side. Taken together it allows us to see what qgroup operations have happened, and what their result was. Signed-off-by: Mark Fasheh--- fs/btrfs/qgroup.c| 10 + include/trace/events/btrfs.h | 88 +++- 2 files changed, 97 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index d904ee1..b068209 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1461,6 +1461,8 @@ struct btrfs_qgroup_extent_record struct btrfs_qgroup_extent_record *entry; u64 bytenr = record->bytenr; + trace_btrfs_qgroup_insert_dirty_extent(record); + while (*p) { parent_node = *p; entry = rb_entry(parent_node, struct btrfs_qgroup_extent_record, @@ -1591,6 +1593,9 @@ static int qgroup_update_counters(struct btrfs_fs_info *fs_info, cur_old_count = btrfs_qgroup_get_old_refcnt(qg, seq); cur_new_count = btrfs_qgroup_get_new_refcnt(qg, seq); + trace_qgroup_update_counters(qg->qgroupid, cur_old_count, +cur_new_count); + /* Rfer update part */ if (cur_old_count == 0 && cur_new_count > 0) { qg->rfer += num_bytes; @@ -1684,6 +1689,9 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans, goto out_free; BUG_ON(!fs_info->quota_root); + trace_btrfs_qgroup_account_extent(bytenr, num_bytes, nr_old_roots, + nr_new_roots); + qgroups = ulist_alloc(GFP_NOFS); if (!qgroups) { ret = -ENOMEM; @@ -1753,6 +1761,8 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans, record = rb_entry(node, struct btrfs_qgroup_extent_record, node); + trace_btrfs_qgroup_account_extents(record); + if (!ret) { /* * Use (u64)-1 as time_seq to do special search, which diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 0b73af9..9d7b545 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -23,7 +23,7 @@ struct map_lookup; struct extent_buffer; struct btrfs_work; struct __btrfs_workqueue; -struct btrfs_qgroup_operation; +struct btrfs_qgroup_extent_record; #define show_ref_type(type)\ __print_symbolic(type, \ @@ -1117,6 +1117,92 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy, TP_ARGS(wq) ); +DECLARE_EVENT_CLASS(btrfs_qgroup_extent, + TP_PROTO(struct btrfs_qgroup_extent_record *rec), + + TP_ARGS(rec), + + TP_STRUCT__entry( + __field(u64, bytenr) + __field(u64, num_bytes ) + ), + + TP_fast_assign( + __entry->bytenr = rec->bytenr, + __entry->num_bytes = rec->num_bytes; + ), + + TP_printk("bytenr = %llu, num_bytes = %llu", + (unsigned long long)__entry->bytenr, + (unsigned long long)__entry->num_bytes) +); + +DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_account_extents, + + TP_PROTO(struct btrfs_qgroup_extent_record *rec), + + TP_ARGS(rec) +); + +DEFINE_EVENT(btrfs_qgroup_extent, btrfs_qgroup_insert_dirty_extent, + + TP_PROTO(struct btrfs_qgroup_extent_record *rec), + + TP_ARGS(rec) +); + +TRACE_EVENT(btrfs_qgroup_account_extent, + + TP_PROTO(u64 bytenr, u64 num_bytes, u64 nr_old_roots, u64 nr_new_roots), + + TP_ARGS(bytenr, num_bytes, nr_old_roots, nr_new_roots), + + TP_STRUCT__entry( + __field(u64, bytenr) + __field(u64, num_bytes ) + __field(u64, nr_old_roots ) + __field(u64, nr_new_roots ) + ), + + TP_fast_assign( + __entry->bytenr = bytenr; + __entry->num_bytes = num_bytes; + __entry->nr_old_roots = nr_old_roots; + __entry->nr_new_roots = nr_new_roots; + ), + + TP_printk("bytenr = %llu, num_bytes = %llu, nr_old_roots = %llu, " + "nr_new_roots = %llu", + __entry->bytenr, + __entry->num_bytes, + __entry->nr_old_roots, + __entry->nr_new_roots) +); + +TRACE_EVENT(qgroup_update_counters, + + TP_PROTO(u64 qgid, u64 cur_old_count, u64 cur_new_count), + + TP_ARGS(qgid, cur_old_count, cur_new_count), + + TP_STRUCT__entry( + __field(u64, qgid
[PATCH 0/3] btrfs: update qgroups in drop snapshot, V2
Hi, The following 3 patches fix a regression introduced in Linux 4.2 where btrfs_drop_snapshot() wasn't updating qgroups, resulting in them going bad. The original e-mail pointing this out is below: http://www.spinics.net/lists/linux-btrfs/msg46093.html The first patch is from Josef and fix bugs in our counting of roots (which is critical for qgroups to work correctly). It was previously sent to the list: http://www.spinics.net/lists/linux-btrfs/msg47035.html Truth be told, most of the time fixing this was spent figuring out that and another issue (which has also been fixed). Once I realized I was seeing a bug and we fixed it correctly, my drop snapshot patch got dramatically smaller. I also re-added some of the tracing in qgroup.c that we recently lost. It is again possible to debug qgroup operations on a live system, allowing us to find issues like the two above by narrowing down our operations and manually walking through them via cat sys/debug/tracing. The entire patch series can be tested in xfstests test btrfs/104. Thanks, --Mark Changes from V1, thanks to Qu for his comments: - lock around call to btrfs_qgroup_insert_dirty_extent() - check whether qgroups are enabled in account_leaf_items() -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Btrfs: use btrfs_get_fs_root in resolve_indirect_ref
From: Josef BacikThe backref code will look up the fs_root we're trying to resolve our indirect refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete as it won't be able to search down the snapshot we are deleting, which will cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref so we can always get the root we're looking for. Thanks, Signed-off-by: Josef Bacik Signed-off-by: Mark Fasheh --- fs/btrfs/backref.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 9a2ec79..0e9da72 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -355,7 +355,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, index = srcu_read_lock(_info->subvol_srcu); - root = btrfs_read_fs_root_no_name(fs_info, _key); + root = btrfs_get_fs_root(fs_info, _key, false); if (IS_ERR(root)) { srcu_read_unlock(_info->subvol_srcu, index); ret = PTR_ERR(root); -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs
On 9/30/15 3:57 PM, Roman Lebedev wrote: > Hello. > > My / is btrfs. > To do some my local stuff more cleanly i wanted to use overlayfs, > but it didn't quite work. > > Simple non-automatic sequence to reproduce the issue: > mkdir lower upper work merged > mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged > vi merged/file > :wq Filipe and I got a chance to look into this today. The crash is due to commit 4bacc9c9234 (overlayfs: Make f_path always point to the overlay and f_inode to the underlay) Incidentally, the test case is as simple as ":> file ; fsync file" after mounting. The short version is that after this commit, we see: file->f_mapping->host = file->f_inode = file->f_path.dentry->d_inode = So now file_operations callbacks can't assume that file->f_path.dentry belongs to the same file system that implements the callback. More than that, any code that could ultimately get a dentry that comes from an open file can't trust that it's from the same file system. This crash is due to this issue. Unlike xfs and ext2/3/4, we use file->f_path.dentry->d_inode to resolve the inode. Using file_inode() is an easy enough fix here, but we run into trouble later. We have logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that walks back up the dentry chain examining the inode's last transaction and last unlink transaction to determine whether a full transaction commit is required. This obviously doesn't work if we're walking the overlayfs path instead. Regardless of any argument over whether that's doing the right thing, it's a pretty common pattern to assume that file->f_path.dentry comes from the same file system when using a file_operation. Is it intended that that assumption is no longer valid? -Jeff > Results in vi being killed on exit, and the following trace appears in dmesg: > > [34304.047841] BUG: unable to handle kernel paging request at 09618e56 > [34304.047846] IP: [] btrfs_sync_file+0xa6/0x350 [btrfs] > [34304.047864] PGD 0 > [34304.047866] Oops: 0002 [#12] SMP > [34304.047867] Modules linked in: overlay cpufreq_userspace cpufreq_stats > cpufreq_powersave cpufreq_conservative binfmt_misc nfsd auth_rpcgss > oid_registry nfs_acl nfs lockd grace fscache sunrpc fglrx(PO) nls_utf8 joydev > nls_cp437 vfat fat hid_generic usbhid kvm_amd hid kvm crct10dif_pclmul > crc32_pclmul ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic > snd_hda_codec_hdmi sha256_ssse3 sha256_generic snd_hda_intel snd_hda_codec > hmac drbg ansi_cprng aesni_intel snd_hda_core aes_x86_64 mxm_wmi snd_hwdep > lrw eeepc_wmi snd_pcm gf128mul asus_wmi sparse_keymap rfkill video snd_timer > glue_helper sp5100_tco evdev ablk_helper e1000e ohci_pci pcspkr snd ohci_hcd > xhci_pci edac_mce_amd ehci_pci serio_raw xhci_hcd soundcore fam15h_power > ehci_hcd cryptd edac_core ptp pps_core usbcore k10temp i2c_piix4 > [34304.047893] sg usb_common acpi_cpufreq wmi tpm_infineon button processor > shpchp tpm_tis tpm thermal_sys tcp_yeah tcp_vegas it87 hwmon_vid loop > parport_pc ppdev lp parport autofs4 crc32c_generic btrfs xor raid6_pq sd_mod > crc32c_intel ahci libahci libata scsi_mod > [34304.047905] CPU: 4 PID: 13990 Comm: vi Tainted: P DO > 4.2.0-1-amd64 #1 Debian 4.2.1-2 > [34304.047906] Hardware name: To be filled by O.E.M. To be filled by > O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/2015 > [34304.047908] task: 8803d5f7f2c0 ti: 8806a3ec8000 task.ti: > 8806a3ec8000 > [34304.047909] RIP: 0010:[] [] > btrfs_sync_file+0xa6/0x350 [btrfs] > [34304.047920] RSP: 0018:8806a3ecbe88 EFLAGS: 00010246 > [34304.047921] RAX: 8803d5f7f2c0 RBX: 8807b2d46600 RCX: > 81a6ad00 > [34304.047922] RDX: 8000 RSI: RDI: > 8807c19f8970 > [34304.047923] RBP: 8807c19f8970 R08: R09: > 0001 > [34304.047924] R10: R11: 0246 R12: > 8807c19f88c8 > [34304.047925] R13: R14: 09618b22 R15: > 55cb20184a70 > [34304.047926] FS: 7f31c5492800() GS:88082fd0() > knlGS: > [34304.047927] CS: 0010 DS: ES: CR0: 80050033 > [34304.047928] CR2: 09618e56 CR3: 00044af44000 CR4: > 000406e0 > [34304.047929] Stack: > [34304.047930] 0001 7fff 880403d5b918 > 8000 > [34304.047932] 55cb20186d40 > 8807b2d46600 > [34304.047933] 0004 88044b249000 0020 > 8807b2d46600 > [34304.047935] Call Trace: > [34304.047939] [] ? do_fsync+0x38/0x60 > [34304.047940] [] ? SyS_fsync+0x10/0x20 > [34304.047943] [] ? system_call_fast_compare_end+0xc/0x6b > [34304.047944] Code: 49 8b 0f 48 85 c9 75 e9 eb b3 48 8b 44 24 08 49 8d ac 24 > a8 00 00 00 48 89 ef 4c 29 e8 48 83 c0 01 48 89 44 24 18 e8 3a 59 3e e1 > 41 ff 86 34 03 00 00 49 8b 84 24 70 ff ff
Re: Btrfs/RAID5 became unmountable after SATA cable fault
Duncan wrote: Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as excerpted: On 2015-11-04 13:01, Janos Toth F. wrote: But the worst part is that there are some ISO files which were seemingly copied without errors but their external checksums (the one which I can calculate with md5sum and compare to the one supplied by the publisher of the ISO file) don't match! Well... this, I cannot understand. How could these files become corrupt from a single disk failure? And more importantly: how could these files be copied without errors? Why didn't Btrfs gave a read error when the checksums didn't add up? If you can prove that there was a checksum mismatch and BTRFS returned invalid data instead of a read error or going to the other disk, then that is a very serious bug that needs to be fixed. You need to keep in mind also however that it's completely possible that the data was bad before you wrote it to the filesystem, and if that's the case, there's nothing any filesystem can do to fix it for you. As Austin suggests, if btrfs is returning data, and you haven't turned off checksumming with nodatasum or nocow, then it's almost certainly returning the data it was given to write out in the first place. Whether that data it was given to write out was correct, however, is an /entirely/ different matter. If ISOs are failing their external checksums, then something is going on. Had you verified the external checksums when you first got the files? That is, are you sure the files were correct as downloaded and/or ripped? Where were the ISOs stored between original procurement/validation and writing to btrfs? Is it possible you still have some/all of them on that media? Do they still external-checksum-verify there? Basically, assuming btrfs checksums are validating, there's three other likely possibilities for where the corruption could have come from before writing to btrfs. Either the files were bad as downloaded or otherwise procured -- which is why I asked whether you verified them upon receipt -- or you have memory that's going bad, or your temporary storage is going bad, before the files ever got written to btrfs. The memory going bad is a particularly worrying possibility, considering... Now I am really considering to move from Linux to Windows and from Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new motherboard with more native SATA connectors and an extra HDD). That one seemed to actually do what it promises (abort any read operations upon checksum errors [which always happens seamlessly on every read] but look at the redundant data first and seamlessly "self-heal" if possible). The only thing which made Btrfs to look as a better alternative was the RAID-5 support. But I recently experienced two cases of 1 drive failing of 3 and it always tuned out as a smaller or bigger disaster (completely lost data or inconsistent data). Have you considered looking into ZFS? I hate to suggest it as an alternative to BTRFS, but it's a much more mature and well tested technology than ReFS, and has many of the same features as BTRFS (and even has the option for triple parity instead of the double you get with RAID6). If you do consider ZFS, make a point to look at FreeBSD in addition to the Linux version, the BSD one was a much better written port of the original Solaris drivers, and has better performance in many cases (and as much as I hate to admit it, BSD is way more reliable than Linux in most use cases). You should also seriously consider whether the convenience of having a filesystem that fixes internal errors itself with no user intervention is worth the risk of it corrupting your data. Returning correct data whenever possible is one thing, being 'self-healing' is completely different. When you start talking about things that automatically fix internal errors without user intervention is when most seasoned system administrators start to get really nervous. Self correcting systems have just as much chance to make things worse as they do to make things better, and most of them depend on the underlying hardware working correctly to actually provide any guarantee of reliability. I too would point you at ZFS, but there's one VERY BIG caveat, and one related smaller one! The people who have a lot of ZFS experience say it's generally quite reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*! The self-healing works well, *PROVIDED* memory isn't producing errors. Absolutely reliable memory is in fact *so* critical, that running ZFS on non-ECC memory is severely discouraged as a very real risk to your data. Which is why the above hints that your memory may be bad are so worrying. Don't even *THINK* about ZFS, particularly its self-healing features, if you're not absolutely sure your memory is 100% reliable, because apparently, based on the comment's I've seen, if it's not, you
Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
On Fri, Nov 06, 2015 at 09:02:13AM +0800, Qu Wenruo wrote: > >The same exact code ran in either case before and after your patches, so my > >guess is that the issue is actually inside the qgroup code that shouldn't > >have been run. I wonder if we even just filled up his memory but never > >cleaned the objects. The only other thing I can think of is if > >account_leaf_items() got run in a really tight loop for some reason. > > > >Kmalloc in the way we are using it is not usually a performance issue, > >especially if we've been reading off disk in the same process. Ask yourself > >this - your own patch series does the same kmalloc for every qgroup > >operation. Did you notice a complete and massive performance slowdown like > >the one Stefan reported? > > You're right, such memory allocation may impact performance but not > so noticeable, compared to other operations which may kick disk IO, > like btrfs_find_all_roots(). > > But at least, enabling qgroup will impact performance. > > Yeah, this time I has test data now. > In a environment with 100 different snapshot, sysbench shows an > overall performance drop about 5%, and in some case, up to 7%, with > qgroup enabled. > > Not sure about the kmalloc impact, maybe less than 1% or maybe 2~3%, > but at least it's worthy trying to use kmem cache. Ok cool, what'd you do to generate the snapshots? I can try a similar test on one of my machines and see what I get. I'm not surprised that the overhead is noticable, and I agree it's easy enough to try things like replacing the allocation once we have a test going. Thanks, --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
Mark Fasheh wrote on 2015/11/05 11:23 -0800: On Wed, Nov 04, 2015 at 09:01:36AM +0800, Qu Wenruo wrote: Mark Fasheh wrote on 2015/11/03 11:26 -0800: On Mon, Nov 02, 2015 at 09:34:24AM +0800, Qu Wenruo wrote: Stefan Priebe wrote on 2015/11/01 21:49 +0100: Hi, this one: http://www.spinics.net/lists/linux-btrfs/msg47377.html adds a regression to my test systems with very large disks (30tb and 50tb). btrfs balance is super slow afterwards while heavily making use of cp --reflink=always on big files (200gb - 500gb). Sorry didn't know how to correctly reply to that "old" message. Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks for the testing. Are you using qgroup or just doing normal balance with qgroup disabled? For the latter case, that's should be optimized to skip the dirty extent insert in qgroup disabled case. For qgroup enabled case, I'm afraid that's the design. As relocation will drop a subtree to relocate, and to ensure qgroup consistent, we must walk down all the tree blocks and mark them dirty for later qgroup accounting. Qu, we're always going to have to walk the tree when deleting it, this is part of removing a subvolume. We've walked shared subtrees in this code for numerous kernel releases without incident before it was removed in 4.2. Do you have any actual evidence that this is a major performance regression? From our previous conversations you seemed convinced of this, without even having a working subtree walk to test. I remember the hand wringing about an individual commit being too heavy with the qgroup code (even though I pointed out that tree walk is a restartable transaction). It seems that you are confused still about how we handle removing a volume wrt qgroups. If you have questions or concerns I would be happy to explain them but IMHO your statements there are opinion and not based in fact. Yes, I don't deny it. But it's quite hard to prove it, as we need such a huge storage like Stefan. What I have is only several hundred GB test storage. Even accounting all my home NAS, I only have 2T, far from the storage Stefan has. And what Stefan report should already give some hint about the performance issue. In your word "it won't be doing anything (ok some kmalloc/free of a very tiny object)", it's already slowing down balance, since balance also use btrfs_drop_subtree(). When I wrote that I was under the impression that the qgroup code was doing it's own sanity checking (it used to) and since Stephan had them disabled they couldn't be causing the problem. I read your e-mail explaining that the qgroup api was now intertwined with delayed ref locking after this one. My fault, as btrfs_qgroup_mark_exntent_dirty() is an exception which doesn't have the qgroup status check and depend on existing locks. The same exact code ran in either case before and after your patches, so my guess is that the issue is actually inside the qgroup code that shouldn't have been run. I wonder if we even just filled up his memory but never cleaned the objects. The only other thing I can think of is if account_leaf_items() got run in a really tight loop for some reason. Kmalloc in the way we are using it is not usually a performance issue, especially if we've been reading off disk in the same process. Ask yourself this - your own patch series does the same kmalloc for every qgroup operation. Did you notice a complete and massive performance slowdown like the one Stefan reported? You're right, such memory allocation may impact performance but not so noticeable, compared to other operations which may kick disk IO, like btrfs_find_all_roots(). But at least, enabling qgroup will impact performance. Yeah, this time I has test data now. In a environment with 100 different snapshot, sysbench shows an overall performance drop about 5%, and in some case, up to 7%, with qgroup enabled. Not sure about the kmalloc impact, maybe less than 1% or maybe 2~3%, but at least it's worthy trying to use kmem cache. I will say that we never had this problem reported before, and account_leaf_items() is always run in all kernels, even without qgroups enabled. That will change with my new patch though. What we can say for sure is that drop_snapshot in the qgroup case will read more disk and obviously that will have a negative impact depending on what the tree looks like. So IMHO we ought to be focusing on reducing the amount of I/O involved. Totally agree. Thanks, Qu But we can't just ignore such "possible" performance issue just because old code did the same thing.(Although not the same now, we're marking all subtree blocks dirty other than shared one). Well, I can't disagree with that - the only reason we are talking right now is because you intentionally ignored the qgroup code in drop_snapshot(). So
Re: Regression in: [PATCH 4/4] btrfs: qgroup: account shared subtree during snapshot delete
Mark Fasheh wrote on 2015/11/05 19:15 -0800: On Fri, Nov 06, 2015 at 09:02:13AM +0800, Qu Wenruo wrote: The same exact code ran in either case before and after your patches, so my guess is that the issue is actually inside the qgroup code that shouldn't have been run. I wonder if we even just filled up his memory but never cleaned the objects. The only other thing I can think of is if account_leaf_items() got run in a really tight loop for some reason. Kmalloc in the way we are using it is not usually a performance issue, especially if we've been reading off disk in the same process. Ask yourself this - your own patch series does the same kmalloc for every qgroup operation. Did you notice a complete and massive performance slowdown like the one Stefan reported? You're right, such memory allocation may impact performance but not so noticeable, compared to other operations which may kick disk IO, like btrfs_find_all_roots(). But at least, enabling qgroup will impact performance. Yeah, this time I has test data now. In a environment with 100 different snapshot, sysbench shows an overall performance drop about 5%, and in some case, up to 7%, with qgroup enabled. Not sure about the kmalloc impact, maybe less than 1% or maybe 2~3%, but at least it's worthy trying to use kmem cache. Ok cool, what'd you do to generate the snapshots? I can try a similar test on one of my machines and see what I get. I'm not surprised that the overhead is noticable, and I agree it's easy enough to try things like replacing the allocation once we have a test going. Thanks, --Mark Doing fsstress in a subvolume with 4 threads, creating a snapshot of that subvolume about every 5 seconds. And do sysbench inside the 50th snapshot. Such test takes both overhead of btrfs_find_all_roots() and kmalloc(). So I'm not sure which overhead is bigger. Thanks, Qu -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs
On 11/5/15 10:18 PM, Al Viro wrote: > On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote: > >> So now file_operations callbacks can't assume that file->f_path.dentry >> belongs to the same file system that implements the callback. More than >> that, any code that could ultimately get a dentry that comes from an >> open file can't trust that it's from the same file system. > > Use file_inode() for inode. > >> This crash is due to this issue. Unlike xfs and ext2/3/4, we use >> file->f_path.dentry->d_inode to resolve the inode. Using file_inode() >> is an easy enough fix here, but we run into trouble later. We have >> logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that >> walks back up the dentry chain examining the inode's last transaction >> and last unlink transaction to determine whether a full transaction >> commit is required. This obviously doesn't work if we're walking the >> overlayfs path instead. Regardless of any argument over whether that's >> doing the right thing, it's a pretty common pattern to assume that >> file->f_path.dentry comes from the same file system when using a >> file_operation. Is it intended that that assumption is no longer valid? > > It's actually rare, and your example is a perfect demonstration of the > reasons why it is so rare. What's to protect btrfs_log_dentry_safe() > from racing with rename(2)? Sure, you do dget_parent(). Which protects > you from having one-time parent dentry freed under you. What it doesn't > do is making any promises about its relationship with your file. I suppose the irony here is that, AFAIK, that code is to ensure a file doesn't get lost between transactions due to rename. Isn't the file->f_path.dentry relationship stable otherwise, though? The name might change and the parent might change but the dentry that the file points to won't. I did find a few other places where that assumption happens without any questionable traversals. Sure, all three are in file systems unlikely to be used with overlayfs. ocfs2_prepare_inode_for_write uses file->f_path.dentry for should_remove_suid (due to needing to do it early since cluster locking is unknown in setattr, according to the commit). Having should_remove_suid operate on an inode would solve that easily. fat_ioctl_set_attributes uses it to call fat_setattr, but that only uses the inode and could have the inode_operation use a wrapper. cifs_new_fileinfo keeps a reference to the dentry but it seems to be used mostly to access the inode except for the nasty-looking call to build_path_from_dentry in cifs_reopen_file, which I won't be touching. That does look like a questionable traversal, especially with the "we can't take the rename lock here" comment. -Jeff -- Jeff Mahoney signature.asc Description: OpenPGP digital signature
Re: kernel BUG when fsync'ing file in a overlayfs merged dir, located on btrfs
On Thu, Nov 05, 2015 at 09:57:35PM -0500, Jeff Mahoney wrote: > So now file_operations callbacks can't assume that file->f_path.dentry > belongs to the same file system that implements the callback. More than > that, any code that could ultimately get a dentry that comes from an > open file can't trust that it's from the same file system. Use file_inode() for inode. > This crash is due to this issue. Unlike xfs and ext2/3/4, we use > file->f_path.dentry->d_inode to resolve the inode. Using file_inode() > is an easy enough fix here, but we run into trouble later. We have > logic in the btrfs fsync() call path (check_parent_dirs_for_sync) that > walks back up the dentry chain examining the inode's last transaction > and last unlink transaction to determine whether a full transaction > commit is required. This obviously doesn't work if we're walking the > overlayfs path instead. Regardless of any argument over whether that's > doing the right thing, it's a pretty common pattern to assume that > file->f_path.dentry comes from the same file system when using a > file_operation. Is it intended that that assumption is no longer valid? It's actually rare, and your example is a perfect demonstration of the reasons why it is so rare. What's to protect btrfs_log_dentry_safe() from racing with rename(2)? Sure, you do dget_parent(). Which protects you from having one-time parent dentry freed under you. What it doesn't do is making any promises about its relationship with your file. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Urgent Business Proposal
Attached is the full details of the Business Proposal view it and get back to me immediately for more details. Project Investment Deal.pdf Description: Adobe PDF document
Re: [GIT PULL] Btrfs fixes
On Thu, Nov 05, 2015 at 11:20:37AM +, fdman...@kernel.org wrote: > From: Filipe Manana> > Hi Chris, > > please consider the following fixes for the 4.4 merge window (they were > all previously sent to the mailing list already). > > One fixes a sleep inside atomic context issue, introduced by one patch > in the integration-4.4 branch. Another two fix races regarding waiting > for qgroup rescan worker to finish and a race between the qgroup rescan > worker and unmounting the filesystem (leading to crashes). The remaining > patch fixes an issue with partial direct IO writes, which has been > introduced in the 4.0 kernel, and results either in an assertion failure > (BUG_ON) when CONFIG_BTRFS_ASSERT=y or arithmetic underflow of an inode's > outstanding extents counter (used for proper space reservation) when > assertions are disabled. > > Two test cases for fstests were sent recently to cover the issues regarding > the races and the direct IO partial write regression. Great, thanks Filipe. I'll send these for my second pull (probably next Wed). -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html