btrfs raid10 performance
I am running a single btrfs RAID10 volume of eight LUKS devices, each using a 2TB SATA hard drive as a backing store. The SATA drives are a mixture of Seagate and Western Digital drives, some with RPMs ranging from 5400 to 7200. Each seems to individually performance test where I would expect for drives of this caliber. They are all attached to an LSI PCIe SAS controller and configured in JBOD. I have a relatively beefy quad core Xeon CPU that supports AES-NI and don't think LUKS is my bottleneck. Here's some info from the resulting filesystem: btrfs fi df /storage Data, RAID10: total=6.30TiB, used=6.29TiB System, RAID10: total=8.00MiB, used=560.00KiB Metadata, RAID10: total=9.00GiB, used=7.64GiB GlobalReserve, single: total=512.00MiB, used=0.00B In general I see good performance, especially read performance which is enough to regularly saturate my gigabit network when copying files from this host via samba. Reads are definitely taking advantage of the multiple copies of data available and spreading the load among all drives. Writes aren't quite as rosy, however. When writing files using dd like in this example: dd if=/dev/zero of=tempfile bs=1M count=10240 conv=fdatasync,notrun c status=progress And running a command like: iostat -m 1 to monitor disk I/O, writes seem to only focus on one of the eight disks at a time, moving from one drive to the next. This results in a sustained 55-90 MB/sec throughput depending on which disk is being written to (remember, some have faster spindle speed than others). Am I wrong to expect btrfs' RAID10 mode to write to multiple disks simultaneously and to break larger writes into smaller stripes across my four pairs of disks? I had trouble identifying whether btrfs RAID10 is writing (64K?) stripes or (1GB?) blocks to disk in this mode. The latter might make more sense based upon what I'm seeing? Anything else I should be trying to narrow down the bottleneck? Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Use iocb to derive pos instead of passing a separate parameter
On 2018/06/26 1:20, David Sterba wrote: > On Mon, Jun 25, 2018 at 01:58:58PM +0900, Misono Tomohiro wrote: >> So, this is the updated version of >> https://patchwork.kernel.org/patch/10063039/ >> >> This time xfstest is ok and >> Reviewed-by: Misono Tomohiro > > Your comment about invalidate_mapping_pages is also ok, right? As > filemap_fdatawait_range and invalidate_mapping_pages use the same > start/end of the range. > This time local variable 'pos' is kept to have the same value before and invalidate_mapping_pages() uses it, so it should be ok. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Use iocb to derive pos instead of passing a separate parameter
On 06-25 18:20, David Sterba wrote: > On Mon, Jun 25, 2018 at 01:58:58PM +0900, Misono Tomohiro wrote: > > So, this is the updated version of > > https://patchwork.kernel.org/patch/10063039/ > > > > This time xfstest is ok and > > Reviewed-by: Misono Tomohiro Yes, thats right. > > Your comment about invalidate_mapping_pages is also ok, right? As > filemap_fdatawait_range and invalidate_mapping_pages use the same > start/end of the range. I did not mess around with other functions which are affected by iocb->ki_pos (as opposed to local pos). So, this should be safe with respect to invalidate_mapping_pages(). -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
On Mon, Jun 25, 2018 at 01:07:10PM -0400, Austin S. Hemmelgarn wrote: > > - mount -o recovery still hung > > - mount -o ro did not hang though > One tip here specifically, if you had to reboot during a balance and the FS > hangs when it mounts, try mounting with `-o skip_balance`. That should > pause the balance instead of resuming it on mount, at which point you should > also be able to cancel it without it hanging. Very good tip, I have this in all my mountpoints :) #LABEL=dshelf2 /mnt/btrfs_pool2 btrfs defaults,compress=lzo,skip_balance,noatime Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] fs: add initial bh_result->b_private value to __blockdev_direct_IO()
On Mon, May 14, 2018 at 06:35:48PM +0200, David Sterba wrote: > On Fri, May 11, 2018 at 01:30:01PM -0700, Omar Sandoval wrote: > > On Fri, May 11, 2018 at 09:05:38PM +0100, Al Viro wrote: > > > On Thu, May 10, 2018 at 11:30:10PM -0700, Omar Sandoval wrote: > > > > do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode, > > > > struct block_device *bdev, struct iov_iter *iter, > > > > get_block_t get_block, dio_iodone_t end_io, > > > > - dio_submit_t submit_io, int flags) > > > > + dio_submit_t submit_io, int flags, void *private) > > > > > > Oh, dear... That's what, 9 arguments? I agree that the hack in question > > > is obscene, but so is this ;-/ > > > > So looking at these one by one, obviously needed: > > > > - iocb > > - inode > > - iter > > > > bdev is almost always inode->i_sb->s_bdev, except for Btrfs :( > > > > These could _maybe_ go in struct kiocb: > > > > - flags could maybe be folded into ki_flags > > - private could maybe go in iocb->private, but I haven't yet read > > through to figure out if we're already using iocb->private for direct > > I/O > > I think the kiocb::private can be used for the purpose. There's only one > user, ext4, that also passes some DIO data around so it would in line > with the interface AFAICS. Omar, do you have an update to the patchset? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
On 2018-06-25 12:07, Marc MERLIN wrote: On Tue, Jun 19, 2018 at 12:58:44PM -0400, Austin S. Hemmelgarn wrote: In your situation, I would run "btrfs pause ", wait to hear from a btrfs developer, and not use the volume whatsoever in the meantime. I would say this is probably good advice. I don't really know what's going on here myself actually, though it looks like the balance got stuck (the output hasn't changed for over 36 hours, unless you've got an insanely slow storage array, that's extremely unusual (it should only be moving at most 3GB of data per chunk)). I didn't hear from any developer, so I had to continue. - btrfs scrub cancel did not work (hang) - at reboot mounting the filesystem hung, even with 4.17, which is disappointing (it should not hang) - mount -o recovery still hung - mount -o ro did not hang though One tip here specifically, if you had to reboot during a balance and the FS hangs when it mounts, try mounting with `-o skip_balance`. That should pause the balance instead of resuming it on mount, at which point you should also be able to cancel it without it hanging. Sigh, why is my FS corrupted again? Anyway, back to btrfs check --repair and, it took all my 32GB of RAM on a system I can't add more RAM to, so I'm hosed. I'll note in passing (and it's not ok at all) that check --repair after a 20 to 30mn pause, takes all the kernel RAM more quickly than the system can OOM or log anything, and just deadlocks it. This is repeateable and totally not ok :( I'm now left with btrfs-progs git master, and lowmem which finally does a bit of repair. So far: gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 enabling repair mode WARNING: low-memory mode repair support is only partial Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Fixed 0 roots. ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4 Created new chunk [18457780224000 1073741824] Delete backref in extent [84302495744 69632] ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4 Delete backref in extent [84302495744 69632] ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240 Delete backref in extent [125712527360 12214272] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431 Delete backref in extent [129952120832 20242432] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433 Delete backref in extent [129952120832 20242432] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249 Delete backref in extent [147895111680 12345344] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251 Delete backref in extent [147895111680 12345344] ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418 Delete backref in extent [150850146304 17522688] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449 Deleted root 2 item[156909494272, 178, 5476627808561673095] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452 Deleted root 2 item[156909494272, 178, 7338474132555182983] At the rate it's going, it'll probably take days though, it's already been 36H Marc -- To unsubscribe from this list: send the line "unsubscribe
[PATCH v2] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking the integer return values from internal btrfs functions. It was sometimes returning VM_FAULT_LOCKED for pages that were actually invalid and had been removed from the radix. Something like this: ret2 = btrfs_delalloc_reserve_space() // returns zero on success lock_page(page) if (page->mapping != inode->i_mapping) goto out_unlock; ... out_unlock: if (!ret2) { ... return VM_FAULT_LOCKED; } This ends up triggering this WARNING in btrfs_destroy_inode() WARN_ON(BTRFS_I(inode)->block_rsv.size); xfstests generic/095 was able to reliably reproduce the errors. Since out_unlock: is only used for errors, this fix moves it below the if (!ret2) check we use to return VM_FAULT_LOCKED for success. Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t) Signed-off-by: Chris Mason --- Changes since v1: don't set the vmfault_t 'ret' to zero, just move our goto taret around around instead. fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 193f933..63f713a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9036,13 +9036,14 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) unlock_extent_cached(io_tree, page_start, page_end, _state); -out_unlock: if (!ret2) { btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); sb_end_pagefault(inode->i_sb); extent_changeset_free(data_reserved); return VM_FAULT_LOCKED; } + +out_unlock: unlock_page(page); out: btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, (ret != 0)); -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote: [snip] > I hope I am not asking for too much (but I know I probably am), but > I suggest that having a small snippet of information on the status > page showing a little bit about what is either currently the > development focus , or what people are known for working at would be > very valuable for users and it may of course work both ways, such as > exciting people or calming them down. ;) > > For example something simple like a "development focus" list... > 2018-Q4: (planned) Renaming the grotesque "RAID" terminology > 2018-Q3: (planned) Magical feature X > 2018-Q2: N-Way mirroring > 2018-Q1: Feature work "RAID"5/6 > > I think it would be good for people living their lives outside as it > would perhaps spark some attention from developers and perhaps even > media as well. I started doing this a couple of years ago, but it turned out to be impossible to keep even vaguely accurate or up to date, without going round and bugging the developers individually on a per-release basis. I don't think it's going to happen. Hugo. -- Hugo Mills | emacs: Emacs Makes A Computer Slow. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
On Mon, Jun 25, 2018 at 06:24:37PM +0200, Hans van Kranenburg wrote: > >> output hasn't changed for over 36 hours, unless you've got an insanely slow > >> storage array, that's extremely unusual (it should only be moving at most > >> 3GB of data per chunk)). > > > > I didn't hear from any developer, so I had to continue. > > - btrfs scrub cancel did not work (hang) > > Did you mean balance cancel? It waits until the current block group is > finished. Yes, I meant that, thanks for correcting me. And you're correct that because it was hung, cancel wasn't going to go anywhere. At least my filesystem was still working at the time (as in IO was going on just fine) > > - at reboot mounting the filesystem hung, even with 4.17, which is > > disappointing (it should not hang) > > - mount -o recovery still hung > > - mount -o ro did not hang though > > > > Sigh, why is my FS corrupted again? > > Again? Do you think balance is corrupting the filesystem? Or have there > been previous btrfs check --repair operations which made smaller > problems bigger in the past? Honestly, I don't fully remember at this point, I keep notes, but not detailled enough and it's been a little while. I know I've had to delete/recreate this filesystem twice already over the last years, but I'm not fully certain I remember when this one was last wiped. Yes, I do run balance along with scrub once a month: btrfs balance start -musage=0 -v $mountpoint 2>&1 | grep -Ev "$FILTER" # After metadata, let's do data: btrfs balance start -dusage=0 -v $mountpoint 2>&1 | grep -Ev "$FILTER" btrfs balance start -dusage=20 -v $mountpoint 2>&1 | grep -Ev "$FILTER" echo btrfs scrub start -Bd $mountpoint ionice -c 3 nice -10 btrfs scrub start -Bd $mountpoint Hard to say if balance has damaged my filesystem over time, but it's definitely possible. > Am I right to interpret the messages below, and see that you have > extents that are referenced hundreds of times? I'm not certain, but it's a backup server with many blocks that are the same, so it could be some COW stuff, even if I didn't run any dedupe commands myself. > Is there heavy snapshotting or deduping going on in this filesystem? If > so, it's not surprising balance will get a hard time moving extents > around, since it has to update all of the metadata for each extent again > in hundreds of places. There is some snapshotting, but maybe around 20 or so per subvolume, not hundreds. > Did you investigate what balance was doing if it takes long? Is is using > cpu all the time, or is it reading from disk slowly (random reads) or is > it writing to disk all the time at full speed? I couldn't see what it was doing, but it's running in the kernel, is it not? (or can you just strace the user space command?) Either way, it's too late for that now, and given that it didn't make progress of a single block in 36H, I'm assuming it was well deadlocked. Thanks for the reply. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
David Sterba wrote: On Fri, Jun 22, 2018 at 01:13:31AM +0200, waxhead wrote: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? The subject you write is 'unsolvable', which I read as 'impossible to solve', eg. on the design level. I'm not aware of such issues. Alright , so I interpret this as there is no showstopper regarding implementation of existing and planned features... If this is about issues that are difficult either to implement or getting right, there are a few known ones. Alright again, and I interpret this as there might be some code that is not flexible enough and changing that might affect working / stable parts of the code so therefore other solutions are looked at which is not that uncommon for software. Apart from not listing the known issues I think I got my questions answered :) and now it is perhaps finally appropriate to file a request at the Stratis bugtracker to ask what specifically they are referring to. If you forget about the "RAID"5/6 like features then the only annoyances that I have with BTRFS so far is... 1. Lack of per subvolume "RAID" levels 2. Lack of not using the deviceid to re-discover and re-add dropped devices And that's about it really... This could quickly turn into 'my faviourite bug/feature' list that can be very long. The most asked for are raid56, and performance of qgroups. Qu Wenruo improved some of the core problems and Jeff is working on the performance problem. So there are people working on that. On the raid56 front, there were some recent updates that fixed some bugs, but the fix for write hole is still missing so we can't raise the status yet. I have some some good news but nobody should get too excited until the code lands. I have prototype for the N-copy raid (where N is 3 or 4). This will provide the underlying infrastructure for the raid5/6 logging mechanism, the rest can be taken from Liu Bo's patchset sent some time ago. In the end the N-copy can be used for data and metadata too, independently and flexibly switched via the balance filters. This will cost one incompatibility bit. I hope I am not asking for too much (but I know I probably am), but I suggest that having a small snippet of information on the status page showing a little bit about what is either currently the development focus , or what people are known for working at would be very valuable for users and it may of course work both ways, such as exciting people or calming them down. ;) For example something simple like a "development focus" list... 2018-Q4: (planned) Renaming the grotesque "RAID" terminology 2018-Q3: (planned) Magical feature X 2018-Q2: N-Way mirroring 2018-Q1: Feature work "RAID"5/6 I think it would be good for people living their lives outside as it would perhaps spark some attention from developers and perhaps even media as well. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: remove warnings superseded by refcount_t usage
There are several WARN_ON calls that catch incrorrect reference counter updates, but this is what the refcount_t does already: * refcount_inc from 0 will warn * refcount_dec_and_test from 0 will warn Signed-off-by: David Sterba --- fs/btrfs/delayed-ref.h | 1 - fs/btrfs/extent_map.c | 2 +- fs/btrfs/transaction.c | 1 - fs/btrfs/volumes.c | 1 - 4 files changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index ea1aecb6a50d..16ee92c541bf 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -210,7 +210,6 @@ btrfs_free_delayed_extent_op(struct btrfs_delayed_extent_op *op) static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref) { - WARN_ON(refcount_read(>refs) == 0); if (refcount_dec_and_test(>refs)) { WARN_ON(ref->in_tree); switch (ref->type) { diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 6648d55e5339..6808ad4ed3c9 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -72,7 +72,7 @@ void free_extent_map(struct extent_map *em) { if (!em) return; - WARN_ON(refcount_read(>refs) == 0); + if (refcount_dec_and_test(>refs)) { WARN_ON(extent_map_in_tree(em)); WARN_ON(!list_empty(>list)); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 4485eae41e88..f6b68099b767 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -41,7 +41,6 @@ static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { void btrfs_put_transaction(struct btrfs_transaction *transaction) { - WARN_ON(refcount_read(>use_count) == 0); if (refcount_dec_and_test(>use_count)) { BUG_ON(!list_empty(>list)); WARN_ON(!RB_EMPTY_ROOT(>delayed_refs.href_root)); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e034ad9e23b4..f64d26b4f278 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5330,7 +5330,6 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes) void btrfs_get_bbio(struct btrfs_bio *bbio) { - WARN_ON(!refcount_read(>refs)); refcount_inc(>refs); } -- 2.17.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
On 06/25/2018 06:07 PM, Marc MERLIN wrote: > On Tue, Jun 19, 2018 at 12:58:44PM -0400, Austin S. Hemmelgarn wrote: >>> In your situation, I would run "btrfs pause ", wait to hear from >>> a btrfs developer, and not use the volume whatsoever in the meantime. >> I would say this is probably good advice. I don't really know what's going >> on here myself actually, though it looks like the balance got stuck (the >> output hasn't changed for over 36 hours, unless you've got an insanely slow >> storage array, that's extremely unusual (it should only be moving at most >> 3GB of data per chunk)). > > I didn't hear from any developer, so I had to continue. > - btrfs scrub cancel did not work (hang) Did you mean balance cancel? It waits until the current block group is finished. > - at reboot mounting the filesystem hung, even with 4.17, which is > disappointing (it should not hang) > - mount -o recovery still hung > - mount -o ro did not hang though > > Sigh, why is my FS corrupted again? Again? Do you think balance is corrupting the filesystem? Or have there been previous btrfs check --repair operations which made smaller problems bigger in the past? > Anyway, back to > btrfs check --repair > and, it took all my 32GB of RAM on a system I can't add more RAM to, so > I'm hosed. I'll note in passing (and it's not ok at all) that check > --repair after a 20 to 30mn pause, takes all the kernel RAM more quickly > than the system can OOM or log anything, and just deadlocks it. > This is repeateable and totally not ok :( > > I'm now left with btrfs-progs git master, and lowmem which finally does > a bit of repair. > So far: > gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 > enabling repair mode > WARNING: low-memory mode repair support is only partial > Checking filesystem on /dev/mapper/dshelf2 > UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d > Fixed 0 roots. Am I right to interpret the messages below, and see that you have extents that are referenced hundreds of times? Is there heavy snapshotting or deduping going on in this filesystem? If so, it's not surprising balance will get a hard time moving extents around, since it has to update all of the metadata for each extent again in hundreds of places. Did you investigate what balance was doing if it takes long? Is is using cpu all the time, or is it reading from disk slowly (random reads) or is it writing to disk all the time at full speed? K > ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, > owner: 374857, offset: 3407872) wanted: 3, have: 4 > Created new chunk [18457780224000 1073741824] > Delete backref in extent [84302495744 69632] > ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, > owner: 374857, offset: 3407872) wanted: 3, have: 4 > Delete backref in extent [84302495744 69632] > ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, > owner: 374857, offset: 114540544) wanted: 181, have: 240 > Delete backref in extent [125712527360 12214272] > ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, > owner: 374857, offset: 126754816) wanted: 68, have: 115 > Delete backref in extent [125730848768 5111808] > ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, > owner: 374857, offset: 126754816) wanted: 68, have: 115 > Delete backref in extent [125730848768 5111808] > ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, > owner: 374857, offset: 131866624) wanted: 115, have: 143 > Delete backref in extent [125736914944 6037504] > ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, > owner: 374857, offset: 131866624) wanted: 115, have: 143 > Delete backref in extent [125736914944 6037504] > ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, > owner: 374857, offset: 148234240) wanted: 302, have: 431 > Delete backref in extent [129952120832 20242432] > ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, > owner: 374857, offset: 148234240) wanted: 356, have: 433 > Delete backref in extent [129952120832 20242432] > ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, > owner: 374857, offset: 180371456) wanted: 161, have: 240 > Delete backref in extent [134925357056 11829248] > ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, > owner: 374857, offset: 180371456) wanted: 162, have: 240 > Delete backref in extent [134925357056 11829248] > ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, > owner: 374857, offset: 192200704) wanted: 170, have: 249 > Delete backref in extent [147895111680 12345344] > ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, > owner: 374857, offset: 192200704) wanted: 172, have: 251 > Delete backref in extent [147895111680 12345344] > ERROR: extent[150850146304,
Re: [PATCH] btrfs: Use iocb to derive pos instead of passing a separate parameter
On Mon, Jun 25, 2018 at 01:58:58PM +0900, Misono Tomohiro wrote: > So, this is the updated version of > https://patchwork.kernel.org/patch/10063039/ > > This time xfstest is ok and > Reviewed-by: Misono Tomohiro Your comment about invalidate_mapping_pages is also ok, right? As filemap_fdatawait_range and invalidate_mapping_pages use the same start/end of the range. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still
On Tue, Jun 19, 2018 at 12:58:44PM -0400, Austin S. Hemmelgarn wrote: > > In your situation, I would run "btrfs pause ", wait to hear from > > a btrfs developer, and not use the volume whatsoever in the meantime. > I would say this is probably good advice. I don't really know what's going > on here myself actually, though it looks like the balance got stuck (the > output hasn't changed for over 36 hours, unless you've got an insanely slow > storage array, that's extremely unusual (it should only be moving at most > 3GB of data per chunk)). I didn't hear from any developer, so I had to continue. - btrfs scrub cancel did not work (hang) - at reboot mounting the filesystem hung, even with 4.17, which is disappointing (it should not hang) - mount -o recovery still hung - mount -o ro did not hang though Sigh, why is my FS corrupted again? Anyway, back to btrfs check --repair and, it took all my 32GB of RAM on a system I can't add more RAM to, so I'm hosed. I'll note in passing (and it's not ok at all) that check --repair after a 20 to 30mn pause, takes all the kernel RAM more quickly than the system can OOM or log anything, and just deadlocks it. This is repeateable and totally not ok :( I'm now left with btrfs-progs git master, and lowmem which finally does a bit of repair. So far: gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2 enabling repair mode WARNING: low-memory mode repair support is only partial Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Fixed 0 roots. ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, owner: 374857, offset: 3407872) wanted: 3, have: 4 Created new chunk [18457780224000 1073741824] Delete backref in extent [84302495744 69632] ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, owner: 374857, offset: 3407872) wanted: 3, have: 4 Delete backref in extent [84302495744 69632] ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, owner: 374857, offset: 114540544) wanted: 181, have: 240 Delete backref in extent [125712527360 12214272] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, owner: 374857, offset: 126754816) wanted: 68, have: 115 Delete backref in extent [125730848768 5111808] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, owner: 374857, offset: 131866624) wanted: 115, have: 143 Delete backref in extent [125736914944 6037504] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, owner: 374857, offset: 148234240) wanted: 302, have: 431 Delete backref in extent [129952120832 20242432] ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, owner: 374857, offset: 148234240) wanted: 356, have: 433 Delete backref in extent [129952120832 20242432] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, owner: 374857, offset: 180371456) wanted: 161, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, owner: 374857, offset: 180371456) wanted: 162, have: 240 Delete backref in extent [134925357056 11829248] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, owner: 374857, offset: 192200704) wanted: 170, have: 249 Delete backref in extent [147895111680 12345344] ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, owner: 374857, offset: 192200704) wanted: 172, have: 251 Delete backref in extent [147895111680 12345344] ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, owner: 374857, offset: 217653248) wanted: 348, have: 418 Delete backref in extent [150850146304 17522688] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, owner: 374857, offset: 235175936) wanted: 555, have: 1449 Deleted root 2 item[156909494272, 178, 5476627808561673095] ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, owner: 374857, offset: 235175936) wanted: 556, have: 1452 Deleted root 2 item[156909494272, 178, 7338474132555182983] At the rate it's going, it'll probably take days though, it's already been 36H Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH 2/2] btrfs: Add graceful handling of V0 extents
On Mon, Jun 25, 2018 at 11:24:50AM +0300, Nikolay Borisov wrote: > Following the removal of the v0 handling code let's be courteous and > print an error message when such extents are handled. In the cases > where we have a transaction just abort it, otherwise just call > btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO. > > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/extent-tree.c | 44 > fs/btrfs/print-tree.c | 11 --- > fs/btrfs/relocation.c | 36 +++- > 3 files changed, 79 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 4129831523a2..131773528683 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -870,8 +870,17 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle > *trans, > num_refs = btrfs_extent_refs(leaf, ei); > extent_flags = btrfs_extent_flags(leaf, ei); > } else { > - BUG(); > + ret = -EINVAL; > + btrfs_err(fs_info, > + "Unsupported V0 extent filesystem detected. Aborting. Please > re-create your filesystem with a newer kernel"); Can you please add a helper that prints the message so there aren't several different copies of that? Also btrfs_err will add the last '\n', so it's not needed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
On Sat, Jun 23, 2018 at 05:11:52AM +, Duncan wrote: > > According to this: > > > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , > > section 1.2 > > > > It claims that BTRFS still have significant technical issues that may > > never be resolved. > > Could someone shed some light on exactly what these technical issues > > might be?! What are BTRFS biggest technical problems? > > > > If you forget about the "RAID"5/6 like features then the only annoyances > > that I have with BTRFS so far is... > > > > 1. Lack of per subvolume "RAID" levels > > 2. Lack of not using the deviceid to re-discover and re-add dropped > > devices > > > > And that's about it really... > > ... And those both have solutions on the roadmap, with RFC patches > already posted for #2 (tho I'm not sure they use devid) altho > realistically they're likely to take years to appear and be tested to > stability. Meanwhile... > > While as the others have said you really need to go to the author to get > what was referred to, and I agree, I can speculate a bit. While this > *is* speculation, admittedly somewhat uninformed as I don't claim to be a > dev, and I'd actually be interested in what others think so don't be > afraid to tell me I haven't a clue, as long as you say why... based on > several years reading the list now... > > 1) When I see btrfs "technical issue that may never be resolved", the #1 > first thing I think of, that AFAIK there are _definitely_ no plans to > resolve, because it's very deeply woven into the btrfs core by now, is... > > Filesystem UUID Identification. > Btrfs takes the UU bit of Universally > Unique quite literally, assuming they really *are* unique, at least on > that system, and uses them to identify the possibly multiple devices that > may be components of the filesystem, a problem most filesystems don't > have to deal with since they're single-device-only. Because btrfs uses > this supposedly unique ID to ID devices that belong to the filesystem, it > can get *very* mixed up, with results possibly including dataloss, if it > sees devices that don't actually belong to a filesystem with the same UUID > as a mounted filesystem. > > But technologies such as LVM allow cloning devices and these additional > devices naturally have the same filesystem metadata, including filesystem > UUID, as the original. Making the problem worse is udev with its plug-n- > play style detection, which will normally trigger a btrfs device scan, > thus making btrfs aware of new devices containing (a component of) a > btrfs, as soon as udev detects the device. The automatic scanning is partially making it hard and would require either extending the scanning mechanim to distinguish automatic and manual scan, and using that information in kernel. Right now, a cloned device will not be added to the filesystem UUID set if the fs is mounted, but otherwise it's up to the administrator. The misisng bit is possibly a way to tell the kernel module to 'forget' a device (forget and never auto-scan). > 2) Subvolume and (more technically) reflink-aware defrag. There was a discussion in the mailinglist recently, some additions to the interface were requested. The code to avoid the OOM exists but the original author is not apparently interested and noone else has that high enough in the todo list. > 3) N-way-mirroring. I have a prototype code for that, 3-copy and 4-copy type of profile. Doning a fully dynamic N-way would become a messs once thre are mixed N-way chunks for different N. Adding N=5 would not be too hard, but I'm not sure if this makes sense. The raid5 write-hole log will build on top of that, but the code has not been written yet, other than the separate device logging sent by Liu Bo. > 4) (Until relatively recently, and still in terms of scaling) Quotas. That's ongoing WIP, as qgroups touch the core parts. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
On 25 Jun 2018, at 9:54, David Sterba wrote: > On Mon, Jun 25, 2018 at 06:45:32AM -0700, Chris Mason wrote: >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 193f933..38403aa 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) >> goto out_unlock; >> } >> ret2 = 0; >> +ret = 0; > > That's something I'd rather avoid, now there are two error values that > need to be tracked. Shouldn't the 'ret2 = 0' be removed completely? Yeah, the whole thing hurts my eyes. Fixing. > >> >> /* page is wholly or partially inside EOF */ >> if (page_start + PAGE_SIZE > size) >> @@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) >> unlock_extent_cached(io_tree, page_start, page_end, _state); >> >> out_unlock: >> -if (!ret2) { >> +if (!ret) { > > I'm not sure is safe, the new ret is one of the VM_FAULT_ values and > it's not 0 by default and in fact I don't see anything that would set it > to 0 directly or indirectly. Which means that the code in the out: label > also needs to be revisited: Good point, ok. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
On Mon, Jun 25, 2018 at 06:45:32AM -0700, Chris Mason wrote: > The vm_fault_t conversion commit introduced a ret2 variable for > tracking the integer return values from internal btrfs functions. It > was sometimes returning VM_FAULT_LOCKED for pages that were actually > invalid and had been removed from the radix. Something like this: > > ret2 = btrfs_delalloc_reserve_space() // returns zero on success > > lock_page(page) > if (page->mapping != inode->i_mapping) > goto out_unlock; > > ... > > out_unlock: > if (!ret2) { > ... > return VM_FAULT_LOCKED; > } > > This ends up triggering this WARNING in btrfs_destroy_inode() > WARN_ON(BTRFS_I(inode)->block_rsv.size); > > The fix used here is to use ret instead of ret2 in our check for success. > > Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to > vm_fault_t) > Signed-off-by: Chris Mason > --- > fs/btrfs/inode.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 193f933..38403aa 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) > goto out_unlock; > } > ret2 = 0; > + ret = 0; That's something I'd rather avoid, now there are two error values that need to be tracked. Shouldn't the 'ret2 = 0' be removed completely? > > /* page is wholly or partially inside EOF */ > if (page_start + PAGE_SIZE > size) > @@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) > unlock_extent_cached(io_tree, page_start, page_end, _state); > > out_unlock: > - if (!ret2) { > + if (!ret) { I'm not sure is safe, the new ret is one of the VM_FAULT_ values and it's not 0 by default and in fact I don't see anything that would set it to 0 directly or indirectly. Which means that the code in the out: label also needs to be revisited: 9016 out: 9017 btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, (ret != 0)); 9018 btrfs_delalloc_release_space(inode, data_reserved, page_start, 9019 reserved_space, (ret != 0)); > btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); > sb_end_pagefault(inode->i_sb); > extent_changeset_free(data_reserved); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
On 25 Jun 2018, at 7:10, David Sterba wrote: On Fri, Jun 22, 2018 at 05:25:54PM -0400, Chris Mason wrote: The bug came here: commit a528a24150870c5c16cbbbec69dba7e992b08456 Author: Souptick Joarder Date: Wed Jun 6 19:54:44 2018 +0530 btrfs: change return type of btrfs_page_mkwrite to vm_fault_t When page->mapping != mapping, we improperly return success because ret2 is zero on goto out_unlock. I'm running a fix through a full xfstests and I'll post soon. The ret/ret2 pattern is IMO used in the wrong way, the ret2 is some temporary value and it should be set and tested next to each other, not holding the state accross the function. The fix I'd propose is to avoid relying on it and add a separate exit block, similar to out_noreserve: --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8981,7 +8981,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) ret = VM_FAULT_SIGBUS; goto out_unlock; } - ret2 = 0; /* page is wholly or partially inside EOF */ if (page_start + PAGE_SIZE > size) @@ -9004,14 +9003,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; unlock_extent_cached(io_tree, page_start, page_end, _state); - -out_unlock: - if (!ret2) { - btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); - sb_end_pagefault(inode->i_sb); - extent_changeset_free(data_reserved); - return VM_FAULT_LOCKED; - } unlock_page(page); VM_FAULT_LOCKED is where we return success. The out_unlock goto is confusing because it's actually only used for failure, but the goto lands right above the if (everything actually worked) {} test. On top of the patch I sent today, moving out_unlock: after the if would make it more clear. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion
The vm_fault_t conversion commit introduced a ret2 variable for tracking the integer return values from internal btrfs functions. It was sometimes returning VM_FAULT_LOCKED for pages that were actually invalid and had been removed from the radix. Something like this: ret2 = btrfs_delalloc_reserve_space() // returns zero on success lock_page(page) if (page->mapping != inode->i_mapping) goto out_unlock; ... out_unlock: if (!ret2) { ... return VM_FAULT_LOCKED; } This ends up triggering this WARNING in btrfs_destroy_inode() WARN_ON(BTRFS_I(inode)->block_rsv.size); The fix used here is to use ret instead of ret2 in our check for success. Fixes: a528a2415087 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t) Signed-off-by: Chris Mason --- fs/btrfs/inode.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 193f933..38403aa 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9013,6 +9013,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) goto out_unlock; } ret2 = 0; + ret = 0; /* page is wholly or partially inside EOF */ if (page_start + PAGE_SIZE > size) @@ -9037,7 +9038,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) unlock_extent_cached(io_tree, page_start, page_end, _state); out_unlock: - if (!ret2) { + if (!ret) { btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); sb_end_pagefault(inode->i_sb); extent_changeset_free(data_reserved); -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
On Fri, Jun 22, 2018 at 01:13:31AM +0200, waxhead wrote: > According to this: > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf > Page 4 , section 1.2 > > It claims that BTRFS still have significant technical issues that may > never be resolved. > Could someone shed some light on exactly what these technical issues > might be?! What are BTRFS biggest technical problems? The subject you write is 'unsolvable', which I read as 'impossible to solve', eg. on the design level. I'm not aware of such issues. If this is about issues that are difficult either to implement or getting right, there are a few known ones. > If you forget about the "RAID"5/6 like features then the only annoyances > that I have with BTRFS so far is... > > 1. Lack of per subvolume "RAID" levels > 2. Lack of not using the deviceid to re-discover and re-add dropped devices > > And that's about it really... This could quickly turn into 'my faviourite bug/feature' list that can be very long. The most asked for are raid56, and performance of qgroups. Qu Wenruo improved some of the core problems and Jeff is working on the performance problem. So there are people working on that. On the raid56 front, there were some recent updates that fixed some bugs, but the fix for write hole is still missing so we can't raise the status yet. I have some some good news but nobody should get too excited until the code lands. I have prototype for the N-copy raid (where N is 3 or 4). This will provide the underlying infrastructure for the raid5/6 logging mechanism, the rest can be taken from Liu Bo's patchset sent some time ago. In the end the N-copy can be used for data and metadata too, independently and flexibly switched via the balance filters. This will cost one incompatibility bit. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: unsolvable technical issues?
On 2018-06-24 16:22, Goffredo Baroncelli wrote: On 06/23/2018 07:11 AM, Duncan wrote: waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted: According to this: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 , section 1.2 It claims that BTRFS still have significant technical issues that may never be resolved. Could someone shed some light on exactly what these technical issues might be?! What are BTRFS biggest technical problems? If you forget about the "RAID"5/6 like features then the only annoyances that I have with BTRFS so far is... 1. Lack of per subvolume "RAID" levels 2. Lack of not using the deviceid to re-discover and re-add dropped devices And that's about it really... ... And those both have solutions on the roadmap, with RFC patches already posted for #2 (tho I'm not sure they use devid) altho realistically they're likely to take years to appear and be tested to stability. Meanwhile... While as the others have said you really need to go to the author to get what was referred to, and I agree, I can speculate a bit. While this *is* speculation, admittedly somewhat uninformed as I don't claim to be a dev, and I'd actually be interested in what others think so don't be afraid to tell me I haven't a clue, as long as you say why... based on several years reading the list now... 1) When I see btrfs "technical issue that may never be resolved", the #1 first thing I think of, that AFAIK there are _definitely_ no plans to resolve, because it's very deeply woven into the btrfs core by now, is... Filesystem UUID Identification. Btrfs takes the UU bit of Universally Unique quite literally, assuming they really *are* unique, at least on that system, and uses them to identify the possibly multiple devices that may be components of the filesystem, a problem most filesystems don't have to deal with since they're single-device-only. Because btrfs uses this supposedly unique ID to ID devices that belong to the filesystem, it can get *very* mixed up, with results possibly including dataloss, if it sees devices that don't actually belong to a filesystem with the same UUID as a mounted filesystem. As partial workaround you can disable udev btrfs rules and then do a "btrfs dev scan" manually only for the device which you need. The you can mount the filesystem. Unfortunately you cannot mount two filesystem with the same UUID. However I have to point out that also LVM/dm might have problems if you clone a PV You don't even need `btrfs dev scan` if you just specify the exact set of devices in the mount options. The `device=` mount option tells the kernel to check that device during the mount process. Also, while LVM does have 'issues' with cloned PV's, it fails safe (by refusing to work on VG's that have duplicate PV's), while BTRFS fails very unsafely (by randomly corrupting data). [...] der say 3-5 (or 5-7, or whatever) years. These could arguably include: 2) Subvolume and (more technically) reflink-aware defrag. It was there for a couple kernel versions some time ago, but "impossibly" slow, so it was disabled until such time as btrfs could be made to scale rather better in this regard. Did you try something like that with XFS+DM snapshot ? No you can't, because defrag in XFS cannot traverse snapshot (and I have to suppose that defrag cannot be effective on a dm-snapshot at all).. What I am trying to point out is that even tough btrfs is not the fastest filesystem (and for some workload is VERY slow), when you compare it when few snapshots were presents LVM/dm is a lot slower. IMHO most of the complaint which affect BTRFS, are due to the fact that with BTRFS an user can quite easily exploit a lot of features and their combinations. When a the slowness issue appears when some advance features combinations are used (i.e. multiple disks profile and (a lot of ) snapshots), this is reported as a BTRFS failure. But in fact even LVM/dm is very slow when the snapshot is used. I still contend that the biggest issue WRT reflink-aware defrag was that it was not optional. The only way to get the old defrag behavior was to boot a kernel that didn't have reflink-aware defrag support. IOW, _everyone_ had to deal with the performance issues, not just the people who wanted to use reflink-aware defrag. There's no hint yet as to when that might actually be, if it will _ever_ be, so this can arguably be validly added to the "may never be resolved" list. 3) N-way-mirroring. [...] This is not an issue, but a not implemented feature If you're looking at feature parity with competitors, it's an issue. 4) (Until relatively recently, and still in terms of scaling) Quotas. Until relatively recently, quotas could arguably be added to the list. They were rewritten multiple times, and until recently, appeared to be effectively eternally broken. Even tough what you are reporting is correct, I have to point out that the quota in BTRFS is more
Re: [PATCH RFC 0/2] Btrfs: fix file data corruptions due to lost dirty bits
On Fri, Jun 22, 2018 at 05:25:54PM -0400, Chris Mason wrote: > The bug came here: > > commit a528a24150870c5c16cbbbec69dba7e992b08456 > Author: Souptick Joarder > Date: Wed Jun 6 19:54:44 2018 +0530 > > btrfs: change return type of btrfs_page_mkwrite to vm_fault_t > > When page->mapping != mapping, we improperly return success because ret2 > is zero on goto out_unlock. > > I'm running a fix through a full xfstests and I'll post soon. The ret/ret2 pattern is IMO used in the wrong way, the ret2 is some temporary value and it should be set and tested next to each other, not holding the state accross the function. The fix I'd propose is to avoid relying on it and add a separate exit block, similar to out_noreserve: --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8981,7 +8981,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) ret = VM_FAULT_SIGBUS; goto out_unlock; } - ret2 = 0; /* page is wholly or partially inside EOF */ if (page_start + PAGE_SIZE > size) @@ -9004,14 +9003,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; unlock_extent_cached(io_tree, page_start, page_end, _state); - -out_unlock: - if (!ret2) { - btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); - sb_end_pagefault(inode->i_sb); - extent_changeset_free(data_reserved); - return VM_FAULT_LOCKED; - } unlock_page(page); out: btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, (ret != 0)); @@ -9021,6 +9012,12 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) sb_end_pagefault(inode->i_sb); extent_changeset_free(data_reserved); return ret; + +out_unlock: + btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true); + sb_end_pagefault(inode->i_sb); + extent_changeset_free(data_reserved); + return VM_FAULT_LOCKED; } static int btrfs_truncate(struct inode *inode, bool skip_writeback) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix physical offset reported by fiemap for inline extents
On 20.06.2018 12:02, fdman...@kernel.org wrote: > From: Filipe Manana > > So fix this by ensuring the physical offset is always set to 0 when we > are processing an extent with a special block_start value. > > Fixes: 9d311e11fc1f ("Btrfs: fiemap: pass correct bytenr when fm_extent_count > is zero") > Signed-off-by: Filipe Manana At first I was "wtf" since I had missed the u64 cast. After realising that the constants are in fact always parsed as positive numbers then it makes sense. Reviewed-by: Nikolay Borisov > --- > > v2: Set the physical offset to 0 for other extent maps with a special > block_start value as well. > > fs/btrfs/extent_io.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 8e4a7cdbc9f5..1aa91d57404a 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -4545,8 +4545,11 @@ int extent_fiemap(struct inode *inode, struct > fiemap_extent_info *fieinfo, > offset_in_extent = em_start - em->start; > em_end = extent_map_end(em); > em_len = em_end - em_start; > - disko = em->block_start + offset_in_extent; > flags = 0; > + if (em->block_start < EXTENT_MAP_LAST_BYTE) > + disko = em->block_start + offset_in_extent; > + else > + disko = 0; > > /* >* bump off for our next call to get_extent > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: remove -ERANGE check and avoid errno overriding in btrfs_get_acl()
On 25.06.2018 11:44, Nikolay Borisov wrote: > > > On 23.06.2018 09:38, Chengguang Xu wrote: >> Remove -ERANGE error check because there is no chance to get into >> this condition and meanwhile avoid overriding errno to -EIO in >> btrfs_get_acl(). > > This is way too terse. The reason why we can't get an ERANGE error is > because we first call btrfs_getxattr to get the length of the attribute, > then we do a subsequent call with the size from the first call. Between > the 2 calls the size shouldn't change. > > Furthermore, I see one more bad practice in this code - the first call > to btrfs_getxattr is not using the value buffer hence it should be > passing NULL to make this obvious, instead it's passing an empty string > form the rodata section. While at it, I guess you could also remove the BUG() in the default case of the switch statement in btrfs_get_acl. A simple return ERR_PTR(-EINVAL) should suffice. > >> >> Signed-off-by: Chengguang Xu >> --- >> v2: >> - Avoid errno overriding instead of print error message in error case. >> - Change commit log for better understanding. >> >> fs/btrfs/acl.c | 10 +- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c >> index 15e1dfef56a5..b71a875036af 100644 >> --- a/fs/btrfs/acl.c >> +++ b/fs/btrfs/acl.c >> @@ -40,13 +40,13 @@ struct posix_acl *btrfs_get_acl(struct inode *inode, int >> type) >> return ERR_PTR(-ENOMEM); >> size = btrfs_getxattr(inode, name, value, size); >> } >> -if (size > 0) { >> +if (size > 0) >> acl = posix_acl_from_xattr(_user_ns, value, size); >> -} else if (size == -ERANGE || size == -ENODATA || size == 0) { >> +else if (size == -ENODATA || size == 0) >> acl = NULL; >> -} else { >> -acl = ERR_PTR(-EIO); >> -} >> +else >> +acl = ERR_PTR(size); >> + >> kfree(value); >> >> return acl; >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: remove -ERANGE check and avoid errno overriding in btrfs_get_acl()
On 23.06.2018 09:38, Chengguang Xu wrote: > Remove -ERANGE error check because there is no chance to get into > this condition and meanwhile avoid overriding errno to -EIO in > btrfs_get_acl(). This is way too terse. The reason why we can't get an ERANGE error is because we first call btrfs_getxattr to get the length of the attribute, then we do a subsequent call with the size from the first call. Between the 2 calls the size shouldn't change. Furthermore, I see one more bad practice in this code - the first call to btrfs_getxattr is not using the value buffer hence it should be passing NULL to make this obvious, instead it's passing an empty string form the rodata section. > > Signed-off-by: Chengguang Xu > --- > v2: > - Avoid errno overriding instead of print error message in error case. > - Change commit log for better understanding. > > fs/btrfs/acl.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c > index 15e1dfef56a5..b71a875036af 100644 > --- a/fs/btrfs/acl.c > +++ b/fs/btrfs/acl.c > @@ -40,13 +40,13 @@ struct posix_acl *btrfs_get_acl(struct inode *inode, int > type) > return ERR_PTR(-ENOMEM); > size = btrfs_getxattr(inode, name, value, size); > } > - if (size > 0) { > + if (size > 0) > acl = posix_acl_from_xattr(_user_ns, value, size); > - } else if (size == -ERANGE || size == -ENODATA || size == 0) { > + else if (size == -ENODATA || size == 0) > acl = NULL; > - } else { > - acl = ERR_PTR(-EIO); > - } > + else > + acl = ERR_PTR(size); > + > kfree(value); > > return acl; > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: Remove V0 extent support
The v0 compat code was introduced in commit 5d4f98a28c7d ("Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)") 9 years ago, which was merged in 2.6.31. This means that the code is there to support filesystems which are _VERY_ old and if you are using btrfs on such an old kernel, you have much bigger problems. This coupled with the fact that no one is likely testing/maintining this code likely means it has bugs lurking. All things considered I think 43 kernel releases later it's high time this remnant of the past got removed. Signed-off-by: Nikolay Borisov --- fs/btrfs/ctree.c | 6 +- fs/btrfs/ctree.h | 2 - fs/btrfs/extent-tree.c | 209 + fs/btrfs/print-tree.c | 30 +-- fs/btrfs/relocation.c | 151 +-- 5 files changed, 4 insertions(+), 394 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 6879697520d5..d1273ec2e0d5 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -888,11 +888,7 @@ int btrfs_block_can_be_shared(struct btrfs_root *root, btrfs_root_last_snapshot(>root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1; -#ifdef BTRFS_COMPAT_EXTENT_TREE_V0 - if (test_bit(BTRFS_ROOT_REF_COWS, >state) && - btrfs_header_backref_rev(buf) < BTRFS_MIXED_BACKREF_REV) - return 1; -#endif + return 0; } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e671a1fcbbec..bc52bf7ac572 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -55,8 +55,6 @@ struct btrfs_ordered_sum; #define BTRFS_OLDEST_GENERATION0ULL -#define BTRFS_COMPAT_EXTENT_TREE_V0 - /* * the max metadata block size. This limit is somewhat artificial, * but the memmove costs go through the roof for larger blocks. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0f7e797236dc..4129831523a2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -870,17 +870,7 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, num_refs = btrfs_extent_refs(leaf, ei); extent_flags = btrfs_extent_flags(leaf, ei); } else { -#ifdef BTRFS_COMPAT_EXTENT_TREE_V0 - struct btrfs_extent_item_v0 *ei0; - BUG_ON(item_size != sizeof(*ei0)); - ei0 = btrfs_item_ptr(leaf, path->slots[0], -struct btrfs_extent_item_v0); - num_refs = btrfs_extent_refs_v0(leaf, ei0); - /* FIXME: this isn't correct for data */ - extent_flags = BTRFS_BLOCK_FLAG_FULL_BACKREF; -#else BUG(); -#endif } BUG_ON(num_refs == 0); } else { @@ -1039,89 +1029,6 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, * tree block info structure. */ -#ifdef BTRFS_COMPAT_EXTENT_TREE_V0 -static int convert_extent_item_v0(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info, - struct btrfs_path *path, - u64 owner, u32 extra_size) -{ - struct btrfs_root *root = fs_info->extent_root; - struct btrfs_extent_item *item; - struct btrfs_extent_item_v0 *ei0; - struct btrfs_extent_ref_v0 *ref0; - struct btrfs_tree_block_info *bi; - struct extent_buffer *leaf; - struct btrfs_key key; - struct btrfs_key found_key; - u32 new_size = sizeof(*item); - u64 refs; - int ret; - - leaf = path->nodes[0]; - BUG_ON(btrfs_item_size_nr(leaf, path->slots[0]) != sizeof(*ei0)); - - btrfs_item_key_to_cpu(leaf, , path->slots[0]); - ei0 = btrfs_item_ptr(leaf, path->slots[0], -struct btrfs_extent_item_v0); - refs = btrfs_extent_refs_v0(leaf, ei0); - - if (owner == (u64)-1) { - while (1) { - if (path->slots[0] >= btrfs_header_nritems(leaf)) { - ret = btrfs_next_leaf(root, path); - if (ret < 0) - return ret; - BUG_ON(ret > 0); /* Corruption */ - leaf = path->nodes[0]; - } - btrfs_item_key_to_cpu(leaf, _key, - path->slots[0]); - BUG_ON(key.objectid != found_key.objectid); - if (found_key.type != BTRFS_EXTENT_REF_V0_KEY) { - path->slots[0]++; - continue; - } - ref0 = btrfs_item_ptr(leaf, path->slots[0], - struct btrfs_extent_ref_v0); -
[PATCH 0/2] Remove v0 extent support
It's unlikely there is anyone using that or even if they are they have bigger problems than this patchset :). After all, v0 was introduced 9 years ago and it was already conditionally compiled by the time BTRFS was merged in the upstream kernel. The patches themselves are really simple - patch 1 removes all the code within ifdef guards. Patch 2, in turn, ads graceful handling by aborting transaction where it makes sense or calling btrfs_handle_fs_error and printing an informative error message. Nikolay Borisov (2): btrfs: Remove V0 extent support btrfs: Add graceful handling of V0 extents fs/btrfs/ctree.c | 6 +- fs/btrfs/ctree.h | 2 - fs/btrfs/extent-tree.c | 239 +++-- fs/btrfs/print-tree.c | 35 ++-- fs/btrfs/relocation.c | 181 ++--- 5 files changed, 70 insertions(+), 393 deletions(-) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: Add graceful handling of V0 extents
Following the removal of the v0 handling code let's be courteous and print an error message when such extents are handled. In the cases where we have a transaction just abort it, otherwise just call btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO. Signed-off-by: Nikolay Borisov --- fs/btrfs/extent-tree.c | 44 fs/btrfs/print-tree.c | 11 --- fs/btrfs/relocation.c | 36 +++- 3 files changed, 79 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4129831523a2..131773528683 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -870,8 +870,17 @@ int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, num_refs = btrfs_extent_refs(leaf, ei); extent_flags = btrfs_extent_flags(leaf, ei); } else { - BUG(); + ret = -EINVAL; + btrfs_err(fs_info, + "Unsupported V0 extent filesystem detected. Aborting. Please re-create your filesystem with a newer kernel"); + if (trans) + btrfs_abort_transaction(trans, ret); + else + btrfs_handle_fs_error(fs_info, ret, NULL); + + goto out_free; } + BUG_ON(num_refs == 0); } else { num_refs = 0; @@ -1302,6 +1311,11 @@ static noinline int remove_extent_data_ref(struct btrfs_trans_handle *trans, ref2 = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_shared_data_ref); num_refs = btrfs_shared_data_ref_count(leaf, ref2); + } else if (key.type == BTRFS_EXTENT_REF_V0_KEY) { + btrfs_err(fs_info, + "Unsupported V0 extent filesystem detected. Aborting. Please re-create your filesystem with a newer kernel"); + btrfs_abort_transaction(trans, -EINVAL); + return -EINVAL; } else { BUG(); } @@ -1334,6 +1348,8 @@ static noinline u32 extent_data_ref_count(struct btrfs_path *path, leaf = path->nodes[0]; btrfs_item_key_to_cpu(leaf, , path->slots[0]); + + BUG_ON(key.type == BTRFS_EXTENT_REF_V0_KEY); if (iref) { /* * If type is invalid, we should have bailed out earlier than @@ -1549,7 +1565,13 @@ int lookup_inline_extent_backref(struct btrfs_trans_handle *trans, leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, path->slots[0]); - BUG_ON(item_size < sizeof(*ei)); + if (item_size < sizeof(*ei)) { + err = -EINVAL; + btrfs_err(fs_info, + "Unsupported V0 extent filesystem detected. Aborting. Please re-create your filesystem with a newer kernel"); + btrfs_abort_transaction(trans, err); + goto out; + } ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item); flags = btrfs_extent_flags(leaf, ei); @@ -2282,7 +2304,15 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans, leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, path->slots[0]); - BUG_ON(item_size < sizeof(*ei)); + + if (item_size < sizeof(*ei)) { + err = -EINVAL; + btrfs_err(fs_info, + "Unsupported V0 extent filesystem detected. Aborting. Please re-create your filesystem with a newer kernel"); + btrfs_abort_transaction(trans, err); + goto out; + } + ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_item); __run_delayed_extent_op(extent_op, leaf, ei); @@ -6815,7 +6845,13 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, leaf = path->nodes[0]; item_size = btrfs_item_size_nr(leaf, extent_slot); - BUG_ON(item_size < sizeof(*ei)); + if (item_size < sizeof(*ei)) { + ret = -EINVAL; + btrfs_err(info, + "Unsupported V0 extent filesystem detected. Aborting. Please re-create your filesystem with a newer kernel"); + btrfs_abort_transaction(trans, ret); + goto out; + } ei = btrfs_item_ptr(leaf, extent_slot, struct btrfs_extent_item); if (owner_objectid < BTRFS_FIRST_FREE_OBJECTID && diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c index b03040b84fe9..cd4387fc766c 100644 --- a/fs/btrfs/print-tree.c +++ b/fs/btrfs/print-tree.c @@ -52,8 +52,11 @@ static void print_extent_item(struct extent_buffer *eb, int slot, int type) u64 offset; int ref_index = 0; - if (item_size < sizeof(*ei)) - BUG(); + if (item_size <