Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents
On Sun, Dec 11, 2016 at 10:56:59PM +0100, Xin Zhou wrote: >Hi Zygo, > >Since the corruption happens after I/O and checksum, >could it be possible to add some bug catcher code in debug build, to help >narrowing down the issue? The corruption happens only during reads, after IO and checksum. It happens at the point that I have patched, after decompression when the decompressed data doesn't fill the extent and there is a hole after the extent. No other code exists in btrfs to fill the hole that forms in such cases. There is nothing left to narrow down. Chris Mason discovered the same bug in 2009 and fixed half of it in commit 93c82d5750. His fix only fixed one of two cases where corruption occurs (i.e. after uncompressed extents). The compressed extent case remained unfixed until the present. To be fair to everyone who missed this bug for seven years: it was quite a hard bug to spot. Over a period of 28 months I observed backup verification failures and eliminated competing theories until only this bug remained. There were at least two other bugs fixed between 2009 and 2014 which would have also produced corrupted data in compressed files under slightly different circumstances, so anyone looking for data corruption in compressed extents would have thought they'd solved the problem at least twice before now. There was a memset in *almost* the right spot in the code to fix the corruption bug until early 2014. Ironically, that memset (which was introduced in 2009) was itself the cause of another corruption bug. The memset was removed in 166ae5a418 ("btrfs: fix inline compressed read err corruption") but that commit removed the memset entirely instead of fixing it. It's hard to detect this bug in the wild because the uninitialized kernel data is often zero by accident (zero is typically the most common byte value in kernel memory) and the corrupted data is always holes. The programs that create files with the specific extent structure that triggers the corruption don't write any data to the corrupted regions of the files (e.g. the corrupted bytes are unused bytes in ELF headers, or EXIF thumbnail data in JPEG files). Such programs typically won't read the corrupted data bytes, so the corruption often has no visible impact on users even when it occurs. The file on disk is not corrupted--the kernel corrupts the file while reading it--so repeatedly copying and verifying the same files over and over (e.g. daily backups of the same origin data) will *eventually* succeed. That makes the problem look like a hardware issue. For several months I thought it *was* a hardware issue, but I kept finding similar corruption on too many dissimilar machines that were showing no other evidence of hardware problems. All that seems to be needed to produce corruption is a filesystem mounted with the compress option and max_inline >= 2048. Any version of btrfs from 2012 to the present behaves the same way (older kernels probably also behave the same way, but they crash too early to finish running my tests). So far I've discovered corrupted files in these places in the wild: - in build directories (e.g. Linux kernel or anything using GCC) - in /etc/lvm/{archive,backup}/* - in /var/log/ntpstats - in backups made with rsync -avxzHSP (but not with rsync -avxzHP) Here is a one-line script that will find potentially corrupted files on an existing filesystem. There is no special setup or repro script required, other than to set 'compress' in the mount options. This is just a developer's work box chosen at random: # cd /usr/src/linux # find -type f -size +4097c -exec sh -c 'for x; do if filefrag -v "$x" | sed -n "4p" | grep -q inline; then echo "$x"; fi; done' -- {} + ./arch/x86/kernel/.module.o.cmd ./block/.scsi_ioctl.o.cmd ./drivers/ata/.pata_sis.o.cmd # filefrag -v ./drivers/ata/.pata_sis.o.cmd Filesystem type is: 9123683e File size of ./drivers/ata/.pata_sis.o.cmd is 45677 (12 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..4095: 0.. 4095: 4096: not_aligned,inline 1:1.. 1: 35322749.. 35322749: 1: 1: shared 2:2.. 2: 35371776.. 35371776: 1: 35322750: shared 3:3.. 3: 35532566.. 35532566: 1: 35371777: shared 4:4.. 4: 35315682.. 35315682: 1: 35532567: shared 5:5.. 5: 36010227.. 36010227: 1: 35315683: shared 6:6.. 9: 35396173.. 35396176: 4: 36010228: encoded,shared 7: 10.. 10: 35913574.. 35913574: 1: 35396177: shared 8: 11.. 11: 35305969.. 35305969: 1: 35913575: last,eof
v4.9-rc7 scrub kernel panic
Hi When testing Chris' for-linus-4.10 branch. I found that even at the branch base, v4.9-rc7, btrfs can't pass quite a lot of scrub tests, including btrfs/011 and btrfs/069. The btrfs/069 will fail 100%, with the following back trace: general protection fault: [#1] SMP Modules linked in: btrfs(O) xor zlib_deflate raid6_pq x86_pkg_temp_thermal ext4 jbd2 mbcache e1000e efivarfs [last unloaded: btrfs] CPU: 3 PID: 5300 Comm: kworker/u8:4 Tainted: G O4.9.0-rc7+ #20 Hardware name: FUJITSU ESPRIMO P720/D3221-A1, BIOS V4.6.5.4 R1.17.0 for D3221-A1x 03/06/2014 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] task: 88008dbcb740 task.stack: c9000123 RIP: 0010:[] [] generic_make_request_checks+0x198/0x5a0 RSP: 0018:c90001233b08 EFLAGS: 00010202 RAX: RBX: 88007f963228 RCX: 0001 RDX: 8000 RSI: RDI: 6b6b6b6b6b6b6b6b RBP: c90001233b68 R08: 868a9b14 R09: eab761b2 R10: 0001 R11: 0001 R12: 0040 R13: 0004 R14: 88008dbc5a88 R15: 0010 FS: () GS:880119e0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 019c7058 CR3: 0001150b4000 CR4: 001406e0 Stack: 0002 813a4cff 0296 0292 88008dbcb740 0003 88007f963228 0004 88008dbc5a88 0010 Call Trace: [] ? generic_make_request+0xcf/0x290 [] generic_make_request+0x24/0x290 [] ? generic_make_request+0xcf/0x290 [] submit_bio+0x6e/0x120 [] ? page_in_rbio+0x4d/0x80 [btrfs] [] ? rbio_orig_end_io+0x80/0x80 [btrfs] [] finish_rmw+0x401/0x550 [btrfs] [] validate_rbio_for_rmw+0x36/0x40 [btrfs] [] raid_rmw_end_io+0x7d/0x90 [btrfs] [] bio_endio+0x56/0x60 [] end_workqueue_fn+0x3c/0x40 [btrfs] [] btrfs_scrubparity_helper+0xef/0x610 [btrfs] [] btrfs_endio_raid56_helper+0xe/0x10 [btrfs] [] process_one_work+0x2af/0x720 [] ? process_one_work+0x22b/0x720 [] worker_thread+0x4b/0x4f0 [] ? process_one_work+0x720/0x720 [] ? process_one_work+0x720/0x720 [] kthread+0xf3/0x110 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x27/0x40 Code: 00 00 0f 1f 44 00 00 65 8b 05 9d 71 c6 7e 89 c0 48 0f a3 05 c3 13 b8 00 0f 92 c3 0f 82 dd 02 00 00 bb 01 00 00 00 e9 8c 00 00 00 <48> 8b 47 08 48 8b 40 50 48 c1 f8 09 48 85 c0 0f 84 99 fe ff ff RIP [] generic_make_request_checks+0x198/0x5a0 RSP Is this a known bug or a new one caused by the block layer change? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Fix disable backtrace assert error
At 12/13/2016 02:00 AM, David Sterba wrote: On Thu, Dec 08, 2016 at 08:18:44AM +0800, Qu Wenruo wrote: On 12/06/2016 07:29 PM, Qu Wenruo wrote: Due to commit 00e769d04c2c83029d6c71(btrfs-progs: Correct value printed by assertions/BUG_ON/WARN_ON), which changed the assert_trace() parameter, the condition passed to assert/WARN_ON/BUG_ON are logical notted for backtrace enabled and disabled case. Such behavior makes us easier to pass value wrong, and in fact it did cause us to pass wrong condition for ASSERT(). Instead of passing different conditions for ASSERT/WARN_ON/BUG_ON() manually, this patch will use BUG_ON() to implement the resting ASSERT/WARN_ON/BUG(), so we don't need to pass 3 different conditions but only one. And to further info the review for the fact that the condition should be different, rename "assert_trace" to "bugon_trace", as unlike assert, we will only trigger the bug when condition is true. Also, move WARN_ON() out of the ifdef branch, as it's completely the same for both branches. Cc: Goldwyn RodriguesSigned-off-by: Qu Wenruo --- kerncompat.h | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/kerncompat.h b/kerncompat.h index e374614..be77608 100644 --- a/kerncompat.h +++ b/kerncompat.h @@ -277,7 +277,7 @@ static inline long IS_ERR(const void *ptr) #define vfree(x) free(x) #ifndef BTRFS_DISABLE_BACKTRACE -static inline void assert_trace(const char *assertion, const char *filename, +static inline void bugon_trace(const char *assertion, const char *filename, const char *func, unsigned line, long val) { if (!val) To keep confusion to the minimum, you can call this *condition instead of *assertion. Right, I'll update it. @@ -287,17 +287,20 @@ static inline void assert_trace(const char *assertion, const char *filename, exit(1); } -#define BUG_ON(c) assert_trace(#c, __FILE__, __func__, __LINE__, (long)(c)) -#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, (long)(c)) -#defineASSERT(c) assert_trace(#c, __FILE__, __func__, __LINE__, (long)!(c)) -#define BUG() assert_trace(NULL, __FILE__, __func__, __LINE__, 1) +#define BUG_ON(c) bugon_trace(#c, __FILE__, __func__, __LINE__, (long)(c)) #else #define BUG_ON(c) assert(!(c)) -#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, (long)(c)) -#define ASSERT(c) assert(!(c)) -#define BUG() assert(0) #endif +#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, (long)(c)) +/* + * TODO: ASSERT() should be depercated. In case like ASSERT(ret == 0), it + * won't output any useful value for ret. + * Should be replaced by BUG_ON(ret); + */ +#defineASSERT(c) BUG_ON(!(c)) I am not sure of this. As you are stating, this (double negation) will kill the value of the condition. Won't it be better to remove all ASSERTs first instead of putting this TODO? IIRC the ASSERT/BUG_ON will be removed step by step. And we have about 60+ ASSERT in current code base, not an easy thing to fix soon. So I prefer to mark ASSERT() deprecated and remove them in later cleanups. But asserts have their meaning. We want to get rid of BUG_ON that are abused for error not-handling. The asserts eg. catch programmer errors like incorrect combinations of function parameters, "explicit checks for implicit assumptions", and other "never happens" conditions. Anything that depends on external output must not be hadled via assert/bug_on. I have some write ups in progress about the coding conventions, will be put to git once it's more or less complete, so we're all on the same page. I'm OK to follow the newly established coding standard. A lot of BUG_ON and ASSERT are abused. But we still need to fix the behavior first. Or a lot of normal operation of btrfsck will fail in backtrace disabled case. So what about the the patch without the comment? At least it's easier to maintain. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: limit async_work allocation and worker func duration
Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-12810144 10144128 321 : tunables000 : slabdata317317 0 kmalloc-128 9992352 9992352128 321 : tunables000 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752128 321 : tunables000 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240128 321 : tunables000 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Changed in v2: remove support of thresh == NO_THRESHOLD. Signed-off-by: Maxim Patlasov--- fs/btrfs/async-thread.c | 14 ++ fs/btrfs/async-thread.h |1 + fs/btrfs/delayed-inode.c |6 -- 3 files changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index e0f071f..63d1977 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -86,6 +86,20 @@ btrfs_work_owner(struct btrfs_work *work) return work->wq->fs_info; } +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq) +{ + /* +* We could compare wq->normal->pending with num_online_cpus() +* to support "thresh == NO_THRESHOLD" case, but it requires +* moving up atomic_inc/dec in thresh_queue/exec_hook. Let's +* postpone it until someone needs the support of that case. +*/ + if (wq->normal->thresh == NO_THRESHOLD) + return false; + + return atomic_read(>normal->pending) > wq->normal->thresh * 2; +} + BTRFS_WORK_HELPER(worker_helper); BTRFS_WORK_HELPER(delalloc_helper); BTRFS_WORK_HELPER(flush_delalloc_helper); diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index 8e52484..1f95973 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int max); void btrfs_set_work_high_priority(struct btrfs_work *work); struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work); struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq); +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq); #endif diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 3eeb9cd..de946dd 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1356,7 +1356,8 @@ release_path: total_done++; btrfs_release_prepared_delayed_node(delayed_node); - if (async_work->nr == 0 || total_done < async_work->nr) + if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) || + total_done < async_work->nr) goto again; free_path: @@ -1372,7 +1373,8 @@ static int btrfs_wq_run_delayed_node(struct btrfs_delayed_root *delayed_root, { struct btrfs_async_delayed_work *async_work; - if (atomic_read(_root->items) < BTRFS_DELAYED_BACKGROUND) + if (atomic_read(_root->items) <
Re: [PATCH] btrfs: limit async_work allocation and worker func duration
On 12/12/2016 06:54 AM, David Sterba wrote: On Fri, Dec 02, 2016 at 05:51:36PM -0800, Maxim Patlasov wrote: Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-12810144 10144128 321 : tunables000 : slabdata317317 0 kmalloc-128 9992352 9992352128 321 : tunables000 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752128 321 : tunables000 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240128 321 : tunables000 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. Nice analysis! The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Signed-off-by: Maxim Patlasov--- fs/btrfs/async-thread.c |8 fs/btrfs/async-thread.h |1 + fs/btrfs/delayed-inode.c |6 -- 3 files changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index e0f071f..29f6252 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work) return work->wq->fs_info; } +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq) +{ + int thresh = wq->normal->thresh != NO_THRESHOLD ? + wq->normal->thresh : num_possible_cpus(); Why not num_online_cpus? I vaguely remember we should be checking online cpus, but don't have the mails for reference. We use it elsewhere for spreading the work over cpus, but it's still not bullet proof regarding cpu onlining/offlining. Thank you for review, David! I borrowed num_possible_cpus from the definition of WQ_UNBOUND_MAX_ACTIVE in workqueue.h, but if btrfs uses num_online_cpus elsewhere, it must be OK as well. Another problem that I realized only now, is that nobody increments/decrements wq->normal->pending if thresh == NO_THRESHOLD, so the code looks pretty misleading: it looks as though assigning thresh to num_possible_cpus (or num_online_cpus) matters, but the next line compares it with "pending" that is always zero. As far as we don't have any NO_THRESHOLD users of btrfs_workqueue_normal_congested for now, I tend to think it's better to add a descriptive comment and simply return "false" from btrfs_workqueue_normal_congested rather than trying to address some future needs now. See please v2 of the patch. Thanks, Maxim Otherwise looks good to me, as far as I can imagine the possible behaviour of the various async parameters just from reading the code. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to get back a deleted sub-volume.
Chris, the "btrfs-show-super -fa" gives me nothing useful to work with. the "btrfs-find-root -a " is actually something that I was already using (see original post), but the list of roots given had a rather LARGE hole of 200 generations that is located between right after I've had everything removed and 1 month before the whole situation. On 12 December 2016 at 04:14, Chris Murphywrote: > Tomasz - try using 'btrfs-find-root -a ' I totally forgot about > this option. It goes through the extent tree and might have a chance > of finding additional generations that aren't otherwise being found. > You can then plug those tree roots into 'btrfs restore -t ' > and do it with the -D and -v options so it's a verbose dry run, and > see if the file listing it spits out is at all useful - if it has any > of the data you're looking for. > > > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: Coding style fixes
On Mon, Dec 12, 2016 at 05:11:56PM +0100, David Sterba wrote: > This type of change is more like a cleanup and you can find more > instances where the type is applied to just one of the operands, while > min_t/max_t would be better. Feel free to send a separate patch for > that. Thanks for the feedback. I will try to do the sweep in the following days. I'm sorry, but I didn't quite understand. Should I resend the min/min_t change of this patch in a separate patch ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Fix disable backtrace assert error
On Thu, Dec 08, 2016 at 08:18:44AM +0800, Qu Wenruo wrote: > > On 12/06/2016 07:29 PM, Qu Wenruo wrote: > >> Due to commit 00e769d04c2c83029d6c71(btrfs-progs: Correct value printed > >> by assertions/BUG_ON/WARN_ON), which changed the assert_trace() > >> parameter, the condition passed to assert/WARN_ON/BUG_ON are logical > >> notted for backtrace enabled and disabled case. > >> > >> Such behavior makes us easier to pass value wrong, and in fact it did > >> cause us to pass wrong condition for ASSERT(). > >> > >> Instead of passing different conditions for ASSERT/WARN_ON/BUG_ON() > >> manually, this patch will use BUG_ON() to implement the resting > >> ASSERT/WARN_ON/BUG(), so we don't need to pass 3 different conditions > >> but only one. > >> > >> And to further info the review for the fact that the condition should be > >> different, rename "assert_trace" to "bugon_trace", as unlike assert, we > >> will only trigger the bug when condition is true. > >> > >> Also, move WARN_ON() out of the ifdef branch, as it's completely the > >> same for both branches. > >> > >> Cc: Goldwyn Rodrigues> >> Signed-off-by: Qu Wenruo > >> --- > >> kerncompat.h | 19 +++ > >> 1 file changed, 11 insertions(+), 8 deletions(-) > >> > >> diff --git a/kerncompat.h b/kerncompat.h > >> index e374614..be77608 100644 > >> --- a/kerncompat.h > >> +++ b/kerncompat.h > >> @@ -277,7 +277,7 @@ static inline long IS_ERR(const void *ptr) > >> #define vfree(x) free(x) > >> > >> #ifndef BTRFS_DISABLE_BACKTRACE > >> -static inline void assert_trace(const char *assertion, const char > >> *filename, > >> +static inline void bugon_trace(const char *assertion, const char > >> *filename, > >> const char *func, unsigned line, long val) > >> { > >>if (!val) > > > > To keep confusion to the minimum, you can call this *condition instead > > of *assertion. > > Right, I'll update it. > > > > >> @@ -287,17 +287,20 @@ static inline void assert_trace(const char > >> *assertion, const char *filename, > >>exit(1); > >> } > >> > >> -#define BUG_ON(c) assert_trace(#c, __FILE__, __func__, __LINE__, > >> (long)(c)) > >> -#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, > >> (long)(c)) > >> -#define ASSERT(c) assert_trace(#c, __FILE__, __func__, __LINE__, > >> (long)!(c)) > >> -#define BUG() assert_trace(NULL, __FILE__, __func__, __LINE__, 1) > >> +#define BUG_ON(c) bugon_trace(#c, __FILE__, __func__, __LINE__, (long)(c)) > >> #else > >> #define BUG_ON(c) assert(!(c)) > >> -#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, > >> (long)(c)) > >> -#define ASSERT(c) assert(!(c)) > >> -#define BUG() assert(0) > >> #endif > >> > >> +#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, > >> (long)(c)) > >> +/* > >> + * TODO: ASSERT() should be depercated. In case like ASSERT(ret == 0), it > >> + * won't output any useful value for ret. > >> + * Should be replaced by BUG_ON(ret); > >> + */ > >> +#define ASSERT(c) BUG_ON(!(c)) > > > > I am not sure of this. As you are stating, this (double negation) will > > kill the value of the condition. Won't it be better to remove all > > ASSERTs first instead of putting this TODO? > > IIRC the ASSERT/BUG_ON will be removed step by step. > And we have about 60+ ASSERT in current code base, not an easy thing to > fix soon. > > So I prefer to mark ASSERT() deprecated and remove them in later cleanups. But asserts have their meaning. We want to get rid of BUG_ON that are abused for error not-handling. The asserts eg. catch programmer errors like incorrect combinations of function parameters, "explicit checks for implicit assumptions", and other "never happens" conditions. Anything that depends on external output must not be hadled via assert/bug_on. I have some write ups in progress about the coding conventions, will be put to git once it's more or less complete, so we're all on the same page. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] recursive defrag cleanup
On Tue, Dec 06, 2016 at 12:39:37PM +0800, Anand Jain wrote: > The command, >btrfs fi defrag -v /btrfs > does nothing, it won't defrag the files under /btrfs as user > may expect. The command with recursive option >btrfs fi defrag -vr /btrfs > would defrag all the files under /btrfs including files in > its sub directories. > > While attempting to fix this. The patch below 1/1 provides > a cleanup. And the actual fix is pending, as to my understanding > of nfwt() it does not provide the list of file without > files under its sub directories. What kind of fix do you mean? We could detect if there's a directory in the list of arguments and assume the recursive mode. I think this is what most users expect. Currently passing a directory will defragment the extent tree, but I think we should extend the defrag ioctl flags to explictly ask for that. At minimum, a directory without -r could print a warning about what it's really doing. But I'm open to other ideas. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check --repair question
On Mon, Dec 12, 2016 at 5:19 AM, Tim Walbergwrote: > All - > > I have a file system I'm having some issues with. The initial symptoms were > that mount > would run for several hours, either committing or rolling back transactions > (primarily > due to a balance that was running when the system was rebooted for other > reasons - > the skip_balance mount option was specified because of this), but would then > be killed > due to an OOM condition (not much else running on the box at the time - a > desktop system > where everything else was waiting for the mount to finish). That's the > background. Kernel > 4.8.1 - custom config, but otherwise stock kernel - and btrfs-tools 4.8.3. There were some OOM related issues early in 4.8, I would try 4.8.12 or even 4.8.14. > > Ran btrfs check, and the only thing it reports is a sequence of these: > > ref mismatch on [5400814960640 16384] extent item 0, found 1 > Backref 5400814960640 parent 5401010913280 root 5401010913280 not found in > extent tree > backpointer mismatch on [5400814960640 16384] > owner ref check failed [5400814960640 16384] > > Which, to my reading are simply some missing backrefs, and probably should be > one of the > easier issues to correct, but I know --repair is still considered > experimental/dangerous, > so I thought I'd ask before I run it... Is this a case that --repair can be > reasonably > expected to handle, or would I be better off recreating the file system and > restoring from > either my saved btrfs send archives or the more reliable backups? It might be that usebackuproot,ro mount will happen faster, and you can update the backups. Then use --repair. It's still listed as dangerous but it's gotten quite a bit less dangerous in later versions. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND][PATCH v2] btrfs-progs: add dev stats returncode option
On Thu, Dec 08, 2016 at 12:54:14PM -0500, Austin S. Hemmelgarn wrote: > On 2016-12-08 12:20, David Sterba wrote: > > On Mon, Dec 05, 2016 at 01:35:20PM -0500, Austin S. Hemmelgarn wrote: > >> Currently, `btrfs device stats` returns non-zero only when there was an > >> error getting the counter values. This is fine for when it gets run by a > >> user directly, but is a serious pain when trying to use it in a script or > >> for monitoring since you need to parse the (not at all machine friendly) > >> output to check the counter values. > >> > >> This patch adds an option ('-s') which causes `btrfs device stats` > >> to set bit 6 in the return code if any of the counters are non-zero. > >> This greatly simplifies checking from a script or monitoring software if > >> any errors have been recorded. In the event that this switch is passed > >> and an error occurs reading the stats, the return code will have bit > >> 0 set (so if there are errors reading counters, and the counters which > >> were read were non-zero, the return value will be 65). > > > > So a typical check in a script would look for either 64 or 65 returned > > from the command, I don't think we can do it simpler. The option naming > > is a bit confusing to me, as it duplicates the 'stats' from the command > > itself. I'd suggest to use '--check' instead, does it sound OK to you? > > > > I'll apply the patch as-is for now (and maybe do some cleanups in the > > surrounding code). > > > Yeah, --check is fine. Like I said, I'm not too picky about the name as > long as it works. Thanks. Changed to -c and added the long option --check for that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Btrfs-progs: subvol_uuid_search: Return error on memory allocation failure
On Sat, Dec 10, 2016 at 07:17:42PM +0530, Prasanth K S R wrote: > From: Prasanth K S R> > This commit fixes coverity defect CID 1328695. > > Signed-off-by: Prasanth K S R Thanks, 1-3 applied. JFYI, I've realized that subvol_uuid_search is part of the public API so changing the return value would break the existing users (I know that snapper uses this function), as it now returns just a pointer or NULL. I'll fix that, a new function and library version bump would be needed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: Coding style fixes
On Mon, Dec 12, 2016 at 03:10:05PM +0100, Seraphime Kirkovski wrote: > Per Documentation/CodingStyle remove brackets on single expression > if statements, add missing spaces after `,` and around `=`, > remove unnecessary line continuations, add missing blank lines > after declarations and fix mixed spaces and tabs. I'm sure you could find more things to fix, there are eg. some missing spaces around + or - binary operators. Patches that just fix the style are usally not very welcome, but I think that doing a sweep once upon a time is good. We had comment typos fixed in a similar fashion, so I'm not against improving code aesthetics. > @@ -394,7 +394,7 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, > void __user *arg) > q = bdev_get_queue(device->bdev); > if (blk_queue_discard(q)) { > num_devices++; > - minlen = min((u64)q->limits.discard_granularity, > + minlen = min_t(u64, q->limits.discard_granularity, >minlen); This type of change is more like a cleanup and you can find more instances where the type is applied to just one of the operands, while min_t/max_t would be better. Feel free to send a separate patch for that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: limit async_work allocation and worker func duration
On Fri, Dec 02, 2016 at 05:51:36PM -0800, Maxim Patlasov wrote: > Problem statement: unprivileged user who has read-write access to more than > one btrfs subvolume may easily consume all kernel memory (eventually > triggering oom-killer). > > Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): > > [root@kteam1 ~]# cat prep.sh > > DEV=/dev/sdb > mkfs.btrfs -f $DEV > mount $DEV /mnt > for i in `seq 1 16` > do > mkdir /mnt/$i > btrfs subvolume create /mnt/SV_$i > ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` > mount -t btrfs -o subvolid=$ID $DEV /mnt/$i > chmod a+rwx /mnt/$i > done > > [root@kteam1 ~]# sh prep.sh > > [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done > > [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | > grep -v dma; sleep 60; done > kmalloc-12810144 10144128 321 : tunables000 : > slabdata317317 0 > kmalloc-128 9992352 9992352128 321 : tunables000 > : slabdata 312261 312261 0 > kmalloc-128 24226752 24226752128 321 : tunables00 > 0 : slabdata 757086 757086 0 > kmalloc-128 42754240 42754240128 321 : tunables00 > 0 : slabdata 1336070 1336070 0 > > The huge numbers above come from insane number of async_work-s allocated > and queued by btrfs_wq_run_delayed_node. > > The problem is caused by btrfs_wq_run_delayed_node() queuing more and more > works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The > worker func (btrfs_async_run_delayed_root) processes at least > BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery > works as expected while the list is almost empty. As soon as it is getting > bigger, worker func starts to process more than one item at a time, it takes > longer, and the chances to have async_works queued more than needed is getting > higher. > > The problem above is worsened by another flaw of delayed-inode implementation: > if async_work was queued in a throttling branch (number of items >= > BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until > the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that > the func occupies CPU infinitely (up to 30sec in my experiments): while the > func is trying to drain the list, the user activity may add more and more > items to the list. Nice analysis! > The patch fixes both problems in straightforward way: refuse queuing too > many works in btrfs_wq_run_delayed_node and bail out of worker func if > at least BTRFS_DELAYED_WRITEBACK items are processed. > > Signed-off-by: Maxim Patlasov> --- > fs/btrfs/async-thread.c |8 > fs/btrfs/async-thread.h |1 + > fs/btrfs/delayed-inode.c |6 -- > 3 files changed, 13 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c > index e0f071f..29f6252 100644 > --- a/fs/btrfs/async-thread.c > +++ b/fs/btrfs/async-thread.c > @@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work) > return work->wq->fs_info; > } > > +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq) > +{ > + int thresh = wq->normal->thresh != NO_THRESHOLD ? > + wq->normal->thresh : num_possible_cpus(); Why not num_online_cpus? I vaguely remember we should be checking online cpus, but don't have the mails for reference. We use it elsewhere for spreading the work over cpus, but it's still not bullet proof regarding cpu onlining/offlining. Otherwise looks good to me, as far as I can imagine the possible behaviour of the various async parameters just from reading the code. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: Coding style fixes
Per Documentation/CodingStyle remove brackets on single expression if statements, add missing spaces after `,` and around `=`, remove unnecessary line continuations, add missing blank lines after declarations and fix mixed spaces and tabs. Signed-off-by: Seraphime Kirkovski--- fs/btrfs/ioctl.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 7acbd2c..fc6dc14 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -200,9 +200,9 @@ static int btrfs_ioctl_getflags(struct file *file, void __user *arg) static int check_flags(unsigned int flags) { - if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \ - FS_NOATIME_FL | FS_NODUMP_FL | \ - FS_SYNC_FL | FS_DIRSYNC_FL | \ + if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | + FS_NOATIME_FL | FS_NODUMP_FL | + FS_SYNC_FL | FS_DIRSYNC_FL | FS_NOCOMP_FL | FS_COMPR_FL | FS_NOCOW_FL)) return -EOPNOTSUPP; @@ -301,7 +301,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (S_ISREG(mode)) { if (inode->i_size == 0) ip->flags &= ~(BTRFS_INODE_NODATACOW -| BTRFS_INODE_NODATASUM); + | BTRFS_INODE_NODATASUM); } else { ip->flags &= ~BTRFS_INODE_NODATACOW; } @@ -394,7 +394,7 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) q = bdev_get_queue(device->bdev); if (blk_queue_discard(q)) { num_devices++; - minlen = min((u64)q->limits.discard_granularity, + minlen = min_t(u64, q->limits.discard_granularity, minlen); } } @@ -1464,9 +1464,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, atomic_dec(>fs_info->async_submit_draining); } - if (range->compress_type == BTRFS_COMPRESS_LZO) { + if (range->compress_type == BTRFS_COMPRESS_LZO) btrfs_set_fs_incompat(root->fs_info, COMPRESS_LZO); - } ret = defrag_count; @@ -1659,6 +1658,7 @@ static noinline int btrfs_ioctl_snap_create_transid(struct file *file, } else { struct fd src = fdget(fd); struct inode *src_inode; + if (!src.file) { ret = -EINVAL; goto out_drop_write; @@ -2236,7 +2236,7 @@ static noinline int btrfs_search_path_in_tree(struct btrfs_fs_info *info, struct btrfs_path *path; if (dirid == BTRFS_FIRST_FREE_OBJECTID) { - name[0]='\0'; + name[0] = '\0'; return 0; } @@ -2679,7 +2679,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_root *root, void __user *arg) ret = btrfs_init_new_device(root, vol_args->name); if (!ret) - btrfs_info(root->fs_info, "disk added %s",vol_args->name); + btrfs_info(root->fs_info, "disk added %s", vol_args->name); kfree(vol_args); out: @@ -2773,7 +2773,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void __user *arg) mutex_unlock(>fs_info->volume_mutex); if (!ret) - btrfs_info(root->fs_info, "disk deleted %s",vol_args->name); + btrfs_info(root->fs_info, "disk deleted %s", vol_args->name); kfree(vol_args); out: atomic_set(>fs_info->mutually_exclusive_operation_running, 0); @@ -2927,6 +2927,7 @@ static int lock_extent_range(struct inode *inode, u64 off, u64 len, */ while (1) { struct btrfs_ordered_extent *ordered; + lock_extent(_I(inode)->io_tree, off, off + len - 1); ordered = btrfs_lookup_first_ordered_extent(inode, off + len - 1); @@ -3895,11 +3896,10 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src, if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode)) return -EISDIR; - if (!same_inode) { + if (!same_inode) btrfs_double_inode_lock(src, inode); - } else { + else inode_lock(src); - } /* determine range to clone */ ret = -EINVAL; @@ -4972,11 +4972,10 @@ static long btrfs_ioctl_qgroup_create(struct file *file, void __user *arg) } /* FIXME: check if the IDs really exist */ - if (sa->create) { + if (sa->create) ret = btrfs_create_qgroup(trans, root->fs_info, sa->qgroupid); - } else { + else ret =
Re: [PATCH 2/3] btrfs: cow_file_range() num_bytes and disk_num_bytes are same
On Tue, Dec 06, 2016 at 12:43:08PM +0800, Anand Jain wrote: > This patch deletes local variable disk_num_bytes as its value > is same as num_bytes in the function cow_file_range(). > > Signed-off-by: Anand JainReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] btrfs: consolidate auto defrag kick off policies
On Tue, Dec 06, 2016 at 12:43:09PM +0800, Anand Jain wrote: > As of now writes smaller than 64k for non compressed extents and 16k > for compressed extents inside eof are considered as candidate > for auto defrag, put them together at a place. Missing sign-off. > --- > fs/btrfs/inode.c | 27 +++ > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 79f073e94f2d..b157575166c6 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -388,6 +388,22 @@ static inline int inode_need_compress(struct inode > *inode) > return 0; > } > > +static inline void inode_should_defrag(struct inode *inode, > + u64 start, u64 end, u64 num_bytes, int comp_type) > +{ > + u64 small_write = SZ_64K; > + if (comp_type) > + small_write = SZ_16K; I think the small_write value should be passed directly, not the compression type. > + > + if (!num_bytes) > + num_bytes = end - start + 1; And the callers should pass the range length. Calculating inside the function makes it less clear. One of the callers passes an blocksize aligned value and the other doest not, so both know what's the right value. Otherwise the cleanup is good. I'll merge the other two patches, please update and resend this one. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] btrfs: fix improper return value
On Sun, Dec 04, 2016 at 11:57:07AM -0800, Omar Sandoval wrote: > On Sun, Dec 04, 2016 at 12:51:53PM +0800, Pan Bian wrote: > > In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret > > on errors. However, it directly returns 0. It may be better to return > > ret. This patch also removes the warning, because the caller already > > prints a warning. > > > > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731 > > Looks good. > > Reviewed-by: Omar SandovalAdded to 4.10 queue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix lockdep warning about log_mutex
On Thu, Dec 01, 2016 at 01:45:04PM -0800, Liu Bo wrote: > + mutex_lock_nested(_I(inode)->log_mutex, 1); Please also use SINGLE_DEPTH_NESTING here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: use down_read_nested to make lockdep silent
On Thu, Dec 01, 2016 at 01:44:19PM -0800, Liu Bo wrote: > If @block_group is not @used_bg, it'll try to get @used_bg's lock without > droping @block_group 's lock and lockdep has throwed a scary deadlock warning > about it. > Fix it by using down_read_nested. > > Signed-off-by: Liu Bo> --- > fs/btrfs/extent-tree.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 210c94a..cdb082a 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -7394,7 +7394,8 @@ btrfs_lock_cluster(struct btrfs_block_group_cache > *block_group, > > spin_unlock(>refill_lock); > > - down_read(_bg->data_rwsem); > + /* We should only have one-level nested. */ > + down_read_nested(_bg->data_rwsem, 1); While looking what down_read_nested does, I've found SINGLE_DEPTH_NESTING, defined to 1. I'd rather use it instead of '1', as it gives more more pointers to the docs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add truncated_len for ordered extent tracepoints
On Thu, Dec 01, 2016 at 01:44:10PM -0800, Liu Bo wrote: > This can help us monitor truncated ordered extents. > > Signed-off-by: Liu BoReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add 'inode' for extent map tracepoint
On Thu, Dec 01, 2016 at 01:44:01PM -0800, Liu Bo wrote: > TP_STRUCT__entry_btrfs( > __field(u64, root_objectid ) > + __field(u64, ino ) > __field(u64, start ) > __field(u64, len ) > __field(u64, orig_start) > @@ -204,7 +206,8 @@ TRACE_EVENT_CONDITION(btrfs_get_extent, > > TP_fast_assign_btrfs(root->fs_info, > __entry->root_objectid = root->root_key.objectid; > - __entry->start = map->start; > + __entry->ino= btrfs_ino(inode); > + __entry->start = map->start; > __entry->len= map->len; > __entry->orig_start = map->orig_start; > __entry->block_start= map->block_start; > @@ -214,12 +217,12 @@ TRACE_EVENT_CONDITION(btrfs_get_extent, > __entry->compress_type = map->compress_type; > ), > > - TP_printk_btrfs("root = %llu(%s), start = %llu, len = %llu, " > + TP_printk_btrfs("root = %llu(%s), ino = %llu start = %llu, len = %llu, " > "orig_start = %llu, block_start = %llu(%s), " > "block_len = %llu, flags = %s, refs = %u, " > "compress_type = %u", > show_root_type(__entry->root_objectid), > - (unsigned long long)__entry->start, > + __entry->ino, (unsigned long long)__entry->start, I think the (unsigned long long) cast should be used here as well, at least for consistency with the other u64. > (unsigned long long)__entry->len, > (unsigned long long)__entry->orig_start, > show_map_type(__entry->block_start), -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs check --repair question
All - I have a file system I'm having some issues with. The initial symptoms were that mount would run for several hours, either committing or rolling back transactions (primarily due to a balance that was running when the system was rebooted for other reasons - the skip_balance mount option was specified because of this), but would then be killed due to an OOM condition (not much else running on the box at the time - a desktop system where everything else was waiting for the mount to finish). That's the background. Kernel 4.8.1 - custom config, but otherwise stock kernel - and btrfs-tools 4.8.3. Ran btrfs check, and the only thing it reports is a sequence of these: ref mismatch on [5400814960640 16384] extent item 0, found 1 Backref 5400814960640 parent 5401010913280 root 5401010913280 not found in extent tree backpointer mismatch on [5400814960640 16384] owner ref check failed [5400814960640 16384] Which, to my reading are simply some missing backrefs, and probably should be one of the easier issues to correct, but I know --repair is still considered experimental/dangerous, so I thought I'd ask before I run it... Is this a case that --repair can be reasonably expected to handle, or would I be better off recreating the file system and restoring from either my saved btrfs send archives or the more reliable backups? tw -- twalb...@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-find-root duration?
On 2016-12-10 20:42, Markus Binsteiner wrote: Hi Xin, thanks. I did not enable autosnap, and I'm pretty sure Debian didn't do it for me either, as I would have seen the subvolumes created by it at some stage. Good to know about this feature though, will definitely use it next time around. BTRFS itself has no such feature. What Xin is probably thinking of is a piece of third-party software called 'snapper' that gets installed by default on at least SUSE and Ubuntu when using BTRFS, but not on Debian because they make no assumptions about individual use-case (and there are _lots_ of people (myself included) who absolutely do not want automatic snapshotting eating up space on our filesystems behind our back). If you do choose to use snapper, make sure to update the settings to fit your needs, as the defaults (at least, the defaults upstream and on SUSE) are absolutely brain-dead and will eat your filesystem (or at least completely kill it's performance) in a couple of months on average. Cheers, Markus On Sun, Dec 11, 2016 at 2:17 PM, Xin Zhouwrote: Hi Markus, Some file system automatically generates snapshot, and create a hidden folder for recovery, if the user accidently deletes some files. It seems btrfs also has a autosnap feature, so if this option has been enabled before deletion, or the volume has been mannually generated snapshots, then probably it might be able to perform fast recover. Regards, Xin Sent: Saturday, December 10, 2016 at 4:12 PM From: "Markus Binsteiner" To: linux-btrfs@vger.kernel.org Subject: btrfs-find-root duration? It seems I've accidentally deleted all files in my home directory, which sits in its own btrfs partition (lvm on luks). Now I'm trying to find the roots to be able to use btrfs restore later on. btrfs-find-root seems to be taking ages though. I've run it like so: btrfs-find-root /dev/mapper/think--big-home -o 5 > roots.txt After 16 hours, there is still no output, but it's still running utilizing 100% of one core. Is there any way to gauge how much longer it'll take? Should there have been output already while it's running? When I run it without redirecting stdout, I get: $ btrfs-find-root /dev/mapper/think--big-home -o 5 Superblock doesn't contain generation info for root 5 Superblock doesn't contain the level info for root 5 When I omit the '-o 5', it says: $ btrfs-find-root /dev/mapper/think--big-home Superblock thinks the generation is 593818 Superblock thinks the level is 0 Is the latter the way to run it? Did that initially, but that didn't return any results in a reasonable timeframe either. The filesystem was created with Debian Jessie, but I'm using Ubuntu ( btrfs-progs v4.7.3 ) to try to restore the files at the moment. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: raid56: Use correct stolen pages to calculate P/Q
In the following situation, scrub will calculate wrong parity to overwrite correct one: RAID5 full stripe: Before | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --- 0 | 0x (Bad) | 0xcdcd | 0x| --- 4K | 0xcdcd | 0xcdcd | 0x| ... | 0xcdcd | 0xcdcd | 0x| --- 64K After scrubbing dev3 only: | Dev 1 | Dev 2 | Dev 3 | | Data stripe 1 | Data stripe 2 | Parity Stripe | --- 0 | 0xcdcd (Good) | 0xcdcd | 0xcdcd (Bad) | --- 4K | 0xcdcd | 0xcdcd | 0x| ... | 0xcdcd | 0xcdcd | 0x| --- 64K The calltrace of such corruption is as following: scrub_bio_end_io_worker() get called for each extent read out |- scriub_block_complete() |- Data extent csum mismatch |- scrub_handle_errored_block |- scrub_recheck_block() |- scrub_submit_raid56_bio_wait() |- raid56_parity_recover() Now we have a rbio with correct data stripe 1 recovered. Let's call it "good_rbio". scrub_parity_check_and_repair() |- raid56_parity_submit_scrub_rbio() |- lock_stripe_add() | |- steal_rbio() | |- Recovered data are steal from "good_rbio", stored into |rbio->stripe_pages[] |Now rbio->bio_pages[] are bad data read from disk. |- async_scrub_parity() |- scrub_parity_work() (delayed_call to scrub_parity_work) scrub_parity_work() |- raid56_parity_scrub_stripe() |- validate_rbio_for_parity_scrub() |- finish_parity_scrub() |- Recalculate parity using *BAD* pages in rbio->bio_pages[] So good parity is overwritten with *BAD* one The fix is to introduce 2 new members, bad_ondisk_a/b, to struct btrfs_raid_bio, to info scrub code to use correct data pages to re-calculate parity. Reported-by: Goffredo BaroncelliSigned-off-by: Qu Wenruo --- fs/btrfs/raid56.c | 62 +++ 1 file changed, 58 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index d2a9a1e..453eefd 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -133,6 +133,16 @@ struct btrfs_raid_bio { /* second bad stripe (for raid6 use) */ int failb; + /* +* For steal_rbio, we can steal recovered correct page, +* but in finish_parity_scrub(), we still use bad on-disk +* page to calculate parity. +* Use these members to info finish_parity_scrub() to use +* correct pages +*/ + int bad_ondisk_a; + int bad_ondisk_b; + int scrubp; /* * number of pages needed to represent the full @@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct btrfs_raid_bio *dest) if (!test_bit(RBIO_CACHE_READY_BIT, >flags)) return; + /* Record recovered stripe number */ + if (src->faila != -1) + dest->bad_ondisk_a = src->faila; + if (src->failb != -1) + dest->bad_ondisk_b = src->failb; + for (i = 0; i < dest->nr_pages; i++) { s = src->stripe_pages[i]; if (!s || !PageUptodate(s)) { @@ -999,6 +1015,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info, rbio->stripe_npages = stripe_npages; rbio->faila = -1; rbio->failb = -1; + rbio->bad_ondisk_a = -1; + rbio->bad_ondisk_b = -1; atomic_set(>refs, 1); atomic_set(>error, 0); atomic_set(>stripes_pending, 0); @@ -2261,6 +2279,9 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio) int bit; int index; struct page *page; + struct page *bio_page; + void *ptr; + void *bio_ptr; for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) { for (i = 0; i < rbio->real_stripes; i++) { @@ -2271,6 +2292,23 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio) page = alloc_page(GFP_NOFS | __GFP_HIGHMEM); if (!page) return -ENOMEM; + + /* +* It's possible that only several pages need recover, +* and rest are all good. +* In that case we need to copy good bio pages into +* stripe_pages[], as we will use stripe_pages[] other +* than bio_pages[] in finish_parity_scrub(). +*/ +
[PATCH 1/3] btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protect, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which stores a rb_tree of mutex for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo--- fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/scrub.c | 177 + 3 files changed, 184 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 50bcfb8..5896059 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -639,6 +639,10 @@ struct btrfs_block_group_cache { * Protected by free_space_lock. */ int needs_free_space; + + /* Scrub full stripe lock tree for RAID5/6 scrub */ + struct rb_root scrub_lock_root; + spinlock_t scrub_lock; }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e97302f..3e61604 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -130,6 +130,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache) if (atomic_dec_and_test(>count)) { WARN_ON(cache->pinned > 0); WARN_ON(cache->reserved > 0); + WARN_ON(!RB_EMPTY_ROOT(>scrub_lock_root)); kfree(cache->free_space_ctl); kfree(cache); } @@ -9906,6 +9907,8 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info, atomic_set(>count, 1); spin_lock_init(>lock); + spin_lock_init(>scrub_lock); + cache->scrub_lock_root = RB_ROOT; init_rwsem(>data_rwsem); INIT_LIST_HEAD(>list); INIT_LIST_HEAD(>cluster_list); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 0d63d99..2b9555c 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -240,6 +240,13 @@ struct scrub_warning { struct btrfs_device *dev; }; +struct scrub_full_stripe_lock { + struct rb_node node; + u64 logical; + u64 refs; + struct mutex mutex; +}; + static void scrub_pending_bio_inc(struct scrub_ctx *sctx); static void scrub_pending_bio_dec(struct scrub_ctx *sctx); static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx); @@ -351,6 +358,176 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info) } /* + * Caller must hold cache->scrub_root_lock. + * + * Return existing full stripe and increase it refs + * Or return NULL, and insert @fstripe_lock into the bg cache + */ +static struct scrub_full_stripe_lock * +add_scrub_lock(struct btrfs_block_group_cache *cache, + struct scrub_full_stripe_lock *fstripe_lock) +{ + struct rb_node **p; + struct rb_node *parent = NULL; + struct scrub_full_stripe_lock *entry; + + p = >scrub_lock_root.rb_node; + while (*p) { + parent = *p; + entry = rb_entry(parent, struct scrub_full_stripe_lock, node); + if (fstripe_lock->logical < entry->logical) { + p = &(*p)->rb_left; + } else if (fstripe_lock->logical > entry->logical) { + p = &(*p)->rb_right; + } else { + entry->refs++; + return entry; + } + } + /* Insert new one */ + rb_link_node(_lock->node, parent, p); + rb_insert_color(_lock->node, >scrub_lock_root); + + return NULL; +} + +static struct scrub_full_stripe_lock * +search_scrub_lock(struct btrfs_block_group_cache *cache, u64 bytenr) +{ + struct rb_node *node; + struct scrub_full_stripe_lock *entry; + + node = cache->scrub_lock_root.rb_node; + while (node) { + entry = rb_entry(node, struct scrub_full_stripe_lock, node); + if (bytenr < entry->logical) + node = node->rb_left; + else if (bytenr > entry->logical) + node = node->rb_right; + else + return entry; + } + return NULL; +} + +/* + * Helper to get full stripe logical from a normal bytenr. + * Thanks to the chaos of scrub structures, we need to get it all + * by ourselves, using btrfs_map_sblock(). + */ +static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr, + u64 *bytenr_ret) +{ + struct btrfs_bio *bbio = NULL; + u64 len; + int ret; + + /* Just use map_sblock() to get full stripe logical */ + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, bytenr, + , , 0, 1); + if (ret || !bbio || !bbio->raid_map) + goto error; + *bytenr_ret = bbio->raid_map[0]; + btrfs_put_bbio(bbio); + return 0; +error: +
[PATCH 2/3] btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one data stripe is corrupted), sometimes scrub will report more csum errors than expected. Sometimes even unrecoverable error will be reported. The problem can be easily reproduced by the following steps: 1) Create a btrfs with RAID5 data profile with 3 devs 2) Mount it with nospace_cache or space_cache=v2 To avoid extra data space usage. 3) Create a 128K file and sync the fs, unmount it Now the 128K file lies at the beginning of the data chunk 4) Locate the physical bytenr of data chunk on dev3 Dev3 is the 1st data stripe. 5) Corrupt the first 64K of the data chunk stripe on dev3 6) Mount the fs and scrub it The correct csum error number should be 16(assuming using x86_64). Larger csum error number can be reported in a 1/3 chance. And unrecoverable error can also be reported in a 1/10 chance. The root cause of the problem is RAID5/6 recover code has race condition, due to the fact that full scrub is initiated per device. While for other mirror based profiles, each mirror is independent with each other, so race won't cause any big problem. For example: Corrupted | Correct | Correct| | Scrub dev3 (D1) |Scrub dev2 (D2) |Scrub dev1(P)| Read out D1 |Read out D2 |Read full stripe | Check csum |Check csum |Check parity | Csum mismatch |Csum match, continue|Parity mismatch | handle_errored_block||handle_errored_block | Read out full stripe || Read out full stripe| D1 csum error(err++) || D1 csum error(err++)| Recover D1 || Recover D1 | So D1's csum error is accounted twice, just because handle_errored_block() doesn't have enough protect, and race can happen. On even worse case, for example D1's recovery code is re-writing D1/D2/P, and P's recovery code is just reading out full stripe, then we can cause unrecoverable error. This patch will use previously introduced lock_full_stripe() and unlock_full_stripe() to protect the whole scrub_handle_errored_block() function for RAID56 recovery. So no extra csum error nor unrecoverable error. Reported-by: Goffredo BaroncelliSigned-off-by: Qu Wenruo --- fs/btrfs/scrub.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 2b9555c..5b2f94a 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1067,6 +1067,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) unsigned int have_csum; struct scrub_block *sblocks_for_recheck; /* holds one for each mirror */ struct scrub_block *sblock_bad; + int lock_ret; int ret; int mirror_index; int page_num; @@ -1096,6 +1097,17 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + /* +* For RAID5/6 race can happen for different dev scrub thread. +* For data corruption, Parity and Data thread will both try +* to recovery the data. +* Race can lead to double added csum error, or even unrecoverable +* error. +*/ + ret = lock_full_stripe(fs_info, logical, GFP_NOFS); + if (ret < 0) + return ret; + if (sctx->is_dev_replace && !is_metadata && !have_csum) { sblocks_for_recheck = NULL; goto nodatasum_case; @@ -1430,6 +1442,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) kfree(sblocks_for_recheck); } + lock_ret = unlock_full_stripe(fs_info, logical); + if (lock_ret < 0) + return lock_ret; return 0; } -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3 for-4.10] RAID56 scrub fixes
Can be feteched from github: https://github.com/adam900710/linux.git raid56_fixes Fixes 2 scrub bugs: 1) Scrub recover correct data, but wrong parity 2) Scrub report wrong csum error number, or even unrecoverable error The patches are still undergoing xfstests, but currect for-linus-4.10 is already causing deadlock for btrfs/011, even without the patches. So I'd remove btrfs/011 and continue the test, even these test cases won't trigger real recover code. But the current internal test cases are quite good so far. I'll test them for several extra loop, and submit the internal test for reference. (Since it's not suitable for xfstest, so I'd only submit the test script, which needs manually to probe chunk layout) Qu Wenruo (3): btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: scrub: Fix RAID56 recovery race conditiong btrfs: raid56: Use correct stolen pages to calculate P/Q fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/raid56.c | 62 ++-- fs/btrfs/scrub.c | 192 + 4 files changed, 257 insertions(+), 4 deletions(-) -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3 for-4.10] RAID56 scrub fixes
Fixes 2 scrub bugs: 1) Scrub recover correct data, but wrong parity 2) Scrub report wrong csum error number, or even unrecoverable error The patches are still undergoing xfstests, but currect for-linus-4.10 is already causing deadlock for btrfs/011, even without the patches. So I'd remove btrfs/011 and continue the test, even these test cases won't trigger real recover code. But the current internal test cases are quite good so far. I'll test them for several extra loop, and submit the internal test for reference. (Since it's not suitable for xfstest, so I'd only submit the test script, which needs manually to probe chunk layout) Qu Wenruo (3): btrfs: scrub: Introduce full stripe lock for RAID56 btrfs: scrub: Fix RAID56 recovery race condition btrfs: raid56: Use correct stolen pages to calculate P/Q fs/btrfs/ctree.h | 4 ++ fs/btrfs/extent-tree.c | 3 + fs/btrfs/raid56.c | 62 ++-- fs/btrfs/scrub.c | 192 + 4 files changed, 257 insertions(+), 4 deletions(-) -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html