Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-12-12 Thread Zygo Blaxell
On Sun, Dec 11, 2016 at 10:56:59PM +0100, Xin Zhou wrote:
>Hi Zygo,
> 
>Since the corruption happens after I/O and checksum,
>could it be possible to add some bug catcher code in debug build, to help
>narrowing down the issue?

The corruption happens only during reads, after IO and checksum.
It happens at the point that I have patched, after decompression when
the decompressed data doesn't fill the extent and there is a hole after
the extent.  No other code exists in btrfs to fill the hole that forms
in such cases.  There is nothing left to narrow down.

Chris Mason discovered the same bug in 2009 and fixed half of it in commit
93c82d5750.  His fix only fixed one of two cases where corruption occurs
(i.e. after uncompressed extents).  The compressed extent case remained
unfixed until the present.

To be fair to everyone who missed this bug for seven years:  it was
quite a hard bug to spot.  Over a period of 28 months I observed backup
verification failures and eliminated competing theories until only this
bug remained.

There were at least two other bugs fixed between 2009 and 2014 which
would have also produced corrupted data in compressed files under
slightly different circumstances, so anyone looking for data corruption
in compressed extents would have thought they'd solved the problem at
least twice before now.

There was a memset in *almost* the right spot in the code to fix the
corruption bug until early 2014.  Ironically, that memset (which was
introduced in 2009) was itself the cause of another corruption bug.
The memset was removed in 166ae5a418 ("btrfs: fix inline compressed read
err corruption") but that commit removed the memset entirely instead of
fixing it.

It's hard to detect this bug in the wild because the uninitialized
kernel data is often zero by accident (zero is typically the most common
byte value in kernel memory) and the corrupted data is always holes.
The programs that create files with the specific extent structure that
triggers the corruption don't write any data to the corrupted regions
of the files (e.g. the corrupted bytes are unused bytes in ELF headers,
or EXIF thumbnail data in JPEG files).  Such programs typically won't
read the corrupted data bytes, so the corruption often has no visible
impact on users even when it occurs.

The file on disk is not corrupted--the kernel corrupts the file while
reading it--so repeatedly copying and verifying the same files over
and over (e.g. daily backups of the same origin data) will *eventually*
succeed.  That makes the problem look like a hardware issue.  For several
months I thought it *was* a hardware issue, but I kept finding similar
corruption on too many dissimilar machines that were showing no other
evidence of hardware problems.

All that seems to be needed to produce corruption is a filesystem mounted
with the compress option and max_inline >= 2048.  Any version of btrfs
from 2012 to the present behaves the same way (older kernels probably also
behave the same way, but they crash too early to finish running my tests).

So far I've discovered corrupted files in these places in the wild:

- in build directories (e.g. Linux kernel or anything using GCC)

- in /etc/lvm/{archive,backup}/*

- in /var/log/ntpstats

- in backups made with rsync -avxzHSP (but not with rsync -avxzHP)

Here is a one-line script that will find potentially corrupted files on an
existing filesystem.  There is no special setup or repro script required,
other than to set 'compress' in the mount options.  This is just a
developer's work box chosen at random:

# cd /usr/src/linux
# find -type f -size +4097c -exec sh -c 'for x; do if filefrag -v "$x" 
| sed -n "4p" | grep -q inline; then echo "$x"; fi; done' -- {} +
./arch/x86/kernel/.module.o.cmd
./block/.scsi_ioctl.o.cmd
./drivers/ata/.pata_sis.o.cmd

# filefrag -v ./drivers/ata/.pata_sis.o.cmd
Filesystem type is: 9123683e
File size of ./drivers/ata/.pata_sis.o.cmd is 45677 (12 blocks of 4096 
bytes)
 ext: logical_offset:physical_offset: length:   expected: 
flags:
   0:0..4095:  0..  4095:   4096: 
not_aligned,inline
   1:1..   1:   35322749..  35322749:  1:  1: 
shared
   2:2..   2:   35371776..  35371776:  1:   35322750: 
shared
   3:3..   3:   35532566..  35532566:  1:   35371777: 
shared
   4:4..   4:   35315682..  35315682:  1:   35532567: 
shared
   5:5..   5:   36010227..  36010227:  1:   35315683: 
shared
   6:6..   9:   35396173..  35396176:  4:   36010228: 
encoded,shared
   7:   10..  10:   35913574..  35913574:  1:   35396177: 
shared
   8:   11..  11:   35305969..  35305969:  1:   35913575: 
last,eof

v4.9-rc7 scrub kernel panic

2016-12-12 Thread Qu Wenruo

Hi

When testing Chris' for-linus-4.10 branch.

I found that even at the branch base, v4.9-rc7, btrfs can't pass
quite a lot of scrub tests, including btrfs/011 and btrfs/069.

The btrfs/069 will fail 100%, with the following back trace:

general protection fault:  [#1] SMP
Modules linked in: btrfs(O) xor zlib_deflate raid6_pq 
x86_pkg_temp_thermal ext4 jbd2 mbcache e1000e efivarfs [last unloaded: 
btrfs]

CPU: 3 PID: 5300 Comm: kworker/u8:4 Tainted: G   O4.9.0-rc7+ #20
Hardware name: FUJITSU ESPRIMO P720/D3221-A1, BIOS V4.6.5.4 R1.17.0 for 
D3221-A1x 03/06/2014

Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
task: 88008dbcb740 task.stack: c9000123
RIP: 0010:[]  [] 
generic_make_request_checks+0x198/0x5a0

RSP: 0018:c90001233b08  EFLAGS: 00010202
RAX:  RBX: 88007f963228 RCX: 0001
RDX: 8000 RSI:  RDI: 6b6b6b6b6b6b6b6b
RBP: c90001233b68 R08: 868a9b14 R09: eab761b2
R10: 0001 R11: 0001 R12: 0040
R13: 0004 R14: 88008dbc5a88 R15: 0010
FS:  () GS:880119e0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 019c7058 CR3: 0001150b4000 CR4: 001406e0
Stack:
 0002 813a4cff  0296
 0292 88008dbcb740 0003 88007f963228
  0004 88008dbc5a88 0010
Call Trace:
 [] ? generic_make_request+0xcf/0x290
 [] generic_make_request+0x24/0x290
 [] ? generic_make_request+0xcf/0x290
 [] submit_bio+0x6e/0x120
 [] ? page_in_rbio+0x4d/0x80 [btrfs]
 [] ? rbio_orig_end_io+0x80/0x80 [btrfs]
 [] finish_rmw+0x401/0x550 [btrfs]
 [] validate_rbio_for_rmw+0x36/0x40 [btrfs]
 [] raid_rmw_end_io+0x7d/0x90 [btrfs]
 [] bio_endio+0x56/0x60
 [] end_workqueue_fn+0x3c/0x40 [btrfs]
 [] btrfs_scrubparity_helper+0xef/0x610 [btrfs]
 [] btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
 [] process_one_work+0x2af/0x720
 [] ? process_one_work+0x22b/0x720
 [] worker_thread+0x4b/0x4f0
 [] ? process_one_work+0x720/0x720
 [] ? process_one_work+0x720/0x720
 [] kthread+0xf3/0x110
 [] ? kthread_park+0x60/0x60
 [] ret_from_fork+0x27/0x40
Code: 00 00 0f 1f 44 00 00 65 8b 05 9d 71 c6 7e 89 c0 48 0f a3 05 c3 13 
b8 00 0f 92 c3 0f 82 dd 02 00 00 bb 01 00 00 00 e9 8c 00 00 00 <48> 8b 
47 08 48 8b 40 50 48 c1 f8 09 48 85 c0 0f 84 99 fe ff ff

RIP  [] generic_make_request_checks+0x198/0x5a0
 RSP 

Is this a known bug or a new one caused by the block layer change?

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Fix disable backtrace assert error

2016-12-12 Thread Qu Wenruo



At 12/13/2016 02:00 AM, David Sterba wrote:

On Thu, Dec 08, 2016 at 08:18:44AM +0800, Qu Wenruo wrote:

On 12/06/2016 07:29 PM, Qu Wenruo wrote:

Due to commit 00e769d04c2c83029d6c71(btrfs-progs: Correct value printed
by assertions/BUG_ON/WARN_ON), which changed the assert_trace()
parameter, the condition passed to assert/WARN_ON/BUG_ON are logical
notted for backtrace enabled and disabled case.

Such behavior makes us easier to pass value wrong, and in fact it did
cause us to pass wrong condition for ASSERT().

Instead of passing different conditions for ASSERT/WARN_ON/BUG_ON()
manually, this patch will use BUG_ON() to implement the resting
ASSERT/WARN_ON/BUG(), so we don't need to pass 3 different conditions
but only one.

And to further info the review for the fact that the condition should be
different, rename "assert_trace" to "bugon_trace", as unlike assert, we
will only trigger the bug when condition is true.

Also, move WARN_ON() out of the ifdef branch, as it's completely the
same for both branches.

Cc: Goldwyn Rodrigues 
Signed-off-by: Qu Wenruo 
---
 kerncompat.h | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kerncompat.h b/kerncompat.h
index e374614..be77608 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -277,7 +277,7 @@ static inline long IS_ERR(const void *ptr)
 #define vfree(x) free(x)

 #ifndef BTRFS_DISABLE_BACKTRACE
-static inline void assert_trace(const char *assertion, const char *filename,
+static inline void bugon_trace(const char *assertion, const char *filename,
  const char *func, unsigned line, long val)
 {
if (!val)


To keep confusion to the minimum, you can call this *condition instead
of *assertion.


Right, I'll update it.




@@ -287,17 +287,20 @@ static inline void assert_trace(const char *assertion, 
const char *filename,
exit(1);
 }

-#define BUG_ON(c) assert_trace(#c, __FILE__, __func__, __LINE__, (long)(c))
-#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, (long)(c))
-#defineASSERT(c) assert_trace(#c, __FILE__, __func__, __LINE__, 
(long)!(c))
-#define BUG() assert_trace(NULL, __FILE__, __func__, __LINE__, 1)
+#define BUG_ON(c) bugon_trace(#c, __FILE__, __func__, __LINE__, (long)(c))
 #else
 #define BUG_ON(c) assert(!(c))
-#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, (long)(c))
-#define ASSERT(c) assert(!(c))
-#define BUG() assert(0)
 #endif

+#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, (long)(c))
+/*
+ * TODO: ASSERT() should be depercated. In case like ASSERT(ret == 0), it
+ * won't output any useful value for ret.
+ * Should be replaced by BUG_ON(ret);
+ */
+#defineASSERT(c) BUG_ON(!(c))


I am not sure of this. As you are stating, this (double negation) will
kill the value of the condition. Won't it be better to remove all
ASSERTs first instead of putting this TODO?


IIRC the ASSERT/BUG_ON will be removed step by step.
And we have about 60+ ASSERT in current code base, not an easy thing to
fix soon.

So I prefer to mark ASSERT() deprecated and remove them in later cleanups.


But asserts have their meaning. We want to get rid of BUG_ON that are
abused for error not-handling. The asserts eg. catch programmer errors
like incorrect combinations of function parameters, "explicit checks for
implicit assumptions", and other "never happens" conditions.  Anything
that depends on external output must not be hadled via assert/bug_on.
I have some write ups in progress about the coding conventions, will be
put to git once it's more or less complete, so we're all on the same
page.




I'm OK to follow the newly established coding standard.
A lot of BUG_ON and ASSERT are abused.

But we still need to fix the behavior first.
Or a lot of normal operation of btrfsck will fail in backtrace disabled 
case.


So what about the the patch without the comment?

At least it's easier to maintain.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: limit async_work allocation and worker func duration

2016-12-12 Thread Maxim Patlasov
Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
mkdir /mnt/$i
btrfs subvolume create /mnt/SV_$i
ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | 
grep -v dma; sleep 60; done
kmalloc-12810144  10144128   321 : tunables000 : 
slabdata317317  0
kmalloc-128   9992352 9992352128   321 : tunables000 : 
slabdata 312261 312261  0
kmalloc-128   24226752 24226752128   321 : tunables000 
: slabdata 757086 757086  0
kmalloc-128   42754240 42754240128   321 : tunables000 
: slabdata 1336070 1336070  0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Changed in v2: remove support of thresh == NO_THRESHOLD.

Signed-off-by: Maxim Patlasov 
---
 fs/btrfs/async-thread.c  |   14 ++
 fs/btrfs/async-thread.h  |1 +
 fs/btrfs/delayed-inode.c |6 --
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index e0f071f..63d1977 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -86,6 +86,20 @@ btrfs_work_owner(struct btrfs_work *work)
return work->wq->fs_info;
 }
 
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)
+{
+   /*
+* We could compare wq->normal->pending with num_online_cpus()
+* to support "thresh == NO_THRESHOLD" case, but it requires
+* moving up atomic_inc/dec in thresh_queue/exec_hook. Let's
+* postpone it until someone needs the support of that case.
+*/
+   if (wq->normal->thresh == NO_THRESHOLD)
+   return false;
+
+   return atomic_read(>normal->pending) > wq->normal->thresh * 2;
+}
+
 BTRFS_WORK_HELPER(worker_helper);
 BTRFS_WORK_HELPER(delalloc_helper);
 BTRFS_WORK_HELPER(flush_delalloc_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index 8e52484..1f95973 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -84,4 +84,5 @@ void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int 
max);
 void btrfs_set_work_high_priority(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_work_owner(struct btrfs_work *work);
 struct btrfs_fs_info *btrfs_workqueue_owner(struct __btrfs_workqueue *wq);
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq);
 #endif
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 3eeb9cd..de946dd 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1356,7 +1356,8 @@ release_path:
total_done++;
 
btrfs_release_prepared_delayed_node(delayed_node);
-   if (async_work->nr == 0 || total_done < async_work->nr)
+   if ((async_work->nr == 0 && total_done < BTRFS_DELAYED_WRITEBACK) ||
+   total_done < async_work->nr)
goto again;
 
 free_path:
@@ -1372,7 +1373,8 @@ static int btrfs_wq_run_delayed_node(struct 
btrfs_delayed_root *delayed_root,
 {
struct btrfs_async_delayed_work *async_work;
 
-   if (atomic_read(_root->items) < BTRFS_DELAYED_BACKGROUND)
+   if (atomic_read(_root->items) < 

Re: [PATCH] btrfs: limit async_work allocation and worker func duration

2016-12-12 Thread Maxim Patlasov

On 12/12/2016 06:54 AM, David Sterba wrote:


On Fri, Dec 02, 2016 at 05:51:36PM -0800, Maxim Patlasov wrote:

Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
mkdir /mnt/$i
btrfs subvolume create /mnt/SV_$i
ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | 
grep -v dma; sleep 60; done
kmalloc-12810144  10144128   321 : tunables000 : 
slabdata317317  0
kmalloc-128   9992352 9992352128   321 : tunables000 : 
slabdata 312261 312261  0
kmalloc-128   24226752 24226752128   321 : tunables000 
: slabdata 757086 757086  0
kmalloc-128   42754240 42754240128   321 : tunables000 
: slabdata 1336070 1336070  0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

Nice analysis!


The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Signed-off-by: Maxim Patlasov 
---
  fs/btrfs/async-thread.c  |8 
  fs/btrfs/async-thread.h  |1 +
  fs/btrfs/delayed-inode.c |6 --
  3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index e0f071f..29f6252 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work)
return work->wq->fs_info;
  }
  
+bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)

+{
+   int thresh = wq->normal->thresh != NO_THRESHOLD ?
+   wq->normal->thresh : num_possible_cpus();

Why not num_online_cpus? I vaguely remember we should be checking online
cpus, but don't have the mails for reference. We use it elsewhere for
spreading the work over cpus, but it's still not bullet proof regarding
cpu onlining/offlining.


Thank you for review, David! I borrowed num_possible_cpus from the 
definition of WQ_UNBOUND_MAX_ACTIVE in workqueue.h, but if btrfs uses 
num_online_cpus elsewhere, it must be OK as well.


Another problem that I realized only now, is that nobody 
increments/decrements wq->normal->pending if thresh == NO_THRESHOLD, so 
the code looks pretty misleading: it looks as though assigning thresh to 
num_possible_cpus (or num_online_cpus) matters, but the next line 
compares it with "pending" that is always zero.


As far as we don't have any NO_THRESHOLD users of 
btrfs_workqueue_normal_congested for now, I tend to think it's better to 
add a descriptive comment and simply return "false" from 
btrfs_workqueue_normal_congested rather than trying to address some 
future needs now. See please v2 of the patch.


Thanks,
Maxim



Otherwise looks good to me, as far as I can imagine the possible
behaviour of the various async parameters just from reading the code.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to get back a deleted sub-volume.

2016-12-12 Thread Tomasz Kusmierz
Chris,

the "btrfs-show-super -fa" gives me nothing useful to work with.

the "btrfs-find-root -a " is actually something that I was
already using (see original post), but the list of roots given had a
rather LARGE hole of 200 generations that is located between right
after I've had everything removed and 1 month before the whole
situation.

On 12 December 2016 at 04:14, Chris Murphy  wrote:
> Tomasz - try using 'btrfs-find-root -a ' I totally forgot about
> this option. It goes through the extent tree and might have a chance
> of finding additional generations that aren't otherwise being found.
> You can then plug those tree roots into 'btrfs restore -t '
> and do it with the -D and -v options so it's a verbose dry run, and
> see if the file listing it spits out is at all useful - if it has any
> of the data you're looking for.
>
>
> Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Coding style fixes

2016-12-12 Thread Seraphime Kirkovski
On Mon, Dec 12, 2016 at 05:11:56PM +0100, David Sterba wrote:
> This type of change is more like a cleanup and you can find more
> instances where the type is applied to just one of the operands, while
> min_t/max_t would be better. Feel free to send a separate patch for
> that.

Thanks for the feedback. I will try to do the sweep in the following 
days.

I'm sorry, but I didn't quite understand. Should I resend the min/min_t 
change of this patch in a separate patch ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Fix disable backtrace assert error

2016-12-12 Thread David Sterba
On Thu, Dec 08, 2016 at 08:18:44AM +0800, Qu Wenruo wrote:
> > On 12/06/2016 07:29 PM, Qu Wenruo wrote:
> >> Due to commit 00e769d04c2c83029d6c71(btrfs-progs: Correct value printed
> >> by assertions/BUG_ON/WARN_ON), which changed the assert_trace()
> >> parameter, the condition passed to assert/WARN_ON/BUG_ON are logical
> >> notted for backtrace enabled and disabled case.
> >>
> >> Such behavior makes us easier to pass value wrong, and in fact it did
> >> cause us to pass wrong condition for ASSERT().
> >>
> >> Instead of passing different conditions for ASSERT/WARN_ON/BUG_ON()
> >> manually, this patch will use BUG_ON() to implement the resting
> >> ASSERT/WARN_ON/BUG(), so we don't need to pass 3 different conditions
> >> but only one.
> >>
> >> And to further info the review for the fact that the condition should be
> >> different, rename "assert_trace" to "bugon_trace", as unlike assert, we
> >> will only trigger the bug when condition is true.
> >>
> >> Also, move WARN_ON() out of the ifdef branch, as it's completely the
> >> same for both branches.
> >>
> >> Cc: Goldwyn Rodrigues 
> >> Signed-off-by: Qu Wenruo 
> >> ---
> >>  kerncompat.h | 19 +++
> >>  1 file changed, 11 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kerncompat.h b/kerncompat.h
> >> index e374614..be77608 100644
> >> --- a/kerncompat.h
> >> +++ b/kerncompat.h
> >> @@ -277,7 +277,7 @@ static inline long IS_ERR(const void *ptr)
> >>  #define vfree(x) free(x)
> >>
> >>  #ifndef BTRFS_DISABLE_BACKTRACE
> >> -static inline void assert_trace(const char *assertion, const char 
> >> *filename,
> >> +static inline void bugon_trace(const char *assertion, const char 
> >> *filename,
> >>  const char *func, unsigned line, long val)
> >>  {
> >>if (!val)
> >
> > To keep confusion to the minimum, you can call this *condition instead
> > of *assertion.
> 
> Right, I'll update it.
> 
> >
> >> @@ -287,17 +287,20 @@ static inline void assert_trace(const char 
> >> *assertion, const char *filename,
> >>exit(1);
> >>  }
> >>
> >> -#define BUG_ON(c) assert_trace(#c, __FILE__, __func__, __LINE__, 
> >> (long)(c))
> >> -#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, 
> >> (long)(c))
> >> -#define   ASSERT(c) assert_trace(#c, __FILE__, __func__, __LINE__, 
> >> (long)!(c))
> >> -#define BUG() assert_trace(NULL, __FILE__, __func__, __LINE__, 1)
> >> +#define BUG_ON(c) bugon_trace(#c, __FILE__, __func__, __LINE__, (long)(c))
> >>  #else
> >>  #define BUG_ON(c) assert(!(c))
> >> -#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, 
> >> (long)(c))
> >> -#define ASSERT(c) assert(!(c))
> >> -#define BUG() assert(0)
> >>  #endif
> >>
> >> +#define WARN_ON(c) warning_trace(#c, __FILE__, __func__, __LINE__, 
> >> (long)(c))
> >> +/*
> >> + * TODO: ASSERT() should be depercated. In case like ASSERT(ret == 0), it
> >> + * won't output any useful value for ret.
> >> + * Should be replaced by BUG_ON(ret);
> >> + */
> >> +#define   ASSERT(c) BUG_ON(!(c))
> >
> > I am not sure of this. As you are stating, this (double negation) will
> > kill the value of the condition. Won't it be better to remove all
> > ASSERTs first instead of putting this TODO?
> 
> IIRC the ASSERT/BUG_ON will be removed step by step.
> And we have about 60+ ASSERT in current code base, not an easy thing to 
> fix soon.
> 
> So I prefer to mark ASSERT() deprecated and remove them in later cleanups.

But asserts have their meaning. We want to get rid of BUG_ON that are
abused for error not-handling. The asserts eg. catch programmer errors
like incorrect combinations of function parameters, "explicit checks for
implicit assumptions", and other "never happens" conditions.  Anything
that depends on external output must not be hadled via assert/bug_on.
I have some write ups in progress about the coding conventions, will be
put to git once it's more or less complete, so we're all on the same
page.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] recursive defrag cleanup

2016-12-12 Thread David Sterba
On Tue, Dec 06, 2016 at 12:39:37PM +0800, Anand Jain wrote:
> The command,
>btrfs fi defrag -v /btrfs
>  does nothing, it won't defrag the files under /btrfs as user
>  may expect. The command with recursive option
>btrfs fi defrag -vr /btrfs
>  would defrag all the files under /btrfs including files in
>  its sub directories.
> 
>  While attempting to fix this. The patch below 1/1 provides
>  a cleanup. And the actual fix is pending, as to my understanding
>  of nfwt() it does not provide the list of file without
>  files under its sub directories.

What kind of fix do you mean? We could detect if there's a directory in
the list of arguments and assume the recursive mode. I think this is
what most users expect.

Currently passing a directory will defragment the extent tree, but I
think we should extend the defrag ioctl flags to explictly ask for that.
At minimum, a directory without -r could print a warning about what it's
really doing. But I'm open to other ideas.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check --repair question

2016-12-12 Thread Chris Murphy
On Mon, Dec 12, 2016 at 5:19 AM, Tim Walberg  wrote:
> All -
>
> I have a file system I'm having some issues with. The initial symptoms were 
> that mount
> would run for several hours, either committing or rolling back transactions 
> (primarily
> due to a balance that was running when the system was rebooted for other 
> reasons -
> the skip_balance mount option was specified because of this), but would then 
> be killed
> due to an OOM condition (not much else running on the box at the time - a 
> desktop system
> where everything else was waiting for the mount to finish). That's the 
> background. Kernel
> 4.8.1 - custom config, but otherwise stock kernel - and btrfs-tools 4.8.3.

There were some OOM related issues early in 4.8, I would try 4.8.12 or
even 4.8.14.


>
> Ran btrfs check, and the only thing it reports is a sequence of these:
>
> ref mismatch on [5400814960640 16384] extent item 0, found 1
> Backref 5400814960640 parent 5401010913280 root 5401010913280 not found in 
> extent tree
> backpointer mismatch on [5400814960640 16384]
> owner ref check failed [5400814960640 16384]
>
> Which, to my reading are simply some missing backrefs, and probably should be 
> one of the
> easier issues to correct, but I know --repair is still considered 
> experimental/dangerous,
> so I thought I'd ask before I run it... Is this a case that --repair can be 
> reasonably
> expected to handle, or would I be better off recreating the file system and 
> restoring from
> either my saved btrfs send archives or the more reliable backups?

It might be that usebackuproot,ro mount will happen faster, and you
can update the backups. Then use --repair. It's still listed as
dangerous but it's gotten quite a bit less dangerous in later
versions.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND][PATCH v2] btrfs-progs: add dev stats returncode option

2016-12-12 Thread David Sterba
On Thu, Dec 08, 2016 at 12:54:14PM -0500, Austin S. Hemmelgarn wrote:
> On 2016-12-08 12:20, David Sterba wrote:
> > On Mon, Dec 05, 2016 at 01:35:20PM -0500, Austin S. Hemmelgarn wrote:
> >> Currently, `btrfs device stats` returns non-zero only when there was an
> >> error getting the counter values.  This is fine for when it gets run by a
> >> user directly, but is a serious pain when trying to use it in a script or
> >> for monitoring since you need to parse the (not at all machine friendly)
> >> output to check the counter values.
> >>
> >> This patch adds an option ('-s') which causes `btrfs device stats`
> >> to set bit 6 in the return code if any of the counters are non-zero.
> >> This greatly simplifies checking from a script or monitoring software if
> >> any errors have been recorded.  In the event that this switch is passed
> >> and an error occurs reading the stats, the return code will have bit
> >> 0 set (so if there are errors reading counters, and the counters which
> >> were read were non-zero, the return value will be 65).
> >
> > So a typical check in a script would look for either 64 or 65 returned
> > from the command, I don't think we can do it simpler. The option naming
> > is a bit confusing to me, as it duplicates the 'stats' from the command
> > itself. I'd suggest to use '--check' instead, does it sound OK to you?
> >
> > I'll apply the patch as-is for now (and maybe do some cleanups in the
> > surrounding code).
> >
> Yeah, --check is fine.  Like I said, I'm not too picky about the name as 
> long as it works.

Thanks. Changed to -c and added the long option --check for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Btrfs-progs: subvol_uuid_search: Return error on memory allocation failure

2016-12-12 Thread David Sterba
On Sat, Dec 10, 2016 at 07:17:42PM +0530, Prasanth K S R wrote:
> From: Prasanth K S R 
> 
> This commit fixes coverity defect CID 1328695.
> 
> Signed-off-by: Prasanth K S R 

Thanks, 1-3 applied.

JFYI, I've realized that subvol_uuid_search is part of the public API so
changing the return value would break the existing users (I know that
snapper uses this function), as it now returns just a pointer or NULL.
I'll fix that, a new function and library version bump would be needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Coding style fixes

2016-12-12 Thread David Sterba
On Mon, Dec 12, 2016 at 03:10:05PM +0100, Seraphime Kirkovski wrote:
> Per Documentation/CodingStyle remove brackets on single expression
> if statements, add missing spaces after `,` and around `=`,
> remove unnecessary line continuations, add missing blank lines
> after declarations and fix mixed spaces and tabs.

I'm sure you could find more things to fix, there are eg. some missing
spaces around + or - binary operators. Patches that just fix the style
are usally not very welcome, but I think that doing a sweep once upon a
time is good. We had comment typos fixed in a similar fashion, so I'm not
against improving code aesthetics.

> @@ -394,7 +394,7 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
> void __user *arg)
>   q = bdev_get_queue(device->bdev);
>   if (blk_queue_discard(q)) {
>   num_devices++;
> - minlen = min((u64)q->limits.discard_granularity,
> + minlen = min_t(u64, q->limits.discard_granularity,
>minlen);

This type of change is more like a cleanup and you can find more
instances where the type is applied to just one of the operands, while
min_t/max_t would be better. Feel free to send a separate patch for
that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: limit async_work allocation and worker func duration

2016-12-12 Thread David Sterba
On Fri, Dec 02, 2016 at 05:51:36PM -0800, Maxim Patlasov wrote:
> Problem statement: unprivileged user who has read-write access to more than
> one btrfs subvolume may easily consume all kernel memory (eventually
> triggering oom-killer).
> 
> Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):
> 
> [root@kteam1 ~]# cat prep.sh
> 
> DEV=/dev/sdb
> mkfs.btrfs -f $DEV
> mount $DEV /mnt
> for i in `seq 1 16`
> do
>   mkdir /mnt/$i
>   btrfs subvolume create /mnt/SV_$i
>   ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
>   mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
>   chmod a+rwx /mnt/$i
> done
> 
> [root@kteam1 ~]# sh prep.sh
> 
> [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done
> 
> [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | 
> grep -v dma; sleep 60; done
> kmalloc-12810144  10144128   321 : tunables000 : 
> slabdata317317  0
> kmalloc-128   9992352 9992352128   321 : tunables000 
> : slabdata 312261 312261  0
> kmalloc-128   24226752 24226752128   321 : tunables00
> 0 : slabdata 757086 757086  0
> kmalloc-128   42754240 42754240128   321 : tunables00
> 0 : slabdata 1336070 1336070  0
> 
> The huge numbers above come from insane number of async_work-s allocated
> and queued by btrfs_wq_run_delayed_node.
> 
> The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
> works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
> worker func (btrfs_async_run_delayed_root) processes at least
> BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
> works as expected while the list is almost empty. As soon as it is getting
> bigger, worker func starts to process more than one item at a time, it takes
> longer, and the chances to have async_works queued more than needed is getting
> higher.
> 
> The problem above is worsened by another flaw of delayed-inode implementation:
> if async_work was queued in a throttling branch (number of items >=
> BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
> the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
> the func occupies CPU infinitely (up to 30sec in my experiments): while the
> func is trying to drain the list, the user activity may add more and more
> items to the list.

Nice analysis!

> The patch fixes both problems in straightforward way: refuse queuing too
> many works in btrfs_wq_run_delayed_node and bail out of worker func if
> at least BTRFS_DELAYED_WRITEBACK items are processed.
> 
> Signed-off-by: Maxim Patlasov 
> ---
>  fs/btrfs/async-thread.c  |8 
>  fs/btrfs/async-thread.h  |1 +
>  fs/btrfs/delayed-inode.c |6 --
>  3 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index e0f071f..29f6252 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -86,6 +86,14 @@ btrfs_work_owner(struct btrfs_work *work)
>   return work->wq->fs_info;
>  }
>  
> +bool btrfs_workqueue_normal_congested(struct btrfs_workqueue *wq)
> +{
> + int thresh = wq->normal->thresh != NO_THRESHOLD ?
> + wq->normal->thresh : num_possible_cpus();

Why not num_online_cpus? I vaguely remember we should be checking online
cpus, but don't have the mails for reference. We use it elsewhere for
spreading the work over cpus, but it's still not bullet proof regarding
cpu onlining/offlining.

Otherwise looks good to me, as far as I can imagine the possible
behaviour of the various async parameters just from reading the code.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: Coding style fixes

2016-12-12 Thread Seraphime Kirkovski
Per Documentation/CodingStyle remove brackets on single expression
if statements, add missing spaces after `,` and around `=`,
remove unnecessary line continuations, add missing blank lines
after declarations and fix mixed spaces and tabs.

Signed-off-by: Seraphime Kirkovski 
---
 fs/btrfs/ioctl.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7acbd2c..fc6dc14 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -200,9 +200,9 @@ static int btrfs_ioctl_getflags(struct file *file, void 
__user *arg)
 
 static int check_flags(unsigned int flags)
 {
-   if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL | \
- FS_NOATIME_FL | FS_NODUMP_FL | \
- FS_SYNC_FL | FS_DIRSYNC_FL | \
+   if (flags & ~(FS_IMMUTABLE_FL | FS_APPEND_FL |
+ FS_NOATIME_FL | FS_NODUMP_FL |
+ FS_SYNC_FL | FS_DIRSYNC_FL |
  FS_NOCOMP_FL | FS_COMPR_FL |
  FS_NOCOW_FL))
return -EOPNOTSUPP;
@@ -301,7 +301,7 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (S_ISREG(mode)) {
if (inode->i_size == 0)
ip->flags &= ~(BTRFS_INODE_NODATACOW
-| BTRFS_INODE_NODATASUM);
+   | BTRFS_INODE_NODATASUM);
} else {
ip->flags &= ~BTRFS_INODE_NODATACOW;
}
@@ -394,7 +394,7 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
q = bdev_get_queue(device->bdev);
if (blk_queue_discard(q)) {
num_devices++;
-   minlen = min((u64)q->limits.discard_granularity,
+   minlen = min_t(u64, q->limits.discard_granularity,
 minlen);
}
}
@@ -1464,9 +1464,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
atomic_dec(>fs_info->async_submit_draining);
}
 
-   if (range->compress_type == BTRFS_COMPRESS_LZO) {
+   if (range->compress_type == BTRFS_COMPRESS_LZO)
btrfs_set_fs_incompat(root->fs_info, COMPRESS_LZO);
-   }
 
ret = defrag_count;
 
@@ -1659,6 +1658,7 @@ static noinline int 
btrfs_ioctl_snap_create_transid(struct file *file,
} else {
struct fd src = fdget(fd);
struct inode *src_inode;
+
if (!src.file) {
ret = -EINVAL;
goto out_drop_write;
@@ -2236,7 +2236,7 @@ static noinline int btrfs_search_path_in_tree(struct 
btrfs_fs_info *info,
struct btrfs_path *path;
 
if (dirid == BTRFS_FIRST_FREE_OBJECTID) {
-   name[0]='\0';
+   name[0] = '\0';
return 0;
}
 
@@ -2679,7 +2679,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_root *root, 
void __user *arg)
ret = btrfs_init_new_device(root, vol_args->name);
 
if (!ret)
-   btrfs_info(root->fs_info, "disk added %s",vol_args->name);
+   btrfs_info(root->fs_info, "disk added %s", vol_args->name);
 
kfree(vol_args);
 out:
@@ -2773,7 +2773,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void 
__user *arg)
mutex_unlock(>fs_info->volume_mutex);
 
if (!ret)
-   btrfs_info(root->fs_info, "disk deleted %s",vol_args->name);
+   btrfs_info(root->fs_info, "disk deleted %s", vol_args->name);
kfree(vol_args);
 out:
atomic_set(>fs_info->mutually_exclusive_operation_running, 0);
@@ -2927,6 +2927,7 @@ static int lock_extent_range(struct inode *inode, u64 
off, u64 len,
 */
while (1) {
struct btrfs_ordered_extent *ordered;
+
lock_extent(_I(inode)->io_tree, off, off + len - 1);
ordered = btrfs_lookup_first_ordered_extent(inode,
off + len - 1);
@@ -3895,11 +3896,10 @@ static noinline int btrfs_clone_files(struct file 
*file, struct file *file_src,
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
return -EISDIR;
 
-   if (!same_inode) {
+   if (!same_inode)
btrfs_double_inode_lock(src, inode);
-   } else {
+   else
inode_lock(src);
-   }
 
/* determine range to clone */
ret = -EINVAL;
@@ -4972,11 +4972,10 @@ static long btrfs_ioctl_qgroup_create(struct file 
*file, void __user *arg)
}
 
/* FIXME: check if the IDs really exist */
-   if (sa->create) {
+   if (sa->create)
ret = btrfs_create_qgroup(trans, root->fs_info, sa->qgroupid);
-   } else {
+   else
ret = 

Re: [PATCH 2/3] btrfs: cow_file_range() num_bytes and disk_num_bytes are same

2016-12-12 Thread David Sterba
On Tue, Dec 06, 2016 at 12:43:08PM +0800, Anand Jain wrote:
> This patch deletes local variable disk_num_bytes as its value
> is same as num_bytes in the function cow_file_range().
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] btrfs: consolidate auto defrag kick off policies

2016-12-12 Thread David Sterba
On Tue, Dec 06, 2016 at 12:43:09PM +0800, Anand Jain wrote:
> As of now writes smaller than 64k for non compressed extents and 16k
> for compressed extents inside eof are considered as candidate
> for auto defrag, put them together at a place.

Missing sign-off.

> ---
>  fs/btrfs/inode.c | 27 +++
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 79f073e94f2d..b157575166c6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -388,6 +388,22 @@ static inline int inode_need_compress(struct inode 
> *inode)
>   return 0;
>  }
>  
> +static inline void inode_should_defrag(struct inode *inode,
> + u64 start, u64 end, u64 num_bytes, int comp_type)
> +{
> + u64 small_write = SZ_64K;
> + if (comp_type)
> + small_write = SZ_16K;

I think the small_write value should be passed directly, not the
compression type.

> +
> + if (!num_bytes)
> + num_bytes = end - start + 1;

And the callers should pass the range length. Calculating inside the
function makes it less clear. One of the callers passes an blocksize
aligned value and the other doest not, so both know what's the right
value.

Otherwise the cleanup is good. I'll merge the other two patches, please
update and resend this one. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] btrfs: fix improper return value

2016-12-12 Thread David Sterba
On Sun, Dec 04, 2016 at 11:57:07AM -0800, Omar Sandoval wrote:
> On Sun, Dec 04, 2016 at 12:51:53PM +0800, Pan Bian wrote:
> > In function btrfs_uuid_tree_iterate(), errno is assigned to variable ret
> > on errors. However, it directly returns 0. It may be better to return
> > ret. This patch also removes the warning, because the caller already
> > prints a warning.
> > 
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188731
> 
> Looks good.
> 
> Reviewed-by: Omar Sandoval 

Added to 4.10 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix lockdep warning about log_mutex

2016-12-12 Thread David Sterba
On Thu, Dec 01, 2016 at 01:45:04PM -0800, Liu Bo wrote:
> + mutex_lock_nested(_I(inode)->log_mutex, 1);

Please also use SINGLE_DEPTH_NESTING here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: use down_read_nested to make lockdep silent

2016-12-12 Thread David Sterba
On Thu, Dec 01, 2016 at 01:44:19PM -0800, Liu Bo wrote:
> If @block_group is not @used_bg, it'll try to get @used_bg's lock without
> droping @block_group 's lock and lockdep has throwed a scary deadlock warning
> about it.
> Fix it by using down_read_nested.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent-tree.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 210c94a..cdb082a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7394,7 +7394,8 @@ btrfs_lock_cluster(struct btrfs_block_group_cache 
> *block_group,
>  
>   spin_unlock(>refill_lock);
>  
> - down_read(_bg->data_rwsem);
> + /* We should only have one-level nested. */
> + down_read_nested(_bg->data_rwsem, 1);

While looking what down_read_nested does, I've found
SINGLE_DEPTH_NESTING, defined to 1. I'd rather use it instead of '1', as
it gives more more pointers to the docs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add truncated_len for ordered extent tracepoints

2016-12-12 Thread David Sterba
On Thu, Dec 01, 2016 at 01:44:10PM -0800, Liu Bo wrote:
> This can help us monitor truncated ordered extents.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add 'inode' for extent map tracepoint

2016-12-12 Thread David Sterba
On Thu, Dec 01, 2016 at 01:44:01PM -0800, Liu Bo wrote:
>   TP_STRUCT__entry_btrfs(
>   __field(u64,  root_objectid )
> + __field(u64,  ino   )
>   __field(u64,  start )
>   __field(u64,  len   )
>   __field(u64,  orig_start)
> @@ -204,7 +206,8 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
>  
>   TP_fast_assign_btrfs(root->fs_info,
>   __entry->root_objectid  = root->root_key.objectid;
> - __entry->start  = map->start;
> + __entry->ino= btrfs_ino(inode);
> + __entry->start  = map->start;
>   __entry->len= map->len;
>   __entry->orig_start = map->orig_start;
>   __entry->block_start= map->block_start;
> @@ -214,12 +217,12 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
>   __entry->compress_type  = map->compress_type;
>   ),
>  
> - TP_printk_btrfs("root = %llu(%s), start = %llu, len = %llu, "
> + TP_printk_btrfs("root = %llu(%s), ino = %llu start = %llu, len = %llu, "
> "orig_start = %llu, block_start = %llu(%s), "
> "block_len = %llu, flags = %s, refs = %u, "
> "compress_type = %u",
> show_root_type(__entry->root_objectid),
> -   (unsigned long long)__entry->start,
> +   __entry->ino, (unsigned long long)__entry->start,

I think the (unsigned long long) cast should be used here as well, at
least for consistency with the other u64.

> (unsigned long long)__entry->len,
> (unsigned long long)__entry->orig_start,
> show_map_type(__entry->block_start),
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs check --repair question

2016-12-12 Thread Tim Walberg
All -

I have a file system I'm having some issues with. The initial symptoms were 
that mount
would run for several hours, either committing or rolling back transactions 
(primarily
due to a balance that was running when the system was rebooted for other 
reasons -
the skip_balance mount option was specified because of this), but would then be 
killed
due to an OOM condition (not much else running on the box at the time - a 
desktop system
where everything else was waiting for the mount to finish). That's the 
background. Kernel
4.8.1 - custom config, but otherwise stock kernel - and btrfs-tools 4.8.3.

Ran btrfs check, and the only thing it reports is a sequence of these:

ref mismatch on [5400814960640 16384] extent item 0, found 1
Backref 5400814960640 parent 5401010913280 root 5401010913280 not found in 
extent tree
backpointer mismatch on [5400814960640 16384]
owner ref check failed [5400814960640 16384]

Which, to my reading are simply some missing backrefs, and probably should be 
one of the
easier issues to correct, but I know --repair is still considered 
experimental/dangerous,
so I thought I'd ask before I run it... Is this a case that --repair can be 
reasonably
expected to handle, or would I be better off recreating the file system and 
restoring from
either my saved btrfs send archives or the more reliable backups?


  tw



-- 
twalb...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-find-root duration?

2016-12-12 Thread Austin S. Hemmelgarn

On 2016-12-10 20:42, Markus Binsteiner wrote:

Hi Xin,

thanks. I did not enable autosnap, and I'm pretty sure Debian didn't
do it for me either, as I would have seen the subvolumes created by it
at some stage. Good to know about this feature though, will definitely
use it next time around.
BTRFS itself has no such feature.  What Xin is probably thinking of is a 
piece of third-party software called 'snapper' that gets installed by 
default on at least SUSE and Ubuntu when using BTRFS, but not on Debian 
because they make no assumptions about individual use-case (and there 
are _lots_ of people (myself included) who absolutely do not want 
automatic snapshotting eating up space on our filesystems behind our 
back).  If you do choose to use snapper, make sure to update the 
settings to fit your needs, as the defaults (at least, the defaults 
upstream and on SUSE) are absolutely brain-dead and will eat your 
filesystem (or at least completely kill it's performance) in a couple of 
months on average.


Cheers,
Markus

On Sun, Dec 11, 2016 at 2:17 PM, Xin Zhou  wrote:

Hi Markus,

Some file system automatically generates snapshot, and create a hidden
folder for recovery,
if the user accidently deletes some files.

It seems btrfs also has a autosnap feature,
so if this option has been enabled before deletion,
or the volume has been mannually generated snapshots, then probably it might
be able to perform fast recover.

Regards,
Xin

Sent: Saturday, December 10, 2016 at 4:12 PM
From: "Markus Binsteiner" 
To: linux-btrfs@vger.kernel.org
Subject: btrfs-find-root duration?
It seems I've accidentally deleted all files in my home directory,
which sits in its own btrfs partition (lvm on luks). Now I'm trying to
find the roots to be able to use btrfs restore later on.

btrfs-find-root seems to be taking ages though. I've run it like so:

btrfs-find-root /dev/mapper/think--big-home -o 5 > roots.txt

After 16 hours, there is still no output, but it's still running
utilizing 100% of one core. Is there any way to gauge how much longer
it'll take? Should there have been output already while it's running?

When I run it without redirecting stdout, I get:

$ btrfs-find-root /dev/mapper/think--big-home -o 5

Superblock doesn't contain generation info for root 5
Superblock doesn't contain the level info for root 5

When I omit the '-o 5', it says:

$ btrfs-find-root /dev/mapper/think--big-home

Superblock thinks the generation is 593818
Superblock thinks the level is 0

Is the latter the way to run it? Did that initially, but that didn't
return any results in a reasonable timeframe either.

The filesystem was created with Debian Jessie, but I'm using Ubuntu (
btrfs-progs v4.7.3 ) to try to restore the files at the moment.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-12-12 Thread Qu Wenruo
In the following situation, scrub will calculate wrong parity to
overwrite correct one:

RAID5 full stripe:

Before
| Dev 1  | Dev  2 | Dev 3 |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--- 0
| 0x (Bad)   | 0xcdcd | 0x|
--- 4K
| 0xcdcd | 0xcdcd | 0x|
...
| 0xcdcd | 0xcdcd | 0x|
--- 64K

After scrubbing dev3 only:

| Dev 1  | Dev  2 | Dev 3 |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--- 0
| 0xcdcd (Good)  | 0xcdcd | 0xcdcd (Bad)  |
--- 4K
| 0xcdcd | 0xcdcd | 0x|
...
| 0xcdcd | 0xcdcd | 0x|
--- 64K

The calltrace of such corruption is as following:

scrub_bio_end_io_worker() get called for each extent read out
|- scriub_block_complete()
   |- Data extent csum mismatch
   |- scrub_handle_errored_block
  |- scrub_recheck_block()
 |- scrub_submit_raid56_bio_wait()
|- raid56_parity_recover()

Now we have a rbio with correct data stripe 1 recovered.
Let's call it "good_rbio".

scrub_parity_check_and_repair()
|- raid56_parity_submit_scrub_rbio()
   |- lock_stripe_add()
   |  |- steal_rbio()
   | |- Recovered data are steal from "good_rbio", stored into
   |rbio->stripe_pages[]
   |Now rbio->bio_pages[] are bad data read from disk.
   |- async_scrub_parity()
  |- scrub_parity_work() (delayed_call to scrub_parity_work)

scrub_parity_work()
|- raid56_parity_scrub_stripe()
   |- validate_rbio_for_parity_scrub()
  |- finish_parity_scrub()
 |- Recalculate parity using *BAD* pages in rbio->bio_pages[]
So good parity is overwritten with *BAD* one

The fix is to introduce 2 new members, bad_ondisk_a/b, to struct
btrfs_raid_bio, to info scrub code to use correct data pages to
re-calculate parity.

Reported-by: Goffredo Baroncelli 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/raid56.c | 62 +++
 1 file changed, 58 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d2a9a1e..453eefd 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -133,6 +133,16 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   /*
+* For steal_rbio, we can steal recovered correct page,
+* but in finish_parity_scrub(), we still use bad on-disk
+* page to calculate parity.
+* Use these members to info finish_parity_scrub() to use
+* correct pages
+*/
+   int bad_ondisk_a;
+   int bad_ondisk_b;
+
int scrubp;
/*
 * number of pages needed to represent the full
@@ -310,6 +320,12 @@ static void steal_rbio(struct btrfs_raid_bio *src, struct 
btrfs_raid_bio *dest)
if (!test_bit(RBIO_CACHE_READY_BIT, >flags))
return;
 
+   /* Record recovered stripe number */
+   if (src->faila != -1)
+   dest->bad_ondisk_a = src->faila;
+   if (src->failb != -1)
+   dest->bad_ondisk_b = src->failb;
+
for (i = 0; i < dest->nr_pages; i++) {
s = src->stripe_pages[i];
if (!s || !PageUptodate(s)) {
@@ -999,6 +1015,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_fs_info *fs_info,
rbio->stripe_npages = stripe_npages;
rbio->faila = -1;
rbio->failb = -1;
+   rbio->bad_ondisk_a = -1;
+   rbio->bad_ondisk_b = -1;
atomic_set(>refs, 1);
atomic_set(>error, 0);
atomic_set(>stripes_pending, 0);
@@ -2261,6 +2279,9 @@ static int alloc_rbio_essential_pages(struct 
btrfs_raid_bio *rbio)
int bit;
int index;
struct page *page;
+   struct page *bio_page;
+   void *ptr;
+   void *bio_ptr;
 
for_each_set_bit(bit, rbio->dbitmap, rbio->stripe_npages) {
for (i = 0; i < rbio->real_stripes; i++) {
@@ -2271,6 +2292,23 @@ static int alloc_rbio_essential_pages(struct 
btrfs_raid_bio *rbio)
page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
if (!page)
return -ENOMEM;
+
+   /*
+* It's possible that only several pages need recover,
+* and rest are all good.
+* In that case we need to copy good bio pages into
+* stripe_pages[], as we will use stripe_pages[] other
+* than bio_pages[] in finish_parity_scrub().
+*/
+   

[PATCH 1/3] btrfs: scrub: Introduce full stripe lock for RAID56

2016-12-12 Thread Qu Wenruo
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protect, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which stores a rb_tree of mutex for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 177 +
 3 files changed, 184 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 50bcfb8..5896059 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -639,6 +639,10 @@ struct btrfs_block_group_cache {
 * Protected by free_space_lock.
 */
int needs_free_space;
+
+   /* Scrub full stripe lock tree for RAID5/6 scrub */
+   struct rb_root scrub_lock_root;
+   spinlock_t scrub_lock;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e97302f..3e61604 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -130,6 +130,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache 
*cache)
if (atomic_dec_and_test(>count)) {
WARN_ON(cache->pinned > 0);
WARN_ON(cache->reserved > 0);
+   WARN_ON(!RB_EMPTY_ROOT(>scrub_lock_root));
kfree(cache->free_space_ctl);
kfree(cache);
}
@@ -9906,6 +9907,8 @@ btrfs_create_block_group_cache(struct btrfs_fs_info 
*fs_info,
 
atomic_set(>count, 1);
spin_lock_init(>lock);
+   spin_lock_init(>scrub_lock);
+   cache->scrub_lock_root = RB_ROOT;
init_rwsem(>data_rwsem);
INIT_LIST_HEAD(>list);
INIT_LIST_HEAD(>cluster_list);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0d63d99..2b9555c 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -240,6 +240,13 @@ struct scrub_warning {
struct btrfs_device *dev;
 };
 
+struct scrub_full_stripe_lock {
+   struct rb_node node;
+   u64 logical;
+   u64 refs;
+   struct mutex mutex;
+};
+
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
 static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
@@ -351,6 +358,176 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info)
 }
 
 /*
+ * Caller must hold cache->scrub_root_lock.
+ *
+ * Return existing full stripe and increase it refs
+ * Or return NULL, and insert @fstripe_lock into the bg cache
+ */
+static struct scrub_full_stripe_lock *
+add_scrub_lock(struct btrfs_block_group_cache *cache,
+  struct scrub_full_stripe_lock *fstripe_lock)
+{
+   struct rb_node **p;
+   struct rb_node *parent = NULL;
+   struct scrub_full_stripe_lock *entry;
+
+   p = >scrub_lock_root.rb_node;
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct scrub_full_stripe_lock, node);
+   if (fstripe_lock->logical < entry->logical) {
+   p = &(*p)->rb_left;
+   } else if (fstripe_lock->logical > entry->logical) {
+   p = &(*p)->rb_right;
+   } else {
+   entry->refs++;
+   return entry;
+   }
+   }
+   /* Insert new one */
+   rb_link_node(_lock->node, parent, p);
+   rb_insert_color(_lock->node, >scrub_lock_root);
+
+   return NULL;
+}
+
+static struct scrub_full_stripe_lock *
+search_scrub_lock(struct btrfs_block_group_cache *cache, u64 bytenr)
+{
+   struct rb_node *node;
+   struct scrub_full_stripe_lock *entry;
+
+   node = cache->scrub_lock_root.rb_node;
+   while (node) {
+   entry = rb_entry(node, struct scrub_full_stripe_lock, node);
+   if (bytenr < entry->logical)
+   node = node->rb_left;
+   else if (bytenr > entry->logical)
+   node = node->rb_right;
+   else
+   return entry;
+   }
+   return NULL;
+}
+
+/*
+ * Helper to get full stripe logical from a normal bytenr.
+ * Thanks to the chaos of scrub structures, we need to get it all
+ * by ourselves, using btrfs_map_sblock().
+ */
+static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr,
+  u64 *bytenr_ret)
+{
+   struct btrfs_bio *bbio = NULL;
+   u64 len;
+   int ret;
+
+   /* Just use map_sblock() to get full stripe logical */
+   ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, bytenr,
+  , , 0, 1);
+   if (ret || !bbio || !bbio->raid_map)
+   goto error;
+   *bytenr_ret = bbio->raid_map[0];
+   btrfs_put_bbio(bbio);
+   return 0;
+error:
+   

[PATCH 2/3] btrfs: scrub: Fix RAID56 recovery race condition

2016-12-12 Thread Qu Wenruo
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16(assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
Corrupted   |   Correct  |  Correct|
|   Scrub dev3 (D1) |Scrub dev2 (D2) |Scrub dev1(P)|

Read out D1 |Read out D2 |Read full stripe |
Check csum  |Check csum  |Check parity |
Csum mismatch   |Csum match, continue|Parity mismatch  |
handle_errored_block||handle_errored_block |
 Read out full stripe   || Read out full stripe|
 D1 csum error(err++)   || D1 csum error(err++)|
 Recover D1 || Recover D1  |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protect, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/scrub.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 2b9555c..5b2f94a 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1067,6 +1067,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
unsigned int have_csum;
struct scrub_block *sblocks_for_recheck; /* holds one for each mirror */
struct scrub_block *sblock_bad;
+   int lock_ret;
int ret;
int mirror_index;
int page_num;
@@ -1096,6 +1097,17 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
have_csum = sblock_to_check->pagev[0]->have_csum;
dev = sblock_to_check->pagev[0]->dev;
 
+   /*
+* For RAID5/6 race can happen for different dev scrub thread.
+* For data corruption, Parity and Data thread will both try
+* to recovery the data.
+* Race can lead to double added csum error, or even unrecoverable
+* error.
+*/
+   ret = lock_full_stripe(fs_info, logical, GFP_NOFS);
+   if (ret < 0)
+   return ret;
+
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
sblocks_for_recheck = NULL;
goto nodatasum_case;
@@ -1430,6 +1442,9 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
kfree(sblocks_for_recheck);
}
 
+   lock_ret = unlock_full_stripe(fs_info, logical);
+   if (lock_ret < 0)
+   return lock_ret;
return 0;
 }
 
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3 for-4.10] RAID56 scrub fixes

2016-12-12 Thread Qu Wenruo
Can be feteched from github:
https://github.com/adam900710/linux.git raid56_fixes

Fixes 2 scrub bugs:
1) Scrub recover correct data, but wrong parity
2) Scrub report wrong csum error number, or even unrecoverable error

The patches are still undergoing xfstests, but currect for-linus-4.10 is already
causing deadlock for btrfs/011, even without the patches.

So I'd remove btrfs/011 and continue the test, even these test cases won't 
trigger
real recover code.

But the current internal test cases are quite good so far.
I'll test them for several extra loop, and submit the internal test for 
reference.
(Since it's not suitable for xfstest, so I'd only submit the test script, which
needs manually to probe chunk layout)

Qu Wenruo (3):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race conditiong
  btrfs: raid56: Use correct stolen pages to calculate P/Q

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/raid56.c  |  62 ++--
 fs/btrfs/scrub.c   | 192 +
 4 files changed, 257 insertions(+), 4 deletions(-)

-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3 for-4.10] RAID56 scrub fixes

2016-12-12 Thread Qu Wenruo
Fixes 2 scrub bugs:
1) Scrub recover correct data, but wrong parity
2) Scrub report wrong csum error number, or even unrecoverable error

The patches are still undergoing xfstests, but currect for-linus-4.10 is already
causing deadlock for btrfs/011, even without the patches.

So I'd remove btrfs/011 and continue the test, even these test cases won't 
trigger
real recover code.

But the current internal test cases are quite good so far.
I'll test them for several extra loop, and submit the internal test for 
reference.
(Since it's not suitable for xfstest, so I'd only submit the test script, which
needs manually to probe chunk layout)

Qu Wenruo (3):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: raid56: Use correct stolen pages to calculate P/Q

 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/raid56.c  |  62 ++--
 fs/btrfs/scrub.c   | 192 +
 4 files changed, 257 insertions(+), 4 deletions(-)

-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html