[PATCH] btrfs-progs: image: Rebuild dev extents for restore

2018-08-16 Thread Qu Wenruo
When restoring image from a dump of multiple device, we just restore
device as is, even the restore destination is just a single device.

In that case, due to dev extents mismatch, latest kernel will refuse to
mount it as it detects such mismatch as good as btrfs check.

Fix it by rebuilding the whole device tree when finishing restore, so
kernel or btrfs check will give no complain about restore image any
more.

This fixes misc/021 test case.

Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
---
 ctree.h  |   6 +++
 image/main.c | 129 +++
 2 files changed, 135 insertions(+)

diff --git a/ctree.h b/ctree.h
index 4719962df67d..69bf3be4b9b1 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1707,6 +1707,12 @@ BTRFS_SETGET_FUNCS(dev_extent_chunk_offset, struct 
btrfs_dev_extent,
   chunk_offset, 64);
 BTRFS_SETGET_FUNCS(dev_extent_length, struct btrfs_dev_extent, length, 64);
 
+BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_chunk_tree, struct btrfs_dev_extent,
+  chunk_tree, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_chunk_objectid,
+  struct btrfs_dev_extent, chunk_objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_chunk_offset, struct 
btrfs_dev_extent,
+  chunk_offset, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_dev_extent_length, struct btrfs_dev_extent,
 length, 64);
 
diff --git a/image/main.c b/image/main.c
index 351c5a256938..9a899105f65e 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2171,6 +2171,128 @@ again:
return 0;
 }
 
+/* Insert dev extents using one chunk (@ce) */
+static int insert_dev_extents(struct btrfs_trans_handle *trans,
+ struct cache_extent *ce)
+{
+   struct btrfs_root *root = trans->fs_info->dev_root;
+   struct btrfs_key key;
+   struct btrfs_dev_extent de;
+   struct map_lookup *map;
+   u64 stripe_len;
+   int i;
+   int ret;
+
+   map = container_of(ce, struct map_lookup, ce);
+
+   stripe_len = calc_stripe_length(map->type, ce->size,
+   map->num_stripes);
+
+   /* These members are shared between dev extents of this chunk */
+   btrfs_set_stack_dev_extent_chunk_objectid(&de,
+   BTRFS_FIRST_CHUNK_TREE_OBJECTID);
+   btrfs_set_stack_dev_extent_chunk_tree(&de, BTRFS_CHUNK_TREE_OBJECTID);
+   btrfs_set_stack_dev_extent_length(&de, stripe_len);
+   btrfs_set_stack_dev_extent_chunk_offset(&de, ce->start);
+   read_extent_buffer(root->node, btrfs_dev_extent_chunk_tree_uuid(&de),
+   btrfs_header_chunk_tree_uuid(root->node),
+   BTRFS_UUID_SIZE);
+
+   /* Insert dev extents items */
+   for (i = 0; i < map->num_stripes; i++) {
+   key.objectid = map->stripes[i].dev->devid;
+   key.offset = map->stripes[i].physical;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+
+   ret = btrfs_insert_item(trans, root, &key, &de, sizeof(de));
+   if (ret < 0) {
+   error(
+   "failed to insert dev extent for devid %llu offset %llu: %s",
+ key.objectid, key.offset, strerror(-ret));
+   return ret;
+   }
+   }
+   return 0;
+}
+
+/*
+ * btrfs-image restore just restore tree dump as is.
+ * For most trees this behavior is fine, as we have chunk level mapping.
+ * But for trees has info related to physical offset, namely device tree,
+ * we need to fix the dev extents to reflect the real chunk mapping.
+ */
+static int rebuild_dev_extents(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_root *root = fs_info->dev_root;
+   struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
+   struct btrfs_trans_handle *trans;
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct cache_extent *ce;
+   int ret;
+
+   trans = btrfs_start_transaction(root, 1);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   error("failed to start transaction: %s", strerror(-ret));
+   return ret;
+   }
+
+   key.objectid = 1;
+   key.offset = 0;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   btrfs_init_path(&path);
+   ret = btrfs_search_slot(trans, root, &key, &path, -1, 1);
+   if (ret < 0) {
+   error("failed to search dev extents: %s", strerror(-ret));
+   goto out;
+   }
+
+   /*
+* Delete all existing dev extents first, don't have function to drop
+* a whole tree yet, so only do it by deleting all items
+*/
+   while (1) {
+   int nr_items = btrfs_header_nritems(path.nodes[0]);
+
+   if (path.nodes[1] == NULL && path.slots[0] >= nr_items)
+   break;
+
+   ret = btrfs_del_items(trans, root, &path, path.slots[0],
+  

Re: btrfsck out of memory for big fs

2018-08-16 Thread Qu Wenruo


On 2018/8/17 下午1:26, litaibaich...@gmail.com wrote:
> thanks Qu.
> I am runing 4.12.
> /# btrfs --version
> btrfs-progs v4.12
> 
> Do you think btrfsck can fix the fs issue ?

Nope, transid is pretty tricky especially for super old corruption.

>  Or we better backup data and re-do the fs ? 

I'd recommend to backup data asap.

Thanks,
Qu

> 
> __
> From: Qu Wenruo
> Date: 2018-08-17 12:34
> To: litaibaich...@gmail.com; linux-btrfs@vger.kernel.org
> Subject: Re: btrfsck out of memory for big fs
>  
>  
> On 2018/8/17 上午10:44, litaibaich...@gmail.com wrote:
>> Hi Guys,
>>
>> I had a big btrfs on a md device,  it be mounted , but after a while it will 
>> become ready only:
>> # btrfs fi df /data/
>> Data, single: total=24.46TiB, used=24.46TiB
>> System, DUP: total=8.00MiB, used=2.59MiB
>> System, single: total=4.00MiB, used=0.00B
>> Metadata, DUP: total=81.00GiB, used=79.71GiB
>> Metadata, single: total=8.00MiB, used=0.00B
>> GlobalReserve, single: total=512.00MiB, used=264.28MiB
>>
>> # dmesg -T
>> [Thu Aug 16 18:16:31 2018] BTRFS error (device md127): parent transid verify 
>> failed on 26603622694912 wanted 185320 found 207817
>> [Thu Aug 16 18:16:31 2018] BTRFS error (device md127): parent transid verify 
>> failed on 26603622694912 wanted 185320 found 207817
>  
> Transaction id mismatch, normally this means some of the fs is already
> corrupted before.
>  
> And considering the transid gap, the corruption happened quite a long
> time ago.
>  
>> [Thu Aug 16 18:16:31 2018] BTRFS warning (device md127): Skipping commit of 
>> aborted transaction.
>> [Thu Aug 16 18:16:31 2018] BTRFS: error (device md127) in 
>> cleanup_transaction:1864: errno=-5 IO failure
>> [Thu Aug 16 18:16:31 2018] BTRFS info (device md127): forced readonly
>> [Thu Aug 16 18:16:31 2018] BTRFS info (device md127): delayed_refs has NO 
>> entry
>>
>> I want to use btrfsck to check it,  but it will OOM :
>> # btrfsck /dev/md127
>> Checking filesystem on /dev/md127
>> UUID: 6b87a52f-9a5f-4d03-b345-9d954c2ce259
>> checking extents
>> Killed
>>
>> I am trying to use lowmem mode,  but as I tried before, it may OOM too,    
>> any ideas ?
>  
> This depends on which version you tried before.
>  
> Lowmem mode used to do partial lowmem and partial normal mode check, and
> normal mode caused the OOM.
>  
> Latest lowmem mode should not cause OOM, but please ger ready for the
> super long run time.
>  
> Thanks,
> Qu
>>
>> Thanks.
>>
>  
> 



signature.asc
Description: OpenPGP digital signature


Re: BTRFS w/ quotas hangs on read-write mount using all available RAM

2018-08-16 Thread Qu Wenruo


On 2018/8/17 上午11:47, Loren M. Lang wrote:
> Hello,
> 
> I am unable to mount my btrfs in read-write mode after enabling quotas 
> and running a full balance on it. My service is running Ubuntu 17.10 
> with Linux kernel 4.13.0-17-generic and btrfs-progs 4.12. I am trying to 
> recover with a live CD of Ubuntu 18.04.1 running Linux 4.15.0-29-generic 
> with btrfs-progs 4.15.1. My system has two ~3 TB drives in the btrfs 
> array with RAID1 for data and metadata and two sub volumes, / and /home.>
> I was attempting to track down where my free space was going when I 
> discovered apt-btrfs-snapshot which was creating a snapshot for every 
> package install I had done, some quite old. I had at least 20 snapshots 
> created when I found it and told apt-btrfs-snapshot to delete all of 
> them.

Since snapshot deletion is delayed, btrfs may be still deleting all
these snapshots.

And too many snapshots will bring performance impact to quota.

> Still not getting back the free space I was expecting, I found a 
> script called btrfs-size.sh which can produce a report, but requires 
> quotas to be enabled so I ran “sudo btrfs quota enable /“. After loosing 
> some patience trying to figure it out, I decided to try and just run a 
> full balance across everything, something like “sudo btrfs balance start 
> -d -m -v /“,

Then you're pushing the performance impact to maximum.
Balance with a lot of snapshots/reflinked file will make quota as slow
as hell.

> but I can’t remember the exact command. I then went to bed 
> only to find my server was completely hung the next day. I couldn’t get 
> the screen to wake or and other response so I was force to power cycle 
> it. However, on repeated attempts to get it to boot, it hangs at the 
> point that it tries to mount read-write and starts slowly consuming more 
> and more RAM until it the system starts to hang. Switching to my 18.04.1 
> recovery disk, I find I can mount it read only and look around, but I 
> can’t disable quotas in read-only mode. If I run "mount -o remount,rw 
> /mnt” to enable read-write, mount hangs in the D state forever and I can 
> slowly see the RAM usage increasing. I added a massive 32 GB of swap 
> partitions, but eventually the system hangs due to out of memory.
> 
> Lastly, I’ve tried unmounting it and running btrfs check on the drive. I 
> see errors such as the following:
> 
> $ sudo btrfs check -p /dev/sda4
> ...
> ref mismatch on [3994222952448 16384] extent item 0, found 1
> tree backref 3994222952448 parent 9688891392 root 9688891392 not found
> in extent tree
> backpointer mismatch on [3994222952448 16384]
> owner ref check failed [3994222952448 16384]
> ref mismatch on [3994223329280 16384] extent item 0, found 1
> tree backref 3994223329280 parent 9688891392 root 9688891392 not found
> in extent tree
> backpointer mismatch on [3994223329280 16384]
> owner ref check failed [3994223329280 16384]
> ref mismatch on [3994271203328 16384] extent item 0, found 1
> tree backref 3994271203328 parent 9688891392 root 9688891392 not found
> in extent tree
> backpointer mismatch on [3994271203328 16384]
> owner ref check failed [3994271203328 16384]
> ref mismatch on [3994276593664 16384] extent item 0, found 1
> tree backref 3994276593664 parent 9688891392 root 9688891392 not found
> in extent tree
> backpointer mismatch on [3994276593664 16384]
> owner ref check failed [3994276593664 16384]
> ref mismatch on [3994278756352 16384] extent item 0, found 1
> tree backref 3994278756352 parent 9688891392 root 9688891392 not found
> in extent tree
> backpointer mismatch on [3994278756352 16384]
> owner ref check failed [3994278756352 16384]
> 
> ERROR: errors found in extent allocation tree or chunk allocation
> block group 3520760643584 has wrong amount of free space
> failed to load free space cache for block group 3520760643584
> checking free space cache [O]
> checking fs roots [.][o].][o]
> checking csums
> checking root refs
> checking quota groups
> ^C
> ubuntu@ubuntu:~$ 
> 
> It hung at checking quota groups for 12 hours before I killed it.

Considering how many snapshots you have, btrfs-progs won't be any
quicker than kernel.

> The 
> errors above are only a small snippet, but seem to keep repeating the 
> same basic thing. I have not tried a repair yet.
> 
> What’s the next step?

Disable quota first, of course.
(Only enable it when number of snapshots is kept pretty low and don't
try offline dedupe, and don't run balance until really needed)

You can disable quota using this branch of btrfs-progs:
https://github.com/adam900710/btrfs-progs/tree/quota_disable

Or apply this patch on btrfs-progs 4.17.1:
https://patchwork.kernel.org/patch/10563589/

Then compile btrfs-progs, use the following command to disable quota
unmounted:

# ./btrfs rescue disable-quota /dev/sda4

It should finish pretty quickly.

Then re-try btrfs check, rw mount (to let balance continue), and btrfs
check again after balance finished.

The reported error could be a f

Re: btrfsck out of memory for big fs

2018-08-16 Thread Qu Wenruo


On 2018/8/17 上午10:44, litaibaich...@gmail.com wrote:
> Hi Guys,
> 
> I had a big btrfs on a md device,  it be mounted , but after a while it will 
> become ready only:
> # btrfs fi df /data/
> Data, single: total=24.46TiB, used=24.46TiB
> System, DUP: total=8.00MiB, used=2.59MiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=81.00GiB, used=79.71GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=264.28MiB
> 
> # dmesg -T
> [Thu Aug 16 18:16:31 2018] BTRFS error (device md127): parent transid verify 
> failed on 26603622694912 wanted 185320 found 207817
> [Thu Aug 16 18:16:31 2018] BTRFS error (device md127): parent transid verify 
> failed on 26603622694912 wanted 185320 found 207817

Transaction id mismatch, normally this means some of the fs is already
corrupted before.

And considering the transid gap, the corruption happened quite a long
time ago.

> [Thu Aug 16 18:16:31 2018] BTRFS warning (device md127): Skipping commit of 
> aborted transaction.
> [Thu Aug 16 18:16:31 2018] BTRFS: error (device md127) in 
> cleanup_transaction:1864: errno=-5 IO failure
> [Thu Aug 16 18:16:31 2018] BTRFS info (device md127): forced readonly
> [Thu Aug 16 18:16:31 2018] BTRFS info (device md127): delayed_refs has NO 
> entry
> 
> I want to use btrfsck to check it,  but it will OOM :
> # btrfsck /dev/md127
> Checking filesystem on /dev/md127
> UUID: 6b87a52f-9a5f-4d03-b345-9d954c2ce259
> checking extents
> Killed
> 
> I am trying to use lowmem mode,  but as I tried before, it may OOM too,    
> any ideas ?

This depends on which version you tried before.

Lowmem mode used to do partial lowmem and partial normal mode check, and
normal mode caused the OOM.

Latest lowmem mode should not cause OOM, but please ger ready for the
super long run time.

Thanks,
Qu
> 
> Thanks.
> 



signature.asc
Description: OpenPGP digital signature


BTRFS w/ quotas hangs on read-write mount using all available RAM

2018-08-16 Thread Loren M. Lang
Hello,

I am unable to mount my btrfs in read-write mode after enabling quotas 
and running a full balance on it. My service is running Ubuntu 17.10 
with Linux kernel 4.13.0-17-generic and btrfs-progs 4.12. I am trying to 
recover with a live CD of Ubuntu 18.04.1 running Linux 4.15.0-29-generic 
with btrfs-progs 4.15.1. My system has two ~3 TB drives in the btrfs 
array with RAID1 for data and metadata and two sub volumes, / and /home.

I was attempting to track down where my free space was going when I 
discovered apt-btrfs-snapshot which was creating a snapshot for every 
package install I had done, some quite old. I had at least 20 snapshots 
created when I found it and told apt-btrfs-snapshot to delete all of 
them. Still not getting back the free space I was expecting, I found a 
script called btrfs-size.sh which can produce a report, but requires 
quotas to be enabled so I ran “sudo btrfs quota enable /“. After loosing 
some patience trying to figure it out, I decided to try and just run a 
full balance across everything, something like “sudo btrfs balance start 
-d -m -v /“, but I can’t remember the exact command. I then went to bed 
only to find my server was completely hung the next day. I couldn’t get 
the screen to wake or and other response so I was force to power cycle 
it. However, on repeated attempts to get it to boot, it hangs at the 
point that it tries to mount read-write and starts slowly consuming more 
and more RAM until it the system starts to hang. Switching to my 18.04.1 
recovery disk, I find I can mount it read only and look around, but I 
can’t disable quotas in read-only mode. If I run "mount -o remount,rw 
/mnt” to enable read-write, mount hangs in the D state forever and I can 
slowly see the RAM usage increasing. I added a massive 32 GB of swap 
partitions, but eventually the system hangs due to out of memory.

Lastly, I’ve tried unmounting it and running btrfs check on the drive. I 
see errors such as the following:

$ sudo btrfs check -p /dev/sda4
...
ref mismatch on [3994222952448 16384] extent item 0, found 1
tree backref 3994222952448 parent 9688891392 root 9688891392 not found
in extent tree
backpointer mismatch on [3994222952448 16384]
owner ref check failed [3994222952448 16384]
ref mismatch on [3994223329280 16384] extent item 0, found 1
tree backref 3994223329280 parent 9688891392 root 9688891392 not found
in extent tree
backpointer mismatch on [3994223329280 16384]
owner ref check failed [3994223329280 16384]
ref mismatch on [3994271203328 16384] extent item 0, found 1
tree backref 3994271203328 parent 9688891392 root 9688891392 not found
in extent tree
backpointer mismatch on [3994271203328 16384]
owner ref check failed [3994271203328 16384]
ref mismatch on [3994276593664 16384] extent item 0, found 1
tree backref 3994276593664 parent 9688891392 root 9688891392 not found
in extent tree
backpointer mismatch on [3994276593664 16384]
owner ref check failed [3994276593664 16384]
ref mismatch on [3994278756352 16384] extent item 0, found 1
tree backref 3994278756352 parent 9688891392 root 9688891392 not found
in extent tree
backpointer mismatch on [3994278756352 16384]
owner ref check failed [3994278756352 16384]

ERROR: errors found in extent allocation tree or chunk allocation
block group 3520760643584 has wrong amount of free space
failed to load free space cache for block group 3520760643584
checking free space cache [O]
checking fs roots [.][o].][o]
checking csums
checking root refs
checking quota groups
^C
ubuntu@ubuntu:~$ 

It hung at checking quota groups for 12 hours before I killed it. The 
errors above are only a small snippet, but seem to keep repeating the 
same basic thing. I have not tried a repair yet.

What’s the next step?

-- 
Loren M. Lang
lor...@north-winds.org
http://www.north-winds.org/


Public Key: ftp://ftp.north-winds.org/pub/lorenl_pubkey.asc
Fingerprint: 10A0 7AE2 DAF5 4780 888A  3FA4 DCEE BB39 7654 DE5B


btrfsck out of memory for big fs

2018-08-16 Thread litaibaich...@gmail.com
Hi Guys,

I had a big btrfs on a md device,  it be mounted , but after a while it will 
become ready only:
# btrfs fi df /data/
Data, single: total=24.46TiB, used=24.46TiB
System, DUP: total=8.00MiB, used=2.59MiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=81.00GiB, used=79.71GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=264.28MiB

# dmesg -T
[Thu Aug 16 18:16:31 2018] BTRFS error (device md127): parent transid verify 
failed on 26603622694912 wanted 185320 found 207817
[Thu Aug 16 18:16:31 2018] BTRFS error (device md127): parent transid verify 
failed on 26603622694912 wanted 185320 found 207817
[Thu Aug 16 18:16:31 2018] BTRFS warning (device md127): Skipping commit of 
aborted transaction.
[Thu Aug 16 18:16:31 2018] BTRFS: error (device md127) in 
cleanup_transaction:1864: errno=-5 IO failure
[Thu Aug 16 18:16:31 2018] BTRFS info (device md127): forced readonly
[Thu Aug 16 18:16:31 2018] BTRFS info (device md127): delayed_refs has NO entry

I want to use btrfsck to check it,  but it will OOM :
# btrfsck /dev/md127
Checking filesystem on /dev/md127
UUID: 6b87a52f-9a5f-4d03-b345-9d954c2ce259
checking extents
Killed

I am trying to use lowmem mode,  but as I tried before, it may OOM too,    any 
ideas ?

Thanks.

[PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking

2018-08-16 Thread Liu Bo
Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark  -D  1  -S0  -n  10  -s  0  -L  1 -l /tmp/fs_log.txt \
-d  $MNT/0  -d  $MNT/1 \
-d  $MNT/2  -d  $MNT/3 \
-d  $MNT/4  -d  $MNT/5 \
-d  $MNT/6  -d  $MNT/7 \
-d  $MNT/8  -d  $MNT/9 \
-d  $MNT/10  -d  $MNT/11 \
-d  $MNT/12  -d  $MNT/13 \
-d  $MNT/14  -d  $MNT/15

w/o patch:
real2m27.307s
user0m12.839s
sys 13m42.831s

w/ patch:
real1m2.273s
user0m15.802s
sys 8m16.495s

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set leave_spinning to have all
nodes in the path stay in spinning mode, which means callers are ready
to not sleep before releasing the path, but it won't cause problems if
they don't want to sleep in blocking mode, IOW, we can just get rid of
leave_spinning.

Signed-off-by: Liu Bo 
---
 fs/btrfs/ctree.c | 57 
 fs/btrfs/ctree.h |  2 --
 fs/btrfs/delayed-inode.c |  3 ---
 3 files changed, 4 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d436fb4c002e..8b31caa60b0a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
}
 }
 
-/*
- * reset all the locked nodes in the patch to spinning locks.
- *
- * held is used to keep lockdep happy, when lockdep is enabled
- * we set held to a blocking lock before we go around and
- * retake all the spinlocks in the path.  You can safely use NULL
- * for held
- */
-noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
-   struct extent_buffer *held, int held_rw)
-{
-   int i;
-
-   if (held) {
-   btrfs_set_lock_blocking_rw(held, held_rw);
-   if (held_rw == BTRFS_WRITE_LOCK)
-   held_rw = BTRFS_WRITE_LOCK_BLOCKING;
-   else if (held_rw == BTRFS_READ_LOCK)
-   held_rw = BTRFS_READ_LOCK_BLOCKING;
-   }
-   btrfs_set_path_blocking(p);
-
-   for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
-   if (p->nodes[i] && p->locks[i]) {
-   btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
-   if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
-   p->locks[i] = BTRFS_WRITE_LOCK;
-   else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
-   p->locks[i] = BTRFS_READ_LOCK;
-   }
-   }
-
-   if (held)
-   btrfs_clear_lock_blocking_rw(held, held_rw);
-}
-
 /* this also releases the path */
 void btrfs_free_path(struct btrfs_path *p)
 {
@@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
}
}
 
-   btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
 
@@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
btrfs_set_path_blocking(p);
reada_for_balance(fs_info, p, level);
sret = split_node(trans, root, p, level);
-   btrfs_clear_path_blocking(p, NULL, 0);
 
BUG_ON(sret > 0);
if (sret) {
@@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path 
*path, int level)
btrfs_set_path_blocking(p);
reada_for_balance(fs_info, p, level);
sret = balance_level(trans, root, p, level);
-   btrfs_clear_path_blocking(p, NULL, 0);
 
if (sret) {
ret = sret;
@@ -2789,7 +2750,10 @@ 

[PATCH 2/2] Btrfs: kill leave_spinning

2018-08-16 Thread Liu Bo
As btrfs_clear_path_blocking() turns out to be a major source of lock
contention, we've kill it and without it btrfs_search_slot() and
btrfs_search_old_slot() are not able to return a path in spinning
mode, lets kill leave_spinning, too.

Signed-off-by: Liu Bo 
---
 fs/btrfs/backref.c|  3 ---
 fs/btrfs/ctree.c  | 16 +++-
 fs/btrfs/ctree.h  |  1 -
 fs/btrfs/delayed-inode.c  |  4 
 fs/btrfs/dir-item.c   |  1 -
 fs/btrfs/export.c |  1 -
 fs/btrfs/extent-tree.c|  7 ---
 fs/btrfs/extent_io.c  |  1 -
 fs/btrfs/file-item.c  |  4 
 fs/btrfs/free-space-tree.c|  2 --
 fs/btrfs/inode-item.c |  6 --
 fs/btrfs/inode.c  |  8 
 fs/btrfs/ioctl.c  |  3 ---
 fs/btrfs/qgroup.c |  2 --
 fs/btrfs/super.c  |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 
 16 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index ae750b1574a2..70c629b90710 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1613,13 +1613,11 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
s64 bytes_left = ((s64)size) - 1;
struct extent_buffer *eb = eb_in;
struct btrfs_key found_key;
-   int leave_spinning = path->leave_spinning;
struct btrfs_inode_ref *iref;
 
if (bytes_left >= 0)
dest[bytes_left] = '\0';
 
-   path->leave_spinning = 1;
while (1) {
bytes_left -= name_len;
if (bytes_left >= 0)
@@ -1665,7 +1663,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
}
 
btrfs_release_path(path);
-   path->leave_spinning = leave_spinning;
 
if (ret)
return ERR_PTR(ret);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8b31caa60b0a..d2df7cfbec06 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2875,14 +2875,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
}
ret = 1;
 done:
-   /*
-* we don't really know what they plan on doing with the path
-* from here on, so for now just mark it as blocking
-*/
-   if (!p->leave_spinning)
-   btrfs_set_path_blocking(p);
if (ret < 0 && !p->skip_release_on_error)
btrfs_release_path(p);
+
+   /* path is supposed to be in blocking mode from now on. */
return ret;
 }
 
@@ -2987,11 +2983,10 @@ int btrfs_search_old_slot(struct btrfs_root *root, 
const struct btrfs_key *key,
}
ret = 1;
 done:
-   if (!p->leave_spinning)
-   btrfs_set_path_blocking(p);
if (ret < 0)
btrfs_release_path(p);
 
+   /* path is supposed to be in blocking mode from now on.*/
return ret;
 }
 
@@ -5628,7 +5623,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
struct btrfs_key key;
u32 nritems;
int ret;
-   int old_spinning = path->leave_spinning;
int next_rw_lock = 0;
 
nritems = btrfs_header_nritems(path->nodes[0]);
@@ -5643,7 +5637,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
btrfs_release_path(path);
 
path->keep_locks = 1;
-   path->leave_spinning = 1;
 
if (time_seq)
ret = btrfs_search_old_slot(root, &key, path, time_seq);
@@ -5780,9 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct 
btrfs_path *path,
ret = 0;
 done:
unlock_up(path, 0, 1, 0, NULL);
-   path->leave_spinning = old_spinning;
-   if (!old_spinning)
-   btrfs_set_path_blocking(path);
 
return ret;
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1aeed3c0e949..e8bddf21b7f7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -339,7 +339,6 @@ struct btrfs_path {
unsigned int search_for_split:1;
unsigned int keep_locks:1;
unsigned int skip_locking:1;
-   unsigned int leave_spinning:1;
unsigned int search_commit_root:1;
unsigned int need_commit_sem:1;
unsigned int skip_release_on_error:1;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index db9f45082c61..e6fbcdbc313e 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1138,7 +1138,6 @@ static int __btrfs_run_delayed_items(struct 
btrfs_trans_handle *trans, int nr)
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
-   path->leave_spinning = 1;
 
block_rsv = trans->block_rsv;
trans->block_rsv = &fs_info->delayed_block_rsv;
@@ -1203,7 +1202,6 @@ int btrfs_commit_inode_delayed_items(struct 
btrfs_trans_handle *trans,
btrfs_release_delayed_node(delayed_node);
return -ENOMEM;
}
-   path->leave_spi

[PATCH 0/2] address lock contention of btree root

2018-08-16 Thread Liu Bo
The lock contention on btree nodes (esp. root node) is apparently a
bottleneck when there're multiple readers and writers concurrently
trying to access them.  Unfortunately this is by design and it's not
easy to fix it unless with some complex changes, however, there is
still some room.

With a stable workload based on fsmark which has 16 threads creating
1,600K files, we could see that a good amount of overhead comes from
switching path between spinning mode and blocking mode in
btrfs_search_slot().

Patch 1 provides more details about the overhead and test results from
fsmark and dbench.
Patch 2 kills leave_spinning due to the behaviour change from patch 1.

Liu Bo (2):
  Btrfs: kill btrfs_clear_path_blocking
  Btrfs: kill leave_spinning

 fs/btrfs/backref.c|  3 --
 fs/btrfs/ctree.c  | 73 +--
 fs/btrfs/ctree.h  |  3 --
 fs/btrfs/delayed-inode.c  |  7 -
 fs/btrfs/dir-item.c   |  1 -
 fs/btrfs/export.c |  1 -
 fs/btrfs/extent-tree.c|  7 -
 fs/btrfs/extent_io.c  |  1 -
 fs/btrfs/file-item.c  |  4 ---
 fs/btrfs/free-space-tree.c|  2 --
 fs/btrfs/inode-item.c |  6 
 fs/btrfs/inode.c  |  8 -
 fs/btrfs/ioctl.c  |  3 --
 fs/btrfs/qgroup.c |  2 --
 fs/btrfs/super.c  |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 ---
 16 files changed, 7 insertions(+), 120 deletions(-)

-- 
1.8.3.1



[PATCH] Btrfs: remove always true if branch in btrfs_get_extent

2018-08-16 Thread Liu Bo
@path is always NULL when it comes to the if branch.

Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8b135a46835f..4b79916472fb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6825,18 +6825,15 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
em->len = (u64)-1;
em->block_len = (u64)-1;
 
+   path = btrfs_alloc_path();
if (!path) {
-   path = btrfs_alloc_path();
-   if (!path) {
-   err = -ENOMEM;
-   goto out;
-   }
-   /*
-* Chances are we'll be called again, so go ahead and do
-* readahead
-*/
-   path->reada = READA_FORWARD;
+   err = -ENOMEM;
+   goto out;
}
+   /*
+* Chances are we'll be called again, so go ahead and do readahead.
+*/
+   path->reada = READA_FORWARD;
 
ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
if (ret < 0) {
-- 
1.8.3.1



Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-16 Thread erenthetitan
Could you show scrub status -d, then start a new scrub (all drives) and show 
scrub status -d again? This may help us diagnose the problem.

Am 15-Aug-2018 09:27:40 +0200 schrieb men...@gmail.com: 
> I needed to resume scrub two times after an unclear shutdown (I was
> cooking and using too much electricity) and two times after a manual
> cancel, because I wanted to watch a 4k movie and the array
> performances were not enough with scrub active.
> Each time I resumed it, I checked also the status, and the total
> number of data scrubbed was keep counting (never started from zero)
> Il giorno mer 15 ago 2018 alle ore 05:33 Zygo Blaxell
>  ha scritto:
> >
> > On Tue, Aug 14, 2018 at 09:32:51AM +0200, Menion wrote:
> > > Hi
> > > Well, I think it is worth to give more details on the array.
> > > the array is built with 5x8TB HDD in an esternal USB3.0 to SATAIII 
> > > enclosure
> > > The enclosure is a cheap JMicron based chinese stuff (from Orico).
> > > There is one USB3.0 link for all the 5 HDD with a SATAIII 3.0Gb
> > > multiplexer behind it. So you cannot expect peak performance, which is
> > > not the goal of this array (domestic data storage).
> > > Also the USB to SATA firmware is buggy, so UAS operations are not
> > > stable, it run in BOT mode.
> > > Having said so, the scrub has been started (and resumed) on the array
> > > mount point:
> > >
> > > sudo btrfs scrub start(resume) /media/storage/das1
> >
> > So is 2.59TB the amount scrubbed _since resume_? If you run a complete
> > scrub end to end without cancelling or rebooting in between, what is
> > the size on all disks (btrfs scrub status -d)?
> >
> > > even if reading the documentation I understand that it is the same
> > > invoking it on mountpoint or one of the HDD in the array.
> > > In the end, especially for a RAID5 array, does it really make sense to
> > > scrub only one disk in the array???
> >
> > You would set up a shell for-loop and scrub each disk of the array
> > in turn. Each scrub would correct errors on a single device.
> >
> > There was a bug in btrfs scrub where scrubbing the filesystem would
> > create one thread for each disk, and the threads would issue commands
> > to all disks and compete with each other for IO, resulting in terrible
> > performance on most non-SSD hardware. By scrubbing disks one at a time,
> > there are no competing threads, so the scrub runs many times faster.
> > With this bug the total time to scrub all disks individually is usually
> > less than the time to scrub the entire filesystem at once, especially
> > on HDD (and even if it's not faster, one-at-a-time disk scrubs are
> > much kinder to any other process trying to use the filesystem at the
> > same time).
> >
> > It appears this bug is not fixed, based on some timing results I am
> > getting from a test array. iostat shows 10x more reads than writes on
> > all disks even when all blocks on one disk are corrupted and the scrub
> > is given only a single disk to process (that should result in roughly
> > equal reads on all disks slightly above the number of writes on the
> > corrupted disk).
> >
> > This is where my earlier caveat about performance comes from. Many parts
> > of btrfs raid5 are somewhere between slower and *much* slower than
> > comparable software raid5 implementations. Some of that is by design:
> > btrfs must be at least 1% slower than mdadm because btrfs needs to read
> > metadata to verify data block csums in scrub, and the difference would
> > be much larger in practice due to HDD seek times, but 500%-900% overhead
> > still seems high especially when compared to btrfs raid1 that has the
> > same metadata csum reading issue without the huge performance gap.
> >
> > It seems like btrfs raid5 could still use a thorough profiling to figure
> > out where it's spending all its IO.
> >
> > > Regarding the data usage, here you have the current figures:
> > >
> > > menion@Menionubuntu:~$ sudo btrfs fi show
> > > [sudo] password for menion:
> > > Label: none uuid: 6db4baf7-fda8-41ac-a6ad-1ca7b083430f
> > > Total devices 1 FS bytes used 11.44GiB
> > > devid 1 size 27.07GiB used 18.07GiB path /dev/mmcblk0p3
> > >
> > > Label: none uuid: 931d40c6-7cd7-46f3-a4bf-61f3a53844bc
> > > Total devices 5 FS bytes used 6.57TiB
> > > devid 1 size 7.28TiB used 1.64TiB path /dev/sda
> > > devid 2 size 7.28TiB used 1.64TiB path /dev/sdb
> > > devid 3 size 7.28TiB used 1.64TiB path /dev/sdc
> > > devid 4 size 7.28TiB used 1.64TiB path /dev/sdd
> > > devid 5 size 7.28TiB used 1.64TiB path /dev/sde
> > >
> > > menion@Menionubuntu:~$ sudo btrfs fi df /media/storage/das1
> > > Data, RAID5: total=6.57TiB, used=6.56TiB
> > > System, RAID5: total=12.75MiB, used=416.00KiB
> > > Metadata, RAID5: total=9.00GiB, used=8.16GiB
> > > GlobalReserve, single: total=512.00MiB, used=0.00B
> > > menion@Menionubuntu:~$ sudo btrfs fi usage /media/storage/das1
> > > WARNING: RAID56 detected, not implemented
> > > WARNING: RAID56 detected, not implemented
> > > WARNING: RAI

[PATCH 3/8] btrfs-progs: Add delayed refs infrastructure

2018-08-16 Thread Nikolay Borisov
This commit pulls those portions of the kernel implementation of
delayed refs which are necessary to have them working in user-space.
I've done the following modifications:

1. Replaced all kmem_cache_alloc calls to kmalloc.

2. Removed all locking-related code, since we are single threaded in
userspace.

3. Removed code which deals with data refs - delayed refs in user space
are going to be used only for cowonly trees.

Signed-off-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 Makefile  |   3 +-
 ctree.h   |   3 +
 delayed-ref.c | 607 ++
 delayed-ref.h | 208 
 extent-tree.c | 226 ++
 kerncompat.h  |   8 +
 transaction.h |   4 +
 7 files changed, 1058 insertions(+), 1 deletion(-)
 create mode 100644 delayed-ref.c
 create mode 100644 delayed-ref.h

diff --git a/Makefile b/Makefile
index fcfc815a2a5b..f4ab14ea74c8 100644
--- a/Makefile
+++ b/Makefile
@@ -116,7 +116,8 @@ objects = ctree.o disk-io.o kernel-lib/radix-tree.o 
extent-tree.o print-tree.o \
  qgroup.o free-space-cache.o kernel-lib/list_sort.o props.o \
  kernel-shared/ulist.o qgroup-verify.o backref.o string-table.o 
task-utils.o \
  inode.o file.o find-root.o free-space-tree.o help.o send-dump.o \
- fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o
+ fsfeatures.o kernel-lib/tables.o kernel-lib/raid56.o transaction.o \
+ delayed-ref.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o check/main.o \
diff --git a/ctree.h b/ctree.h
index 4719962df67d..5242595fe355 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2790,4 +2790,7 @@ int btrfs_punch_hole(struct btrfs_trans_handle *trans,
 int btrfs_read_file(struct btrfs_root *root, u64 ino, u64 start, int len,
char *dest);
 
+/* extent-tree.c */
+int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long nr);
+
 #endif
diff --git a/delayed-ref.c b/delayed-ref.c
new file mode 100644
index ..e8123436a58f
--- /dev/null
+++ b/delayed-ref.c
@@ -0,0 +1,607 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2009 Oracle.  All rights reserved.
+ */
+
+#include "ctree.h"
+#include "btrfs-list.h"
+#include "delayed-ref.h"
+#include "transaction.h"
+
+/*
+ * delayed back reference update tracking.  For subvolume trees
+ * we queue up extent allocations and backref maintenance for
+ * delayed processing.   This avoids deep call chains where we
+ * add extents in the middle of btrfs_search_slot, and it allows
+ * us to buffer up frequently modified backrefs in an rb tree instead
+ * of hammering updates on the extent allocation tree.
+ */
+
+/*
+ * compare two delayed tree backrefs with same bytenr and type
+ */
+static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref1,
+ struct btrfs_delayed_tree_ref *ref2)
+{
+   if (ref1->node.type == BTRFS_TREE_BLOCK_REF_KEY) {
+   if (ref1->root < ref2->root)
+   return -1;
+   if (ref1->root > ref2->root)
+   return 1;
+   } else {
+   if (ref1->parent < ref2->parent)
+   return -1;
+   if (ref1->parent > ref2->parent)
+   return 1;
+   }
+   return 0;
+}
+
+static int comp_refs(struct btrfs_delayed_ref_node *ref1,
+struct btrfs_delayed_ref_node *ref2,
+bool check_seq)
+{
+   int ret = 0;
+
+   if (ref1->type < ref2->type)
+   return -1;
+   if (ref1->type > ref2->type)
+   return 1;
+   if (ref1->type == BTRFS_TREE_BLOCK_REF_KEY ||
+   ref1->type == BTRFS_SHARED_BLOCK_REF_KEY)
+   ret = comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref1),
+btrfs_delayed_node_to_tree_ref(ref2));
+   else
+   BUG();
+
+   if (ret)
+   return ret;
+   if (check_seq) {
+   if (ref1->seq < ref2->seq)
+   return -1;
+   if (ref1->seq > ref2->seq)
+   return 1;
+   }
+   return 0;
+}
+
+/* insert a new ref to head ref rbtree */
+static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root,
+  struct rb_node *node)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent_node = NULL;
+   struct btrfs_delayed_ref_head *entry;
+   struct btrfs_delayed_ref_head *ins;
+   u64 bytenr;
+
+   ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node);
+   bytenr = ins->bytenr;
+   while (*p) {
+   parent_node = *p;
+   entry = rb_entry(parent_node, struct btrfs_delayed_ref_head,
+  

[PATCH 1/8] btrfs-progs: Add __free_extent2 function

2018-08-16 Thread Nikolay Borisov
This is a simple adapter to convert the arguments delayed ref arguments
to the existing arguments of __free_extent.

Signed-off-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 extent-tree.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/extent-tree.c b/extent-tree.c
index 5d49af5a901e..34409e600087 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2136,6 +2136,17 @@ void btrfs_unpin_extent(struct btrfs_fs_info *fs_info,
update_pinned_extents(fs_info, bytenr, num_bytes, 0);
 }
 
+static int __free_extent2(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_node *node,
+ struct btrfs_delayed_extent_op *extent_op)
+{
+
+   struct btrfs_delayed_tree_ref *ref = 
btrfs_delayed_node_to_tree_ref(node);
+
+   return __free_extent(trans, node->bytenr, node->num_bytes,
+ref->parent, ref->root, ref->level, 0, 1);
+}
+
 /*
  * remove an extent from the root, returns 0 on success
  */
-- 
2.7.4



[PATCH 8/8] btrfs-progs: Merge alloc_reserved_tree_block(2|)

2018-08-16 Thread Nikolay Borisov
Now that delayed refs have been wired let's merge the two function. In
the process also remove one BUG_ON since alloc_reserved_tree_block's
callers can handle errors. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 extent-tree.c | 77 +++
 1 file changed, 30 insertions(+), 47 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 1a63efdd9681..b9a30644720b 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -44,10 +44,6 @@ struct pending_extent_op {
int level;
 };
 
-static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
-u64 root_objectid, u64 generation,
-u64 flags, struct btrfs_disk_key *key,
-int level, struct btrfs_key *ins);
 static int __free_extent(struct btrfs_trans_handle *trans,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner_objectid,
@@ -2528,16 +2524,22 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
-static int alloc_reserved_tree_block2(struct btrfs_trans_handle *trans,
+static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
  struct btrfs_delayed_ref_node *node,
  struct btrfs_delayed_extent_op *extent_op)
 {
 
struct btrfs_delayed_tree_ref *ref = 
btrfs_delayed_node_to_tree_ref(node);
-   struct btrfs_key ins;
bool skinny_metadata = btrfs_fs_incompat(trans->fs_info, 
SKINNY_METADATA);
-   int ret;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_extent_item *extent_item;
+   struct btrfs_extent_inline_ref *iref;
+   struct extent_buffer *leaf;
+   struct btrfs_path *path;
+   struct btrfs_key ins;
+   u32 size = sizeof(*extent_item) + sizeof(*iref);
u64 start, end;
+   int ret;
 
ins.objectid = node->bytenr;
if (skinny_metadata) {
@@ -2546,6 +2548,8 @@ static int alloc_reserved_tree_block2(struct 
btrfs_trans_handle *trans,
} else {
ins.offset = node->num_bytes;
ins.type = BTRFS_EXTENT_ITEM_KEY;
+
+   size += sizeof(struct btrfs_tree_block_info);
}
 
if (ref->root == BTRFS_EXTENT_TREE_OBJECTID) {
@@ -2557,69 +2561,48 @@ static int alloc_reserved_tree_block2(struct 
btrfs_trans_handle *trans,
ASSERT(end == node->bytenr + node->num_bytes - 1);
}
 
-   ret = alloc_reserved_tree_block(trans, ref->root, trans->transid,
-   extent_op->flags_to_set,
-   &extent_op->key, ref->level, &ins);
-
-   if (ref->root == BTRFS_EXTENT_TREE_OBJECTID) {
-   clear_extent_bits(&trans->fs_info->extent_ins, start, end,
- EXTENT_LOCKED);
-   }
-
-   return ret;
-}
-
-static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
-u64 root_objectid, u64 generation,
-u64 flags, struct btrfs_disk_key *key,
-int level, struct btrfs_key *ins)
-{
-   int ret;
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_extent_item *extent_item;
-   struct btrfs_tree_block_info *block_info;
-   struct btrfs_extent_inline_ref *iref;
-   struct btrfs_path *path;
-   struct extent_buffer *leaf;
-   u32 size = sizeof(*extent_item) + sizeof(*iref);
-   int skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
-
-   if (!skinny_metadata)
-   size += sizeof(*block_info);
-
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
 
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
- ins, size);
-   BUG_ON(ret);
+ &ins, size);
+   if (ret)
+   return ret;
 
leaf = path->nodes[0];
extent_item = btrfs_item_ptr(leaf, path->slots[0],
 struct btrfs_extent_item);
btrfs_set_extent_refs(leaf, extent_item, 1);
-   btrfs_set_extent_generation(leaf, extent_item, generation);
+   btrfs_set_extent_generation(leaf, extent_item, trans->transid);
btrfs_set_extent_flags(leaf, extent_item,
-  flags | BTRFS_EXTENT_FLAG_TREE_BLOCK);
+  extent_op->flags_to_set |
+  BTRFS_EXTENT_FLAG_TREE_BLOCK);
 
if (skinny_metadata) {
iref = (struct btrfs_extent_inline_ref *)(extent_item + 1);
} else {
+   struct btrfs_tree_block_info *block_info;
block_info = (struct btrfs_tree_block_info *)(extent_item + 1);
-

[PATCH 2/8] btrfs-progs: Add alloc_reserved_tree_block2 function

2018-08-16 Thread Nikolay Borisov
This is a simple adapter function to convert the delayed-refs structures
to the current arguments of alloc_reserved_tree_block.

Signed-off-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 extent-tree.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/extent-tree.c b/extent-tree.c
index 34409e600087..f7b59f84bf3d 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2687,6 +2687,30 @@ int btrfs_reserve_extent(struct btrfs_trans_handle 
*trans,
return ret;
 }
 
+static int alloc_reserved_tree_block2(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_node *node,
+ struct btrfs_delayed_extent_op *extent_op)
+{
+
+   struct btrfs_delayed_tree_ref *ref = 
btrfs_delayed_node_to_tree_ref(node);
+   struct btrfs_key ins;
+   bool skinny_metadata = btrfs_fs_incompat(trans->fs_info, 
SKINNY_METADATA);
+
+   ins.objectid = node->bytenr;
+   if (skinny_metadata) {
+   ins.offset = ref->level;
+   ins.type = BTRFS_METADATA_ITEM_KEY;
+   } else {
+   ins.offset = node->num_bytes;
+   ins.type = BTRFS_EXTENT_ITEM_KEY;
+   }
+
+   return alloc_reserved_tree_block(trans, ref->root, trans->transid,
+extent_op->flags_to_set,
+&extent_op->key, ref->level, &ins);
+
+}
+
 static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 u64 root_objectid, u64 generation,
 u64 flags, struct btrfs_disk_key *key,
-- 
2.7.4



[PATCH 6/8] btrfs-progs: Remove old delayed refs infrastructure

2018-08-16 Thread Nikolay Borisov
Given that the new delayed refs infrastructure is implemented and
wired up, there is no point in keeping the old code. So just remove it.

Signed-off-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 ctree.h   |   2 -
 disk-io.c |   2 -
 extent-tree.c | 137 --
 3 files changed, 141 deletions(-)

diff --git a/ctree.h b/ctree.h
index 75675ef3f781..49f0f5181512 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1098,7 +1098,6 @@ struct btrfs_fs_info {
struct extent_io_tree free_space_cache;
struct extent_io_tree block_group_cache;
struct extent_io_tree pinned_extents;
-   struct extent_io_tree pending_del;
struct extent_io_tree extent_ins;
struct extent_io_tree *excluded_extents;
 
@@ -2481,7 +2480,6 @@ int btrfs_fix_block_accounting(struct btrfs_trans_handle 
*trans);
 void btrfs_pin_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 
num_bytes);
 void btrfs_unpin_extent(struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes);
-int btrfs_extent_post_op(struct btrfs_trans_handle *trans);
 struct btrfs_block_group_cache *btrfs_lookup_block_group(struct
 btrfs_fs_info *info,
 u64 bytenr);
diff --git a/disk-io.c b/disk-io.c
index 26e4f6e93ed6..2e6d56a36af9 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -730,7 +730,6 @@ struct btrfs_fs_info *btrfs_new_fs_info(int writable, u64 
sb_bytenr)
extent_io_tree_init(&fs_info->free_space_cache);
extent_io_tree_init(&fs_info->block_group_cache);
extent_io_tree_init(&fs_info->pinned_extents);
-   extent_io_tree_init(&fs_info->pending_del);
extent_io_tree_init(&fs_info->extent_ins);
fs_info->excluded_extents = NULL;
 
@@ -988,7 +987,6 @@ void btrfs_cleanup_all_caches(struct btrfs_fs_info *fs_info)
extent_io_tree_cleanup(&fs_info->free_space_cache);
extent_io_tree_cleanup(&fs_info->block_group_cache);
extent_io_tree_cleanup(&fs_info->pinned_extents);
-   extent_io_tree_cleanup(&fs_info->pending_del);
extent_io_tree_cleanup(&fs_info->extent_ins);
 }
 
diff --git a/extent-tree.c b/extent-tree.c
index 2fa51bbc0359..6893b4c07019 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -52,8 +52,6 @@ static int __free_extent(struct btrfs_trans_handle *trans,
 u64 bytenr, u64 num_bytes, u64 parent,
 u64 root_objectid, u64 owner_objectid,
 u64 owner_offset, int refs_to_drop);
-static int finish_current_insert(struct btrfs_trans_handle *trans);
-static int del_pending_extents(struct btrfs_trans_handle *trans);
 static struct btrfs_block_group_cache *
 btrfs_find_block_group(struct btrfs_root *root, struct btrfs_block_group_cache
   *hint, u64 search_start, int data, int owner);
@@ -1422,13 +1420,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
return err;
 }
 
-int btrfs_extent_post_op(struct btrfs_trans_handle *trans)
-{
-   finish_current_insert(trans);
-   del_pending_extents(trans);
-   return 0;
-}
-
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
 u64 offset, int metadata, u64 *refs, u64 *flags)
@@ -2012,74 +2003,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle 
*trans,
return 0;
 }
 
-static int extent_root_pending_ops(struct btrfs_fs_info *info)
-{
-   u64 start;
-   u64 end;
-   int ret;
-
-   ret = find_first_extent_bit(&info->extent_ins, 0, &start,
-   &end, EXTENT_LOCKED);
-   if (!ret) {
-   ret = find_first_extent_bit(&info->pending_del, 0, &start, &end,
-   EXTENT_LOCKED);
-   }
-   return ret == 0;
-
-}
-static int finish_current_insert(struct btrfs_trans_handle *trans)
-{
-   u64 start;
-   u64 end;
-   u64 priv;
-   struct btrfs_fs_info *info = trans->fs_info;
-   struct btrfs_root *extent_root = info->extent_root;
-   struct pending_extent_op *extent_op;
-   struct btrfs_key key;
-   int ret;
-   int skinny_metadata =
-   btrfs_fs_incompat(extent_root->fs_info, SKINNY_METADATA);
-
-
-   while(1) {
-   ret = find_first_extent_bit(&info->extent_ins, 0, &start,
-   &end, EXTENT_LOCKED);
-   if (ret)
-   break;
-
-   ret = get_state_private(&info->extent_ins, start, &priv);
-   BUG_ON(ret);
-   extent_op = (struct pending_extent_op *)(unsigned long)priv;
-
-   if (extent_op->type == PENDING_EXTENT_INSERT) {
-   key.objectid = start;
-   if (skinny_metadata) {
-   

[PATCH 7/8] btrfs-progs: Remove __free_extent2

2018-08-16 Thread Nikolay Borisov
Now that delayed refs have been all wired up clean up the __free_extent2
adapter function since it's no longer needed. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 extent-tree.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/extent-tree.c b/extent-tree.c
index 6893b4c07019..1a63efdd9681 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -2052,17 +2052,6 @@ void btrfs_unpin_extent(struct btrfs_fs_info *fs_info,
update_pinned_extents(fs_info, bytenr, num_bytes, 0);
 }
 
-static int __free_extent2(struct btrfs_trans_handle *trans,
- struct btrfs_delayed_ref_node *node,
- struct btrfs_delayed_extent_op *extent_op)
-{
-
-   struct btrfs_delayed_tree_ref *ref = 
btrfs_delayed_node_to_tree_ref(node);
-
-   return __free_extent(trans, node->bytenr, node->num_bytes,
-ref->parent, ref->root, ref->level, 0, 1);
-}
-
 /*
  * remove an extent from the root, returns 0 on success
  */
@@ -4183,7 +4172,9 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
BUG_ON(!extent_op || !extent_op->update_flags);
ret = alloc_reserved_tree_block2(trans, node, extent_op);
} else if (node->action == BTRFS_DROP_DELAYED_REF) {
-   ret = __free_extent2(trans, node, extent_op);
+   struct btrfs_delayed_tree_ref *ref = 
btrfs_delayed_node_to_tree_ref(node);
+   ret =  __free_extent(trans, node->bytenr, node->num_bytes,
+ref->parent, ref->root, ref->level, 0, 1);
} else {
BUG();
}
-- 
2.7.4



[PATCH 5/8] btrfs-progs: Wire up delayed refs

2018-08-16 Thread Nikolay Borisov
This commit enables the delayed refs infrastructures. This entails doing
the following:

1. Replacing existing calls of btrfs_extent_post_op (which is the
equivalent of delayed refs) with the proper btrfs_run_delayed_refs.
As well as eliminating open-coded calls to finish_current_insert and
del_pending_extents which execute the delayed ops.

2. Wiring up the addition of delayed refs when freeing extents
(btrfs_free_extent) and when adding new extents (alloc_tree_block).

3. Adding calls to btrfs_run_delayed refs in the transaction commit
path alongside comments why every call is needed, since it's not always
obvious (those call sites were derived empirically by running and
debugging existing tests)

4. Correctly flagging the transaction in which we are reinitialising
the extent tree.

5 Moving btrfs_write_dirty_block_groups to btrfs_write_dirty_block_groups
since blockgroups should be written to disk after the last delayed refs
have been run.

Signed-off-by: Nikolay Borisov 
Signed-off-by: David Sterba 
---
 check/main.c  |   3 +-
 extent-tree.c | 166 ++
 transaction.c |  27 +-
 3 files changed, 112 insertions(+), 84 deletions(-)

diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..b361cd7e26a0 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8710,7 +8710,7 @@ static int reinit_extent_tree(struct btrfs_trans_handle 
*trans,
fprintf(stderr, "Error adding block group\n");
return ret;
}
-   btrfs_extent_post_op(trans);
+   btrfs_run_delayed_refs(trans, -1);
}
 
ret = reset_balance(trans, fs_info);
@@ -9767,6 +9767,7 @@ int cmd_check(int argc, char **argv)
goto close_out;
}
 
+   trans->reinit_extent_tree = true;
if (init_extent_tree) {
printf("Creating a new extent tree\n");
ret = reinit_extent_tree(trans, info,
diff --git a/extent-tree.c b/extent-tree.c
index 7d6c37c6b371..2fa51bbc0359 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1418,8 +1418,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
err = ret;
 out:
btrfs_free_path(path);
-   finish_current_insert(trans);
-   del_pending_extents(trans);
BUG_ON(err);
return err;
 }
@@ -1602,8 +1600,6 @@ int btrfs_set_block_flags(struct btrfs_trans_handle 
*trans, u64 bytenr,
btrfs_set_extent_flags(l, item, flags);
 out:
btrfs_free_path(path);
-   finish_current_insert(trans);
-   del_pending_extents(trans);
return ret;
 }
 
@@ -1701,7 +1697,6 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
 struct btrfs_block_group_cache *cache)
 {
int ret;
-   int pending_ret;
struct btrfs_root *extent_root = trans->fs_info->extent_root;
unsigned long bi;
struct extent_buffer *leaf;
@@ -1717,12 +1712,8 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
btrfs_mark_buffer_dirty(leaf);
btrfs_release_path(path);
 fail:
-   finish_current_insert(trans);
-   pending_ret = del_pending_extents(trans);
if (ret)
return ret;
-   if (pending_ret)
-   return pending_ret;
return 0;
 
 }
@@ -2049,6 +2040,7 @@ static int finish_current_insert(struct 
btrfs_trans_handle *trans)
int skinny_metadata =
btrfs_fs_incompat(extent_root->fs_info, SKINNY_METADATA);
 
+
while(1) {
ret = find_first_extent_bit(&info->extent_ins, 0, &start,
&end, EXTENT_LOCKED);
@@ -2080,6 +2072,8 @@ static int finish_current_insert(struct 
btrfs_trans_handle *trans)
BUG_ON(1);
}
 
+
+   printf("shouldn't be executed\n");
clear_extent_bits(&info->extent_ins, start, end, EXTENT_LOCKED);
kfree(extent_op);
}
@@ -2379,7 +2373,6 @@ static int __free_extent(struct btrfs_trans_handle *trans,
}
 fail:
btrfs_free_path(path);
-   finish_current_insert(trans);
return ret;
 }
 
@@ -2462,33 +2455,30 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
  u64 bytenr, u64 num_bytes, u64 parent,
  u64 root_objectid, u64 owner, u64 offset)
 {
-   struct btrfs_root *extent_root = root->fs_info->extent_root;
-   int pending_ret;
int ret;
 
WARN_ON(num_bytes < root->fs_info->sectorsize);
-   if (root == extent_root) {
-   struct pending_extent_op *extent_op;
-
-   extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
-   BUG_ON(!extent_op);
-
-   extent_op->type = PENDING_EXTENT_DELETE;
-   extent_op->bytenr = bytenr;
-   extent_op->num

[PATCH 4/8] btrfs-progs: Make btrfs_write_dirty_block_groups take only trans argument

2018-08-16 Thread Nikolay Borisov
The root argument is used only to get a reference to the fs_info, this
can be achieved with the transaction handle being passed so use that.
This is in preparation for moving this function in the main transaction
commit routine. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 ctree.h   | 3 +--
 extent-tree.c | 5 ++---
 transaction.c | 4 ++--
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/ctree.h b/ctree.h
index 5242595fe355..75675ef3f781 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2523,8 +2523,7 @@ int btrfs_update_extent_ref(struct btrfs_trans_handle 
*trans,
u64 orig_parent, u64 parent,
u64 root_objectid, u64 ref_generation,
u64 owner_objectid);
-int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root);
+int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
 int btrfs_read_block_groups(struct btrfs_root *root);
 struct btrfs_block_group_cache *
diff --git a/extent-tree.c b/extent-tree.c
index 3356dd2e4cf6..7d6c37c6b371 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1727,8 +1727,7 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
 
 }
 
-int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root)
+int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
 {
struct extent_io_tree *block_group_cache;
struct btrfs_block_group_cache *cache;
@@ -1739,7 +1738,7 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
u64 end;
u64 ptr;
 
-   block_group_cache = &root->fs_info->block_group_cache;
+   block_group_cache = &trans->fs_info->block_group_cache;
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
diff --git a/transaction.c b/transaction.c
index ecafbb156610..96d9891b0d1c 100644
--- a/transaction.c
+++ b/transaction.c
@@ -61,7 +61,7 @@ static int update_cowonly_root(struct btrfs_trans_handle 
*trans,
u64 old_root_bytenr;
struct btrfs_root *tree_root = root->fs_info->tree_root;
 
-   btrfs_write_dirty_block_groups(trans, root);
+   btrfs_write_dirty_block_groups(trans);
while(1) {
old_root_bytenr = btrfs_root_bytenr(&root->root_item);
if (old_root_bytenr == root->node->start)
@@ -75,7 +75,7 @@ static int update_cowonly_root(struct btrfs_trans_handle 
*trans,
&root->root_key,
&root->root_item);
BUG_ON(ret);
-   btrfs_write_dirty_block_groups(trans, root);
+   btrfs_write_dirty_block_groups(trans);
}
return 0;
 }
-- 
2.7.4



[PATCH 0/8 V2] Add delayed-refs support to btrfs-progs

2018-08-16 Thread Nikolay Borisov
Hello, 

Here is the second version of the delayed refs for progs support. The first 
version can be found here [1]. I've taken into account all the feedback from 
Misono and have verified the code is working and rebased it atop btrfs-progs
4.17.1.

Changes since v1: 
 * Removed num_entries variable from delayed ref root
 
 * Added a patch to refactor btrfs_write_dirty_block_groups and subsequently 
 changed when this function is called to fix an issue reported by Misono. I
 verified that 'make test-fsck TEST_ENABLE_OVERRIDE=true 
TEST_ARGS_CHECK=--mode=lowmem'
 produces no errors

 * Added 2 patches which remove the newly added adapter functions at the 
 beggining of the series, following the wiring up of the delayed refs 
 infrastructured. The first one (dealing with __free_extent2) is trivial, while
 the second one (for alloc_reserved_tree_block2) is a bit more involved, since
 I've opted to merge the two functions. 

 * Rebased atop latest btrfs-progs release - 4.17.1

 * Dropped patches which have been merged in the mean time


[1] https://www.spinics.net/lists/linux-btrfs/msg79173.html

Nikolay Borisov (8):
  btrfs-progs: Add __free_extent2 function
  btrfs-progs: Add alloc_reserved_tree_block2 function
  btrfs-progs: Add delayed refs infrastructure
  btrfs-progs: Make btrfs_write_dirty_block_groups take only trans
argument
  btrfs-progs: Wire up delayed refs
  btrfs-progs: Remove old delayed refs infrastructure
  btrfs-progs: Remove __free_extent2
  btrfs-progs: Merge alloc_reserved_tree_block(2|)

 Makefile  |   3 +-
 check/main.c  |   3 +-
 ctree.h   |   8 +-
 delayed-ref.c | 607 ++
 delayed-ref.h | 208 
 disk-io.c |   2 -
 extent-tree.c | 575 +++---
 kerncompat.h  |   8 +
 transaction.c |  29 ++-
 transaction.h |   4 +
 10 files changed, 1199 insertions(+), 248 deletions(-)
 create mode 100644 delayed-ref.c
 create mode 100644 delayed-ref.h

-- 
2.7.4



Re: Are the btrfs mount options inconsistent?

2018-08-16 Thread David Sterba
On Thu, Aug 16, 2018 at 12:01:25PM +0100, David Howells wrote:
> I'm trying to convert btrfs to use the new mount API stuff and I'm finding it
> hard to work out the relationships between some of the arguments, specifically
> datacow, datasum and compress*.
> 
> What I see is that enabling datasum implies enabling datacow and that
> disabling datacow implies disabling datasum - which seems reasonable.
> 
> However, selecting compression implies enabling datacow and datasum, and
> disabling datacow, as one might expect, disables compression - but disabling
> datasum does not.  Is that correct?

No it's not. Compression needs the checksums so nodatasum should disable
compression, which is missing as you found out.

This invalid combination also causes some problems during device
replace, but this was fixed in 4.18, so the missing part are the mount
options.


Are the btrfs mount options inconsistent?

2018-08-16 Thread David Howells
Hi Chris,

I'm trying to convert btrfs to use the new mount API stuff and I'm finding it
hard to work out the relationships between some of the arguments, specifically
datacow, datasum and compress*.

What I see is that enabling datasum implies enabling datacow and that
disabling datacow implies disabling datasum - which seems reasonable.

However, selecting compression implies enabling datacow and datasum, and
disabling datacow, as one might expect, disables compression - but disabling
datasum does not.  Is that correct?

David


Re: Transaction aborted - 4.16.17 kernel

2018-08-16 Thread Nikolay Borisov



On 16.08.2018 11:07, Qu Wenruo wrote:
> 
> 
> On 2018/8/16 下午3:51, David Goodwin wrote:
>> I've just spotted this on one server.
>>
>> Running : umount /backups && mount /backups
>>
>> seems to allow it to become r/w again, but it does write :
>>
>> BTRFS error (device xvdj): cleaner transaction attach returned -30
>>
>> to 'dmesg'.
>>
>>
>> David.
>>
>> [ cut here ]
>>
>> BTRFS: Transaction aborted (error -28)
>> WARNING: CPU: 3 PID: 992 at fs/btrfs/extent-tree.c:7004
> 
> Looks like some tree block backref mismatch.

Nope, -28 is ENOSPC. I think this is the problem that Josef sent some
patches - ie. when running delayed refs puts too much pressure during
transaction commit and it's no possible to satisfy this. Unfortunately
those patches haven't been merged.

> 
> Would you please try "btrfs check --readonly " to see if it
> reports any error?
> 
> Thanks,
> Qu
> 
>> __btrfs_free_extent.isra.63+0x3d2/0xd20
>> Modules linked in: dm_mod dax ipt_REJECT nf_reject_ipv4 nfsv3
>> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp nf_conntrack_ftp nf_nat
>> nf_conntrack libcrc32c xt_multiport iptable_filter ip_tables x_tables
>> autofs4 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc
>> intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel evdev pcbc
>> snd_pcsp snd_pcm aesni_intel snd_timer aes_x86_64 crypto_simd snd
>> glue_helper cryptd soundcore xen_netfront xen_blkfront crc32c_intel
>> CPU: 3 PID: 992 Comm: btrfs-transacti Not tainted 4.16.17-dg1 #1
>> RIP: e030:__btrfs_free_extent.isra.63+0x3d2/0xd20
>> RSP: e02b:c9004290bc68 EFLAGS: 00010292
>> RAX: 0026 RBX: 0180c5538000 RCX: 0006
>> RDX: 0007 RSI: 0001 RDI: 88039a996650
>> RBP: ffe4 R08: 0001 R09: 1d80
>> R10: 0001 R11: 1d80 R12: 880392d64000
>> R13: 880251fedcb0 R14:  R15: 0002
>> FS:  () GS:88039a98()
>> knlGS:
>> CS:  e033 DS:  ES:  CR0: 80050033
>> CR2: 7fdee11425c0 CR3: 0002d370 CR4: 0660
>> Call Trace:
>>  ? btrfs_merge_delayed_refs+0x23c/0x3c0
>>  __btrfs_run_delayed_refs+0x320/0x1180
>>  btrfs_run_delayed_refs+0x105/0x1c0
>>  btrfs_commit_transaction+0x393/0x8a0
>>  ? wait_woken+0x80/0x80
>>  transaction_kthread+0x195/0x1b0
>>  kthread+0xf8/0x130
>>  ? btrfs_cleanup_transaction+0x540/0x540
>>  ? kthread_create_worker_on_cpu+0x50/0x50
>>  ret_from_fork+0x35/0x40
>> Code: 48 8b 04 24 48 8b 40 50 f0 48 0f ba a8 d0 16 00 00 02 72 19 83 fd
>> fb 0f 84 07 03 00 00 89 ee 48 c7 c7 30 f3 d9 81 e8 2e a6 d7 ff <0f> 0b
>> 48 8b 3c 24 89 e9 ba 5c 1b 00 00 48 c7 c6 80 86 c3 81 e8
>> ---[ end trace c83740faa277d833 ]---
> 


Re: Transaction aborted - 4.16.17 kernel

2018-08-16 Thread Qu Wenruo


On 2018/8/16 下午3:51, David Goodwin wrote:
> I've just spotted this on one server.
> 
> Running : umount /backups && mount /backups
> 
> seems to allow it to become r/w again, but it does write :
> 
> BTRFS error (device xvdj): cleaner transaction attach returned -30
> 
> to 'dmesg'.
> 
> 
> David.
> 
> [ cut here ]
> 
> BTRFS: Transaction aborted (error -28)
> WARNING: CPU: 3 PID: 992 at fs/btrfs/extent-tree.c:7004

Looks like some tree block backref mismatch.

Would you please try "btrfs check --readonly " to see if it
reports any error?

Thanks,
Qu

> __btrfs_free_extent.isra.63+0x3d2/0xd20
> Modules linked in: dm_mod dax ipt_REJECT nf_reject_ipv4 nfsv3
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp nf_conntrack_ftp nf_nat
> nf_conntrack libcrc32c xt_multiport iptable_filter ip_tables x_tables
> autofs4 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc
> intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel evdev pcbc
> snd_pcsp snd_pcm aesni_intel snd_timer aes_x86_64 crypto_simd snd
> glue_helper cryptd soundcore xen_netfront xen_blkfront crc32c_intel
> CPU: 3 PID: 992 Comm: btrfs-transacti Not tainted 4.16.17-dg1 #1
> RIP: e030:__btrfs_free_extent.isra.63+0x3d2/0xd20
> RSP: e02b:c9004290bc68 EFLAGS: 00010292
> RAX: 0026 RBX: 0180c5538000 RCX: 0006
> RDX: 0007 RSI: 0001 RDI: 88039a996650
> RBP: ffe4 R08: 0001 R09: 1d80
> R10: 0001 R11: 1d80 R12: 880392d64000
> R13: 880251fedcb0 R14:  R15: 0002
> FS:  () GS:88039a98()
> knlGS:
> CS:  e033 DS:  ES:  CR0: 80050033
> CR2: 7fdee11425c0 CR3: 0002d370 CR4: 0660
> Call Trace:
>  ? btrfs_merge_delayed_refs+0x23c/0x3c0
>  __btrfs_run_delayed_refs+0x320/0x1180
>  btrfs_run_delayed_refs+0x105/0x1c0
>  btrfs_commit_transaction+0x393/0x8a0
>  ? wait_woken+0x80/0x80
>  transaction_kthread+0x195/0x1b0
>  kthread+0xf8/0x130
>  ? btrfs_cleanup_transaction+0x540/0x540
>  ? kthread_create_worker_on_cpu+0x50/0x50
>  ret_from_fork+0x35/0x40
> Code: 48 8b 04 24 48 8b 40 50 f0 48 0f ba a8 d0 16 00 00 02 72 19 83 fd
> fb 0f 84 07 03 00 00 89 ee 48 c7 c7 30 f3 d9 81 e8 2e a6 d7 ff <0f> 0b
> 48 8b 3c 24 89 e9 ba 5c 1b 00 00 48 c7 c6 80 86 c3 81 e8
> ---[ end trace c83740faa277d833 ]---



signature.asc
Description: OpenPGP digital signature


Transaction aborted - 4.16.17 kernel

2018-08-16 Thread David Goodwin

I've just spotted this on one server.

Running : umount /backups && mount /backups

seems to allow it to become r/w again, but it does write :

BTRFS error (device xvdj): cleaner transaction attach returned -30

to 'dmesg'.


David.

[ cut here ]

BTRFS: Transaction aborted (error -28)
WARNING: CPU: 3 PID: 992 at fs/btrfs/extent-tree.c:7004 
__btrfs_free_extent.isra.63+0x3d2/0xd20
Modules linked in: dm_mod dax ipt_REJECT nf_reject_ipv4 nfsv3 
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat_ftp nf_conntrack_ftp nf_nat 
nf_conntrack libcrc32c xt_multiport iptable_filter ip_tables x_tables 
autofs4 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc 
intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel evdev pcbc 
snd_pcsp snd_pcm aesni_intel snd_timer aes_x86_64 crypto_simd snd 
glue_helper cryptd soundcore xen_netfront xen_blkfront crc32c_intel

CPU: 3 PID: 992 Comm: btrfs-transacti Not tainted 4.16.17-dg1 #1
RIP: e030:__btrfs_free_extent.isra.63+0x3d2/0xd20
RSP: e02b:c9004290bc68 EFLAGS: 00010292
RAX: 0026 RBX: 0180c5538000 RCX: 0006
RDX: 0007 RSI: 0001 RDI: 88039a996650
RBP: ffe4 R08: 0001 R09: 1d80
R10: 0001 R11: 1d80 R12: 880392d64000
R13: 880251fedcb0 R14:  R15: 0002
FS:  () GS:88039a98() knlGS:
CS:  e033 DS:  ES:  CR0: 80050033
CR2: 7fdee11425c0 CR3: 0002d370 CR4: 0660
Call Trace:
 ? btrfs_merge_delayed_refs+0x23c/0x3c0
 __btrfs_run_delayed_refs+0x320/0x1180
 btrfs_run_delayed_refs+0x105/0x1c0
 btrfs_commit_transaction+0x393/0x8a0
 ? wait_woken+0x80/0x80
 transaction_kthread+0x195/0x1b0
 kthread+0xf8/0x130
 ? btrfs_cleanup_transaction+0x540/0x540
 ? kthread_create_worker_on_cpu+0x50/0x50
 ret_from_fork+0x35/0x40
Code: 48 8b 04 24 48 8b 40 50 f0 48 0f ba a8 d0 16 00 00 02 72 19 83 fd 
fb 0f 84 07 03 00 00 89 ee 48 c7 c7 30 f3 d9 81 e8 2e a6 d7 ff <0f> 0b 
48 8b 3c 24 89 e9 ba 5c 1b 00 00 48 c7 c6 80 86 c3 81 e8

---[ end trace c83740faa277d833 ]---