Re: btrfs-convert missing in btrfs-tools v4.15.1

2018-08-23 Thread Duncan
Nicholas D Steeves posted on Thu, 23 Aug 2018 14:15:18 -0400 as excerpted:

>> It's in my interest to ship all tools in distros, but there's also only
>> that much what the upstream community can do. If you're going to
>> reconsider the status of btrfs-convert in Debian, please let me know.
> 
> Yes, I'd be happy to advocate for its reinclusion if the answer to 4/5
> of the following questions is "yes".  Does SUSE now recommend the use of
> btrfs-convert to its enterprise customers?  The following is a
> frustrating criteria, but: Can a random desktop user run btrfs-convert
> against their ext4 rootfs and expect the operation to succeed?  Is
> btrfs-convert now sufficiently trusted that it can be recommended with
> the same degree of confidence as a backup, mkfs.btrfs, then restore to
> new filesystem approach?  Does the user of a btrfs volume created with
> btrfs-convert have an equal or lesser probability of encountering bugs
> compared to a one who used mkfs.btrfs?

Just a user and list regular here, and gentoo not debian, but for what it 
counts...

I'd personally never consider or recommend a filesystem converter over 
the backup, mkfs-to-new-fs, restore-to-new-fs, method, for three reasons.

1) Regardless of how stable a filesystem converter is and what two 
filesystems the conversion is between, "things" /do/ occasionally happen, 
thus making it irresponsible to use or recommend use of such a converter 
without a suitably current and tested backup, "just in case."

(This is of course a special case of the sysadmin's first rule of 
backups, that the true value of data is defined not by any arbitrary 
claims, but by the number of backups of that data it's considered worth 
the time/trouble/resources to make/have.  If the data value is trivial 
enough, sure, don't bother with the backup, but if it's of /that/ low a 
value, so low it's not worth a backup even when doing something as 
theoretically risky as a filesystem conversion, why is it worth the time 
and trouble to bother converting it in the first place, instead of just 
blowing it away and starting clean?)

2) Once a backup is considered "strongly recommended", as we've just 
established that it should be in 1 regardless of the stability of the 
converter, just using the existing filesystem as that backup and starting 
fresh with a mkfs for the new filesystem and copying things over is 
simply put the easiest, simplest and cleanest method to change 
filesystems.

3) (Pretty much)[1] Regardless of the filesystems in question, a fresh 
mkfs and clean sequential transfer of files from the old-fs/backup to the 
new one is pretty well guaranteed to be better optimized than conversion 
from an existing filesystem of a different type, particularly one that 
has been in normal operation for awhile and thus has operational 
fragmentation of both data and free-space.  That's in addition to being 
less bug-prone, even for a "stable" converter.


Restating: So(1) doing a conversion without a backup is irresponsible, 
(2) the easiest backup and conversion method is directly using the old fs 
as the backup, and copying over to the freshly mkfs-ed new filesystem, 
and (3) a freshly mkfs-ed filesystem and sequential copy of files to it 
from backup, whether that be the old filesystem or not, is going to be 
more efficient and less bug-prone than an in-place conversion.

Given the above, why would /anyone/ /sane/ consider using a converter?  
It simply doesn't make sense, even if the converter were as stable as the 
most stable filesystems we have.


So as a distro btrfs package maintainer, do what you wish in terms of the 
converter, but were it me, I might actually consider replacing it with an 
executable that simply printed out some form of the above argument, with 
a pointer to the sources should they still be interested after having 
read that argument.[2] Then, if people really are determined to 
unnecessarily waste their time to get a less efficient filesystem, 
possibly risking their data in the process of getting it, they can always 
build the converter from sources themselves.

---
[1] I debated omitting the qualifier as I know of no exceptions, but I'm 
not a filesystem expert and while I'm a bit skeptical, I suppose it's 
possible that they might exist.

[2] There's actually btrfs precedent for this in the form of the 
executable built as fsck.btrfs, which does nothing (successfully) but 
possibly print a message referring people to btrfs check, if run in 
interactive mode.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



dduper - Offline btrfs deduplication tool

2018-08-23 Thread Lakshmipathi.G
Hi -

dduper is an offline dedupe tool. Instead of reading whole file blocks and
computing checksum, It works by fetching checksum from BTRFS csum tree. This 
hugely improves the performance. 

dduper works like:
- Read csum for given two files.
- Find matching location.
- Pass the location to ioctl_ficlonerange directly
  instead of ioctl_fideduperange

By default, dduper adds safty check to above steps by creating a 
backup reflink file and compares the md5sum after dedupe. 
If the backup file matches new deduped file, then backup file is 
removed. You can skip this check by passing --skip option. Here is 
sample cli usage [1] and quick demo [2]  

Some performance numbers: (with -skip option)

Dedupe two 1GB files with same  content - 1.2 seconds
Dedupe two 5GB files with same  content - 8.2 seconds
Dedupe two 10GB files with same  content - 13.8 seconds

dduper requires `btrfs inspect-internal dump-csum` command, you can use 
this branch [3] or apply patch by yourself [4] 

[1] 
https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md
[2] http://giis.co.in/btrfs_dedupe.gif
[3] git clone https://gitlab.collabora.com/laks/btrfs-progs.git -b  dump_csum
[4] https://patchwork.kernel.org/patch/10540229/ 

Please remember its version-0.1, so test it out, if you plan to use dduper real 
data.
Let me know, if you have suggestions or feedback or bugs :)

Cheers.
Lakshmipathi.G



[PATCH] btrfs-progs: dduper - BTRFS offline deduplication tool

2018-08-23 Thread Lakshmipathi Ganapathi
dduper is an offline dedupe tool. It works by fetching checksum from BTRFS
csum tree, instead of reading whole file blocks and computing checksum.
This tool relies on output from 'btrfs inspect-internal dump-csum' command.

Signed-off-by: Lakshmipathi.G 
---
 dduper | 310 +
 1 file changed, 310 insertions(+)
 create mode 100644 dduper

diff --git a/dduper b/dduper
new file mode 100644
index 000..2170b11
--- /dev/null
+++ b/dduper
@@ -0,0 +1,310 @@
+#!/usr/bin/env python
+
+""" dduper - BTRFS Dedupe tool.
+
+This is a offline dedupe tool. Instead of reading whole file blocks and
+computing checksum, It works by fetching checksum from BTRFS csum tree.
+This hugely improves the performance.
+
+Authors: Lakshmipathi.G 
+"""
+import argparse
+import errno
+import hashlib
+import numpy as np
+import math
+import os
+import pdb
+import struct
+import subprocess
+import sys
+
+from collections import OrderedDict
+from fcntl import ioctl
+from itertools import combinations
+from itertools import izip_longest
+from stat import *
+#  4kb block size
+blk_size = 4
+#  no.of csum on single row - right now its 8
+no_of_chunks = 0
+FICLONERANGE = 0x4020940d
+
+device_name = None
+skip = 0
+chunk_sz = 0
+run_len = 0
+ele_sz = 0
+
+# Already deduped files
+processed_files = []
+
+
+# From https://stackoverflow.com/questions/434287
+def grouper(iterable, n, fillvalue=None):
+args = [iter(iterable)] * n
+return izip_longest(*args, fillvalue=fillvalue)
+
+
+def get_ele_size():
+
+global no_of_chunks, run_len
+if chunk_sz <= 0 or chunk_sz % 32 != 0:
+print "Ensure chunk size is of multiple 32KB. (32,64,128 etc)"
+sys.exit(-1)
+no_of_chunks = chunk_sz / blk_size
+ele_sz = no_of_chunks / 8
+run_len = no_of_chunks * blk_size * 1024
+return ele_sz
+
+
+def get_hashes(out1):
+"""
+   For each list item compute its hash and store it with offset as its key.
+"""
+global ele_sz
+
+if ele_sz == 1:
+od = OrderedDict()
+for idx, ele in enumerate(out1):
+v = []
+k = hashlib.md5(str(ele)).hexdigest()
+v.append(idx)
+if k in od:
+print "Collison with: "+str(k) + "at offset: "+str(v)
+od[k] = v
+else:
+od = OrderedDict()
+for idx, ele in enumerate(grouper(out1, ele_sz, 'x')):
+v = []
+k = hashlib.md5(str(ele)).hexdigest()
+v.append(idx)
+if k in od:
+print "Collison with: "+str(k) + "at offset: "+str(v)
+od[k] = v
+
+return od
+
+
+def ioctl_ficlonerange(dst_fd, s):
+
+try:
+ioctl(dst_fd, FICLONERANGE, s)
+except Exception as e:
+print "error({0})".format(e.errno)
+
+
+def cmp_files(file1, file2):
+
+md1 = subprocess.Popen(['md5sum', file1], stdout=subprocess.PIPE,
+close_fds=True).stdout.read().split(" ")[0]
+md2 = subprocess.Popen(['md5sum', file2], stdout=subprocess.PIPE,
+close_fds=True).stdout.read().split(" ")[0]
+if md1 == md2:
+return 0
+else:
+return 1
+
+
+def do_dedupe(src_file, dst_file, dry_run):
+
+bkup_file = dst_file + ".__superduper"
+src_fd = os.open(src_file, os.O_RDONLY)
+dst_fd = os.open(dst_file, os.O_WRONLY)
+perfect_match = 0
+
+out1 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', 
src_file, device_name],
+stdout=subprocess.PIPE, close_fds=True).stdout.readlines()
+out2 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', 
dst_file, device_name],
+stdout=subprocess.PIPE, close_fds=True).stdout.readlines()
+# todo : perfect match files. Remove dst_file from further operations
+if out1 == out2:
+print "Prefect match : ", src_file, dst_file
+perfect_match = 1
+
+src_dict = get_hashes(out1)
+dst_dict = get_hashes(out2)
+total_entry = len(src_dict) - 1  # Fix missing final ele
+file_size = os.path.getsize(src_file)
+
+np1 = np.array([v for v in src_dict.keys()])
+np2 = np.array([v for v in dst_dict.keys()])
+matched_keys = np.intersect1d(np1, np2)
+unmatched_keys = np.setdiff1d(np1, np2)
+
+if dry_run == 0:
+# todo: Clear dict/np/list if there are not used further
+# todo : handle same content within single file
+
+if matched_keys is not None:
+if skip == 0:
+bkup2 = subprocess.Popen(['cp', '--reflink=always', dst_file, 
bkup_file], stdout=subprocess.PIPE)
+print "*" * 24
+# print "matched regions"
+for location in matched_keys:
+entry = src_dict[location][0]
+src_len = no_of_chunks * blk_size * 1024
+src_offset = src_dict[location][0] * src_len
+
+multi_dst_offsets = dst_dict[location]  # list
+for offset in 

[PATCH] btrfs: ctree.h: Fix suspicious rcu usage warning in btrfs_debug_in_rcu()

2018-08-23 Thread Misono Tomohiro
commit 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded rcu string handling with macro.

It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of rcu string for non debug case
(i.e. when message is not printed), leading suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.

Fix this by adding a wrapper to call lock/unlock for non debug case too.

Fixes: 672d599041c8 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code")
Reported-by: David Howells 
Signed-off-by: Misono Tomohiro 
---
 fs/btrfs/ctree.h | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 53af9f5253f4..cc8b4ff8dcea 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3390,9 +3390,9 @@ do {  
\
 #define btrfs_debug(fs_info, fmt, args...) \
btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_in_rcu(fs_info, fmt, args...) \
-   btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args)
+   btrfs_no_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \
-   btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args)
+   btrfs_no_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_rl(fs_info, fmt, args...) \
btrfs_no_printk(fs_info, KERN_DEBUG fmt, ##args)
 #endif
@@ -3404,6 +3404,13 @@ do { 
\
rcu_read_unlock();  \
 } while (0)
 
+#define btrfs_no_printk_in_rcu(fs_info, fmt, args...)  \
+do {   \
+   rcu_read_lock();\
+   btrfs_no_printk(fs_info, fmt, ##args);  \
+   rcu_read_unlock();  \
+} while (0)
+
 #define btrfs_printk_ratelimited(fs_info, fmt, args...)\
 do {   \
static DEFINE_RATELIMIT_STATE(_rs,  \
-- 
2.14.4



Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-23 Thread Zygo Blaxell
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > >   - should we just round down the EOF dedupe request to the
> > > block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.

Here are some concrete examples.

In the following, letters are 4K disk blocks and also inode offsets
(i.e. "A" means a block containing 4096 x "A" located at inode offset 0,
"B" contains "B" located at inode offset 1, etc).  "|" indicates
a physical discontinuity of the blocks on disk.  Lowercase "a" has
identical content to uppercase "A", but they are located in different
physical blocks on disk.

Suppose you have two identical files with different write histories,
so they have different on-disk layouts:

Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

Inode 2:  a|b|c|d|e|f|g|hijklmnopqrstuvwxyz

A naive dedupe app might pick src and dst at random, and do this:

// dedupe(length, src_ino, src_off, dst_ino, dst_off)

dedupe(length 26, Inode 1, Offset 0, Inode 2, Offset 0)

with the result having 11 fragments in each file, all from the
original Inode 1:

Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

Inode 2:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

A smarter dedupe app might choose src and dst based on logical proximity
and/or physical seek distance, or the app might choose dst with the
smallest number of existing references in the filesystem, or the app might
simply choose the longest available src extents to minimize fragmentation:

dedupe(length 7, Inode 1, Offset 0, Inode 2, Offset 0)

dedupe(length 19, Inode 2, Offset 7, Inode 1, Offset 7)

with the result having 2 fragments in each file, each chosen
from a different original inode:

Inode 1:  ABCDEFG|hijklmnopqrstuvwxyz

Inode 2:  ABCDEFG|hijklmnopqrstuvwxyz

If the kernel continued past the 'length 7' size specified in the first
dedupe, then the 'hijklmnopqrstuvwxyz' would be *lost*, and the second
dedupe would be an expensive no-op because both Inode 1 and Inode 2
refer to the same physical blocks:

Inode 1:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ

  [---] - app asked for this
Inode 2:  ABCDEFGH|IJ|KL|M|N|O|PQ|RST|UV|WX|YZ
kernel does this too - [-]
and "hijklmnopqrstuvwxyz" no longer exists for second dedupe

A dedupe app willing to spend more on IO can create its own better src
with only one fragment:

open(with O_TMPFILE) -> Inode 3

copy(length 7, Inode 1, Offset 0, Inode 3, Offset 0)

copy(length 19, Inode 2, Offset 7, Inode 3, Offset 7)

dedupe(length 26, Inode 3, Offset 0, Inode 1, Offset 0)

dedupe(length 26, Inode 3, Offset 0, Inode 2, Offset 0)

close(Inode 3)

Now there is just one fragment referenced from two places:

Inode 1:  αβξδεφγηιςκλμνοπθρστυвшχψζ

Inode 2:  αβξδεφγηιςκλμνοπθρστυвшχψζ

[If encoding goes horribly wrong, the above are a-z transcoded as a mix
of Greek and Cyrillic Unicode characters.]

Real filesystems sometimes present thousands of possible dedupe
src/dst permutations to choose from.  The kernel shouldn't be trying to
second-guess an application that may have access to external information
to make better decisions (e.g. the full set of src extents available,
or knowledge of other calls the app will issue in the future).

> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.
> 
> In a parallel universe (or a better filesystem, or a userspace emulation
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks these point to, then adjusts all the references to the
> dst blocks at once, dealing with partial overlaps and snapshots and
> nodatacow and whatever other exotic features might be lurking in the
> filesystem, ending with every reference to every part of dst replaced
> by the longest possible contiguous 

Re: fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!

2018-08-23 Thread Misono Tomohiro
On 2018/08/24 0:49, David Howells wrote:
> I'm seeing the attached message generated from this line:
> 
>   btrfs_debug_in_rcu(fs_info,
>   "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
>   bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector,
>   (u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid,
>   bio->bi_iter.bi_size);
> 
> in submit_stripe_bio().  I'm not sure exactly why, but I suspect
> rcu_str_deref() is the point from where it is generated.
> 
> Note that the mount succeeds.
> 
> This code was introduced by:
> 
>   commit 672d599041c862dd61a1576c32e946ef0d77aa34
>   Author: Misono Tomohiro 
>   Date:   Thu Aug 2 16:19:07 2018 +0900
> 


Thanks for the reporting.
I didn't turn on CONFIG_PROVE_RCU and missed the warning.

I will send a fix.

Thanks,
Misono

> David
> ---
> =
> WARNING: suspicious RCU usage
> 4.18.0-fscache+ #540 Not tainted
> -
> fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!
> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by mount/3194:
>  #0: 72604777 (>fs_type->s_umount_key#54/1){+.+.}, at: 
> alloc_super+0xa4/0x313
> 
> stack backtrace:
> CPU: 2 PID: 3194 Comm: mount Not tainted 4.18.0-fscache+ #540
> Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
> Call Trace:
>  dump_stack+0x67/0x8e
>  btrfs_map_bio+0x282/0x321
>  btree_submit_bio_hook+0x71/0xa6
>  submit_one_bio+0x55/0x66
>  read_extent_buffer_pages+0x1ec/0x2ab
>  btree_read_extent_buffer_pages+0x6e/0x237
>  ? alloc_extent_buffer+0x28f/0x2f2
>  read_tree_block+0x43/0x5e
>  open_ctree+0x139b/0x1ee4
>  btrfs_get_tree+0x357/0xa33
>  ? selinux_fs_context_dup+0x2d/0x104
>  vfs_get_tree+0x7a/0x162
>  btrfs_mount_root+0x52/0x8b
>  btrfs_get_tree+0x4ab/0xa33
>  ? vfs_parse_fs_string+0x5b/0x9e
>  vfs_get_tree+0x7a/0x162
>  do_mount+0x7f0/0x8b2
>  ? memdup_user+0x3e/0x5a
>  ksys_mount+0x72/0x97
>  __x64_sys_mount+0x21/0x24
>  do_syscall_64+0x7d/0x1a0
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f11b8365ada
> Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 
> 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48
> RSP: 002b:7ffe0d9a5b98 EFLAGS: 0246 ORIG_RAX: 00a5
> RAX: ffda RBX: 5562bf7702a0 RCX: 7f11b8365ada
> RDX: 5562bf77a620 RSI: 5562bf7704a0 RDI: 5562bf770480
> RBP:  R08:  R09: 7f11b8620c40
> R10: c0ed R11: 0246 R12: 5562bf770480
> R13: 5562bf77a620 R14:  R15: 7f11b90f9184
> 
> 



Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"

2018-08-23 Thread Qu Wenruo


On 2018/8/24 上午12:44, Zygo Blaxell wrote:
> On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
>> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
>>> This is a repro script for a btrfs bug that causes corrupted data reads
>>> when reading a mix of compressed extents and holes.  The bug is
>>> reproducible on at least kernels v4.1..v4.18.
>>
>> This bug already sounds more serious than previous nodatasum +
>> compression bug.
> 
> Maybe.  "compression + holes corruption bug 2017" could be avoided with
> the max-inline=0 mount option without disabling compression.  This time,
> the workaround is more intrusive:  avoid all applications that use dedup
> or hole-punching.
> 
>>> Some more observations and background follow, but first here is the
>>> script and some sample output:
>>>
>>> root@rescue:/test# cat repro-hole-corruption-test
>>> #!/bin/bash
>>>
>>> # Write a 4096 byte block of something
>>> block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
>>>
>>> # Here is some test data with holes in it:
>>> for y in $(seq 0 100); do
>>> for x in 0 1; do
>>> block 0;
>>> block 21;
>>> block 0;
>>> block 22;
>>> block 0;
>>> block 0;
>>> block 43;
>>> block 44;
>>> block 0;
>>> block 0;
>>> block 61;
>>> block 62;
>>> block 63;
>>> block 64;
>>> block 65;
>>> block 66;>  done
>>
>> Does the content has any difference on this bug?
>> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.
> 
> The content of the extents doesn't seem to matter, other than it needs to
> be compressible so that the extents on disk are compressed.  The bug is
> also triggered by writing non-zero data to all blocks, and then punching
> the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".
> 
> The layout of the extents matters a lot.  I have to loop hundreds or
> thousands of times to hit the bug if the first block in the pattern is
> not a hole, or if the non-hole extents are different sizes or positions
> than above.
> 
> I tried random patterns of holes and extent refs, and most of them have
> an order of magnitude lower hit rates than the above.  This might be due
> to some relationship between the alignment of read() request boundaries
> with extent boundaries, but I haven't done any tests designed to detect
> such a relationship.
> 
> In the wild, corruption happens on some files much more often than others.
> This seems to be correlated with the extent layout as well.
> 
> I discovered the bug by examining files that were intermittently but
> repeatedly failing routine data integrity checks, and found that in every
> case they had similar hole + extent patterns near the point where data
> was corrupted.
> 
> I did a search on some big filesystems for the
> hole-refExtentA-hole-refExtentA pattern and found several files with
> this pattern that had passed previous data integrity checks, but would
> fail randomly in the sha1sum/drop-caches loop.
> 
>> This should indeed cause 101 128K compressed data extent.
>> But I'm wondering the description about 'holes'.
> 
> The holes are coming, wait for it... ;)
> 
>>> done > am
>>> sync
>>>
>>> # Now replace those 101 distinct extents with 101 references to the 
>>> first extent
>>> btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 
>>> 131072)); done) 2>&1 | tail
>>
>> Will this bug still happen by creating one extent and then reflink it
>> 101 times?
> 
> Yes.  I used btrfs-extent-same because a binary is included in the
> Debian duperemove package, but I use it only for convenience.
> 
> It's not necessary to have hundreds of references to the same extent--even
> two refs to a single extent plus a hole can trigger the bug sometimes.
> 100 references in a single file will trigger the bug so often that it
> can be detected within the first 20 sha1sum loops.
> 
> When the corruption occurs, it affects around 90 of the original 101
> extents.  The different sha1sum results are due to different extents
> giving bad data on different runs.
> 
>>> # Punch holes into the extent refs
>>> fallocate -v -d am
>>
>> Hole-punch in fact happens here.
>>
>> BTW, will add a "sync" here change the result?
> 
> No.  You can reboot the machine here if you like, it does not change
> anything that happens during reads later.

So it looks like my assumption of bad file extent interpreter is getting
more and more valid.

It has nothing to do with the race against hole punching/write, but only
the file layout and extent map cache.

> 
> Looking at the extent tree in btrfs-debug-tree, the data on disk
> looks correct, and btrfs does read it correctly most of the time (the
> 

Re: lazytime mount option—no support in Btrfs

2018-08-23 Thread Janos Toth F.
On Tue, Aug 21, 2018 at 4:10 PM Austin S. Hemmelgarn
 wrote:
> Also, once you've got the space cache set up by mounting once writable
> with the appropriate flag and then waiting for it to initialize, you
> should not ever need to specify the `space_cache` option again.

True.
I am not sure why I thought I have to keep cache-v2 in the rootflag
list. I guess I forgot it was meant to be removed after a reboot. Or
may be some old kernels misbehaved (always cleared the space-tree for
some reason and re-initialized a space-cache instead unless the flag
was there). But I removed it now and everything works as intended.
Thanks.


Device Delete Stalls - Conclusion

2018-08-23 Thread Stefan Malte Schumacher
Hello everybody,

first, let me thank everybody for their advice. What I did was close
the terminal with the device delete-process running in it and fired it
up again. It took about 5 minutes of intensive IO-Usage and the data
was redistributed and and the /dev/sda/ removed from the list of
drives. I am currently running a scrub, which I hope will find and
correct any errors caused by the forceful abort of the device delete
process. Many people suggested it: Next time I am going to replace one
of the older 4TB drives with a bigger one I will use "replace"
directly instead of "add/delete".

Yours sincerely
Stefan


Re: Device Delete Stalls

2018-08-23 Thread Chris Murphy
And by 4.14 I actually mean 4.14.60 or 4.14.62 (based on the
changelog). I don't think the single patch in 4.14.62 applies to your
situation.


Re: Device Delete Stalls

2018-08-23 Thread Chris Murphy
On Thu, Aug 23, 2018 at 8:04 AM, Stefan Malte Schumacher
 wrote:
> Hallo,
>
> I originally had RAID with six 4TB drives, which was more than 80
> percent full. So now I bought
> a 10TB drive, added it to the Array and gave the command to remove the
> oldest drive in the array.
>
>  btrfs device delete /dev/sda /mnt/btrfs-raid
>
> I kept a terminal with "watch btrfs fi show" open and It showed that
> the size of /dev/sda had been set to zero and that data was being
> redistributed to the other drives. All seemed well, but now the
> process stalls at 8GB being left on /dev/sda/. It also seems that the
> size of the drive has been reset the original value of 3,64TiB.
>
> Label: none  uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8
> Total devices 7 FS bytes used 8.07TiB
> devid1 size 3.64TiB used 8.00GiB path /dev/sda
> devid2 size 3.64TiB used 2.73TiB path /dev/sdc
> devid3 size 3.64TiB used 2.73TiB path /dev/sdd
> devid4 size 3.64TiB used 2.73TiB path /dev/sde
> devid5 size 3.64TiB used 2.73TiB path /dev/sdf
> devid6 size 3.64TiB used 2.73TiB path /dev/sdg
> devid7 size 9.10TiB used 2.50TiB path /dev/sdb
>
> I see no more btrfs worker processes and no more activity in iotop.
> How do I proceed? I am using a current debian stretch which uses
> Kernel 4.9.0-8 and btrfs-progs 4.7.3-1.
>
> How should I proceed? I have a Backup but would prefer an easier and
> less time-comsuming way out of this mess.

I'd let it keep running as long as you can tolerate it. In the
meantime, update your backups, and keep using the file system
normally, it should be safe to use. The block group migration can
sometimes be slow with "brfs dev del" compared to the replace
operation, I can't explain why but it might be related to some
combination of file and free space fragmentation as well as number of
snapshots, and just general complexity of what is effectively a
partial balance operation going on.

Next, you could do a sysrq + t, which dumps process state into the
kernel message buffer which might not be big enough to contain the
output. If you're using systemd, the journal -k will have it, and
presumably syslog's messages will have it. I can't parse this output
but a developer might find it useful to see what's going on and if
it's just plain wrong. Or if it's just slow.

Next, once you get sick of waiting, well you can force a reboot with
'reboot -f' or 'sysrq + b' but then what's the plan? Sure you could
just try again but I don't know that this should give different
results. It's either just slow, or it's a bug. And if it's a bug,
maybe it's fixed in something newer, in which case I'd try a much
newer kernel 4.14 at the oldest, and ideally 4.18.4, at least to
finish off this task.

For what it's worth, the bulk of the delete operation is like a
filtered balance, it's mainly relocating block groups, and that is
supposed to be COW. So it should be safe to do an abrupt reboot. If
you're not writing new information there's no information to lose; the
worst case is Btrfs has a slightly older superblock than the latest
generation for block group relocation and it starts from that point
again. I've done quite a lot of jerkface reboot -f and sysrq + b with
Btrfs and have never broken a file system so far (power failures,
different story) but maybe I'm lucky and I have a bunch of well
behaved devices.



-- 
Chris Murphy


Re: btrfs-convert missing in btrfs-tools v4.15.1

2018-08-23 Thread Nicholas D Steeves
Hi Qu,

On Sun, Jul 29, 2018 at 07:44:05AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年07月29日 05:34, Nicholas D Steeves wrote:
> > Resending because I forgot to CC list.
> > 
> > Hi jkexcel,
> > 
> > On 28 July 2018 at 16:50, jkexcel  wrote:
> >>
> >> I'm an end user trying to use btrfs-convert but when I installed
> >> btrfs-tools and its dependency btrfs-progs on kubuntu 18.04, the
> >> installation was successful, and it shows that v4.15.1-1build1 was
> >> installed.
> >>
> >> However, when using the command  # brtfs-convert  /dev/sda4 (with the
> >> drive unmounted) the resulting error appears "command not found"
> >> I also tried command "btrfs convert" in case this was folded into the
> >> main tool, but this also failed.
> >>
> >> 1. Is btrfs-convert still available?
> >>
> >> 2. Where can I find it?
> >>
> >> 3. Has btrfs-convert been replaced? what is it's new name?
> >>
> >> 4. Is it safe to use a downgraded version of btrfs-tools ie: 4.14 or
> >> older?
> > 
> > You can blame me for that.  In Debian several users had reported
> > release-critical issues in btrfs-convert, so I submitted a patch to
> > disable it for the forseable future, eg:
> > 
> >   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864798
> >   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=854489
> 
> Both report looks pretty old (4.7~4.9).
> 
> In fact, just in v4.10 we had a big rework for convert.
> It should work much better after that.
> 
> Furthermore, we have newer (but smaller) fixes to remove a lot of
> BUG_ON(), and do graceful exit for ENOSPC case since then.
> 
> And the design of btrfs-convert (at least for the latest btrfs-convert)
> is to ensure until everything goes well, we won't touch any bytes of the
> ext* fs (in fact we open the ext* fs in RO mode).
> So it at least won't corrupt the ext* fs.
> 
> > 
> > Also, please consider the official status "As of 4.0 kernels this feature
> > is not often used or well tested anymore, and there have been some reports
> > that the conversion doesn't work reliably. Feel free to try it out, but
> > make sure you have backups" (
> > https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 ).
> 
> The wiki page looks needs to be updated.
> 
> Both btrfs-convert and base btrfs-progs are improving, especially after
> v4.10 btrfs-convert goes through a big rework and works well so far, and
> even added support for reiserfs under the same framwork.
> 
> So IMHO it's at least worth trying.
> 
> Thanks,
> Qu

Thank you for sharing the cut-off where success became more likely :-)
Debian 9 could have had 4.9.1 at the newest, so it wouldn't have had
btrfs-convert.  So it sounds like btrfs-convert could have been
enabled for the experimental suite (which is almost only used by
Debian developers and not users) for 4.10.  Looking at the changelog
it seems we might have had to disable it again before 4.14.1 or 4.16.

I'm happy we're having this conversation now, because the time to give
it another try is probably sometime in the next month :-)  See my long
email to David for the caveats.

Cheers,
Nicholas


signature.asc
Description: PGP signature


Re: btrfs-convert missing in btrfs-tools v4.15.1

2018-08-23 Thread Nicholas D Steeves
Hi everyone,

Sorry for the delay replying, I've been busy with other work.  Reply
follows inline.  P.S. sorry about the table in this email that is 99
columns wide.

On Thu, Aug 09, 2018 at 01:50:46PM +0200, David Sterba wrote:
> On Sat, Jul 28, 2018 at 05:34:49PM -0400, Nicholas D Steeves wrote:
> > On 28 July 2018 at 16:50, jkexcel  wrote:
> > > I'm an end user trying to use btrfs-convert but when I installed
> > > btrfs-tools and its dependency btrfs-progs on kubuntu 18.04, the
> > > installation was successful, and it shows that v4.15.1-1build1 was
> > > installed.
> > >
> > > However, when using the command  # brtfs-convert  /dev/sda4 (with the
> > > drive unmounted) the resulting error appears "command not found"
> > > I also tried command "btrfs convert" in case this was folded into the
> > > main tool, but this also failed.
> > >
> > > 1. Is btrfs-convert still available?
> > >
> > > 2. Where can I find it?
> > >
> > > 3. Has btrfs-convert been replaced? what is it's new name?
> > >
> > > 4. Is it safe to use a downgraded version of btrfs-tools ie: 4.14 or
> > > older?
> > 
> > You can blame me for that.  In Debian several users had reported
> > release-critical issues in btrfs-convert, so I submitted a patch to
> > disable it for the forseable future, eg:
> > 
> >   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864798
> >   https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=854489
> 
> The reports are against version 4.7.3 released one year before the time
> of the bug reports. The fix for the reported bug happened in 4.10, that
> was half a year before the bug.

Debian stable will always have an old version, which will be in use
for two to four years.  Btrfs-progs 4.7.3 will be in use in Debian 9
until at least 2021, possibly longer.  Btw, I strongly believe Debian 9
should have shipped with btrfs-progs 4.9.1, but alas the primary
maintainer didn't upload it on time.

For "buster" (Debian 10), which will probably be released in mid 2019,
the newest possible btrfs-progs version that could be included is
whatever is current at the end of January 2019.  Exceptions are
sometimes granted for an unblock of a newer version.  For example, if
an LTS kernel won't be released until March, and the release,
technical committee, and kernel team decide that's the version we
want, then a preapproved exception will be granted.  At that time an
argument can be made for preapproval of a newer btrfs-progs as well.

That said, I try to keep a backported newer version of btrfs-progs for
the stable Debian release reasonably up-to-date (backports are
available to users on a per-package opt-in basis).  That's the channel
for feature enablement.  Also, my apologies, at the moment this
backport is very much out of date--it stalled while investigating
which packages would be broken by the library reorganisation;
although, from what I can tell that would only be snapper.

> > Also, please consider the official status "As of 4.0 kernels this feature
> > is not often used or well tested anymore, and there have been some reports
> > that the conversion doesn't work reliably. Feel free to try it out, but
> > make sure you have backups" (
> > https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 ).
> 
> Sorry that this you take it as official status. The wiki is open to
> edits and such claims appear there from time to time. I've removed it,
> it's been there since 2015 when it possibly was accurate but it's not
> anymore.

Is there a more authoritative and up-to-date location for various
statuses?  It would be nice to have something in btrfs-progs as a
table like this, or exportable in some kind of human-friendly format:

+---+-++--+---+
|Feature|1st mainline version where feature   |LTS kernel 1|LTS kernel 2
  |LTS kernel 3   |
|   |appeared |eg: 4.4 |eg: 4.9 
  |eg: 4.14   |
+---+-++--+---+
|convert|Assume -progs and kernel |exp?|mostly? 
  |stable?|
|from   |vessions are strongly||  amend 
  |   |
|ext3/4 |associated, for simplicity.  ||  status with:  
  |   |
|   | ||4.9.z:testing   
  |   |
+---+-++--+---+
|foo|3.16:danger||exp||mostly||testing||stable|exp |mostly  
  |testing|
+---+-++--+---+

Ideally something that could be trivially exported to the format the
btrfs.wiki.kernel.org wiki uses.  The trick is to make it convenient
to maintain...  Other than CSV, I'm not sure what would work well for

Re: Device Delete Stalls

2018-08-23 Thread Austin S. Hemmelgarn

On 2018-08-23 10:04, Stefan Malte Schumacher wrote:

Hallo,

I originally had RAID with six 4TB drives, which was more than 80
percent full. So now I bought
a 10TB drive, added it to the Array and gave the command to remove the
oldest drive in the array.

  btrfs device delete /dev/sda /mnt/btrfs-raid

I kept a terminal with "watch btrfs fi show" open and It showed that
the size of /dev/sda had been set to zero and that data was being
redistributed to the other drives. All seemed well, but now the
process stalls at 8GB being left on /dev/sda/. It also seems that the
size of the drive has been reset the original value of 3,64TiB.

Label: none  uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8
 Total devices 7 FS bytes used 8.07TiB
 devid1 size 3.64TiB used 8.00GiB path /dev/sda
 devid2 size 3.64TiB used 2.73TiB path /dev/sdc
 devid3 size 3.64TiB used 2.73TiB path /dev/sdd
 devid4 size 3.64TiB used 2.73TiB path /dev/sde
 devid5 size 3.64TiB used 2.73TiB path /dev/sdf
 devid6 size 3.64TiB used 2.73TiB path /dev/sdg
 devid7 size 9.10TiB used 2.50TiB path /dev/sdb

I see no more btrfs worker processes and no more activity in iotop.
How do I proceed? I am using a current debian stretch which uses
Kernel 4.9.0-8 and btrfs-progs 4.7.3-1.

How should I proceed? I have a Backup but would prefer an easier and
less time-comsuming way out of this mess.


Not exactly what you asked for, but I do have some advice on how to 
avoid this situation in the future:


If at all possible, use `btrfs device replace` instead of an add/delete 
cycle.  The replace operation requires two things.  First, you have to 
be able to connect the new device to the system while all the old ones 
except the device you are removing are present.  Second, the new device 
has to be at least as big as the old one.  Assuming both conditions are 
met and you can use replace, it's generally much faster and is a lot 
more reliable than an add/delete cycle (especially when the array is 
near full).  This is because replace just copies the data that's on the 
old device directly (or rebuilds it directly if it's not present anymore 
or corrupted), whereas the add/delete method implicitly re-balances the 
entire array (which takes a long time and may fail if the array is 
mostly full).



Now, as far as what's actually going on here, I'm unfortunately not 
quite sure, and therefore I'm really not the best person to be giving 
advice on how to fix it.  I will comment that having info on the 
allocations for all the devices (not just /dev/sda) would be useful in 
debugging, but even with that I don't know that I personally can help.


Re: Reproducer for "compressed data + hole data corruption bug, 2018 editiion"

2018-08-23 Thread Zygo Blaxell
On Thu, Aug 23, 2018 at 01:10:48PM +0800, Qu Wenruo wrote:
> On 2018/8/23 上午11:11, Zygo Blaxell wrote:
> > This is a repro script for a btrfs bug that causes corrupted data reads
> > when reading a mix of compressed extents and holes.  The bug is
> > reproducible on at least kernels v4.1..v4.18.
> 
> This bug already sounds more serious than previous nodatasum +
> compression bug.

Maybe.  "compression + holes corruption bug 2017" could be avoided with
the max-inline=0 mount option without disabling compression.  This time,
the workaround is more intrusive:  avoid all applications that use dedup
or hole-punching.

> > Some more observations and background follow, but first here is the
> > script and some sample output:
> > 
> > root@rescue:/test# cat repro-hole-corruption-test
> > #!/bin/bash
> > 
> > # Write a 4096 byte block of something
> > block () { head -c 4096 /dev/zero | tr '\0' "\\$1"; }
> > 
> > # Here is some test data with holes in it:
> > for y in $(seq 0 100); do
> > for x in 0 1; do
> > block 0;
> > block 21;
> > block 0;
> > block 22;
> > block 0;
> > block 0;
> > block 43;
> > block 44;
> > block 0;
> > block 0;
> > block 61;
> > block 62;
> > block 63;
> > block 64;
> > block 65;
> > block 66;>  done
> 
> Does the content has any difference on this bug?
> It's just 16 * 4K * 2 * 101 data write *without* any hole so far.

The content of the extents doesn't seem to matter, other than it needs to
be compressible so that the extents on disk are compressed.  The bug is
also triggered by writing non-zero data to all blocks, and then punching
the holes later with "fallocate -p -l 4096 -o $(( insert math here ))".

The layout of the extents matters a lot.  I have to loop hundreds or
thousands of times to hit the bug if the first block in the pattern is
not a hole, or if the non-hole extents are different sizes or positions
than above.

I tried random patterns of holes and extent refs, and most of them have
an order of magnitude lower hit rates than the above.  This might be due
to some relationship between the alignment of read() request boundaries
with extent boundaries, but I haven't done any tests designed to detect
such a relationship.

In the wild, corruption happens on some files much more often than others.
This seems to be correlated with the extent layout as well.

I discovered the bug by examining files that were intermittently but
repeatedly failing routine data integrity checks, and found that in every
case they had similar hole + extent patterns near the point where data
was corrupted.

I did a search on some big filesystems for the
hole-refExtentA-hole-refExtentA pattern and found several files with
this pattern that had passed previous data integrity checks, but would
fail randomly in the sha1sum/drop-caches loop.

> This should indeed cause 101 128K compressed data extent.
> But I'm wondering the description about 'holes'.

The holes are coming, wait for it... ;)

> > done > am
> > sync
> > 
> > # Now replace those 101 distinct extents with 101 references to the 
> > first extent
> > btrfs-extent-same 131072 $(for x in $(seq 0 100); do echo am $((x * 
> > 131072)); done) 2>&1 | tail
> 
> Will this bug still happen by creating one extent and then reflink it
> 101 times?

Yes.  I used btrfs-extent-same because a binary is included in the
Debian duperemove package, but I use it only for convenience.

It's not necessary to have hundreds of references to the same extent--even
two refs to a single extent plus a hole can trigger the bug sometimes.
100 references in a single file will trigger the bug so often that it
can be detected within the first 20 sha1sum loops.

When the corruption occurs, it affects around 90 of the original 101
extents.  The different sha1sum results are due to different extents
giving bad data on different runs.

> > # Punch holes into the extent refs
> > fallocate -v -d am
> 
> Hole-punch in fact happens here.
> 
> BTW, will add a "sync" here change the result?

No.  You can reboot the machine here if you like, it does not change
anything that happens during reads later.

Looking at the extent tree in btrfs-debug-tree, the data on disk
looks correct, and btrfs does read it correctly most of the time (the
correct sha1sum below is 6926a34e0ab3e0a023e8ea85a650f5b4217acab4).
The corruption therefore comes from btrfs read() producing incorrect
data in some instances.

> > # Do some other stuff on the machine while this runs, and watch the 
> > sha1sums change!
> > while :; do echo $(sha1sum am); sysctl -q vm.drop_caches={1,2,3}; sleep 
> > 1; done
> > 
> > 

Re: fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!

2018-08-23 Thread David Howells
David Sterba  wrote:

> The code previously had explicit rcu_lock/unlock, now it uses the
> btrfs_debug_in_rcu helper which is supposed to provide that. It's
> possible that the helper is missing it due to some #ifdef mess, but I
> don't see it.

I preprocessed the function and extracted submit_stripe_bio() (see attached).

The problem is that btrfs_no_printk() is being called, but not within an RCU
section.

#define btrfs_debug_in_rcu(fs_info, fmt, args...) \
btrfs_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args)

David
---
static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 u64 physical, int dev_nr, int async)
{
 struct btrfs_device *dev = bbio->stripes[dev_nr].dev;
 struct btrfs_fs_info *fs_info = bbio->fs_info;

 bio->bi_private = bbio;
 btrfs_io_bio(bio)->stripe_index = dev_nr;
 bio->bi_end_io = btrfs_end_bio;
 bio->bi_iter.bi_sector = physical >> 9;
 btrfs_no_printk(fs_info, "\001" "7" "btrfs_map_bio: rw %d 0x%x, sector=%llu, 
dev=%lu (%s id %llu), size=%u", ((bio)->bi_opf & ((1 << 8) - 1)), bio->bi_opf, 
(u64)bio->bi_iter.bi_sector, (u_long)dev->bdev->bd_dev, ({ struct rcu_string 
*__str = ({ typeof(*(dev->name)) *p1 = (typeof(*(dev->name)) *)({ union 
{ typeof((dev->name)) __val; char __c[1]; } __u; if (1) 
__read_once_size(&((dev->name)), __u.__c, sizeof((dev->name))); else 
__read_once_size_nocheck(&((dev->name)), __u.__c, sizeof((dev->name))); do { } 
while (0); __u.__val; }); do { static bool __attribute__ 
((__section__(".data.unlikely"))) __warned; if (debug_lockdep_rcu_enabled() && 
!__warned && (!((0) || rcu_read_lock_held( { __warned = true; 
lockdep_rcu_suspicious(



  "fs/btrfs/volumes.c"
# 6110 "../fs/btrfs/volumes.c"
 ,



  6114
# 6110 "../fs/btrfs/volumes.c"
 , "suspicious rcu_dereference_check() usage"); } } while (0); ; 
((typeof(*(dev->name)) *)(p1)); }); __str->str; }), dev->devid, 
bio->bi_iter.bi_size)



   ;
 do { if ((bio)->bi_disk != (dev->bdev)->bd_disk) bio_clear_flag(bio, 9); 
(bio)->bi_disk = (dev->bdev)->bd_disk; (bio)->bi_partno = 
(dev->bdev)->bd_partno; } while (0);

 btrfs_bio_counter_inc_noblocked(fs_info);

 if (async)
  btrfs_schedule_bio(dev, bio);
 else
  btrfsic_submit_bio(bio);
}


Re: fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!

2018-08-23 Thread David Sterba
On Thu, Aug 23, 2018 at 04:49:11PM +0100, David Howells wrote:
> I'm seeing the attached message generated from this line:
> 
>   btrfs_debug_in_rcu(fs_info,
>   "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
>   bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector,
>   (u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid,
>   bio->bi_iter.bi_size);
> 
> in submit_stripe_bio().  I'm not sure exactly why, but I suspect
> rcu_str_deref() is the point from where it is generated.

Yes, the macro is rcu_dereference in disguise.

The code previously had explicit rcu_lock/unlock, now it uses the
btrfs_debug_in_rcu helper which is supposed to provide that. It's
possible that the helper is missing it due to some #ifdef mess, but I
don't see it.


fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!

2018-08-23 Thread David Howells
I'm seeing the attached message generated from this line:

btrfs_debug_in_rcu(fs_info,
"btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector,
(u_long)dev->bdev->bd_dev, rcu_str_deref(dev->name), dev->devid,
bio->bi_iter.bi_size);

in submit_stripe_bio().  I'm not sure exactly why, but I suspect
rcu_str_deref() is the point from where it is generated.

Note that the mount succeeds.

This code was introduced by:

commit 672d599041c862dd61a1576c32e946ef0d77aa34
Author: Misono Tomohiro 
Date:   Thu Aug 2 16:19:07 2018 +0900

David
---
=
WARNING: suspicious RCU usage
4.18.0-fscache+ #540 Not tainted
-
fs/btrfs/volumes.c:6114 suspicious rcu_dereference_check() usage!

other info that might help us debug this:


rcu_scheduler_active = 2, debug_locks = 1
1 lock held by mount/3194:
 #0: 72604777 (>fs_type->s_umount_key#54/1){+.+.}, at: 
alloc_super+0xa4/0x313

stack backtrace:
CPU: 2 PID: 3194 Comm: mount Not tainted 4.18.0-fscache+ #540
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 dump_stack+0x67/0x8e
 btrfs_map_bio+0x282/0x321
 btree_submit_bio_hook+0x71/0xa6
 submit_one_bio+0x55/0x66
 read_extent_buffer_pages+0x1ec/0x2ab
 btree_read_extent_buffer_pages+0x6e/0x237
 ? alloc_extent_buffer+0x28f/0x2f2
 read_tree_block+0x43/0x5e
 open_ctree+0x139b/0x1ee4
 btrfs_get_tree+0x357/0xa33
 ? selinux_fs_context_dup+0x2d/0x104
 vfs_get_tree+0x7a/0x162
 btrfs_mount_root+0x52/0x8b
 btrfs_get_tree+0x4ab/0xa33
 ? vfs_parse_fs_string+0x5b/0x9e
 vfs_get_tree+0x7a/0x162
 do_mount+0x7f0/0x8b2
 ? memdup_user+0x3e/0x5a
 ksys_mount+0x72/0x97
 __x64_sys_mount+0x21/0x24
 do_syscall_64+0x7d/0x1a0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f11b8365ada
Code: 48 8b 0d c9 a3 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 
00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 8b 0d 96 a3 2b 00 f7 d8 64 89 01 48
RSP: 002b:7ffe0d9a5b98 EFLAGS: 0246 ORIG_RAX: 00a5
RAX: ffda RBX: 5562bf7702a0 RCX: 7f11b8365ada
RDX: 5562bf77a620 RSI: 5562bf7704a0 RDI: 5562bf770480
RBP:  R08:  R09: 7f11b8620c40
R10: c0ed R11: 0246 R12: 5562bf770480
R13: 5562bf77a620 R14:  R15: 7f11b90f9184


Device Delete Stalls - Addition

2018-08-23 Thread Stefan Malte Schumacher
Hello everybody,

I think this might be useful:

root@mars:~# btrfs dev usage /mnt/btrfs-raid/
/dev/sda, ID: 1
   Device size: 3.64TiB
   Device slack:  0.00B
   Data,RAID1:  7.00GiB
   Metadata,RAID1:  1.00GiB
   Unallocated: 3.63TiB

Yours sincerely
Stefan


Device Delete Stalls

2018-08-23 Thread Stefan Malte Schumacher
Hallo,

I originally had RAID with six 4TB drives, which was more than 80
percent full. So now I bought
a 10TB drive, added it to the Array and gave the command to remove the
oldest drive in the array.

 btrfs device delete /dev/sda /mnt/btrfs-raid

I kept a terminal with "watch btrfs fi show" open and It showed that
the size of /dev/sda had been set to zero and that data was being
redistributed to the other drives. All seemed well, but now the
process stalls at 8GB being left on /dev/sda/. It also seems that the
size of the drive has been reset the original value of 3,64TiB.

Label: none  uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8
Total devices 7 FS bytes used 8.07TiB
devid1 size 3.64TiB used 8.00GiB path /dev/sda
devid2 size 3.64TiB used 2.73TiB path /dev/sdc
devid3 size 3.64TiB used 2.73TiB path /dev/sdd
devid4 size 3.64TiB used 2.73TiB path /dev/sde
devid5 size 3.64TiB used 2.73TiB path /dev/sdf
devid6 size 3.64TiB used 2.73TiB path /dev/sdg
devid7 size 9.10TiB used 2.50TiB path /dev/sdb

I see no more btrfs worker processes and no more activity in iotop.
How do I proceed? I am using a current debian stretch which uses
Kernel 4.9.0-8 and btrfs-progs 4.7.3-1.

How should I proceed? I have a Backup but would prefer an easier and
less time-comsuming way out of this mess.

Yours
Stefan


Re: [PATCH v2.1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-23 Thread Qu Wenruo


On 2018/8/23 下午9:24, David Sterba wrote:
> On Tue, Aug 21, 2018 at 09:42:03AM +0800, Qu Wenruo wrote:
>> [BUG]
>> When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
>> when try to recover balance:
>> --
>> [ cut here ]
> 
> Please don't use lines starting with ---,

I though "--" won't trigger the "---" of git thus OK to use.

> an empty line is fine, or you
> can also indent the trace dump by a few spaces for clarity. I'll update
> the patches that have been merged so you don't need to resend.

I'll go an empty line from now on.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2.1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-23 Thread David Sterba
On Tue, Aug 21, 2018 at 09:42:03AM +0800, Qu Wenruo wrote:
> [BUG]
> When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
> when try to recover balance:
> --
> [ cut here ]

Please don't use lines starting with ---, an empty line is fine, or you
can also indent the trace dump by a few spaces for clarity. I'll update
the patches that have been merged so you don't need to resend.


Re: [PATCH v2.1] btrfs: Handle owner mismatch gracefully when walking up tree

2018-08-23 Thread David Sterba
On Tue, Aug 21, 2018 at 09:42:03AM +0800, Qu Wenruo wrote:
> [BUG]
> When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
> when try to recover balance:
> --
> [ cut here ]
> kernel BUG at fs/btrfs/extent-tree.c:8956!
> invalid opcode:  [#1] PREEMPT SMP NOPTI
> CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
> RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
> RSP: 0018:b53540c9b890 EFLAGS: 00010202
> Call Trace:
>  walk_up_tree+0x172/0x1f0 [btrfs]
>  btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
>  merge_reloc_roots+0xe1/0x1d0 [btrfs]
>  btrfs_recover_relocation+0x3ea/0x420 [btrfs]
>  open_ctree+0x1af3/0x1dd0 [btrfs]
>  btrfs_mount_root+0x66b/0x740 [btrfs]
>  mount_fs+0x3b/0x16a
>  vfs_kern_mount.part.9+0x54/0x140
>  btrfs_mount+0x16d/0x890 [btrfs]
>  mount_fs+0x3b/0x16a
>  vfs_kern_mount.part.9+0x54/0x140
>  do_mount+0x1fd/0xda0
>  ksys_mount+0xba/0xd0
>  __x64_sys_mount+0x21/0x30
>  do_syscall_64+0x60/0x210
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> ---[ end trace d4344e4deee03435 ]---
> --
> 
> [CAUSE]
> Another extent tree corruption.
> 
> In this particular case, tree reloc root's owner is
> DATA_RELOC_TREE (should be TREE_RELOC_TREE), thus its backref is
> corrupted and we failed the owner check in walk_up_tree().
> 
> [FIX]
> It's pretty hard to take care of every extent tree corruption, but at
> least we can remove such BUG_ON() and exit more gracefully.
> 
> And since in this particular image, DATA_RELOC_TREE and TREE_RELOC_TREE
> shares the same root (which is obviously invalid), we needs to make
> __del_reloc_root() more robust to detect such invalid share to avoid
> possible NULL dereference as root->node can be NULL in this case.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
> Reported-by: Xu Wen 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 


Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-23 Thread Zygo Blaxell
On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > - is documenting rejection on request alignment grounds
> >   (i.e. EINVAL) in the man page sufficient for app
> >   developers to understand what is going on here?
> 
> I think so.  The manpage says: "The filesystem does not support
> reflinking the ranges of the given files", which (to my mind) covers
> this case of not supporting dedupe of EOF blocks.

Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
this; however, on btrfs, not supporting dedupe of EOF blocks means small
files (one extent) cannot be deduped at all, because the EOF block holds
a reference to the entire dst extent.  If a dedupe app doesn't go all the
way to EOF on btrfs, then it should not attempt to dedupe any part of the
last extent of the file as the benefit would be zero or slightly negative.

The app developer would need to be aware that such a restriction could
exist on some filesystems, and be able to distinguish this from other
cases that could lead to EINVAL.  Portable code would have to try a dedupe
up to EOF, then if that failed, round down and retry, and if that failed
too, the app would have to figure out which filesystem it's running on
to know what to do next.  Performance demands the app know what the FS
will do in advance, and avoid a whole class of behavior.

btrfs dedupe reports success if the src extent is inline and the same
size as the dst extent (i.e. file is smaller than one page).  No dedupe
can occur in such cases--a clone results in a simple copy, so the best
a dedupe could do would be a no-op.  Returning EINVAL there would break
a few popular tools like "cp --reflink".  Returning OK but doing nothing
seems to be the best option in that case.

> > - should we just round down the EOF dedupe request to the
> >   block before EOF so dedupe still succeeds?
> 
> I've often wondered if the interface should (have) be(en) that we start
> at src_off/dst_off and share as many common blocks as possible until we
> find a mismatch, then tell userspace where we stopped... instead of like
> now where we compare the entire extent and fail if any part of it
> doesn't match.

The usefulness or harmfulness of that approach depends a lot on what
the application expects the filesystem to do.

In btrfs, the dedupe operation acts on references to data, not the
underlying data blocks.  If there are 1000 slightly overlapping references
to a single contiguous range of data blocks in dst on disk, each dedupe
operation acts on only one of those, leaving the other 999 untouched.
If the app then submits 999 other dedupe requests, no references to the
dst blocks remain and the underlying data blocks can be deleted.

In a parallel universe (or a better filesystem, or a userspace emulation
built out of dedupe and other ioctls), dedupe could work at the extent
data (physical) level.  The app points at src and dst extent references
(inode/offset/length tuples), and the filesystem figures out which
physical blocks these point to, then adjusts all the references to the
dst blocks at once, dealing with partial overlaps and snapshots and
nodatacow and whatever other exotic features might be lurking in the
filesystem, ending with every reference to every part of dst replaced
by the longest possible contiguous reference(s) to src.

Problems arise if the length deduped is not exactly the length requested.
If the search continues until a mismatch is found, where does the search
for a mismatch lead?  Does the search follow physically contiguous
blocks on disk, or would dedupe follow logically contiguous blocks in
the src and dst files?  Or the intersection of those, i.e. physically
contiguous blocks that are logically contiguous in _any_ two files,
not limited to src and dst.

There is also the problem where the files could have been previously
deduped and then partially overwritten with identical data.  If the
application cannot control where the dedupe search for identical data
ends, it can end up accidentally creating new references to extents
while it is trying to eliminate those extents.  The kernel might do a
lot of extra work from looking ahead that the application has to undo
immediately (e.g. after the first few blocks of dst, the app wants to
do another dedupe with a better src extent elsewhere on the filesystem,
but the kernel goes ahead and dedupes with an inferior src beyond the
end of what the app asked for).

bees tries to determine exactly the set of dedupe requests required to
remove all references to duplicate extents (and maybe someday do defrag
as well).  If the kernel deviates from the requested sizes (e.g. because
the data changed on the filesystem between dedup requests), the final
extent layout after the dedupe requests are finished won't match what
bees expected it to be, so bees has to reexamine the filesystem and
either retry with a fresh set of exact 

Re: [PATCH] Btrfs: use next_state in find_first_extent_bit

2018-08-23 Thread David Sterba
On Thu, Aug 23, 2018 at 03:14:53AM +0800, Liu Bo wrote:
> As next_state() is already defined to get the next state, update us to use
> the helper.
> 
> No functional changes.
> 
> Signed-off-by: Liu Bo 

Added to misc-next, thanks.


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-08-23 Thread Nikolay Borisov



On 22.08.2018 22:51, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.
> 
> This patch set updates five structs which may have a large rb tree in
> practice
> 
> Liu Bo (5):
>   Btrfs: href_root: use rb_first_cached
>   Btrfs: href->ref_tree: use rb_first_cached
>   Btrfs: delayed_items: use rb_first_cached
>   Btrfs: extent_map: use rb_first_cached
>   Btrfs: preftree: use rb_first_cached
> 
>  fs/btrfs/backref.c| 34 +-
>  fs/btrfs/delayed-inode.c  | 29 +--
>  fs/btrfs/delayed-inode.h  |  4 ++--
>  fs/btrfs/delayed-ref.c| 50 
> +++
>  fs/btrfs/delayed-ref.h|  4 ++--
>  fs/btrfs/disk-io.c|  8 +++
>  fs/btrfs/extent-tree.c| 19 ---
>  fs/btrfs/extent_map.c | 27 +++--
>  fs/btrfs/extent_map.h |  2 +-
>  fs/btrfs/inode.c  |  4 ++--
>  fs/btrfs/tests/extent-map-tests.c |  4 ++--
>  fs/btrfs/transaction.c|  5 ++--
>  fs/btrfs/volumes.c|  5 ++--
>  13 files changed, 107 insertions(+), 88 deletions(-)
> 

Do you have any performance data proving this is in fact helping?


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-08-23 Thread Holger Hoffstätte

On 08/22/18 21:51, Liu Bo wrote:

Several structs in btrfs are using rb_first() in a while loop, it'd be
more efficient to do this with rb_first_cached() which has the O(1)
complexity.

This patch set updates five structs which may have a large rb tree in
practice


Great idea, works for me with no bad side effects so far. So:

Tested-by: Holger Hoffstätte 

cheers
Holger


Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option

2018-08-23 Thread Qu Wenruo



On 2018/8/23 下午3:36, Nikolay Borisov wrote:
> 
> 
> On 23.08.2018 10:31, Qu Wenruo wrote:
>> Introduce --breadth-first option to do breadth-first tree dump.
>> This is especially handy to inspect high level trees, e.g. comparing
>> tree reloc tree with its source tree.
> 
> Will it make sense instead of exposing another option to just have a
> heuristics check that will switch to the BFS if the tree is higher than,
> say, 2 levels?

BFS has one obvious disadvantage here, so it may not be a good idea to
use it for default.

>> More memory usage <<
   It needs to alloc heap memory, and this can be pretty large for
   leaves.
   At level 1, it will need to alloc nr_leaves * sizeof(bfs_entry)
   memory at least.
   Compared to DFS, it only needs to iterate at most 8 times, and all of
   its memory usage is function call stack memory.

It only makes sense for my niche use case (compare tree reloc tree with
its source).
For real world use case the default DFS should works fine without all
the memory allocation burden.

So I still prefer to keep DFS as default and only provides BFS as a
niche option for weird guys like me.

Thanks,
Qu


Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option

2018-08-23 Thread Su Yue




On 08/23/2018 03:31 PM, Qu Wenruo wrote:

By default dump-tree does depth-first search.

For 2 level trees it's completely OK, but for 3 level trees, it would be
pretty hard to locate output of middle level tree nodes.

Introduce --breadth-first option to do breadth-first tree dump.
This is especially handy to inspect high level trees, e.g. comparing
tree reloc tree with its source tree.

Signed-off-by: Qu Wenruo 


LGTM.

Reviewed-by: Su Yue 

---
  Documentation/btrfs-inspect-internal.asciidoc |  5 ++
  cmds-inspect-dump-tree.c  | 26 --
  print-tree.c  | 89 +++
  print-tree.h  |  1 +
  4 files changed, 112 insertions(+), 9 deletions(-)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index e2db64660b9a..5435455fe099 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -69,6 +69,9 @@ readable equivalents where possible.
  This is useful for analyzing filesystem state or inconsistencies and has
  a positive educational effect on understanding the internal filesystem 
structure.
  +
+By default *dump-tree* will dump the tree using depth-first search, this 
behavior
+can be changed by '--breadth-first' option.
++
  NOTE: contains file names, consider that if you're asked to send the dump for
  analysis. Does not contain file data.
  +
@@ -89,6 +92,8 @@ print only the uuid tree information, empty output if the 
tree does not exist
  print info of the specified block only
  --follow
  use with '-b', print all children tree blocks of ''
+--breadth-first
+use breadth-first search to dump trees instead of default depth-first search
  -t 
  print only the tree with the specified ID, where the ID can be numerical or
  common name in a flexible human readable form
diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index c8acd55a0c3a..ecda07d65ab6 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -200,6 +200,7 @@ const char * const cmd_inspect_dump_tree_usage[] = {
"-b|--block  print info from the specified block only",
"-t|--tree print only tree with the given id (string or 
number)",
"--follow   use with -b, to show all children tree blocks of 
",
+   "--breadth-firstdo breadth-first search instead of default 
depth-first one",
NULL
  };
  
@@ -213,6 +214,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)

struct extent_buffer *leaf;
struct btrfs_disk_key disk_key;
struct btrfs_key found_key;
+   void (*print_tree_func) (struct extent_buffer *, int follow);
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
int ret;
int slot;
@@ -227,6 +229,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
u64 tree_id = 0;
bool follow = false;
  
+	print_tree_func = btrfs_print_tree;

/*
 * For debug-tree, we care nothing about extent tree (it's just backref
 * and usage accounting, only makes sense for RW operations).
@@ -238,7 +241,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
optind = 0;
while (1) {
int c;
-   enum { GETOPT_VAL_FOLLOW = 256 };
+   enum { GETOPT_VAL_FOLLOW = 256, GETOPT_VAL_BREADTH_FIRST };
static const struct option long_options[] = {
{ "extents", no_argument, NULL, 'e'},
{ "device", no_argument, NULL, 'd'},
@@ -248,6 +251,8 @@ int cmd_inspect_dump_tree(int argc, char **argv)
{ "block", required_argument, NULL, 'b'},
{ "tree", required_argument, NULL, 't'},
{ "follow", no_argument, NULL, GETOPT_VAL_FOLLOW },
+   { "breadth-first", no_argument, NULL,
+   GETOPT_VAL_BREADTH_FIRST },
{ NULL, 0, NULL, 0 }
};
  
@@ -303,6 +308,9 @@ int cmd_inspect_dump_tree(int argc, char **argv)

case GETOPT_VAL_FOLLOW:
follow = true;
break;
+   case GETOPT_VAL_BREADTH_FIRST:
+   print_tree_func = btrfs_print_tree_breadth_first;
+   break;
default:
usage(cmd_inspect_dump_tree_usage);
}
@@ -341,7 +349,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
(unsigned long long)block_only);
goto close_root;
}
-   btrfs_print_tree(leaf, follow);
+   print_tree_func(leaf, follow);
free_extent_buffer(leaf);
goto close_root;
}
@@ -368,17 +376,17 @@ int cmd_inspect_dump_tree(int argc, char **argv)
} else {
   

Re: [PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option

2018-08-23 Thread Nikolay Borisov



On 23.08.2018 10:31, Qu Wenruo wrote:
> Introduce --breadth-first option to do breadth-first tree dump.
> This is especially handy to inspect high level trees, e.g. comparing
> tree reloc tree with its source tree.

Will it make sense instead of exposing another option to just have a
heuristics check that will switch to the BFS if the tree is higher than,
say, 2 levels?


[PATCH] btrfs-progs: dump-tree: Introduce --breadth-first option

2018-08-23 Thread Qu Wenruo
By default dump-tree does depth-first search.

For 2 level trees it's completely OK, but for 3 level trees, it would be
pretty hard to locate output of middle level tree nodes.

Introduce --breadth-first option to do breadth-first tree dump.
This is especially handy to inspect high level trees, e.g. comparing
tree reloc tree with its source tree.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-inspect-internal.asciidoc |  5 ++
 cmds-inspect-dump-tree.c  | 26 --
 print-tree.c  | 89 +++
 print-tree.h  |  1 +
 4 files changed, 112 insertions(+), 9 deletions(-)

diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index e2db64660b9a..5435455fe099 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -69,6 +69,9 @@ readable equivalents where possible.
 This is useful for analyzing filesystem state or inconsistencies and has
 a positive educational effect on understanding the internal filesystem 
structure.
 +
+By default *dump-tree* will dump the tree using depth-first search, this 
behavior
+can be changed by '--breadth-first' option.
++
 NOTE: contains file names, consider that if you're asked to send the dump for
 analysis. Does not contain file data.
 +
@@ -89,6 +92,8 @@ print only the uuid tree information, empty output if the 
tree does not exist
 print info of the specified block only
 --follow
 use with '-b', print all children tree blocks of ''
+--breadth-first
+use breadth-first search to dump trees instead of default depth-first search
 -t 
 print only the tree with the specified ID, where the ID can be numerical or
 common name in a flexible human readable form
diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index c8acd55a0c3a..ecda07d65ab6 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -200,6 +200,7 @@ const char * const cmd_inspect_dump_tree_usage[] = {
"-b|--block  print info from the specified block only",
"-t|--tree print only tree with the given id (string or 
number)",
"--follow   use with -b, to show all children tree blocks 
of ",
+   "--breadth-firstdo breadth-first search instead of default 
depth-first one",
NULL
 };
 
@@ -213,6 +214,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
struct extent_buffer *leaf;
struct btrfs_disk_key disk_key;
struct btrfs_key found_key;
+   void (*print_tree_func) (struct extent_buffer *, int follow);
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
int ret;
int slot;
@@ -227,6 +229,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
u64 tree_id = 0;
bool follow = false;
 
+   print_tree_func = btrfs_print_tree;
/*
 * For debug-tree, we care nothing about extent tree (it's just backref
 * and usage accounting, only makes sense for RW operations).
@@ -238,7 +241,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
optind = 0;
while (1) {
int c;
-   enum { GETOPT_VAL_FOLLOW = 256 };
+   enum { GETOPT_VAL_FOLLOW = 256, GETOPT_VAL_BREADTH_FIRST };
static const struct option long_options[] = {
{ "extents", no_argument, NULL, 'e'},
{ "device", no_argument, NULL, 'd'},
@@ -248,6 +251,8 @@ int cmd_inspect_dump_tree(int argc, char **argv)
{ "block", required_argument, NULL, 'b'},
{ "tree", required_argument, NULL, 't'},
{ "follow", no_argument, NULL, GETOPT_VAL_FOLLOW },
+   { "breadth-first", no_argument, NULL,
+   GETOPT_VAL_BREADTH_FIRST },
{ NULL, 0, NULL, 0 }
};
 
@@ -303,6 +308,9 @@ int cmd_inspect_dump_tree(int argc, char **argv)
case GETOPT_VAL_FOLLOW:
follow = true;
break;
+   case GETOPT_VAL_BREADTH_FIRST:
+   print_tree_func = btrfs_print_tree_breadth_first;
+   break;
default:
usage(cmd_inspect_dump_tree_usage);
}
@@ -341,7 +349,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
(unsigned long long)block_only);
goto close_root;
}
-   btrfs_print_tree(leaf, follow);
+   print_tree_func(leaf, follow);
free_extent_buffer(leaf);
goto close_root;
}
@@ -368,17 +376,17 @@ int cmd_inspect_dump_tree(int argc, char **argv)
} else {
if (info->tree_root->node) {
printf("root tree\n");
- 

[PATCH] btrfs-progs: print-tree: Skip deprecated blockptr / nodesize output

2018-08-23 Thread Qu Wenruo
When printing tree nodes, we output slots like:
key (EXTENT_TREE ROOT_ITEM 0) block 73625600 (17975) gen 16

The number in the parentheses is blockptr / nodesize.

However this number doesn't really do any thing useful.
And in fact for unaligned metadata block group (block group start bytenr
is not aligned to 16K), the number doesn't even make sense as it's
rounded down.

In factor kernel doesn't ever output such divided result in its
print-tree.c

Remove it so later reader won't wonder what the number means.

Signed-off-by: Qu Wenruo 
---
 print-tree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/print-tree.c b/print-tree.c
index a09ecfbb28f0..31f6fa12522f 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -1420,9 +1420,8 @@ void btrfs_print_tree(struct extent_buffer *eb, int 
follow)
btrfs_disk_key_to_cpu(, _key);
printf("\t");
btrfs_print_key(_key);
-   printf(" block %llu (%llu) gen %llu\n",
+   printf(" block %llu gen %llu\n",
   (unsigned long long)blocknr,
-  (unsigned long long)blocknr / eb->len,
   (unsigned long long)btrfs_node_ptr_generation(eb, i));
fflush(stdout);
}
-- 
2.18.0