date:20181008

Re: qgroups not enabled, but perf stats reports btrfs_qgroup_release_data and btrfs_qgroup_free_delayed_ref

2018-10-08 Thread Qu Wenruo



On 2018/10/9 上午6:41, Chris Murphy wrote:
> [chris@flap ~]$ sudo perf stat -e 'btrfs:*' -a sleep 70
> ##And then I loaded a few sites in Firefox early on in those 70 seconds.
> 
>  Performance counter stats for 'system wide':
> 
>  5  btrfs:btrfs_transaction_commit
> 29  btrfs:btrfs_inode_new
> 29  btrfs:btrfs_inode_request
> 25  btrfs:btrfs_inode_evict
>  1,602  btrfs:btrfs_get_extent
>  0  btrfs:btrfs_handle_em_exist
>  1  btrfs:btrfs_get_extent_show_fi_regular
> 88  btrfs:btrfs_truncate_show_fi_regular
> 19  btrfs:btrfs_get_extent_show_fi_inline
>  2  btrfs:btrfs_truncate_show_fi_inline
>189  btrfs:btrfs_ordered_extent_add
>189  btrfs:btrfs_ordered_extent_remove
>  9  btrfs:btrfs_ordered_extent_start
>592  btrfs:btrfs_ordered_extent_put
>  1,207  btrfs:__extent_writepage
>  1,203  btrfs:btrfs_writepage_end_io_hook
> 25  btrfs:btrfs_sync_file
>  0  btrfs:btrfs_sync_fs
>  0  btrfs:btrfs_add_block_group
>  1,508  btrfs:add_delayed_tree_ref
>  1,498  btrfs:run_delayed_tree_ref
>379  btrfs:add_delayed_data_ref
>336  btrfs:run_delayed_data_ref
>  1,887  btrfs:add_delayed_ref_head
>  1,839  btrfs:run_delayed_ref_head
>  0  btrfs:btrfs_chunk_alloc
>  0  btrfs:btrfs_chunk_free
>794  btrfs:btrfs_cow_block
>  6,982  btrfs:btrfs_space_reservation
>  0  btrfs:btrfs_trigger_flush
>  0  btrfs:btrfs_flush_space
>952  btrfs:btrfs_reserved_extent_alloc
>  0  btrfs:btrfs_reserved_extent_free
>  1,005  btrfs:find_free_extent
>  1,005  btrfs:btrfs_reserve_extent
>816  btrfs:btrfs_reserve_extent_cluster
>  1  btrfs:btrfs_find_cluster
>  0  btrfs:btrfs_failed_cluster_setup
>  1  btrfs:btrfs_setup_cluster
>  5,952  btrfs:alloc_extent_state
>  6,034  btrfs:free_extent_state
>374  btrfs:btrfs_work_queued
>362  btrfs:btrfs_work_sched
>362  btrfs:btrfs_all_work_done
>116  btrfs:btrfs_ordered_sched
>  0  btrfs:btrfs_workqueue_alloc
>  0  btrfs:btrfs_workqueue_destroy
>  0  btrfs:btrfs_qgroup_reserve_data
>201  btrfs:btrfs_qgroup_release_data
>  1,839  btrfs:btrfs_qgroup_free_delayed_ref
>  0  btrfs:btrfs_qgroup_account_extents
>  0  btrfs:btrfs_qgroup_trace_extent
>  0  btrfs:btrfs_qgroup_account_extent
>  0  btrfs:qgroup_update_counters
>  0  btrfs:qgroup_update_reserve
>  0  btrfs:qgroup_meta_reserve
>  0  btrfs:qgroup_meta_convert
>  0  btrfs:qgroup_meta_free_all_pertrans
>  0  btrfs:btrfs_prelim_ref_merge
>  0  btrfs:btrfs_prelim_ref_insert
>  2,663  btrfs:btrfs_inode_mod_outstanding_extents
>  0  btrfs:btrfs_remove_block_group
>  0  btrfs:btrfs_add_unused_block_group
>  0  btrfs:btrfs_skip_unused_block_group
> 
>   70.004723586 seconds time elapsed
> 
> [chris@flap ~]$
> 
> 
> Seems like a lot of activity for just a few transactions, but what
> really caught my eye here is the qgroup reporting for a file system
> that has never had qgroups enabled. Is it expected?

Indeed some of them can be avoided, as for qgroup not enabled case, such
function is really doing nothing.

In the case of btrfs_qgroup_free_delayed_ref() case, it doesn't check if
qgroup is enabled and calls btrfs_qgroup_free_refroot() under all cases,
and expect btrfs_qgroup_free_refroot() to do some check.

However btrfs_qgroup_free_refroot() doesn't do any check on if qgroup is
enabled or not.
And btrfs_qgroup_free_refroot() will just find no corresponding qgroup
and exit.

Thanks for exposing such bug,
Qu


> 
> 
> Chris Murphy
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v3 3/6] btrfs-progs: original check: Add ability to detect bad dev extents

2018-10-08 Thread Su Yue





On 10/8/18 8:30 PM, Qu Wenruo wrote:

Unlike lowmem mode check, we don't have good place for original mode to
check overlap dev extents.

So this patch introduces a new function, btrfs_check_dev_extents(), to
handle possible bad dev extents.

Reported-by: Hans van Kranenburg 
Signed-off-by: Qu Wenruo 


Reviewed-by: Su Yue 

---
  check/main.c | 99 
  1 file changed, 99 insertions(+)

diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..ff9a785ce555 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8224,6 +8224,99 @@ out:
return ret;
  }
  
+/*

+ * Check if all dev extents are valid (not overlap nor beyond device
+ * boundary).
+ *
+ * Dev extents <-> chunk cross checking is already done in check_chunks().
+ */
+static int check_dev_extents(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_root *dev_root = fs_info->dev_root;
+   int ret;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
+
+   btrfs_init_path();
+
+   key.objectid = 1;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, dev_root, , , 0, 0);
+   if (ret < 0) {
+   error("failed to search device tree: %s", strerror(-ret));
+   goto out;
+   }
+   if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
+   ret = btrfs_next_leaf(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   }
+
+   while (1) {
+   struct btrfs_dev_extent *dev_ext;
+   struct btrfs_device *dev;
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
+   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
+   if (key.type != BTRFS_DEV_EXTENT_KEY)
+   break;
+   dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
+struct btrfs_dev_extent);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], dev_ext);
+
+   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
+   if (!dev) {
+   error("failed to find device with devid %llu", devid);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (prev_devid == devid && prev_dev_ext_end > physical_offset) {
+   error(
+"dev extent devid %llu physical offset %llu overlap with previous dev extent end 
%llu",
+ devid, physical_offset, prev_dev_ext_end);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (physical_offset + physical_len > dev->total_bytes) {
+   error(
+"dev extent devid %llu physical offset %llu len %llu is beyond device boudnary 
%llu",
+ devid, physical_offset, physical_len,
+ dev->total_bytes);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+
+   ret = btrfs_next_item(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+   }
+out:
+   btrfs_release_path();
+   return ret;
+}
+
  static int check_chunks_and_extents(struct btrfs_fs_info *fs_info)
  {
struct rb_root dev_cache;
@@ -8318,6 +8411,12 @@ again:
goto out;
}
  
+	ret = check_dev_extents(fs_info);

+   if (ret < 0) {
+   err = ret;
+   goto out;
+   }
+
ret = check_chunks(_cache, _group_cache,
   _extent_cache, NULL, NULL, NULL, 0);
if (ret) {

Re: [PATCH v3 2/6] btrfs-progs: lowmem check: Add check for overlapping dev extents

2018-10-08 Thread Su Yue





On 10/8/18 8:30 PM, Qu Wenruo wrote:

Add such check at check_dev_item(), since at that timing we're also
iterating dev extents for dev item accounting.

Signed-off-by: Qu Wenruo 


LGTM.
Reviewed-by: Su Yue 

---
  check/mode-lowmem.c | 34 --
  1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 1bce44f5658a..07c03cad77af 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4065,6 +4065,8 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
u64 dev_id;
u64 used;
u64 total = 0;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
int ret;
  
  	dev_item = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);

@@ -4086,8 +4088,16 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
return REFERENCER_MISSING;
}
  
-	/* Iterate dev_extents to calculate the used space of a device */

+   /*
+* Iterate dev_extents to calculate the used space of a device
+*
+* Also make sure no dev extents overlap and end beyond device boundary
+*/
while (1) {
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
goto next;
  
@@ -4099,7 +4109,27 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
  
  		ptr = btrfs_item_ptr(path.nodes[0], path.slots[0],

 struct btrfs_dev_extent);
-   total += btrfs_dev_extent_length(path.nodes[0], ptr);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], ptr);
+
+   if (prev_devid == devid && physical_offset < prev_dev_ext_end) {
+   error(
+"dev extent devid %llu offset %llu len %llu overlap with previous dev extent end 
%llu",
+ devid, physical_offset, physical_len,
+ prev_dev_ext_end);
+   return ACCOUNTING_MISMATCH;
+   }
+   if (physical_offset + physical_len > total_bytes) {
+   error(
+"dev extent devid %llu offset %llu len %llu is beyond device boundary %llu",
+ devid, physical_offset, physical_len,
+ total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+   total += physical_len;
  next:
ret = btrfs_next_item(dev_root, );
if (ret)

Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo



On 2018/10/9 上午6:20, Hans van Kranenburg wrote:
> On 10/08/2018 02:30 PM, Qu Wenruo wrote:
>> Obviously, used bytes can't be larger than total bytes.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  check/mode-lowmem.c | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
>> index 07c03cad77af..1173b963b8f3 100644
>> --- a/check/mode-lowmem.c
>> +++ b/check/mode-lowmem.c
>> @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info 
>> *fs_info,
>>  used = btrfs_device_bytes_used(eb, dev_item);
>>  total_bytes = btrfs_device_total_bytes(eb, dev_item);
>>  
>> +if (used > total_bytes) {
>> +error("device %llu has incorrect used bytes %llu > total bytes 
>> %llu",
>> +dev_id, used, total_bytes);
>> +return ACCOUNTING_MISMATCH;
> 
> The message and return code point at an error in accounting logic.

One of the biggest problem in lowmem is we don't always have the error
code we really want.

And that's the case for this error message.
It's indeed not an accounting error, as in that case (just like that
test case image) the used bytes is correct accounted.

The good news is, the return value is never really used to classify the
error.
So as long as the error message makes sense, it's not a big problem.

Thanks,
Qu

> 
> However, if you have a fully allocated device and a DUP chunk ending
> beyond device, then having used > total_bytes is expected...
> 
> So maybe there's two possibilities... There's an error in the accounting
> logic, or there's an "over-allocation", which is another type of issue
> which produces used > total with correct accounting logic.
> 
>> +}
>>  key.objectid = dev_id;
>>  key.type = BTRFS_DEV_EXTENT_KEY;
>>  key.offset = 0;
>>
> 
> 



signature.asc
Description: OpenPGP digital signature

Re: merge status of per-chunk degradable check [was Re: Which device is missing ?]

2018-10-08 Thread Hans van Kranenburg

On 10/09/2018 02:08 AM, Nicholas D Steeves wrote:
> On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
>> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
>>> I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
>>>
>>> But I have stranges status or errors  about "missing devices" and I
>>> do not understand the current situation :
> [...]
>>Note that, since the main FS is missing a device, it will probably
>> need to be mounted in degraded mode (-o degraded), and that on kernels
>> earlier than (IIRC) 4.14, this can only be done *once* without the FS
>> becoming more or less permanently read-only. On recent kernels, it
>> _should_ be OK.
>>
>> *WARNING ENDS*
> 
> I think this was the patch that addressed this?:
>   https://www.spinics.net/lists/linux-btrfs/msg47283.html
>   https://patchwork.kernel.org/patch/7226931/
> 
> In my notes it wasn't present in <= 4.14.15, but my notes might be
> wrong.  Does this patch resolve the one-shot -o degraded, reboot,
> forever read-only behaviour, or is something else required?  When was
> it merged?  Has it been or will it be backported to 4.14.x?  I'm
> guessing 4.9.x is too far back, but it would be really nice to see it
> there too :-)
> 
> Also, will this issue be resolved for linux-4.19?  If so I'd like to
> update the Debian btrfs wiki with this good news :-)

[...]

> P.S. Please let me know if you'd prefer for me to shift this
> documentation effort to btrfs.wiki.kernel.org.

Yes, absolutely. This is not specific to how we do things for Debian.
Upstream documentation can help all distros.

-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature

merge status of per-chunk degradable check [was Re: Which device is missing ?]

2018-10-08 Thread Nicholas D Steeves

On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> > I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> > 
> > But I have stranges status or errors  about "missing devices" and I
> > do not understand the current situation :
[...]
>Note that, since the main FS is missing a device, it will probably
> need to be mounted in degraded mode (-o degraded), and that on kernels
> earlier than (IIRC) 4.14, this can only be done *once* without the FS
> becoming more or less permanently read-only. On recent kernels, it
> _should_ be OK.
> 
> *WARNING ENDS*

I think this was the patch that addressed this?:
  https://www.spinics.net/lists/linux-btrfs/msg47283.html
  https://patchwork.kernel.org/patch/7226931/

In my notes it wasn't present in <= 4.14.15, but my notes might be
wrong.  Does this patch resolve the one-shot -o degraded, reboot,
forever read-only behaviour, or is something else required?  When was
it merged?  Has it been or will it be backported to 4.14.x?  I'm
guessing 4.9.x is too far back, but it would be really nice to see it
there too :-)

Also, will this issue be resolved for linux-4.19?  If so I'd like to
update the Debian btrfs wiki with this good news :-)

Finally, is the following a valid workaround for users who don't have
access to a kernel containing this fix:

1. Make a raid1 profile volume (both data and metadata) with >= 3 disks.
2. Lose one disk.
3. Allocator continues to write raid1 chunks instead of single,
because it is still possible to write one chunk to two disks.
4. Thus reboot twice -> forever read-only averted?

Kind regards,
Nicholas

P.S. Please let me know if you'd prefer for me to shift this
documentation effort to btrfs.wiki.kernel.org.

signature.asc
Description: PGP signature

Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-08 Thread Hans van Kranenburg

On 10/08/2018 03:19 PM, Hans van Kranenburg wrote:
> On 10/08/2018 08:43 AM, Qu Wenruo wrote:
>>
>>
>> On 2018/10/5 下午6:58, Hans van Kranenburg wrote:
>>> On 10/05/2018 09:51 AM, Qu Wenruo wrote:


 On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
> This patch set contains an additional fix for a newly exposed bug after
> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>
> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374

 For that bug, did you succeeded in reproducing the bug?
>>>
>>> Yes, open the above link and scroll to "Steps to reproduce".
>>
>> That's beyond device boundary one. Also reproduced here.
>> And hand-crafted a super small image as test case for btrfs-progs.
>>
>> But I'm a little curious about the dev extent overlapping case.
>> Have you got one?
> 
> Ah ok, I see. No, I didn't do that yet.
> 
> By using unmodified tooling, I think this can be done by a combination
> of a few resizings and using very specific balance to cause a fs of
> exactly 7880MiB again with a single 1578MiB gap inside...
> 
> I'll try later today to see if I can come up with a recipe for this.

Ok, this is actually pretty simple to do:

 >8 

-# mkdir bork
-# cd bork
-# dd if=/dev/zero of=image bs=1 count=0 seek=1024M
0+0 records in
0+0 records out
0 bytes copied, 0.000183343 s, 0.0 kB/s
-# mkfs.btrfs -d dup -m dup image
-# losetup -f image
-# mount -o space_cache=v2 /dev/loop0 mountpoint
-# dd if=/dev/zero of=mountpoint/flapsie bs=1M
dd: error writing 'mountpoint/flapsie': No space left on device
453+0 records in
452+0 records out
474185728 bytes (474 MB, 452 MiB) copied, 4.07663 s, 116 MB/s

 >8 

-# ./show_usage.py /bork/mountpoint/
Target profile for SYSTEM (chunk tree): DUP
Target profile for METADATA: DUP
Target profile for DATA: DUP
Mixed groups: False

Virtual space usage by block group type:
|
| typetotal used
| - 
| Data452.31MiB452.22MiB
| System8.00MiB 16.00KiB
| Metadata 51.19MiB656.00KiB

Total raw filesystem size: 1.00GiB
Total raw allocated bytes: 1023.00MiB
Allocatable bytes remaining: 1.00MiB
Unallocatable raw bytes: 0.00B
Unallocatable bytes that can be reclaimed by balancing: 0.00B

Estimated virtual space left to use for metadata: 51.05MiB
Estimated virtual space left to use for data: 96.00KiB

Allocated raw disk bytes by chunk type. Parity is a reserved part of the
allocated bytes, limiting the amount that can be used for data or metadata:
|
| flags   allocated used   parity
| -   -    --
| DATA|DUP904.62MiB904.44MiB0.00B
| SYSTEM|DUP   16.00MiB 32.00KiB0.00B
| METADATA|DUP102.38MiB  1.28MiB0.00B

Allocated bytes per device:
|
| devid  total sizeallocated path
| -  --- 
| 1 1.00GiB   1023.00MiB /dev/loop0

Allocated bytes per device, split up per chunk type. Parity bytes are again
part of the total amount of allocated bytes.
|
| Device ID: 1
| | flags   allocated   parity
| | -   -   --
| | DATA|DUP904.62MiB0.00B
| | SYSTEM|DUP   16.00MiB0.00B
| | METADATA|DUP102.38MiB0.00B

Unallocatable raw bytes per device:
|
| devidunallocatable
| --
| 1   0.00B

 >8 

Now we're gonna cause some neat 1578MiB to happen that we're going to
free up later to reproduce the failure:

-# dd if=/dev/zero of=image bs=1 count=0 seek=2602M
0+0 records in
0+0 records out
0 bytes copied, 0.000188621 s, 0.0 kB/s
-# losetup -c /dev/loop0
-# btrfs fi resize max mountpoint/
Resize 'mountpoint/' of 'max'
-# dd if=/dev/zero of=mountpoint/1578MiB bs=1M
dd: error writing 'mountpoint/1578MiB': No space left on device
790+0 records in
789+0 records out
827326464 bytes (827 MB, 789 MiB) copied, 12.2452 s, 67.6 MB/s

 >8 

-# python3
import btrfs
fs = btrfs.FileSystem('/bork/mountpoint/')
for d in fs.dev_extents():
  print("start {} end {} vaddr {}".format(d.paddr, d.paddr + d.length,
d.vaddr))

start 1048576 end 11534336 vaddr 547880960
start 11534336 end 22020096 vaddr 547880960
start 22020096 end 30408704 vaddr 22020096
start 30408704 end 38797312 vaddr 22020096
start 38797312 end 92471296 vaddr 30408704
start 92471296 end 146145280 vaddr 30408704
start 146145280 end 213254144 vaddr 84082688
start 213254144 end 280363008 vaddr 84082688
start 280363008 end 397803520 vaddr 151191552
start 397803520 end 515244032 vaddr 151191552
start 515244032 end 632684544 vaddr 268632064
start 632684544 end 750125056 vaddr 268632064
start 750125056 end 867565568 vaddr 386072576
start 867565568 end 985006080 vaddr 386072576
start 985006080 end

[PATCH v2] fstests: btrfs verify hardening agaist duplicate fsid

2018-10-08 Thread Anand Jain

We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after its
been copied. So this test case reproduces this issue.

For example:

Initially.. /dev/mmcblk0p4 is mounted as /

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part /
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4

Copy mmcblk0 to sda
dd if=/dev/mmcblk0 of=/dev/sda

And immediately after the copy completes the change in the device
superblock is notified which the automount scans using
btrfs device scan and the new device sda becomes the mounted root
device.

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda   8:01 14.9G  0 disk
|-sda48:414G  0 part /
|-sda28:21  500M  0 part
|-sda38:31  256M  0 part
`-sda18:11  256M  0 part
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi
btrfs fi show /
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid1 size 4.00GiB used 3.00GiB path /dev/sda4

The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take
the sda out of the system on to another system to change its fsid using
the 'btrfstune -u' command.

Signed-off-by: Anand Jain 
---
 tests/btrfs/173 | 88 +
 tests/btrfs/173.out |  6 
 tests/btrfs/group   |  1 +
 3 files changed, 95 insertions(+)
 create mode 100755 tests/btrfs/173
 create mode 100644 tests/btrfs/173.out

diff --git a/tests/btrfs/173 b/tests/btrfs/173
new file mode 100755
index ..b466ae921e19
--- /dev/null
+++ b/tests/btrfs/173
@@ -0,0 +1,88 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Oracle. All Rights Reserved.
+#
+# FS QA Test 173
+#
+# Fuzzy test for FS image duplication.
+#  Could be fixed by
+#[patch] btrfs: harden agaist duplicate fsid
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+mnt=$TEST_DIR/$seq.mnt
+_cleanup()
+{
+   rm -rf $mnt > /dev/null 2>&1
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_scratch_dev_pool_get 2
+
+dev_foo=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
+dev_bar=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
+
+echo dev_foo=$dev_foo >> $seqres.full
+echo dev_bar=$dev_bar >> $seqres.full
+echo | tee -a $seqres.full
+
+rm -rf $mnt > /dev/null 2>&1
+mkdir $mnt
+_mkfs_dev $dev_foo
+_mount $dev_foo $mnt
+
+check_btrfs_mount()
+{
+   local x=$(findmnt $mnt | grep -v TARGET | awk '{print $2}')
+   [[ $x == $dev_foo ]] && echo DEV_FOO
+   [[ $x == $dev_bar ]] && echo DEV_BAR
+}
+
+echo MNT $(check_btrfs_mount)
+
+for sb_bytenr in 65536 67108864
+do
+   echo -n "dd status=none if=$dev_foo of=$dev_bar bs=1 "\
+   "seek=$sb_bytenr skip=$sb_bytenr count=4096" >> $seqres.full
+   dd status=none if=$dev_foo of=$dev_bar bs=1 seek=$sb_bytenr \
+   skip=$sb_bytenr count=4096 >> $seqres.full 2>&1
+   echo ..:$? >> $seqres.full
+done
+
+#Original device is mounted, scan of its clone should fail
+$BTRFS_UTIL_PROG device scan $dev_bar >> $seqres.full 2>&1
+echo btrfs device scan dev_bar ...:$?| tee -a $seqres.full
+
+echo MNT $(check_btrfs_mount)
+
+#Original device scan should be successful
+$BTRFS_UTIL_PROG device scan $dev_foo >> $seqres.full 2>&1
+echo btrfs device scan dev_foo ...:$?| tee -a $seqres.full
+
+umount $mnt > /dev/null 2>&1
+_scratch_dev_pool_put
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/173.out b/tests/btrfs/173.out
new file mode 100644
index ..3c7e3fb4e3f7
--- /dev/null
+++ b/tests/btrfs/173.out
@@ -0,0 +1,6 @@
+QA output created by 173
+
+MNT DEV_FOO
+btrfs device scan dev_bar ...:1
+MNT DEV_FOO
+btrfs device scan dev_foo ...:0
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 45782565c3b7..b2f1393f3e97 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -175,3 +175,4 @@
 170 auto quick snapshot
 171 auto quick

qgroups not enabled, but perf stats reports btrfs_qgroup_release_data and btrfs_qgroup_free_delayed_ref

2018-10-08 Thread Chris Murphy

[chris@flap ~]$ sudo perf stat -e 'btrfs:*' -a sleep 70
##And then I loaded a few sites in Firefox early on in those 70 seconds.

 Performance counter stats for 'system wide':

 5  btrfs:btrfs_transaction_commit
29  btrfs:btrfs_inode_new
29  btrfs:btrfs_inode_request
25  btrfs:btrfs_inode_evict
 1,602  btrfs:btrfs_get_extent
 0  btrfs:btrfs_handle_em_exist
 1  btrfs:btrfs_get_extent_show_fi_regular
88  btrfs:btrfs_truncate_show_fi_regular
19  btrfs:btrfs_get_extent_show_fi_inline
 2  btrfs:btrfs_truncate_show_fi_inline
   189  btrfs:btrfs_ordered_extent_add
   189  btrfs:btrfs_ordered_extent_remove
 9  btrfs:btrfs_ordered_extent_start
   592  btrfs:btrfs_ordered_extent_put
 1,207  btrfs:__extent_writepage
 1,203  btrfs:btrfs_writepage_end_io_hook
25  btrfs:btrfs_sync_file
 0  btrfs:btrfs_sync_fs
 0  btrfs:btrfs_add_block_group
 1,508  btrfs:add_delayed_tree_ref
 1,498  btrfs:run_delayed_tree_ref
   379  btrfs:add_delayed_data_ref
   336  btrfs:run_delayed_data_ref
 1,887  btrfs:add_delayed_ref_head
 1,839  btrfs:run_delayed_ref_head
 0  btrfs:btrfs_chunk_alloc
 0  btrfs:btrfs_chunk_free
   794  btrfs:btrfs_cow_block
 6,982  btrfs:btrfs_space_reservation
 0  btrfs:btrfs_trigger_flush
 0  btrfs:btrfs_flush_space
   952  btrfs:btrfs_reserved_extent_alloc
 0  btrfs:btrfs_reserved_extent_free
 1,005  btrfs:find_free_extent
 1,005  btrfs:btrfs_reserve_extent
   816  btrfs:btrfs_reserve_extent_cluster
 1  btrfs:btrfs_find_cluster
 0  btrfs:btrfs_failed_cluster_setup
 1  btrfs:btrfs_setup_cluster
 5,952  btrfs:alloc_extent_state
 6,034  btrfs:free_extent_state
   374  btrfs:btrfs_work_queued
   362  btrfs:btrfs_work_sched
   362  btrfs:btrfs_all_work_done
   116  btrfs:btrfs_ordered_sched
 0  btrfs:btrfs_workqueue_alloc
 0  btrfs:btrfs_workqueue_destroy
 0  btrfs:btrfs_qgroup_reserve_data
   201  btrfs:btrfs_qgroup_release_data
 1,839  btrfs:btrfs_qgroup_free_delayed_ref
 0  btrfs:btrfs_qgroup_account_extents
 0  btrfs:btrfs_qgroup_trace_extent
 0  btrfs:btrfs_qgroup_account_extent
 0  btrfs:qgroup_update_counters
 0  btrfs:qgroup_update_reserve
 0  btrfs:qgroup_meta_reserve
 0  btrfs:qgroup_meta_convert
 0  btrfs:qgroup_meta_free_all_pertrans
 0  btrfs:btrfs_prelim_ref_merge
 0  btrfs:btrfs_prelim_ref_insert
 2,663  btrfs:btrfs_inode_mod_outstanding_extents
 0  btrfs:btrfs_remove_block_group
 0  btrfs:btrfs_add_unused_block_group
 0  btrfs:btrfs_skip_unused_block_group

  70.004723586 seconds time elapsed

[chris@flap ~]$


Seems like a lot of activity for just a few transactions, but what
really caught my eye here is the qgroup reporting for a file system
that has never had qgroups enabled. Is it expected?


Chris Murphy

Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Hans van Kranenburg

On 10/08/2018 02:30 PM, Qu Wenruo wrote:
> Obviously, used bytes can't be larger than total bytes.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  check/mode-lowmem.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
> index 07c03cad77af..1173b963b8f3 100644
> --- a/check/mode-lowmem.c
> +++ b/check/mode-lowmem.c
> @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info 
> *fs_info,
>   used = btrfs_device_bytes_used(eb, dev_item);
>   total_bytes = btrfs_device_total_bytes(eb, dev_item);
>  
> + if (used > total_bytes) {
> + error("device %llu has incorrect used bytes %llu > total bytes 
> %llu",
> + dev_id, used, total_bytes);
> + return ACCOUNTING_MISMATCH;

The message and return code point at an error in accounting logic.

However, if you have a fully allocated device and a DUP chunk ending
beyond device, then having used > total_bytes is expected...

So maybe there's two possibilities... There's an error in the accounting
logic, or there's an "over-allocation", which is another type of issue
which produces used > total with correct accounting logic.

> + }
>   key.objectid = dev_id;
>   key.type = BTRFS_DEV_EXTENT_KEY;
>   key.offset = 0;
> 

-- 
Hans van Kranenburg

Re: Which device is missing ?

2018-10-08 Thread Pierre Couderc





On 10/08/2018 11:21 PM, Hugo Mills wrote:

On Mon, Oct 08, 2018 at 11:01:35PM +0200, Pierre Couderc wrote:

On 10/08/2018 06:14 PM, Hugo Mills wrote:

On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:

On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:

I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).

But I have stranges status or errors  about "missing devices" and I
do not understand the current situation :


root@server:~# btrfs fi show
Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
     Total devices 1 FS bytes used 190.91GiB
     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2

warning, device 1 is missing
Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
     Total devices 2 FS bytes used 116.18GiB
     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
     *** Some devices missing

This looks like you've created a RAID-1 array with /dev/sda2 and
/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
original [part of a] filesystem on /dev/sda2, and replacing it with a
wholly different filesystem. Since the new FS on /dev/sda2 (UUID
28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
and the original FS was made of two devices, btrfs fi show is telling
you that there's some devices missing -- /dev/sda2 is no longer part
of that FS, and is therefore a missing device.

I note that you've got data on both filesystems, so they must both
have been mounted somewhere and had stuff put on them.

I recommend doing something like this:

# mkfs /media/btrfs/myraid1 /media/btrfs/tmp
# mount /dev/sdb /media/btrfs/myraid1/
# mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
# cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
# umount /media/btrfs/tmp/
# wipefs /dev/sda2   # destroy the FS on sda2
# btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/

This will copy all the data from the filesystem on /dev/sda2 into
the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
as the second device for the main FS.

*WARNING!*

Note that, since the main FS is missing a device, it will probably
need to be mounted in degraded mode (-o degraded), and that on kernels
earlier than (IIRC) 4.14, this can only be done *once* without the FS
becoming more or less permanently read-only. On recent kernels, it
_should_ be OK.

*WARNING ENDS*

Oh, and for the record, to make a RAID-1 filesystem from scratch,
you simply need this:

# mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb

You do not need to run mkfs.btrfs on each device separately.

Hugo.

Thnk you very much. I understand a bit better. I think  that I have
nothing of interest on /dev/sdb and that its contents is the result
of previous trials.
And that my system is on /dev/dsda2 as :

root@server:~# df -h
Filesystem  Size  Used Avail Use% Mounted on
udev    3.9G 0  3.9G   0% /dev
tmpfs   787M  8.8M  778M   2% /run
/dev/sda2   1.9T  193G  1.7T  11% /
tmpfs   3.9G 0  3.9G   0% /dev/shm
tmpfs   5.0M 0  5.0M   0% /run/lock
tmpfs   3.9G 0  3.9G   0% /sys/fs/cgroup
/dev/sda1   511M  5.7M  506M   2% /boot/efi
tmpfs   100K 0  100K   0% /var/lib/lxd/shmounts
tmpfs   100K 0  100K   0% /var/lib/lxd/devlxd
root@server:~#

Is it exact ?

Yes, it looks like you're running / from the FS on /dev/sda2.


If yes, I suppose I should wipe data on /dev/sdb, then build the
RAID by expanding /dev/sda2.

Correct.

I would recommend putting a partition table on /dev/sdb, because it
doesn't take up much space, and it's always easier to have one already
there when you need it (and there's a few things that can get confused
if there isn't a partition table).


So I should :

wipefs /dev/sdb
btrfs device add /dev/sdb /
btrfs balance start -v -mconvert=raid1 -dconvert=raid1 /
Does it sound correct ? (my kernel is boot/vmlinuz-4.18.0-1-amd64)

Yes, exactly.

Hugo.


Thnk you very very much.
I do it now, as you with a partition table on /dev/sdb !

Re: Which device is missing ?

2018-10-08 Thread Hugo Mills

On Mon, Oct 08, 2018 at 11:01:35PM +0200, Pierre Couderc wrote:
> On 10/08/2018 06:14 PM, Hugo Mills wrote:
> >On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
> >>On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> >>>I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> >>>
> >>>But I have stranges status or errors  about "missing devices" and I
> >>>do not understand the current situation :
> >>>
> >>>
> >>>root@server:~# btrfs fi show
> >>>Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
> >>>     Total devices 1 FS bytes used 190.91GiB
> >>>     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> >>>
> >>>warning, device 1 is missing
> >>>Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
> >>>     Total devices 2 FS bytes used 116.18GiB
> >>>     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
> >>>     *** Some devices missing
> >>This looks like you've created a RAID-1 array with /dev/sda2 and
> >>/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
> >>original [part of a] filesystem on /dev/sda2, and replacing it with a
> >>wholly different filesystem. Since the new FS on /dev/sda2 (UUID
> >>28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
> >>and the original FS was made of two devices, btrfs fi show is telling
> >>you that there's some devices missing -- /dev/sda2 is no longer part
> >>of that FS, and is therefore a missing device.
> >>
> >>I note that you've got data on both filesystems, so they must both
> >>have been mounted somewhere and had stuff put on them.
> >>
> >>I recommend doing something like this:
> >>
> >># mkfs /media/btrfs/myraid1 /media/btrfs/tmp
> >># mount /dev/sdb /media/btrfs/myraid1/
> >># mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
> >># cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
> >># umount /media/btrfs/tmp/
> >># wipefs /dev/sda2   # destroy the FS on sda2
> >># btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/
> >>
> >>This will copy all the data from the filesystem on /dev/sda2 into
> >>the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
> >>as the second device for the main FS.
> >>
> >>*WARNING!*
> >>
> >>Note that, since the main FS is missing a device, it will probably
> >>need to be mounted in degraded mode (-o degraded), and that on kernels
> >>earlier than (IIRC) 4.14, this can only be done *once* without the FS
> >>becoming more or less permanently read-only. On recent kernels, it
> >>_should_ be OK.
> >>
> >>*WARNING ENDS*
> >Oh, and for the record, to make a RAID-1 filesystem from scratch,
> >you simply need this:
> >
> ># mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb
> >
> >You do not need to run mkfs.btrfs on each device separately.
> >
> >Hugo.
> Thnk you very much. I understand a bit better. I think  that I have
> nothing of interest on /dev/sdb and that its contents is the result
> of previous trials.
> And that my system is on /dev/dsda2 as :
> 
> root@server:~# df -h
> Filesystem  Size  Used Avail Use% Mounted on
> udev    3.9G 0  3.9G   0% /dev
> tmpfs   787M  8.8M  778M   2% /run
> /dev/sda2   1.9T  193G  1.7T  11% /
> tmpfs   3.9G 0  3.9G   0% /dev/shm
> tmpfs   5.0M 0  5.0M   0% /run/lock
> tmpfs   3.9G 0  3.9G   0% /sys/fs/cgroup
> /dev/sda1   511M  5.7M  506M   2% /boot/efi
> tmpfs   100K 0  100K   0% /var/lib/lxd/shmounts
> tmpfs   100K 0  100K   0% /var/lib/lxd/devlxd
> root@server:~#
> 
> Is it exact ?

   Yes, it looks like you're running / from the FS on /dev/sda2.

> If yes, I suppose I should wipe data on /dev/sdb, then build the
> RAID by expanding /dev/sda2.

   Correct.

   I would recommend putting a partition table on /dev/sdb, because it
doesn't take up much space, and it's always easier to have one already
there when you need it (and there's a few things that can get confused
if there isn't a partition table).

> So I should :
> 
> wipefs /dev/sdb
> btrfs device add /dev/sdb /
> btrfs balance start -v -mconvert=raid1 -dconvert=raid1 /

> Does it sound correct ? (my kernel is boot/vmlinuz-4.18.0-1-amd64)

   Yes, exactly.

   Hugo.

-- 
Hugo Mills | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4  |Derek Lowe


signature.asc
Description: Digital signature

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg

On 10/08/2018 06:37 PM, Holger Hoffstätte wrote:
> On 10/08/18 17:46, Hans van Kranenburg wrote:
> 
>> fs.devices() also looks for dev_items in the chunk tree:
>>
>> https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481
>>
>> So, BOOM! you need root.
>>
>> Or just start a 0, ignore errors and start trying all devids until you
>> found num_devices amount of them that work, yolo.
> 
> Since I need to walk /sys/fs/btrfs/ anyway I *think* I can just look
> at the entries in /sys/fs/btrfs//devices/ and query them all
> directly.

But, you still need root for that right? The progs code does a RO open
directly on the block device.

-$ btrfs dev stats /dev/xvdb
ERROR: cannot open /dev/xvdb: Permission denied
ERROR: '/dev/xvdb' is not a mounted btrfs device

stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 0
stat("/dev/loop0", {st_mode=S_IFBLK|0660, st_rdev=makedev(7, 0), ...}) = 0
open("/dev/loop0", O_RDONLY)= -1 EACCES (Permission denied)

But:

-# btrfs dev stats /dev/xvdb
[/dev/xvdb].write_io_errs0
[/dev/xvdb].read_io_errs 0
[/dev/xvdb].flush_io_errs0
[/dev/xvdb].corruption_errs  0
[/dev/xvdb].generation_errs  0

> The skeleton btrfs_exporter already responds to http requests and
> returns dummy metrics, using the official python client library.
> I've found a nice little python sysfs scraper and now only need to
> figure out how best to map the btrfs info in sysfs to useful metrics.
> The privileged access issue would only have come into play much later,
> but it seems I can avoid it after all, which is great.
> I'm also (re-)learning python in the process, so that's the actual
> thing slowing me down..

:)

-- 
Hans van Kranenburg

Re: [PATCH] fstests: btrfs verify hardening agaist duplicate fsid

2018-10-08 Thread Anand Jain





On 10/06/2018 06:14 PM, Eryu Guan wrote:

On Mon, Oct 01, 2018 at 04:44:35PM +0800, Anand Jain wrote:

We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after its
been copied. So this test case reproduces this issue.

For example:

Initially.. /dev/mmcblk0p4 is mounted as /

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part /
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
 Total devices 1 FS bytes used 1.40GiB
 devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4

Copy mmcblk0 to sda
dd if=/dev/mmcblk0 of=/dev/sda

And immediately after the copy completes the change in the device
superblock is notified which the automount scans using
btrfs device scan and the new device sda becomes the mounted root
device.

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda   8:01 14.9G  0 disk
|-sda48:414G  0 part /
|-sda28:21  500M  0 part
|-sda38:31  256M  0 part
`-sda18:11  256M  0 part
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi
btrfs fi show /
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
 Total devices 1 FS bytes used 1.40GiB
 devid1 size 4.00GiB used 3.00GiB path /dev/sda4

The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take
the sda out of the system on to another system to change its fsid using
the 'btrfstune -u' command.

Signed-off-by: Anand Jain 


Looks like that the test will break the whole test env as it leaves an
unmountable $SCRATCH_MNT. I'd wait for the fix to get in first before
merging the test, in case it breaks normal regression tests. (I noticed
that the test is not in 'auto' group, so it's not that dangerous.)


Its possible that its unmountable without the kernel patch. But I am 
unable to reproduce it consistently with or without the kernel patch.


Any idea ways to make it auto for kernels without the patch?


Also, it'd be great if test can be reviewed by btrfs folks too!


---
  tests/btrfs/173 | 72 +
  tests/btrfs/173.out |  5 
  tests/btrfs/group   |  1 +
  3 files changed, 78 insertions(+)
  create mode 100755 tests/btrfs/173
  create mode 100644 tests/btrfs/173.out

diff --git a/tests/btrfs/173 b/tests/btrfs/173
new file mode 100755
index ..f59a62e206c3
--- /dev/null
+++ b/tests/btrfs/173
@@ -0,0 +1,72 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Oracle. All Rights Reserved.
+#
+# FS QA Test 173
+#
+# Fuzzy test for FS image duplication.
+#  Could be fixed by
+#[patch] btrfs: harden agaist duplicate fsid
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_scratch_dev_pool_get 2
+
+dev_foo=$(echo $SCRATCH_DEV_POOL | awk '{print $1}' | rev | cut -d"/" -f1 | 
rev)
+dev_bar=$(echo $SCRATCH_DEV_POOL | awk '{print $2}' | rev | cut -d"/" -f1 | 
rev)


This doesn't work if the devices in SCRATCH_DEV_POOL are symlinks, e.g.
lvm devices: /dev/mapper/testvg-testlv1, dev_foo is "testvg-testlv1" in
this case.


 Ah, right will fix.


+
+_mkfs_dev /dev/$dev_foo


But /dev/testvg-testlv1 isn't existed.

_short_dev and/or _real_dev is useful in this case. e.g.

dev_foo=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
# dev_foo is like "dm-1"
dev_foo=$(_short_dev $dev_foo)
# dev_foo is like "/dev/dm-1"
dev_foo=$(_real_dev $dev_foo)


 I changed the code a bit which avoids the split. Pls review if that 
will be ok.



+_mount /dev/$dev_foo $SCRATCH_MNT


It'd better to mount non-SCRATCH_DEV to other mount point, e.g.
$TEST_DIR/$seq.mnt


 Will do, any idea why? Isn't the framework automatically try to 
unmount SCRATCH_MNT.


Thanks, Anand


Thanks,
Eryu


+
+echo mount before btrfs image clone | tee -a $seqres.full
+findmnt /dev/$dev_foo | grep -v TARGET | awk '{print $1" "$2}' | \
+   sed -e "s/$dev_foo/dev_foo/g" | _filter_scratch | tee -a $seqres.full
+findmnt /dev/$dev_bar | grep -v TARGET | awk '{print $1"

[PATCH v2 rev log added] fstests: btrfs verify hardening agaist duplicate fsid

2018-10-08 Thread Anand Jain

We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after its
been copied. So this test case reproduces this issue.

For example:

Initially.. /dev/mmcblk0p4 is mounted as /

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part /
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4

Copy mmcblk0 to sda
dd if=/dev/mmcblk0 of=/dev/sda

And immediately after the copy completes the change in the device
superblock is notified which the automount scans using
btrfs device scan and the new device sda becomes the mounted root
device.

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda   8:01 14.9G  0 disk
|-sda48:414G  0 part /
|-sda28:21  500M  0 part
|-sda38:31  256M  0 part
`-sda18:11  256M  0 part
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi
btrfs fi show /
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid1 size 4.00GiB used 3.00GiB path /dev/sda4

The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take
the sda out of the system on to another system to change its fsid using
the 'btrfstune -u' command.

Signed-off-by: Anand Jain 
---
v1->v2: 
  dont play around with dev patch use it as it is.
  do not use SCRATCH_MNT instead create it at the TEST_DIR and its related
   changes.
  golden out changes
   
 tests/btrfs/173 | 88 +
 tests/btrfs/173.out |  6 
 tests/btrfs/group   |  1 +
 3 files changed, 95 insertions(+)
 create mode 100755 tests/btrfs/173
 create mode 100644 tests/btrfs/173.out

diff --git a/tests/btrfs/173 b/tests/btrfs/173
new file mode 100755
index ..b466ae921e19
--- /dev/null
+++ b/tests/btrfs/173
@@ -0,0 +1,88 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Oracle. All Rights Reserved.
+#
+# FS QA Test 173
+#
+# Fuzzy test for FS image duplication.
+#  Could be fixed by
+#[patch] btrfs: harden agaist duplicate fsid
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+mnt=$TEST_DIR/$seq.mnt
+_cleanup()
+{
+   rm -rf $mnt > /dev/null 2>&1
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+_scratch_dev_pool_get 2
+
+dev_foo=$(echo $SCRATCH_DEV_POOL | awk '{print $1}')
+dev_bar=$(echo $SCRATCH_DEV_POOL | awk '{print $2}')
+
+echo dev_foo=$dev_foo >> $seqres.full
+echo dev_bar=$dev_bar >> $seqres.full
+echo | tee -a $seqres.full
+
+rm -rf $mnt > /dev/null 2>&1
+mkdir $mnt
+_mkfs_dev $dev_foo
+_mount $dev_foo $mnt
+
+check_btrfs_mount()
+{
+   local x=$(findmnt $mnt | grep -v TARGET | awk '{print $2}')
+   [[ $x == $dev_foo ]] && echo DEV_FOO
+   [[ $x == $dev_bar ]] && echo DEV_BAR
+}
+
+echo MNT $(check_btrfs_mount)
+
+for sb_bytenr in 65536 67108864
+do
+   echo -n "dd status=none if=$dev_foo of=$dev_bar bs=1 "\
+   "seek=$sb_bytenr skip=$sb_bytenr count=4096" >> $seqres.full
+   dd status=none if=$dev_foo of=$dev_bar bs=1 seek=$sb_bytenr \
+   skip=$sb_bytenr count=4096 >> $seqres.full 2>&1
+   echo ..:$? >> $seqres.full
+done
+
+#Original device is mounted, scan of its clone should fail
+$BTRFS_UTIL_PROG device scan $dev_bar >> $seqres.full 2>&1
+echo btrfs device scan dev_bar ...:$?| tee -a $seqres.full
+
+echo MNT $(check_btrfs_mount)
+
+#Original device scan should be successful
+$BTRFS_UTIL_PROG device scan $dev_foo >> $seqres.full 2>&1
+echo btrfs device scan dev_foo ...:$?| tee -a $seqres.full
+
+umount $mnt > /dev/null 2>&1
+_scratch_dev_pool_put
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/173.out b/tests/btrfs/173.out
new file mode 100644
index ..3c7e3fb4e3f7
--- /dev/null
+++ b/tests/btrfs/173.out
@@ -0,0 +1,6 @@
+QA output created by 173
+
+MNT DEV_FOO
+btrfs device scan dev_bar ...:1
+MNT DEV_FOO
+btrfs device scan dev_foo ...:0
diff --git

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Holger Hoffstätte


On 10/08/18 17:46, Hans van Kranenburg wrote:


fs.devices() also looks for dev_items in the chunk tree:

https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481

So, BOOM! you need root.

Or just start a 0, ignore errors and start trying all devids until you
found num_devices amount of them that work, yolo.


Since I need to walk /sys/fs/btrfs/ anyway I *think* I can just look
at the entries in /sys/fs/btrfs//devices/ and query them all
directly.

The skeleton btrfs_exporter already responds to http requests and
returns dummy metrics, using the official python client library.
I've found a nice little python sysfs scraper and now only need to
figure out how best to map the btrfs info in sysfs to useful metrics.
The privileged access issue would only have come into play much later,
but it seems I can avoid it after all, which is great.
I'm also (re-)learning python in the process, so that's the actual
thing slowing me down..

-h

Re: Which device is missing ?

2018-10-08 Thread Hugo Mills

On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> > I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> > 
> > But I have stranges status or errors  about "missing devices" and I
> > do not understand the current situation :
> > 
> > 
> > root@server:~# btrfs fi show
> > Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
> >     Total devices 1 FS bytes used 190.91GiB
> >     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> > 
> > warning, device 1 is missing
> > Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
> >     Total devices 2 FS bytes used 116.18GiB
> >     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
> >     *** Some devices missing
> 
>This looks like you've created a RAID-1 array with /dev/sda2 and
> /dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
> original [part of a] filesystem on /dev/sda2, and replacing it with a
> wholly different filesystem. Since the new FS on /dev/sda2 (UUID
> 28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
> and the original FS was made of two devices, btrfs fi show is telling
> you that there's some devices missing -- /dev/sda2 is no longer part
> of that FS, and is therefore a missing device.
> 
>I note that you've got data on both filesystems, so they must both
> have been mounted somewhere and had stuff put on them.
> 
>I recommend doing something like this:
> 
> # mkfs /media/btrfs/myraid1 /media/btrfs/tmp
> # mount /dev/sdb /media/btrfs/myraid1/
> # mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
> # cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
> # umount /media/btrfs/tmp/
> # wipefs /dev/sda2   # destroy the FS on sda2
> # btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/
> 
>This will copy all the data from the filesystem on /dev/sda2 into
> the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
> as the second device for the main FS.
> 
> *WARNING!*
> 
>Note that, since the main FS is missing a device, it will probably
> need to be mounted in degraded mode (-o degraded), and that on kernels
> earlier than (IIRC) 4.14, this can only be done *once* without the FS
> becoming more or less permanently read-only. On recent kernels, it
> _should_ be OK.
> 
> *WARNING ENDS*

   Oh, and for the record, to make a RAID-1 filesystem from scratch,
you simply need this:

# mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb

   You do not need to run mkfs.btrfs on each device separately.

   Hugo.

-- 
Hugo Mills | Welcome to Rivendell, Mr Anderson...
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Machinae Supremacy, Hybrid


signature.asc
Description: Digital signature

Re: Which device is missing ?

2018-10-08 Thread Hugo Mills

On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> 
> But I have stranges status or errors  about "missing devices" and I
> do not understand the current situation :
> 
> 
> root@server:~# btrfs fi show
> Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
>     Total devices 1 FS bytes used 190.91GiB
>     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> 
> warning, device 1 is missing
> Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
>     Total devices 2 FS bytes used 116.18GiB
>     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
>     *** Some devices missing

   This looks like you've created a RAID-1 array with /dev/sda2 and
/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
original [part of a] filesystem on /dev/sda2, and replacing it with a
wholly different filesystem. Since the new FS on /dev/sda2 (UUID
28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
and the original FS was made of two devices, btrfs fi show is telling
you that there's some devices missing -- /dev/sda2 is no longer part
of that FS, and is therefore a missing device.

   I note that you've got data on both filesystems, so they must both
have been mounted somewhere and had stuff put on them.

   I recommend doing something like this:

# mkfs /media/btrfs/myraid1 /media/btrfs/tmp
# mount /dev/sdb /media/btrfs/myraid1/
# mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
# cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
# umount /media/btrfs/tmp/
# wipefs /dev/sda2   # destroy the FS on sda2
# btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/

   This will copy all the data from the filesystem on /dev/sda2 into
the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
as the second device for the main FS.

*WARNING!*

   Note that, since the main FS is missing a device, it will probably
need to be mounted in degraded mode (-o degraded), and that on kernels
earlier than (IIRC) 4.14, this can only be done *once* without the FS
becoming more or less permanently read-only. On recent kernels, it
_should_ be OK.

*WARNING ENDS*

   Hugo.

[snip]

-- 
Hugo Mills | UNIX: Japanese brand of food containers
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

signature.asc
Description: Digital signature

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg

On 10/08/2018 05:29 PM, Holger Hoffstätte wrote:
> On 10/08/18 16:40, Hans van Kranenburg wrote:
>>> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
>>> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.
>>
>> That's the tree search ioctl, for reading arbitrary metadata.
>>
>> The device stats ioctl is IOC_GET_DEV_STATS...
> 
> Yeah..OK, that is clear and gave me the hint to ask: why is it
> calling this in the first place? And as it turns out [1] is where
> it seems to go wrong, as is_block_device() returns 0 (as it should),
> fi_args.num_devices is never set (remains 0) and it proceeds to call
> everything below, eventually calling the BTRFS_IOC_FS_INFO ioctl in
> #1712. And that works fine:
> 
> 1711 if (fi_args->num_devices != 1) {
> (gdb) s
> 1712    ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
> (gdb) s
> 1713    if (ret < 0) {
> (gdb) p ret
> $28 = 0
> (gdb) p *fi_args
> $30 = {
>   max_id = 1,
>   num_devices = 1,
>   fsid = "z%:\371\315\033A\203\267.\200\255;FH\221",
>   nodesize = 16384,
>   sectorsize = 4096,
>   clone_alignment = 4096,
>   reserved32 = 0,
>   reserved = {0 }
> }
> 
> It's only when it goes into search_chunk_tree_for_fs_info()
> where things fail due to CAP_SYS_ADMIN.
> 
> And all this explains the actual bug: when I call btrfs device stats
> not on the mountpoint (as I've been trying all this time) but rather
> on the device, it works properly right away as regular user:
> 
> (gdb) set args device stats /dev/loop0
> (gdb) r
> Breakpoint 1, get_fs_info (path=path@entry=0x7fffe527 "/dev/loop0",
> fi_args=fi_args@entry=0x7fffd400,
>     di_ret=di_ret@entry=0x7fffd3f0) at utils.c:1652
> 1652    {
> (gdb) c
> Continuing.
> [/dev/loop0].write_io_errs    0
> [/dev/loop0].read_io_errs 0
> [/dev/loop0].flush_io_errs    0
> [/dev/loop0].corruption_errs  0
> [/dev/loop0].generation_errs  0
> [Inferior 1 (process 2805) exited normally]
> 
> So this is simply a discrepancy in handling a device vs. the device(s)
> for a mountpoint.

Apparently based on what you point it at, it does a different thing.

If you point it at a block device, it will try opening the block device
to find out which devid it has (from the superblock), find out where it
is mounted and then only ask for stats for this one device.

-# btrfs dev stats /dev/xvdc
[/dev/xvdc].write_io_errs0
[/dev/xvdc].read_io_errs 0
[/dev/xvdc].flush_io_errs0
[/dev/xvdc].corruption_errs  0
[/dev/xvdc].generation_errs  0

If you point it at a mounted filesystem, it lists stats for all devices.
Since there's no way to get a list of devids, it directly searches the
chunk tree directly for dev_item objects.

-# btrfs dev stats /mountpoint
[/dev/xvdb].write_io_errs0
[/dev/xvdb].read_io_errs 0
[/dev/xvdb].flush_io_errs0
[/dev/xvdb].corruption_errs  0
[/dev/xvdb].generation_errs  0
[/dev/xvdc].write_io_errs0
[/dev/xvdc].read_io_errs 0
[/dev/xvdc].flush_io_errs0
[/dev/xvdc].corruption_errs  0
[/dev/xvdc].generation_errs  0
[/dev/xvdd].write_io_errs0
[/dev/xvdd].read_io_errs 0
[/dev/xvdd].flush_io_errs0
[/dev/xvdd].corruption_errs  0
[/dev/xvdd].generation_errs  0

I do the same thing in the nagios plugin:

https://github.com/knorrie/python-btrfs/blob/master/examples/nagios/plugins/check_btrfs#L131

fs.devices() also looks for dev_items in the chunk tree:

https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481

So, BOOM! you need root.

Or just start a 0, ignore errors and start trying all devids until you
found num_devices amount of them that work, yolo.

>> I can do the device stats ioctl as normal user:
>>
>> import btrfs
>> fs = btrfs.FileSystem('/')
>> btrfs.utils.pretty_print(fs.dev_stats(1))
>>
>> 
>> devid: 1
>> nr_items: 5
>> flags: 0
>> write_errs: 0
>> read_errs: 0
>> flush_errs: 0
>> generation_errs: 0
>> corruption_errs: 0
> 
> That works for me too, and that's actually the important part. \o/
> Glad we talked about it. :}
> 
> -h
> 
> [1]
> https://github.com/kdave/btrfs-progs/blob/7faaca0d9f78f7162ae603231f693dd8e1af2a41/utils.c#L1666
> 
> 


-- 
Hans van Kranenburg

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Holger Hoffstätte


On 10/08/18 16:40, Hans van Kranenburg wrote:

Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.


That's the tree search ioctl, for reading arbitrary metadata.

The device stats ioctl is IOC_GET_DEV_STATS...


Yeah..OK, that is clear and gave me the hint to ask: why is it
calling this in the first place? And as it turns out [1] is where
it seems to go wrong, as is_block_device() returns 0 (as it should),
fi_args.num_devices is never set (remains 0) and it proceeds to call
everything below, eventually calling the BTRFS_IOC_FS_INFO ioctl in
#1712. And that works fine:

1711 if (fi_args->num_devices != 1) {
(gdb) s
1712ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args);
(gdb) s
1713if (ret < 0) {
(gdb) p ret
$28 = 0
(gdb) p *fi_args
$30 = {
  max_id = 1,
  num_devices = 1,
  fsid = "z%:\371\315\033A\203\267.\200\255;FH\221",
  nodesize = 16384,
  sectorsize = 4096,
  clone_alignment = 4096,
  reserved32 = 0,
  reserved = {0 }
}

It's only when it goes into search_chunk_tree_for_fs_info()
where things fail due to CAP_SYS_ADMIN.

And all this explains the actual bug: when I call btrfs device stats
not on the mountpoint (as I've been trying all this time) but rather
on the device, it works properly right away as regular user:

(gdb) set args device stats /dev/loop0
(gdb) r
Breakpoint 1, get_fs_info (path=path@entry=0x7fffe527 "/dev/loop0", 
fi_args=fi_args@entry=0x7fffd400,
di_ret=di_ret@entry=0x7fffd3f0) at utils.c:1652
1652{
(gdb) c
Continuing.
[/dev/loop0].write_io_errs0
[/dev/loop0].read_io_errs 0
[/dev/loop0].flush_io_errs0
[/dev/loop0].corruption_errs  0
[/dev/loop0].generation_errs  0
[Inferior 1 (process 2805) exited normally]

So this is simply a discrepancy in handling a device vs. the device(s)
for a mountpoint.


I can do the device stats ioctl as normal user:

import btrfs
fs = btrfs.FileSystem('/')
btrfs.utils.pretty_print(fs.dev_stats(1))


devid: 1
nr_items: 5
flags: 0
write_errs: 0
read_errs: 0
flush_errs: 0
generation_errs: 0
corruption_errs: 0


That works for me too, and that's actually the important part. \o/
Glad we talked about it. :}

-h

[1] 
https://github.com/kdave/btrfs-progs/blob/7faaca0d9f78f7162ae603231f693dd8e1af2a41/utils.c#L1666

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg

On 10/08/2018 04:40 PM, Hans van Kranenburg wrote:
> On 10/08/2018 04:27 PM, Holger Hoffstätte wrote:
>> (moving the discussion here from GH [1])
>>
>> Apparently there is something weird going on with the device stats
>> ioctls. I cannot get them to work as regular user, while they work
>> for David. A friend confirms the same issue on his system - no access
>> as non-root.
>>
>> So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with
>> debug symbols and stepped into search_chunk_tree_for_fs_info().
>> Everything is fine, all args are correct, right until:
>>
>> (gdb) s
>> 1614    ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args);
>> (gdb) s
>> 1615    if (ret < 0)
>> (gdb) p ret
>> $4 = -1
>> (gdb) p search_args
>> $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset
>> = 1,
>> max_offset = 18446744073709551615, min_transid = 0, max_transid =
>> 18446744073709551615,
>> min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0,
>> unused2 = 0,
>> unused3 = 0, unused4 = 0}, buf = '\000' }
>>
>> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
>> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.
> 
> That's the tree search ioctl, for reading arbitrary metadata.
> 
> The device stats ioctl is IOC_GET_DEV_STATS...
> 
> I can do the device stats ioctl as normal user:
> 
> import btrfs
> fs = btrfs.FileSystem('/')
> btrfs.utils.pretty_print(fs.dev_stats(1))
> 
> 
> devid: 1
> nr_items: 5
> flags: 0
> write_errs: 0
> read_errs: 0
> flush_errs: 0
> generation_errs: 0
> corruption_errs: 0

By the way, I can also do BTRFS_IOC_FS_INFO, BTRFS_IOC_DEV_INFO and
BTRFS_IOC_SPACE_INFO as normal user.

However, while fs_info tells me that there are num_devices devices,
there's no place where you can actually get which devids these are, and
you need to provide them one by one to dev_info and dev_stats... :

btrfs.utils.pretty_print(fs.fs_info())


max_id: 1
num_devices: 1
nodesize: 4096
sectorsize: 4096
clone_alignment: 4096
fsid: 91077ca5-6559-4a90-9d03-912d3a33412e

btrfs.utils.pretty_print(fs.dev_info(1))

devid: 1
bytes_used: 60699967488
total_bytes: 107374182400
path: /dev/xvda
uuid: 7e998baa-b533-4476-9132-d7d748d28044


btrfs.utils.pretty_print(fs.space_info())
-
  
  flags: Data, single
  total_bytes: 54.00GiB
  used_bytes: 53.27GiB
-
  
  flags: System, single
  total_bytes: 32.00MiB
  used_bytes: 12.00KiB
-
  
  flags: Metadata, single
  total_bytes: 2.50GiB
  used_bytes: 1.30GiB
-
  
  flags: GlobalReserve, single
  total_bytes: 181.02MiB
  used_bytes: 0.00B


> 
>> So why can Dave get his dev stats as unprivileged user?
>> Does this work for anybody else? And why? :)
>>
>> cheers
>> Holger
>>
>> [1]
>> https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190
>>
> 
> 


-- 
Hans van Kranenburg

Re: [PATCH] btrfs: Remove unused variable mode in btrfs_mount

2018-10-08 Thread Goldwyn Rodrigues

On 15:03 08/10, David Sterba wrote:
> On Fri, Oct 05, 2018 at 07:26:15AM -0500, Goldwyn Rodrigues wrote:
> > Code cleanup.
> 
> Have you check when and why the variable become unused? Thanks.

No, I did not check it earlier. git blame points to
312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()")
Author cc'd.

-- 
Goldwyn

Re: Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Hans van Kranenburg

On 10/08/2018 04:27 PM, Holger Hoffstätte wrote:
> (moving the discussion here from GH [1])
> 
> Apparently there is something weird going on with the device stats
> ioctls. I cannot get them to work as regular user, while they work
> for David. A friend confirms the same issue on his system - no access
> as non-root.
> 
> So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with
> debug symbols and stepped into search_chunk_tree_for_fs_info().
> Everything is fine, all args are correct, right until:
> 
> (gdb) s
> 1614    ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args);
> (gdb) s
> 1615    if (ret < 0)
> (gdb) p ret
> $4 = -1
> (gdb) p search_args
> $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset
> = 1,
> max_offset = 18446744073709551615, min_transid = 0, max_transid =
> 18446744073709551615,
> min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0,
> unused2 = 0,
> unused3 = 0, unused4 = 0}, buf = '\000' }
> 
> Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
> BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.

That's the tree search ioctl, for reading arbitrary metadata.

The device stats ioctl is IOC_GET_DEV_STATS...

I can do the device stats ioctl as normal user:

import btrfs
fs = btrfs.FileSystem('/')
btrfs.utils.pretty_print(fs.dev_stats(1))


devid: 1
nr_items: 5
flags: 0
write_errs: 0
read_errs: 0
flush_errs: 0
generation_errs: 0
corruption_errs: 0

> So why can Dave get his dev stats as unprivileged user?
> Does this work for anybody else? And why? :)
> 
> cheers
> Holger
> 
> [1]
> https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190
> 


-- 
Hans van Kranenburg

Curious problem: btrfs device stats & unpriviliged access

2018-10-08 Thread Holger Hoffstätte


(moving the discussion here from GH [1])

Apparently there is something weird going on with the device stats
ioctls. I cannot get them to work as regular user, while they work
for David. A friend confirms the same issue on his system - no access
as non-root.

So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with
debug symbols and stepped into search_chunk_tree_for_fs_info().
Everything is fine, all args are correct, right until:

(gdb) s
1614ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args);
(gdb) s
1615if (ret < 0)
(gdb) p ret
$4 = -1
(gdb) p search_args
$5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset = 1,
max_offset = 18446744073709551615, min_transid = 0, max_transid = 
18446744073709551615,
min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0, unused2 
= 0,
unused3 = 0, unused4 = 0}, buf = '\000' }

Looking at the kernel side of things in fs/btrfs/ioctl.c I see both
BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN.

So why can Dave get his dev stats as unprivileged user?
Does this work for anybody else? And why? :)

cheers
Holger

[1] 
https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190

Which device is missing ?

2018-10-08 Thread Pierre Couderc


I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).

But I have stranges status or errors  about "missing devices" and I do 
not understand the current situation :



root@server:~# btrfs fi show
Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
    Total devices 1 FS bytes used 190.91GiB
    devid    1 size 1.82TiB used 196.02GiB path /dev/sda2

warning, device 1 is missing
Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
    Total devices 2 FS bytes used 116.18GiB
    devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
    *** Some devices missing

root@server:~# fdisk -l
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: CFF97102-D6B5-4126-B2B4-FA735598D1F0

Device   Start    End    Sectors  Size Type
/dev/sda1 2048    1050623    1048576  512M EFI System
/dev/sda2  1050624 3907026943 3905976320  1.8T Linux filesystem


Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


root@server:~# btrfs fi usage /
Overall:
    Device size:   1.82TiB
    Device allocated:    196.02GiB
    Device unallocated:    1.63TiB
    Device missing:  0.00B
    Used:    191.84GiB
    Free (estimated):  1.63TiB  (min: 835.27GiB)
    Data ratio:   1.00
    Metadata ratio:   2.00
    Global reserve:  263.64MiB  (used: 0.00B)

Data,single: Size:192.01GiB, Used:189.98GiB
   /dev/sda2 192.01GiB

Metadata,DUP: Size:2.00GiB, Used:951.67MiB
   /dev/sda2   4.00GiB

System,DUP: Size:8.00MiB, Used:48.00KiB
   /dev/sda2  16.00MiB

Unallocated:
   /dev/sda2   1.63TiB
root@server:~#

Re: [PATCH 24/42] btrfs: assert on non-empty delayed iputs

2018-10-08 Thread David Sterba

On Fri, Sep 28, 2018 at 07:18:03AM -0400, Josef Bacik wrote:
> I ran into an issue where there was some reference being held on an
> inode that I couldn't track.  This assert wasn't triggered, but it at
> least rules out we're doing something stupid.
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba

Re: [PATCH 23/42] btrfs: make sure we create all new bgs

2018-10-08 Thread David Sterba

On Fri, Sep 28, 2018 at 07:18:02AM -0400, Josef Bacik wrote:
> Allocating new chunks modifies both the extent and chunk tree, which can
> trigger new chunk allocations.  So instead of doing list_for_each_safe,
> just do while (!list_empty()) so we make sure we don't exit with other
> pending bg's still on our list.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 

Reviewed-by: David Sterba

Re: [PATCH] btrfs: Remove unused variable mode in btrfs_mount

2018-10-08 Thread David Sterba

On Fri, Oct 05, 2018 at 07:26:15AM -0500, Goldwyn Rodrigues wrote:
> Code cleanup.

Have you check when and why the variable become unused? Thanks.

[PATCH v3 6/6] btrfs-progs: fsck-tests: Add test image for dev extents beyond device boundary

2018-10-08 Thread Qu Wenruo

Now two locations can detect such problem, either by device item
used/total bytes check, or by early dev extents check against device
boundary.

The image is hand-crafted image which uses DATA SINGLE chunk to feed
btrfs check.
As expected, as long as block group item, chunk item, device used bytes
matches, older btrfs check can't detect such problem.

Signed-off-by: Qu Wenruo 
---
 .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
 tests/fsck-tests/036-bad-dev-extents/test.sh  |  20 ++
 2 files changed, 20 insertions(+)
 create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
 create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

diff --git a/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz 
b/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
new file mode 100644
index 
..47cb2a707b0097e369dc088ed0549f847995f136
GIT binary patch
literal 1640
zcmV-u2ABE$H+ooF000E$*0e?f03iVu0001VFXf})3;zZsT>wRyj;C3^v%$$4d1oRm
zhA1@4%tH=9jYF%IQSpIUKDpjXLRl?q4p$q;1zsY^#9_Lx=#tbGm>@S#e2aqt!?0}y
z2BPO%4~c9Q5)jKFD7}DURarKL)`^j{f?s>sEHdzmSvX^98%kGi<_8
zMnXynsC*B7}KE(6w>*wfdb|tw$kt^y>W+TB*?pon1P+)#u#?6bSIG)TooJx~u$#
zf^+}xKw`BfI}6=717S~Q%LW1kwY`pz9H{`uNNNFOk2w!VauNEaMfoLj)Z<)!1?F60
zJ+OEA|4$a=9W#XX*l{EG!j^s}p|
z0i#_%$Q}d)-EE8#8O5^x$$8Y0l2
zc19e0Et3O3m`pMOqEkasL8VGE+u~2lZn>sRCRj169Z6mQ3*+`D+C#F1V7POV%lx(9cB{WN*9OP%Zbd1VDn(S4HX^ad4-b~#H
z@9eUP4AAU`)yRf!k+rrrLSYfBSEi6RE#HtbqyPl11S>RCH
zqvJbt22t`FmU^tmTb+7LtpybCB-x1lGSlrpVQ9|6WNBs~q*M-to1gD1l41oiy~}+O?!{68Jvim2*%BznW)@B=3IH)Xi1795q{#>sF*y^0T2@b$`Wqwch?BgN}IR9Ui
z;!cs)hJFGsJmFaiUsYrN$c0^BLU^n-B%fagn+jR{?Dq+K%VyMG@pmAOShFY)k8zBxm@7YD
zb^ZXU;<`+_B+IV{+A~1Ku
zWggqWQd{E%hF}W2ArPwJ33zWP>MIe;YDN8WV+|+a4_c>yFN#ZGN00G#%3HYi
z4pePTnc{8k5CjwuN6hudRFcBkvB^pSFufzaaD`-;mPgJ~wr(

[PATCH v3 3/6] btrfs-progs: original check: Add ability to detect bad dev extents

2018-10-08 Thread Qu Wenruo

Unlike lowmem mode check, we don't have good place for original mode to
check overlap dev extents.

So this patch introduces a new function, btrfs_check_dev_extents(), to
handle possible bad dev extents.

Reported-by: Hans van Kranenburg 
Signed-off-by: Qu Wenruo 
---
 check/main.c | 99 
 1 file changed, 99 insertions(+)

diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..ff9a785ce555 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8224,6 +8224,99 @@ out:
return ret;
 }
 
+/*
+ * Check if all dev extents are valid (not overlap nor beyond device
+ * boundary).
+ *
+ * Dev extents <-> chunk cross checking is already done in check_chunks().
+ */
+static int check_dev_extents(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_root *dev_root = fs_info->dev_root;
+   int ret;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
+
+   btrfs_init_path();
+
+   key.objectid = 1;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, dev_root, , , 0, 0);
+   if (ret < 0) {
+   error("failed to search device tree: %s", strerror(-ret));
+   goto out;
+   }
+   if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
+   ret = btrfs_next_leaf(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   }
+
+   while (1) {
+   struct btrfs_dev_extent *dev_ext;
+   struct btrfs_device *dev;
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
+   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
+   if (key.type != BTRFS_DEV_EXTENT_KEY)
+   break;
+   dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
+struct btrfs_dev_extent);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], dev_ext);
+
+   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
+   if (!dev) {
+   error("failed to find device with devid %llu", devid);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (prev_devid == devid && prev_dev_ext_end > physical_offset) {
+   error(
+"dev extent devid %llu physical offset %llu overlap with previous dev extent 
end %llu",
+ devid, physical_offset, prev_dev_ext_end);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (physical_offset + physical_len > dev->total_bytes) {
+   error(
+"dev extent devid %llu physical offset %llu len %llu is beyond device boudnary 
%llu",
+ devid, physical_offset, physical_len,
+ dev->total_bytes);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+
+   ret = btrfs_next_item(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+   }
+out:
+   btrfs_release_path();
+   return ret;
+}
+
 static int check_chunks_and_extents(struct btrfs_fs_info *fs_info)
 {
struct rb_root dev_cache;
@@ -8318,6 +8411,12 @@ again:
goto out;
}
 
+   ret = check_dev_extents(fs_info);
+   if (ret < 0) {
+   err = ret;
+   goto out;
+   }
+
ret = check_chunks(_cache, _group_cache,
   _extent_cache, NULL, NULL, NULL, 0);
if (ret) {
-- 
2.19.1

[PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo

Obviously, used bytes can't be larger than total bytes.

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 07c03cad77af..1173b963b8f3 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
used = btrfs_device_bytes_used(eb, dev_item);
total_bytes = btrfs_device_total_bytes(eb, dev_item);
 
+   if (used > total_bytes) {
+   error("device %llu has incorrect used bytes %llu > total bytes 
%llu",
+   dev_id, used, total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
key.objectid = dev_id;
key.type = BTRFS_DEV_EXTENT_KEY;
key.offset = 0;
-- 
2.19.1

[PATCH v3 2/6] btrfs-progs: lowmem check: Add check for overlapping dev extents

2018-10-08 Thread Qu Wenruo

Add such check at check_dev_item(), since at that timing we're also
iterating dev extents for dev item accounting.

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 1bce44f5658a..07c03cad77af 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4065,6 +4065,8 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
u64 dev_id;
u64 used;
u64 total = 0;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
int ret;
 
dev_item = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);
@@ -4086,8 +4088,16 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
return REFERENCER_MISSING;
}
 
-   /* Iterate dev_extents to calculate the used space of a device */
+   /*
+* Iterate dev_extents to calculate the used space of a device
+*
+* Also make sure no dev extents overlap and end beyond device boundary
+*/
while (1) {
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
goto next;
 
@@ -4099,7 +4109,27 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
 
ptr = btrfs_item_ptr(path.nodes[0], path.slots[0],
 struct btrfs_dev_extent);
-   total += btrfs_dev_extent_length(path.nodes[0], ptr);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], ptr);
+
+   if (prev_devid == devid && physical_offset < prev_dev_ext_end) {
+   error(
+"dev extent devid %llu offset %llu len %llu overlap with previous dev extent 
end %llu",
+ devid, physical_offset, physical_len,
+ prev_dev_ext_end);
+   return ACCOUNTING_MISMATCH;
+   }
+   if (physical_offset + physical_len > total_bytes) {
+   error(
+"dev extent devid %llu offset %llu len %llu is beyond device boundary %llu",
+ devid, physical_offset, physical_len,
+ total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+   total += physical_len;
 next:
ret = btrfs_next_item(dev_root, );
if (ret)
-- 
2.19.1

[PATCH v3 1/6] btrfs-progs: image: Use correct device size when restoring

2018-10-08 Thread Qu Wenruo

When restoring btrfs image, the total_bytes of device item is not
updated correctly. In fact total_bytes can be left 0 for restored image.

It doesn't trigger any error because btrfs check never checks
total_bytes of dev item.
However this is going to change.

Fix it by populating total_bytes of device item with the end position of
last dev extent to make later btrfs check happy.

Signed-off-by: Qu Wenruo 
---
 image/main.c | 48 +---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/image/main.c b/image/main.c
index 351c5a256938..d5b89bc3149f 100644
--- a/image/main.c
+++ b/image/main.c
@@ -2082,15 +2082,17 @@ static void remap_overlapping_chunks(struct 
mdrestore_struct *mdres)
 }
 
 static int fixup_devices(struct btrfs_fs_info *fs_info,
-struct mdrestore_struct *mdres, off_t dev_size)
+struct mdrestore_struct *mdres, int out_fd)
 {
struct btrfs_trans_handle *trans;
struct btrfs_dev_item *dev_item;
+   struct btrfs_dev_extent *dev_ext;
struct btrfs_path path;
struct extent_buffer *leaf;
struct btrfs_root *root = fs_info->chunk_root;
struct btrfs_key key;
u64 devid, cur_devid;
+   u64 dev_size; /* Get from last dev extents */
int ret;
 
trans = btrfs_start_transaction(fs_info->tree_root, 1);
@@ -2101,16 +2103,56 @@ static int fixup_devices(struct btrfs_fs_info *fs_info,
 
dev_item = _info->super_copy->dev_item;
 
+   btrfs_init_path();
devid = btrfs_stack_device_id(dev_item);
 
+   key.objectid = devid;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = (u64)-1;
+
+   ret = btrfs_search_slot(NULL, fs_info->dev_root, , , 0, 0);
+   if (ret < 0) {
+   error("failed to locate last dev extent of devid %llu: %s",
+   devid, strerror(-ret));
+   btrfs_release_path();
+   return ret;
+   }
+   if (ret == 0) {
+   error("found invalid dev extent devid %llu offset -1",
+   devid);
+   btrfs_release_path();
+   return -EUCLEAN;
+   }
+   ret = btrfs_previous_item(fs_info->dev_root, , devid,
+ BTRFS_DEV_EXTENT_KEY);
+   if (ret > 0)
+   ret = -ENOENT;
+   if (ret < 0) {
+   error("failed to locate last dev extent of devid %llu: %s",
+   devid, strerror(-ret));
+   btrfs_release_path();
+   return ret;
+   }
+
+   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
+   dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
+struct btrfs_dev_extent);
+   dev_size = key.offset + btrfs_dev_extent_length(path.nodes[0], dev_ext);
+   btrfs_release_path();
+
btrfs_set_stack_device_total_bytes(dev_item, dev_size);
btrfs_set_stack_device_bytes_used(dev_item, mdres->alloced_chunks);
+   /* Don't forget to enlarge the real file */
+   ret = ftruncate64(out_fd, dev_size);
+   if (ret < 0) {
+   error("failed to enlarge result image: %s", strerror(errno));
+   return -errno;
+   }
 
key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
key.type = BTRFS_DEV_ITEM_KEY;
key.offset = 0;
 
-   btrfs_init_path();
 
 again:
ret = btrfs_search_slot(trans, root, , , -1, 1);
@@ -2275,7 +2317,7 @@ static int restore_metadump(const char *input, FILE *out, 
int old_restore,
return 1;
}
 
-   ret = fixup_devices(info, , st.st_size);
+   ret = fixup_devices(info, , fileno(out));
close_ctree(info->chunk_root);
if (ret)
goto out;
-- 
2.19.1

[PATCH v3 5/6] btrfs-progs: original check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo

Signed-off-by: Qu Wenruo 
---
 check/main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/check/main.c b/check/main.c
index ff9a785ce555..12f12e18a83f 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7938,6 +7938,12 @@ static int check_device_used(struct device_record 
*dev_rec,
struct device_extent_record *dev_extent_rec;
u64 total_byte = 0;
 
+   if (dev_rec->byte_used > dev_rec->total_byte) {
+   error("device %llu has incorrect used bytes %llu > total bytes 
%llu",
+ dev_rec->devid, dev_rec->byte_used, dev_rec->total_byte);
+   return -EUCLEAN;
+   }
+
cache = search_cache_extent2(_cache->tree, dev_rec->devid, 0);
while (cache) {
dev_extent_rec = container_of(cache,
-- 
2.19.1

[PATCH v3 0/6] btrfs-progs: check: Detect invalid dev extents and device items

2018-10-08 Thread Qu Wenruo

This patchset can be fetch from github:
https://github.com/adam900710/btrfs-progs/tree/dev_extents_check

Hans van Kranenburg reported a case where btrfs DUP chunk allocator
could allocate invalid dev extents, either overlaps with existing dev
extents or beyond device boundary.

This patchset enhances the btrfs-progs side to detect such problems.
With hand crafted test image for it.

Link: https://www.spinics.net/lists/linux-btrfs/msg82370.html

Changelog:
v2:
  Fix a bug in the 1st patch which makes lowmem mode never checks
  overlap dev extents.
  Fix test case bug which never passes due to wrong script.
v3:
  Add btrfs-image fixes to make test cases happy.

Qu Wenruo (6):
  btrfs-progs: image: Use correct device size when restoring
  btrfs-progs: lowmem check: Add check for overlapping dev extents
  btrfs-progs: original check: Add ability to detect bad dev extents
  btrfs-progs: lowmem check: Add dev_item check for used bytes and total
bytes
  btrfs-progs: original check: Add dev_item check for used bytes and
total bytes
  btrfs-progs: fsck-tests: Add test image for dev extents beyond device
boundary

 check/main.c  | 105 ++
 check/mode-lowmem.c   |  39 ++-
 image/main.c  |  48 +++-
 .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
 tests/fsck-tests/036-bad-dev-extents/test.sh  |  20 
 5 files changed, 207 insertions(+), 5 deletions(-)
 create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
 create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

-- 
2.19.1

Re: Monitoring btrfs with Prometheus (and soon OpenMonitoring)

2018-10-08 Thread Austin S. Hemmelgarn


On 2018-10-07 09:37, Holger Hoffstätte wrote:


The Prometheus statistics collection/aggregation/monitoring/alerting system
[1] is quite popular, easy to use and will probably be the basis for the
upcoming OpenMetrics "standard" [2].

Prometheus collects metrics by polling host-local "exporters" that respond
to http requests; many such exporters exist, from the generic node_exporter
for OS metrics to all sorts of application-/service-specific varieties.

Since btrfs already exposes quite a lot of monitorable and - more
importantly - actionable runtime information in sysfs it only makes sense
to expose these metrics for visualization & alerting. I noodled over the
idea some time ago but got sidetracked, besides not being thrilled at all
by the idea of doing this in golang (which I *really* dislike).

However, exporters can be written in any language as long as they speak
the standard response protocol, so an alternative would be to use one
of the other official exporter clients. These provide language-native
"mini-frameworks" where one only has to fill in the blanks (see [3]
for examples).

Since the issue just came up in the node_exporter bugtracker [3] I
figured I ask if anyone here is interested in helping build a proper
standalone btrfs_exporter in C++? :D

..just kidding, I'd probably use python (which I kind of don't really
know either :) and build on Hans' python-btrfs library for anything
not covered by sysfs.

Anybody interested in helping? Apparently there are also golang libs
for btrfs [5] but I don't know anything about them (if you do, please
comment on the bug), and the idea of adding even more stuff into the
monolithic, already creaky and somewhat bloated node_exporter is not
appealing to me.

Potential problems wrt. btrfs are access to root-only information,
like e.g. the btrfs device stats/errors in the aforementioned bug,
since exporters are really supposed to run unprivileged due to network
exposure. The S.M.A.R.T. exporter [6] solves this with dual-process
contortions; obviously it would be better if all relevant metrics were
accessible directly in sysfs and not require privileged access, but
forking a tiny privileged process every polling interval is probably
not that bad.

All ideas welcome!
You might be interested in what Netdata [1] is doing.  We've already got 
tracking of space allocations via the sysfs interface (fun fact, you 
actually don't have to be root on most systems to read that data), and 
also ship some per-defined alarms that will trigger when the device gets 
close to full at a low-level (more specifically, if total chunk 
allocations exceed 90% of the total space of all the devices in the volume).


Actual data collection is being done in C (Netdata already has a lot of 
infrastructure for parsing things out of /proc or /sys), and there ahs 
been some discussion in the past of adding collection of device error 
counters (I've been working on and off on it myself, but I still don't 
have a good enough understanding of the C code to get anything actually 
working yet).


[1] https://my-netdata.io/

Re: Understanding BTRFS RAID0 Performance

2018-10-08 Thread Austin S. Hemmelgarn


On 2018-10-05 20:34, Duncan wrote:

Wilson, Ellis posted on Fri, 05 Oct 2018 15:29:52 + as excerpted:


Is there any tuning in BTRFS that limits the number of outstanding reads
at a time to a small single-digit number, or something else that could
be behind small queue depths?  I can't otherwise imagine what the
difference would be on the read path between ext4 vs btrfs when both are
on mdraid.


It seems I forgot to directly answer that question in my first reply.
Thanks for restating it.

Btrfs doesn't really expose much performance tuning (yet?), at least
outside the code itself.  There are a few very limited knobs, but they're
just that, few and limited or broad-stroke.

There are mount options like ssd/nossd, ssd_spread/nossd_spread, the
space_cache set of options (see below), flushoncommit/noflushoncommit,
commit=, etc (see the btrfs (5) manpage), but nothing really to
influence stride length, etc, or to optimize chunk placement between ssd
and non-ssd devices, for instance.

And there's a few filesystem features, normally set at mkfs.btrfs time
(and thus covered in the mkfs.btrfs manpage) but some of which can be
tuned later, but generally, the defaults have changed over time to
reflect the best case, and the older variants are there primarily to
retain backward compatibility with old kernels and tools that didn't
handle the newer variants.

That said, as I think about it there are some tunables that may be worth
experimenting with.  Most or all of these are covered in the btrfs (5)
manpage.

* Given the large device numbers you mention and raid0, you're likely
dealing with multi-TB-scale filesystems.  At this level, the
space_cache=v2 mount option may be useful.  It's not the default yet as
btrfs check, etc, don't yet handle it, but given your raid0 choice you
may not be concerned about that.  Need only be given once after which v2
is "on" for the filesystem until turned off.

* Consider experimenting with the thread_pool=n mount option.  I've seen
very little discussion of this one, but given your interest in
parallelization, it could make a difference.
Probably not as much as you might think.  I'll explain a bit more 
further down where this is being mentioned again.


* Possibly the commit= (default 30) mount option.  In theory,
upping this may allow better write merging, tho your interest seems to be
more on the read side, and the commit time has consequences at crash time.
Based on my own experience, having a higher commit time doesn't impact 
read or write performance much or really help all that much with write 
merging.  All it really helps with is minimizing overhead, but it's not 
even all that great at doing that.


* The autodefrag mount option may be considered if you do a lot of
existing file updates, as is common with database or VM image files.  Due
to COW this triggers high fragmentation on btrfs, and autodefrag should
help control that.  Note that autodefrag effectively increases the
minimum extent size from 4 KiB to, IIRC, 16 MB, tho it may be less, and
doesn't operate at whole-file size, so larger repeatedly-modified files
will still have some fragmentation, just not as much.  Obviously, you
wouldn't see the read-time effects of this until the filesystem has aged
somewhat, so it may not show up on your benchmarks.

(Another option for such files is setting them nocow or using the
nodatacow mount option, but this turns off checksumming and if it's on,
compression for those files, and has a few other non-obvious caveats as
well, so isn't something I recommend.  Instead of using nocow, I'd
suggest putting such files on a dedicated traditional non-cow filesystem
such as ext4, and I consider nocow at best a workaround option for those
who prefer to use btrfs as a single big storage pool and thus don't want
to do the dedicated non-cow filesystem for some subset of their files.)

* Not really for reads but for btrfs and any cow-based filesystem, you
almost certainly want the (not btrfs specific) noatime mount option.
Actually...  This can help a bit for some workloads.  Just like the 
commit time, it comes down to a matter of overhead.  Essentially, if you 
read a file regularly, than with the default of relatime, you've got a 
guaranteed write requiring a commit of the metadata tree once every 24 
hours.  It's not much to worry about for just one file, but if you're 
reading a very large number of files all the time, it can really add up.


* While it has serious filesystem integrity implications and thus can't
be responsibly recommended, there is the nobarrier mount option.  But if
you're already running raid0 on a large number of devices you're already
gambling with device stability, and this /might/ be an additional risk
you're willing to take, as it should increase performance.  But for
normal users it's simply not worth the risk, and if you do choose to use
it, it's at your own risk.
Agreed, if you're running RAID0 with this many drives, nobarrier may be 
worth it for a

Re: [PATCH 37/42] btrfs: wakeup cleaner thread when adding delayed iput

2018-10-08 Thread Filipe Manana

On Fri, Sep 28, 2018 at 12:21 PM Josef Bacik  wrote:
>
> The cleaner thread usually takes care of delayed iputs, with the
> exception of the btrfs_end_transaction_throttle path.  The cleaner
> thread only gets woken up every 30 seconds, so instead wake it up to do
> it's work so that we can free up that space as quickly as possible.
>
> Signed-off-by: Josef Bacik 
Reviewed-by: Filipe Manana 
> ---
>  fs/btrfs/inode.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 2b257d14bd3d..0a1671fb03bf 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -3323,6 +3323,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
> ASSERT(list_empty(>delayed_iput));
> list_add_tail(>delayed_iput, _info->delayed_iputs);
> spin_unlock(_info->delayed_iput_lock);
> +   wake_up_process(fs_info->cleaner_kthread);
>  }
>
>  void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

[PATCH v2 4/5] btrfs-progs: original check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo

Signed-off-by: Qu Wenruo 
---
 check/main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/check/main.c b/check/main.c
index ff9a785ce555..12f12e18a83f 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7938,6 +7938,12 @@ static int check_device_used(struct device_record 
*dev_rec,
struct device_extent_record *dev_extent_rec;
u64 total_byte = 0;
 
+   if (dev_rec->byte_used > dev_rec->total_byte) {
+   error("device %llu has incorrect used bytes %llu > total bytes 
%llu",
+ dev_rec->devid, dev_rec->byte_used, dev_rec->total_byte);
+   return -EUCLEAN;
+   }
+
cache = search_cache_extent2(_cache->tree, dev_rec->devid, 0);
while (cache) {
dev_extent_rec = container_of(cache,
-- 
2.19.1

[PATCH v2 3/5] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo

Obviously, used bytes can't be larger than total bytes.

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 07c03cad77af..1173b963b8f3 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
used = btrfs_device_bytes_used(eb, dev_item);
total_bytes = btrfs_device_total_bytes(eb, dev_item);
 
+   if (used > total_bytes) {
+   error("device %llu has incorrect used bytes %llu > total bytes 
%llu",
+   dev_id, used, total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
key.objectid = dev_id;
key.type = BTRFS_DEV_EXTENT_KEY;
key.offset = 0;
-- 
2.19.1

[PATCH v2 5/5] btrfs-progs: fsck-tests: Add test image for dev extents beyond device boundary

2018-10-08 Thread Qu Wenruo

Now two locations can detect such problem, either by device item
used/total bytes check, or by early dev extents check against device
boundary.

The image is hand-crafted image which uses DATA SINGLE chunk to feed
btrfs check.
As expected, as long as block group item, chunk item, device used bytes
matches, older btrfs check can't detect such problem.

Signed-off-by: Qu Wenruo 
---
 .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
 tests/fsck-tests/036-bad-dev-extents/test.sh  |  20 ++
 2 files changed, 20 insertions(+)
 create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
 create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

diff --git a/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz 
b/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
new file mode 100644
index 
..47cb2a707b0097e369dc088ed0549f847995f136
GIT binary patch
literal 1640
zcmV-u2ABE$H+ooF000E$*0e?f03iVu0001VFXf})3;zZsT>wRyj;C3^v%$$4d1oRm
zhA1@4%tH=9jYF%IQSpIUKDpjXLRl?q4p$q;1zsY^#9_Lx=#tbGm>@S#e2aqt!?0}y
z2BPO%4~c9Q5)jKFD7}DURarKL)`^j{f?s>sEHdzmSvX^98%kGi<_8
zMnXynsC*B7}KE(6w>*wfdb|tw$kt^y>W+TB*?pon1P+)#u#?6bSIG)TooJx~u$#
zf^+}xKw`BfI}6=717S~Q%LW1kwY`pz9H{`uNNNFOk2w!VauNEaMfoLj)Z<)!1?F60
zJ+OEA|4$a=9W#XX*l{EG!j^s}p|
z0i#_%$Q}d)-EE8#8O5^x$$8Y0l2
zc19e0Et3O3m`pMOqEkasL8VGE+u~2lZn>sRCRj169Z6mQ3*+`D+C#F1V7POV%lx(9cB{WN*9OP%Zbd1VDn(S4HX^ad4-b~#H
z@9eUP4AAU`)yRf!k+rrrLSYfBSEi6RE#HtbqyPl11S>RCH
zqvJbt22t`FmU^tmTb+7LtpybCB-x1lGSlrpVQ9|6WNBs~q*M-to1gD1l41oiy~}+O?!{68Jvim2*%BznW)@B=3IH)Xi1795q{#>sF*y^0T2@b$`Wqwch?BgN}IR9Ui
z;!cs)hJFGsJmFaiUsYrN$c0^BLU^n-B%fagn+jR{?Dq+K%VyMG@pmAOShFY)k8zBxm@7YD
zb^ZXU;<`+_B+IV{+A~1Ku
zWggqWQd{E%hF}W2ArPwJ33zWP>MIe;YDN8WV+|+a4_c>yFN#ZGN00G#%3HYi
z4pePTnc{8k5CjwuN6hudRFcBkvB^pSFufzaaD`-;mPgJ~wr(

[PATCH v2 1/5] btrfs-progs: lowmem check: Add check for overlapping dev extents

2018-10-08 Thread Qu Wenruo

Add such check at check_dev_item(), since at that timing we're also
iterating dev extents for dev item accounting.

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 1bce44f5658a..07c03cad77af 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4065,6 +4065,8 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
u64 dev_id;
u64 used;
u64 total = 0;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
int ret;
 
dev_item = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);
@@ -4086,8 +4088,16 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
return REFERENCER_MISSING;
}
 
-   /* Iterate dev_extents to calculate the used space of a device */
+   /*
+* Iterate dev_extents to calculate the used space of a device
+*
+* Also make sure no dev extents overlap and end beyond device boundary
+*/
while (1) {
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
goto next;
 
@@ -4099,7 +4109,27 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
 
ptr = btrfs_item_ptr(path.nodes[0], path.slots[0],
 struct btrfs_dev_extent);
-   total += btrfs_dev_extent_length(path.nodes[0], ptr);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], ptr);
+
+   if (prev_devid == devid && physical_offset < prev_dev_ext_end) {
+   error(
+"dev extent devid %llu offset %llu len %llu overlap with previous dev extent 
end %llu",
+ devid, physical_offset, physical_len,
+ prev_dev_ext_end);
+   return ACCOUNTING_MISMATCH;
+   }
+   if (physical_offset + physical_len > total_bytes) {
+   error(
+"dev extent devid %llu offset %llu len %llu is beyond device boundary %llu",
+ devid, physical_offset, physical_len,
+ total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+   total += physical_len;
 next:
ret = btrfs_next_item(dev_root, );
if (ret)
-- 
2.19.1

[PATCH v2 2/5] btrfs-progs: original check: Add ability to detect bad dev extents

2018-10-08 Thread Qu Wenruo

Unlike lowmem mode check, we don't have good place for original mode to
check overlap dev extents.

So this patch introduces a new function, btrfs_check_dev_extents(), to
handle possible bad dev extents.

Reported-by: Hans van Kranenburg 
Signed-off-by: Qu Wenruo 
---
 check/main.c | 99 
 1 file changed, 99 insertions(+)

diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..ff9a785ce555 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8224,6 +8224,99 @@ out:
return ret;
 }
 
+/*
+ * Check if all dev extents are valid (not overlap nor beyond device
+ * boundary).
+ *
+ * Dev extents <-> chunk cross checking is already done in check_chunks().
+ */
+static int check_dev_extents(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_root *dev_root = fs_info->dev_root;
+   int ret;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
+
+   btrfs_init_path();
+
+   key.objectid = 1;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, dev_root, , , 0, 0);
+   if (ret < 0) {
+   error("failed to search device tree: %s", strerror(-ret));
+   goto out;
+   }
+   if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
+   ret = btrfs_next_leaf(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   }
+
+   while (1) {
+   struct btrfs_dev_extent *dev_ext;
+   struct btrfs_device *dev;
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
+   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
+   if (key.type != BTRFS_DEV_EXTENT_KEY)
+   break;
+   dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
+struct btrfs_dev_extent);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], dev_ext);
+
+   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
+   if (!dev) {
+   error("failed to find device with devid %llu", devid);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (prev_devid == devid && prev_dev_ext_end > physical_offset) {
+   error(
+"dev extent devid %llu physical offset %llu overlap with previous dev extent 
end %llu",
+ devid, physical_offset, prev_dev_ext_end);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (physical_offset + physical_len > dev->total_bytes) {
+   error(
+"dev extent devid %llu physical offset %llu len %llu is beyond device boudnary 
%llu",
+ devid, physical_offset, physical_len,
+ dev->total_bytes);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+
+   ret = btrfs_next_item(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+   }
+out:
+   btrfs_release_path();
+   return ret;
+}
+
 static int check_chunks_and_extents(struct btrfs_fs_info *fs_info)
 {
struct rb_root dev_cache;
@@ -8318,6 +8411,12 @@ again:
goto out;
}
 
+   ret = check_dev_extents(fs_info);
+   if (ret < 0) {
+   err = ret;
+   goto out;
+   }
+
ret = check_chunks(_cache, _group_cache,
   _extent_cache, NULL, NULL, NULL, 0);
if (ret) {
-- 
2.19.1

[PATCH v2 0/5] btrfs-progs: check: Detect invalid dev extents and device items

2018-10-08 Thread Qu Wenruo

This patchset can be fetch from github:
https://github.com/adam900710/btrfs-progs/tree/dev_extents_check

Hans van Kranenburg reported a case where btrfs DUP chunk allocator
could allocate invalid dev extents, either overlaps with existing dev
extents or beyond device boundary.

This patchset enhances the btrfs-progs side to detect such problems.
With hand crafted test image for it.

Link: https://www.spinics.net/lists/linux-btrfs/msg82370.html

Changelog:
v2:
  Fix a bug in the 1st patch which makes lowmem mode never checks
  overlap dev extents.
  Fix test case bug which never passes due to wrong script.

Qu Wenruo (5):
  btrfs-progs: lowmem check: Add check for overlapping dev extents
  btrfs-progs: original check: Add ability to detect bad dev extents
  btrfs-progs: lowmem check: Add dev_item check for used bytes and total
bytes
  btrfs-progs: original check: Add dev_item check for used bytes and
total bytes
  btrfs-progs: fsck-tests: Add test image for dev extents beyond device
boundary

 check/main.c  | 105 ++
 check/mode-lowmem.c   |  39 ++-
 .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
 tests/fsck-tests/036-bad-dev-extents/test.sh  |  20 
 4 files changed, 162 insertions(+), 2 deletions(-)
 create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
 create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

-- 
2.19.1

[PATCH] Btrfs: fix warning when replaying log after fsync of a tmpfile

2018-10-08 Thread fdmanana

From: Filipe Manana 

When replaying a log which contains a tmpfile (which necessarily has a
link count of 0) we end up calling inc_nlink(), at
fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
the following:

  [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 
inc_nlink+0x33/0x40
  [195191.943674] Modules linked in: btrfs dm_flakey dm_mod xor raid6_pq 
libcrc32c kvm_intel bochs_drm ttm kvm drm_kms_helper drm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
crypto_simd cryptd glue_helper joydev sg button evdev pcspkr qemu_fw_cfg 
serio_raw parport_pc ppdev lp parport ip_tables x_tables autofs4 ext4 
crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod virtio_scsi ata_generic 
virtio_pci virtio_ring virtio ata_piix floppy crc32c_intel libata psmouse e1000 
scsi_mod i2c_piix4 [last unloaded: btrfs]
  [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 
4.19.0-rc6-btrfs-next-38 #1
  [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
  [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
  [195191.943727] Code: c0 74 07 83 c0 01 89 47 48 c3 f6 87 d1 00 00 00 04 74 
17 48 8b 47 28 f0 48 83 a8 70 07 00 00 01 8b 47 48 83 c0 01 89 47 48 c3 <0f> 0b 
eb e5 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 ff 05 54
  [195191.943728] RSP: 0018:b96e425e3870 EFLAGS: 00010246
  [195191.943730] RAX:  RBX: 8c0d1e6af4f0 RCX: 
0006
  [195191.943731] RDX:  RSI:  RDI: 
8c0d1e6af4f0
  [195191.943731] RBP: 0097 R08: 0001 R09: 

  [195191.943732] R10:  R11:  R12: 
b96e425e3a60
  [195191.943733] R13: 8c0d10cff0c8 R14: 8c0d0d515348 R15: 
8c0d78a1b3f8
  [195191.943735] FS:  7f570ee24480() GS:8c0dfb20() 
knlGS:
  [195191.943736] CS:  0010 DS:  ES:  CR0: 80050033
  [195191.943737] CR2: 5593286277c8 CR3: bb8f2006 CR4: 
003606f0
  [195191.943739] DR0:  DR1:  DR2: 

  [195191.943740] DR3:  DR6: fffe0ff0 DR7: 
0400
  [195191.943741] Call Trace:
  [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
  [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
  [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
  [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
  [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
  [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
  [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
  [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
  [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
  [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
  [195191.943906]  ? mount_fs+0x3b/0x140
  [195191.943908]  mount_fs+0x3b/0x140
  [195191.943912]  ? __init_waitqueue_head+0x36/0x50
  [195191.943916]  vfs_kern_mount+0x62/0x160
  [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
  [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
  [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
  [195191.943943]  ? mount_fs+0x3b/0x140
  [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
  [195191.943954]  mount_fs+0x3b/0x140
  [195191.943956]  ? __init_waitqueue_head+0x36/0x50
  [195191.943960]  vfs_kern_mount+0x62/0x160
  [195191.943963]  do_mount+0x1f9/0xd40
  [195191.943967]  ? memdup_user+0x4b/0x70
  [195191.943971]  ksys_mount+0x7e/0xd0
  [195191.943974]  __x64_sys_mount+0x21/0x30
  [195191.943977]  do_syscall_64+0x60/0x1b0
  [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [195191.943983] RIP: 0033:0x7f570e4e524a
  [195191.943985] Code: 48 8b 0d 51 fc 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 
2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 
01 f0 ff ff 73 01 c3 48 8b 0d 1e fc 2a 00 f7 d8 64 89 01 48
  [195191.943986] RSP: 002b:7ffd83589478 EFLAGS: 0206 ORIG_RAX: 
00a5
  [195191.943989] RAX: ffda RBX: 563f335b2060 RCX: 
7f570e4e524a
  [195191.943990] RDX: 563f335b2240 RSI: 563f335b2280 RDI: 
563f335b2260
  [195191.943992] RBP:  R08:  R09: 
0020
  [195191.943993] R10: c0ed R11: 0206 R12: 
563f335b2260
  [195191.943994] R13: 563f335b2240 R14:  R15: 

  [195191.944002] irq event stamp: 8688
  [195191.944010] hardirqs last  enabled at (8687): [] 
console_unlock+0x503/0x640
  [195191.944012] hardirqs last disabled at (8688): [] 
trace_hardirqs_off_thunk+0x1a/0x1c
  [195191.944018] softirqs last  enabled at (8638): [] 
__set_page_dirty_nobuffers+0x101/0x150
  [195191.944020] softirqs last disabled at (8634): [] 
wb_wakeup_delayed+0x2e/0x60
  [195191.944022] ---[ end trace 5d6e873a9a0b811a ]---

This happens because the inode does not have the flag I_LINKABLE set,

Re: [PATCH 1/5] btrfs-progs: lowmem check: Add check for overlapping dev extents

2018-10-08 Thread Qu Wenruo




On 2018/10/8 下午5:28, Su Yue wrote:
> 
> 
> On 10/8/18 3:00 PM, Qu Wenruo wrote:
>> Add such check at check_dev_item(), since at that timing we're also
>> iterating dev extents for dev item accounting.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>   check/mode-lowmem.c | 32 ++--
>>   1 file changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
>> index 1bce44f5658a..d387423639e6 100644
>> --- a/check/mode-lowmem.c
>> +++ b/check/mode-lowmem.c
>> @@ -4065,6 +4065,8 @@ static int check_dev_item(struct btrfs_fs_info
>> *fs_info,
>>   u64 dev_id;
>>   u64 used;
>>   u64 total = 0;
>> +    u64 prev_devid = 0;
>> +    u64 prev_dev_ext_end = 0;
> 
> Those two new variables aren't assigned anymore in the patch...

Oh, what I'm doing... (palm face

I'll fix this with the test case bug.

Thanks for reviewing,
Qu

> 
> Thanks,
> Su
>>   int ret;
>>     dev_item = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);
>> @@ -4086,8 +4088,16 @@ static int check_dev_item(struct btrfs_fs_info
>> *fs_info,
>>   return REFERENCER_MISSING;
>>   }
>>   -    /* Iterate dev_extents to calculate the used space of a device */
>> +    /*
>> + * Iterate dev_extents to calculate the used space of a device
>> + *
>> + * Also make sure no dev extents overlap and end beyond device
>> boundary
>> + */
>>   while (1) {
>> +    u64 devid;
>> +    u64 physical_offset;
>> +    u64 physical_len;
>> +
>>   if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
>>   goto next;
>>   @@ -4099,7 +4109,25 @@ static int check_dev_item(struct
>> btrfs_fs_info *fs_info,
>>     ptr = btrfs_item_ptr(path.nodes[0], path.slots[0],
>>    struct btrfs_dev_extent);
>> -    total += btrfs_dev_extent_length(path.nodes[0], ptr);
>> +    devid = key.objectid;
>> +    physical_offset = key.offset;
>> +    physical_len = btrfs_dev_extent_length(path.nodes[0], ptr);
>> +
>> +    if (prev_devid == devid && physical_offset < prev_dev_ext_end) {
>> +    error(
>> +"dev extent devid %llu offset %llu len %llu overlap with previous dev
>> extent end %llu",
>> +  devid, physical_offset, physical_len,
>> +  prev_dev_ext_end);
>> +    return ACCOUNTING_MISMATCH;
>> +    }
>> +    if (physical_offset + physical_len > total_bytes) {
>> +    error(
>> +"dev extent devid %llu offset %llu len %llu is beyond device boundary
>> %llu",
>> +  devid, physical_offset, physical_len,
>> +  total_bytes);
>> +    return ACCOUNTING_MISMATCH;
>> +    }
>> +    total += physical_len;
>>   next:
>>   ret = btrfs_next_item(dev_root, );
>>   if (ret)
>>
> 
>

[PATCH] generic: test mounting filesystem after fsync of a tmpfile

2018-10-08 Thread fdmanana

From: Filipe Manana 

Test that if we fsync a tmpfile, without adding a hard link to it, and
then power fail, we will be able to mount the filesystem without
triggering any crashes, warnings or corruptions.

This test is motivated by an issue in btrfs where this scenario triggered
a warning (without any side effects). The following linux kernel patch
fixes the issue in btrfs:

  "Btrfs: fix warning when replaying log after fsync of a tmpfile"

Signed-off-by: Filipe Manana 
---
 tests/generic/506 | 58 +++
 tests/generic/506.out |  3 +++
 tests/generic/group   |  1 +
 3 files changed, 62 insertions(+)
 create mode 100755 tests/generic/506
 create mode 100644 tests/generic/506.out

diff --git a/tests/generic/506 b/tests/generic/506
new file mode 100755
index ..7d28d3b0
--- /dev/null
+++ b/tests/generic/506
@@ -0,0 +1,58 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test No. 506
+#
+# Test that if we fsync a tmpfile, without adding a hard link to it, and then
+# power fail, we will be able to mount the filesystem without triggering any
+# crashes, warnings or corruptions.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_xfs_io_command "-T"
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+# Create our tmpfile, write some data to it and fsync it. We want a power
+# failure to happen after the fsync, so that we have an inode with a link
+# count of 0 in our log/journal.
+$XFS_IO_PROG -T \
+   -c "pwrite -S 0xab 0 64K" \
+   -c "fsync" \
+   $SCRATCH_MNT | _filter_xfs_io
+
+# Simulate a power failure and mount the filesystem to check that it succeeds.
+_flakey_drop_and_remount
+
+_unmount_flakey
+
+status=0
+exit
diff --git a/tests/generic/506.out b/tests/generic/506.out
new file mode 100644
index ..f522e663
--- /dev/null
+++ b/tests/generic/506.out
@@ -0,0 +1,3 @@
+QA output created by 506
+wrote 65536/65536 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/generic/group b/tests/generic/group
index 4da0e188..2e2a6247 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -508,3 +508,4 @@
 503 auto quick dax punch collapse zero
 504 auto quick locks
 505 shutdown auto quick metadata
+506 auto quick log
-- 
2.11.0

Re: [PATCH 1/5] btrfs-progs: lowmem check: Add check for overlapping dev extents

2018-10-08 Thread Su Yue





On 10/8/18 3:00 PM, Qu Wenruo wrote:

Add such check at check_dev_item(), since at that timing we're also
iterating dev extents for dev item accounting.

Signed-off-by: Qu Wenruo 
---
  check/mode-lowmem.c | 32 ++--
  1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 1bce44f5658a..d387423639e6 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4065,6 +4065,8 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
u64 dev_id;
u64 used;
u64 total = 0;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;


Those two new variables aren't assigned anymore in the patch...

Thanks,
Su

int ret;
  
  	dev_item = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);

@@ -4086,8 +4088,16 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
return REFERENCER_MISSING;
}
  
-	/* Iterate dev_extents to calculate the used space of a device */

+   /*
+* Iterate dev_extents to calculate the used space of a device
+*
+* Also make sure no dev extents overlap and end beyond device boundary
+*/
while (1) {
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
goto next;
  
@@ -4099,7 +4109,25 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
  
  		ptr = btrfs_item_ptr(path.nodes[0], path.slots[0],

 struct btrfs_dev_extent);
-   total += btrfs_dev_extent_length(path.nodes[0], ptr);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], ptr);
+
+   if (prev_devid == devid && physical_offset < prev_dev_ext_end) {
+   error(
+"dev extent devid %llu offset %llu len %llu overlap with previous dev extent end 
%llu",
+ devid, physical_offset, physical_len,
+ prev_dev_ext_end);
+   return ACCOUNTING_MISMATCH;
+   }
+   if (physical_offset + physical_len > total_bytes) {
+   error(
+"dev extent devid %llu offset %llu len %llu is beyond device boundary %llu",
+ devid, physical_offset, physical_len,
+ total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
+   total += physical_len;
  next:
ret = btrfs_next_item(dev_root, );
if (ret)

Re: [PATCH 5/5] btrfs-progs: fsck-tests: Add test image for dev extents beyond device boundary

2018-10-08 Thread Su Yue





On 10/8/18 3:00 PM, Qu Wenruo wrote:

Now two locations can detect such problem, either by device item
used/total bytes check, or by early dev extents check against device
boundary.

The image is hand-crafted image which uses DATA SINGLE chunk to feed
btrfs check.
As expected, as long as block group item, chunk item, device used bytes
matches, older btrfs check can't detect such problem.

Signed-off-by: Qu Wenruo 
---
  .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
  tests/fsck-tests/036-bad-dev-extents/test.sh  |  19 ++
  2 files changed, 19 insertions(+)
  create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
  create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

diff --git a/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz 
b/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
new file mode 100644
index 
..47cb2a707b0097e369dc088ed0549f847995f136
GIT binary patch
literal 1640
zcmV-u2ABE$H+ooF000E$*0e?f03iVu0001VFXf})3;zZsT>wRyj;C3^v%$$4d1oRm
zhA1@4%tH=9jYF%IQSpIUKDpjXLRl?q4p$q;1zsY^#9_Lx=#tbGm>@S#e2aqt!?0}y
z2BPO%4~c9Q5)jKFD7}DURarKL)`^j{f?s>sEHdzmSvX^98%kGi<_8
zMnXynsC*B7}KE(6w>*wfdb|tw$kt^y>W+TB*?pon1P+)#u#?6bSIG)TooJx~u$#
zf^+}xKw`BfI}6=717S~Q%LW1kwY`pz9H{`uNNNFOk2w!VauNEaMfoLj)Z<)!1?F60
zJ+OEA|4$a=9W#XX*l{EG!j^s}p|
z0i#_%$Q}d)-EE8#8O5^x$$8Y0l2
zc19e0Et3O3m`pMOqEkasL8VGE+u~2lZn>sRCRj169Z6mQ3*+`D+C#F1V7POV%lx(9cB{WN*9OP%Zbd1VDn(S4HX^ad4-b~#H
z@9eUP4AAU`)yRf!k+rrrLSYfBSEi6RE#HtbqyPl11S>RCH
zqvJbt22t`FmU^tmTb+7LtpybCB-x1lGSlrpVQ9|6WNBs~q*M-to1gD1l41oiy~}+O?!{68Jvim2*%BznW)@B=3IH)Xi1795q{#>sF*y^0T2@b$`Wqwch?BgN}IR9Ui
z;!cs)hJFGsJmFaiUsYrN$c0^BLU^n-B%fagn+jR{?Dq+K%VyMG@pmAOShFY)k8zBxm@7YD
zb^ZXU;<`+_B+IV{+A~1Ku
zWggqWQd{E%hF}W2ArPwJ33zWP>MIe;YDN8WV+|+a4_c>yFN#ZGN00G#%3HYi
z4pePTnc{8k5CjwuN6hudRFcBkvB^pSFufzaaD`-;mPgJ~wr(
I checked to your branch and ran test but failed.
Should it be run_must_fail instead?

+}
+
+check_all_images

[PATCH 4/5] btrfs-progs: original check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo

Signed-off-by: Qu Wenruo 
---
 check/main.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/check/main.c b/check/main.c
index ff9a785ce555..12f12e18a83f 100644
--- a/check/main.c
+++ b/check/main.c
@@ -7938,6 +7938,12 @@ static int check_device_used(struct device_record 
*dev_rec,
struct device_extent_record *dev_extent_rec;
u64 total_byte = 0;
 
+   if (dev_rec->byte_used > dev_rec->total_byte) {
+   error("device %llu has incorrect used bytes %llu > total bytes 
%llu",
+ dev_rec->devid, dev_rec->byte_used, dev_rec->total_byte);
+   return -EUCLEAN;
+   }
+
cache = search_cache_extent2(_cache->tree, dev_rec->devid, 0);
while (cache) {
dev_extent_rec = container_of(cache,
-- 
2.19.0

[PATCH 2/5] btrfs-progs: original check: Add ability to detect bad dev extents

2018-10-08 Thread Qu Wenruo

Unlike lowmem mode check, we don't have good place for original mode to
check overlap dev extents.

So this patch introduces a new function, btrfs_check_dev_extents(), to
handle possible bad dev extents.

Reported-by: Hans van Kranenburg 
Signed-off-by: Qu Wenruo 
---
 check/main.c | 99 
 1 file changed, 99 insertions(+)

diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..ff9a785ce555 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8224,6 +8224,99 @@ out:
return ret;
 }
 
+/*
+ * Check if all dev extents are valid (not overlap nor beyond device
+ * boundary).
+ *
+ * Dev extents <-> chunk cross checking is already done in check_chunks().
+ */
+static int check_dev_extents(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_root *dev_root = fs_info->dev_root;
+   int ret;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
+
+   btrfs_init_path();
+
+   key.objectid = 1;
+   key.type = BTRFS_DEV_EXTENT_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, dev_root, , , 0, 0);
+   if (ret < 0) {
+   error("failed to search device tree: %s", strerror(-ret));
+   goto out;
+   }
+   if (path.slots[0] >= btrfs_header_nritems(path.nodes[0])) {
+   ret = btrfs_next_leaf(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   goto out;
+   }
+   }
+
+   while (1) {
+   struct btrfs_dev_extent *dev_ext;
+   struct btrfs_device *dev;
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
+   btrfs_item_key_to_cpu(path.nodes[0], , path.slots[0]);
+   if (key.type != BTRFS_DEV_EXTENT_KEY)
+   break;
+   dev_ext = btrfs_item_ptr(path.nodes[0], path.slots[0],
+struct btrfs_dev_extent);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], dev_ext);
+
+   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
+   if (!dev) {
+   error("failed to find device with devid %llu", devid);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (prev_devid == devid && prev_dev_ext_end > physical_offset) {
+   error(
+"dev extent devid %llu physical offset %llu overlap with previous dev extent 
end %llu",
+ devid, physical_offset, prev_dev_ext_end);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   if (physical_offset + physical_len > dev->total_bytes) {
+   error(
+"dev extent devid %llu physical offset %llu len %llu is beyond device boudnary 
%llu",
+ devid, physical_offset, physical_len,
+ dev->total_bytes);
+   ret = -EUCLEAN;
+   goto out;
+   }
+   prev_devid = devid;
+   prev_dev_ext_end = physical_offset + physical_len;
+
+   ret = btrfs_next_item(dev_root, );
+   if (ret < 0) {
+   error("failed to find next leaf: %s", strerror(-ret));
+   goto out;
+   }
+   if (ret > 0) {
+   ret = 0;
+   break;
+   }
+   }
+out:
+   btrfs_release_path();
+   return ret;
+}
+
 static int check_chunks_and_extents(struct btrfs_fs_info *fs_info)
 {
struct rb_root dev_cache;
@@ -8318,6 +8411,12 @@ again:
goto out;
}
 
+   ret = check_dev_extents(fs_info);
+   if (ret < 0) {
+   err = ret;
+   goto out;
+   }
+
ret = check_chunks(_cache, _group_cache,
   _extent_cache, NULL, NULL, NULL, 0);
if (ret) {
-- 
2.19.0

[PATCH 3/5] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-08 Thread Qu Wenruo

Obviously, used bytes can't be larger than total bytes.

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index d387423639e6..c50e34236ac8 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
used = btrfs_device_bytes_used(eb, dev_item);
total_bytes = btrfs_device_total_bytes(eb, dev_item);
 
+   if (used > total_bytes) {
+   error("device %llu has incorrect used bytes %llu > total bytes 
%llu",
+   dev_id, used, total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
key.objectid = dev_id;
key.type = BTRFS_DEV_EXTENT_KEY;
key.offset = 0;
-- 
2.19.0

[PATCH 5/5] btrfs-progs: fsck-tests: Add test image for dev extents beyond device boundary

2018-10-08 Thread Qu Wenruo

Now two locations can detect such problem, either by device item
used/total bytes check, or by early dev extents check against device
boundary.

The image is hand-crafted image which uses DATA SINGLE chunk to feed
btrfs check.
As expected, as long as block group item, chunk item, device used bytes
matches, older btrfs check can't detect such problem.

Signed-off-by: Qu Wenruo 
---
 .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
 tests/fsck-tests/036-bad-dev-extents/test.sh  |  19 ++
 2 files changed, 19 insertions(+)
 create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
 create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

diff --git a/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz 
b/tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
new file mode 100644
index 
..47cb2a707b0097e369dc088ed0549f847995f136
GIT binary patch
literal 1640
zcmV-u2ABE$H+ooF000E$*0e?f03iVu0001VFXf})3;zZsT>wRyj;C3^v%$$4d1oRm
zhA1@4%tH=9jYF%IQSpIUKDpjXLRl?q4p$q;1zsY^#9_Lx=#tbGm>@S#e2aqt!?0}y
z2BPO%4~c9Q5)jKFD7}DURarKL)`^j{f?s>sEHdzmSvX^98%kGi<_8
zMnXynsC*B7}KE(6w>*wfdb|tw$kt^y>W+TB*?pon1P+)#u#?6bSIG)TooJx~u$#
zf^+}xKw`BfI}6=717S~Q%LW1kwY`pz9H{`uNNNFOk2w!VauNEaMfoLj)Z<)!1?F60
zJ+OEA|4$a=9W#XX*l{EG!j^s}p|
z0i#_%$Q}d)-EE8#8O5^x$$8Y0l2
zc19e0Et3O3m`pMOqEkasL8VGE+u~2lZn>sRCRj169Z6mQ3*+`D+C#F1V7POV%lx(9cB{WN*9OP%Zbd1VDn(S4HX^ad4-b~#H
z@9eUP4AAU`)yRf!k+rrrLSYfBSEi6RE#HtbqyPl11S>RCH
zqvJbt22t`FmU^tmTb+7LtpybCB-x1lGSlrpVQ9|6WNBs~q*M-to1gD1l41oiy~}+O?!{68Jvim2*%BznW)@B=3IH)Xi1795q{#>sF*y^0T2@b$`Wqwch?BgN}IR9Ui
z;!cs)hJFGsJmFaiUsYrN$c0^BLU^n-B%fagn+jR{?Dq+K%VyMG@pmAOShFY)k8zBxm@7YD
zb^ZXU;<`+_B+IV{+A~1Ku
zWggqWQd{E%hF}W2ArPwJ33zWP>MIe;YDN8WV+|+a4_c>yFN#ZGN00G#%3HYi
z4pePTnc{8k5CjwuN6hudRFcBkvB^pSFufzaaD`-;mPgJ~wr(

[PATCH 1/5] btrfs-progs: lowmem check: Add check for overlapping dev extents

2018-10-08 Thread Qu Wenruo

Add such check at check_dev_item(), since at that timing we're also
iterating dev extents for dev item accounting.

Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 32 ++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 1bce44f5658a..d387423639e6 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4065,6 +4065,8 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
u64 dev_id;
u64 used;
u64 total = 0;
+   u64 prev_devid = 0;
+   u64 prev_dev_ext_end = 0;
int ret;
 
dev_item = btrfs_item_ptr(eb, slot, struct btrfs_dev_item);
@@ -4086,8 +4088,16 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
return REFERENCER_MISSING;
}
 
-   /* Iterate dev_extents to calculate the used space of a device */
+   /*
+* Iterate dev_extents to calculate the used space of a device
+*
+* Also make sure no dev extents overlap and end beyond device boundary
+*/
while (1) {
+   u64 devid;
+   u64 physical_offset;
+   u64 physical_len;
+
if (path.slots[0] >= btrfs_header_nritems(path.nodes[0]))
goto next;
 
@@ -4099,7 +4109,25 @@ static int check_dev_item(struct btrfs_fs_info *fs_info,
 
ptr = btrfs_item_ptr(path.nodes[0], path.slots[0],
 struct btrfs_dev_extent);
-   total += btrfs_dev_extent_length(path.nodes[0], ptr);
+   devid = key.objectid;
+   physical_offset = key.offset;
+   physical_len = btrfs_dev_extent_length(path.nodes[0], ptr);
+
+   if (prev_devid == devid && physical_offset < prev_dev_ext_end) {
+   error(
+"dev extent devid %llu offset %llu len %llu overlap with previous dev extent 
end %llu",
+ devid, physical_offset, physical_len,
+ prev_dev_ext_end);
+   return ACCOUNTING_MISMATCH;
+   }
+   if (physical_offset + physical_len > total_bytes) {
+   error(
+"dev extent devid %llu offset %llu len %llu is beyond device boundary %llu",
+ devid, physical_offset, physical_len,
+ total_bytes);
+   return ACCOUNTING_MISMATCH;
+   }
+   total += physical_len;
 next:
ret = btrfs_next_item(dev_root, );
if (ret)
-- 
2.19.0

[PATCH 0/5] btrfs-progs: check: Detect invalid dev extents and device items

2018-10-08 Thread Qu Wenruo

This patchset can be fetch from github:
https://github.com/adam900710/btrfs-progs/tree/dev_extents_check

Hans van Kranenburg reported a case where btrfs DUP chunk allocator
could allocate invalid dev extents, either overlaps with existing dev
extents or beyond device boundary.

This patchset enhances the btrfs-progs side to detect such problems.
With hand crafted test image for it.

Qu Wenruo (5):
  btrfs-progs: lowmem check: Add check for overlapping dev extents
  btrfs-progs: original check: Add ability to detect bad dev extents
  btrfs-progs: lowmem check: Add dev_item check for used bytes and total
bytes
  btrfs-progs: original check: Add dev_item check for used bytes and
total bytes
  btrfs-progs: fsck-tests: Add test image for dev extents beyond device
boundary

 check/main.c  | 105 ++
 check/mode-lowmem.c   |  37 +-
 .../over_dev_boundary.img.xz  | Bin 0 -> 1640 bytes
 tests/fsck-tests/036-bad-dev-extents/test.sh  |  19 
 4 files changed, 159 insertions(+), 2 deletions(-)
 create mode 100644 
tests/fsck-tests/036-bad-dev-extents/over_dev_boundary.img.xz
 create mode 100755 tests/fsck-tests/036-bad-dev-extents/test.sh

-- 
2.19.0

Re: [PATCH 0/6] Chunk allocator DUP fix and cleanups

2018-10-08 Thread Qu Wenruo



On 2018/10/5 下午6:58, Hans van Kranenburg wrote:
> On 10/05/2018 09:51 AM, Qu Wenruo wrote:
>>
>>
>> On 2018/10/5 上午5:24, Hans van Kranenburg wrote:
>>> This patch set contains an additional fix for a newly exposed bug after
>>> the previous attempt to fix a chunk allocator bug for new DUP chunks:
>>>
>>> https://lore.kernel.org/linux-btrfs/782f6000-30c0-0085-abd2-74ec5827c...@mendix.com/T/#m609ccb5d32998e8ba5cfa9901c1ab56a38a6f374
>>
>> For that bug, did you succeeded in reproducing the bug?
> 
> Yes, open the above link and scroll to "Steps to reproduce".

That's beyond device boundary one. Also reproduced here.
And hand-crafted a super small image as test case for btrfs-progs.

But I'm a little curious about the dev extent overlapping case.
Have you got one?

Thanks,
Qu

> 
> o/
> 
>> I'm adding dev extent overlap checking in btrfs_verify_dev_extents() and
>> btrfs-progs.
>> I think it could help to detect such problem.
>>
>> Thanks,
>> Qu
>>
>>>
>>> The DUP fix is "fix more DUP stripe size handling". I did that one
>>> before starting to change more things so it can be applied to earlier
>>> LTS kernels.
>>>
>>> Besides that patch, which is fixing the bug in a way that is least
>>> intrusive, I added a bunch of other patches to help getting the chunk
>>> allocator code in a state that is a bit less error-prone and
>>> bug-attracting.
>>>
>>> When running this and trying the reproduction scenario, I can now see
>>> that the created DUP device extent is 827326464 bytes long, which is
>>> good.
>>>
>>> I wrote and tested this on top of linus 4.19-rc5. I still need to create
>>> a list of related use cases and test more things to at least walk
>>> through a bunch of obvious use cases to see if there's nothing exploding
>>> too quickly with these changes. However, I'm quite confident about it,
>>> so I'm sharing all of it already.
>>>
>>> Any feedback and review is appreciated. Be gentle and keep in mind that
>>> I'm still very much in a learning stage regarding kernel development.
>>>
>>> The stable patches handling workflow is not 100% clear to me yet. I
>>> guess I have to add a Fixes: in the DUP patch which points to the
>>> previous commit 92e222df7b.
>>>
>>> Moo!,
>>> Knorrie
>>>
>>> Hans van Kranenburg (6):
>>>   btrfs: alloc_chunk: do not refurbish num_bytes
>>>   btrfs: alloc_chunk: improve chunk size variable name
>>>   btrfs: alloc_chunk: fix more DUP stripe size handling
>>>   btrfs: fix ncopies raid_attr for RAID56
>>>   btrfs: introduce nparity raid_attr
>>>   btrfs: alloc_chunk: rework chunk/stripe calculations
>>>
>>>  fs/btrfs/volumes.c | 84 +++---
>>>  fs/btrfs/volumes.h |  4 ++-
>>>  2 files changed, 45 insertions(+), 43 deletions(-)
>>>
>>
> 
> 



signature.asc
Description: OpenPGP digital signature

56 matches

Mail list logo