Re: Re-mounting removable btrfs on different device

2018-09-06 Thread Remi Gauvin
On 2018-09-06 11:32 PM, Duncan wrote:

> Without the mentioned patches, the only way (other than reboot) is to 
> remove and reinsert the btrfs kernel module (assuming it's a module, not 
> built-in), thus forcing it to forget state.
> 
> Of course if other critical mounted filesystems (such as root) are btrfs, 
> or if btrfs is a kernel-built-in not a module and thus can't be removed, 
> the above doesn't work and a reboot is necessary.  Thus the need for 
> those patches you mentioned.
> 

Good to know, thanks.
<>

Re: dduper - Offline btrfs deduplication tool

2018-09-06 Thread Lakshmipathi.G
> 
> One question:
> Why not ioctl_fideduperange?
> i.e. you kill most of benefits from that ioctl - atomicity.
> 
I plan to add fideduperange as an option too. User can
choose between fideduperange and ficlonerange call.

If I'm not wrong, with fideduperange, kernel performs
comparsion check before dedupe. And it will increase
time to dedupe files.

I believe the risk involved with ficlonerange is  minimized 
by having a backup file(reflinked). We can revert to older 
original file, if we encounter some problems.

> 
> -- 
> Have a nice day,
> Timofey.

Cheers.
Lakshmipathi.G


Re: Re-mounting removable btrfs on different device

2018-09-06 Thread Duncan
Remi Gauvin posted on Thu, 06 Sep 2018 20:54:17 -0400 as excerpted:

> I'm trying to use a BTRFS filesystem on a removable drive.
> 
> The first drive drive was added to the system, it was /dev/sdb
> 
> Files were added and device unmounted without error.
> 
> But when I re-attach the drive, it becomes /dev/sdg (kernel is fussy
> about re-using /dev/sdb).
> 
> btrfs fi show: output:
> 
> Label: 'Archive 01'  uuid: 221222e7-70e7-4d67-9aca-42eb134e2041
>   Total devices 1 FS bytes used 515.40GiB
>   devid1 size 931.51GiB used 522.02GiB path /dev/sdg1
> 
> This causes BTRFS to fail mounting the device [errors snipped]

> I've seen some patches on this list to add a btrfs device forget option,
> which I presume would help with a situation like this.  Is there a way
> to do that manually?

Without the mentioned patches, the only way (other than reboot) is to 
remove and reinsert the btrfs kernel module (assuming it's a module, not 
built-in), thus forcing it to forget state.

Of course if other critical mounted filesystems (such as root) are btrfs, 
or if btrfs is a kernel-built-in not a module and thus can't be removed, 
the above doesn't work and a reboot is necessary.  Thus the need for 
those patches you mentioned.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 2:16 PM, Stefan Loewen  wrote:

> Data,single: Size:695.01GiB, Used:653.69GiB
> /dev/sdb1 695.01GiB
> Metadata,DUP: Size:4.00GiB, Used:2.25GiB
> /dev/sdb1   8.00GiB
> System,DUP: Size:40.00MiB, Used:96.00KiB


> Does that mean Metadata is duplicated?

Yes. Single copy for data. Duplicate for metadata+system, and there
are no single chunks for metadata/system.

>
> Ok so to summarize and see if I understood you correctly:
> There are bad sectors on disk. Running an extended selftest (smartctl -t
> long) could find those and replace them with spare sectors.

More likely if it finds a persistently failing sector, it will just
record the first failing sector LBA in its log, and then abort. You'll
see this info with 'smartctl -a' or with -x.

It is possible to resume the test using selective option and picking a
4K aligned 512 byte LBA value after the 4K sector with the defect.
Just because only one is reported in dmesg doesn't mean there isn't a
bad one.

It's unlikely the long test is going to actually fix anything, it'll
just give you more ammunition for getting a likely under warranty
device replaced because it really shouldn't have any issues at this
age.


> If it does not I can try calculating the physical (4K) sector number and
> write to that to make the drive notice and mark the bad sector.
> Is there a way to find out which file I will be writing to beforehand?

I'm not sure how to do it easily.

>Or is
> it easier to just write to the sector and then wait for scrub to tell me
> (and the sector is broken anyways)?

If it's a persistent read error, then it's lost. So you might as well
overwrite it. If it's data, scrub will tell you what file is corrupted
(and restore can help you recover the whole file, of course it'll have
a 4K hole of zeros in it). If it's metadata, Btrfs will fix up the 4K
hole with duplicate metadata.

Gotcha is to make certain you've got the right LBA to write to. You
can use dd to test this, by reading the suspect bad sector, and if
you've got the right one, you'll get an I/O error in user space and
dmesg will have a message like before with sector value. Use the dd
skip= flag for reading, but make *sure* you use seek= when writing
*and* make sure you always use bs=4096 count=1 so that if you make a
mistake you limit the damage haha.

>
> For the drive: Not under warranty anymore. It's an external HDD that I had
> lying around for years, mostly unused. Now I wanted to use it as part of my
> small DIY NAS.

Gotcha. Well you can read up on smartctl and smartd, and set it up for
regular extended tests, and keep an eye on rapidly changing values. It
might give you a 50/50 chance of an early heads up before it dies.

I've got an old Hitachi/Apple laptop drive that years ago developed
multiple bad sectors in different zones of the drive. They got
remapped and I haven't had a problem with that drive since. *shrug*
And in fact I did get a discrete error message from the drive for one
of those and Btrfs overwrote that bad sector with a good copy (it's in
a raid1 volume), so working as designed I guess.

Since you didn't get a fix up message from Btrfs, either the whole
thing just got confused with hanging tasks, or it's possible it's a
data block.


-- 
Chris Murphy


Re-mounting removable btrfs on different device

2018-09-06 Thread Remi Gauvin
I'm trying to use a BTRFS filesystem on a removable drive.

The first drive drive was added to the system, it was /dev/sdb

Files were added and device unmounted without error.

But when I re-attach the drive, it becomes /dev/sdg (kernel is fussy
about re-using /dev/sdb).

btrfs fi show: output:

Label: 'Archive 01'  uuid: 221222e7-70e7-4d67-9aca-42eb134e2041
Total devices 1 FS bytes used 515.40GiB
devid1 size 931.51GiB used 522.02GiB path /dev/sdg1

This causes BTRFS to fail mounting the device with the following errors:

sd 3:0:0:0: [sdg] Attached SCSI disk
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 1, flush 0,
corrupt 0, gen 0
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 2, flush 0,
corrupt 0, gen 0
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 3, flush 0,
corrupt 0, gen 0
blk_partition_remap: fail for partition 1
BTRFS error (device sdb1): bdev /dev/sdg1 errs: wr 0, rd 4, flush 0,
corrupt 0, gen 0
ata4: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
ata4: irq_stat 0x00400040, connection status changed
ata4: SError: { HostInt PHYRdyChg 10B8B DevExch }


I've seen some patches on this list to add a btrfs device forget option,
which I presume would help with a situation like this.  Is there a way
to do that manually?
<>

[PATCH 3/3] btrfs: keep trim from interfering with transaction commits

2018-09-06 Thread jeffm
From: Jeff Mahoney 

Commit 499f377f49f08 (btrfs: iterate over unused chunk space in FITRIM)
fixed free space trimming, but introduced latency when it was running.
This is due to it pinning the transaction using both a incremented
refcount and holding the commit root sem for the duration of a single
trim operation.

This was to ensure safety but it's unnecessary.  We already hold the the
chunk mutex so we know that the chunk we're using can't be allocated
while we're trimming it.

In order to check against chunks allocated already in this transaction,
we need to check the pending chunks list.  To to that safely without
joining the transaction (or attaching than then having to commit it) we
need to ensure that the dev root's commit root doesn't change underneath
us and the pending chunk lists stays around until we're done with it.

We can ensure the former by holding the commit root sem and the latter
by pinning the transaction.  We do this now, but the critical section
covers the trim operation itself and we don't need to do that.

This patch moves the pinning and unpinning logic into helpers and
unpins the transaction after performing the search and check for
pending chunks.

Limiting the critical section of the transaction pinning improves
the latency substantially on slower storage (e.g. image files over NFS).

Fixes: 499f377f49f08 (btrfs: iterate over unused chunk space in FITRIM
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 92e5e9fd9bdd..8dc8e090667c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10870,14 +10870,16 @@ int btrfs_error_unpin_extent_range(struct 
btrfs_fs_info *fs_info,
  * We don't want a transaction for this since the discard may take a
  * substantial amount of time.  We don't require that a transaction be
  * running, but we do need to take a running transaction into account
- * to ensure that we're not discarding chunks that were released in
- * the current transaction.
+ * to ensure that we're not discarding chunks that were released or
+ * allocated in the current transaction.
  *
  * Holding the chunks lock will prevent other threads from allocating
  * or releasing chunks, but it won't prevent a running transaction
  * from committing and releasing the memory that the pending chunks
  * list head uses.  For that, we need to take a reference to the
- * transaction.
+ * transaction and hold the commit root sem.  We only need to hold
+ * it while performing the free space search since we have already
+ * held back allocations.
  */
 static int btrfs_trim_free_extents(struct btrfs_device *device,
   u64 minlen, u64 *trimmed)
@@ -10908,9 +10910,13 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
 
ret = mutex_lock_interruptible(_info->chunk_mutex);
if (ret)
-   return ret;
+   break;
 
-   down_read(_info->commit_root_sem);
+   ret = down_read_killable(_info->commit_root_sem);
+   if (ret) {
+   mutex_unlock(_info->chunk_mutex);
+   break;
+   }
 
spin_lock(_info->trans_lock);
trans = fs_info->running_transaction;
@@ -10918,13 +10924,17 @@ static int btrfs_trim_free_extents(struct 
btrfs_device *device,
refcount_inc(>use_count);
spin_unlock(_info->trans_lock);
 
+   if (!trans)
+   up_read(_info->commit_root_sem);
+
ret = find_free_dev_extent_start(trans, device, minlen, start,
 , );
-   if (trans)
+   if (trans) {
+   up_read(_info->commit_root_sem);
btrfs_put_transaction(trans);
+   }
 
if (ret) {
-   up_read(_info->commit_root_sem);
mutex_unlock(_info->chunk_mutex);
if (ret == -ENOSPC)
ret = 0;
@@ -10932,7 +10942,6 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
}
 
ret = btrfs_issue_discard(device->bdev, start, len, );
-   up_read(_info->commit_root_sem);
mutex_unlock(_info->chunk_mutex);
 
if (ret)
-- 
2.12.3



[PATCH 1/3] btrfs: use ->devices list instead of ->alloc_list in btrfs_trim_fs

2018-09-06 Thread jeffm
From: Jeff Mahoney 

btrfs_trim_fs iterates over the fs_devices->alloc_list while holding
the device_list_mutex.  The problem is that ->alloc_list is protected
by the chunk mutex.  We don't want to hold the chunk mutex over
the trim of the entire file system.  Fortunately, the ->dev_list
list is protected by the dev_list mutex and while it will give us
all devices, including read-only devices, we already just skip the
read-only devices.  Then we can continue to take and release the chunk
mutex while scanning each device.

Fixes: 499f377f49f (btrfs: iterate over unused chunk space in FITRIM)
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..a0e82589c3e8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -11008,8 +11008,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
fstrim_range *range)
}
 
mutex_lock(_info->fs_devices->device_list_mutex);
-   devices = _info->fs_devices->alloc_list;
-   list_for_each_entry(device, devices, dev_alloc_list) {
+   devices = _info->fs_devices->devices;
+   list_for_each_entry(device, devices, dev_list) {
ret = btrfs_trim_free_extents(device, range->minlen,
  _trimmed);
if (ret)
-- 
2.12.3



[PATCH 0/3] btrfs: trim latency improvements

2018-09-06 Thread jeffm
From: Jeff Mahoney 

This patch set fixes a few issues with trim.

1) Fix device list iteration.  We're iterating the ->alloc_list while
   holding the device_list_mutex.  The ->alloc_list is protected by
   the chunk mutex and we don't want to hold it across the entire
   trim execution.  Instead, use the ->devices list, which is protected
   by the device_list_mutex.

2) Skip trim on devices that don't support it.  Rather than letting
   the block layer reject it, bounce out early.

3) Don't keep the commit_root_sem locked and the transaction pinned
   across the block layer component of trim.  We only need these to
   ensure the pending chunks list doesn't go away underneath us, so
   it's safe to drop across the trim itself.  Historically, this
   caused issues when fstrim and balance would run at the same time
   since balance would produce lots of transactions and would
   have to wait constantly, causing problems for everything else that
   wanted to start a transaction.

-Jeff
---

Jeff Mahoney (3):
  btrfs: use ->devices list instead of ->alloc_list in btrfs_trim_fs
  btrfs: don't attempt to trim devices that don't support it
  btrfs: keep trim from interfering with transaction commits

 fs/btrfs/extent-tree.c | 33 +++--
 1 file changed, 23 insertions(+), 10 deletions(-)

-- 
2.12.3



[PATCH 2/3] btrfs: don't attempt to trim devices that don't support it

2018-09-06 Thread jeffm
From: Jeff Mahoney 

We check whether any device the file system is using supports discard
in the ioctl call, but then we attempt to trim free extents on every
device regardless of whether discard is supported.  Due to the way
we mask off EOPNOTSUPP, we can end up issuing the trim operations
on each free range on devices that don't support it, just wasting time.

Fixes: 499f377f49f08 (btrfs: iterate over unused chunk space in FITRIM)
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a0e82589c3e8..92e5e9fd9bdd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10887,6 +10887,10 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
 
*trimmed = 0;
 
+   /* Discard not supported = nothing to do. */
+   if (!blk_queue_discard(bdev_get_queue(device->bdev)))
+   return 0;
+
/* Not writeable = nothing to do. */
if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state))
return 0;
-- 
2.12.3



[PATCH RESEND] btrfs: fix error handling in free_log_tree

2018-09-06 Thread jeffm
From: Jeff Mahoney 

When we hit an I/O error in free_log_tree->walk_log_tree during file system
shutdown we can crash due to there not being a valid transaction handle.

Use btrfs_handle_fs_error when there's no transaction handle to use.

BUG: unable to handle kernel NULL pointer dereference at 0060
IP: free_log_tree+0xd2/0x140 [btrfs]
PGD 0 P4D 0
Oops:  [#1] SMP DEBUG_PAGEALLOC PTI
Modules linked in: 
CPU: 2 PID: 23544 Comm: umount Tainted: GW4.12.14-kvmsmall #9 
SLE15 (unreleased)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
task: 96bfd3478880 task.stack: a7cf40d78000
RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
RSP: 0018:a7cf40d7bd10 EFLAGS: 00010282
RAX: fffb RBX: fffb RCX: 0002
RDX:  RSI: 96c02f07d4c8 RDI: 0282
RBP: 96c013cf1000 R08: 96c02f07d4c8 R09: 96c02f07d4d0
R10:  R11: 0002 R12: 
R13: 96c005e800c0 R14: a7cf40d7bdb8 R15: 
FS:  7f17856bcfc0() GS:96c03f60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0060 CR3: 45ed6002 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 ? wait_for_writer+0xb0/0xb0 [btrfs]
 btrfs_free_log+0x17/0x30 [btrfs]
 btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
 btrfs_free_fs_roots+0xc0/0x130 [btrfs]
 ? wait_for_completion+0xf2/0x100
 close_ctree+0xea/0x2e0 [btrfs]
 ? kthread_stop+0x161/0x260
 generic_shutdown_super+0x6c/0x120
 kill_anon_super+0xe/0x20
 btrfs_kill_super+0x13/0x100 [btrfs]
 deactivate_locked_super+0x3f/0x70
 cleanup_mnt+0x3b/0x70
 task_work_run+0x78/0x90
 exit_to_usermode_loop+0x77/0xa6
 do_syscall_64+0x1c5/0x1e0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x7f1784f90827
RSP: 002b:7ffdeeb03118 EFLAGS: 0246 ORIG_RAX: 00a6
RAX:  RBX: 556a60c62970 RCX: 7f1784f90827
RDX: 0001 RSI:  RDI: 556a60c62b50
RBP:  R08: 0005 R09: 
R10: 556a60c63900 R11: 0246 R12: 556a60c62b50
R13: 7f17854a81c4 R14:  R15: 
Code: 65 a1 fd ff be 01 00 00 00 48 89 ef e8 58 a1 fd ff 48 8b 7d 00 e8 9f 33 
fe ff 48 89 ef e8 17 6c d3 ed 48 83 c4 50 5b 5d 41 5c c3 <49> 8b 44 24 60 f0 0f 
ba a8 80 65 01 00 02 72 23 83 fb fb 75 39
RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: a7cf40d7bd10
CR2: 0060
---[ end trace 3bc199fbf8fb4977 ]---

Cc:  # v3.13
Fixes: 681ae50917df9 (Btrfs: cleanup reserved space when freeing tree log on 
error)
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/tree-log.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f8220ec02036..a5f6971a125f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3143,9 +3143,12 @@ static void free_log_tree(struct btrfs_trans_handle 
*trans,
};
 
ret = walk_log_tree(trans, log, );
-   /* I don't think this can happen but just in case */
-   if (ret)
-   btrfs_abort_transaction(trans, ret);
+   if (ret) {
+   if (trans)
+   btrfs_abort_transaction(trans, ret);
+   else
+   btrfs_handle_fs_error(log->fs_info, ret, NULL);
+   }
 
while (1) {
ret = find_first_extent_bit(>dirty_log_pages,
-- 
2.12.3



Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-06 Thread Liu Bo
On Thu, Sep 6, 2018 at 11:50 AM, Nikolay Borisov  wrote:
>
>
> On  6.09.2018 09:47, Liu Bo wrote:
>> On Wed, Sep 5, 2018 at 10:45 PM, Liu Bo  wrote:
>>> Somehow this ends up with crash in btrfs/124, I'm trying to figure out
>>> what went wrong.
>>>
>>
>> It revealed the problem addressed in Josef's patch[1], so with it,
>> this patch works just fine.
>
> What exactly was the crash ?
>

assertion failed: list_empty(_group->bg_list), file:
fs/btrfs/extent-tree.c,

kernel BUG at fs/btrfs/ctree.h:3427!
...
close_ctree+0x142/0x310 [btrfs]

thanks,
liubo



>>
>> [1] btrfs: make sure we create all new bgs
>>
>> thanks,
>> liubo
>>
>>>
>>> On Tue, Sep 4, 2018 at 6:14 PM, Liu Bo  wrote:
 __btrfs_end_transaction() has done the metadata release twice,
 probably because it used to process delayed refs in between, but now
 that we don't process delayed refs any more, the 2nd release is always
 a noop.

 Signed-off-by: Liu Bo 
 ---
  fs/btrfs/transaction.c | 6 --
  1 file changed, 6 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index bb1b9f526e98..94b036a74d11 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
 btrfs_trans_handle *trans,
 return 0;
 }

 -   btrfs_trans_release_metadata(trans);
 -   trans->block_rsv = NULL;
 -
 -   if (!list_empty(>new_bgs))
 -   btrfs_create_pending_block_groups(trans);
 -
 trans->delayed_ref_updates = 0;
 if (!trans->sync) {
 must_run_delayed_refs =
 --
 1.8.3.1

>>


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Loewen

[root@archlinux @data]# btrfs fi us /mnt/intenso_white/
Overall:
Device size: 911.51GiB
Device allocated:    703.09GiB
Device unallocated:  208.43GiB
Device missing:  0.00B
Used:    658.19GiB
Free (estimated):    249.75GiB  (min: 145.53GiB)
Data ratio:   1.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)
Data,single: Size:695.01GiB, Used:653.69GiB
/dev/sdb1 695.01GiB
Metadata,DUP: Size:4.00GiB, Used:2.25GiB
/dev/sdb1   8.00GiB
System,DUP: Size:40.00MiB, Used:96.00KiB
/dev/sdb1  80.00MiB
Unallocated:
/dev/sdb1 208.43GiB

Does that mean Metadata is duplicated?

Ok so to summarize and see if I understood you correctly:
There are bad sectors on disk. Running an extended selftest (smartctl -t 
long) could find those and replace them with spare sectors.
If it does not I can try calculating the physical (4K) sector number and 
write to that to make the drive notice and mark the bad sector.
Is there a way to find out which file I will be writing to beforehand? 
Or is it easier to just write to the sector and then wait for scrub to 
tell me (and the sector is broken anyways)?


For the drive: Not under warranty anymore. It's an external HDD that I 
had lying around for years, mostly unused. Now I wanted to use it as 
part of my small DIY NAS.



On 9/6/18 9:58 PM, Chris Murphy wrote:

On Thu, Sep 6, 2018 at 12:36 PM, Stefan Loewen  wrote:

Output of the commands is attached.

fdisk
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

smart
Sector Sizes: 512 bytes logical, 4096 bytes physical

So clearly the case is lying about the actual physical sector size of
the drive. It's very common. But it means to fix the bad sector by
writing to it, must be a 4K write. A 512 byte write to the reported
LBA, will fail because it is a RMW, and the read will fail. So if you
write to that sector, you'll get a read failure. Kinda confusing. So
you can convert the LBA to a 4K value, and use dd to write to that "4K
LBA" using bs=4096 and a count of 1 but only when you're ready to
lose all 4096 bytes in that sector. If it's data, it's fine. It's the
loss of one file, and scrub will find and report path to file so you
know what was affected.

If it's metadata, it could be a problem. What do you get for 'btrfs fi
us ' for this volume? I'm wondering if DUP metadata is
being used across the board with no single chunks. If so, then you can
zero that sector, and Btrfs will detect the missing metadata in that
chunk on scrub, and fix it up from a copy. But if you only have single
copy metadata, it just depends what's on that block as to how
recoverable or repairable this is.


195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
196 Reallocated_Event_Count -O--CK   252   252   000-0
197 Current_Pending_Sector  -O--CK   252   252   000-0
198 Offline_Uncorrectable   CK   252   252   000-0

Interesting, no complaints there. Unexpected.

11 Calibration_Retry_Count -O--CK   100   100   000-8
200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-31

https://kb.acronis.com/content/9136

This is a low hour device, probably still under warranty? I'd get it
swapped out. If you want more ammunition for arguing in favor of a
swap out under warranty you could do

smartctl -t long /dev/sdb

That will take just under 4 hours to run (you can use the drive in the
meantime, but it'll take a bit longer); and then after that

smartctl -x /dev/sdb

And see if it's found a bad sector or updated any of those smart
values for the worse in particular the offline values.




SCT (Get) Error Recovery Control command failed

OK so not configurable, it is whatever it is and we don't know what
that is. Probably one of the really long recoveries.





The broken-sector-theory sounds plausible and is compatible with my new
findings:
I suspected the problem to be in one specific directory, let's call it
"broken_dir".
I created a new subvolume and copied broken_dir over.
- If I copied it with cp --reflink, made a snapshot and tried to btrfs-send
that, it hung
- If I rsynced broken_dir over I could snapshot and btrfs-send without a
problem.

Yeah I'm not sure what it is, maybe a data block.


But shouldn't btrfs scrub or check find such errors?

Nope. Btrfs expects the drive to complete the read command, but always
second guesses the content of the read by comparing to checksums. So
if the drive just supplied corrupt data, Btrfs would detect that and
discretely report, and if there's a good copy it would self heal. But
it can't do that because the drive or USB bus also seems to hang in
such a way that a bunch of tasks are also hung, and none of them are
getting a clear pass/fail for the read. It just hangs.

Arguably the device or the link should not hang. So I'm still
wondering if something else 

Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 12:36 PM, Stefan Loewen  wrote:
> Output of the commands is attached.

fdisk
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

smart
Sector Sizes: 512 bytes logical, 4096 bytes physical

So clearly the case is lying about the actual physical sector size of
the drive. It's very common. But it means to fix the bad sector by
writing to it, must be a 4K write. A 512 byte write to the reported
LBA, will fail because it is a RMW, and the read will fail. So if you
write to that sector, you'll get a read failure. Kinda confusing. So
you can convert the LBA to a 4K value, and use dd to write to that "4K
LBA" using bs=4096 and a count of 1 but only when you're ready to
lose all 4096 bytes in that sector. If it's data, it's fine. It's the
loss of one file, and scrub will find and report path to file so you
know what was affected.

If it's metadata, it could be a problem. What do you get for 'btrfs fi
us ' for this volume? I'm wondering if DUP metadata is
being used across the board with no single chunks. If so, then you can
zero that sector, and Btrfs will detect the missing metadata in that
chunk on scrub, and fix it up from a copy. But if you only have single
copy metadata, it just depends what's on that block as to how
recoverable or repairable this is.


195 Hardware_ECC_Recovered  -O-RCK   100   100   000-0
196 Reallocated_Event_Count -O--CK   252   252   000-0
197 Current_Pending_Sector  -O--CK   252   252   000-0
198 Offline_Uncorrectable   CK   252   252   000-0

Interesting, no complaints there. Unexpected.

11 Calibration_Retry_Count -O--CK   100   100   000-8
200 Multi_Zone_Error_Rate   -O-R-K   100   100   000-31

https://kb.acronis.com/content/9136

This is a low hour device, probably still under warranty? I'd get it
swapped out. If you want more ammunition for arguing in favor of a
swap out under warranty you could do

smartctl -t long /dev/sdb

That will take just under 4 hours to run (you can use the drive in the
meantime, but it'll take a bit longer); and then after that

smartctl -x /dev/sdb

And see if it's found a bad sector or updated any of those smart
values for the worse in particular the offline values.




SCT (Get) Error Recovery Control command failed

OK so not configurable, it is whatever it is and we don't know what
that is. Probably one of the really long recoveries.




>
> The broken-sector-theory sounds plausible and is compatible with my new
> findings:
> I suspected the problem to be in one specific directory, let's call it
> "broken_dir".
> I created a new subvolume and copied broken_dir over.
> - If I copied it with cp --reflink, made a snapshot and tried to btrfs-send
> that, it hung
> - If I rsynced broken_dir over I could snapshot and btrfs-send without a
> problem.

Yeah I'm not sure what it is, maybe a data block.

>
> But shouldn't btrfs scrub or check find such errors?

Nope. Btrfs expects the drive to complete the read command, but always
second guesses the content of the read by comparing to checksums. So
if the drive just supplied corrupt data, Btrfs would detect that and
discretely report, and if there's a good copy it would self heal. But
it can't do that because the drive or USB bus also seems to hang in
such a way that a bunch of tasks are also hung, and none of them are
getting a clear pass/fail for the read. It just hangs.

Arguably the device or the link should not hang. So I'm still
wondering if something else is going on, but this is just the most
obvious first problem, and maybe it's being complicated by another
problem we haven't figure out yet. Anyway, once this problem is solve,
it'll become clear if there are additional problems or not.

In my case, I often get usb reset errors when I directly connect USB
3.0 drives to my Intel NUC, but I don't ever get them when plugging
the drive into a dyconn hub. So if you don't already have a hub in
between the drive and the computer, it might be worth considering.
Basically the hub is going to read and completely rewrite the whole
stream that goes through it (in both directions).



-- 
Chris Murphy


[PATCH] btrfs: fix error handling in btrfs_dev_replace_start

2018-09-06 Thread jeffm
From: Jeff Mahoney 

When we fail to start a transaction in btrfs_dev_replace_start,
we leave dev_replace->replace_start set to STARTED but clear
->srcdev and ->tgtdev.  Later, that can result in an Oops in
btrfs_dev_replace_progress when having state set to STARTED or
SUSPENDED implies that ->srcdev is valid.

Also fix error handling when the state is already STARTED or
SUSPENDED while starting.  That, too, will clear ->srcdev and ->tgtdev
even though it doesn't own them.  This should be an impossible case to
hit since we should be protected by the BTRFS_FS_EXCL_OP bit being set.
Let's add an ASSERT there while we're at it.

Fixes: e93c89c1a (Btrfs: add new sources for device replace code)
Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/dev-replace.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e2ba0419297a..0581c8570a05 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -445,6 +445,7 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
break;
case BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED:
case BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED:
+   ASSERT(0);
ret = BTRFS_IOCTL_DEV_REPLACE_RESULT_ALREADY_STARTED;
goto leave;
}
@@ -487,6 +488,10 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
btrfs_dev_replace_write_lock(dev_replace);
+   dev_replace->replace_state =
+   BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED;
+   dev_replace->srcdev = NULL;
+   dev_replace->tgtdev = NULL;
goto leave;
}
 
@@ -508,8 +513,6 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
return ret;
 
 leave:
-   dev_replace->srcdev = NULL;
-   dev_replace->tgtdev = NULL;
btrfs_dev_replace_write_unlock(dev_replace);
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
return ret;
-- 
2.12.3



Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-06 Thread Nikolay Borisov



On  6.09.2018 09:47, Liu Bo wrote:
> On Wed, Sep 5, 2018 at 10:45 PM, Liu Bo  wrote:
>> Somehow this ends up with crash in btrfs/124, I'm trying to figure out
>> what went wrong.
>>
> 
> It revealed the problem addressed in Josef's patch[1], so with it,
> this patch works just fine.

What exactly was the crash ?

> 
> [1] btrfs: make sure we create all new bgs
> 
> thanks,
> liubo
> 
>>
>> On Tue, Sep 4, 2018 at 6:14 PM, Liu Bo  wrote:
>>> __btrfs_end_transaction() has done the metadata release twice,
>>> probably because it used to process delayed refs in between, but now
>>> that we don't process delayed refs any more, the 2nd release is always
>>> a noop.
>>>
>>> Signed-off-by: Liu Bo 
>>> ---
>>>  fs/btrfs/transaction.c | 6 --
>>>  1 file changed, 6 deletions(-)
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index bb1b9f526e98..94b036a74d11 100644
>>> --- a/fs/btrfs/transaction.c
>>> +++ b/fs/btrfs/transaction.c
>>> @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
>>> btrfs_trans_handle *trans,
>>> return 0;
>>> }
>>>
>>> -   btrfs_trans_release_metadata(trans);
>>> -   trans->block_rsv = NULL;
>>> -
>>> -   if (!list_empty(>new_bgs))
>>> -   btrfs_create_pending_block_groups(trans);
>>> -
>>> trans->delayed_ref_updates = 0;
>>> if (!trans->sync) {
>>> must_run_delayed_refs =
>>> --
>>> 1.8.3.1
>>>
> 


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Loewen

Output of the commands is attached.

The broken-sector-theory sounds plausible and is compatible with my new 
findings:
I suspected the problem to be in one specific directory, let's call it 
"broken_dir".

I created a new subvolume and copied broken_dir over.
- If I copied it with cp --reflink, made a snapshot and tried to 
btrfs-send that, it hung
- If I rsynced broken_dir over I could snapshot and btrfs-send without a 
problem.


But shouldn't btrfs scrub or check find such errors?


On 9/6/18 8:16 PM, Chris Murphy wrote:

OK you've got a different problem.

[  186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0
00 08 00 00
[  186.898764] print_req_error: I/O error, dev sdb, sector 354853072
[  187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0
00 08 00 00
[  188.247048] print_req_error: I/O error, dev sdb, sector 354855120


This is a read error for a specific sector.  So your drive has media
problems. And I think that's the instigating problem here, from which
a bunch of other tasks that depend on one or more reads completing but
never do. But weirdly there also isn't any kind of libata reset. At
least on SATA, by default we see a link reset after a command has not
returned in 30 seconds. That reset would totally clear the drive's
command queue, and then things either can recover or barf. But in your
case, neither happens and it just sits there with hung tasks.

[  189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0,
rd 2, flush 0, corrupt 0, gen 0

And that's the last we really see from Btrfs. After that, it's all
just hung task traces and are rather unsurprising to me.

Drives in USB cases add a whole bunch of complicating factors for
troubleshooting and repair. Including often masking the actual logical
and physical sector size, the min and max IO size, alignment offset,
and all kinds of things. They can have all sorts of bugs. And I'm also
not totally certain about the relationship between the usb reset
messages and the bad sector. As far as I know the only way we can get
a sector LBA expressly noted in dmesg along with the failed read(10)
command, is if the drive has reported back to libata that discrete
error with sense information. So I'm accepting that as a reliable
error, rather than it being something like a cable. But the reset
messages could possibly be something else in addition to that.

Anyway, the central issue is sector 354855120 is having problems. I
can't tell from the trace if it's transient or persistent. Maybe if
it's transient, that would explain how you sometimes get send to start
working again briefly but then it reverts to hanging. What do you get
for:

fdisk -l /dev/sdb
smartctl -x /dev/sdb
smartctl -l sct erc /dev/sdb

Those are all read only commands, nothing is written or changed.



[root@archlinux ~]# fdisk -l /dev/sdb
Disk /dev/sdb: 931.5 GiB, 1000204140544 bytes, 1953523712 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x1cc21bd2
Device Boot  StartEndSectors  Size Id Type
/dev/sdb1 2048 1933593750 1933591703  922G 83 Linux
/dev/sdb2   1933610176 1953525167   19914992  9.5G 83 Linux


[root@archlinux ~]# smartctl -x /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.5-arch1-1-ARCH] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Samsung SpinPoint M8 (AF)
Device Model: ST1000LM024 HN-M101MBB
Serial Number:S2RXJ9FCB07612
LU WWN Device Id: 5 0004cf 208d24759
Firmware Version: 2AR10002
User Capacity:1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate:5400 rpm
Form Factor:  2.5 inches
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is:Thu Sep  6 18:23:57 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM level is: 254 (maximum performance), recommended: 254
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write 

Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 10:03 AM, Stefan Löwen  wrote:
> I have one subvolume (rw) and 2 snapshots (ro) of it.
>
> I just tested 'btrfs send  > /dev/null' and that also shows no IO
> after a while but also no significant CPU usage.
> During this I tried 'ls' on the source subvolume and it hangs as well.
> dmesg has some interesting messages I think (see attached dmesg.log)
>

OK you've got a different problem.

[  186.898756] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  186.898762] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a0 d0
00 08 00 00
[  186.898764] print_req_error: I/O error, dev sdb, sector 354853072
[  187.109641] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.345245] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.657844] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  187.851336] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.026882] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.215881] usb 2-1: reset SuperSpeed Gen 1 USB device number 2
using xhci_hcd
[  188.247028] sd 2:0:0:0: [sdb] tag#0 FAILED Result:
hostbyte=DID_ERROR driverbyte=DRIVER_OK
[  188.247041] sd 2:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 15 26 a8 d0
00 08 00 00
[  188.247048] print_req_error: I/O error, dev sdb, sector 354855120


This is a read error for a specific sector.  So your drive has media
problems. And I think that's the instigating problem here, from which
a bunch of other tasks that depend on one or more reads completing but
never do. But weirdly there also isn't any kind of libata reset. At
least on SATA, by default we see a link reset after a command has not
returned in 30 seconds. That reset would totally clear the drive's
command queue, and then things either can recover or barf. But in your
case, neither happens and it just sits there with hung tasks.

[  189.350360] BTRFS error (device sdb1): bdev /dev/sdb1 errs: wr 0,
rd 2, flush 0, corrupt 0, gen 0

And that's the last we really see from Btrfs. After that, it's all
just hung task traces and are rather unsurprising to me.

Drives in USB cases add a whole bunch of complicating factors for
troubleshooting and repair. Including often masking the actual logical
and physical sector size, the min and max IO size, alignment offset,
and all kinds of things. They can have all sorts of bugs. And I'm also
not totally certain about the relationship between the usb reset
messages and the bad sector. As far as I know the only way we can get
a sector LBA expressly noted in dmesg along with the failed read(10)
command, is if the drive has reported back to libata that discrete
error with sense information. So I'm accepting that as a reliable
error, rather than it being something like a cable. But the reset
messages could possibly be something else in addition to that.

Anyway, the central issue is sector 354855120 is having problems. I
can't tell from the trace if it's transient or persistent. Maybe if
it's transient, that would explain how you sometimes get send to start
working again briefly but then it reverts to hanging. What do you get
for:

fdisk -l /dev/sdb
smartctl -x /dev/sdb
smartctl -l sct erc /dev/sdb

Those are all read only commands, nothing is written or changed.



-- 
Chris Murphy


Re: [PATCH v5 0/6] Btrfs: implement swap file support

2018-09-06 Thread Omar Sandoval
On Thu, Sep 06, 2018 at 01:59:54PM +0200, David Sterba wrote:
> On Fri, Aug 31, 2018 at 03:36:35PM -0700, Omar Sandoval wrote:
> > This series implements swap file support for Btrfs.
> > 
> > Changes since v4 [1]:
> > 
> > - Added a kernel doc for btrfs_get_chunk_map()
> > - Got rid of "Btrfs: push EXCL_OP set into btrfs_rm_device()"
> > - Made activate error messages more clear and consistent
> > - Changed clear vs unlock order in activate error case
> > - Added "mm: export add_swap_extent()" as a separate patch
> > - Added a btrfs_wait_ordered_range() at the beginning of
> >   btrfs_swap_activate() to catch newly created files
> > - Added some Reviewed-bys from Nikolay
> > 
> > I took a stab at adding support for balance when a swap file is active,
> > but it's a major pain: we need to mark block groups which contain swap
> > file extents, check the block group counter in relocate/scrub, then
> > unmark the block groups when the swap file is deactivated, which gets
> > really messy because the file can grow while it is an active swap file.
> > If this is a deal breaker, I can work something out, but I don't think
> > it's worth the trouble.
> 
> I'm afraid it is a deal breaker. Unlike dev-replace or resize, balance
> is used more often so switching off the swap file for the duration of
> the operation is administration pain.
> 
> If it's possible to constrain the swap file further, like no growing
> that you mention, or mandatory preallocation or similar, then I hope it
> would make it possible to implement in a sane way.

Alright, I'll have another go.


Re: [PATCH v3] Btrfs: set leave_spinning in btrfs_get_extent

2018-09-06 Thread David Sterba
On Sat, Aug 25, 2018 at 01:47:09PM +0800, Liu Bo wrote:
> Unless it's going to read inline extents from btree leaf to page,
> btrfs_get_extent won't sleep during the period of holding path lock.
> 
> This sets leave_spinning at first and sets path to blocking mode right
> before reading inline extent if that's the case.  The benefit is that a
> path in spinning mode typically has less impact (faster) on waiters
> rather than that in blocking mode.
> 
> Also fixes the misalignment of the prototype, which is too trivial for
> a single patch.

^^^ removed as it refers to the hunks from v2.

> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 


Re: [PATCH] Btrfs: fix alignment in declaration and prototype of btrfs_get_extent

2018-09-06 Thread David Sterba
On Sat, Aug 25, 2018 at 01:47:59PM +0800, Liu Bo wrote:
> This fixes btrfs_get_extent to be consistent with our existing
> declaration style.

For the record, indentation styles that are accepted are both, aligning
under the opening ( and tab or double tab indentation on the next line.
Preferrably not spliting the type or long expressions in the argument
lists.

I personally don't like the alignment under opening ( and in some
excessive cases I reformat the code to be more compact. We'll never
agree on one style, so as long as the code does not look ugly as in the
2nd hunk of this patch, there's no "enforcement".

Patch added to misc-next, thanks.


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Löwen

I have one subvolume (rw) and 2 snapshots (ro) of it.

I just tested 'btrfs send  > /dev/null' and that also shows no 
IO after a while but also no significant CPU usage.

During this I tried 'ls' on the source subvolume and it hangs as well.
dmesg has some interesting messages I think (see attached dmesg.log)


On 9/6/18 5:48 PM, Chris Murphy wrote:

On Thu, Sep 6, 2018 at 9:04 AM, Stefan Loewen  wrote:

Update:
It seems like btrfs-send is not completely hung. It somewhat regularly
wakes up every hour to transfer a few bytes. I noticed this via a
periodic 'ls -l' on the snapshot file. These are the last outputs
(uniq'ed):

-rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
-rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
-rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
-rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
-rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
-rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
-rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot

I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
have no idea if this is useful) but there are no changes, even during
the short wakeups.

I have a sort of "me too" here. I definitely see btrfs send just hang
for no apparent reason, but in my case it's for maybe 15-30 seconds.
Not an hour. Looking at top and iotop at the same time as the LED
lights on the drives, there's  definitely zero activity happening. I
can make things happen during this time - like I can read a file or
save a file from/to any location including the send source or receive
destination. It really just behaves as if the send thread is saying
"OK I'm gonna nap now, back in a bit" and then it is.

So what I end up with on drives with a minimum read-write of 80M/s, is
a send receive that's getting me a net of about 30M/s.

I have around 100 snapshots on the source device. How many total
snapshots do you have on your source? That does appear to affect
performance for some things, including send/receive.


[0.00] Linux version 4.18.5-arch1-1-ARCH (builduser@heftig-12250) (gcc version 8.2.0 (GCC)) #1 SMP PREEMPT Fri Aug 24 12:48:58 UTC 2018
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=a83c9650-c8f8-4afe-90a6-4e80156d523d rw quiet
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xdffe] usable
[0.00] BIOS-e820: [mem 0xdfff-0xdfff] ACPI data
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00011fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.5 present.
[0.00] DMI: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[0.00] Hypervisor detected: KVM
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] last_pfn = 0x12 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR variable ranges disabled:
[0.00] Disabled
[0.00] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[0.00] CPU MTRRs all blank - virtualized system.
[0.00] x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WB  WT  UC- UC  
[0.00] last_pfn = 0xdfff0 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x0009fff0-0x0009] mapped at [(ptrval)]
[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [(ptrval)] 99000 size 24576
[0.00] BRK [0x115a77000, 0x115a77fff] PGTABLE
[0.00] BRK [0x115a78000, 0x115a78fff] PGTABLE
[0.00] BRK [0x115a79000, 0x115a79fff] PGTABLE
[0.00] BRK [0x115a7a000, 0x115a7afff] PGTABLE
[0.00] BRK [0x115a7b000, 0x115a7bfff] PGTABLE
[

Re: [PATCH] btrfs: extent-tree.c: Remove redundant variable from btrfs_cross_ref_exist()

2018-09-06 Thread David Sterba
On Thu, Aug 30, 2018 at 10:59:16AM +0900, Misono Tomohiro wrote:
> Since commit d7df2c796d7e ("Btrfs attach delayed ref updates to
> delayed ref heads"), check_delaed_ref() won't return -ENOENT.
> 
> In btrfs_cross_ref_exist(), two variable 'ret' and 'ret2' is
> originally used to handle -ENOENT error case.
> 
> Since the code is not needed anymore, let's just remove 'ret2'.

Good cleanup and the patch would be ok as-is. I've noticed that
check_delayed_ref now returns only two values so it might make sense to
turn it to bool and update the name of check_delayed_ref to make it
clear what the return value means in the loop.

The concern here is that adding another error code in check_delayed_ref
in the future will not be caught in btrfs_cross_ref_exist. Adding an
assert filtering only EAGAIN would be sufficient as this might miss the
extra value in case it'd be an uncommon case.


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Chris Murphy
On Thu, Sep 6, 2018 at 9:04 AM, Stefan Loewen  wrote:
> Update:
> It seems like btrfs-send is not completely hung. It somewhat regularly
> wakes up every hour to transfer a few bytes. I noticed this via a
> periodic 'ls -l' on the snapshot file. These are the last outputs
> (uniq'ed):
>
> -rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
> -rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
> -rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
> -rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
> -rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
> -rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
> -rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot
>
> I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
> have no idea if this is useful) but there are no changes, even during
> the short wakeups.

I have a sort of "me too" here. I definitely see btrfs send just hang
for no apparent reason, but in my case it's for maybe 15-30 seconds.
Not an hour. Looking at top and iotop at the same time as the LED
lights on the drives, there's  definitely zero activity happening. I
can make things happen during this time - like I can read a file or
save a file from/to any location including the send source or receive
destination. It really just behaves as if the send thread is saying
"OK I'm gonna nap now, back in a bit" and then it is.

So what I end up with on drives with a minimum read-write of 80M/s, is
a send receive that's getting me a net of about 30M/s.

I have around 100 snapshots on the source device. How many total
snapshots do you have on your source? That does appear to affect
performance for some things, including send/receive.


-- 
Chris Murphy


Re: [PATCH] btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag

2018-09-06 Thread David Sterba
On Wed, Sep 05, 2018 at 11:07:33AM +0800, Su Yue wrote:
> Since commit 8b62f87bad9c ("Btrfs: rework outstanding_extents"),
> manual operations of outstanding_extent in btrfs_inode are replaced by
> btrfs_mod_outstanding_extents().
> The one in cluster_pages_for_defrag seems to be lost, so replace it
> of btrfs_mod_outstanding_extents().
> 
> Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents")
> Signed-off-by: Su Yue 

Reviewed-by: David Sterba 


Re: [PATCH v2] Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes

2018-09-06 Thread David Sterba
On Wed, Sep 05, 2018 at 09:55:27AM +0800, Liu Bo wrote:
> Here we're not releasing any space, but transferring bytes from
> ->bytes_may_use to ->bytes_reserved.
> 
> Signed-off-by: Liu Bo 
> ---
> v2: Add missing commit log.

I've enhanced the changlog a bit how the tracepoint got there.

Reviewed-by: David Sterba 


Re: btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Loewen
Update:
It seems like btrfs-send is not completely hung. It somewhat regularly
wakes up every hour to transfer a few bytes. I noticed this via a
periodic 'ls -l' on the snapshot file. These are the last outputs
(uniq'ed):

-rw--- 1 root root 1492797759 Sep  6 08:44 intenso_white.snapshot
-rw--- 1 root root 1493087856 Sep  6 09:44 intenso_white.snapshot
-rw--- 1 root root 1773825308 Sep  6 10:44 intenso_white.snapshot
-rw--- 1 root root 1773976853 Sep  6 11:58 intenso_white.snapshot
-rw--- 1 root root 1774122301 Sep  6 12:59 intenso_white.snapshot
-rw--- 1 root root 1774274264 Sep  6 13:58 intenso_white.snapshot
-rw--- 1 root root 1774435235 Sep  6 14:57 intenso_white.snapshot

I also monitor the /proc/3022/task/*/stack files with 'tail -f' (I
have no idea if this is useful) but there are no changes, even during
the short wakeups.
Am Do., 6. Sep. 2018 um 11:22 Uhr schrieb Stefan Löwen
:
>
> Hello linux-btrfs,
>
> I'm trying to clone a subvolume with 'btrfs send' but it always hangs
> for hours.
>
> I tested this on multiple systems. All showed the same result:
> - Manjaro (btrfs-progs v4.17.1; linux v4.18.5-1-MANJARO)
> - Ubuntu 18.04 in VirtualBox (btrfs-progs v4.15.1; linux v4.15.0-33-generic)
> - ArchLinux in VirtualBox (btrfs-progs v4.17.1; linux v4.18.5-arch1-1-ARCH)
> All following logs are from the ArchLinux VM.
>
> To make sure it's not the 'btrfs receive' at fault I tried sending into
> a file using the following command:
> 'strace -o btrfs-send.strace btrfs send -vvv -f intenso_white.snapshot
> /mnt/intenso_white/@data/.snapshots/test-snapshot'
>
> The 'btrfs send' process always copies around 1.2-1.4G of data, then
> stops all disk IO and fully loads one cpu core. btrfs scrub found 0
> errors. Neither did btrfsck. 'btrfs device stats' is all 0.
>
> I would be thankful for all ideas and tips.
>
> Regards
> Stefan
>
> -
>
> The btrfs-send.strace is attached. So is the dmesg.log during the hang.
>
> Stack traces of the hung process:
> --- /proc/3022/task/3022/stack ---
> [<0>] 0x
> --- /proc/3022/task/3023/stack ---
> [<0>] pipe_wait+0x6c/0xb0
> [<0>] splice_from_pipe_next.part.3+0x24/0xa0
> [<0>] __splice_from_pipe+0x43/0x180
> [<0>] splice_from_pipe+0x5d/0x90
> [<0>] default_file_splice_write+0x15/0x20
> [<0>] __se_sys_splice+0x31b/0x770
> [<0>] do_syscall_64+0x5b/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0x
>
> [vagrant@archlinux mnt]$ uname -a
> Linux archlinux 4.18.5-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 24 12:48:58
> UTC 2018 x86_64 GNU/Linux
>
> [vagrant@archlinux mnt]$ btrfs --version
> btrfs-progs v4.17.1
>
> [vagrant@archlinux mnt]$ sudo btrfs fi show /dev/sdb1
> Label: 'intenso_white'  uuid: 07bf61ed-7728-4151-a784-c4b840e343ed
> Total devices 1 FS bytes used 655.82GiB
> devid1 size 911.51GiB used 703.09GiB path /dev/sdb1
>
> [vagrant@archlinux mnt]$ sudo btrfs fi df /mnt/intenso_white/
> Data, single: total=695.01GiB, used=653.69GiB
> System, DUP: total=40.00MiB, used=96.00KiB
> Metadata, DUP: total=4.00GiB, used=2.13GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>


Re: [PATCH 0/3] btrfs: qgroup: Deprecate unused features for btrfs_qgroup_inherit()

2018-09-06 Thread Qu Wenruo


On 2018/9/5 下午9:00, David Sterba wrote:
> On Fri, Aug 31, 2018 at 10:29:27AM +0800, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux/tree/qgroup_inherit_check
>> Which is based on v4.19-rc1 tag.
>>
>> This patchset will first set btrfs_qgroup_inherit structure size limit
>> from PAGE_SIZE to fixed SZ_4K.
>> I understand this normally will cause compatibility problem, but
>> considering how minor this feature is and no sane guy should use it for
>> over 100 qgroups, it should be fine in real world.
> 
> Agreed, please update the changelog of 1st patch with description on
> what will stop working and under what conditions. The 4k limit sounds
> good enough, the real difference would be on architectures with larger
> page sizes where the feature would be used.

No problem.

> 
>> The 2nd patch introduce check function for btrfs_qgroup_inherit
>> structure and deprecates the following features:
>> 1) limit set
>>Never utilized by btrfs-progs from the beginning.
>>
>> 2) copy rfer/excl
>>Although btrfs-progs provides support for it as a hidden,
>>undocumented feature, it's the easiest way to screw up qgroup
>>numbers.
>>And we already have patches wondering around the ML to remove such
>>support.
> 
> The deprecation should be done in a few steps. First issue a warning
> that the feature is deprecated and will be removed in release X. Then
> wait until somebody complains (or not) and remove the code in release X.
> 
> The X is something like 4.22, ie. at least 2 cycles after the
> deprecation warning is added.

Thanks for the deprecation progress.

However I'm wondering if the "release X" is really needed in the warning
message.
(I may forgot to submit the real deprecation patch for that release).

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v5 0/6] Btrfs: implement swap file support

2018-09-06 Thread David Sterba
On Fri, Aug 31, 2018 at 03:36:35PM -0700, Omar Sandoval wrote:
> This series implements swap file support for Btrfs.
> 
> Changes since v4 [1]:
> 
> - Added a kernel doc for btrfs_get_chunk_map()
> - Got rid of "Btrfs: push EXCL_OP set into btrfs_rm_device()"
> - Made activate error messages more clear and consistent
> - Changed clear vs unlock order in activate error case
> - Added "mm: export add_swap_extent()" as a separate patch
> - Added a btrfs_wait_ordered_range() at the beginning of
>   btrfs_swap_activate() to catch newly created files
> - Added some Reviewed-bys from Nikolay
> 
> I took a stab at adding support for balance when a swap file is active,
> but it's a major pain: we need to mark block groups which contain swap
> file extents, check the block group counter in relocate/scrub, then
> unmark the block groups when the swap file is deactivated, which gets
> really messy because the file can grow while it is an active swap file.
> If this is a deal breaker, I can work something out, but I don't think
> it's worth the trouble.

I'm afraid it is a deal breaker. Unlike dev-replace or resize, balance
is used more often so switching off the swap file for the duration of
the operation is administration pain.

If it's possible to constrain the swap file further, like no growing
that you mention, or mandatory preallocation or similar, then I hope it
would make it possible to implement in a sane way.


Re: [PATCH] btrfs: qgroup: Don't trace subtree if we're dropping tree reloc tree

2018-09-06 Thread Qu Wenruo


On 2018/9/6 下午7:15, David Sterba wrote:
> On Thu, Sep 06, 2018 at 09:41:14AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/9/5 下午9:11, David Sterba wrote:
>>> On Wed, Sep 05, 2018 at 01:03:39PM +0800, Qu Wenruo wrote:
 Tree reloc tree doesn't contribute to qgroup numbers, as we have
>>>
>>> I think you can call it just 'reloc tree', I'm fixing that in all
>>> changelogs and comments anyway.
>>
>> But there is another tree called data reloc tree.
>> That why I'm sticking to tree reloc tree to distinguish from data reloc
>> tree.
> 
> So call it 'reloc tree' and 'data reloc tree', the naming of
> BTRFS_TREE_RELOC_OBJECTID does not follow the other tree naming and I
> don't know if the inention was to name it after the 'tree log' scheme or
> the 'extent tree'. Or if there was an intention behind the naming at
> all.
> 
> If this is going to bother us too much, I won't mind renaming it to
> BTRFS_RELOC_TREE_OBJECTID everywhere. For consistency.
> 

I'll go reloc tree for later reference.

Renaming makes sense, although I'd like to change its name to something
more distinguish (without the reloc part).

For now, I'm fine using reloc tree.

 accounted them at balance time (check replace_path()).

 Skip such unneeded subtree trace should reduce some performance
 overhead.
>>>
>>> Please provide some numbers or description of the improvement. There are
>>> several performance problems caused by qgroups so it would be good to
>>> get a better idea how much this patch is going to help. Thanks.
>>
>> That's the problem.
>> For my internal test, with 3000+ tree blocks, metadata balance could
>> save about 1~2%.
>> But according to dump-tree, the tree layout is almost the worst case
>> scenario, just one metadata block group owns all the tree blocks.
> 
> If you do such test, it's also a good example for the changelog. It
> describes the worst case and this information can be used to prepare
> testing environment on large data samples.

Already preparing the real world case (great thanks for Fujitsu guys
providing 1TB storage for the test).

I'll provide the data along after the test is done.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] btrfs: qgroup: Don't trace subtree if we're dropping tree reloc tree

2018-09-06 Thread David Sterba
On Thu, Sep 06, 2018 at 09:41:14AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/9/5 下午9:11, David Sterba wrote:
> > On Wed, Sep 05, 2018 at 01:03:39PM +0800, Qu Wenruo wrote:
> >> Tree reloc tree doesn't contribute to qgroup numbers, as we have
> > 
> > I think you can call it just 'reloc tree', I'm fixing that in all
> > changelogs and comments anyway.
> 
> But there is another tree called data reloc tree.
> That why I'm sticking to tree reloc tree to distinguish from data reloc
> tree.

So call it 'reloc tree' and 'data reloc tree', the naming of
BTRFS_TREE_RELOC_OBJECTID does not follow the other tree naming and I
don't know if the inention was to name it after the 'tree log' scheme or
the 'extent tree'. Or if there was an intention behind the naming at
all.

If this is going to bother us too much, I won't mind renaming it to
BTRFS_RELOC_TREE_OBJECTID everywhere. For consistency.

> >> accounted them at balance time (check replace_path()).
> >>
> >> Skip such unneeded subtree trace should reduce some performance
> >> overhead.
> > 
> > Please provide some numbers or description of the improvement. There are
> > several performance problems caused by qgroups so it would be good to
> > get a better idea how much this patch is going to help. Thanks.
> 
> That's the problem.
> For my internal test, with 3000+ tree blocks, metadata balance could
> save about 1~2%.
> But according to dump-tree, the tree layout is almost the worst case
> scenario, just one metadata block group owns all the tree blocks.

If you do such test, it's also a good example for the changelog. It
describes the worst case and this information can be used to prepare
testing environment on large data samples.


Re: Transactional btrfs

2018-09-06 Thread Austin S. Hemmelgarn

On 2018-09-06 03:23, Nathan Dehnel wrote:

https://lwn.net/Articles/287289/

In 2008, HP released the source code for a filesystem called advfs so
that its features could be incorporated into linux filesystems. Advfs
had a feature where a group of file writes were an atomic transaction.

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

These guys used advfs to add a "syncv" system call that makes writes
across multiple files atomic.

https://lwn.net/Articles/715918/

A patch was later submitted based on the previous paper in some way.

So I guess my question is, does btrfs support atomic writes across
multiple files? Or is anyone interested in such a feature?

I'm fairly certain that it does not currently, but in theory it would 
not be hard to add.


Realistically, the only cases I can think of where cross-file atomic 
_writes_ would be of any benefit are database systems.


However, if this were extended to include rename, unlink, touch, and a 
handful of other VFS operations, then I can easily think of a few dozen 
use cases.  Package managers in particular would likely be very 
interested in being able to atomically rename a group of files as a 
single transaction, as it would make their job _much_ easier.


btrfs send hung in pipe_wait

2018-09-06 Thread Stefan Löwen

Hello linux-btrfs,

I'm trying to clone a subvolume with 'btrfs send' but it always hangs 
for hours.


I tested this on multiple systems. All showed the same result:
- Manjaro (btrfs-progs v4.17.1; linux v4.18.5-1-MANJARO)
- Ubuntu 18.04 in VirtualBox (btrfs-progs v4.15.1; linux v4.15.0-33-generic)
- ArchLinux in VirtualBox (btrfs-progs v4.17.1; linux v4.18.5-arch1-1-ARCH)
All following logs are from the ArchLinux VM.

To make sure it's not the 'btrfs receive' at fault I tried sending into 
a file using the following command:
'strace -o btrfs-send.strace btrfs send -vvv -f intenso_white.snapshot 
/mnt/intenso_white/@data/.snapshots/test-snapshot'


The 'btrfs send' process always copies around 1.2-1.4G of data, then 
stops all disk IO and fully loads one cpu core. btrfs scrub found 0 
errors. Neither did btrfsck. 'btrfs device stats' is all 0.


I would be thankful for all ideas and tips.

Regards
Stefan

-

The btrfs-send.strace is attached. So is the dmesg.log during the hang.

Stack traces of the hung process:
--- /proc/3022/task/3022/stack ---
[<0>] 0x
--- /proc/3022/task/3023/stack ---
[<0>] pipe_wait+0x6c/0xb0
[<0>] splice_from_pipe_next.part.3+0x24/0xa0
[<0>] __splice_from_pipe+0x43/0x180
[<0>] splice_from_pipe+0x5d/0x90
[<0>] default_file_splice_write+0x15/0x20
[<0>] __se_sys_splice+0x31b/0x770
[<0>] do_syscall_64+0x5b/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0x

[vagrant@archlinux mnt]$ uname -a
Linux archlinux 4.18.5-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 24 12:48:58 
UTC 2018 x86_64 GNU/Linux


[vagrant@archlinux mnt]$ btrfs --version
btrfs-progs v4.17.1

[vagrant@archlinux mnt]$ sudo btrfs fi show /dev/sdb1
Label: 'intenso_white'  uuid: 07bf61ed-7728-4151-a784-c4b840e343ed
Total devices 1 FS bytes used 655.82GiB
devid    1 size 911.51GiB used 703.09GiB path /dev/sdb1

[vagrant@archlinux mnt]$ sudo btrfs fi df /mnt/intenso_white/
Data, single: total=695.01GiB, used=653.69GiB
System, DUP: total=40.00MiB, used=96.00KiB
Metadata, DUP: total=4.00GiB, used=2.13GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

execve("/usr/bin/btrfs", ["btrfs", "send", "-vvv", "-f", 
"intenso_white.snapshot", "/mnt/intenso_white/@data/.snapsh"...], 
0x7ffcb8f52718 /* 16 vars */) = 0
brk(NULL)   = 0x555c9eddb000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe8c9971e0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=28542, ...}) = 0
mmap(NULL, 28542, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffa9a016000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/libuuid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\24\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=26552, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7ffa9a014000
mmap(NULL, 2121752, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa99e0d000
mprotect(0x7ffa99e13000, 2093056, PROT_NONE) = 0
mmap(0x7ffa9a012000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7ffa9a012000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/libblkid.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\240\231\0\0\0\0\0\0"..., 832) = 
832
fstat(3, {st_mode=S_IFREG|0755, st_size=326480, ...}) = 0
mmap(NULL, 2426656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa99bbc000
mprotect(0x7ffa99c06000, 2097152, PROT_NONE) = 0
mmap(0x7ffa99e06000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4a000) = 0x7ffa99e06000
mmap(0x7ffa99e0c000, 1824, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffa99e0c000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/libz.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320!\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=91912, ...}) = 0
mmap(NULL, 2187280, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa999a5000
mprotect(0x7ffa999bb000, 2093056, PROT_NONE) = 0
mmap(0x7ffa99bba000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0x7ffa99bba000
close(3)= 0
openat(AT_FDCWD, "/usr/lib/liblzo2.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20'\0\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=137432, ...}) = 0
mmap(NULL, 2232528, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7ffa99783000
mprotect(0x7ffa997a4000, 2093056, PROT_NONE) = 0
mmap(0x7ffa999a3000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2) = 0x7ffa999a3000
close(3) 

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-06 Thread Dave Chinner
On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > >   - is documenting rejection on request alignment grounds
> > > > > (i.e. EINVAL) in the man page sufficient for app
> > > > > developers to understand what is going on here?
> > > > 
> > > > I think so.  The manpage says: "The filesystem does not support
> > > > reflinking the ranges of the given files", which (to my mind) covers
> > > > this case of not supporting dedupe of EOF blocks.
> > > 
> > > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > > files (one extent) cannot be deduped at all, because the EOF block holds
> > > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > > last extent of the file as the benefit would be zero or slightly negative.
> > 
> > That's a filesystem implementation issue, not an API or application
> > issue.
> 
> The API and application issue remains even if btrfs is not considered.
> btrfs is just the worst case outcome.  Other filesystems still have
> fragmentation issues, and applications have efficiency-vs-capability
> tradeoffs to make if they can't rely on dedupe-to-EOF being available.
> 
> Tools like 'cp --reflink=auto' work by trying the best case, then falling
> back to a second choice if the first choice returns an error.

Well, yes. That's necessary for the "cp" tool to behave according to
user expectations.  That's not a kernel API issue - that's just an
implementation of an *application requirement*.  Indeed, this is
identical to the behaviour of rename() in "mv" - if rename fails
with -EXDEV, mv needs to fall back to a manual copy because the user
expects the file to be moved.

IOWS, these application level requirements you talk about are just
not relevant to the kernel API for dedupe/clone operations.

[snip]

> It is something that naive dedupe apps will do.  "naive" here meaning
> "does not dive deeply into the filesystem's physical structure (or survey
> the entire filesystem with FIEMAP) to determine that the thousand-refs
> situation does not exist at dst prior to invoking the dedupe() call."

/me sighs and points at FS_IOC_GETFSMAP

$ man ioctl_getfsmap

DESCRIPTION
   This ioctl(2) operation retrieves physical extent mappings
   for a filesystem.  This information can be used to discover
   which files are mapped to a physical block, examine free
   space, or find known bad blocks, among other things.
.

I don't really care about "enabling" naive, inefficient
applications. I care about applications that scale to huge
filesystems and can get the job done quickly and efficiently.

> > XFS doesn't have partial overlaps, we don't have nodatacow hacks,
> > and the subvol snapshot stuff I'm working on just uses shared data
> > extents so it's 100% compatible with dedupe.
> 
> If you allow this sequence of operations, you get partial overlaps:
> 
>   dedupe(fd1, 0, fd2, 0, 1048576);
> 
>   dedupe(fd2, 16384, fd3, 0, 65536);

Oh, I misunderstood - I thought you were refering to sharing partial
filesystem blocks (like at EOF) because that's what this discussion
was originally about. XFS supports the above just fine.

[snip]

tl;dr we don't need a new clone or dedupe API

> For future development I've abandoned the entire dedupe_file_range
> approach.  I need to be able to read and dedupe the data blocks of
> the filesystem directly without having to deal with details like which
> files those blocks belong to, especially on filesystems with lots of
> existing deduped blocks and snapshots. 

IOWs, your desired OOB dedupe algorithm is:

a) ask the filesystem where all it's file data is
b) read that used space to build a data hash index
c) on all the data hash collisions find the owners of the
   colliding blocks
d) if the block data is the same dedupe it

I agree - that's a simple and effective algorithm. It's also the
obvious solution to an expert in the field.

> The file structure is frankly
> just noise for dedupe on large filesystems.

We learnt that lesson back in the late 1990s. xfsdump, xfs_fsr, all
the SGI^WHPE HSM scanning tools, etc all avoid the directory
structure because it's so slow. XFS's bulkstat interface, OTOH, can
scan for target inodes at a over a million inodes/sec if you've got
the IO and CPU to throw at it

> I'm building a translation
> layer for bees that does this--i.e. the main dedupe loop works only with
> raw data blocks, and the translation layer maps read(blocknr, length)
> and dedupe(block1, 

Transactional btrfs

2018-09-06 Thread Nathan Dehnel
https://lwn.net/Articles/287289/

In 2008, HP released the source code for a filesystem called advfs so
that its features could be incorporated into linux filesystems. Advfs
had a feature where a group of file writes were an atomic transaction.

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

These guys used advfs to add a "syncv" system call that makes writes
across multiple files atomic.

https://lwn.net/Articles/715918/

A patch was later submitted based on the previous paper in some way.

So I guess my question is, does btrfs support atomic writes across
multiple files? Or is anyone interested in such a feature?


Re: [PATCH] Btrfs: remove redundant btrfs_trans_release_metadata"

2018-09-06 Thread Liu Bo
On Wed, Sep 5, 2018 at 10:45 PM, Liu Bo  wrote:
> Somehow this ends up with crash in btrfs/124, I'm trying to figure out
> what went wrong.
>

It revealed the problem addressed in Josef's patch[1], so with it,
this patch works just fine.

[1] btrfs: make sure we create all new bgs

thanks,
liubo

>
> On Tue, Sep 4, 2018 at 6:14 PM, Liu Bo  wrote:
>> __btrfs_end_transaction() has done the metadata release twice,
>> probably because it used to process delayed refs in between, but now
>> that we don't process delayed refs any more, the 2nd release is always
>> a noop.
>>
>> Signed-off-by: Liu Bo 
>> ---
>>  fs/btrfs/transaction.c | 6 --
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index bb1b9f526e98..94b036a74d11 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -826,12 +826,6 @@ static int __btrfs_end_transaction(struct 
>> btrfs_trans_handle *trans,
>> return 0;
>> }
>>
>> -   btrfs_trans_release_metadata(trans);
>> -   trans->block_rsv = NULL;
>> -
>> -   if (!list_empty(>new_bgs))
>> -   btrfs_create_pending_block_groups(trans);
>> -
>> trans->delayed_ref_updates = 0;
>> if (!trans->sync) {
>> must_run_delayed_refs =
>> --
>> 1.8.3.1
>>


Re: [PATCH 22/35] btrfs: make sure we create all new bgs

2018-09-06 Thread Liu Bo
On Fri, Aug 31, 2018 at 7:03 AM, Josef Bacik  wrote:
> On Fri, Aug 31, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
>>
>>
>> On 30.08.2018 20:42, Josef Bacik wrote:
>> > We can actually allocate new chunks while we're creating our bg's, so
>> > instead of doing list_for_each_safe, just do while (!list_empty()) so we
>> > make sure to catch any new bg's that get added to the list.
>>
>> HOw can this occur, please elaborate and put an example callstack in the
>> commit log.
>>
>
> Eh?  We're modifying the extent tree and chunk tree, which can cause bg's to 
> be
> allocated, it's just common sense.
>

This explains a bit.

  => btrfs_make_block_group
  => __btrfs_alloc_chunk
  => do_chunk_alloc
  => find_free_extent
  => btrfs_reserve_extent
  => btrfs_alloc_tree_block
  => __btrfs_cow_block
  => btrfs_cow_block
  => btrfs_search_slot
  => btrfs_update_device
  => btrfs_finish_chunk_alloc
  => btrfs_create_pending_block_groups
 ...


Reviewed-by: Liu Bo 

thanks,
liubo